GitHub - databonsai/databonsai: clean & curate your data with LLMs.

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

tensorlakeai GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

promptslab GitHub - promptslab/LLMtuner: Tune LLM in few lines of code

global-data-consortium-working-draft

The Global Data Consortium (GDC) aims to advance the role of AI in universities by promoting collaboration and sharing data to enhance learning outcomes and address challenges in education.

Link