GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

github.com
Thumbnail of GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

GitHub - nomic-ai/nomic: Interact, analyze and structure massive text, image, embedding, audio and video datasets

alibaba GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!