Data Loading

Data Engineering The Open Data Stack Distilled into Four Core Tools

WebDataset

google GitHub - google/magika: Detect file content types with deep learning

Filimoa GitHub - Filimoa/open-parse: Improved file parsing for LLM’s

Stability and scalability for search

Bap Our 5 favourite open-source customer data platforms

VikParuchuri GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Unstructured-IO GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Bap Our 5 favourite open-source customer data platforms