Data Loading

Data Integration. Integration is needed when your organization collects large amounts of data in various systems such as databases, CRM systems, application servers, and so on. Accessing and analyzing data that is spread across multiple systems can be a challenge. To address this challenge, data integration can be used to create a unified view of y... See more

Data Engineering • The Open Data Stack Distilled into Four Core Tools

WebDataset

WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.

The WebDataset format

A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more

WebDataset

Magika

Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

In an evaluati... See more

google • GitHub - google/magika: Detect file content types with deep learning

Easily chunk complex documents the same way a human would.

Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.

Open Parse is designed to fill this gap by providing a flexible, e... See more

Filimoa • GitHub - Filimoa/open-parse: Improved file parsing for LLM’s

The solution: The ingestion service

To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the req... See more

Stability and scalability for search

1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.

2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.

3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more

Bap • Our 5 favourite open-source customer data platforms

Marker

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Support for a range of PDF documents (optimized for books and scientific papers)

Removes headers/footers/other artifacts

Converts most equations to latex

Formats code blocks and tables

Support for multiple

VikParuchuri • GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Data Extraction Stack

Open-Source Pre-Processing Tools for Unstructured Data

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more

Unstructured-IO • GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Snowplow

1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.

2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.

3️⃣ Snowplow supports integration with multiple data st... See more