Data Processing

The backbone for Versatile ai

Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.

Instill AI

WebDataset

WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.

The WebDataset format

A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more

WebDataset

Magika

Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

In an evaluati... See more

google • GitHub - google/magika: Detect file content types with deep learning

Traditional ETL solutions are still quite powerful when it comes to:

Common connectors with small-medium data volumes : we still have a lot of respect for companies like Fivetran, who have really nailed the user experience for the most common ETL use cases, like syncing Zendesk tickets or a production Postgres read replica into Snowflake. The only

Why you should move your ETL stack to Modal

SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 20 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.

It is ... See more

tobymao • GitHub - tobymao/sqlglot: Python SQL Parser and Transpiler

Most commonly, ETL means moving data from some source system (e.g. a production database, Slack API) into an analytical data warehouse (e.g. Snowflake) where the data is easier to combine and analyze. Most data teams use a vendor like Fivetran or an orchestration platform like Airflow to do this.

Modal is a great solution for ETL if you are primaril... See more

Why you should move your ETL stack to Modal

Spice.ai OSS

What is Spice?

Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more

spiceai • GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

Open source, high-throughput, fault-tolerant vector embedding pipeline

Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably

dgarnitz • GitHub - dgarnitz/vectorflow: VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.

1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.

2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.

3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more

Instill AI

WebDataset

google • GitHub - google/magika: Detect file content types with deep learning

Why you should move your ETL stack to Modal

tobymao • GitHub - tobymao/sqlglot: Python SQL Parser and Transpiler

Why you should move your ETL stack to Modal

spiceai • GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

dgarnitz • GitHub - dgarnitz/vectorflow: VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.

Bap • Our 5 favourite open-source customer data platforms