Data Processing
The backbone for Versatile ai
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Instill AI
WebDataset
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset
Magika
Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
In an evaluati... See more
Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
In an evaluati... See more
google • GitHub - google/magika: Detect file content types with deep learning
Traditional ETL solutions are still quite powerful when it comes to:
- Common connectors with small-medium data volumes : we still have a lot of respect for companies like Fivetran, who have really nailed the user experience for the most common ETL use cases, like syncing Zendesk tickets or a production Postgres read replica into Snowflake. The only
Why you should move your ETL stack to Modal
SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 20 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.
It is ... See more
It is ... See more
tobymao • GitHub - tobymao/sqlglot: Python SQL Parser and Transpiler
Most commonly, ETL means moving data from some source system (e.g. a production database, Slack API) into an analytical data warehouse (e.g. Snowflake) where the data is easier to combine and analyze. Most data teams use a vendor like Fivetran or an orchestration platform like Airflow to do this.
Modal is a great solution for ETL if you are primaril... See more
Modal is a great solution for ETL if you are primaril... See more
Why you should move your ETL stack to Modal
Spice.ai OSS
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more
spiceai • GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Open source, high-throughput, fault-tolerant vector embedding pipeline
Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably
Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably
dgarnitz • GitHub - dgarnitz/vectorflow: VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more