
GitHub - Filimoa/open-parse: Improved file parsing for LLM’s

Zerox OCR
A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense!
The general logic:
A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense!
The general logic:
- Pass in a PDF (URL or file buffer)
- Turn the PDF into a series of images
- Pass each image to GPT and ask nicely for Markdown
- Aggregat
Tyler Maran • GitHub - getomni-ai/zerox: Zero shot pdf OCR with gpt-4o-mini
Super JSON Mode is a Python framework that enables the efficient creation of structured output from an LLM by breaking up a target schema into atomic components and then performing generations in parallel.
It supports both state of the art LLMs via OpenAI 's legacy completions API and open source LLMs such as via Hugging Face Transformers and vLLM .... See more
It supports both state of the art LLMs via OpenAI 's legacy completions API and open source LLMs such as via Hugging Face Transformers and vLLM .... See more
varunshenoy • GitHub - varunshenoy/super-json-mode: Low latency JSON generation using LLMs ⚡️
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more