GitHub - turboderp/exllamav2: A fast inference library for r...

GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs

RelatedInsightsHighlights

Ollama

slowllama

Fine-tune Llama2 and CodeLLama models, including 70B/35B on Apple M1/M2 devices (for example, Macbook Air or Mac Mini) or consumer nVidia GPUs.

slowllama is not using any quantization. Instead, it offloads parts of model to SSD or main memory on both forward/backward passes. In contrast with training large models from scratch (unattainable... See more

okuvshynov • GitHub - okuvshynov/slowllama: Finetune llama2-70b and codellama on MacBook Air without quantization

Deep learning at the speed of light.

Luminal is a deep learning library that uses composable compilers to achieve high performance.

use luminal::prelude::*;

// Setup graph and tensors

let mut cx = Graph::new();

let a = cx.tensor().set([[1.0], [2.0], [3.0]]);

let b = cx.tensor().set([[1.0, 2.0, 3.0, 4.0]]);

// Do math...

let mut c = a.matmul(b).retrieve();

... See more

jafioti • GitHub - jafioti/luminal: Deep learning at the speed of light.

2-5x faster 50% less memory local LLM finetuning

Manual autograd engine - hand derived backprop steps.

2x to 5x faster than QLoRA. 50% less memory usage.

All kernels written in OpenAI's Triton language.

0% loss in accuracy - no approximation methods - all exact.

No change of hardware necessary. Supports NVIDIA GPUs since 2018+. Minimum CUDA Compute Cap

Ollama

okuvshynov • GitHub - okuvshynov/slowllama: Finetune llama2-70b and codellama on MacBook Air without quantization

jafioti • GitHub - jafioti/luminal: Deep learning at the speed of light.

unslothai • GitHub - unslothai/unsloth: 5X faster 50% less memory LLM finetuning