GitHub - mit-han-lab/streaming-llm: Efficient Streaming Lang...

GitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks

mit-han-lab github.com

Saved by Darren LI and

RelatedCollectionsHighlights

Ollama

ollama.com

Just a moment...

researchgate.net

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of La

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

There is often a notable gap between state of the art research and what practitioners can reasonably use. However, I'm glad to say that attention sinks can be added to any pretrained LLM at near to no additional effort.

I have released the attention_sinks Python module, which acts as a drop-in replacement for the transformers API. This Python module... See more

Ollama

Just a moment...

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Tom Aarsen • 🕳️ Attention Sinks in LLMs for endless fluency