🕳️ Attention Sinks in LLMs for endless fluency

Tom Aarsen huggingface.co

RelatedHighlights

GitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks

mit-han-lab github.com

What We Learned From a Year of Building With LLMs

Bryan Bischof oreilly.com

AI Revolution - Transformers and Large Language Models (LLMs)

Elad Gil blog.eladgil.com

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of La

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.