Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
You have a 128K context window. You fill it.
That is not leverage. That is laziness with a large budget.
Every token you send costs latency and money. Long prompts delay the first token. They pressure KV cache memory. On shared infrastructure, they squeeze out concurrent requests. On your own hardware, they are the difference between serving one user and serving ten.
The instinct to throw context at a problem is understandable. More context feels safer. It feels like you are giving the model everything it needs. But attention is not uniform across a long context. A model asked to reason over 60,000 tokens does not attend to all 60,000 tokens equally — research on long-context behavior consistently shows that retrieval accuracy degrades when the relevant information is buried in the middle of a long window. Filling the context is not the same as informing the model.
Compression is the discipline of sending what matters.
LLMLingua, published at EMNLP 2023 by researchers at Microsoft, treats prompt compression as a token classification problem. A small language model — GPT2-small or LLaMA-7B — scores each token by its conditional perplexity. High-perplexity tokens carry information; low-perplexity tokens are redundant. The method removes the redundant ones. The result: up to 20x compression with minimal downstream task performance loss. The compressed prompt is not readable by humans. The model does not care. It reads perplexity distributions, not prose.
LLMLingua-2, published at ACL 2024, reframes this as a bidirectional classification task using a Transformer encoder. It is faster and more faithful to the original content — because a bidirectional encoder can assess a token in full context, not just the left-to-right view of a generative model.
These are token-dropping methods. A parallel family of approaches — AutoCompressor (Princeton, 2023) and ICAE (ICLR 2024) — goes further. Instead of dropping tokens, they compress context into dense summary vectors that the model conditions on as soft prompts. ICAE achieves 4x context compression using only ~1% additional LoRA parameters on top of the base model. The compressed representation is not text — it is a learned embedding of what the text contains.
The practical question is not which method to use. The practical question is: are you thinking about this at all?
Most RAG pipelines retrieve and concatenate. Most agents dump their tool outputs directly into context. Most system prompts are written once and never audited for redundancy. Every token in those contexts is a token you are paying to attend to — even if the model would have answered correctly without it.
Start with the cheapest intervention: audit your prompts. Remove boilerplate. Remove repeated instructions. Remove context that does not change the answer. Measure the output quality. You will usually find you can cut 30-40% of tokens before you touch accuracy.
Then, if you need more, reach for a tool.
Compression is not cheating. It is what engineers do with budgets.