Contextual compression is a retrieval and preprocessing technique that reduces the amount of text sent to a language model by extracting, rewriting, or summarizing only the information relevant to a specific query, while preserving citations and faithfulness.
What is Contextual Compression?
In many RAG systems, retrieval returns multiple chunks that contain both relevant and irrelevant material. Passing all of that text into the LLM increases cost, latency, and the risk of distraction or hallucination. Contextual compression adds an intermediate step between retrieval and generation: a “compressor” model or algorithm filters each retrieved chunk to keep only query-relevant spans, or produces a shorter, query-conditioned summary. Compression can be extractive (select sentences), abstractive (rewrite into a concise summary), or hybrid (extract + rewrite), and it can be applied per-document, per-chunk, or across the whole retrieved set.
A common implementation is a two-stage pipeline: (1) retrieve candidate passages using vector and/or keyword search, then (2) run a compressor that outputs a compact context along with source references. Compression is particularly useful when retrieved documents are long (policies, technical manuals) or when the model’s context window is limited.
Where it’s used and why it matters
Contextual compression is used in production RAG for enterprise search, customer support, and compliance assistants. It matters because it improves the “signal-to-noise” ratio of the context that conditions generation. With less irrelevant text, models tend to follow instructions better and cite evidence more accurately. It also reduces token usage, enabling lower cost per request or allowing more documents to be considered within the same context budget. The main risks are loss of crucial details (over-compression) and faithfulness issues in abstractive summaries, so teams often favor extractive compression or enforce citation-backed summaries.
Examples
- Extractive span selection: keep only sentences that mention the queried entity or constraint.
- Query-conditioned summarization: rewrite a long policy section into 5–10 bullet points relevant to the question.
- Hybrid compression with citations: extract key sentences, then paraphrase while attaching chunk IDs.
- Adaptive compression: compress more aggressively when many passages are retrieved or when context budget is tight.
FAQs
Is contextual compression the same as summarization?
It can include summarization, but it is explicitly query-conditioned and typically constrained to preserve evidence and relevance.
When should you use extractive vs. abstractive compression?
Extractive is safer for faithfulness and citations; abstractive can be shorter and clearer but needs stronger evaluation and guardrails.
How do you evaluate compression quality?
Measure answer accuracy and citation faithfulness, and separately evaluate whether the compressed context retains all necessary supporting facts.
Does compression replace better retrieval?
No. It complements retrieval: good retrieval finds relevant sources; compression reduces noise and fits evidence into the model’s context window.