Context Window Packing

aka Context Compression, Token Budget Management, Fit in Context, Token Cost Reduction

category: memory · status: mature

Intent

Choose what fits in the context window each turn given a fixed token budget.

Context

Agents whose available context (system prompt + history + retrieved chunks + tools + state) exceeds the model's window.

Problem

Naive concatenation overflows; naive truncation loses critical state.

Forces

What to drop is task-dependent.
Compression has its own LLM cost.
Reserved budget for the response itself.

Solution

Define a packing policy. Reserve N tokens for system + tools + response. Allocate the rest across history (compressed), retrieved chunks (top-k after rerank), and current state. Use eviction (drop oldest), summarisation (compress), or selection (relevance-rank) policies. Audit token counts before each call.

Constrains

Total tokens passed to the model must not exceed the window minus the reserved response budget.

Consequences

Benefits

Predictable behaviour at the window edge.
Inspectable trade-offs.

Liabilities

Complexity of the packing logic.
Compression artefacts.

Known Uses

LangChain ConversationSummaryBufferMemory
Most production agent frameworks

Related Patterns

uses → episodic-summaries
alternative-to → memgpt-paging
complements → dynamic-scaffolding
used-by → todo-list-driven-agent
used-by → reasoning-trace-carry-forward

References

Lost in the Middle: How Language Models Use Long Contexts — Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang (2023) paper