essay
Context Is a Data Structure, Not a String
Why LLM agents degrade before hitting token limits, and how treating context as a graph with tiered compression changes everything.
The hardest problem in building LLM agents isn’t prompting or tool use—it’s context management. And most systems get it wrong by treating context as a flat string that you stuff tokens into until it overflows.
I spent the last few months building ContextEngine, an open-source platform for managing LLM context. The core insight: context degradation happens well before you hit token limits, and the solution requires thinking about context as a first-class data structure.
The Pre-Rot Problem
Here’s something that surprised me: LLM agent quality drops significantly after about 50 tool calls, typically around 60-70% of context capacity. Not at 100%. Not even close.
Quality │
100% │████████████████████████
│ ████████
80% │ ████████
│ ████
60% │ ████
│────────────────────────────────────────────────────
0% 25% 50% 65% 80% 100%
Token Usage
│◄──── Safe ────►│◄ Pre-Rot ►│◄── Degraded ──►│
I call this “pre-rot.” The context window fills with tool results, intermediate outputs, and accumulated state. The model can still process it—but the signal-to-noise ratio tanks. Important information gets buried. The model starts missing things it saw earlier.
This isn’t a token limit problem. It’s an information density problem.
Why Flat Context Fails
Most RAG and agent systems treat context as a buffer: append new content, maybe truncate old content when you run out of space. This breaks in predictable ways:
No relationships. A tool result exists in isolation from the query that spawned it. An entity mentioned in message 5 has no explicit link to the same entity in message 47.
Naive truncation. When space runs out, old content gets dropped. But “old” doesn’t mean “unimportant.” Often the most critical context—the original task, key constraints, early decisions—came first.
No reversibility. Once you summarize or drop content, it’s gone. If the model later needs that detail, you’re out of luck.
Context as a Graph
The fix starts with representation. Context isn’t a string—it’s a graph of typed nodes with relationships between them.
┌─────────────┐ TEMPORAL ┌─────────────┐
│ MESSAGE │──────────────────►│ TOOL_CALL │
│ "Find all │ │ glob │
│ Python files│ CAUSAL │ pattern:... │
└─────────────┘◄──────────────────└─────────────┘
│
TOOL_IO
▼
┌─────────────┐
│ TOOL_RESULT │
│ [files...] │
└─────────────┘
Nodes have types: MESSAGE, TOOL_CALL, TOOL_RESULT, ARTIFACT, ENTITY, SUMMARY. Edges have types: TEMPORAL, CAUSAL, REFERENCES, SUMMARIZES, SAME_ENTITY.
This representation enables operations that flat strings can’t support:
- Query nodes by entity (show me everything about “the API endpoint”)
- Traverse causal chains (what led to this result?)
- Preserve relationships through compression (keep the links, shrink the content)
Tiered Compression
With a graph representation, compression becomes strategic instead of desperate. ContextEngine uses three tiers:
Lossless (100% recoverable). Externalize large payloads to storage with pointers. Deduplicate semantically similar content. Collapse sequential tool chains into summaries with preserved references. This alone typically achieves 2-5x compression.
Compaction (80-95% recoverable). Extract and reference repeated JSON schemas instead of duplicating them. Keep only entity-relevant sentences. Filter by current task relevance. Another 2-4x on top of lossless.
Summarization (last resort). Hierarchical summaries that preserve relationships. Task-aware compression that weights by relevance. Incremental updates to running summaries. 5-10x compression, but irreversible.
The key is ordering: try lossless first, then compaction, then summarization only when necessary. And track everything in a recovery manifest so you can potentially restore content if needed.
Pre-Rot Detection
Compression is reactive. Better is to detect pre-rot before it happens.
ContextEngine tracks token budgets with configurable thresholds:
budget = TokenBudget(
total_tokens=100_000,
warning_threshold=0.5, # Start monitoring
trigger_threshold=0.65, # Start compressing
critical_threshold=0.8 # Aggressive compression
)
if budget.status.needs_compression:
pipeline.compress(graph, max_tier=CompressionTier.COMPACTION)
The thresholds are based on empirical observation: quality starts dropping around 60-70% capacity. By triggering compression at 65%, you maintain headroom for new context while preserving information density.
What I Learned
Three things surprised me building this:
Structure beats volume. A well-organized 30k context consistently outperforms a naive 80k context in my testing. The model can find and use information that’s properly linked and deduplicated.
Reversibility matters more than ratio. A 3x compression you can undo beats a 10x compression you can’t. When the model needs detail later, you want the option to restore it.
Graph operations are fast enough. I worried that maintaining a proper graph structure would add latency. In practice, the operations are sub-millisecond. The bottleneck is always the LLM call, never the context management.
Current State
ContextEngine is open source with five packages:
- context-core: Graph, entities, semantic index, token budget (358 tests)
- context-compression: Pipeline with 9 strategies (311 tests)
- context-memory: Storage backends, tiered storage (307 tests)
- context-tools: Tool caching and patterns (283 tests)
- context-observe: OpenTelemetry tracing, metrics
Phase 4 (multi-agent coordination) is planned but not started. The foundation is solid enough to use today.
Try It
uv add context-core context-compression
from context_core import ContextGraph, TokenBudget
from context_compression import CompressionPipeline
graph = ContextGraph(session_id="my-session")
# ... add nodes ...
budget = TokenBudget(total_tokens=100_000)
if budget.allocate("context", graph.total_tokens).needs_compression:
CompressionPipeline().compress(graph)
The full code is at github.com/Sean-Koval/context-engineering. Contributions welcome—especially on the multi-agent coordination layer.
Context is too valuable to waste on naive string concatenation.