Context Management
Myrm’s context management system ensures agents can sustain 200+ turn conversations without information loss, while keeping costs low through intelligent compression and cache optimization.Architecture Overview
Every message passes through a multi-stage pipeline before reaching the LLM:Compression Pipeline
Layer 1: Instant Filtering
Large tool outputs (e.g., file contents, search results) are immediately truncated to preserve only the most relevant portions. Zero API cost.Layer 2: Cache TTL Pruning
Expired cached content is automatically cleared. This prevents stale data from consuming valuable context window space.Layer 3: Priority-Aware Compression
Messages are classified into three priority tiers:| Priority | Treatment |
|---|---|
| Critical | Never compressed (user instructions, active errors) |
| Important | Compressed last (recent tool results, key decisions) |
| Standard | Compressed first (older conversation turns) |
Layer 4: Structured Summarization
When compression alone isn’t sufficient, an LLM generates a structured summary with 11 fields (user goal, active tasks, errors, decisions, etc.) to replace the compressed history.Layer 5: Cache Optimization
Provider-specific cache markers (Anthropiccache_control, Qwen prefixed cache) are injected to maximize prompt cache hits on subsequent requests.
Reversible Compression
Unlike competitors that permanently discard compressed content, Myrm’s compression is reversible:- Tool outputs are offloaded to
.context/storage, not deleted - Archive Checkpoints capture full state before compression
- Content can be restored on demand
Prompt Cache Optimization
Prompt caching reduces input token costs by up to 90% when cache hits occur.Design Principles
- Static/Dynamic Separation — System prompt is frozen (cacheable); all dynamic content goes into user messages
- 4-Layer Stable Prefix — System prompt → tools → workspace rules → first user message form a stable prefix
- Cache Break Detection —
cache_break_detectoractively monitors for cache invalidation and reports the cause - Only-Append Policy — History messages are never modified in-place, preserving prefix stability
Protection Features
| Feature | Description |
|---|---|
| Hot Cache Bypass | When cache is already warm, skip unnecessary compression to preserve hits |
| Anti-Thrashing | Detect and skip repeated low-yield compression cycles |
| 90% Safety Net | Emergency compression at 90% context window utilization to prevent OOM |
| Cache-TTL Archive | Expired cache entries are archived (not deleted) for potential restoration |
Thinking Content Management
When using reasoning/thinking models (DeepSeek, MiMo, Kimi, Anthropic Claude), theThinkingBlockCleaner processor automatically manages reasoning_content and thinking_blocks to prevent context bloat:
- Anthropic models —
reasoning_contentis removed (redundant withthinking_blocks);thinking_blocksare preserved - DeepSeek/MiMo/Kimi — Historical
reasoning_contentfrom older turns (before the last user message) is selectively removed, except on messages that carrytool_calls(API requirement). Current-turn reasoning is always preserved - Model switching — When switching from a non-thinking model to a thinking model mid-session, empty
reasoning_contentfields are automatically back-filled on historical assistant messages to prevent 400 errors
Extreme Scenario Anti-Explosion
To ensure unparalleled stability even during massive context accumulations and multimodal autonomous tasks, Myrm employs a 4-layer protective moat:1. Gateway Hygiene (Token Block)
Before requests even reach the Agent Harness, the Control Plane gateway scans the payload size. Massive malicious or malformed payloads (>120K tokens) are instantly intercepted with a400 Bad Request. This prevents LLM compute nodes from suffering Out-Of-Memory (OOM) crashes and system halts.
2. Auxiliary Ratio Shield (Graceful Degradation)
When the main model (e.g., 200K window) is nearing its limit, context is compressed using an auxiliary model. If the user configured an overly small auxiliary model (e.g., 8K window), passing 100K tokens to it would cause a fatal crash and loss of conversation history. Myrm dynamically checks this ratio; if the auxiliary model is too small, it silently degrades and uses the main model for summarization, issuing a warning but keeping the session alive.3. Smart Media Stripping
For vision models operating autonomously (e.g., Computer Use), screenshots are heavily appended. Myrm implements a Sliding Visual Evidence Window, retaining only the last 2 media-containing messages for visual reasoning while stripping all large Base64 images from older history. This drastically reduces token bloat while maintaining vision capability.4. Tail Budget Ratio
Instead of arbitrarily truncating messages, Myrm calculates a dedicated token budget (e.g., 20% of max context) exclusively reserved for the most recent conversation tail. This ensures that the agent’s current working memory and active tool results are never compressed or squeezed out, guaranteeing task continuity.Session Notes
Agents can create structured notes during a session — these persist in the context at zero API cost (no LLM call needed) and serve as a lightweight alternative to full compression.Dynamic Thresholds
Compression thresholds adapt automatically based on context utilization:| Utilization | Action |
|---|---|
| 40% | Begin monitoring, prepare for compression |
| 50% | Light compression (dedup, truncation) |
| 70% | Full compression (priority-aware removal) |
| 90% | Emergency compression (safety net) |

