Context Management

Myrm’s context management system ensures agents can sustain 200+ turn conversations without information loss, while keeping costs low through intelligent compression and cache optimization. Validated by 1,400+ automated tests covering the full pipeline.

Architecture Overview

Every message passes through a 12-step progressive offloading pipeline before reaching the LLM:

User Message → Thinking Cleanup → Media Filter → Smart Filter → Cache TTL Prune → Pre-Compact → Compress → Session Notes → Summarize → Post-Compaction Reread → Normalize → Media Resolve → Prompt Cache Optimize → LLM

Compression Pipeline

Layer 0: Intent-Guided Compression (CompressionIntent)

Before any data is removed, the pipeline analyzes the current user query and recent history to build a compression intent — a map of what must be preserved:

Focus files: File paths actively referenced in the current query are flagged as protected
Focus modules: Code modules under active discussion are shielded from aggressive compression
Failed tool recovery: IDs of recently failed tool calls are retained so the agent can retry or explain failures

This intent flows through all subsequent layers, ensuring that compression never destroys context the user is actively working with. Competitors lack this step — their compression is topic-blind and frequently discards critical working context.

Layer 1: Instant Filtering (ContextBudgetGuard)

Large tool outputs (e.g., file contents, search results) are processed by an intelligent budget guard. Unlike competitors that set a fixed per-tool truncation threshold, Myrm uses holistic budget management:

Smart exemptions: File read/write tools bypass truncation to avoid “read→truncate→re-read” loops
Structure-aware: JSON/XML/CSV outputs use structural trimming (preserving schema) instead of brute character cutting
Disk persistence: Oversized outputs are automatically saved to disk with a summary + path returned to the Agent
Predictive overflow protection: Truncation intensity adapts dynamically based on remaining token budget

Layer 1.5: Universal Tool Output Intelligence (FilterProcessor)

Not just terminal commands — every tool (search, API, browser, MCP, etc.) output is protected by a three-layer defense:

Layer	Component	Trigger	Effect
L1	11 Bash command compressors + YAML declarative engine	Shell output	Smart compression for git/pytest/npm etc., preserving key information
L2	`FilterProcessor` + `SemanticFilter` / `StructuralFilter`	Any tool output >5K tokens (single) or >15K tokens (turn aggregate)	Structural content: zero-LLM-cost extraction (9 formats); Unstructured: lightweight LLM semantic summary
L3	`ContextBudgetGuard`	Single output >100K chars	Hard safety net: persist to disk + return summary reference
L4	`OversizedResultHandler` (MCP)	MCP tool output > `max_output_chars`	Vault spill: full content → ArtifactVault with `vault://` pointer + head/tail summary. Fallback: head-truncation if vault unavailable

Tool protection whitelist: file_read_tool, file_edit_tool and other critical tools are automatically exempt from filtering to avoid “read→truncate→re-read” loops. Prompt Cache aware: Automatically skips filtering during session resume (Resume) or human-in-the-loop (HITL) scenarios to protect existing cache prefixes. Full content recoverable: All filtered large results are persisted to .myrm/artifacts/tool_outputs/. MCP oversized outputs are stored in ArtifactVault with vault:// pointers — the Agent can retrieve original content at any time via file_read_tool(paths=["vault://..."]).

Layer 2: Cache TTL Pruning

Expired cached content is automatically cleared. This prevents stale data from consuming valuable context window space.

Layer 3: Priority-Aware Compression

Messages are classified into four priority tiers:

Priority	Treatment
System Critical	Never compressed (system prompt, user instructions)
Active Task	Compressed last (current tool results, active errors)
Older Context	Compressed earlier (previous conversation turns)
Background	Compressed first (stale outputs, resolved errors)

Three strategies are applied in order: Dedup (remove exact duplicates) → Truncate (shorten verbose outputs) → Remove (discard low-value turns).

Layer 4: Structured Summarization

When compression alone isn’t sufficient, an LLM generates a structured summary with 14 fields (user goal, active tasks, completed actions, errors, decisions, artifacts, blocked items, next steps, etc.) to replace the compressed history. A SummaryAuditor validates quality, and incremental merge ensures new information is folded into existing summaries without loss. Before persistence, all summary fields undergo dual redaction (credential leak detection + PII redaction) to ensure sensitive data from raw messages never survives into compressed history. Task continuity fields: The summary includes dedicated blocked_items (up to 3 current blockers) and next_steps (up to 5 planned actions). These are rendered in the U-curve tail zone of the summary message — the high-attention region (~80% recall) identified by Lost-in-Middle research — ensuring the agent precisely remembers “what’s blocking progress” and “what to do next” after compaction. Incremental merge intelligence: When a new summary is merged with an existing one, resolved blockers are automatically removed and completed steps are discarded, keeping the handoff context current without manual intervention.

Layer 4.5: Post-Compaction Reread

After summarization, the PostCompactionRereadProcessor automatically re-reads the top 5 most recently modified/created files from the ArtifactTracker, injecting their latest content into the context as a HumanMessage. This ensures the agent has fresh file contents immediately after compaction — not stale snapshots from before compression occurred.

Dynamic budget: Total reread content is capped at 50,000 tokens to prevent context re-inflation
Integrity Guard sync: Re-read file paths are automatically registered with FileIntegrityGuard, preventing “file modified since last read” false alarms
Prompt Cache safe: Content is injected as HumanMessage, never touching the system prompt prefix

This matches the post-compaction reread capability documented in Claude Code and Codex, while adding Guard synchronization that competitors lack.

Layer 5: Cache Optimization

Provider-specific cache markers (Anthropic cache_control, Qwen prefixed cache) are injected to maximize prompt cache hits on subsequent requests.

Reversible Compression

Unlike competitors that permanently discard compressed content, Myrm’s compression is reversible:

Tool outputs are offloaded to .context/ storage, not deleted
Archive Checkpoints capture full state before compression
Content can be restored on demand

This means agents can “look back” at earlier details even after compression.

System Prompt Architecture

Four Prompt Modes

Myrm supports four prompt modes that control the density and scope of system instructions injected into each LLM call:

Mode	Content	Use Case
full	Identity + obedience rules + response rules + security + task integrity + memory rules	Default for all conversations
lean	Identity + security + task integrity	Advanced users who want less AI intervention
naked	Security rules + tool guidance only	Full user control, minimal system overhead
search	Dedicated search prompt	Lightweight fast-search interactions

Each mode is pre-built as a static string at startup, ensuring the same parameter combination always returns the identical string object — maximizing KV Cache hit rates across users.

XML-Tagged Rule Isolation

All system prompt rules are wrapped in semantic XML tags, enabling LLMs to parse rule boundaries precisely and avoid attention drift during long sessions:

Framework layer (model_discipline.py): <agent_behavior_rules>, <tool_use_enforcement>, <execution_discipline> (per-model), <escalation_contract>
Business layer (shared_rules.py): <security_rules>, <memory_rules>, <task_integrity>, <response_rules>, <desktop_control_rules>, <absolute_obedience_override>
Identity layer (general_agent_prompt.py): <identity>, <ruleset>, <tool_guidance>
Middleware layer: <user_instructions>, <workspace_context>, <cli_tools>

Per-Model Execution Discipline

Different LLM families have known failure modes. Myrm corrects these automatically with model-specific discipline prompts:

Model Family	Corrections Applied
GPT / Codex / Grok	Tool persistence, mandatory tool use, act-don’t-ask
Gemini / Gemma	Absolute paths, parallel calls, non-interactive
Claude	Execute-when-instructed, reduce disclaimers
DeepSeek / Qwen / GLM	Reduce over-explanation, enforce tool calls

These corrections are determined at initialization and never change during a session — fully KV Cache safe.

Prompt Cache Optimization

Prompt caching reduces input token costs by up to 90% when cache hits occur.

Design Principles

Static/Dynamic Separation — System prompt is frozen (cacheable); all dynamic content goes into user messages
4-Layer Stable Prefix — System prompt → tools → workspace rules → first user message form a stable prefix
Cache Break Detection — cache_break_detector actively monitors for cache invalidation and reports the cause
Only-Append Policy — History messages are never modified in-place, preserving prefix stability
Cache-Safe Skill Attenuation — After loading a skill, tool access is narrowed via tool_choice.allowed_tools; the bound tool schema list stays frozen so Prefix Cache is not invalidated mid-session

Cache Preheat & Idle KeepAlive (Zero Cold-Start)

Myrm automatically pre-warms the Anthropic/Qwen server-side prefix cache at three key moments:

Agent initialization — Right after the system prompt is built, a fire-and-forget max_tokens=0 request seeds the cache while the user is still typing. The first real message hits a warm cache, reducing TTFT by up to 52%.
Post-compaction — After context compaction rewrites the message list, the new prefix is immediately preheated to avoid a cache miss on the next message.
Idle keep-alive — When the agent is idle for more than 4 minutes, a background CacheKeepAliveManager sends lightweight probes (10 input tokens, 0 output) every 4 minutes to prevent the provider’s 5-minute TTL from evicting the prefix cache. This ensures consistent 0.5-1s TTFT when users resume after thinking breaks, instead of 2-5s cold restarts. Cost: ~$0.09/day/session. The manager automatically pauses during active conversations and correctly replaces itself during LLM failover.

This is an exclusive capability — no competing framework implements prompt cache pre-warming or idle keep-alive.

Protection Features

Feature	Description
Hot Cache Bypass	When cache is already warm, skip unnecessary compression to preserve hits
Anti-Thrashing	Detect and skip repeated low-yield compression cycles
90% Safety Net	Emergency compression at 90% context window utilization to prevent OOM
Cache-TTL Archive	Expired cache entries are archived (not deleted) for potential restoration

Tool-Call Linear Alignment

Long conversations inevitably require compaction — but compressing messages must never break the pairing between AI tool calls and their results. Myrm provides 3-layer architectural protection:

Layer	Mechanism	When
L1: ID-based Grouping	`tool_call_groups` pairs messages by `tool_call_id` (not position), handling cross-turn reuse	During compression selection
L2: Integrity Guard	Post-compaction validation removes orphan tool messages and trims partially-matched AI messages	After compression completes
L3: Dangling-Call Repair	Pre-LLM middleware inserts synthetic error results for interrupted/timed-out tool calls (covers 3 call sources)	Before every LLM invocation

This eliminates the Tool message must follow tool_calls API rejection that plagues competing frameworks after session truncation. Verified by 239 dedicated tests.

Session-Affinity Routing (OpenAI)

For OpenAI models, Myrm automatically injects a prompt_cache_key routing hint on every request. This ensures all calls within the same session route to the same inference node, maximizing the Auto Prefix Cache hit rate:

Automatic detection — Only injected when using native OpenAI endpoints (api.openai.com)
Zero configuration — Uses the existing session ID, no user action needed
Subagent propagation — Child agents inherit the routing key via Python ContextVar
Effect — Cache hit rate improves from ~60% to ~87% (OpenAI documented figures)

This is a production-proven optimization used by competing frameworks (hermes-agent, openclaw). Myrm implements it in ~25 lines with zero side effects for non-OpenAI providers.

Thinking Content Management

When using reasoning/thinking models (DeepSeek, MiMo, Kimi, Anthropic Claude), the ThinkingBlockCleaner processor automatically manages reasoning_content and thinking_blocks to prevent context bloat:

Anthropic models — reasoning_content is removed (redundant with thinking_blocks); thinking_blocks are preserved
DeepSeek/MiMo/Kimi — Historical reasoning_content from older turns (before the last user message) is selectively removed, except on messages that carry tool_calls (API requirement). Current-turn reasoning is always preserved
Model switching — When switching from a non-thinking model to a thinking model mid-session, empty reasoning_content fields are automatically back-filled on historical assistant messages to prevent 400 errors

In a typical 20-turn DeepSeek session, this saves ~8,000–20,000 reasoning tokens (~50% of reasoning overhead).

Extreme Scenario Anti-Explosion

To ensure unparalleled stability even during massive context accumulations and multimodal autonomous tasks, Myrm employs a 4-layer protective moat:

1. Gateway Hygiene (Token Block)

Before requests even reach the Agent Harness, the Control Plane gateway scans the payload size. Massive malicious or malformed payloads (>120K tokens) are instantly intercepted with a 400 Bad Request. This prevents LLM compute nodes from suffering Out-Of-Memory (OOM) crashes and system halts.

2. Auxiliary Ratio Shield (Graceful Degradation)

When the main model (e.g., 200K window) is nearing its limit, context is compressed using an auxiliary model. If the user configured an overly small auxiliary model (e.g., 8K window), passing 100K tokens to it would cause a fatal crash and loss of conversation history. Myrm dynamically checks this ratio; if the auxiliary model is too small, it silently degrades and uses the main model for summarization, issuing a warning but keeping the session alive.

3. Smart Media Stripping

For vision models operating autonomously (e.g., Computer Use), screenshots are heavily appended. Myrm implements a Sliding Visual Evidence Window, retaining only the last 2 media-containing messages for visual reasoning while stripping all large Base64 images from older history. This drastically reduces token bloat while maintaining vision capability.

4. Tail Budget Ratio

Instead of arbitrarily truncating messages, Myrm calculates a dedicated token budget (e.g., 20% of max context) exclusively reserved for the most recent conversation tail. This ensures that the agent’s current working memory and active tool results are never compressed or squeezed out, guaranteeing task continuity.

Subagent Result Distillation

When sub-agents execute tasks (e.g., running test suites, researching codebases), their raw output can be thousands of lines. Myrm applies 3-tier progressive protection to ensure the parent agent’s context stays clean:

Tier	Mechanism	Effect
Tier 1	`truncate_result`	Hard limit by `max_result_tokens` (last line of defense)
Tier 2	`_auto_vault_or_truncate`	Outputs exceeding 8,000 chars are stored in ArtifactVault; parent receives a compact summary (head + tail), a `vault://` pointer, and an explicit `file_read_tool(paths=["vault://…"])` recovery hint
Tier 3	`AgentHandoverState`	Structured handover (completed tasks, pending todos, risks, relevant files) is extracted and separated from the raw result

Additionally, sub-agents inherit the full context pipeline via enable_context_compression, so their internal execution already benefits from the same compression, pruning, and summarization layers — preventing bloated results from forming in the first place. Subagent compaction safety: When a sub-agent’s context is compacted, the same pipeline protections apply — extract_protected_head preserves leading system messages, the summary is injected as a HumanMessage (not SystemMessage) to preserve prompt cache, and ensure_tool_pair_integrity validates message structure. This means sub-agent context always starts with a valid system message after compaction, eliminating the “assistant-first rejection” failure pattern seen in competing frameworks. Why this matters: Competitors describe “letting sub-agents handle long logs so the main agent only sees results” as a goal — Myrm already implements this with zero information loss (file_read_tool on vault:// URIs, line ranges supported) and structured handover state. The chat UI shows VaultArtifactCard for vaulted results; no extra LLM vault tools needed.

Session Notes

Agents can create structured notes during a session — these persist in the context at zero API cost (no LLM call needed) and serve as a lightweight alternative to full compression.

Dynamic Thresholds

Compression thresholds adapt automatically based on context utilization:

Utilization	Action
40%	Begin monitoring, prepare for compression
50%	Light compression (dedup, truncation)
70%	Full compression (priority-aware removal)
90%	Emergency compression (safety net)

Auxiliary Model Guard

When using a smaller LLM for summarization, Myrm dynamically detects the aux model’s context window and truncates messages before sending. This prevents small models from crashing during compression — they gracefully handle any input size.

Human Anchor Protection

During context compaction, the user’s original instruction (the first HumanMessage after system prompts) must never be confused with synthetic messages generated by the system (e.g., compaction summaries, session notes). Myrm solves this through correct-by-construction architecture rather than runtime heuristics:

Execution order guarantee: extract_protected_head() runs before any synthetic messages (summaries, session notes, pre-compact recalls) are injected into the message list. The function only sees real user messages.
Role-based isolation: When chat history is reloaded from the database, the compacted_summary is injected as an assistant role message (AIMessage), so it can never be mistakenly identified as a HumanMessage by the protected-head extractor.
Marker defense-in-depth: All synthetic user-role messages carry UNVERIFIED_CONTEXT_MARKER and <memory-context> / <pre_compact_recall_context> tags, providing an additional semantic boundary.

vs. competitors: Hermes uses _is_real_user_message() — a ~120-line runtime checker that inspects 4 flags + 5 prefixes + compression metadata on every compaction cycle. This approach requires updating the checker whenever a new synthetic message type is added, creating a maintenance burden and bypass risk. Myrm’s architecture makes misidentification structurally impossible.

7-Layer Oversized Output Deep Defense

When tool outputs exceed the context window budget (e.g., a 200KB JSON response, a 50K-line test log), Myrm applies a 7-layer defense-in-depth strategy. Each layer operates independently with its own fallback, ensuring zero data loss and full recoverability:

Layer	Component	Trigger	Strategy
L1	MCP Vault Spill	MCP tool output > `max_output_chars`	Full content → ArtifactVault, return head+tail summary + `vault://` pointer + `file_read_tool` hint
L2	Subagent Auto-Vault	Subagent output > 8K chars	Same vault strategy + VaultArtifactCard UI in frontend
L3	Structure-Aware Trimming	JSON/text tool results	Preserve JSON skeleton (4-depth + 12-key + 6-item limits); text: head+tail retention
L4	Stream Recovery	`CONTEXT_OVERFLOW` from LLM API	Stage 1: emergency compact tool outputs → Stage 2: truncate oldest rounds
L5	Preflight Guard	Pre-request token estimation	Block request before API call if estimated tokens exceed threshold, preventing cost waste
L6	Hook Output Spiller	Hook output > 2500 tokens	Persist to disk + return truncated preview with file path
L7	Background Output Spill	Bash background command output	Auto-spill oversized background process output to disk

Full data preservation: Unlike competitors that replace oversized outputs with brief instructions (losing original data) or apply simple character truncation, Myrm’s vault mechanism retains 100% of the original content. The Agent can retrieve any portion via file_read_tool(paths=["vault://..."], line_range="100:200"). vs. competitors: nanobot uses a single-layer “in-flight context governor” that replaces tool output with a bounded instruction — the original data is lost. OpenClaw uses 2-layer truncation (per-result + aggregate budget) with file spill — functional but lacks vault protocol and frontend visualization. Myrm’s 7-layer approach ensures every tool pathway (MCP, bash, browser, subagent, hooks) has dedicated protection with independent fallback chains.

Zero-Config Compression & Model Hot-Switch

Switching models or adjusting context parameters takes effect on the next message — no restart, no /reset, no YAML editing.

How It Works

DB-backed config: All model/compression settings are stored in the database via GUI Settings. Changes persist immediately.
Per-message fingerprint: compute_execution_fingerprint() hashes 30+ agent parameters (model, provider, engine_params, prompt_mode, skills, etc.). Every new message compares the current fingerprint against the cached one.
Auto rebuild: If the fingerprint differs, ChatAgentExecutionCache tears down the old agent unit and builds a fresh one with updated config — transparent to the user.
Proportional threshold auto-calculation: ContextConfig(max_context_tokens=N) automatically computes all compression thresholds as ratios of N:
- compress_threshold = 50% of N
- compress_force_threshold = 70% of N
- summarize_trigger_threshold = 90% of N
- Switch from GPT-4o (128K) to Claude Sonnet (200K) → all thresholds scale up automatically.
Auto context window discovery: enrich_model_context_window() fetches the real max_input_tokens from the LiteLLM registry for 500+ models. No manual context_length config needed.
Auto summarizer model: summarizer_llm defaults to lite_model (cheaper, faster) with automatic fallback to the main model if unavailable.
Circuit breaker protection: Summarization failures trigger a circuit breaker with half-open probe recovery — no cascading errors.

Frontend Visualization

Context Health Ring (ContextUsageIndicator): real-time SVG ring showing token usage percentage, health status dot (green/amber/red), one-click manual compression button, and auto fork-CTA at ≥75% usage.
Session Context Health Panel (SessionContextHealthPanel): three-card layout with 20+ metrics covering compaction efficiency, pruning ROI, cache hit rates, and adaptive backoff status.

vs. Competitors

Aspect	Hermes	Myrm
Config changes	Edit `config.yaml` → next message rebuilds agent	GUI Settings → DB → per-message fingerprint → auto rebuild
Compression params	15+ manual YAML keys (`threshold`, `target_ratio`, `protect_last_n`, etc.)	Zero-config: proportional auto-calculation from `max_context_tokens`
Context window	Manual `model.context_length` in YAML	Auto-fetched from LiteLLM registry (500+ models)
Compression model	Manual `auxiliary.compression.model/provider`	Auto-selects `lite_model` with main-model fallback
Visibility	CLI text progress bar (60% info / 85% warning)	GUI Health Ring + 3-card analytics panel + fork guidance
Error handling	`hygiene_timeout_seconds` + cooldown	Circuit breaker with half-open auto-recovery

409 hot-reload tests verified (execution_cache 21 + context_management 312 + config/hot-reload 17 + frontend 59).

​Context Management

​Architecture Overview

​Compression Pipeline

​Layer 0: Intent-Guided Compression (CompressionIntent)

​Layer 1: Instant Filtering (ContextBudgetGuard)

​Layer 1.5: Universal Tool Output Intelligence (FilterProcessor)

​Layer 2: Cache TTL Pruning

​Layer 3: Priority-Aware Compression

​Layer 4: Structured Summarization

​Layer 4.5: Post-Compaction Reread

​Layer 5: Cache Optimization

​Reversible Compression

​System Prompt Architecture

​Four Prompt Modes

​XML-Tagged Rule Isolation

​Per-Model Execution Discipline

​Prompt Cache Optimization

​Design Principles

​Cache Preheat & Idle KeepAlive (Zero Cold-Start)

​Protection Features

​Tool-Call Linear Alignment

​Session-Affinity Routing (OpenAI)

​Thinking Content Management

​Extreme Scenario Anti-Explosion

​1. Gateway Hygiene (Token Block)

​2. Auxiliary Ratio Shield (Graceful Degradation)

​3. Smart Media Stripping

​4. Tail Budget Ratio

​Subagent Result Distillation

​Session Notes

​Dynamic Thresholds

​Auxiliary Model Guard

​Human Anchor Protection

​7-Layer Oversized Output Deep Defense

​Zero-Config Compression & Model Hot-Switch

​How It Works

​Frontend Visualization

​vs. Competitors