Error Recovery

Myrm’s error recovery system ensures agents keep running through network failures, model outages, rate limits, and unexpected errors — automatically, without user intervention.

14-Layer Recovery Architecture

Layer	Mechanism	What It Handles
L1	Stream Recovery	Network interruptions and stalls during LLM streaming — dual-phase stall detection (first-event 60s + inter-chunk 180s) + token-level precise resume + auto-failover on silent provider hangs
L2	Continuation Recovery	Consecutive stream failures — maintains context across multiple interruptions
L3	Circuit Breaker	Model provider outages — 3-tier cooldown (auth 30min / permanent 10min / transient 1min) with half-open probe
L4	Smart Deferred Failover	Provider failover with intelligence — rate-limit (429) never triggers model switch (always retries same model); overload (529) requires 3 consecutive failures before switching to backup model. Saves primary model quota and avoids unnecessary degradation
L5	Agent Recovery	Tool execution failures — automatic re-planning with alternative strategies
L6	Truncation Recovery	Response cutoff — local JSON repair (stack-based nesting, literal/numeric completion, dangling key fill) + progressive output budget boost (2x → 3x → 4x) with auto-retry
L7	Aux Model Guard	Small model safety — dynamic message truncation before summarization to prevent aux model crashes
L8	Deterministic Fallback	LLM-free safety net — rule-based summary generation when all LLM options fail, preventing deadlocks
L9	Image Auto-Resize	Oversized images that exceed model limits — automatic re-encoding and compression
L10	Media Rejection Recovery	Model rejects media content (unsupported format) — removes media and retries with text only
L11	Thinking Signature Recovery	Model thinking mode signature errors — disables thinking mode and retries
L12	Long Context Tier Switch	Context exceeds standard window — auto-switches to long-context model variant
L13	Empty Response Recovery	Model returns blank — retries with adjusted parameters
L14	Grace-Call Summary	Iteration limit reached — one final toolless LLM call generates a structured summary, ensuring users never see a blank response

Circuit Breaker

The circuit breaker prevents cascading failures when a model provider goes down:

States

CLOSED (normal) → OPEN (failures detected) → HALF-OPEN (probe) → CLOSED (recovered)

Error Classification

Error Type	Cooldown	Recovery Strategy
`auth`	30 minutes	Check credentials, try alternative keys
`permanent`	10 minutes	Switch to fallback model
`transient`	1 minute	Retry with backoff

Credential Pool

When one API key hits rate limits, the system automatically rotates to the next available key:

4 dispatch strategies (round-robin, least-used, random, priority)
Per-key error-aware cooldown
Exponential backoff per key
Automatic probe when cooldown expires

Error Diagnostics

When errors occur, the system provides structured, actionable feedback:

9 Error Categories

Category	Example	Recovery Hint
`FileNotFoundError`	Missing file reference	Suggest searching for correct path
`PermissionError`	Insufficient access	Suggest requesting approval
`ConnectionError`	Network failure	Auto-retry with backoff
`TimeoutError`	LLM response timeout	Increase timeout or simplify request
`RateLimitError`	API quota exceeded	Switch key or wait
`ContextOverflow`	Window exceeded	Trigger compression
`AuthError`	Invalid credentials	Rotate to next key
`ToolError`	Tool execution failed	Try alternative tool
`ModelError`	Model capability gap	Escalate to stronger model

Each error includes a structured context with error_hint, error_category (28 canonical categories via ToolErrorCategory StrEnum, fully i18n-translated in 4 languages), and suggested RecoveryAction — displayed as clickable buttons in the GUI. A cross-layer sync test suite (46 tests) ensures the harness enum and frontend i18n keys never drift.

Interactive Recovery Buttons

For common LLM errors, the error card includes one-click fix buttons that take you directly to the right settings page:

Error	Button	Action
API key invalid/expired	”Update API Key”	Opens Settings page
Billing quota exceeded	”Top Up Balance”	Opens Settings page
Model not found	”Change Model”	Opens Settings page

Button labels are localized into 5 languages (English, Chinese, Japanese, Korean, German) and automatically match your interface language. If the diagnostic engine encounters an unexpected error, it degrades gracefully — the base error message still displays without recovery buttons.

Code Execution Auto-Diagnosis

When the agent runs Python code or Bash commands, the execution engine automatically classifies errors and generates actionable hints:

Category	Trigger	Auto-generated Hint
`import`	`ModuleNotFoundError`	Smart install command with PyPI name mapping (e.g. `cv2` → `pip install opencv-python`)
`not_found`	Command not found	Platform-specific install command via tool discovery
`permission`	Permission denied	`chmod` suggestion with the exact path
`timeout`	Execution timed out	Guidance to reduce input or split task
`oom`	Out of memory	Guidance to process data in chunks
`sandbox_ro`	Read-only filesystem	Redirect writes to `/workspace`
`network_blocked`	Network access blocked	Tells agent to avoid retrying with different HTTP libraries
`syntax`	Syntax error	No hint (agent should fix the code)

The engine includes a built-in import-to-PyPI mapping table (PIL → Pillow, sklearn → scikit-learn, yaml → PyYAML, and more) and auto-detects whether uv pip is available. All code runs in an isolated shared virtual environment managed by VenvManager, ensuring user-installed packages never pollute the system Python.

Model Self-Escalation

When a lightweight model detects it lacks the capability to complete a task:

Model outputs a special <<<NEEDS_PRO>>> marker
The EscalationScrubber intercepts the marker (hidden from user)
Agent automatically switches to the configured stronger model
Task continues seamlessly

This enables cost-efficient routing: simple tasks use cheap models, complex tasks auto-escalate.

Loop Detection

7 independent detectors identify different types of agent loops:

Detector	Pattern	Action
Repetition	Same tool called with identical parameters	Warning → Break
Ping-Pong	Alternating A→B→A→B pattern	Warning → Break
No Progress	Output content unchanged across turns	Warning → Break
Divergence	Tool calls increasingly off-topic (adaptive threshold: Exploration 60% / Execution 30% / Recovery 15%)	Warning → Break
Output Diminishing	Response quality declining	Warning → Break
Consecutive Failures	Multiple tool calls failing in a row	Warning → Break
Cross-tool Error Signature	Same error pattern repeated across different tools (line/path normalized)	ToolStuckException

Detection follows a graduated response: first a warning with context-aware suggestions is injected into the agent’s context, then a forced break if the pattern persists (severity: WARNING 3-5x → ERROR 6-9x → CRITICAL 10+x).

Post-Compaction Loop Protection

When context overflow triggers emergency compaction, LoopGuard handles the transition with precision:

Loop detection state survives intact — the sliding window (pattern detection) and error signatures operate in ContextVar, fully decoupled from the message list that compaction modifies
Iteration budget is intelligently reset — notify_compaction() resets total_calls so the agent is not prematurely terminated due to pre-compaction call history, while preserving error_signatures for cross-compaction failure tracking
Agent phase is preserved — the current execution phase (exploration, execution, etc.) carries over, maintaining context-aware detection thresholds

This dual approach — detect loops more sensitively while giving the agent a fresh budget — fundamentally eliminates both the “post-compaction doom loop” and the “premature termination after compaction” failure patterns. No competing system addresses this compaction×budget intersection.

Post-Compaction Memory Protection

After context compaction, the agent’s memory retrieval remains fully intact with zero lag or data loss, thanks to a 5-layer memory protection architecture:

SystemMessage Immune to Compaction: User profile and rules are injected as SystemMessage at position 0, never touched by the compress processor
Learned Context Immune to Compaction: Learned context is injected as HumanMessage, not a tool call pair, so it’s never selected for compression
PreCompactProcessor Proactive Recall: Before compaction, the system automatically triggers vector database semantic search and injects relevant memories as standalone message blocks, ensuring the LLM retains access to critical memories after compaction
Real-time Vector Index: Qdrant vector database entries are searchable immediately after write — no index lag
Independent Memory Extraction: End-of-session memory extraction uses the original dialogue, unaffected by in-session compaction

This architectural design fundamentally eliminates “post-compaction memory loss” — a problem competitors must patch with forced index refresh mechanisms.

Iteration Budget

Agents have configurable iteration limits (default: 50) with dynamically computed thresholds based on the graph recursion limit:

Threshold	Action
~70% budget	First warning: “Review your original goal and prioritize”
~90% budget	Critical warning: “Finalize your work immediately”
100% budget	ToolStuckException → Hard stop with grace summary

The thresholds are automatically derived from graph_recursion_limit and converted to tool-call counts, ensuring the budget scales correctly regardless of configuration. The grace summary provides a structured wrap-up of completed work, remaining tasks, and suggestions for continuation.

Silent Tool Retry

When a tool call fails due to transient errors (network timeouts, rate limits, temporary unavailability), the system retries automatically — the user only sees a heartbeat timer ticking, never the failure.

6-Layer Retry Architecture

Layer	Mechanism	User Experience
Tool Executor	2 automatic retries + exponential backoff + `Retry-After` header respect + circuit breaker	TOOL_HEARTBEAT updates elapsed time in real-time
Planner 3-Strike	Per-error retry counting + structured attempt history + auto-escalation	Silent for 3 attempts, then escalates to user approval
Goal Verification	Automatic verification retries with counter reset on success	Invisible to user
Stream Recovery	10+ recovery strategies (overflow/failover/escalation/transient retry)	Brief pause, then seamless continuation
File Snapshot	Shadow Git + per-operation snapshots for sandbox state recovery	One-click rollback in GUI
Frontend Heartbeat	SSE TOOL_HEARTBEAT events with real-time elapsed_ms updates	User sees “Tool running… 15s” instead of a frozen screen

How It Differs from Competitors

Not prompt-based: Retry logic is deterministic code (Pydantic schemas + counters), not LLM instructions that may be ignored
Not developer-only: Unlike framework-level retry configs (e.g. LangGraph’s RetryPolicy), the heartbeat UI provides end-user visibility
Not noisy: Retries are silent — no error popups, no user decisions required for transient failures

File Checkpoint

Before any destructive file operation, AutoSnapshotInterceptor automatically takes a snapshot:

Covers 6 tool categories: write_file, patch_file, delete_file, move_file, execute_terminal, code_execute
Per-turn deduplication prevents redundant snapshots
Snapshots enable single-click rollback in the GUI

Database Safety

A 5-layer protection system ensures your data (conversations, scheduled tasks, memories) survives any failure:

Layer	Protection	When
Pre-migration snapshot	Automatic backup before every schema migration	App startup
3-tier disaster recovery	Rescue (`.dump`) → restore from backup → in-memory degradation	DB init failure
Periodic hot backup	SQLite backup API every 6 hours + shutdown snapshot	Runtime
Framework BackupManager	SHA-256 verification, manifest, retention, quarantine	Harness layer
Health checker	Dual-layer probes with auto-repair	Continuous

Multi-step table rebuild migrations (e.g. CREATE TABLE AS SELECT → DROP → RENAME) are fully protected: if the process is interrupted mid-migration, the pre-migration backup provides a clean restore point.

Subagent Error Compaction

When a child agent crashes with a long traceback, the error message is automatically compacted before reaching the parent agent’s context — preventing pollution that would degrade the parent’s reasoning quality.

Aspect	Implementation
Strategy	Head + truncation marker + tail (preserves error type at head, recent stack frames at tail)
Default limit	2000 characters (~500 tokens) — configurable via `SubagentConfig.max_error_chars`
Defense-in-depth	Dual-layer: executor error path + notifications formatting layer
Savings	Typical 8000-char traceback → 2000 chars (75% reduction, ~1500 tokens saved per failure)
Disable	Set `max_error_chars=0` to pass raw errors through

This prevents a common multi-agent failure pattern: a child agent’s verbose crash output consuming the parent’s context window, causing cascading reasoning degradation across the agent hierarchy.

Subagent Partial Progress on Failure

When a sub-agent fails mid-execution (LLM error, budget exceeded, timeout, or runtime exception), all accumulated work is preserved and returned to the parent agent — never lost.

Failure Path	What Happens	Parent Agent Receives
LLM Error (MyrmLLMError)	Provider returns error after partial work	`SubAgentResult(success=False, result="80% completed work...")`
Budget Exceeded	Token/cost limit hit mid-task	`SubAgentResult(success=False, result="work done before budget hit...")`
Timeout	Task exceeds deadline	`SubAgentResult(success=False, result="progress before timeout...")`
Runtime Exception	Unexpected crash	`SubAgentResult(success=False, result="all output accumulated...")`

Why This Matters

Without partial progress preservation, a sub-agent that completed 80% of a complex task before hitting a rate limit would lose all its work. The parent agent would have to start from scratch — wasting the tokens already consumed and doubling the cost. With Myrm’s approach:

Parent receives all completed work as structured output
Parent can resume from where the sub-agent left off
Token cost for the successful portion is not wasted
Over-long partial output is automatically truncated (configurable via max_error_chars * 2)

Competitor Comparison

Feature	Myrm	Claude Code	OpenClaw	Hermes
Partial progress on failure	4 paths	None	None	None
Structured error status	12 states	Exit code only	tool_error	Timeout/crash
Truncation protection	Auto	None	None	None

Myrm provides defense-in-depth against cascading failures at every level of the agent hierarchy:

Layer	Mechanism	Scope
Infrastructure	Circuit breaker (CLOSED → OPEN → HALF_OPEN)	Per-model, per-browser pool
Terminal Error	Hard circuit breaker for network/sandbox failures	Session-wide
Behavior Pattern	LoopGuard (7 pattern types) + FrequencyGuard (sliding window rate limit)	Per-session
Sub-Agent	`_cascade_cancel_descendants` — recursive cancellation of all descendant agents	Agent hierarchy
Emergency	E-Stop (KILL_ALL) — immediately halts all tool execution	Global

Unlike traditional microservice dependency chains, LLM agents don’t have explicit tool DAGs — the LLM decides the call sequence at runtime. Myrm addresses cascading errors at the right abstraction level: infrastructure-level circuit breaking, behavioral pattern detection, and hierarchical cancellation — rather than attempting to model non-existent tool dependencies. Verified: 604 tests passed across all cascading error protection modules (LoopGuard, FrequencyGuard, E-Stop, ToolCallBroadcaster, SubagentExecutor, CircuitBreaker, ToolGuards).

Retry Storm & Budget Protection

Myrm actively prevents runaway retry loops and protects your API budget:

Guard	What It Does	Response
LoopGuard	Detects 7 loop patterns (repetition, ping-pong, no-progress, divergence, diminishing output, consecutive-failures, cross-tool error-signature)	WARN → BREAK
FrequencyGuard	Sliding-window rate limit (100 global / 30 per-tool per minute)	WARN at 80% → BREAK at 100%
MultidimensionalBudgetGuard	Per-session, daily, and per-call USD limits	OK → WARNING → FINALIZATION → EXCEEDED
Iteration Budget	Dynamic tool-call budget derived from recursion limit	70% warn → 90% critical → 100% ToolStuckException

This is active truncation (immediately stop execution), not passive monitoring (just log and alert). When a retry storm is detected, the agent is forced to stop and provide its best answer with whatever results it has — protecting both cloud compute costs and local API key balance. Verified: 330 tests passed across all retry protection modules (LoopGuard, FrequencyGuard, BudgetGuard, MultidimensionalBudgetGuard, BudgetBoundaryMiddleware).

Budget Protection

Myrm provides comprehensive budget control across all deployment modes:

MultidimensionalBudgetGuard: Per-session, daily, and per-call USD limits with 4-level progressive response (OK → WARNING → FINALIZATION → EXCEEDED)
Dynamic Budget Hints: When budget drops to WARNING or FINALIZATION, the exact remaining USD is injected into the LLM prompt — the AI knows precisely how much it can spend and self-adjusts behavior accordingly
BudgetBadge: Real-time budget indicator in the chat input area showing usage percentage with color-coded status
BudgetExceededDialog: One-click top-up or plan upgrade when budget is exceeded
ChannelBudget: Independent budget limits for each IM channel (Telegram, WeChat, etc.)
BudgetPolicySection: Full UI for configuring budget policies with finalization reserve
DailyChart: 30-day usage trend with cache hit rate overlay

Verified: 181 tests passed across all budget protection modules (harness framework: 102 passed, server business layer: 79 passed).

Data Lifecycle Management

Myrm automatically manages data retention across all storage engines — no manual cleanup needed:

9 Automated Schedulers: Context files (3-tier: 30d/14d/7d), auth logs (configurable retention), chat trash (30d auto-purge), SQLite WAL checkpoint (every 6h), database rotation backup, Qdrant segment optimization, browser zombie detection (48h threshold), Kanban GC, incognito auto-wipe (1h)
MemoryGuardian: Adaptive maintenance frequency — every 6h when healthy, every 2h when degraded. Health scoring (70 normal / 35 critical) drives automatic force maintenance after 2 consecutive unhealthy checks
File Access Tracking: Prevents accidental deletion of referenced context files via file_access_tracker
Scheduler Health API: Real-time green/yellow/red status monitoring for all background schedulers
Hot Backups: Automatic SQLite hot backup after every maintenance cycle

Verified: 413 tests passed across all data lifecycle modules (server lifecycle: 265 passed, harness lifecycle: 90 passed, cron + memory: 58 passed).

Skill Evolution — Self-Improving Agent

Myrm’s agents learn from failures and evolve their skills autonomously:

Automatic Evolution: When a skill fails or receives negative feedback, the system generates an evolution proposal that updates the skill itself — not a temporary prompt patch
Review Lifecycle: Safe changes auto-apply; risky ones become reviewable growth cases with approve/reject workflow
Semantic Deduplication: Similarity checker prevents skill entropy — duplicate or near-identical skills are caught before saving
Experience Ledger: Every evolution event (14 types) is permanently recorded for audit and analytics
Quality Alerts: Webhook notifications when skill quality degrades, enabling proactive maintenance

Verified: 489 tests passed across all skill evolution modules (server: 107 passed, harness framework: 382 passed).

What Users Experience

All recovery happens transparently:

Model goes down? — Automatic switch to backup in milliseconds
Network drops? — Stream resumes from the exact token where it stopped
Rate limited? — Key rotation or backoff, then retry
API key expired? — One-click button to update it, right in the error card
Agent loops? — Detected early, before wasting budget
Response truncated? — Text truncation: seamless keep+continue with progressive output boost (2x/3x/4x, cap 32768); Tool truncation: discard invalid + auto-retry; JSON truncation: local repair. Output-cap auto-recovery across 5 provider formats (Anthropic, OpenRouter, LM Studio, vLLM, DashScope). SSE status notification in 5 languages. 388 output-cap + 294 truncation/recovery tests verified (Jul 2026)
Upgrade interrupted? — Pre-migration snapshot restores your data automatically
Child agent crashes? — Error auto-compacted, parent receives a clean summary, reasoning stays unpolluted
Tool call fails? — Silent retry with heartbeat timer; you see “running… 15s” instead of an error
Stream error with actionable recovery? — Errors include recovery_actions buttons (retry, switch model, install dependency) directly in the chat UI, plus diagnostic_result with i18n error messages and step-by-step resolution guides. No guesswork needed — just click the suggested action
Repeated failures? — 3-Strike protocol auto-escalates to ask for your help — no infinite loops
Environment broken? — Doctor Dashboard runs 9 parallel diagnostic probes (Python version, dependencies, LLM connectivity, network, workspace storage, database, browser, hooks, desktop control) and shows health status at a glance with one-click repair actions. Unlike CLI-only competitors that require terminal access, the GUI health cards display real-time status with actionable fix buttons
SSE disconnects mid-task? — Goal progress panel auto-clears stale indicators on reconnect, re-syncs from server, and preserves completed steps. No ghost spinners, no misleading “in progress” after the agent has stopped
Cancel a running task? — End-to-end cancellation propagates through the entire execution chain within 0.5s (CancellationMonitor polling interval). Background jobs killed, subagents cascade-cancelled, resources cleaned up, token registry unregistered
Disconnect during long task? — Grace period tolerance keeps the task alive. If disconnection persists, OfflineDurableTask registers the work for background completion with user notification on finish. Background processes (npm install, webpack watch, test suites) are managed by a process-level singleton registry completely decoupled from the SSE stream — refreshing the page or reconnecting never kills live daemon sessions. SSE reconnection uses Last-Event-ID with a 5MB sliding window buffer for lossless event replay. 108 tests verified across 4 batches (registry, streaming, reconnect)
Server restarts mid-goal? — Orphaned goals auto-pause with clear reason. Durable tasks resume from LangGraph checkpoint on next startup — zero repeated work
Process crash during normal conversation? — InterruptedTurnMarker writes a durable write-ahead record before every agent stream. On restart, eligible markers are scanned and automatically dispatched for background continuation with chat history reload, message persistence, crash-loop breaker (max 2 attempts), 15-minute freshness window, and user notification on success or failure. User-controllable via autoContinueInterruptedTurns setting (default: enabled)
Emergency halt needed? — E-Stop API (/freeze) cancels ALL active agent streams globally in one call — the panic button for production incidents

Getting Started

Core Concepts

Guides

Self-Hosting

Error Recovery

Error Recovery

14-Layer Recovery Architecture

Circuit Breaker

States

Error Classification

Credential Pool

Error Diagnostics

9 Error Categories

Interactive Recovery Buttons

Code Execution Auto-Diagnosis

Model Self-Escalation

Loop Detection

Post-Compaction Loop Protection

Post-Compaction Memory Protection

Iteration Budget

Silent Tool Retry

6-Layer Retry Architecture

How It Differs from Competitors

File Checkpoint

Database Safety

Subagent Error Compaction

Subagent Partial Progress on Failure

Why This Matters

Competitor Comparison

Retry Storm & Budget Protection

Budget Protection

Data Lifecycle Management

Skill Evolution — Self-Improving Agent

What Users Experience

​Error Recovery

​14-Layer Recovery Architecture

​Circuit Breaker

​States

​Error Classification

​Credential Pool

​Error Diagnostics

​9 Error Categories

​Interactive Recovery Buttons

​Code Execution Auto-Diagnosis

​Model Self-Escalation

​Loop Detection

​Post-Compaction Loop Protection

​Post-Compaction Memory Protection

​Iteration Budget

​Silent Tool Retry

​6-Layer Retry Architecture

​How It Differs from Competitors

​File Checkpoint

​Database Safety

​Subagent Error Compaction

​Subagent Partial Progress on Failure

​Why This Matters

​Competitor Comparison

​Retry Storm & Budget Protection

​Budget Protection

​Data Lifecycle Management

​Skill Evolution — Self-Improving Agent

​What Users Experience

Error Recovery

14-Layer Recovery Architecture

Circuit Breaker

States

Error Classification

Credential Pool

Error Diagnostics

9 Error Categories

Interactive Recovery Buttons

Code Execution Auto-Diagnosis

Model Self-Escalation

Loop Detection

Post-Compaction Loop Protection

Post-Compaction Memory Protection

Iteration Budget

Silent Tool Retry

6-Layer Retry Architecture

How It Differs from Competitors

File Checkpoint

Database Safety

Subagent Error Compaction

Subagent Partial Progress on Failure

Why This Matters

Competitor Comparison

Retry Storm & Budget Protection

Budget Protection

Data Lifecycle Management

Skill Evolution — Self-Improving Agent

What Users Experience