Error Recovery
Myrm’s error recovery system ensures agents keep running through network failures, model outages, rate limits, and unexpected errors — automatically, without user intervention.14-Layer Recovery Architecture
| Layer | Mechanism | What It Handles |
|---|---|---|
| L1 | Stream Recovery | Network interruptions during LLM streaming — token-level precise resume |
| L2 | Continuation Recovery | Consecutive stream failures — maintains context across multiple interruptions |
| L3 | Circuit Breaker | Model provider outages — 3-tier cooldown (auth 30min / permanent 10min / transient 1min) with half-open probe |
| L4 | Fallback Presets | Provider failover — pre-configured backup model chains |
| L5 | Agent Recovery | Tool execution failures — automatic re-planning with alternative strategies |
| L6 | Truncation Recovery | Response cutoff — local JSON repair (stack-based nesting, literal/numeric completion, dangling key fill) + progressive output budget boost (2x → 3x → 4x) with auto-retry |
| L7 | Aux Model Guard | Small model safety — dynamic message truncation before summarization to prevent aux model crashes |
| L8 | Deterministic Fallback | LLM-free safety net — rule-based summary generation when all LLM options fail, preventing deadlocks |
| L9 | Image Auto-Resize | Oversized images that exceed model limits — automatic re-encoding and compression |
| L10 | Media Rejection Recovery | Model rejects media content (unsupported format) — removes media and retries with text only |
| L11 | Thinking Signature Recovery | Model thinking mode signature errors — disables thinking mode and retries |
| L12 | Long Context Tier Switch | Context exceeds standard window — auto-switches to long-context model variant |
| L13 | Empty Response Recovery | Model returns blank — retries with adjusted parameters |
| L14 | Grace-Call Summary | Iteration limit reached — one final toolless LLM call generates a structured summary, ensuring users never see a blank response |
Circuit Breaker
The circuit breaker prevents cascading failures when a model provider goes down:States
Error Classification
| Error Type | Cooldown | Recovery Strategy |
|---|---|---|
auth | 30 minutes | Check credentials, try alternative keys |
permanent | 10 minutes | Switch to fallback model |
transient | 1 minute | Retry with backoff |
Credential Pool
When one API key hits rate limits, the system automatically rotates to the next available key:- 4 dispatch strategies (round-robin, least-used, random, priority)
- Per-key error-aware cooldown
- Exponential backoff per key
- Automatic probe when cooldown expires
Error Diagnostics
When errors occur, the system provides structured, actionable feedback:9 Error Categories
| Category | Example | Recovery Hint |
|---|---|---|
FileNotFoundError | Missing file reference | Suggest searching for correct path |
PermissionError | Insufficient access | Suggest requesting approval |
ConnectionError | Network failure | Auto-retry with backoff |
TimeoutError | LLM response timeout | Increase timeout or simplify request |
RateLimitError | API quota exceeded | Switch key or wait |
ContextOverflow | Window exceeded | Trigger compression |
AuthError | Invalid credentials | Rotate to next key |
ToolError | Tool execution failed | Try alternative tool |
ModelError | Model capability gap | Escalate to stronger model |
error_hint, error_category, and suggested RecoveryAction — displayed as clickable buttons in the GUI.
Model Self-Escalation
When a lightweight model detects it lacks the capability to complete a task:- Model outputs a special
<<<NEEDS_PRO>>>marker - The
EscalationScrubberintercepts the marker (hidden from user) - Agent automatically switches to the configured stronger model
- Task continues seamlessly
Loop Detection
5 independent detectors identify different types of agent loops:| Detector | Pattern | Action |
|---|---|---|
| Repetition | Same tool called with identical parameters | Warning → Break |
| Ping-Pong | Alternating A→B→A→B pattern | Warning → Break |
| No Progress | Output content unchanged across turns | Warning → Break |
| Divergence | Tool calls increasingly off-topic | Warning → Break |
| Output Diminishing | Response quality declining | Warning → Break |
Iteration Budget
Agents have configurable iteration limits with graduated warnings:| Threshold | Action |
|---|---|
| 35 turns | First warning: “Getting close to limit” |
| 45 turns | Second warning: “Almost at limit, wrap up” |
| 48 turns | Final warning: “Last 2 turns” |
| 50 turns | Hard stop with grace summary |
File Checkpoint
Before any destructive file operation,AutoSnapshotInterceptor automatically takes a snapshot:
- Covers 6 tool categories:
write_file,patch_file,delete_file,move_file,execute_terminal,code_execute - Per-turn deduplication prevents redundant snapshots
- Snapshots enable single-click rollback in the GUI
What Users Experience
All recovery happens transparently:- Model goes down? — Automatic switch to backup in milliseconds
- Network drops? — Stream resumes from the exact token where it stopped
- Rate limited? — Key rotation or backoff, then retry
- Agent loops? — Detected early, before wasting budget
- Response truncated? — Local JSON repair first, then auto-retry with larger output budget

