Skip to main content

Error Recovery

Myrm’s error recovery system ensures agents keep running through network failures, model outages, rate limits, and unexpected errors — automatically, without user intervention.

14-Layer Recovery Architecture

LayerMechanismWhat It Handles
L1Stream RecoveryNetwork interruptions during LLM streaming — token-level precise resume
L2Continuation RecoveryConsecutive stream failures — maintains context across multiple interruptions
L3Circuit BreakerModel provider outages — 3-tier cooldown (auth 30min / permanent 10min / transient 1min) with half-open probe
L4Fallback PresetsProvider failover — pre-configured backup model chains
L5Agent RecoveryTool execution failures — automatic re-planning with alternative strategies
L6Truncation RecoveryResponse cutoff — local JSON repair (stack-based nesting, literal/numeric completion, dangling key fill) + progressive output budget boost (2x → 3x → 4x) with auto-retry
L7Aux Model GuardSmall model safety — dynamic message truncation before summarization to prevent aux model crashes
L8Deterministic FallbackLLM-free safety net — rule-based summary generation when all LLM options fail, preventing deadlocks
L9Image Auto-ResizeOversized images that exceed model limits — automatic re-encoding and compression
L10Media Rejection RecoveryModel rejects media content (unsupported format) — removes media and retries with text only
L11Thinking Signature RecoveryModel thinking mode signature errors — disables thinking mode and retries
L12Long Context Tier SwitchContext exceeds standard window — auto-switches to long-context model variant
L13Empty Response RecoveryModel returns blank — retries with adjusted parameters
L14Grace-Call SummaryIteration limit reached — one final toolless LLM call generates a structured summary, ensuring users never see a blank response

Circuit Breaker

The circuit breaker prevents cascading failures when a model provider goes down:

States

CLOSED (normal) → OPEN (failures detected) → HALF-OPEN (probe) → CLOSED (recovered)

Error Classification

Error TypeCooldownRecovery Strategy
auth30 minutesCheck credentials, try alternative keys
permanent10 minutesSwitch to fallback model
transient1 minuteRetry with backoff

Credential Pool

When one API key hits rate limits, the system automatically rotates to the next available key:
  • 4 dispatch strategies (round-robin, least-used, random, priority)
  • Per-key error-aware cooldown
  • Exponential backoff per key
  • Automatic probe when cooldown expires

Error Diagnostics

When errors occur, the system provides structured, actionable feedback:

9 Error Categories

CategoryExampleRecovery Hint
FileNotFoundErrorMissing file referenceSuggest searching for correct path
PermissionErrorInsufficient accessSuggest requesting approval
ConnectionErrorNetwork failureAuto-retry with backoff
TimeoutErrorLLM response timeoutIncrease timeout or simplify request
RateLimitErrorAPI quota exceededSwitch key or wait
ContextOverflowWindow exceededTrigger compression
AuthErrorInvalid credentialsRotate to next key
ToolErrorTool execution failedTry alternative tool
ModelErrorModel capability gapEscalate to stronger model
Each error includes a structured context with error_hint, error_category, and suggested RecoveryAction — displayed as clickable buttons in the GUI.

Model Self-Escalation

When a lightweight model detects it lacks the capability to complete a task:
  1. Model outputs a special <<<NEEDS_PRO>>> marker
  2. The EscalationScrubber intercepts the marker (hidden from user)
  3. Agent automatically switches to the configured stronger model
  4. Task continues seamlessly
This enables cost-efficient routing: simple tasks use cheap models, complex tasks auto-escalate.

Loop Detection

5 independent detectors identify different types of agent loops:
DetectorPatternAction
RepetitionSame tool called with identical parametersWarning → Break
Ping-PongAlternating A→B→A→B patternWarning → Break
No ProgressOutput content unchanged across turnsWarning → Break
DivergenceTool calls increasingly off-topicWarning → Break
Output DiminishingResponse quality decliningWarning → Break
Detection follows a graduated response: first a warning is injected into the agent’s context, then a forced break if the pattern persists.

Iteration Budget

Agents have configurable iteration limits with graduated warnings:
ThresholdAction
35 turnsFirst warning: “Getting close to limit”
45 turnsSecond warning: “Almost at limit, wrap up”
48 turnsFinal warning: “Last 2 turns”
50 turnsHard stop with grace summary
The grace summary provides a structured wrap-up of completed work, remaining tasks, and suggestions for continuation.

File Checkpoint

Before any destructive file operation, AutoSnapshotInterceptor automatically takes a snapshot:
  • Covers 6 tool categories: write_file, patch_file, delete_file, move_file, execute_terminal, code_execute
  • Per-turn deduplication prevents redundant snapshots
  • Snapshots enable single-click rollback in the GUI

What Users Experience

All recovery happens transparently:
  • Model goes down? — Automatic switch to backup in milliseconds
  • Network drops? — Stream resumes from the exact token where it stopped
  • Rate limited? — Key rotation or backoff, then retry
  • Agent loops? — Detected early, before wasting budget
  • Response truncated? — Local JSON repair first, then auto-retry with larger output budget