Model Configuration

Myrm uses LiteLLM for unified access to 100+ models from 26+ built-in providers, with three compatibility protocols (OpenAI-like, Gemini-like, Anthropic-like) for infinite custom provider extension. Each built-in provider comes with pre-filled API URLs and supports one-click domestic/international region switching — truly zero-configuration ready.

First-Time Setup

When you first launch Myrm, the onboarding wizard guides you through model configuration:

Local model auto-detection: Myrm automatically scans for running Ollama or LM Studio instances and recommends the best available model with a one-click activation button.
Cloud Quick Start: No local GPU? The wizard shows cloud providers with free tiers — Google Gemini, SiliconFlow (registration credits), and OpenRouter (free open-source models) — with one-click navigation to the configuration page.
Smart Routing onboarding: When two or more models are configured, the wizard offers a one-click Smart Routing activation. It auto-classifies your models into quick/standard/reasoning tiers and shows estimated cost savings (40–70%). Enable it to automatically route simple messages to lighter models while reserving powerful models for complex tasks.
Persistent guidance: Even if you skip the wizard, a gentle banner on the chat screen reminds you to configure a model provider before you can start chatting.

All three deployment modes (Local WebUI, Tauri Desktop, Cloud-hosted) share the same configuration UI in Settings > Models.

Adding API Keys

Navigate to Settings > Models or set environment variables:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=...
GOOGLE_API_KEY=...

Smart Routing

The 3-dimensional complexity router automatically selects the optimal model based on task requirements, privacy sensitivity, and provider health: Phase 1 — Rule Engine (zero LLM cost): Multi-dimensional scoring across 6 signal types — keywords (30+ bilingual), code blocks, math/LaTeX, URLs/file paths, images, and message length. 99% of requests resolve instantly. Phase 2 — LLM Judge (ambiguous only): For borderline cases, a lightweight judge model classifies the query. Results are SHA-256 cached (5-min TTL, 256-entry LRU) to avoid repeated calls. Session Momentum: Short follow-ups (e.g., “ok”, “yes”) inherit the session’s complexity tier instead of being downgraded to SIMPLE. Message-length-based weight decay ensures long messages are classified independently. PenaltyTracker — Self-Learning: When you flag a routing decision as incorrect, the tier receives a penalty score (24-hour half-life). Future queries need stronger signals to activate penalized tiers, making routing accuracy improve over time. Fast Lane: Tasks classified as SIMPLE skip the heavy middleware pipeline and go directly to the model with minimal overhead. This reduces response latency by ~30-50% for everyday chat, greetings, and short follow-ups — the messages you send most often feel the fastest.

Model Speed Test

Before committing to a model, test its real-world performance right from Settings. Go to Settings > Model Service and click the speed test button in the top-right corner. What it measures:

TTFT (Time to First Token) — How quickly the model starts responding
TPS (Tokens per Second) — Sustained generation throughput
Total tokens — Full response length for the test prompt

How to use:

Click Run All to benchmark every enabled model in sequence
Click the retry icon on any individual model to retest it
Results show success/error status with color-coded badges

Speed test results reflect your actual network conditions and API key quotas — not synthetic benchmarks. Combined with Smart Routing’s automatic model selection, you get data-driven confidence that the right model is handling each task.

Thinking Intensity Control

For reasoning-capable models (Claude 3.7+, o1/o3/o4-mini, Gemini 2.5, DeepSeek R1, Qwen3, etc.), Myrm provides fine-grained control over how deeply the model “thinks” before responding. 6 preset levels: Off / Low / Medium / High / Extra High / Max — covering everything from quick answers to deep multi-step reasoning. Custom token budget: Enter any value (e.g. 16384) to set a precise thinking budget for providers that support it (like Claude’s budget_tokens). Per-model memory: Your intensity preference is automatically saved per model. Switch between Claude and GPT — each remembers its own last setting. No re-configuration needed. Auto-detection: The thinking intensity button only appears when the current model supports reasoning. Non-reasoning models (e.g. GPT-4o-mini) show no button — zero clutter. Detection uses a dual-layer approach: API-reported capability takes priority, with regex fallback covering 25+ model families. The selected intensity is passed through the full pipeline — frontend model_kwargs → server passthrough → harness extra_body → LiteLLM → provider API — with zero framework coupling. Thinking Headroom (automatic): All major reasoning models (Claude 4.6+, DeepSeek R1, OpenAI o-series, Gemini 2.5+) count thinking tokens against the max_tokens budget. If max_tokens is too small, the thinking phase exhausts the budget before the response starts — causing truncation. Myrm automatically raises max_tokens to a safe floor based on the selected thinking intensity (8K for low, 16K for medium, 32K for high, 65K for xhigh/max). When no intensity is explicitly set, a conservative 16K default floor is applied since all thinking models default to thinking-on. This works alongside stream recovery as a dual-layer defense — proactive prevention plus reactive fallback. The adjustment is completely transparent and does not increase API costs (providers bill actual tokens used, not the max_tokens ceiling).

Per-Model Prompt Adaptation

Different AI models have different behavioral tendencies. Myrm automatically adapts its system prompt for each model family to get the best results:

GPT / Codex / Grok: Tool persistence enforcement, mandatory tool use for facts, act-don’t-ask bias
Claude: Disclaimer minimization, execution-first mindset
Gemini / Gemma: Absolute file path construction, dependency verification
DeepSeek / Qwen / GLM: Concise Chinese-aware discipline
Claude Opus 5+: Automatic behavioral tuning for three Anthropic-documented issues — scope expansion (doing more than asked), self-correction narration (unnecessary “I was wrong” statements), and default verbosity (responses longer than needed)

All prompt adaptations are fixed at agent initialization and cached in the system prompt, ensuring zero impact on KV Cache hit rates. You don’t need to configure anything — Myrm detects the model family from the model name and applies the appropriate discipline automatically.

In-Chat Model Transparency

Every assistant message shows a Token Economics badge in the action bar. Click it to see exactly which model handled the request, how it was routed, what it cost, and how fast it ran:

Model name & routing tier — see the exact model (e.g. gpt-4o, claude-sonnet-4) and its routing classification (Simple / Standard / Reasoning)
Prompt cache hit rate — percentage of cached tokens, estimated dollar savings, and cache-break attribution when the cache is invalidated
Cost breakdown — per-message cost with actual/estimated badge, multi-model breakdown when routing uses multiple models, and per-tool token consumption
Performance baseline — TTFT, tokens-per-second, and latency compared to your session average, with color-coded delta indicators
Context budget ring — circular progress showing context window usage with healthy/warning/critical thresholds
Privacy level — data sensitivity classification (S1 Public / S2 Internal / S3 Confidential) and routing path (local / cloud)

No black boxes — every routing decision is visible and verifiable directly in the chat.

Key Rotation

Add multiple API keys per provider and Myrm automatically rotates them with smart failover. When you have 2+ active keys, a strategy selector appears in the pool status bar — pick the rotation mode that best fits your setup:

Strategy	Best For
Round Robin	Even distribution across all keys (default)
Fill First	Maximizing free tier quota before using paid keys
Least Used	Balancing by actual call count
Random	Simple unpredictable selection

On rate-limit errors, the pool applies exponential backoff with ±15% jitter and automatically switches to the next available key — users never notice the switch.

Privacy Routing

Myrm’s Privacy Routing automatically selects cloud or local models based on data sensitivity — no manual switching required:

Sensitivity	Routing	Data Handling
S1 — Public	Cloud model	Direct to cloud
S2 — Internal	Cloud (after PII redaction) or local	Auto-redaction or local routing
S3 — Confidential	Local model only	Data never leaves your machine

Privacy Routing wraps the model behind a standard interface. Agents, middlewares, and the execution loop are completely unaware of the routing — they interact with a normal model. Combined with Ollama, LM Studio, or vLLM for local backends, this enables fully air-gapped operation for sensitive workloads.

Local Models & Hardware Cookbook

Myrm auto-detects local model services and your hardware capabilities for a zero-friction local AI experience: First Launch — Auto-Discovery: During onboarding, Myrm probes for Ollama and LM Studio on their default ports. If found, it offers one-click activation — configuring the provider, selecting the recommended model, and setting it as default in a single step. Hardware Cookbook: In Settings, the Hardware Cookbook panel displays your machine’s specs (CPU, RAM, GPU, VRAM, disk space) and calculates a Fit Score for each available local model. Models are rated perfect / good / fair / poor based on your VRAM or RAM. The download button is disabled when disk space is insufficient, preventing system hangs from downloading models that won’t fit. Inference Speed Preview (~tok/s): Alongside the VRAM estimate, Myrm shows a pre-download inference speed badge for each model, calculated from your GPU’s memory bandwidth and the model’s parameter count (Q4_K_M quantization). Speed is color-coded for instant readability:

🟢 ≥ 20 tok/s — smooth, real-time conversation
🟡 8–19 tok/s — usable with slight latency
🔴 < 8 tok/s — noticeably slow for live chat

This means you know whether a model will feel snappy or sluggish before committing to a multi-GB download. Vendor efficiency is factored in (Apple Silicon, Nvidia, AMD, Intel each have different real-world bandwidth utilization). Smart Recommendation — Best Fit First: Models are sorted by hardware fit level, then by capability (parameter count) within each tier. This means Myrm recommends the most powerful model your hardware can handle, not just the smallest. The top-ranked model is highlighted with a “Best Fit” badge. MoE-Aware Speed Estimation: Mixture-of-Experts models like DeepSeek R1 32B activate only a fraction of their total parameters per token (e.g. 7B out of 32B). Myrm uses the active parameter count for speed estimation, so MoE models show their true inference speed rather than a misleadingly slow estimate based on total parameters. 100% Offline — Zero Network Dependency: The entire recommendation flow works without any internet connection. Model specifications are bundled as a static asset shipped with the application — no external API calls, no cache expiration, no “first launch needs internet” requirement. Hardware detection is a local system call, and Ollama probing targets localhost only. This means the Hardware Cookbook works identically in air-gapped environments, on planes, in offline server rooms, or anywhere without connectivity. Curated & Fraud-Proof Model List: Unlike tools that dynamically scrape model repositories and must defend against inflated benchmark scores, Myrm’s Hardware Cookbook uses a team-curated model list. Every data point (parameter count, VRAM requirement, disk size, active parameters) is an objective, verifiable fact — not a subjective benchmark score. This eliminates the risk of manipulated leaderboard rankings influencing your recommendations. One-Click Install & Remove: Download models directly from the UI with SSE streaming progress and cancel support. After install, models appear instantly in the selector. Uninstall reclaims disk space immediately. Deploy Mode Awareness: In SaaS mode, local model features are automatically hidden — no clutter for cloud-only users. In Local WebUI or Desktop modes, the full Hardware Cookbook is available. Adaptive Stall Detection for Local Endpoints: When your API URL points to a local address (localhost, 192.168.x.x, 10.x.x.x, or any private network), Myrm automatically relaxes internal timeout thresholds. Large local models (70B+) can take minutes for initial token generation with big contexts — cloud-optimized 60-second timeouts would kill these requests prematurely, triggering futile retry loops. Myrm detects local endpoints via RFC1918/IPv6 address analysis and extends timeouts to 5–30 minutes, ensuring your local inference completes successfully. Remote API behavior remains unchanged. No configuration needed — it just works.

Config Auto-Healing

All model configuration values are automatically sanitized on save, preventing common copy-paste errors from causing connection failures:

API URL — trailing slashes and whitespace removed (prevents 404 errors)
API Key — leading/trailing spaces and newlines stripped (prevents auth failures)
Model name — extra whitespace removed (prevents model-not-found errors)
Empty values — converted to unset state gracefully (prevents crashes)
Legacy providerType values — old exports that used bare ids like openai instead of openai-like are auto-normalized on load; routing falls back to the provider id instead of crashing the app
Developer System Health — the context bundle health panel calls /context-bundle under the API base URL (never double-prefixes /api, which would 404)

This means when pasting from docs or terminal, even if your input has extra spaces, slashes, or newlines, the system works correctly. You never need to worry about formatting.

Multi-Device Config Sync

Myrm keeps settings consistent across browser tabs, desktop app, and SaaS — without nagging you on every refresh. What you get:

Change language, TTS, or model defaults on your phone — your desktop picks it up automatically
Open multiple tabs or hard-refresh — no false “Configuration Conflict” dialogs
Work offline — changes queue locally and sync when you’re back online
Break something? Config Time Machine rolls back any key to a previous version

How it works (user-facing):

Smart merge — if two devices edit different fields, both changes apply automatically
Honest conflicts — if two devices edit the same field, you choose which version to keep
Same-device silence — your own tab refreshing never triggers a conflict prompt
Idempotent sync — identical content never bumps version numbers or creates noise

vs file-based agents (OpenClaw, Hermes CLI): settings live in scattered files with no merge UI; multi-device use means manual copy or git conflicts. Myrm is the only agent workstation with enterprise-grade config sync and a full GUI.

Network Proxy (For Restricted Networks)

Myrm automatically respects your system’s proxy settings — if you already have a VPN or proxy (e.g., ClashX, V2Ray) configured at the OS level, all LLM API calls will route through it with zero additional configuration. How it works:

The HTTP client (httpx) defaults to trust_env=True, automatically reading HTTP_PROXY/HTTPS_PROXY/ALL_PROXY environment variables
System-level proxy settings (macOS Network Preferences, Windows Internet Options) are automatically applied
No restart required — proxy changes take effect on the next API call

Alternative: Custom Base URL: If you prefer a relay/forwarding service (e.g., one-api, new-api), simply set each provider’s Base URL to your relay endpoint. Myrm supports this per-provider — you can mix direct connections and relays as needed. Cloud users: In SaaS mode, the control plane’s LLM Relay handles all provider connections — no proxy configuration needed.

Vision Fallback

When your main model does not support vision, set a Vision Fallback model in Settings → Models. Myrm automatically converts attached images into text descriptions before sending them to the main model — with live progress in chat and session-level caching for duplicate images. See the Vision & Image Understanding guide for setup and usage.

Multi-Model Consensus (MoA)

Multiple models generate answers in parallel, and an aggregator model synthesizes the most reliable result. Ideal for critical decisions, contract reviews, and technical assessments that require cross-validation. How to use: Go to Settings > Agent > Capabilities > Multi-Model Consensus, enable it, and select reference and aggregator models. Key parameters:

Reference Temperature — Controls reference model output diversity (default 0.6, higher = more diverse)
Aggregator Temperature — Controls final synthesis precision (default 0.4, lower = more precise)
Reference Output Limit — Limits max output tokens per reference model (default unlimited, recommend 600-2000 for cost savings)
Reference Reasoning Effort — Controls reasoning depth for reference models (low/medium/high); recommend low to save reasoning token costs
Aggregator Reasoning Effort — Controls reasoning depth for the aggregator model; recommend high for synthesis quality
Min Successful — Minimum reference models that must succeed (default 1)

Smart optimizations:

Skips aggregation when only 1 reference succeeds — returns directly
Aggregator output is never truncated — user-visible answers stay complete
Compatible with reasoning model output formats (DeepSeek-R1, etc.) with independent reasoning effort control
Full multi-turn conversation context preserved

Cost Visualization

Complete token economics visualization pipeline from backend to frontend:

Level	Capability	Key Components
Per-message	Token usage and cost per message	TokenUsageDisplay, cache savings water-drop animation
Per-session	Session diagnostics + execution trace replay	SessionAnalyticsDialog, ExecutionTraceTimeline
Dashboard	7/30/365-day multi-dimensional usage stats	UsageStatisticsSection, routing analytics panel
Budget control	MultidimensionalBudgetGuard (per-session/daily/per-call) + 4-stage progressive response (OK → WARNING eco-compress → FINALIZATION forced output → EXCEEDED hard block) + budget_boundary_middleware auto-intercept + SSE real-time push	BudgetPolicySection, BudgetBadge, BudgetDialog, channel budget management
Cloud quota	Work Unit quota pre-check before execution	useQuotaGuard hook
Usage ledger	Per-provider per-model token/cost full accounting	UsageLedger, cost_engine (5 providers, vision/reasoning multipliers)
Enterprise	Org usage overview + daily charts + model breakdown + user quota table + budget settings + audit logs	EnterpriseUsageTab
Model Policy	Org-level model whitelist (restrict which LLMs are available)	EnterpriseModelPolicyTab

Metric	Value
Budget & quota tests	374 passed (Jul 2026)
Token economics tests	355 passed
Token types tracked	7 (prompt/completion/cached/reasoning/audio_in/audio_out/image)
Frontend visualization components	18+
Backend statistics API endpoints	12

Tool Schema Auto-Normalization

When you switch between models (e.g., GPT-4o → Gemini → Claude), each provider has different requirements for tool schemas. Myrm automatically normalizes all tool schemas before sending them to the model — no manual configuration required. What it fixes automatically:

Orphan required entries (fields listed as required but not defined in properties) — causes 400 errors on Gemini/Vertex AI
Nested nullable patterns (anyOf: [{type: X}, {type: null}]) — causes errors on strict OpenAI function calling
$ref/$defs inline definitions — most providers don’t support JSON Schema references
Missing type annotations on nested objects — strict providers reject them
Anthropic-unsupported keywords (minimum, maxItems, pattern, etc.) — constraints are folded into description

Why this matters: MCP tools come from third-party servers with varying schema quality. Without normalization, switching your agent from one model to another often breaks tool calling entirely. Myrm ensures every tool works on every model, transparently.

Fault Tolerance

The 14-layer error recovery system handles failures automatically:

Rate limit errors (4-strategy key rotation + credential pool + ManagedLLM/KeyPool in-agent failover)
Provider outages (Circuit Breaker with 3-tier cooldown + fallback presets)
Stream interruptions (token-level precise resume)
Response truncation (progressive output budget boost 2x → 3x → 4x)
Oversized images (automatic re-encoding and compression)
Model thinking mode errors (automatic mode adjustment and retry)
Empty responses (parameter adjustment and retry)
Iteration limits (grace-call summary — users never see a blank response)

See Error Recovery for the full 14-layer architecture.

Getting Started

Core Concepts

Guides

Self-Hosting

Model Configuration

Model Configuration

First-Time Setup

Adding API Keys

Smart Routing

Model Speed Test

Thinking Intensity Control

Per-Model Prompt Adaptation

In-Chat Model Transparency

Key Rotation

Privacy Routing

Local Models & Hardware Cookbook

Config Auto-Healing

Multi-Device Config Sync

Network Proxy (For Restricted Networks)

Vision Fallback

Multi-Model Consensus (MoA)

Cost Visualization

Tool Schema Auto-Normalization

Fault Tolerance

​Model Configuration

​First-Time Setup

​Adding API Keys

​Smart Routing

​Model Speed Test

​Thinking Intensity Control

​Per-Model Prompt Adaptation

​In-Chat Model Transparency

​Key Rotation

​Privacy Routing

​Local Models & Hardware Cookbook

​Config Auto-Healing

​Multi-Device Config Sync

​Network Proxy (For Restricted Networks)

​Vision Fallback

​Multi-Model Consensus (MoA)

​Cost Visualization

​Tool Schema Auto-Normalization

​Fault Tolerance

Model Configuration

First-Time Setup

Adding API Keys

Smart Routing

Model Speed Test

Thinking Intensity Control

Per-Model Prompt Adaptation

In-Chat Model Transparency

Key Rotation

Privacy Routing

Local Models & Hardware Cookbook

Config Auto-Healing

Multi-Device Config Sync

Network Proxy (For Restricted Networks)

Vision Fallback

Multi-Model Consensus (MoA)

Cost Visualization

Tool Schema Auto-Normalization

Fault Tolerance