Vision & Image Understanding

Myrm lets you attach images in chat, on channels, and in desktop uploads. If your main model does not support vision, Myrm automatically routes images through a Vision Fallback model and injects a text description — so you never have to switch models manually.

How It Works

You paste, drag, or attach an image in the WebUI (or send one on a supported channel).
Myrm checks whether the selected main model supports vision (supports_vision).
If yes — the image is sent to the main model as a native multimodal block.
If no — Myrm shows a live “Analyzing image…” status, calls your configured Vision Fallback model, replaces the image with a concise text description, then continues the conversation with your main model.

The same pipeline applies to video: native video-capable models receive the file directly; others get frame extraction plus vision analysis.

Setup

Open Settings → Models.
Pick your Main chat model (any provider).
Set Vision Fallback to a vision-capable model (e.g. GPT-4o, Gemini Flash, Qwen-VL).
Optional: use the capability icons in the model picker — models with the eye icon support vision natively.

Myrm auto-detects model capabilities via LiteLLM and models.dev. You can override per model in the model card if needed.

What You Can Do

Screenshot Q&A — paste a screenshot and ask what is wrong or what to click next.
Annotation editor — draw circles or arrows on an image before sending so the agent focuses on the right region.
Non-vision main model — use a cheap text model for reasoning while vision fallback handles images.
Channel images — Telegram, Discord, iMessage, and other channels deliver images into the same pipeline.
PDF & documents — scanned or image-heavy PDFs can route through vision when text extraction is sparse.

Status & Caching

While fallback analysis runs, the chat shows an Analyzing image (or Analyzing video) indicator. When done, it clears automatically. Identical images in the same session are cached by content hash — repeat uploads do not trigger duplicate vision API calls.

Tips

Configure a fast, cost-effective model for Vision Fallback if you send many screenshots.
For large images, Myrm compresses automatically before calling the vision model.
If analysis fails, you get a clear error message; the rest of your message still processes.

Model Configuration — slots, routing, and API keys
Voice Interaction — audio and video message transcription
Browser Automation — vision-assisted page verification

​Vision & Image Understanding

​How It Works

​Setup

​What You Can Do

​Status & Caching

​Tips

​Related