Promptyard
Cost

LLM Capability Matrix

Which frontier models support multimodal, vision, audio, JSON mode, and tool calling.

Model Provider MultimodalVisionAudioJSON modeTool calling
GPT-4o OpenAI
GPT-4o mini OpenAI
GPT-4 Turbo OpenAI
GPT-4 OpenAI
GPT-3.5 Turbo OpenAI
Claude 3.7 Sonnet Anthropic
Claude 3.7 Opus Anthropic
Claude 3.5 Haiku Anthropic
Gemini 2.0 Pro Google AI
Gemini 2.0 Flash Google AI
Mistral Large 2 Mistral
DeepSeek V3 DeepSeek

supported · not supported. Last reviewed 2026-05-10.

Pick the cheapest model that has every flag you need

For most production workloads the right question isn't "which model is best" but "which model is cheapest that supports the flags I need". If you need vision + tool calling + JSON mode, GPT-4o-mini ($0.15 / $0.60) gives you all three at a fraction of GPT-4o's price.

Eval the cheapest qualifying model on your task before reaching for a flagship. The capability matrix narrows the list; an eval picks the winner.

FAQ

What does "JSON mode" mean exactly?
A guarantee that the model returns syntactically valid JSON. Doesn't guarantee it conforms to your schema — that's what tool calling + JSON Schema gives you. OpenAI calls it response_format: { type: "json_object" }; Anthropic surfaces it via tool calling.
What about reasoning / chain-of-thought modes?
Not yet a column. Reasoning capability is currently a model-level distinction (o1, o3, Claude's extended thinking) rather than a flag. We'll add a "reasoning" column when the row count of reasoning-equipped models warrants it.
Why does Claude show audio = false but it can transcribe audio in some clients?
The matrix tracks API-native capability, not what client wrappers (e.g., Claude Desktop with audio attachments) layer on top. Claude's API doesn't take raw audio yet.

Related tools