LLMs 101

Monthly model tracker

How AI models rank when
real humans compare them

Benchmark scores don't tell the full story. This tracker translates human preference data and real-world usage into plain-English tiers — updated every month.

Updated June 2026
🏆
What is human preference ranking?
Real users are shown two anonymous AI responses to the same prompt and pick the one they prefer. Thousands of these battles produce an Elo-style ranking — the same system used to rank chess players.
💰
What does "cost vibe" mean?
Instead of listing raw dollar-per-token figures (which change constantly), we use plain-English cost tiers: Free, Low, Standard, and Premium — reflecting how expensive it is to run each model at real production scale.
📊
Why not show benchmark scores?
Benchmarks like MMLU or HumanEval measure narrow academic tasks. A score of "86.2%" means nothing in practice. Human preference rankings and real use-case guidance are far more useful for most readers.
🔄
How often is this updated?
Monthly. The AI landscape moves fast — a model released last month can jump several tiers. Check back each month, or read the Trends section for major shifts as they happen.
Show
How to read this table: Rankings reflect human preference data from LMSYS Chatbot Arena, combined with our own qualitative assessment of real-world professional use. Tier 1 = consistently preferred by humans over all alternatives. Tiers are relative — all listed models are genuinely excellent by any historical standard.
1
OpenAI
o3 / o4-mini
Reasoning-first models
🏆 Tier 1 — Top
Best for
Advanced reasoning, maths, competitive coding, and any task requiring careful multi-step logic. The current benchmark leader.
Premium Closed API ↑ Holding #1
2
Anthropic
Claude 4 Opus
Flagship intelligence model
🥈 Tier 1 — Near top
Best for
Long-form analysis, nuanced writing, complex document workflows, and tasks requiring careful, principled reasoning with a safety-conscious approach.
Premium Closed API ↑ +1 this month
3
Google DeepMind
Gemini 2.5 Pro
1M token context, multimodal
🥉 Tier 1 — Highly competitive
Best for
Analysing enormous inputs — full video, large codebases, lengthy research archives. Unmatched context window at this tier.
Standard Closed API → Stable
4
OpenAI
GPT-4o / GPT-4.1
Versatile workhorse
Tier 2 — Excellent all-rounder
Best for
The proven, reliable default for business use. Fast, multimodal, excellent at structured tasks. The model most enterprise software is built on.
Standard Closed API → Stable
5
DeepSeek
DeepSeek R1
Open reasoning model
🚀 Tier 2 — Market disruptor
Best for
High-volume reasoning at near-zero cost. Matches o1 on most benchmarks at a fraction of the price — the most significant cost disruption in AI history.
Ultra-low Open weights ↑ Rising fast
6
Anthropic
Claude 4 Sonnet
Balanced capability model
Tier 2 — Best value Claude
Best for
Everyday professional tasks — writing, editing, coding, analysis — where Opus-tier quality isn't required but quality still matters. Excellent cost/performance ratio.
Standard Closed API → Stable
7
Meta AI
Llama 3.3 70B
Best open-weight model
Tier 2 — Open weight champion
Best for
Private deployments and high-volume automation where data cannot leave your infrastructure. Frontier-competitive quality with no API costs.
Free* Open weights → Stable
8
Google DeepMind
Gemini 2.0 Flash
Speed-optimised production model
Tier 3 — Speed & cost leader
Best for
Real-time applications, high-frequency API calls, and production use cases where response speed and cost-per-call are the primary constraints.
Low Closed API → Stable

* Llama models are free to download and run but require your own hardware or cloud compute. Self-hosting costs (electricity, GPU rental) vary. · Rankings reflect human preference data primarily sourced from LMSYS Chatbot Arena, combined with editorial assessment. · Cost tiers are illustrative — actual pricing changes frequently. Check each provider's current pricing page for exact figures. · This tracker focuses on general-purpose chat and reasoning models. Specialised models (image generation, audio, video) are not included.