AI Model Tracker — Human Rankings & Cost Guide

🏆

What is human preference ranking?

Real users are shown two anonymous AI responses to the same prompt and pick the one they prefer. Thousands of these battles produce an Elo-style ranking — the same system used to rank chess players.

💰

What does "cost vibe" mean?

Instead of listing raw dollar-per-token figures (which change constantly), we use plain-English cost tiers: Free, Low, Standard, and Premium — reflecting how expensive it is to run each model at real production scale.

📊

Why not show benchmark scores?

Benchmarks like MMLU or HumanEval measure narrow academic tasks. A score of "86.2%" means nothing in practice. Human preference rankings and real use-case guidance are far more useful for most readers.

🔄

How often is this updated?

Monthly. The AI landscape moves fast — a model released last month can jump several tiers. Check back each month, or read the Trends section for major shifts as they happen.

How to read this table: Rankings reflect our monthly editorial review of current AI benchmarks and independent testing, combined with qualitative assessment of real-world professional use. Tier 1 = consistently preferred across that review. Tiers are relative — all listed models are genuinely excellent by any historical standard.

OpenAI

GPT-5.6 Sol

Flagship frontier intelligence model

🏆 Tier 1 - Top

Best for

The hardest reasoning, coding, and agentic work. Runs behind ChatGPT Pro mode and powers long autonomous tasks split across subagents.

Premium Closed API

Anthropic

Claude Opus 4.8

Flagship intelligence model

🥈 Tier 1 - Top

Best for

The most demanding coding, judgment, and agentic tasks. Leads on hard coding and is tuned for honesty and catching its own mistakes.

Premium Closed API

Google DeepMind

Gemini 3.1 Pro

Flagship reasoning and multimodal model

🥉 Tier 1 - Top

Best for

Complex projects needing deep reasoning plus strong multimodal understanding. Google's top generally-available Pro model while 3.5 Pro remains delayed.

Premium Closed API

xAI

Grok 4.5

Opus-class coding and agentic model

Tier 1 - Top (challenger)

Best for

Software engineering and agentic workflows, trained alongside Cursor on real developer sessions. Fast and token-efficient with live access to X data.

Standard Closed API

Anthropic

Claude Sonnet 5

Best-value all-rounder

Tier 2 - Best value Claude

Best for

Everyday coding, writing, and agentic work at near-Opus quality for roughly 40% less. The default model on Claude Free and Pro plans.

Standard Closed API

Google DeepMind

Gemini 3.6 Flash

Fast, efficient workhorse model

Tier 2 - Best value Gemini

Best for

High-volume coding, knowledge work, and multimodal tasks. Delivers quality close to Gemini Pro using up to 17% fewer tokens than 3.5 Flash.

Low Closed API

Moonshot AI

Kimi K3

Largest open-weight frontier model

Tier 3 - Open-weight frontier

Best for

Long-horizon coding, knowledge work, and reasoning at open-weight cost. At 2.8T parameters it benchmarks near the very top of widely available models.

Low Open weights

Z.ai

GLM-5.2

Top open-weight coding model

Tier 3 - Open-weight value

Best for

Long-horizon agentic coding on a 1M-token context at a fraction of closed-model cost. MIT-licensed and downloadable, it lands near Opus 4.8 on key agent benchmarks.

Low Open weights

DeepSeek

DeepSeek V4-Pro

Cost-efficient open reasoning model

Tier 3 - Open-weight value

Best for

Reasoning, math, and agentic coding on a 1M-token context. MIT-licensed open weights that rival top closed models while leading most other open releases.

Ultra-low Open weights

OpenAI

GPT-5.6 Luna

Budget high-volume model

Tier 4 - Speed and cost leader

Best for

Cost-sensitive, high-volume workloads on OpenAI's current frontier generation. The cheapest sibling of the GPT-5.6 family, built for scale.

Low Closed API

Anthropic

Claude Haiku 4.5

Fast, low-cost small model

Tier 4 - Speed and cost leader

Best for

Real-time chat, customer support, and high-volume subagent tasks. Delivers near-frontier coding at low cost and high speed.

Low Closed API

Mistral

Mistral Medium 3.5

European balanced enterprise model

Tier 2 - Emerging challenger

Best for

Balanced enterprise workloads from Europe's leading lab, with strong performance and simplified deployment. A credible sovereign alternative to US-based models.

Standard Closed API

* Llama models are free to download and run but require your own hardware or cloud compute. Self-hosting costs (electricity, GPU rental) vary. · Rankings reflect monthly editorial review of current AI benchmarks and independent testing, combined with editorial assessment — see our Methodology & Glossary page for full detail. · Cost tiers are illustrative and based on that same monthly review — actual pricing changes frequently; check each provider's current pricing page for exact figures. · This tracker focuses on general-purpose chat and reasoning models. Specialised models (image generation, audio, video) are not included.

How AI models rank whenreal humans compare them

How AI models rank when
real humans compare them