Changelog
Score updates, methodology changes, and notable events.
Benchmark Update: March 3, 2026
**zhipu-ai:** - ARC-AGI-2: 4.859999999999999% (new)
Benchmark Update: March 2, 2026
**openai:** - Chatbot Arena ELO: 1429 ELO (1362 → 1429 ↑) - SWE-bench Verified: 74.4% (69% → 74.4% ↑) - BigCodeBench: 61.1% (78% → 61.1% ↓) - AIME 2025: 100% (86.7% → 100% ↑) - ARC-AGI-2: 54.16% (32% → 54.16% ↑) - Video Arena ELO: 1199 ELO (1245 → 1199 ↓) - AA Video Arena ELO: 1199 ELO (1240 → 1199 ↓) - MMMU: 85.4% (72% → 85.4% ↑) - GAIA: 33.22259136212624% (75% → 33.22259136212624% ↓) **google-deepmind:** - Chatbot Arena ELO: 1474 ELO (1355 → 1474 ↑) - HumanEval+: 79.3% (90.5% → 79.3% ↓) - BigCodeBench: 59.9% (74% → 59.9% ↓) - ARC-AGI-2: 33.61% (77.1% → 33.61% ↓) - Image Arena ELO: 1101 ELO (1230 → 1101 ↓) - AA Image Arena ELO: 1101 ELO (1225 → 1101 ↓) - Video Arena ELO: 1221 ELO (1280 → 1221 ↓) - AA Video Arena ELO: 1221 ELO (1275 → 1221 ↓) - MMMU: 84% (70.5% → 84% ↑) **xai:** - Chatbot Arena ELO: 1443 ELO (1330 → 1443 ↑) - AIME 2025: 92.5% (65% → 92.5% ↑) - Image Arena ELO: 918 ELO (new) - AA Image Arena ELO: 918 ELO (new) - MMMU: 78% (new) **deepseek:** - Chatbot Arena ELO: 1424 ELO (1325 → 1424 ↑) - SWE-bench Verified: 70% (55% → 70% ↑) - BigCodeBench: 62.2% (72% → 62.2% ↓) - AIME 2025: 95.83333333333334% (87% → 95.83333333333334% ↑) - ARC-AGI-2: 4.03% (22% → 4.03% ↓) **mistral:** - Chatbot Arena ELO: 1372 ELO (1295 → 1372 ↑) - HumanEval+: 73.8% (82% → 73.8% ↓) - BigCodeBench: 52.5% (60% → 52.5% ↓) **cohere:** - Chatbot Arena ELO: 1327 ELO (1250 → 1327 ↑) **stepfun:** - Chatbot Arena ELO: 1320 ELO (1160 → 1320 ↑) **alibaba-qwen:** - SWE-bench Verified: 9% (42% → 9% ↓) - HumanEval+: 87.2% (80% → 87.2% ↑) - MMMU: 51.4% (58% → 51.4% ↓) **anthropic:** - HumanEval+: 77.4% (94% → 77.4% ↓) - BigCodeBench: 59% (80% → 59% ↓) - MMMU: 80.7% (68% → 80.7% ↑) **meta-ai:** - BigCodeBench: 61.4% (68% → 61.4% ↓) - MMMU: 73.4% (62% → 73.4% ↑) **midjourney:** - Image Arena ELO: 1070 ELO (1285 → 1070 ↓) - AA Image Arena ELO: 1070 ELO (1280 → 1070 ↓) **recraft:** - Image Arena ELO: 1073 ELO (1220 → 1073 ↓) - AA Image Arena ELO: 1073 ELO (1215 → 1073 ↓) **stability-ai:** - Image Arena ELO: 1028 ELO (1180 → 1028 ↓) - AA Image Arena ELO: 1028 ELO (1175 → 1028 ↓) **kuaishou-kling:** - Video Arena ELO: 1087 ELO (1275 → 1087 ↓) - AA Video Arena ELO: 1087 ELO (1270 → 1087 ↓) **luma-ai:** - Video Arena ELO: 947 ELO (1235 → 947 ↓) - AA Video Arena ELO: 947 ELO (1230 → 947 ↓) **pika-labs:** - Video Arena ELO: 1028 ELO (1210 → 1028 ↓) - AA Video Arena ELO: 1028 ELO (1205 → 1028 ↓) **tencent-hunyuan:** - Video Arena ELO: 1002 ELO (1195 → 1002 ↓) - AA Video Arena ELO: 1002 ELO (new) **zhipu-ai:** - MMMU: 75.4% (55% → 75.4% ↑) - SWE-bench Verified: 72.8% (new) - AIME 2025: 96.66666666666667% (new)
Benchmark Update: Claude Sonnet/Opus 4.6 & Gemini 3.1 Pro
Updated benchmark scores across multiple categories to reflect two major model releases: Anthropic — Claude Opus 4.6 & Sonnet 4.6: - GPQA Diamond: 74.8% → 91.3% (Opus 4.6) - SWE-bench Verified: 72.5% → 80.8% (Opus 4.6) - MATH-500: 93.0% → 97.8% (Sonnet 4.6) - ARC-AGI-2: 25.0% → 68.8% (Opus 4.6) Google DeepMind — Gemini 3.1 Pro: - GPQA Diamond: 76.5% → 94.3% - SWE-bench Verified: 63.8% → 80.6% - MMLU-Pro: 86.8% → 89.8% - Humanity's Last Exam: 24.5% → 44.4% - AIME 2025: 83.0% → 95.0% - ARC-AGI-2: 28.0% → 77.1% - FrontierMath: 28.5% → 38.0% - Terminal-Bench: 55.0% → 68.5% Google DeepMind moves to #1 overall. Anthropic climbs to #3.
Video Generation Expansion
Added 13 new organizations specializing in image and video generation: Midjourney, Black Forest Labs, Runway, Kuaishou (Kling), ByteDance Seed, Pika Labs, Luma AI, MiniMax (Hailuo), Ideogram, Recraft, Lightricks, StepFun, and Zhipu AI.
Initial Rankings Published
First edition of The AI Race rankings with 25 organizations scored across capability, velocity, adoption, compute, ecosystem, and trust dimensions. Expanded to cover image and video generation specialists alongside frontier text labs.
DeepSeek-R1 Impact Update
Updated DeepSeek scores following the release of DeepSeek-R1 reasoning model. Also added Janus-Pro image generation scores.
Real Benchmark Data Integration
We've integrated 24 public AI benchmarks across 7 categories into The AI Race, replacing placeholder scores with real, verifiable data from authoritative sources. Benchmarks tracked: - Language & Knowledge: Chatbot Arena ELO, MMLU-Pro, GPQA Diamond, SimpleQA, Humanity's Last Exam, IFEval - Coding: SWE-bench Verified, HumanEval+, LiveCodeBench, Aider Polyglot, BigCodeBench - Reasoning & Math: MATH-500, AIME 2025, ARC-AGI-2, FrontierMath - Image Generation: Image Arena ELO (LM Arena + Artificial Analysis) - Video Generation: Video Arena ELO (LM Arena + Artificial Analysis) - Multimodal: MMMU, Video-MME, GAIA - Agents & Tools: TAU2-bench, WebArena, Terminal-Bench Data is now automatically refreshed daily at 6:00 AM UTC from LM Arena, Artificial Analysis, and HuggingFace. Each organization's profile page now shows a full benchmark breakdown with scores, source links, and category groupings. The methodology page has been updated to reflect these changes.