Explore AI benchmarks and see how organizations rank across categories.
Last updated: Feb 1, 2026
Crowdsourced human-preference ELO from blind pairwise comparison
Broad knowledge across 57 subjects with 10-option multiple choice
PhD-level science questions (198 expert-validated)
Short-form factual accuracy (4,326 questions)
Frontier human knowledge across dozens of academic subjects
Verifiable instruction-following evaluation
Real-world software engineering — resolving GitHub issues
Functional code correctness from docstrings (164 problems)
Contamination-free coding evaluation with fresh problems
Multi-language coding (225 exercises across 6 languages)
Practical programming tasks (1,140 tasks)
Competition-level mathematics across 6 domains
American Invitational Mathematics Examination (30 problems)
Abstract visual reasoning and fluid intelligence
Frontier-level mathematics (350 problems, 4 tiers)
Human preference ELO for text-to-image generators
Artificial Analysis image quality ELO via blind votes
Human preference ELO for text-to-video generators
Artificial Analysis video quality ELO via blind votes
College-level multimodal understanding (11.5K questions)
Multimodal video analysis across 6 visual domains
General AI assistant capabilities requiring reasoning & tools
Conversational AI agent task completion (retail, airline, telecom)
Web interaction tasks in realistic simulated environments
AI agent CLI task completion in sandboxed Docker environments