| BENCHMARK | Opus 4.8 | Sonnet 4.6 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
SWE-Bench Pro Fix real bugs in actual software repos (can it code like a senior dev?) | 69.2% | 79.6% ▲ BEST | 58.6% | 54.2% |
Terminal-Bench 2.1 Complete tasks by running commands in a real terminal (shell, scripts, tools) | 74.6% | 59.1% | 78.2% ▲ BEST | 70.3% |
Humanity's Last Exam 3,000 questions that stump most human experts — no outside tools allowed | 49.8% ▲ BEST | 33.2% | 41.4% | 44.4% |
Humanity's Last Exam (+ tools) Same expert questions, but with web search and calculators enabled | 57.9% ▲ BEST | 49.0% | 52.2% | 51.4% |
OSWorld-Verified Control a real computer — click buttons, open apps, fill forms, navigate UIs | 83.4% ▲ BEST | — | 78.7% | 76.2% |
GDPval-AA Realistic knowledge-work tasks — research, writing, analysis (higher score = more productive) | 1890 ▲ BEST | — | 1769 | 1314 |
Finance Agent v2 End-to-end financial analysis tasks — modeling, forecasting, due diligence | 53.9% ▲ BEST | — | 51.8% | 43.0% |
GPQA Diamond PhD-level questions in chemistry, biology, and physics — written by domain experts | — | 89.9% | — | 94.3% ▲ BEST |
BrowseComp Find specific hard-to-locate facts by searching and reading the web | — | — | 84.4% | 85.9% ▲ BEST |
ARC-AGI-2 Spot the pattern in visual grids — tasks humans find easy but AI historically fails | — | 58.3% | — | 77.1% ▲ BEST |