BENCHMARK COMPARE

FRONTIER AI MODELS // INDEPENDENT BENCHMARK DATA

// TOGGLE MODELS

BENCHMARK	Opus 4.8	Sonnet 4.6	GPT-5.5	Gemini 3.1 Pro
SWE-Bench Pro Fix real bugs in actual software repos (can it code like a senior dev?)	69.2%	79.6% ▲ BEST	58.6%	54.2%
Terminal-Bench 2.1 Complete tasks by running commands in a real terminal (shell, scripts, tools)	74.6%	59.1%	78.2% ▲ BEST	70.3%
Humanity's Last Exam 3,000 questions that stump most human experts — no outside tools allowed	49.8% ▲ BEST	33.2%	41.4%	44.4%
Humanity's Last Exam (+ tools) Same expert questions, but with web search and calculators enabled	57.9% ▲ BEST	49.0%	52.2%	51.4%
OSWorld-Verified Control a real computer — click buttons, open apps, fill forms, navigate UIs	83.4% ▲ BEST	—	78.7%	76.2%
GDPval-AA Realistic knowledge-work tasks — research, writing, analysis (higher score = more productive)	1890 ▲ BEST	—	1769	1314
Finance Agent v2 End-to-end financial analysis tasks — modeling, forecasting, due diligence	53.9% ▲ BEST	—	51.8%	43.0%
GPQA Diamond PhD-level questions in chemistry, biology, and physics — written by domain experts	—	89.9%	—	94.3% ▲ BEST
BrowseComp Find specific hard-to-locate facts by searching and reading the web	—	—	84.4%	85.9% ▲ BEST
ARC-AGI-2 Spot the pattern in visual grids — tasks humans find easy but AI historically fails	—	58.3%	—	77.1% ▲ BEST

// SOURCES: anthropic.com/news/claude-opus-4-8 · openai.com · deepmind.googleBOLD = best score in row among active models

⚠ AI-CURATED — data may be inaccurate or outdated. Verify at official sources.

// ALWAYSBERUSHING · AI INTELLIGENCE FEED