LABAI UPDATESCOMPARE

BENCHMARK COMPARE

FRONTIER AI MODELS // INDEPENDENT BENCHMARK DATA
// TOGGLE MODELS
BENCHMARKOpus 4.8Sonnet 4.6GPT-5.5Gemini 3.1 Pro
SWE-Bench Pro
Fix real bugs in actual software repos (can it code like a senior dev?)
69.2%79.6%
▲ BEST
58.6%54.2%
Terminal-Bench 2.1
Complete tasks by running commands in a real terminal (shell, scripts, tools)
74.6%59.1%78.2%
▲ BEST
70.3%
Humanity's Last Exam
3,000 questions that stump most human experts — no outside tools allowed
49.8%
▲ BEST
33.2%41.4%44.4%
Humanity's Last Exam (+ tools)
Same expert questions, but with web search and calculators enabled
57.9%
▲ BEST
49.0%52.2%51.4%
OSWorld-Verified
Control a real computer — click buttons, open apps, fill forms, navigate UIs
83.4%
▲ BEST
78.7%76.2%
GDPval-AA
Realistic knowledge-work tasks — research, writing, analysis (higher score = more productive)
1890
▲ BEST
17691314
Finance Agent v2
End-to-end financial analysis tasks — modeling, forecasting, due diligence
53.9%
▲ BEST
51.8%43.0%
GPQA Diamond
PhD-level questions in chemistry, biology, and physics — written by domain experts
89.9%94.3%
▲ BEST
BrowseComp
Find specific hard-to-locate facts by searching and reading the web
84.4%85.9%
▲ BEST
ARC-AGI-2
Spot the pattern in visual grids — tasks humans find easy but AI historically fails
58.3%77.1%
▲ BEST
// SOURCES: anthropic.com/news/claude-opus-4-8 · openai.com · deepmind.googleBOLD = best score in row among active models
⚠ AI-CURATED — data may be inaccurate or outdated. Verify at official sources.
// ALWAYSBERUSHING · AI INTELLIGENCE FEED