Claude Sonnet 4.5 Benchmark Results
Comprehensive evaluation results across key AI capability domains reported by Anthropic
Claude Intelligence Comparison
Model Generations
Intelligence comparison across Claude model generations using Artificial Analysis Intelligence Index.
Coding Performance - 82.0%
SWE-bench Verified
Best coding model in the world. Data from SWE-bench Verified with parallel test-time compute.
Intelligence Benchmark - 63
Reasoning & Problem-Solving
Artificial Analysis Intelligence Index composite score measuring reasoning and problem-solving.
Math Competition - 88.0%
AIME 2025
Data from AIME 2025: Advanced mathematics competition performance from Artificial Analysis.
Claude Models Comparison
Performance analysis comparing Claude Sonnet 4.5 with other Claude models using real data from Artificial Analysis.
Claude Intelligence: Model Generation Comparison
Intelligence comparison across Claude model generations using the Artificial Analysis Intelligence Index.
Intelligence comparison across Claude model generations using the Artificial Analysis Intelligence Index.
Evolution of intelligence between models:
- Claude 3.5 established strong reasoning and coding capabilities with balanced performance.
- Claude 4 delivered major gains in complex reasoning and multi-step problem solving.
- Claude Sonnet 4.5 represents Anthropic's latest advancement in AI capabilities.
Data Source: All performance metrics sourced from independent evaluations by Artificial Analysis.
Coding Performance
Data from SWE-bench Verified: Real-world software engineering on authentic GitHub issues with comprehensive test coverage
Claude Sonnet 4.5 achieves 82.0% on SWE-bench Verified with parallel test-time compute, making it the best coding model in the world. This benchmark tests real GitHub issues with comprehensive test coverage.
Parallel test-time compute means running multiple attempts in parallel and selecting the best result that passes tests.
Source: Anthropic's official announcement (September 29, 2025)
Claude vs Leading AI Models
Performance comparisons between Claude Sonnet 4.5 and the world's most advanced AI models.
AI vs Human: The Physics Intuition Test
Testing how well AI models predict simple real-world physics scenarios against a human baseline.
How the Physics Intuition Benchmark Works
The Visual Physics Comprehension Test (VPCT) evaluates whether AI models can reason about very simple physics scenarios.
Example Problem
Below is one of the 100 test problems. A ball starts at the top and rolls down the ramps.
Prompt given to both humans and AI models:
"Can you predict which of the three buckets the ball will fall into?"

The Benchmark Setup
- Dataset: The benchmark includes 100 unique problems like the one above, each with different ramp configurations.
- Solutions: The official answers are available here: VPCT dataset on HuggingFace.
- Human Baseline: Humans find these puzzles trivial. In testing, volunteers scored 100% on all problems.
- AI Performance: Vision-language models (GPT, Claude, Gemini, etc.) are evaluated on the same 100 problems. Older models performed near random guessing (~33% accuracy), but newer reasoning-focused models have shown significant improvements.
- Claude 4.5 Result: On this benchmark, Claude 4.5 scored 39.8%. This is slightly above random guessing, but far below human performance. The result shows signs of emerging physical intuition, yet it still struggles with problems that are effortless for humans.
Why It Matters
VPCT measures physical intuition, core abilities for robotics, planning, and real-world interaction. Progress here is a crucial indicator of whether models are developing grounded intelligence, not just language fluency.
Source: Visual Physics Comprehension Test by Chase Brower
Intelligence Benchmark (Claude vs Leading AI Models)
Composite score measuring reasoning, problem-solving, and knowledge across multiple domains from Artificial Analysis
Claude Sonnet 4.5 achieves a score of 63 on the Artificial Analysis Intelligence Index, demonstrating strong reasoning and problem-solving capabilities across multiple domains.
Source: Artificial Analysis Intelligence Index Leaderboard
Math Competition
Data from AIME 2025: Advanced mathematics competition performance from Artificial Analysis
Claude Sonnet 4.5 achieves 88.0% on AIME 2025, demonstrating strong mathematical reasoning capabilities.
Source: Artificial Analysis AIME 2025 Benchmark Leaderboard
AI Model Speed Comparison
Compare real-time token generation speeds between any two AI models. Watch as they generate 200 tokens (≈150 words) side by side.