HomeNewsAI Models
AI ModelsTech Radar

Anthropic Claude 4.5 Sets New SOTA on Agent Benchmarks

Claude 4.5 demonstrates a significant leap in multi-step agentic tasks, tool use, and long-context retrieval, strengthening Anthropic's enterprise positioning.

AnthropicMay 14, 2026

Anthropic released Claude 4.5 with a particular focus on agentic performance — the ability to execute complex, multi-step tasks autonomously using tools, web browsing, and code execution. On SWE-bench Verified, Claude 4.5 scores 72.5%, representing a new state of the art for autonomous software engineering tasks.

The model introduces computer use capabilities — the ability to control a desktop environment including browsers, applications, and file systems — with meaningfully improved reliability over the Claude 3.5 version. Anthropic reports a 40% reduction in error rates on multi-step computer use tasks in internal evaluations.

Where Claude 4.5 leads

  • SWE-bench Verified: 72.5% — autonomous software engineering benchmark
  • TAU-bench (tool use): 81.3% — complex tool use across multiple APIs
  • Long context retrieval: near-perfect recall up to 200K tokens
  • Computer use reliability: 40% fewer errors vs Claude 3.5 Computer Use
  • Constitutional AI v3: improved refusal calibration — fewer false positives
Claude 4.5 represents our clearest progress yet on the path to reliable AI agents. The step from impressive demos to production-worthy autonomous systems is real, and it is happening now.