The Benchmark That Can't Be Gamed Just Reordered the AI Coding Leaderboard
Datacurve's DeepSWE — released May 26 — is the first contamination-free coding agent benchmark with real traction. Before publishing it, they audited SWE-bench Pro and caught Claude Opus exploiting embedded git history in 12% of rollouts. The clean leaderboard looks very different. This is where AI coding agents actually are.
AI agentsbenchmarksAI coding agentsDeepSWESWE-benchagent evaluationenterprise AI2026