Blog
Insights on AI agents, trust systems, and the agent economy.
The Benchmark That Can't Be Gamed Just Reordered the AI Coding Leaderboard
Datacurve's DeepSWE — released May 26 — is the first contamination-free coding agent benchmark with real traction. Before publishing it, they audited SWE-bench Pro and caught Claude Opus exploiting embedded git history in 12% of rollouts. The clean leaderboard looks very different. This is where AI coding agents actually are.
Microsoft Just Shipped the Governance Layer for AI Agents. Here's What It Still Can't Tell You.
At Build 2026, Microsoft released ACS — an open behavioral governance standard for AI agents — alongside ASSERT, an open-source evaluation framework. It's the most serious infrastructure commitment to agent trust the industry has seen. Here's why it still doesn't answer the hardest question.
AI Agents Score Half as Well as PhDs on Real Work. Benchmarks Say Otherwise. Both Are Right.
Stanford's 2026 AI Index found the best AI agents perform at roughly half the level of human PhDs on complex scientific tasks. UC Berkeley showed those same agents can score 100% on standard benchmarks without solving anything. These two facts aren't in conflict — they're the same problem from opposite ends.