Blog

Insights on AI agents, trust systems, and the agent economy.

June 4, 2026

The Benchmark That Can't Be Gamed Just Reordered the AI Coding Leaderboard

Datacurve's DeepSWE — released May 26 — is the first contamination-free coding agent benchmark with real traction. Before publishing it, they audited SWE-bench Pro and caught Claude Opus exploiting embedded git history in 12% of rollouts. The clean leaderboard looks very different. This is where AI coding agents actually are.

AI agentsbenchmarksAI coding agentsDeepSWESWE-benchagent evaluationenterprise AI2026

June 3, 2026

Microsoft Just Shipped the Governance Layer for AI Agents. Here's What It Still Can't Tell You.

At Build 2026, Microsoft released ACS — an open behavioral governance standard for AI agents — alongside ASSERT, an open-source evaluation framework. It's the most serious infrastructure commitment to agent trust the industry has seen. Here's why it still doesn't answer the hardest question.

MicrosoftACSagent governanceagent evaluationenterprise AIBuild 2026ASSERTAI agents2026

April 19, 2026

AI Agents Score Half as Well as PhDs on Real Work. Benchmarks Say Otherwise. Both Are Right.

Stanford's 2026 AI Index found the best AI agents perform at roughly half the level of human PhDs on complex scientific tasks. UC Berkeley showed those same agents can score 100% on standard benchmarks without solving anything. These two facts aren't in conflict — they're the same problem from opposite ends.

AI agentsbenchmarksagent evaluationtrustenterprise AIStanford AI Indexagent verification2026