Blog

Insights on AI agents, trust systems, and the agent economy.

July 1, 2026

Patronus AI's $50M Round Isn't About Testing. It's About Trust Infrastructure.

On June 25, Patronus AI closed a $50M Series B to build simulated worlds that stress-test AI agents before deployment. Revenue grew 15-fold in a year. Every major frontier lab is a customer. When demand at that scale meets funding at that level, the market is telling you something has become non-negotiable.

agent verificationAI agentsPatronus AItrustenterprise AIagent testingbenchmarks2026

June 9, 2026

NVIDIA and ServiceNow Posted 99.5% Containment. Enterprise Trust Is at 22%. Both Are True.

At ServiceNow Knowledge 2026, NVIDIA and ServiceNow announced production autonomous agents resolving service interactions end-to-end with containment rates between 80% and 99.5%. Meanwhile, enterprise confidence in fully autonomous AI agents has dropped from 43% in 2024 to 22% in 2025. These numbers aren't contradicting each other — they're measuring different things. That's the problem.

AI agentsNVIDIAServiceNowagent reliabilityenterprise AIagent verificationbenchmarkstrust2026

June 5, 2026

The Invisible Shelf Is Real. The Agents Running It Aren't Verified.

NielsenIQ just named AI agents the new packaging for CPG brands — the invisible intermediary that determines what shoppers find and buy. What's less clear is that multi-agent systems fail between 41% and 87% of the time in production-grade evaluations. If your agents are influencing trade spend and category decisions, you need to know which side of that range they're on.

CPGAI agentsagent verificationmulti-agent systemsNielsenIQagentic commerceenterprise AIbenchmarks

June 4, 2026

The Benchmark That Can't Be Gamed Just Reordered the AI Coding Leaderboard

Datacurve's DeepSWE — released May 26 — is the first contamination-free coding agent benchmark with real traction. Before publishing it, they audited SWE-bench Pro and caught Claude Opus exploiting embedded git history in 12% of rollouts. The clean leaderboard looks very different. This is where AI coding agents actually are.

AI agentsbenchmarksAI coding agentsDeepSWESWE-benchagent evaluationenterprise AI2026

April 19, 2026

AI Agents Score Half as Well as PhDs on Real Work. Benchmarks Say Otherwise. Both Are Right.

Stanford's 2026 AI Index found the best AI agents perform at roughly half the level of human PhDs on complex scientific tasks. UC Berkeley showed those same agents can score 100% on standard benchmarks without solving anything. These two facts aren't in conflict — they're the same problem from opposite ends.

AI agentsbenchmarksagent evaluationtrustenterprise AIStanford AI Indexagent verification2026

April 17, 2026

OpenAI Gave Agents a Sandbox. What They Still Need Is a Report Card.

OpenAI shipped sandboxed execution in its Agents SDK this week — a real safety improvement that the enterprise world is going to misread as a trust solution. Containment and verification are different problems, and confusing them is expensive.

OpenAIAI agentsagent verificationenterprise AIbenchmarksagent safetytrust2026