Measuring whether your AI agent actually outperforms simpler solutions is trickier than it sounds. This piece introduces a framework for benchmarking agentic systems that goes beyond cherry-picked demos. Useful read if you're building agents and want to avoid the "it works on my examples" trap