Making AI work through eval hygiene
Anthropic, of all companies, just shipped three quality regressions in Claude Code that its own evals didn’t catch. Think about that. Three regressions over a short six weeks, by the most sophisticated eval shop in AI. If this can happen to Anthropic, it most definitely can happen to you, and it likely will. In a…