Custom evals and edit-rate monitoring
- Engineers building large-language-model apps are increasingly pairing custom evaluation sets with live edit tracking, using user corrections to catch quality drift earlier. - The practice shifts monitoring from generic benchmark scores to task-specific checks, then flags rising rewrite or retry rates as production warnings. - Vendors now package evals, tracing, and quality dashboards together as standard observability for AI apps. (anthropic.com)
A model can look fine on a benchmark and still fail users in production, which is why teams are building custom evals and tracking how often outputs get edited. (anthropic.com) (braintrust.dev) Custom evals are test sets built around one company’s real tasks, like file edits, concision, retrieval accuracy, or policy compliance, instead of broad public leaderboards. Anthropic said it added narrow evals for Claude Code behaviors including concision, file edits, and over-engineering to find failures that generic tests missed. (anthropic.com) The second signal comes from users themselves. If people frequently rewrite, retry, or override an answer, that behavior can expose degradation before an automated score drops enough to trigger an alert. (atlan.com) (venturebeat.com) That changes what “monitoring” means for artificial intelligence products. Traditional software dashboards watch uptime and latency; LLM observability adds output quality, drift, and human correction patterns on top. (langfuse.com) (braintrust.dev) The idea is to treat production like a continuous exam. Teams run domain-specific evals before shipping changes, then watch live telemetry to see whether real users are accepting outputs or fixing them. (langfuse.com) (braintrust.dev) Tool vendors are now selling that workflow directly. Langfuse pitches dashboards for quality and user behavior analysis, while Braintrust markets observability that traces prompts, scores outputs, and catches silent regressions. (langfuse.com) (braintrust.dev) The underlying problem is drift: user needs change, prompts evolve, retrieval data goes stale, and model versions behave differently over time. Static test sets can miss that shift if they are not refreshed with real production examples. (venturebeat.com) (christianjmills.com) Edit-rate monitoring is not a perfect quality score. Some users edit good drafts for style, and some bad answers go uncorrected, so teams usually combine edit signals with automated judges, sampling, and manual review. (braintrust.dev) (arxiv.org) The result is a more practical feedback loop: test the behavior you actually need, watch where users intervene, and update the eval set with those failures. That is becoming the standard playbook for keeping LLM products reliable after launch. (anthropic.com) (langfuse.com)