Evals being treated as infra

An industry thread argues that evaluation suites should be built as production infrastructure—used continuously to measure, optimize and monitor models, prompts and retrieval components rather than as one‑off reports. The same conversation in product‑ML circles stresses combining offline benchmarks, human rubrics and production telemetry into a single eval pipeline. (x.com)

In artificial intelligence products, an “eval” is a test: give a model an input, grade the output, and track whether it still meets the bar after every change. (developers.openai.com) That framing has shifted in 2026 from occasional benchmarking toward continuous operations. OpenAI now pitches “eval-driven system design” and an Evals API for building, running, and reviewing tests as part of application development, not just model comparison. (developers.openai.com) Anthropic made the same point in a January 9, 2026 engineering post on agents, saying teams use evals to catch failures before they hit users and that the value of those tests “compounds over the lifecycle” of a system. Anthropic also says agent evals often need multiple trials because outputs vary from run to run. (anthropic.com) The operational shift comes from how these systems are built. A chatbot or agent is usually a stack of parts — model calls, retrieval, tool use, and formatting — so teams increasingly test each part separately instead of treating the model as a single black box. (docs.langchain.com) That is why product teams now split evals into two lanes. Offline evals use curated datasets before launch to benchmark versions and catch regressions, while online evals watch live traffic after launch for drift, safety issues, and edge cases that never showed up in the test set. (docs.langchain.com) The newer argument in industry circles is that those lanes should connect into one pipeline. Production failures become labeled examples, those examples go back into offline datasets, and the next prompt, retrieval, or model change gets tested against the updated set before rollout. (docs.langchain.com) Tooling vendors are building around that workflow. Braintrust says its platform helps teams “measure, evaluate, and improve AI in production” with model comparison, prompt iteration, regression checks, and real user data in one system. (braintrust.dev) Weights & Biases takes a similar line with Weave. Its documentation centers eval runs around a dataset plus scoring functions, and its public materials now pair labeled-dataset evaluation with online monitoring for production traffic. (docs.wandb.ai) Arize’s Phoenix docs make the split explicit: one set of tools for evaluation traces and another for “continuous monitoring” with alerting and threshold triggers on production traffic. Phoenix’s open-source repository also describes the product as an observability and evaluation platform, not just a benchmark harness. (arize.com, github.com) The practical effect is that evals are being treated less like a report card and more like logging, testing, and monitoring infrastructure. In that setup, a prompt edit, a retrieval tweak, or a model swap is not finished when it looks better in a demo; it is finished when the eval pipeline says it still works in production. (developers.openai.com, docs.langchain.com)

Evals being treated as infra

Get your own daily briefing