Use rubric evals and human review

- Anthropic and OpenAI have both pushed a clearer pattern for AI teams in 2025 and 2026: use rubric-based evals, then calibrate them with humans. - The telling detail is operational, not philosophical — OpenAI says to “log everything,” mine failures into eval cases, and keep humans aligned with graders. - This matters because agent failures compound across turns, so teams now treat review data as training fuel, not just a support queue.

AI evals are turning into product infrastructure. Not a side dashboard. Not a nice-to-have before launch. The shift is that teams building agents now score outputs against explicit rubrics, then use human reviewers to check whether those scores actually match reality. That closes a loop that used to stay broken for months — users hit weird failures, support teams noticed them, but the model never really learned from them. ### What is a rubric eval? A rubric eval is just a structured scorecard for model behavior. You define the dimensions that matter for your task — factual accuracy, policy compliance, instruction following, formatting, tool use, whatever — then grade outputs against those dimensions instead of relying on a vague thumbs-up. OpenAI’s eval docs frame evals as structured tests for model performance, and its reinforcement fine-tuning docs go one step further: rubrics can be turned into graders that score factual accuracy, functional success, or policy compliance. (developers.openai.com) ### Why isn’t automation enough? Because models are slippery. A grader can tell you whether JSON parsed, or whether an answer mentioned a forbidden term, but it can miss the thing a human instantly notices — the answer is technically formatted right and still misleading. OpenAI’s guidance is blunt here: combine metrics with human judgment, maintain agreemen(developers.openai.com) review consistent, and the human keeps the rubric honest. (developers.openai.com) ### Why is this harder for agents? Agents don’t just answer once. They call tools, change state, make intermediate decisions, and keep going. Anthropic’s January 9, 2026 guide makes the core problem clear: in multi-turn systems, mistakes propagate and compound. A bad retrieval step can poison the next tool call, which can poison the final answer. That means(developers.openai.com) trajectory. (anthropic.com) ### So where do humans come in? Humans show up where the rubric is underspecified, where the stakes are high, or where the model found a weird edge case. In OpenAI’s November 4, 2025 “Self-Evolving Agents” cookbook, the whole point is that autonomous systems plateau when humans are only fixing edge cases manually. The proposed fix is a repeatable loop: collect struc(anthropic.com)nto prompt optimization and retraining logic. (developers.openai.com) ### What happens to low-scoring outputs? The useful ones become training assets. OpenAI’s best-practices guide says to log everything, mine logs for good eval cases, and grow the eval set over time. That is the practical trick — a failed answer is not just a bug report. It becomes a labeled example, a regression test, or a (developers.openai.com)ike regular maintenance. (developers.openai.com) ### Why does auditability matter so much? Because a lot of these use cases are not toy chatbots. The healthcare retraining example in OpenAI’s cookbook is built around regulated document drafting, where accuracy and compliance both matter and teams need traceable feedback. Rubrics help because they preserve the reason an output failed. You are not just stor(developers.openai.com)ar on criterion 4.” That makes later policy updates and retraining much easier to justify. (developers.openai.com) ### Is this replacing fine-tuning or enabling it? Enabling it. OpenAI’s docs explicitly tie evals to one of the few reliable ways to improve an LLM application through fine-tuning, and its reinforcement fine-tuning guide describes training runs where graders score sampled responses and those scores drive updates. The catch (developers.openai.com)gnal itself. (developers.openai.com) ### Bottom line? The new pattern is simple: write a rubric, automate what you can, send ambiguous or high-risk cases to humans, and turn the reviewed failures back into evals and training data. That is how teams are moving from “the model messed up again” to a real improvement loop — with receipts.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.