Arize demos hands-on agent evals
- Arize published a talk called “Ship Real Agents” arguing production agent evaluation must grade traces and behavior over time, not just final answers. - Complementary community posts shared a 12‑metric evaluation harness covering retrieval, tool calls, trajectory tracing, recovery and trace‑native scoring. - Together these resources push labs toward trajectory‑level, trace‑annotated evals that require human adjudication and richer post‑training data operations. (youtube.com) (x.com)