Use Samson-style traces to correlate production model evals with company-specific benchmarks — Akshay Ramaswamy
- Akshay Ramaswamy at Elise AI recommends running company‑specific benchmarks on new releases and validating them against runtime telemetry to ensure bench results reflect production behavior. (x.com) - He ties bench outputs to concrete runtime signals like latency, error rates, and observed hallucination counts so teams can reject models that look good in lab but fail in flight. (x.com) - The operational step is to stitch evaluation results into distributed traces and logs so every failed production sample maps back to a benchmark slice. (x.com)