Repeatability tests halve pass rates — single-run evals fall from ~60% to ~30%, GG_Observatory finds

- GG_Observatory found a dramatic reliability drop: single-run eval pass rates near 60% fell to about 25% when tests were repeated across eight runs, exposing brittle evaluation signals. (x.com) - That gap means a model that “passes” once may fail most of the time; teams should report multi-run pass rates and variance, not just single-shot scores. (x.com) - The practical fix: run benchmarks across multiple seeds and runs, then correlate that distribution with runtime telemetry to know how bench numbers map to production stability. (x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.