Repeatability tests halve pass rates — single-run evals fall from ~60% to ~30%, GG_Observatory finds
- GG_Observatory found a dramatic reliability drop: single-run eval pass rates near 60% fell to about 25% when tests were repeated across eight runs, exposing brittle evaluation signals. (x.com) - That gap means a model that “passes” once may fail most of the time; teams should report multi-run pass rates and variance, not just single-shot scores. (x.com) - The practical fix: run benchmarks across multiple seeds and runs, then correlate that distribution with runtime telemetry to know how bench numbers map to production stability. (x.com)