Eduardopto: track LLM retry rates

- An observability practitioner using the handle Eduardopto argued teams should track large language model retries and refusals, not just latency and error dashboards. - The post singled out three signals: retry counts, refusal frequency, and changes in top-token probabilities across repeated calls to the same prompt. - Those metrics target silent drift in agent pipelines that standard traces can miss. (opentelemetry.io)

Large language models can fail without crashing: they answer differently, refuse unexpectedly, or need hidden retries before a workflow completes. (opentelemetry.io) (www.datadoghq.com) That is the point behind a post from Eduardopto, who said teams should watch retry rates and refusal patterns instead of relying only on standard health checks. The post framed those signals as a way to catch failures that look “successful” in ordinary dashboards. (x.com) The concrete checklist was short: count how often a call is retried, measure how often the model refuses, and compare top-token probabilities across repeated runs of the same prompt. Those are all signals of instability even when latency, uptime, and HTTP status codes look normal. (x.com) (developers.openai.com) Top-token probabilities are the model’s ranked guesses for the next word, like a weather forecast for text. OpenAI’s logprobs output exposes those token-level probabilities, which makes it possible to compare how confident or shaky a model was from one run to the next. (developers.openai.com) (responsible-ai-developers.googleblog.com) Refusals are also now explicit enough to measure in some APIs. Anthropic’s Claude documentation says streaming classifiers can end a response with `stop_reason: "refusal"`, which gives developers a machine-readable event instead of only a natural-language apology. (platform.claude.com) That matters most in agent systems, where one model call can trigger tools, planners, validators, and follow-up prompts. A single refusal or low-confidence branch can force a retry upstream while the overall request still returns a final answer to the user. (www.langchain.com) (www.comet.com) Mainstream observability stacks already track duration, token usage, and traces, but standards are still catching up on model-behavior signals. OpenTelemetry’s generative artificial intelligence semantic conventions are still marked experimental, and the current public spec focuses on spans, events, and core metrics rather than a standard retry-or-refusal dashboard. (opentelemetry.io) (github.com) The practical claim in Eduardopto’s post is that non-deterministic failures leave statistical fingerprints before they become outages. If retries rise, refusals spike, or token-confidence swings widen for the same prompt, the model may have drifted even when the service still looks healthy. (x.com) (www.datadoghq.com) For teams shipping agents, the message is operational, not philosophical: measure the extra attempts, measure the no’s, and measure how much the model hesitates before it speaks. (x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.