Observability: diagnosis not dashboard

SRE commentary is shifting observability goals from generating signals to delivering diagnostic leverage—correlation, causality hints and ownership mapping—so teams can answer questions about affected workflows rather than wade through more telemetry. Open-source agent runtimes and token/OTEL tracking for LLM prod costs were called out as practical additions for LLM-era observability. (x.com/DatalayerIO/status/2044131972755194208)

Observability teams are rewriting the job description: the point is no longer to collect more signals, but to shorten diagnosis. (opentelemetry.io) Observability means instrumenting software so engineers can follow one request across many services with traces, metrics, and logs. OpenTelemetry describes that stack as a vendor-neutral framework for generating, collecting, and exporting telemetry data. (opentelemetry.io ) The practical shift is from dashboards that show red lights to traces that explain which request broke, where it slowed, and which dependency failed. OpenTelemetry’s primer says distributed tracing breaks a request into steps so teams can find root causes in systems that are too complex to reproduce locally. (opentelemetry.io) That changes what engineers ask for from their tools. Honeycomb describes observability as the ability to answer new questions about a system’s state without shipping new code, and Grafana’s tracing docs center linking traces with logs, metrics, and profiles during troubleshooting. (honeycomb.io) (grafana.com) The newer wrinkle is artificial intelligence software, where the expensive unit is often tokens, not just requests or central processing unit time. OpenTelemetry’s generative artificial intelligence semantic conventions define attributes and metrics for token counts, model operations, conversations, and errors. (opentelemetry.io) Those conventions now reach into agent software, the orchestration layer that lets a model call tools and hand work between steps. OpenTelemetry has separate semantic conventions for generative artificial intelligence agent and framework spans, and a recent proposal adds a low-cardinality `gen_ai.workflow.name` field so teams can tie model activity back to a named workflow. (opentelemetry.io) (github.com) That is why cost tracking is moving into the same traces engineers already use for outages. OpenTelemetry’s OpenAI client conventions include input-token fields, and tools such as OpenLIT and Langfuse market OpenTelemetry-based tracing that records token usage, latency, and cost for large language model applications. (opentelemetry.io) (openlit.io) (langfuse.com) The ownership problem is part of the same diagnosis push. Semantic conventions give teams shared names for spans, metrics, logs, and resources, which makes it easier to map a failing workflow to the service, library, or team that emitted it. (opentelemetry.io) Vendors are still selling dashboards, alerts, and artificial intelligence copilots, but the center of gravity has moved toward correlation. Grafana teaches teams to use logs, metrics, and traces together, while Honeycomb now offers a Model Context Protocol server that lets coding agents query traces, triggers, and service-level objectives in natural language. (grafana.com) (honeycomb.io) The result is a narrower question with a more useful answer: not whether the system is noisy, but which workflow is broken, what changed, and who owns the next fix. (opentelemetry.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.