Observability’s three pillars

A widely shared post summarized observability as three pillars — metrics via Prometheus, logs via Loki, and traces via Jaeger — and used that framework to explain system-health troubleshooting. (x.com) A companion interview-style thread applied that split to triaging API latency, showing where logs, metrics and traces each fit in a real incident workflow. (x.com)

Observability is the practice of instrumenting software so engineers can see what a system is doing, and the common split is three data types: metrics, logs, and traces. (opentelemetry.io) Metrics are the fast-moving gauges: request rate, error rate, memory use, or latency over time. Prometheus, created in 2012 at SoundCloud, is an open-source monitoring and alerting toolkit built to collect, store, and query those time-series measurements. (prometheus.io) Logs are the line-by-line records a program writes when something happens, like a database timeout or a failed login. Grafana Loki is a log aggregation system that stores and queries those records and, unlike many logging systems, indexes metadata labels rather than the full text of every line. (grafana.com) Traces are request maps: they follow one user action or application call as it moves across services. Jaeger is an open-source distributed tracing platform that shows how a request traversed a system and where delays or errors appeared along the way. (jaegertracing.io) That three-part frame has become a simple way to explain incident response in cloud software. A latency spike usually starts in metrics, moves into logs for concrete error messages, and ends in traces to pinpoint which service or database call consumed the time. (prometheus.io, grafana.com, jaegertracing.io) The tools are separate projects, but the industry has spent the past few years trying to join the signals. OpenTelemetry describes itself as a vendor-neutral framework for generating, collecting, and exporting traces, metrics, and logs, and says shared context can correlate all three across a request path. (opentelemetry.io, opentelemetry.io) That shift also changes how teams instrument code. Jaeger’s current documentation recommends OpenTelemetry instrumentation and software development kits, noting that Jaeger’s older client libraries were retired beginning in 2022. (jaegertracing.io) Prometheus remains the default metrics reference point in many cloud-native setups. Its OpenMetrics specification says Prometheus has been the default for cloud-native observability since 2015, reflecting how deeply the project’s scrape-and-query model shaped monitoring practice. (prometheus.io) Loki’s role is different: it keeps the raw evidence. Grafana’s documentation says Loki is designed for cost-effective log storage and query at scale, with labels used to find log streams efficiently when an alert or dashboard shows something has gone wrong. (grafana.com) Traces close the loop by showing sequence, not just symptoms. OpenTelemetry’s tracing guide says traces provide the full path a request takes through an application, which is why the three-pillar model keeps resurfacing whenever engineers explain how they actually debug production systems. (opentelemetry.io)

Observability’s three pillars

Get your own daily briefing