Engineers embed faithfulness and hallucination evals into OpenTelemetry traces
- Arize Phoenix and its OpenInference instrumentation let engineers attach LLM evaluation results to OpenTelemetry traces instead of storing them separately. - Phoenix Evals 2.0 says its evaluators are natively instrumented with OpenTelemetry tracing and ship prebuilt hallucination-detection metrics. - The shift ties quality scores to the same traces that already capture model calls, retrieval, timing, and token usage, giving teams one debugging record for runtime and evals (arize.com) (github.com).
A trace is the step-by-step receipt for one large language model request, and engineers are now attaching quality grades to that same receipt. (arize-ai.github.io) (arize.com) Arize’s Phoenix platform accepts traces over the OpenTelemetry protocol, and its OpenInference layer defines AI-specific fields for model calls, retrieval, tools, inputs, and outputs. (arize.com) (github.com) Phoenix also lets teams export trace datasets, run evaluators on those traces, and log the resulting labels and scores back into the Phoenix user interface. (arize.com) That changes how LLM debugging works in practice. Instead of checking one system for latency and another for hallucination or relevance, developers can inspect a single run and see what the model did, how long it took, and how it scored. (arize.com 1) (arize.com 2) Phoenix’s documentation says a trace can capture model calls, retrieval, tool use, and custom logic, while OpenInference shows those traces as linked spans with shared trace IDs and parent-child relationships. (arize.com) (arize-ai.github.io) Arize’s evals package says it includes prebuilt metrics for tasks such as hallucination detection, and that evaluators are natively instrumented through OpenTelemetry tracing. (github.com) Older observability tools were built to watch servers and networks. Arize’s 2023 tracing write-up said LLM systems need different span types, including evals, agents, embeddings, and model calls, because a few lines of orchestration code can trigger many downstream operations. (arize.com) The result is a more unified record for AI operations: one trace can carry the request path, the model output, and the score that says whether the answer stayed grounded in its source material. (arize.com 1) (arize.com 2) That is why engineers are pushing evals into traces. The same telemetry stream that already explains where an LLM call went can now show whether it was fast, expensive, and wrong. (arize.com) (github.com)