Nirixa launches agent-monitoring beta
- Nirixa released a public beta SDK for AI observability on April 11, adding agent tracing so multi-step runs appear as one monitored workflow. - The beta groups agent runs into a single trace with aggregated cost, token, and latency totals, plus a waterfall view for debugging. - The launch lands as agent observability tools push tracing and automated evaluation into production workflows. (opensearch.org)
Nirixa’s new public beta is aimed at a basic problem in AI software: when an agent fails, teams often see only the final answer, not the steps that broke. Nirixa says its SDK now groups multi-step agent runs into a single trace developers can inspect. (pypi.org) The company’s PyPI release page shows version 2.2.0 published on April 11, 2026. It describes “Agent Tracing” that rolls a multi-step run into one observable trace with total cost, token, and latency figures. (pypi.org) That same release says the trace includes a waterfall view in Nirixa’s dashboard. In practice, that means developers can see where time and money were spent across a chain of model calls and tool use. (pypi.org) Nirixa’s website pitches the broader platform as “AI Observability & Cost Intelligence” for teams using OpenAI, Anthropic, Gemini, Groq, and other model providers. It says the service tracks token costs, prompt stability, hallucination risk, and latency in real time. (nirixa.in) The company also says teams can break down spend by feature, endpoint, user, model, and prompt version. Its site advertises p50, p95, and p99 latency tracking, prompt diffs, and configurable risk thresholds for outputs. (nirixa.in) Nirixa is entering a crowded part of the AI tooling market. OpenSearch published an “Agent Health” framework in March that combines trace observability, automated benchmarking, and LLM-as-a-judge evaluation for AI agents. (opensearch.org) Other vendors are making similar arguments: that agent systems need more than ordinary application logs because they make variable, multi-step decisions across models and tools. Langfuse’s documentation, for example, describes LLM-as-a-judge as a way to score live outputs or full traces against a rubric at production scale. (langfuse.com) Researchers have also been warning that automated judging needs careful controls. A 2024 survey on arXiv says LLM-as-a-judge can scale evaluation, but reliability, bias, and standardization remain open problems. (arxiv.org) For Nirixa, the pitch is straightforward: if AI agents are going to act more like software workers, companies want a replayable record of what each one did, how long it took, and what it cost. (pypi.org) (nirixa.in)