Comet ML promotes Opik evaluation framework

- Comet promoted Opik on May 20 as an open-source framework for tracing, evaluating and monitoring LLM applications with heuristic checks and judge-based scoring. - Opik’s documentation lists 30-plus metrics, including IsJson, RegexMatch, Hallucination and Answer Relevance, plus G-Eval presets and custom metric support. - Developers can configure scoring rules in Opik’s UI or API and run them on production traces.

Comet has been promoting Opik as an evaluation layer for large language model applications, pairing rule-based checks with model-graded scoring and production telemetry. Opik is an open-source project built by Comet, according to the project’s GitHub repository and product documentation. The framework is positioned for teams building LLM applications, retrieval systems and agent workflows that need both tracing and evaluation. Comet’s product pages say Opik can score traces from development, testing or production and surface errors across thousands of runs. ### What exactly is Opik being promoted as? Comet’s GitHub repository describes Opik as a platform to “debug, evaluate, and monitor” LLM applications, RAG systems and agentic workflows. The company’s product page uses similar language and says the platform combines end-to-end observability with evaluation tooling. The documentation splits evaluation into two tracks. Opik says users can create test suites with natural-language assertions for pass-fail checks, or use datasets and metrics to score outputs quantitatively across many traces. (github.com) ### Which metrics are built in? Opik’s metrics overview says the framework includes two main categories: heuristic metrics and “LLM as a Judge” metrics. (github.com) The heuristic side covers deterministic checks such as exact matching, regex validation and similarity scoring. The same page lists IsJson, which validates whether output can be parsed as JSON, and RegexMatch, which checks whether output matches a specified regular expression pattern. (comet.com) Opik’s judge-based metrics are aimed at semantic or task-specific scoring. The documentation lists built-in evaluators including Answer Relevance, Context Precision, Context Recall, Agent Task Completion and other quality checks. Comet’s product page says the platform offers more than 30 metrics for answer relevance, context precision, task completion and hallucination. (comet.com) ### How does the hallucination and relevance scoring work? Opik’s online evaluation rules documentation says the platform can automatically score logged LLM calls with LLM-as-a-Judge metrics. The built-in production rules include Hallucination, Moderation and Answer Relevance, and users can apply those rules to production traces through the UI or REST API. (comet.com) The hallucination metric documentation says Opik uses an LLM judge with a prompt template and, by default, uses OpenAI’s gpt-4o model unless the user changes the model through LiteLLM-supported options. Comet’s broader LLM-as-a-judge guide says the method is meant to automate checks that would otherwise require human review and can flag hallucinations or off-topic responses at scale. (comet.com) ### Where does G-Eval fit in? Opik’s documentation includes a dedicated G-Eval metric. The docs say Opik ships preset G-Eval judges for common use cases and that each preset inherits from a GEval class with shared parameters such as model and temperature. The optimization documentation says teams can use Equals when there is a single correct answer, or G-Eval when answers vary and a model needs to score quality. (comet.com) That places G-Eval in the part of the stack where developers want structured model-based grading rather than strict string matching. ### Can teams extend Opik for domain-specific evaluation? (comet.com) Opik’s custom metric documentation says users can define their own metrics by subclassing the BaseMetric class and implementing a scoring method. The docs also say developers who need an LLM-as-a-Judge metric can use G-Eval or build one from scratch. Comet’s evaluation overview says custom metrics are intended for domain-specific evaluation, while product materials say production traces can be monitored in real time with alerts if interactions fail test criteria. (comet.com) A May 20 social post cited Comet and OpenLedger as examples of linking runtime telemetry to evaluation metrics; OpenLedger’s public site describes its platform as infrastructure for AI models and agents, though the post itself framed the comparison. (comet.com) ### How does telemetry connect to evaluation in practice? Opik’s evaluation workflow starts with a production trace, according to Comet’s docs. The platform says users can inspect a trace’s span tree, turn a failure into a test case, update the agent, and rerun the test suite to check for regressions. The production rules documentation says scoring rules can be attached to logged traces with a sampling rate, selected model, prompt and score definition. (comet.com) Comet’s product page also says Opik tracks token usage and model cost, alongside real-time monitoring and alerts on production interactions. (comet.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.