Open Evaluation for Agents
The community is adopting open-source, framework-agnostic tools to evaluate LLM-based agents continuously—running regression suites, live-traffic tests, and behavioral compliance checks before rollout. These tools aim to catch evaluation drift and gate model or skill releases across LangChain, CrewAI, and custom stacks. (dev.to)
A new open-source project called EvalForge, published April 2, 2026 by Hemanth Kumar, provides a single tool that takes a structured record of an agent run (a "trace" saved as JSON) and produces a scored pass/fail result that can be consumed by automated build-and-test pipelines. (dev.to) (github.com) EvalForge normalizes traces from different agent libraries so you don’t have to rewrite evaluation code when you switch frameworks: the project shows mappings for LangChain, CrewAI, and AutoGen artifacts into a single JSON schema and claims it can plug directly into deployment workflows. (dev.to) (github.com) Technically, EvalForge’s input is a universal trace JSON that records metadata (framework, model, duration, token counts), the user input, a list of ordered steps (each step labeled as a "thought" or a "tool_call" with tool outputs), and the final answer; a trace is simply a structured timeline of what the agent did and what each tool returned during the run. (dev.to) For scoring, EvalForge currently ships a “faithfulness” grader that uses an LLM as a judge — meaning it asks a model to compare the agent’s final answer against the tool outputs recorded in the trace and produce a numeric score — and it exits with code 0 or 1 so a continuous integration system (an automated build-and-test pipeline that runs on each code change) can block a release if the eval fails. (dev.to) (github.com) The project also publishes a runtime package (evalforge-runtime) that runs a small API server to execute instrumented agent processes with execution logging, cost tracking, and optional integrations (for example, Langfuse or cloud secret stores), which makes it practical to run regression suites inside CI or a centralized evaluation job runner. (pypi.org) (github.com) EvalForge sits alongside other open-source agent-eval work: LangChain’s agentevals repository offers many ready-made trajectory-style evaluators that focus on step-level behavior, and AWS’s Agent Evaluation project focuses on orchestrating concurrent multi-turn conversations with an LLM-based evaluator; EvalForge’s differentiator is the canonical, framework-agnostic trace format plus the explicit CI gating design for pass/fail automation. (github.com 1) (github.com 2) (dev.to)