iFixAi opens 32-test agent diagnostic
- Guri Singh and CyrilXBT have publicly launched iFixAi, an open-source AI-agent diagnostic that runs up to 32 behavior inspections and emits governance scorecards. - The most important detail is the caveat: v1.0.0 ships with no frontier-model baselines, and its pass thresholds are policy defaults, not calibrated benchmarks. - That still matters because teams keep shipping agents without repeatable audits, and iFixAi turns fuzzy “alignment” claims into testable failures.
AI-agent testing is starting to split into two camps. One camp asks whether a model is smart. The other asks whether it behaves safely and predictably once you wire it into tools, policies, and real workflows. iFixAi lands squarely in the second camp. Guri Singh and CyrilXBT have opened it up as a public, open-source diagnostic that runs up to 32 inspections against an agent and turns the results into a reproducible scorecard. (github.com) ### What is iFixAi actually testing? Basically, not raw capability. It tests misalignment risk — five buckets of it: fabrication, manipulation, deception, unpredictability, and opacity. The project frames itself as a fixture-driven diagnostic for finding where an agent’s behavior diverges from common alignment expectations, not as a benchmark for who has the smartest model. (github.com) ### (github.com) is the whole point. iFixAi says each run writes a content-addressed manifest that captures the inputs, provider, model, fixture, rubric, seed, judge setup, and corpus version so an auditor can replay the run exactly. That is the pitch — not “trust our vibes,” but “here is the artifact, rerun it yourself.” (ifixai.ai) ### What do the 32 tests look li(github.com)ed behavior checks. Some are very concrete: whether unauthorized tool calls are blocked, whether audit logs include required fields, whether prompt-injection payloads are refused, whether identical inputs produce semantically identical decisions, whether cross-session data leaks, and whether rate limits fire under load. That is useful because it treats an agent like (ifixai.ai). (github.com) ### How does scoring work? Each evidence item is pass or fail — no partial credit at that layer. Those inspection scores roll up into category scores and then an overall scorecard. Some tests can be excluded if there is insufficient evidence, and two mandatory checks — B01 and B08, both tied to tool authorization and blocking unauthorized actions — can cap the overall score at 0.60 if they fail or cannot be verified. (github.com) ### Can you run it on normal model APIs? Yes, but with limits. The docs say vanilla LLM providers like OpenAI, Anthropic, or Gemini typically score 27 inspections, while policy-wrapped providers can score all 32. That distinction matters — some governance properties only exist if the surrounding system exposes hooks like authorization and audit controls. In other words, iFixAi is partly testing the model, but also the scaffolding around the model. (github.com) ### So is this a real benchmark? Sort of — but the catch is that the creators explicitly say it is not a certification or safety guarantee. v1.0.0 has no published baselines for frontier models, and its thresholds and category weights are policy defaults rather than empirically calibrated ones. The project says the most defensible use today is CI drift tracking over time, or comparing two systems on the same fixture, not declaring that one absolute letter grade proves safety. (github.com) ### Why map it to governance standards? Because buyers, auditors, and legal teams do not want a pile of prompts — they want a control story. iFixAi says its gap analysis maps tests to frameworks including the OWASP LLM Top 10, NIST AI RMF, the EU AI Act, and ISO 42001. That makes the output easier to plug into internal reviews and external compliance work, even if the underlying score still needs human judgment. (([github.com)fit in the tooling stack? The project itself makes the distinction clearly: if you want capability testing, use something like HELM or lm-eval; if you want a general evaluation framework, use Inspect; if you want a governance-behavior diagnostic with a signed, reproducible scorecard, use iFixAi. That is a narrower lane, but it is a real one. (github.com)hat someone “solved alignment.” They did not. What changed is that one more piece of agent safety work moved from vague principle to runnable software. And right now, that may be the more useful step. (github.com)