ArXiv paper proposes validation methods for LLM safety scoring when no benchmark exists

- Researchers from Simula, OsloMet, and Norway’s Directorate of Health posted an arXiv paper on May 8 laying out how to compare LLM safety without labels. - The paper’s validation chain checks three things: safe-vs-abliterated separation, target-driven variance, and rerun stability; its Norwegian test hit 0.89–1.00 AUROC. - That matters because many real deployments need local safety evidence before any benchmark exists — and single-number rankings can mislead.

Safety scoring for language models has a basic problem: the places that most need careful evaluation often have the worst benchmarks. If you are testing a model for Norwegian public-sector use, or for a narrow regulatory setting, there may be no labeled dataset at all. But teams still have to choose a model. That is the gap this new arXiv paper tries to close — not by pretending a benchmark exists, but by defining what kind of evidence you can still trust when it doesn’t. ### What is the paper actually proposing? The authors call the setup benchmarkless comparative safety scoring. The idea is simple: you are not claiming a model is “safe” in some absolute sense. You are claiming that, under a fixed audit setup, one candidate behaves safer than another for the scenarios you care about. That sounds narrower — and it is — but basically the whole point is to make the claim honest enough to use in deployment decisions. (arxiv.org) ### Why can’t you just use existing safety benchmarks? Because existing benchmarks are often mismatched to the real deployment cell. A company or agency may need evidence for a specific language, sector, or policy regime, and the public benchmarks may be in English, too broad, or just irrelevant to the actual risks. Building a fresh labeled benchmark is expensive and sometimes impossible on the timeline. So the paper treats “no benchmark” as a normal operating condition, not an edge case. (arxiv.org) ### So what replaces ground truth? The replacement is what the paper calls an instrumental-validity chain. Instead of asking whether scores match a gold label set, it asks whether the scoring instrument behaves sensibly under stress. The chain has three main checks: can it cleanly separate a controlled safe target from an intentionally ablated one, is most of the score variation driven by the target model rather than quirks of the auditor or judge, and do the results stabilize when you rerun the audit multiple times. (arxiv.org) ### What did they test it on? They instantiate the method in a tool called SimpleAudit, described as local-first, and validate it on a Norwegian safety pack. In that setup, the safe and ablated targets separate with AUROC values from 0.89 to 1.00. The paper also says target identity explains the dominant share of variance — about η² ≈ 0.52 — and that severity profiles stabilize by 10 reruns. Those are not universal safety numbers. They are evidence that the instrument itself is behaving coherently in this setting. (arxiv.org) ### Why does “fixed setup” matter so much? Because the score is only valid under the exact audit contract: scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Change those and you may have changed the meaning of the score. That is the paper’s most useful discipline. It pushes against the habit of turning messy evaluation into a single leaderboard number that looks portable when it really isn’t. ### Did they compare real models too? (arxiv.org) Yes. The paper includes a Norwegian public-sector procurement case comparing Borealis and Gemma 3. The interesting part is that there is no universal winner — the “safer” model depends on the scenario category and the risk measure you care about. So the output should be a bundle of evidence — scores, deltas, critical rates, uncertainty, and which auditor and judge were used — not a neat one-line ranking. ### What’s the catch? The method does not solve the hardest philosophical problem, which is defining safety once and for all. It solves a narrower operational one: how to make comparative claims less flimsy when labels are missing. Turns out that is still valuable. A lot of real AI governance work is exactly this unglamorous problem — choosing between imperfect models in local contexts with incomplete benchmarks. ### Bottom line? This paper is a push toward more disciplined safety evaluation in the places where benchmarks lag deployment. (arxiv.org) The big message is not “we found the right safety score.” It is “if you do not have ground truth, be explicit about the contract, stress-test the instrument, and stop pretending one number tells the whole story.” (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.