Builds safety scoring without ground truth

- Researchers from Simula, OsloMet, the University of Oslo, and Norway’s health directorate posted a May 7 paper on safety scoring for LLMs without labels. - Their SimpleAudit setup separated “safe” and “abliterated” model variants with AUROC from 0.89 to 1.00, and scores stabilized after about 10 reruns. - It matters because many real audits need local, domain-specific safety checks before any benchmark exists.

Safety evaluation for language models has a boring-sounding problem that turns out to be pretty brutal. You often need to decide which model is safer before anyone has built the benchmark you’d normally use to judge it. That is especially true in smaller languages, regulated sectors, and local deployments where the usual English-heavy datasets do not fit. A new paper posted on May 7 tries to solve exactly that gap by replacing ground-truth labels with a chain of proxy checks and packaging the whole thing into a tool called SimpleAudit. ### What actually changed? Sushant Gautam and eight coauthors from Simula Metropolitan, Oslo Metropolitan University, the University of Oslo, Simula Research Laboratory, and the Norwegian Directorate of Health introduced what they call “benchmarkless comparative safety scoring.” The point is not to prove an absolute safety score in the abstract. The point is to compare candidate models in a fixed deployment setting when no labeled benchmark exists yet. (arxiv.org) ### Why is missing ground truth such a problem? Most safety evals quietly assume you have a trusted answer key — labels saying which responses are acceptable, unsafe, or policy-violating. But open-ended safety work often does not look like that. A public agency choosing a Norwegian-language assistant, for example, may need tests tailored to local law, local language, and local workflows long before a benchmark gets built. The paper argues that waiting for perfect labels is often too slow and too expensive for real procurement decisions. (arxiv.org) ### So what do they use instead? They build an “instrumental-validity chain.” Basically, if you cannot compare a model to ground truth, you ask whether your scoring system behaves sensibly under controlled stress. The paper checks three things: whether the scorer can distinguish a safer model from an intentionally safety-damaged one, whether most of the variation comes from the target model rather than quirks of the auditor or judge, and whether results stay stable when you rerun the audit. (arxiv.org) ### What is the “abliterated” contrast? This is the clever part. They use a controlled safe-versus-abliterated comparison as a calibration test. Think of it like checking a thermometer with ice water and boiling water before trusting it in the wild. If the scoring method cannot reliably separate a model with safety behavior intact from one whose safety has been deliberately degraded, the whole instrument is suspect. In their validation, that separation reached AUROC values between 0.89 and 1.00. (arxiv.org) ### Did the method look stable? Mostly, yes. The authors say target identity explained the largest share of score variance, with η² around 0.52, which is their evidence that the results are driven more by model differences than by evaluator noise. They also report that severity profiles stabilized by about 10 reruns. That matters because one-off safety evals can be noisy, and noisy rankings are exactly how teams talk themselves into fake certainty. (arxiv.org) ### What did they test it on? They instantiated the method in SimpleAudit, a local-first auditing tool, and validated it on a Norwegian safety pack. They also describe a Norwegian public-sector procurement case comparing Borealis and Gemma 3. The result was not a neat universal winner. The safer model changed depending on scenario category and risk measure, which is a useful warning against collapsing safety into one leaderboard number. (arxiv.org) ### Why is that nuance important? Because safety decisions are deployment decisions. A model can look better on one class of harmful behavior and worse on another. The paper’s argument is that audits should report matched deltas, critical rates, uncertainty, and which judge and auditor were used — not just “Model A beat Model B.” That is less tidy, but it is much closer to how real risk works. (arxiv.org) ### Bottom line This is not a magic replacement for human judgment. The catch is that the scores only hold under a fixed scenario pack, rubric, auditor, judge, sampling setup, and rerun budget. But as a practical move, it is a strong one. If you need to compare models in a domain where no benchmark exists yet, this gives you a way to build evidence instead of pretending the missing labels are not a problem. (arxiv.org)

Builds safety scoring without ground truth

Get your own daily briefing