NIST to run pre-deployment safety tests on frontier AI models
- NIST’s AI center, CAISI, said May 5 it will test frontier models from Google DeepMind, Microsoft, and xAI before release for security risks. - The new deals add three labs to earlier Anthropic and OpenAI agreements, giving CAISI access to five major frontier-model developers overall. - That shifts U.S. AI oversight from broad voluntary promises toward hands-on model access, pre-release evaluation, and evidence-based safety checks.
The U.S. government is getting a closer look at powerful AI systems before the public does. That is the news here. On May 5, NIST’s Center for AI Standards and Innovation — CAISI — said it signed new agreements with Google DeepMind, Microsoft, and xAI to run pre-deployment evaluations on frontier models. Basically, the government is moving one step upstream, from reacting to released models to testing some of them before launch. (nist.gov) ### What actually changed? CAISI already had similar arrangements with Anthropic and OpenAI from 2024, back when the office was still called the U.S. AI Safety Institute. The new agreements expand that setup to three more labs. So this is not a brand-new idea — it is the scaling up of a system that now covers five of the biggest U.S.-linked frontier AI developers. (nist.gov) ### What is CAISI, exactly? CAISI sits inside NIST, which is the Commerce Department’s standards and measurement shop. Its job is not to approve models like a drug regulator. The closer analogy is a government lab that develops tests, runs evaluations, and tries to turn vague safety talk into measurable evidence. NIST says C(nist.gov)bilities that could create national-security risk. (nist.gov) ### What does “pre-deployment evaluation” mean here? It means the labs give CAISI access to models before public release so the agency can probe specific dangerous capabilities. The announced focus is frontier AI and national-security risk — things like cyber abuse potential, misuse pathways, and other advanced capabilities that are hard to judge from marketing demos or benchmark scores alone. CAISI also says i(nist.gov)e evaluations, which matters because the testing methods themselves are still immature. (nist.gov) ### Why does early access matter so much? Because once a model is public, the incentives change fast. Developers are managing launches, customers are integrating tools, and outside researchers are working from whatever access the company allows. Early access lets evaluators inspect the system before those pressures lock in(nist.gov)t does not guarantee safety, but it is a much more serious checkpoint. (nist.gov) ### Is this mandatory government approval? No — and that is the catch. These are voluntary agreements, not a legal licensing regime. CAISI can test models it gets access to, but it is not formally clearing every frontier model for release. That means the power here comes from cooperation, reputation, and the practical value of shared testing methods, not from a hard stop button. (nist.gov) ### So why are companies agreeing to this? Partly because independent testing is becoming harder to dodge. Frontier models are increasingly agentic, tool-using, and connected to real systems. That makes “trust us” a weaker answer. Working with CAISI gives labs a way to show they are submitting to outside scrutiny, and it helps shape the standards that enterprises and governments may later expect everyone to mee(nist.gov)n describing this kind of collaboration as an ongoing part of their safety work. (anthropic.com) ### What does this mean for model builders? It pushes safety work toward artifacts you can replay and inspect — not just policy PDFs. If a lab knows a government evaluator may test a model before release, the useful evidence starts looking more technical: tool traces, permission boundaries, eval harnesses, failure taxonomies, (anthropic.com)ineering and less like PR. (cio.com) ### Bottom line? This is still voluntary, and it is not a full regulatory regime. But it is a real change. The U.S. government now has a broader, more formal path to inspect frontier AI systems before release — and that makes pre-launch safety evidence much harder for major labs to treat as optional. (nist.gov)