Safety talk: prompts, sanitization, review
Recent discussions emphasize concrete safety layers for LLMs in high‑stakes settings — system prompts to set behavior, input sanitization to remove risky data, and human review for outputs that affect liability or privacy. (x.com) Researchers and practitioners also warned of commercial behaviors that need multi‑stakeholder scrutiny, especially where models are used without engineered guardrails. (x.com)
Large language models read instructions and user data in the same text stream, which is why recent safety guidance keeps returning to three layers: a system prompt, input cleaning, and human review. (owasp.org) A system prompt is the hidden rulebook developers give a model before any user message arrives. Anthropic said its “Constitutional” approach uses a written set of principles to steer outputs, and OpenAI said system-level steering is one part of deployment safety. (anthropic.com) (openai.com) Input sanitization means screening what goes into the model before the model treats it like an instruction. The Open Worldwide Application Security Project said prompt injection works because normal data and commands are mixed together, and it lists filtering dangerous patterns, limiting formats, and validating inputs as basic defenses. (owasp.org 1) (owasp.org 2) Human review is the last check when a model’s answer could create legal, medical, financial, privacy, or security consequences. OpenAI’s API safety guide says human review is “especially critical” in high-stakes domains and for code generation. (openai.com) That stack is showing up in formal risk frameworks, not just product advice. The National Institute of Standards and Technology released its Generative Artificial Intelligence Profile on July 26, 2024 as a companion to the Artificial Intelligence Risk Management Framework, and said it helps organizations identify generative artificial intelligence risks and choose management actions. (nist.gov 1) (nist.gov 2) The pressure is highest in tools that can act on outside content, not just answer questions. Anthropic said in December 2025 that prompt injections are “one of the most significant security challenges” for browser-based artificial intelligence agents, because malicious instructions can be hidden inside pages, emails, or other material the agent reads. (anthropic.com) Security groups now treat prompt injection as a top model risk, not a fringe edge case. The Open Worldwide Application Security Project’s Generative Artificial Intelligence Security Project lists prompt injection as “LLM01” and says attacks can alter behavior, leak data, or bypass intended restrictions. (owasp.org 1) (owasp.org 2) The commercial warning behind these discussions is straightforward: companies can ship a chatbot faster than they can engineer the controls around it. NIST says its risk framework is voluntary and meant to be adapted across sectors, which is one reason researchers, product teams, and regulators are all weighing in at once. (nist.gov) (nist.gov) The practical consensus is narrower than the broader politics around artificial intelligence. If a model touches sensitive records, customer commitments, or decisions that could expose a company to liability, the baseline advice from standards bodies and platform providers is to set behavior up front, clean inputs before inference, and keep a person in the loop before action. (nist.gov) (openai.com) (owasp.org)