OpenAI's Privacy Filter
- OpenAI released an open-source, on-device 'Privacy Filter' to remove personal information from enterprise datasets. - The model runs locally on devices to sanitise data before it leaves corporate environments. - The tool is being pitched as a prerequisite for letting agents access internal documents while protecting personal data. (venturebeat.com)
A privacy filter is software that spots names, emails, account numbers, and other personal details before text is shared. OpenAI said April 22 it has released one as an open-source model that runs locally on a laptop or in a browser. (openai.com) OpenAI’s release is called Privacy Filter, and the company published it on GitHub and Hugging Face under an Apache 2.0 license for commercial use and modification. The repository says teams can run, evaluate, and fine-tune it in their own environments. (github.com) (huggingface.co) The model does not write text the way a chatbot does. OpenAI says it reads a document in one pass, labels each token for possible privacy risk, and then groups those labels into spans to mask or redact. (openai.com) (huggingface.co) OpenAI says the point of running it locally is to keep unfiltered documents on the device instead of sending raw files to a server for de-identification. The company said that setup is meant for training, indexing, logging, and review pipelines that handle internal business data. (openai.com) (github.com) That fits a growing enterprise problem: companies want AI systems and agents to search internal files, but those files often contain employee, customer, or patient information. OpenAI’s enterprise privacy page says business customers own and control their data and that OpenAI does not train on business data by default. (openai.com) OpenAI says Privacy Filter is designed to do more than rule-based scrubbing tools that only catch fixed patterns like phone numbers or email addresses. In its announcement, the company said the model uses language context to decide when a detail belongs to a private person and when similar text should stay because it is public information. (openai.com) The technical tradeoff is size and speed. The GitHub and Hugging Face pages say the model has 1.5 billion total parameters, uses 50 million active parameters at runtime, and supports a 128,000-token context window, which is long enough to process large documents without splitting them into chunks. (github.com) (huggingface.co) OpenAI said the released version reached state-of-the-art performance on the PII-Masking-300k benchmark after correcting annotation issues it identified in the evaluation set. The model card says the system was also stress-tested on multilingual, reasoning, and adversarial cases. (openai.com) (cdn.openai.com) OpenAI is also warning that the model is not a complete compliance system by itself. The model card includes sections on failure modes, over-reliance risk, and high-risk deployment cautions, which means companies still have to decide what counts as sensitive data and what to do when the filter misses or over-masks something. (cdn.openai.com) The immediate use case is simple: scrub the document first, then let the agent read it. OpenAI’s bet is that more companies will connect AI systems to internal knowledge once the first pass happens on their own machines instead of in someone else’s cloud. (openai.com 1) (openai.com 2)