What is AI watermarking?
AI watermarking is the process of embedding a recognizable, unique signal into the output of an artificial intelligence model, such as text or an image, to identify that content as AI generated. That signal, known as a watermark, can then be detected by algorithms designed to scan for it.
Ideally, an AI watermark should be invisible to the naked eye, but extractable using specialized software or algorithms. A generative AI model that incorporates watermarking can be used like any other model, but model output will indicate explicitly that it was created using AI. Effective AI watermarking should also avoid impairing model performance; resist attempts at forgery, removal or modification; and be compatible with a range of model architectures.
AI watermarking is a relatively new technique that has seen increased interest in the wake of consumer-facing text and image generators, which have made it much easier to create believable content using AI. In March 2023, for instance, an image of the pope wearing a white puffer jacket was created using the image generator Midjourney and went viral on social media, where many users believed the image to be genuine.
Although that example is relatively benign, the ability to widely disseminate high-quality content produced by generative AI raises broader concerns about AI-manipulated media. For example, AI-generated images could be used to spread political misinformation and create deepfakes, while AI-generated text could help malicious actors conduct phishing campaigns and scams at a larger scale. As AI systems become capable of producing increasingly convincing output and AI-generated media becomes more prevalent online, researchers are exploring how to use hidden signals to indicate the origin of that content to audiences.
How AI watermarking works
The AI watermarking process involves two stages: watermark encoding during model training and watermark detection after output generation.
AI watermarks are created during model training by teaching the model to embed a specific signal or identifier in generated content -- for example, a textual watermark hidden in a sentence generated by a large language model (LLM) or a visual watermark concealed in the output of an image generator. This process usually involves making subtle changes to the model during the training stage, such as alterations to model weights or features.
After model training and deployment, specialized algorithms detect the presence of the watermark embedded earlier, thereby checking whether a piece of media was generated by AI. For example, an algorithm might search for the presence of rare phrases or analyze an image's pixels to detect hidden patterns.
As an example, consider a watermarking technique proposed by Scott Aaronson, a computer scientist and researcher at OpenAI. An LLM such as OpenAI's GPT-4 generates output by predicting the next token -- a natural language processing term referring to a short unit of text, such as a word, syllable or punctuation mark -- based on the previous tokens. Each candidate for the next token is assigned a probability score indicating how likely it is to come next.
Normally, the model randomly selects the next token based on these probability scores. But to create an AI watermark, the model could instead use a cryptographic function whose private key is only accessible to the model's developers. For example, the system might be more likely to choose certain rare words or sequences of tokens that a human would be unlikely to replicate.
The presence of these rare words and phrases would then function as a watermark. To an end user, the text output by the model would still appear randomly generated. However, someone with the cryptographic key could analyze the text to reveal the hidden watermark based on how often the encoded biases occur.
Similar techniques could theoretically be implemented to watermark images. For example, model developers could alter certain weights in early layers of convolutional neural networks to encode noise that functions as a watermark or include watermarked images in training data so that the model's output inherits those markers.
The benefits of AI watermarking
Watermarking AI-generated content has several benefits:
- Preventing the spread of AI-generated misinformation. Social media networks, news organizations and other online platforms could use AI watermarks to indicate to readers that a piece of content was created using AI. Adding a disclaimer label to an Instagram post that contains an AI-generated image could help thwart attempts to spread disinformation, for example.
- Indicating authorship. Because watermarks trace online content back to a specific creator, they are useful in flagging AI output such as deepfake videos and bot-authored books. This could limit the spread of fraudulent content by helping creators prove that their name or image was used deceptively.
- Establishing authenticity. Similar to a physical watermark on paper currency, AI watermarks serve as digital signatures that can demonstrate provenance, or the origin of a piece of media. This could be useful in contexts such as scientific investigations or legal proceedings, where research findings or evidence could be scanned for AI watermarks to evaluate their integrity.
The limitations of current AI watermarking techniques
Unfortunately, current AI watermarking techniques are unreliable and relatively easy to circumvent. In January 2023, for example, OpenAI launched an AI text detector for ChatGPT developed by Aaronson and other OpenAI researchers. But just six months later, OpenAI took down the AI classifier tool, citing its "low rate of accuracy."
Developing persistent AI watermarks that not even determined hackers can eliminate remains an open research problem. One significant issue is that watermarks are often easy to remove, particularly in text. For example, text watermarking strategies that involve slightly emphasizing certain words or using specific patterns can be overcome simply by human editing of AI-generated text.
There is also the problem of false positives -- incorrectly identifying a human-created piece of media as the product of AI. Malicious actors could trigger a false positive by adding a watermark to a real image to instill doubt about its authenticity. False positives could also arise through random chance if an image or passage of text happens to mimic the hallmarks of a particular watermark, leading to unfair accusations of plagiarism or deceit.
Other watermarking techniques might work only for specific data sets, showing limitations for fine-tuned models. Challenges remain around ensuring watermarks persist across model versions and applications; creating flexible watermarking techniques that can be applied across model architectures is likely to prove difficult as well.
Finally, finding the right balance when it comes to watermark detectability is another hurdle. Including too much modified data in the training set or altering a model's weights and features too aggressively during training can degrade the model's overall accuracy. Likewise, a too-obvious watermark could make AI-generated content useless -- for example, watermarked text that sounds highly unnatural due to heavily overemphasizing rare words and syntax patterns. But, at the other extreme, subtler watermarks are more vulnerable to tampering and risk being too weak for detectors to notice.
Even if these practical limitations are overcome, widespread AI watermarking could also raise ethical concerns. Namely, embedding unique watermarks into AI-generated content could potentially compromise users' privacy by tracking individuals' use of generative AI tools through watermarking.