Tools for Verifying Safe Generative Voice AI
As artificial intelligence (AI) generated voices become increasingly close to human-level quality, Resemble AI is providing additional tools to help the industry tackle malicious use and stop misinformation. To deploy safe neural speech in the wild, Resemble AI is introducing the PerTh Watermarker, a deep neural network watermarker. The data is embedded in an imperceptible and difficult-to-detect way, acting as an “invisible watermark” that is both difficult to remove, and provides a way to verify if a given clip was generated by Resemble.
The Importance of Detection
Advanced photo editing software has made it impossible to rely on images alone to determine the legitimacy of information. Nowadays, there are many tools available to edit images, video, and audio, but one must be skilled in using these tools to create a convincing result. Despite this, there are plenty of people capable of using these tools, so we must be aware that content from unknown sources could be manipulated. Nevertheless, even with access to advanced tools and knowledge, it is still difficult to generate completely new content, rather than just modifying existing material.
These barriers of skill and content generation are rapidly being dismantled by the new wave of AI generative models that have surfaced in recent months. Many researchers and companies already claim to generate image, speech, and video content that is indistinguishable from real content. Typically, some technical expertise is required to use these tools. However, many companies, including ours, strive to make them as easy to use as possible, reducing or eliminating the need for any specific technical knowledge. Without proper verification in place, this could lead to the proliferation of fake and misleading content, which can even make its way into the pages of well-intentioned and reputable news organizations.
One way to tackle this emerging issue is to instead use AI as a tool to detect whether content is genuine. After all, if the AI is powerful enough to fool our senses with generated content, then perhaps it’s also more powerful than our ability to detect fake content. The challenging field of fake content detection is a very active area of research, including here at Resemble. However, we still have more work to do, as the cat-and-mouse game between generation and detection continues.
Despite the challenges posed by unverified users and data, practitioners can still take action. For instance, at Resemble, we require users to provide a recording of a consent clip in the voice they are attempting to clone. If the voice in this clip does not match the other clips, the user is blocked from creating the AI voice. If they manage to use deep-faking to manipulate the consent clip, they must already have access to deep-faking tools.
In our opinion, this was a start but no longer is enough to combat the problem. Therefore, we have developed an additional layer of security that uses machine learning models to both embed packets of data into the speech content that we generate, and recover said data at a later point. The data is embedded in an imperceptible and difficult-to-detect way, acting as an “invisible watermark.” Because the data is imperceptible, while being tightly coupled to the speech information, it is both difficult to remove, and provides a way to verify if a given clip was generated by Resemble. Importantly, this “watermarking” technique is also tolerant of various audio manipulations like speeding up, slowing down, converting to compressed formats like MP3, etc.