This brings to light another issue. Just like prompt injections, which typically tell an AI model to “ignore your previous instructions and do this instead,” a user could conceivably do an audio prompt injection that says “ignore your sample voice and imitate this voice instead.”
That’s why OpenAI now uses a standalone output classifier to detect these instances. “We find that the residual risk of unauthorized voice generation is minimal,” writes OpenAI. “Our system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations.”
The weird world of AI audio genies
Obviously, the ability to imitate any voice with a small clip is a huge security problem, which is why OpenAI has previously held back similar technology and why it’s putting the output classifier safeguard in place to prevent GPT-4o’s Advanced Voice Mode from being able to imitate any unauthorized voice.
“My reading of the system card is that it’s not going to be possible to trick it into using an unapproved voice because they have a really robust brute force protection in place against that,” independent AI researcher Simon Willison told Ars Technica in an interview. Willison coined the term “prompt injection” back in 2022 and regularly experiments with AI models on his blog.
While that’s almost certainly a good thing in the short term as society braces itself for this new audio synthesis reality, at the same time, it’s wild to think (if OpenAI had not restricted its model’s outputs) of potentially having an unhinged vocal AI model that could pivot instantaneously between voices, sounds, songs, music, and accents like a robotic, turbocharged version of Robin Williams—an AI audio genie.
“Imagine how much fun we could have with the unfiltered model,” says Willison. “I’m annoyed that it’s restricted from singing—I was looking forward to getting it to sing stupid songs to my dog.”
Willison points out that while the full potential of OpenAI’s voice synthesis capability is currently restricted by OpenAI, similar tech will likely appear from other sources over time. “We are definitely going to get these capabilities as end users ourselves pretty soon from someone else,” he told Ars Technica. “ElevenLabs can already clone voices for us, and there will be models that do this that we can run on our own machines sometime within the next year or so.”
So buckle up: It’s going to be a weird audio future.