Microsoft’s new AI needs just 3 seconds of audio to clone a voice

VALL-E can even mimic a speaker’s emotions and acoustic environment.

Microsoft’s new voice-cloning AI can simulate a speaker’s voice with remarkable accuracy — and all it needs to get started is a three-second sample of them talking.

Voice cloning 101: Voice cloning isn’t new. Google the term, and you’ll get a long list of links to websites and apps offering to train an AI to produce audio that sounds just like you. You can then use the clone to hear yourself “read” any text you like.

For a writer, this can be useful for creating an author-narrated audio version of their book without spending days in a recording studio. A voice actor, meanwhile, might clone their voice so that they can rent out the AI for projects they don’t have time to tackle themselves.

Shorter source samples typically lead to voice clones that sound less realistic.

Depending on the service, the voice cloning process might start with you reciting 50 predetermined sentences or uploading a clip of you saying anything at all. Some services will ask for hours of audio to train their AI, while others will boast about needing just 5 seconds.

Often, you get out of these voice cloning services what you put into them — a shorter sample typically leads to a clone that sounds like a robot trying to impersonate a person, while longer clips can result in AI-generated audio that sounds just like the original speaker.

Short and sweet: Microsoft’s new voice-cloning AI, VALL-E, bucks this trend, generating audio that sounds remarkably like the original speaker from a voice sample just three seconds long. 

You can’t clone your own voice with VALL-E, but Microsoft has shared a research paper on arXiv and created a Github page where you can compare snippets of human voices to speech generated by VALL-E and a “baseline” voice-cloning AI (YourTSS).

On this page, Microsoft also demonstrates how the AI can mimic a speaker’s emotion and the acoustic environment of a sample — if the speaker sounds angry, VALL-E can generate angry-sounding audio, and if the original clip sounds like it was recorded over the phone, the AI can generate audio that matches those acoustics.

VALL-E’s training library was hundreds of times larger than other systems’.

How it works: An AI is typically only as good as its training data, and Microsoft opted to use Meta’s LibriLight — an audio library containing 60,000 hours of speech from more than 7,000 English speakers — to train VALL-E.

This means the AI’s training set was “hundreds of times larger” than those used to train existing voice cloning systems, according to the research paper.

When VALL-E is presented with a new voice to clone, it breaks the three second audio clip into bits Microsoft calls “acoustic tokens.” Using those tokens and its training data, it can then predict what the voice would sound like saying other phrases.

The big picture: If you go back to that list of “voice cloning” search results, you’ll likely find links to articles detailing how the AIs are being used for nefarious purposes.

There’s the cybercriminal who cloned a boss’s voice to trick an employee into transferring company cash into their bank account, and warnings to seniors that bad actors can now clone the voices of their grandchildren to extort money.

The Microsoft team addresses the potential for people to misuse VALL-E in their research paper, noting that such risks could be mitigated by the creation of a “detection model” capable of determining if a clip was generated by the AI. 

Even if bad actors find ways around such tools, though, other people will use the tech for good: creating synthetic voices for ALS patients, helping people connect with deceased loved ones, or doing something so remarkable we can’t even yet imagine it.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.

Related
Farmers can fight invasive insects with AI and a robotic arm
As the invasive spotted lanternfly threatens to expand its range, Carnegie Mellon researchers are developing a robot to fight back.
Google unveils AI try-on feature for shopping
Google’s AI-powered virtual try-on feature lets shoppers see what an article of clothing would look like on a wide range of models.
GitHub CEO says Copilot will write 80% of code “sooner than later”
GitHub CEO Thomas Dohmke goes in depth to answer questions about how AI-powered development will change the future of innovation itself.
No, AI probably won’t kill us all – and there’s more to this fear campaign than meets the eye
A dose of scepticism is warranted when considering the AI doomsayer narrative — there are commercial incentives to manufacture fear of AI.
AI is riding to the rescue on wildfires
AI-powered systems designed to detect, confirm, and detail wildfires at the earliest possible time may help firefighters tame infernos in the West.
Up Next
Subscribe to Freethink for more great stories