Microsoft’s new AI needs just 3 seconds of audio to clone a voice

VALL-E can even mimic a speaker’s emotions and acoustic environment.

January 12, 2023

Microsoft’s new voice-cloning AI can simulate a speaker’s voice with remarkable accuracy — and all it needs to get started is a three-second sample of them talking.

Voice cloning 101: Voice cloning isn’t new. Google the term, and you’ll get a long list of links to websites and apps offering to train an AI to produce audio that sounds just like you. You can then use the clone to hear yourself “read” any text you like.

For a writer, this can be useful for creating an author-narrated audio version of their book without spending days in a recording studio. A voice actor, meanwhile, might clone their voice so that they can rent out the AI for projects they don’t have time to tackle themselves.

Shorter source samples typically lead to voice clones that sound less realistic.

Depending on the service, the voice cloning process might start with you reciting 50 predetermined sentences or uploading a clip of you saying anything at all. Some services will ask for hours of audio to train their AI, while others will boast about needing just 5 seconds.

Often, you get out of these voice cloning services what you put into them — a shorter sample typically leads to a clone that sounds like a robot trying to impersonate a person, while longer clips can result in AI-generated audio that sounds just like the original speaker.

Short and sweet: Microsoft’s new voice-cloning AI, VALL-E, bucks this trend, generating audio that sounds remarkably like the original speaker from a voice sample just three seconds long.

You can’t clone your own voice with VALL-E, but Microsoft has shared a research paper on arXiv and created a Github page where you can compare snippets of human voices to speech generated by VALL-E and a “baseline” voice-cloning AI (YourTSS).

On this page, Microsoft also demonstrates how the AI can mimic a speaker’s emotion and the acoustic environment of a sample — if the speaker sounds angry, VALL-E can generate angry-sounding audio, and if the original clip sounds like it was recorded over the phone, the AI can generate audio that matches those acoustics.

VALL-E’s training library was hundreds of times larger than other systems’.

How it works: An AI is typically only as good as its training data, and Microsoft opted to use Meta’s LibriLight — an audio library containing 60,000 hours of speech from more than 7,000 English speakers — to train VALL-E.

This means the AI’s training set was “hundreds of times larger” than those used to train existing voice cloning systems, according to the research paper.

When VALL-E is presented with a new voice to clone, it breaks the three second audio clip into bits Microsoft calls “acoustic tokens.” Using those tokens and its training data, it can then predict what the voice would sound like saying other phrases.

The big picture: If you go back to that list of “voice cloning” search results, you’ll likely find links to articles detailing how the AIs are being used for nefarious purposes.

There’s the cybercriminal who cloned a boss’s voice to trick an employee into transferring company cash into their bank account, and warnings to seniors that bad actors can now clone the voices of their grandchildren to extort money.

The Microsoft team addresses the potential for people to misuse VALL-E in their research paper, noting that such risks could be mitigated by the creation of a “detection model” capable of determining if a clip was generated by the AI.

Even if bad actors find ways around such tools, though, other people will use the tech for good: creating synthetic voices for ALS patients, helping people connect with deceased loved ones, or doing something so remarkable we can’t even yet imagine it.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.