Canadian Developer Says Voice-imitation Technology Raises ‘Societal Issues’

In a unique and open blog posting, the Canadian developer of advanced voice-imitation technology warns prospective users that the product could result in serious risks, including fraud, identity theft and more.

As such, the mere existence of the technology “raises important societal issues” readily acknowledged and discussed by the team at Montréal-based tech start-up Lyrebird.

Using the power of artificial intelligence and machine learning, Lyrebird offers new speech synthesis solutions (as both a voice-imitation algorithm and a soon-to-be-released developer API) that let users create entire speeches or conversations using a chosen voice or specific voices speaking text tailored to their needs.

Canadian Developer Says Voice-imitation Technology Raises ‘Societal Issues’

New voice-imitation technology that can reproduce the human voice (say, that of a popular actor or other recognizable celebrity) is being developed in Canada.

These new tools can be used to give voice (say, that of a popular actor or other recognizable celebrity) to personal digital assistant devices, for the readings of audio books, in animated movies or video games, and for bringing a human touch to devices used by people with disabilities, for example.

But other potential uses are quite disturbing, and seem limited only by the imagination of the user. The company’s own demo (which you can hear online) synthesizes the voices of Donald Trump, Barack Obama and Hillary Clinton. The approximations are good, but if and when they are perfect, the ability to mimic a politician’s voice opens up some illegal, immoral and possibly unsurvivable consequences.

And the potential for perfect voice simulations is quite good: several advanced technology companies are using sophisticated artificial intelligence and machine learning techniques so that their software can actually learn how to speak better. The development team at Lyrebird says they can clone anyone’s voice by simply listening – the AI-driven software can mimic a person’s voice and get it to speak any text, based on hearing just a few seconds of source audio recording.

It’s not just Lyrebird, although their product is fast and flexible: companies like Google (DeepMind), Adobe (Project VoCo), Baidu (Deep Voice) and Intel (Speech Plus) are all creating life-like synthesized voices using AI.

Google’s DeepMind is creating realistic-sounding machine speech using the power of AI. DeepMind recently opened its first international AI research office in Edmonton, working with the University of Alberta.

Now, keep in mind these are truly synthesized voices, not prerecorded. Apple Siri and Microsoft Cortana, for example, have a real person come in and read a big long list of words, basically. Then, they edit the individually recorded words into a sentence. That is not the same technique, nor does it have the same potential, as truly synthesized voices.

Voice synthesis solutions listen to a real voice by sampling and analyzing it digitally, then the tools are able to re-create the uniquely identifiable waveforms that make up that voice. It’s not as easy it sounds: our voice is made up of several discrete audio elements, including tone, volume, pitch, articulation, pronunciation and inflection. It is influenced by factors such as how we breathe, the shapes formed by our mouth and the vibrations created in our larynx.

Of course, speech is always affected by the emotions of the speakers, not to mention the content of the speech.

So Lyrebird relies on deep learning models developed at the leading-edge Montréal Institute for Learning Algorithms laboratory at the University of Montréal; the three company founders are all PhD students there: Alexandre de Brébisson, Kundan Kumar and Jose Sotelo. They’re working with some of the top researchers in the AI community, including MILA director Yoshua Bengio.

Lyrebird says its algorithms can bring emotion to the speech it synthesizes, letting customers make voices sound angry, sympathetic or stressed out. The results are not perfect, but they represent an amazing start and a clear direction: tomorrow’s speech will be as easily manipulated – faked – as today’s images (using software like Photoshop).

Even with some kind of embedded digital watermarking, it’s possible even other machines will not be able to tell the difference between human and synthetic voices.

“This could potentially have dangerous consequences,” the company states on its website, with some understatement. Referencing the occasional use of voice recordings in court proceedings and legal trials as one example, Lyrebird notes that “[o]ur technology questions the validity of such evidence as it allows (a user) to easily manipulate audio recordings”.

Lyrebird’s posted concerns and cautions are a welcome statement on a burgeoning new sector: voice synthesis technology is still in development, but the interest is already clear. As many as 10,000 people have reportedly signed up to receive the company’s latest beta news and application releases.

“We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible,” Lyrebird adds in its statement.

So, be aware… and listen carefully.

Baidu’s Deep Voice is a text-to-speech system constructed built on deep neural networks. The company says it can do audio synthesis in real-time for applications like speech-enabled devices, navigation systems and for the visually-impaired.

-30-

About Lee Rickwood

Leave a Reply

Daniel Schneider

Theresa Bumstead

Neil Soll

About Lee Rickwood

Read Previous

Read Next

Leave a Reply