On the website Infinite Conversation, the German filmmaker Werner Herzog and the Slovenian philosopher Slavoj Žižek are having a public chat about anything and everything. Their discussion is compelling, in part, because these intellectuals have distinctive accents when speaking English, not to mention a tendency toward eccentric word choices. But they have something else in common: both voices are deepfakes, and the text they speak in those distinctive accents is being generated by artificial intelligence.
I built this conversation as a warning. Improvements in what’s called machine learning have made deepfakes—incredibly realistic but fake images, videos or speech—too easy to create, and their quality too good. At the same time, language-generating AI can quickly and inexpensively churn out large quantities of text. Together, these technologies can do more than stage an infinite conversation. They have the capacity to drown us in an ocean of disinformation.
Machine learning, an AI technique that uses large quantities of data to “train” an algorithm to improve as it repetitively performs a particular task, is going through a phase of rapid growth. This is pushing entire sectors of information technology to new levels, including speech synthesis, systems that produce utterances that humans can understand. As someone who is interested in the liminal space between humans and machines, I’ve always found it a fascinating application. So when those advances in machine learning allowed voice synthesis and voice cloning technology to improve in giant leaps over the past few years—after a long history of small, incremental improvements—I took note.
Infinite Conversation got started when I stumbled across an exemplary speech synthesis program called Coqui TTS. Many projects in the digital domain begin with finding a previously unknown software library or open-source program. When I discovered this tool kit, accompanied by a flourishing community of users and plenty of documentation, I knew I had all the necessary ingredients to clone a famous voice.
As an appreciator of Werner Herzog’s work, persona and worldview, I’ve always been drawn by his voice and way of speaking. I’m hardly alone, as pop culture has made Herzog into a literal cartoon: his cameos and collaborations include The Simpsons, Rick and Morty and Penguins of Madagascar. So when it came to picking someone’s voice to tinker with, there was no better option—particularly since I knew I would have to listen to that voice for hours on end. It’s almost impossible to get tired of hearing his dry speech and heavy German accent, which convey a gravitas that can’t be ignored.
Building a training set for cloning Herzog’s voice was the easiest part of the process. Between his interviews, voice-overs and audiobook work there are literally hundreds of hours of speech that can be harvested for training a machine-learning model—or in my case, fine-tuning an existing one. A machine-learning algorithm’s output generally improves in “epochs,” which are cycles through which the neural network is trained with all the training data. The algorithm can then sample the results at the end of each epoch, giving the researcher material to review in order to evaluate how well the program is progressing. With the synthetic voice of Werner Herzog, hearing the model improve with each epoch felt like witnessing a metaphorical birth, with his voice gradually coming to life in the digital realm.
Once I had a satisfactory Herzog voice, I started working on a second voice and intuitively picked Slavoj Žižek. Like Herzog, Žižek has an interesting, quirky accent, a relevant presence within the intellectual sphere and connections with the world of cinema. He has also achieved somewhat popular stardom, in part thanks to his polemical fervor and sometimes controversial ideas.
At this point, I still wasn’t sure what the final format of my project was going to be—but having been taken by surprise by how easy and smooth the whole process of voice-cloning was, I knew it was a warning to anyone who would pay attention. Deepfakes have become too good and too easy to make; just this month, Microsoft announced a new speech synthesis tool called VALL-E that, researchers claim, can imitate any voice based on just three seconds of recorded audio. We’re about to face a crisis of trust, and we’re utterly unprepared for it.
In order to emphasize this technology’s capacity to produce large quantities of disinformation, I settled on the idea of a never-ending conversation. I only needed a large language model—fine-tuned on texts written by each of the two participants—and a simple program to control the back-and-forth of the conversation, so that its flow would feel natural and believable.
At their very core, language models predict the next word in a sequence, given a series of words already present. By fine-tuning a language model, it is possible to replicate the style and concepts that a specific person is likely to speak about, provided that you have abundant conversation transcripts for that individual. I decided to use one of the leading commercial language models available. That’s when it dawned on me that it’s already possible to generate a fake dialogue, including its synthetic voice form, in less time than it takes to listen to it. This provided me with an obvious name for the project: Infinite Conversation. After a couple of months of work, I published it online last October. The Infinite Conversation will also be displayed, starting February 11, at the Misalignment Museum art installation in San Francisco.
Once all the pieces fell into place, I marveled at something that hadn’t occurred to me when I started the project. Like their real-life personas, my chatbot versions of Herzog and Žižek converse often around topics of philosophy and aesthetics. Because of the esoteric nature of these topics, the listener can temporarily ignore the occasional nonsense that the model generates. For example, AI Žižek’s view of Alfred Hitchcock alternates between seeing the famous director as a genius and as a cynical manipulator; in another inconsistency, the real Herzog notoriously hates chickens, but his AI imitator sometimes speaks about the fowl compassionately. Because actual postmodern philosophy can read as muddled, a problem Žižek himself noted, the lack of clarity in the Infinite Conversation can be interpreted as profound ambiguity rather than impossible contradictions.
This probably contributed to the overall success of the project. Several hundred of the Infinite Conversation’s visitors have listened for over an hour, and in some cases people have tuned in for much longer. As I mention on the website, my hope for visitors of the Infinite Conversation is that they not dwell too seriously on what is being said by the chatbots, but gain awareness of this technology and its consequences; if this AI-generated chatter seems plausible, imagine the realistic-sounding speeches that could be used to tarnish the reputations of politicians, scam business leaders or simply distract people with misinformation that sounds like human-reported news.
But there is a bright side. Infinite Conversation visitors can join a growing number of listeners who report that they use the soothing voices of Werner Herzog and Slavoj Žižek as a form of white noise to fall asleep. That’s a usage of this new technology I can get into.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.