Human-Like Chatbots: Benefits, Dangers, and Possibilities
Chatbots are all around us, and their presence in our daily lives will only increase in the near future. But despite a burst of attention and investment, building a human-like chatbot remains a challenge.
June 4, 2020
From booking flights to helping diagnose medical issues, chatbots are all around us. What’s more, the presence of chatbots in our daily lives will only increase in the near future. According to the research firm Research and Markets, the global chatbot market will expand from $2.6 billion in 2019 to $9.4 billion in 2024—an almost 30 percent annual growth rate.
Despite this burst of attention and investment, however, building a human-like chatbot remains frustratingly out of reach. Even today’s most sophisticated chatbots don’t come close to the breadth, depth, and skill that an average human being possesses.
Given this persistent gap, we need to ask, What would it mean for a chatbot to be judged human-like? What is the current state of the art regarding human-like chatbots, and what are the techniques used to make chatbots more human? What would be the repercussions for society of building a human-like chatbot, in terms of both the advantages and drawbacks?
What is a human-like chatbot, anyway?
The field of artificial intelligence (AI) has long debated whether it should focus on building machines that behave humanly or on those that behave rationally. Yet creating human-like chatbots is complicated by the fact that we have trouble defining what it is to be human in the first place—let alone building chatbots that imitate our behavior.
Humans are irrational creatures. We are constantly led astray by emotions and cognitive biases that are very far from the cold, calculating circuits of a machine. That hasn’t prevented us, however, from trying to build chatbots that behave in an entirely human manner.
The Turing test
The most well-known method for assessing whether a chatbot is human-like is the Turing test, first proposed by computer scientist Alan Turing in 1950. Turing posited that a machine would pass the test if human interrogators could not reliably tell the difference between responses from the machine and those from another human.
Using the Turing test as the prevailing metric has both advantages and disadvantages for assessing a chatbot’s “humanity.” First, the test is simple and easily understandable, even by laymen. Second, the test evaluates much of what we consider to be human: the ability to comprehend and produce natural language; the ability to learn, reason, and extrapolate from limited information; and the ability to account for emotions and context.
As a barometer for human-like chatbots, however, the Turing test also has some disadvantages. For one, the test is highly dependent on the human interrogator. Experts in philosophy and computer science will likely be better at discerning between humans and machines than will the general public. The test might also be exploited by a machine that behaves in an unintelligent, yet distressingly human, manner—for example, spewing insults, lying, remaining silent, or giving incoherent answers.
Every year, Turing’s idea is put to the test at the Loebner Prize, a competition for the most convincingly human chatbot. To date, no chatbot has ever successfully fooled half the Loebner Prize judges—the necessary criterion for winning the event’s grand prize.
In the decades since the Turing test’s introduction, researchers have proposed several evolutions and variations, including:
-The Feigenbaum test, which measures a machine’s domain knowledge about a particular field.
-The Minimum Intelligent Signal Test (MIST), which only allows for binary yes/no or true/false answers, using questions that test knowledge of basic facts in science, history, and culture.
-The visual Turing test, which determines a machine’s ability to understand visual images.
While it’s the best-known metric, the Turing test isn’t the only way to measure the “humanity” of a chatbot. Various AI researchers have proposed alternative metrics for assessing the strength of a chatbot’s conversational skills.
In January 2020, Google released the Meena chatbot, a state-of-the-art conversational model. (More on Meena follows.) To evaluate Meena and other chatbots, Google proposed a human evaluation metric which it calls the Sensibleness and Specificity Average (SSA). For each utterance that a chatbot makes, human evaluators grade it along two scales:
-Does the response make sense?
-Is the response specific?
Given the prompt “I love tennis,” for example, a chatbot might respond with:
-“Me too, I love Roger Federer!” (sensible and specific)
-“That’s nice.” (sensible but not specific)
-“Me too, I love Cristiano Ronaldo!” (specific but not sensible)
-“I don’t know.” (not sensible and not specific)
While metrics such as SSA offer a viable alternative to the Turing test, they should be approached with the same criticism as other proposed alternatives.
Other criteria for assessing a chatbot’s “humanity” include personality and factuality. Humans typically have a personality that is consistent over time and across conversations. In addition, humans often display in-depth knowledge about select topics, such as music festivals or Arnold Schwarzenegger, while being unfamiliar with other topics.
How to build a human-like chatbot
Perhaps the greatest challenge in building a human-like chatbot is that there are many, many ways to go wrong—and many fewer ways to do it right. A bot that makes a single grammar mistake or misinterpretation can instantly shatter the illusion for users.
The need for data
While the first chatbots such as ELIZA were hand-coded, research has now shifted to a data-driven approach, using techniques from AI, machine learning, and natural language processing (NLP). Thus far, the most successful way of making chatbots more human has been to train them on massive text corpora, just as humans learn language by listening to and reading vast quantities of information.
By training on this data, machines grow to understand the relationships between words, and how each word is used in context—essentially, learning the implicit rules of grammar and vocabulary. The input text data must also be drawn from diverse sources so that the resulting chatbot is familiar with a wide range of topics.
GPT-2: A breakthrough for NLP
One recent development that demonstrates the success of this approach is OpenAI’s GPT-2, a generative language model that was trained on 40 gigabytes of internet-sourced data, totaling 8 million documents. Upon its release in February 2019, GPT-2 made waves for its ability to “write” synthetic text samples based on a brief prompt.
Following is one such human-written prompt:
In a shocking finding, scientists [sic] discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
Given this prompt, the GPT-2 model returned a short article beginning with the following paragraphs:
The scientists named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
(Note that this text is a cherry-picked example representing the best of 10 tries by OpenAI researchers, based on the given prompt.)
This example has a plausible narrative, grammatical correctness, and fluid, complex dependencies—for example, the reference to the scientist across multiple sentences. However, there are still noticeable errors that prevent it from being entirely authentic (e.g., the reference to “almost two centuries” for a species that has just been discovered).
Chatbots and deep learning
The race to build a human-like chatbot is now focused on deep learning, a machine learning technique that relies on vast quantities of unlabeled input data. Deep learning uses very large neural networks—loose models of the human brain composed of many interconnected, artificial “neurons.”
With deep learning the latest trend, beating the current state of the art largely depends on using better corpora and smarter network architectures. Google’s Meena and GPT-2 are both examples of the Transformer machine learning architecture for neural networks. Transformer architectures are more easily parallelizable than alternatives, such as long short-term memory (LSTM) networks, which enables them to be trained on much larger datasets.
The current state of human-like chatbots
Released by Google in January 2020, Meena uses the Evolved Transformer seq2seq architecture, with 2.6 billion parameters in the network. The Meena chatbot was trained on 341 gigabytes of text from social media platforms, over eight times larger than the training data for GPT-2.
Google engineers trained Meena on 40 billion words from social media platforms. Meena models conversations as trees, where the first message is the root node and responses are children of the previous message. Following is a short example dialogue between a human user and Meena:
Meena: Hi! How are you this fine evening?
User: I’m doing well. it’s morning my time!
Meena: Good morning! What are your plans for today?
User: Get my work done then watch a movie at home in the night
Meena: Sounds fun! What movie?
User: I haven’t decided yet, perhaps an Oscar-nominated one. any suggestion?
Meena: I’m not too familiar with recent Oscar winners, but I do know I wanted to see The Grand Budapest Hotel.
DialoGPT is a conversational model released by Microsoft Research in November 2019. The DialoGPT model is based on OpenAI’s GPT-2, and was trained on 12 years of posts from the social news aggregation website Reddit totaling 27 gigabytes. The full DialoGPT model contains 762 million parameters.
According to Microsoft, DialoGPT responses are "comparable to human response quality under a single-turn conversation Turing test.” Following is a short example dialogue between a human user and DialoGPT:
User: Does money buy happiness?
DialoGPT: Depends how much money you spend on it .
User: What is the best way to buy happiness ?
DialoGPT: You just have to be a millionaire by your early 20s, then you can be happy .
User: This is so difficult !
DialoGPT: You have no idea how hard it is to be a millionaire and happy . There is a reason the rich have a lot of money
Not to be outdone by Google and Microsoft, Facebook has released its own state-of-the-art chatbot. In April 2020, Facebook AI Research (FAIR) released Blender, which it calls the largest open-source chatbot ever released. The Blender chatbot uses 9.4 billion parameters, almost four times as many as Meena.
Facebook engineers designed Blender to have a broad general knowledge base by training it on Wikipedia articles and then fine-tuned it with actual conversations between real human beings. The bot’s realism is also improved by giving it a consistent personality and identity and the ability to display empathy. The total body of work is available in a new dataset called BlendedSkillTalk.
In a head-to-head chatbot comparison, a majority of human evaluators said that Blender was more human-like and more engaging to speak with than Meena. Following is a sample dialogue between a human user and Blender:
User: If you could have dinner with one person, dead or alive, who would that be?
Blender: That's a tough one. I think I'd have to go with Steve Jobs. I'd love to pick his brain.
User: Ah, interesting choice. What do you know about Steve Jobs?
Blender: He was one of the most influential people in the history of technology. He was a visionary.
User: What questions would you ask him?
Blender: I'd want to know how he came up with the idea for the iPod. It was so revolutionary at the time.
User: Totally...the iPod changed the way everyone listens to music.
Blender: It really did. It's crazy to think that it was released in 2001. That was a long time ago.
The benefits (and dangers) of human-like chatbots
Meena, DialoGPT, and Blender are all highly impressive efforts from the world’s tech titans—but they still aren’t equal to the fluidity and authenticity of conversations with real human beings. So what will happen when a truly human-like chatbot arrives on the scene?
It’s easy to imagine a variety of benefits from human-like chatbots. For example, customer service is one of the biggest use cases for chatbots, yet 86 percent of consumers say that they still prefer to speak with a human agent. However, much of this resistance is no doubt due to chatbots’ less-than-human-like language skills. State-of-the-art bots will make both companies and consumers more willing to use them.
However, truly human-like chatbots also present substantial dangers. It’s easy to imagine chatbots that are used to impersonate a real human being—for example, in order to automate cybercrimes, such as scamming and phishing. In addition, chatbots that can simulate a realistic human conversation will still not be the vaunted artificial general intelligence (AGI) that would be as smart as a human being. Using chatbots in the wild, no matter how human-like, will still require human supervision to ensure they don't "go rogue."
Despite the potential risks involved, human-like chatbots have many possible benefits and use cases: improving the customer experience, automating parts of business workflows (such as marketing and lead generation), providing social companionship, bringing fictional characters to life, and much more. Yet while there have been many advancements made in the past year alone, it's clear that we're still far from chatbots that can truly be considered human-like. The coming months and years will be very exciting as we see what lies ahead for the field.