Bot Basics, Bot Development
The Turing Test – Time for Change?
The Holy Grail of conversational chatbot design is to create a computer program that can pass the famous Turing test—that is, to make one which is indistinguishable from a human in everyday general conversation. However, I believe that the Turing test as it currently stands has serious flaws, which I’ll discuss in this article, and I’ll offer an alternative to the test.
August 1, 2019
The Turing test in full flow - are these judges talking to a computer or a human?
So, what’s the point of the Turing test?
The main disagreement I have with the test itself is why even bother? I’m sure back in Alan Turing’s day, having a machine appear indistinguishable from a human must have seemed like the peak of technological achievement. However, computers have moved on a lot since the 1950s. They can process and store information far faster and more efficiently than any human. Internet access now means we have the entire knowledge of the world at our fingertips—something Alan Turing could have only dreamt about. Let’s take this example where we ask a question in a typical Turing test and see if you can guess which of the two responses came from the computer.
Judge: How many people live in Brazil?
Response 1: No idea but I bet it’s at least a couple of hundred million people.
Response 2: The current population of Brazil is 212,328,972 as of Monday, June 17, 2019, based on the latest United Nations estimates. Brazil’s population is equivalent to 2.75% of the total world population. It has a population density of 24.66 people per square kilometre (62 per square mile), which ranks 5th in the world.
Not too hard to spot the computer, was it? The mechanical knowledge retrieval of Response 2 easily gave it away and, while Response 1 was definitely human-like, it wasn’t more intelligent or even useful. Imagine typing the same question into your favourite search engine and it gives a virtual shrug of the shoulders responding with a vague “I dunno” type response. The deceptive nature of trying to fool people into thinking they are talking to a real person may be a fun party trick but has little practical application, in my opinion. The computers are often programmed to have deliberate misspellings, backspacing, and grammatical errors in order to appear human-like and, for me, it’s this deliberate dumbing down of the bots that seems pointless. We should use artificial intelligence (AI) as a tool to help people—not to deceive us.
Also bear in mind that the Turing test is traditionally carried out with a judge typing to both a computer program and a real person, and it’s the judge’s task to decided which is which. This means that not only does the chatbot have to appear human-like, but it also has to be more human-like than the actual person!
In the Turing test, the judge must decide which of two “people” he is talking to is a human.
It’s not all bad news
One of the things I do like about the Turing test, though, is the way it encourages development of natural language understanding (NLU) in machines. The ability to type your message or simply talk to the computer without having to remember commands or syntax is going to be huge in the future, as it’s the way humans communicate with each other now. Also, people enjoy talking to computers simply for fun and entertainment rather than always trying to achieve a particular task. My bot, Mitsuku, has over 1.5 million interactions with people every month, but it always tells people it’s a computer program rather than pretending to be a real person.
So how can we improve the test in its current format?
Remove the pass/fail part of it
At the moment, the chatbot is said to have passed the Turing test if it can fool a human judge into thinking they were talking to a real person rather than a computer. Anyone who has even glanced at trying to design a program to achieve this goal will have soon realised what an impossible task this is. Human conversation is incredibly complicated to reproduce in a machine and, as the computer is solely judged on whether it can outperform its human opponent, it is highly unlikely that this goal will be achieved any time soon without using tactics such as non-English–speaking judges, pretending to not understand English very well, or not be interested in taking part in the test, etc.
I believe that, rather than a binary pass/fail judgment, the programs should be measured in stages of progress. Let’s face it, if we were checking on how well NASA were doing in sending an astronaut to Mars and, each month, they said, “No, we haven’t done it yet,” people would think it was a waste of time and no developments were being made. The government would probably also cut their funding. Instead, NASA give news about their progress and targets achieved, such as getting a man into orbit, landing on the moon, sending probes to Mars, and so on. It’s easy for the public to follow their progress instead of thinking we are no closer to landing on Mars than we were back in the 1900s.
Chatbots undergoing the Turing test today have no such luck. They are simply judged on whether they have passed or failed. This gives the impression to the general public that they haven’t evolved at all since the ELIZA days, which is untrue, as chatbots are far more sophisticated—and getting better all the time.
How can progress in the Turing test be measured?
Rather than being judged on a simple pass or fail, I would like to see a graded Turing test to get an idea as to where we currently are in comparison to a human level of intelligence. The closest we have to a regular Turing test is the annual Loebner Prize, where a panel of judges talk to humans and chatbots, in four rounds of 25 minutes each, before deciding on which program is the most human-like.
The problem is that the human volunteers in the contest are usually a mixture of university students, journalists, and other people who are generally well educated. Can a chatbot be more convincing than these opponents? No, the programs simply do not have the same wealth of life experience and knowledge that these humans have. So, each year, the chatbots fail to convince anyone they are human, the media reports that we are safe from Skynet, and the whole thing is forgotten about until next year, with people saying there has been no progress for decades.
And that’s the problem. No chatbot today is even close to outperforming a well-educated adult human. Nowhere near close. So how can we monitor the progress and developments of the world’s best chatbots?
An age-based Turing test
While it’s true that chatbots are far less convincing than an adult human, it’s probably safe to say that the best chatbots can perform better than a human baby, as regards to language comprehension and answering Turing test–style questions, such as, “What colour is a red ball?” The baby cannot understand or reply to such input. The chatbot wins.
So now that the bots have outperformed a baby, let’s try them with a one- or two-year-old child. Very young children can speak and understand basic things about the world. They can answer the simple, “What colour is a red ball?” type questions but will struggle with something more complicated, such as, “John has five apples and eats two of them. How many does he have left?” A good chatbot can easily answer such queries and so wins again.
Now we try older children, maybe three or four years old. Some can answer the apple question just described but will struggle with, “Which is larger, a hydrogen atom or the planet Saturn?” as they do not have any knowledge of these objects. It would be unreasonable to expect a three-year-old child to know what an atom is.
We continue this process until the human volunteer can outperform the chatbot. I’m guessing that the best current chatbots are probably on the same level as an average five- or six-year-old child. Maybe in a few years, the bots will regularly outclass a seven-year-old. Nobody knows, but it would be great to show the general public that bots are making some progress instead of just being seen to fail each year.
Practicalities of running such a test
I would still keep the format of the Turing test, in that a judge exchanges messages with both a human and a computer program via an anonymous method, but it would be impossible for a baby or a very young child to type their responses back to the judge. To get around this issue, a parent could type the child’s responses until the chatbot level was a six- or seven-year-old human and the child could type by themselves.
To achieve a fairer sample than just a computer versus one child, we should run the contest with as many children and judges as possible. This would negate any rogue results, such as the child deliberately answering incorrectly or deciding they no longer wanted to take part in the contest—or even to rule out exceptionally gifted children. The children should represent the average ability of their age group, and so, the larger the sample, the more accurate the results.
Let’s look at an example of a possible interaction between a judge, a computer, and a child, over several contests. The contest itself would have maybe 20 or 30 questions, depending on the attention span of the child. This could be reduced for very young children and increased as the chatbots improve over time. The first contest would feature newborn babies and should be held simply to try and find a base level for future contests. Here is a series of very basic contests with just one question as an example:
Judge: What do you call a shape with three sides?
Computer: A triangle
Zero-year-old human: [No response]
Computer: A triangle
One-year-old human: [*Gurgling noises*]
Computer: A triangle
Two-year-old human: I don’t know, Mummy.
Computer: A triangle
Three-year-old human: Is it a circle?
Computer: A triangle
Four-year-old human: A triangle
The computer here has demonstrated that it is equal to the intelligence level of a four-year-old human, for this particular question, and so the questions for the next contest should be increased in difficulty. Each contest should only include children older than the age of the computer’s current intelligence level. Let’s continue the contest now that the computer is equivalent to a four-year-old.
Judge: My name is John. What is my name?
Computer: My name is Botty.
Five-year old human: John
Judge: How many brothers and sisters do you have?
Computer: I am an only child.
Five-year-old human: One older brother
In Contest 6, the child was more intelligent than the computer, so Contest 7 still competes against five-year-old children. The bot was successful in Contest 7, so Contest 8 will feature six-year-old children and the media can now announce that chatbots are of equivalent intelligence to a five-year-old child. Bear in mind that these examples are very much simplified and that the actual contest should have many questions and human subjects, with any correct responses aggregated to find a winner.
I dare say that, once the children get to around eight or nine years old, it will probably take many more contests in order for the chatbots to develop further. But at least, using this method, progress can be seen to be made in the field of conversational AI and, who knows, maybe one day the computer will be equivalent to an average adult human. We’re entering the realms of fiction now, but if chatbots ever surpass the intelligence level of an average human, only then is it time to bring in the university-educated professors to be human subjects rather than including them as standard in today’s Turing tests.
These are just my initial thoughts, and I’m sure there could be improvements or additions to increase the reliability of the results. But it surely has to be better than the unrealistic expectation of the current Turing test in trying to outwit an adult human when we are nowhere near that stage yet.
To bring the best bots to your business, check out www.pandorabots.com or contact us at firstname.lastname@example.org for more details.