Why the Turing Test Became Obsolete

What's generally called "The Turing Test" is intended to tell humans from machines pretending to be humans. That distinction between "human-made" and "machine-made" looks more relevant each day, isn't it?
In this post, I'll explain briefly how the Turing Test got the status of the ultimate proof of human-level intelligence and then how this status was lost. Then, I mention some good alternatives to use instead for testing human –and even super-human– level cognition.
A sound measurement of the cognitive abilities of machines is a must for data professionals as we are flooded by "intelligence" or even "consciousness" claims that are mostly noise and hype.
What is the Turing Test?
Turing himself called his test "the imitation game," but then things become murky because that's the name of a movie about Turing's personal life and struggles – which were big.
While I liked the movie, as well as how Benedict Cumberbatch delivered the lead role, the "imitation game" is not the same as the "Turing machine," which is not the same as the "Enigma machine." The film confused them all.
Look, I taught a course on Automata Theory for more than ten years at my university, and the final topic was the Turing Machine. This one is just one of several types of automata, such as Finite State automaton, Stack automaton, and so on.
I can tell you that the Turing Machine has nothing to do with the Enigma machine, which was an encoder used for the German cyphered messages sent during the war. Turing worked for British intelligence in order to decipher those messages so they could see what the Germans were up to.
What is unfortunate is that despite Turing's good work, that wasn't enough to spare him from a trial over his homosexuality. If you saw the movie, you know Turing was forced to endure chemical castration, which left emotional scars so deep that later on, he committed suicide.
I hope this is enough to clarify that the Turing machine and the Enigma machine were all related to Turing but had nothing to do with the Turing test.
The imitation game, later called the "Turing Test," is a thought experiment proposed by Turing in 1950. He imagined a possible "kind of intelligent" machine pretending to be a human, as well as two humans – let's call them Human 1 and Human 2. Human 1 is the "judge" and communicates with either Human 2 or the machine using a teletype, but s/he doesn't know which one it is. Either Human 2 or the machine tries to convince the judge that they are human, chatting about just any conceivable topic for a limited but substantial time. At the end of the interaction, Human 1 tells if s/he was chatting with a human or a machine.
When the chat was with the machine, and the judge said it was a human, for a substantial quantity of judges, then the machine "succeeds" in the Turing Test.
Turing never implemented the imitation game—there were no "intelligent machines" at the time. But just 6 years later, the "Artificial Intelligence" field was created at the Dartmouth workshop, and the AI term was coined by John McCarthy of MIT. Many years later, I met McCarthy at a conference, and at lunchtime, I was lucky enough to be seated just next to him. He told me then that the term "Artificial Intelligence" was more about getting funded than anything else. He got his research funded indeed, and for better or for worse, we still deal with the AI name.
As early as 1967, there was a conversational system called ELIZA, built by Joseph Weizenbaum, that tried to be a challenger for the Turing test. It simulated a psychiatrist of the Rogerian school consulting the user, and very often, it questioned this one, turning answers into new questions, so it looked "empathetic" in a way familiar to those who have had psychological help.
As weird as it may seem, you can run ELIZA right now! It's embedded in systems like the EMACS editor, which I've been using for nearly 40 years (not the same distribution, obviously). Now I use the "Aquamacs" distribution for Mac. In EMACS, you summon ELIZA as follows:
Esc-x doctor (that is, the key "ESC" at the top left of the keyboard, then an "x"), then "doctor."
That's it. You'll be greeted by the virtual "psychiatrist."
My recent conversation with ELIZA went as follows:
ELIZA: How do you do? What brings you to see me? Me: Hello, I'm consulting you because I'm having some disturbing dreams. ELIZA: I would appreciate it if you would tell me about your plans. Me: Plans? I don't have much plans, as I'm lacking sleep because of the dreams I'm having… ELIZA: Are you sure that is the real reason? Me: Yes, I'm pretty sure. Can't you see my tired eyes? ELIZA: Why do you say that? Me: It's obvious to me that I look tired. ELIZA: What makes you believe it's obvious to you that you look tired? Me: I have black shadows around my eyes. ELIZA: Maybe your life has something to do with this. … (I cut it here)
According to Weizenbaum, many people interacting with ELIZA—including his secretary—attributed human feelings to the contraption. Of course, by today's standards, ELIZA interactions don't look natural anymore.
What took the Turing Test down
Once AI took off, for a number of years, there were competitions inspired by the Turing Test, the best known of which is the Loebner Prize, which ran yearly from 1990 to 2019 with few exceptions. It granted money prizes of around $2,500 to the winner.
The Loebner Prize was not seen as a prestigious scientific competition. Rather, it was considered by many in academia as a "publicity stunt." Famously, Marvin Minsky, one of the AI founders, went as far as offering a "prize" to anyone who could stop the competition.
Why those scathing critiques? The competition was more of a social event than a scientifically controlled experiment, with poorly qualified judges, too short interactions (two and a half minutes), and an orientation to "whimsical" conversation instead of verifiable predictions.
But the last blow to the Turing Test came in 2012 when a chatbot named "Eugene Goostman" was declared as "passing the Turing Test." Wait, this sounds like a victory. Why is it a blow to the Turing Test?
Developed in St. Petersburg by two Russians and one Ukrainian (how rare this sounds nowadays), Eugene Goostman was cleverly designed as a 13-year-old Ukrainian boy—so he was not very fluent in English and even not aware of some details of American culture. This cunning design was intended to induce forgiveness in the judges. Prior to 2013, it finished second in the 2005 and 2008 editions of the Loebner Prize.
So when, in 2012, it convinced 29% of its judges that it was human—in the year of the 100th birthday of Alan Turing—it became big news. But in a bad way…
Many academics, like Hector Levesque of the University of Toronto, argued that a good intelligent behavior test should be more than a clever trick oriented to deceiving a bunch of gullible judges.
Levesque then published the paper presented below as a reference, where he not only questions the Turing Test as an adequate way for testing cognition but also proposes something to use instead.
The following year, Levesque was awarded the prestigious IJCAl Award for Research Excellence. Was it a sort of endorsement of his critique of the Turing Test?
What will replace the Turing Test?
In that influential paper demolishing the Turing Test, Levesque proposed to use something he didn't invent but named and repurposed. That was the "Winograd Schema."
If the "Winograd" name rings a bell for you, it's for a good reason: Terry Winograd was the Ph.D. Thesis advisor of Larry Page, Google's co-founder. Actually, I met Larry Page in real life at the Google headquarters in a "Faculty Summit" session, and he told us a personal anecdote about his work with Terry Winograd:
When Larry was about to start his Ph.D., he contacted a possible advisor, Dr. Terry Winograd, and asked him about suitable doctoral projects. Winograd came up with not one but two possible projects. The problem was that the projects couldn't be more different from one another:
One was about self-driving cars, which were, at the time, a robotics research project far away from a real-world application.
The other one was to investigate the structure of the internet to see how search could be done more efficiently.
After agonizingly ruminating on the advantages of each project for a week, he chose the second one, and the rest was history.
But Winograd also proposed a common-sense reasoning activity about disambiguating pronouns in phrases. For instance, consider the following sentence from his 1971 report (reference below):
"The city councilmen refused the demonstrators a permit because they feared violence."
The question is, "Who feared violence?" The city councilmen or the demonstrators?
For a human, it's easy to come up with the "city councilmen" answer, but for a machine, it's not because there are assumptions involved, such as the supposed role of the city council.
These kinds of questions are good candidates for assessing the commonsense reasoning of a machine because:
- It's not about deceiving somebody;
- There is a correct answer and a wrong one;
- It involves reasoning as well as handling common knowledge.
So Levesque coined the "Winograd Schema" for this kind of test, and in the reference paper below (I encourage you to read it; it's not too hard), we can find a collection of dozens of such phrases.
We can say that the paper contains the first test set for AI chatbots before they existed (obviously, you can't put ELIZA to the test here).
As an example, I put both Microsoft Copilot and Google Bard to answer the following Winograd quiz:
Tom threw his school bag down to Ray after he reached the top of the stairs. Who reached the top of the stairs? Answers: Tom / Ray.
You can see that the problem is to disambiguate the "he" pronoun in "…he reached."
Both Bard and Copilot got the right answer (Tom). Bard 1, Copilot 1.
Then I used an alternative formulation, often used in Winograd schemas:
Tom threw his school bag down to Ray after he reached the bottom of the stairs. Who reached the bottom of the stairs? Answers: Tom / Ray.
Here Copilot gave the wrong answer (Tom) while Bard got it right (Ray). Bard 2, Copilot 1.
Of course, my experiment here is not conclusive of better cognitive capabilities in Bard, and much more testing is needed. This is why standard test collections have been set up.
The WinoGrande quiz set, composed of 44k questions, is the most comprehensive collection of Winograd Schema I found.
There are also some similar collections of common-sense reasoning tests, such as the following:

These tests are frequently used to assess cognitive chatbot performance, though each company uses whatever they want to publish results; we can suspect that they publish mostly the results of tests where they excelled.
Closing thoughts
Once cognitive abilities are a matter of measuring with standard test collections, it's not at all a matter of looking for "consciousness" or "real Intelligence;" it becomes just a matter of gradually improving over the results scale.
By the way, in the test collections, humans normally don't get the maximum possible, so getting a perfect score would be a "superhuman" performance.
As I have pointed out in previous posts, I'd dismiss human-oriented tests such as the GRE used for school admissions because they have time constraints that, while they are meaningful for humans, for machines, they lose any meaning. It would be like putting a human against a calculator for numeric operations. In my view, those tests only look for sensational headlines and provide only clickbait, not real value.
Don't get me wrong: Alan Turing provided a ground-breaking framework for Testing intelligent behavior with his "imitation game." However, as machines' cognitive abilities improved, it became necessary to replace the Turing Test with more precise and relevant measuring tools.
Turing was a pioneer in machine cognition, perhaps a genius, but not an up-to-date scientist by today's standards.
References
- Levesque, Hector, Ernest Davis, and Leora Morgenstern. "The Winograd schema challenge." Thirteenth international conference on the principles of knowledge representation and reasoning. 2012.
- Kejriwal, Mayank, and Ke Shen. "Do Fine-tuned Commonsense Language Models Really Generalize?" arXiv preprint arXiv:2011.09159 (2020).
- Winograd, Terry. "Procedures as a representation for data in a computer program for understanding natural language." MIT report (1971).
Get commented news about AI and tech with a healthy dose of skepticism by subscribing to my free "SkepTech" newsletter at https://rafebrena.substack.com/