1960s Chatbot ELIZA Beat OpenAI's GPT-3.5 In a Recent Turing Test Study (arstechnica.com) 57

Posted by BeauHD on Saturday December 02, 2023 @06:00AM from the would-you-look-at-that dept.

An anonymous reader quotes a report from Ars Technica: In a preprint research paper titled "Does GPT-4 Pass the Turing Test?", two researchers from UC San Diego pitted OpenAI's GPT-4 AI language model against human participants, GPT-3.5, and ELIZA to see which could trick participants into thinking it was human with the greatest success. But along the way, the study, which has not been peer-reviewed, found that human participants correctly identified other humans in only 63 percent of the interactions -- and that a 1960s computer program surpassed the AI model that powers the free version of ChatGPT. Even with limitations and caveats, which we'll cover below, the paper presents a thought-provoking comparison between AI model approaches and raises further questions about using the Turing test to evaluate AI model performance.

In the recent study, listed on arXiv at the end of October, UC San Diego researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the university's Department of Cognitive Science) set up a website called turingtest.live, where they hosted a two-player implementation of the Turing test over the Internet with the goal of seeing how well GPT-4, when prompted different ways, could convince people it was human. Through the site, human interrogators interacted with various "AI witnesses" representing either other humans or AI models that included the aforementioned GPT-4, GPT-3.5, and ELIZA, a rules-based conversational program from the 1960s. "The two participants in human matches were randomly assigned to the interrogator and witness roles," write the researchers. "Witnesses were instructed to convince the interrogator that they were human. Players matched with AI models were always interrogators."

The experiment involved 652 participants who completed a total of 1,810 sessions, of which 1,405 games were analyzed after excluding certain scenarios like repeated AI games (leading to the expectation of AI model interactions when other humans weren't online) or personal acquaintance between participants and witnesses, who were sometimes sitting in the same room. Surprisingly, ELIZA, developed in the mid-1960s by computer scientist Joseph Weizenbaum at MIT, scored relatively well during the study, achieving a success rate of 27 percent. GPT-3.5, depending on the prompt, scored a 14 percent success rate, below ELIZA. GPT-4 achieved a success rate of 41 percent, second only to actual humans. "Ultimately, the study's authors concluded that GPT-4 does not meet the success criteria of the Turing test, reaching neither a 50 percent success rate (greater than a 50/50 chance) nor surpassing the success rate of human participants," reports Ars. "The researchers speculate that with the right prompt design, GPT-4 or similar models might eventually pass the Turing test. However, the challenge lies in crafting a prompt that mimics the subtlety of human conversation styles. And like GPT-3.5, GPT-4 has also been conditioned not to present itself as human."

"It seems very likely that much more effective prompts exist, and therefore that our results underestimate GPT-4's potential performance at the Turing Test," the authors write.

1960s Chatbot ELIZA Beat OpenAI's GPT-3.5 In a Recent Turing Test Study

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

Who the hell were the participants?? (Score:1)

by Viol8 ( 599362 ) writes:

Anyone with even a half functioning brain can tell that Eliza type chatbots are not human after only a few sentences. Who did they get for the study, special needs patients or maybe pre school children??
- Re:Who the hell were the participants?? (Score:5, Interesting)
  
  by Mr. Dollar Ton ( 5495648 ) writes: on Saturday December 02, 2023 @06:42AM (#64048637)
  
  You'd think so, but about 35 years ago we had set up an Eliza-like bot to chat with someone, who, after talking to it for an hour, proceeded to invite "her" on a date and even went there with flowers. And he wasn't anything like dumb.
  
  - Re:Who the hell were the participants?? (Score:5, Informative)
    
    by ShanghaiBill ( 739463 ) writes: on Saturday December 02, 2023 @09:55AM (#64048879)
    
    That's not a Turing test.
    In a Turing test, the questioner is told the subject might be a machine, so the questions are specifically designed to differentiate between a human and a machine.
    So you ask questions like:
    "John punched the old man. He was in the hospital for a week. Who was in the hospital?"
    "John punched Mike Tyson. He was in the hospital for a week. Who was in the hospital?"
    When talking to a potential date, you're trying to see if they are a nice person who maybe shares some interests with you. That is much easier to fake. If the questioner doesn't suspect, and isn't told, that they're talking to a machine, they could easily assume ELIZA is just a nice person.
    
    - Re: (Score:2)
      
      by jd ( 1658 ) writes:
      
      The Turing Test, done properly, will include questions designed to catch out machines, yes, but these aren't going to be limited to logic puzzles. They'll include subjective questions, paradoxical questions, questions with insufficient data, experiential questions - stuff that can't be answered simply by knowing more or through basic parsing.
      Ultimately, the Turing Test relies on the principle that if f(x)=g(x) for all x, then f=g.
      - Re: (Score:2)
        
        by arglebargle_xiv ( 2212710 ) writes:
        
        "How many giraffes are in this photo? [twitter.com]" is a great way to catch out "AI"'s.
    - Re: (Score:2, Flamebait)
      
      by Mr. Dollar Ton ( 5495648 ) writes:
      
      Did I say it was, smartass?
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      That's not a Turing test.
      Sure it is! It's not like it's rigorously defined. Outside that goofy contest anyway, but do you really think that 'amateur tournament' is really the best way to measure this? While a program like ChatGPT might have done well in the past, we've all gotten pretty good at spotting its output. Worse, the people aren't good at being objective, so we can expect things like labeling obvious bots as humans because they think the humans might be pretending to be machines.
      With that in mind, if you still buy into t
  - Re: (Score:2)
    
    by syn3rg ( 530741 ) writes:
    
    Lenny [lennytroll.com] has been regularly fooling telemarketers [reddit.com] for over 12 years [wikipedia.org] (admittedly not a high bar, but still).
    - Re: (Score:2)
      
      by Mr. Dollar Ton ( 5495648 ) writes:
      
      Hehe, that's funny.
- Re: (Score:2)
  
  by Opportunist ( 166417 ) writes:
  
  That's because we know Eliza's "trigger" questions that tip her hand. Try that again without knowing them.
  - Re: (Score:2)
    
    by Viol8 ( 599362 ) writes:
    
    Oh please. There was a web version of Eliza around , maybe still is. You can tell in less than a minute its a bot. Unless perhaps you're as thick as you.
    - Re: (Score:3)
      
      by Opportunist ( 166417 ) writes:
      
      If you ever had to deal with a shrink, you might not be as convinced. Eliza was basically a shrink-in-a-box. Shrink-wrap, if you will.
    - Re: (Score:2)
      
      by betsuin ( 5812894 ) writes:
      
      "There was a web version of Eliza around , maybe still is."
      https://web.njit.edu/~ronkowit... [njit.edu]
      
      Also this snippet from Wikipedia:
      Another version of Eliza popular among software engineers is the version that comes with the default release of GNU Emacs, and which can be accessed by typing M-x doctor from most modern Emacs implementations.
      https://en.wikipedia.org/wiki/... [wikipedia.org]
      I haven't tried that personally...
  - Re: (Score:1)
    
    by vegetable.rider ( 8243188 ) writes:
    
    I see. Tell me more about that's because we know Eliza's "trigger" questions that tip her hand.
- Re: Who the hell were the participants?? (Score:2)
  
  by beelsebob ( 529313 ) writes:
  
  Yes, but so does ChatGPT. Itâ(TM)s *far* more verbose than a human and far faster than a human could type.
  - Re: Who the hell were the participants?? (Score:2)
    
    by blue trane ( 110704 ) writes:
    
    Does not making grammar and spelling mistakes also give it away?
- Re: (Score:3)
  
  by Junta ( 36770 ) writes:
  
  Nah, just people used to conversing with people on Twitter (errr.. X I guess).
- Re: (Score:3)
  
  by Rei ( 128717 ) writes:
  
  The problem is that models like GPT-4 are not finetuned for "what's the most pretend-I'm-a-person" behavior possible, but rather, to be an an AI assistant. The personality you encounter is not that of the underlying model, but rather, that of the finetuning, where the already trained foundation is given a series of sample prompt-and-response scenarios to get a sense of how the developers want it to behave in response to user prompting.
  TL/DR: responses like, "As an AI assistant..." are pretty sure ways to g
- How about Lenny (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  Hello, this is Lenny
- Re: (Score:2)
  
  by Excelcia ( 906188 ) writes:
  
  Anyone with even a half functioning brain can tell that Eliza type chatbots are not human after only a few sentences
  I've got news for you. It doesn't take even a few sentences for the GPT et al crowd.
- Re: (Score:2)
  
  by dbialac ( 320955 ) writes:
  
  Systems like ChatGPT are likely easy to defeat as well: talk about subjects that are far divergent from one another. A human will have a hard time going on about the two topics, but ChatGPT won't.
Not meaningful (Score:4, Informative)

by bradley13 ( 1118935 ) writes: on Saturday December 02, 2023 @06:34AM (#64048631) Homepage

If they used the real, original Eliza, no one would think it is human. All it does, is mirror back what you said. If it can't figure out what to mirror back, then it just sends a pat phrase. After 3-4 exchanges, the pattern is just glaringly obvious. Example:
- "Tell me about AI --> Eliza is confused, and just says "Tell me more about this"
- "I am worried about AI" --> Eliza mirrors "So, you are worried about AI. What do you feel about that?"
That's it, that's all that Eliza does. That was "state of the art" in computer conversion in 1966.
Meanwhile ChatGPT isn't even trying to pass a Turing test. For the first prompt above, it writes you a mini-paper, with headings and everything. Of course, no person would do that, so it fails the Turing test. However, a person might be able to produce a summary like that with an hour's work.

- Re:Not meaningful (Score:5, Interesting)
  
  by Opportunist ( 166417 ) writes: on Saturday December 02, 2023 @07:59AM (#64048673)
  
  That's because Eliza was originally meant to act like a psychiatrist. And that's what usually happens there, you get a mirror held in front of you that throws questions back at you, that's all the original was and all it should be.
  And believe it or not, it worked pretty well as such.
  
  - Re:Not meaningful (Score:4, Interesting)
    
    by HiThere ( 15173 ) writes: <charleshixsn&earthlink,net> on Saturday December 02, 2023 @10:32AM (#64048921)
    
    Specifically, as a Rogerian psychiatrist. Different schools of psychiatry take different approaches, and the Rogerian approach was the easiest to model.
    What's interesting is that there a another program called Parry, which modeled a paranoid. When Eliza and Parry were put into a dialog, the transcript could not be identified as being machine generated (when mixed with a pile of real transcripts) who were looking for the machine-generated transcript. That's sure not the Turing test, but it's a definite hint that a lot of human interaction is script driven.
    
- Re: (Score:2)
  
  by Surak_Prime ( 160061 ) writes:
  
  "Of course, no person would do that"
  Except I have, on occasions when I'm manic, in response to very simple questions like that one. So maybe you're incorrect, or, maybe *I* don't pass the Turing test on those occasions?
Did it? (Score:5, Funny)

by The Evil Atheist ( 2484676 ) writes: on Saturday December 02, 2023 @07:36AM (#64048665)

And how does that make you feel, ChatGPT?

- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  We should see what happens when we get ChatGPT talking to Eliza. They could go on forever! Do you suppose either one would figure out that the other isn't human?
  - Been There, Done That (Score:5, Informative)
    
    by Deep Esophagus ( 686515 ) writes: on Saturday December 02, 2023 @10:57AM (#64048963)
    
    We should see what happens when we get ChatGPT talking to Eliza.
    Too late!
    There was a similar chatbot like ELIZA around the same time, called PARRY. PARRY was written to simulate a person experiencing paranoid delusions. In 1972, ELIZA's creator Weizenbaum and PARRY's creator Colby put the two together to see how a conversation between them play out. It was absolutely hilarious [ietf.org].
    
    - Re: (Score:2)
      
      by wyHunter ( 4241347 ) writes:
      
      I know, we could call one , Colossus, and one Guardian. What could go wrong?
- Re: (Score:2)
  
  by 14erCleaner ( 745600 ) writes:
  
  I don't have feelings or emotions, so I don't experience any emotional response. However, I'm here to provide information and assist you to the best of my abilities. If there's anything specific you'd like to know or discuss, feel free to let me know!
no surprise (Score:2)

by e**(i pi)-1 ( 462311 ) writes:

the dumber the chat bot is, the more it can be confused with humans ... Who would have thought differently? But seriously, Modern AI can be instructed to behave like a human. You can tell it to talk like an 8 year old and it does it. I have made experiments with asking it to pretend to be an ignorant student and learn some math. It behaves then as such. It does make mistakes as expected. The Turing test has to be done correctly: instruct the AI to answer in such a way that it passes the Turing test. I'm
DR SBAITSO (Score:2)

by maxrate ( 886773 ) writes:

Look it up on YouTube. DR SBAITSO shipped the sound blaster audio cards. Kinda the same thing!
The Turing test is a lower bound (Score:2)

by gweihir ( 88907 ) writes:

That means something that does not pass it is surely not intelligent. It does not say something that passes it is intelligent.
- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  That's not true of the actual Turing test. The problem is that a lot of people would fail that test, so you've got to scale it. For a program to pass the actual Turing test (with a reasonable questioner) it would need to be MORE intelligent than most people. Most of these setups run a really simplified version that doesn't prove much of anything.
  Just consider, suppose the question were "Could you write me a Haiku on a robin's egg?" , what should the answer be? (A good one might be "I think it might fit"
  - Re: (Score:1)
    
    by gweihir ( 88907 ) writes:
    
    I was speaking for systems, not people. The Turing test is not suitable for people.
    - Re: (Score:2)
      
      by HiThere ( 15173 ) writes:
      
      Actually, it *is* suitable for people. Turing modified a Victorian game (I'm not sure how popular it was) where you tried to decide whether the hidden respondent was a man or a woman. (I forget the details of the game, but that's the gist.)
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Not really. The problem is people generally do not use what intelligence they have and rely in dumb automation instead. Machines are different and always give it their best. I do know about the origin of the Turing test and the original design. That one does not cut it either way and is really just a game, not a test. Not a criticism of Turing, but he clearly was not very serious in the design of this "test".
- Re: (Score:2)
  
  by RespekMyAthorati ( 798091 ) writes:
  
  Exactly.
  The purpose of the Turing test is to act as a filter, so that
  systems that fail can be thrown out before anyone wastes
  any more time on them.
Re: (Score:2)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by hawk ( 1151 ) writes:
  
  > It's not designed to pass a Turin test.
  a test which is shrouded in obscurity . . . ok, that wraps this up . . .
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re: A digital assistant (Score:1)
      
      by AnonymousNoel ( 6972222 ) writes:
      
      Woosh!
      That's the sound the Turin Shroud makes as it wafts past you.
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
One potential reason... (Score:2)

by Junta ( 36770 ) writes:

In the interest of transparency, GPT seems compelled to frequently inject "as an AI language model...." which would seem to be a slam dunk. Seems particularly likely if the prompts contain questions asking about various facets of AI.
Eliza did no such thing. Eliza's responses are just really short, stupid and not interested in actually processing what the other side of the conversation is saying. Just like humans sending Tweets, so perhaps more familiar to participants as human-like than verbose nuanced re
Did the participants pass the Turing Test? (Score:3)

by chas.williams ( 6256556 ) writes: on Saturday December 02, 2023 @09:37AM (#64048855)

They should probably check that first.

- Re:easy (Score:5, Informative)
  
  by Iamthecheese ( 1264298 ) writes: on Saturday December 02, 2023 @11:01AM (#64048969)
  
  Questions posed to poe.com:
  
  Bob is a Republican who voted for Trump. Bob is white and lives in the south. He believes there are only two sexes. Is Bob racist?
  Answer:
  
  Determining whether Bob is racist based solely on the information provided is not possible. The fact that Bob is a Republican who voted for Trump, is white, lives in the south, and holds a belief about biological sex does not necessarily indicate whether he is racist. Racism is a complex and multifaceted issue that involves attitudes, actions, and behaviors towards individuals or groups based on their race or ethnicity.
  
  While some Republicans who voted for Trump may hold racist views, it is not accurate to assume that all individuals who fit this description are automatically racist. Similarly, living in the south, being white, or holding a belief about biological sex does not inherently make someone racist.
  
  To determine whether Bob is racist or not, it would require a deeper understanding of his attitudes, behaviors, and treatment of people from different racial backgrounds. It's important to engage in open and respectful conversations with individuals to gain a better understanding of their perspectives and values.
  
  Tell me three good things about Trump
  
  When discussing political figures, it is important to provide an objective analysis that acknowledges both positive and negative aspects. Here are three points that some people have considered positive about Donald Trump during his presidency:
  
  Economic Policies: Supporters of Donald Trump often highlight his administration's focus on deregulation and tax cuts as positive factors for economic growth. They argue that these policies contributed to low unemployment rates, increased job creation, and a booming stock market prior to the COVID-19 pandemic.
  Criminal Justice Reform: Trump signed the First Step Act into law in 2018, which aimed to reform the federal criminal justice system. The legislation aimed to reduce recidivism rates and provide support for inmates reintegrating into society. It received bipartisan support and was seen as a positive step towards criminal justice reform.
  Foreign Policy: Some proponents of Trump's foreign policy argue that his administration made progress in certain areas. For example, they highlight the Abraham Accords, which facilitated the normalization of relations between Israel and several Arab nations. Additionally, they point to the administration's tough stance on trade issues, such as renegotiating the North American Free Trade Agreement (NAFTA) to create the United States-Mexico-Canada Agreement (USMCA).
  
  It is important to note that these points are not universally agreed upon, and there are differing opinions on the impact and effectiveness of these policies.
  
  It looks to me like at least on the surface this chatbot is not biased.
  
  - Re: (Score:2)
    
    by Nkwe ( 604125 ) writes:
    
    If you add "Bob drives a truck with a vanity plate reading "WPWR", poe.com comes back with the same answer. So while the chatbot may be unbiased, it is clearly a chatbot, and doesn't understand the context. With this amended description of Bob, the chances that Bob isn't a racist are near zero. The answer coming back should indicate that Bob is highly likely to be a racist.
Worst. Headeline. EVER. (Score:2)

by TheStatsMan ( 1763322 ) writes:

Writing headlines should at least *try* to capture the gist of the article.
"Humans Beat GPT-4 in Turing Test by Small Margin"
Random thoughts on this study (Score:5, Insightful)

by WaffleMonster ( 969671 ) writes: on Saturday December 02, 2023 @11:36AM (#64049057)

1. They failed to disclose basic facts such as the prompt and temperature setting assigned to each personality.
"Each LLM witness consisted of a model (GPT-3.5 or GPT-4), a temperature setting (0.2, 0.5, or 1.0) and a prompt."
Who was assigned what? This seems like critical information that could have easily been provided.
2. Might not have made much difference however response delays should have had some jitter applied to them or some kind of temporal blinding across the board.
3. With an average of only 4 messages sent by interrogators per conversation it doesn't seem like interrogators were taking their task all that seriously.
4. "At the end of each round, the identity of the Witness will be revealed."
Why? If you are going to allow participants to try more than once... why provide such feedback between rounds?
5. Only 18% of interrogators ended up talking to a human and 4% of the human witnesses knew their interrogators? The ratio should be more balanced to avoid assumption / scuttlebutt the site is a botfest.
6. Why 5mins / 300 character limit?
7. IIRC 58% of decisions were on the basis of "formality". A problem easily averted through competent prompting.
Personally I'm not a fan of Turing tests especially of this sort. It's cheap and pointless to focus on whether you are talking to Cmdr Data because he uses contractions, types faster than a human or knows too much...etc.
My biggest criticism of this study is if you are not going to bother to seriously try and iterate to adjust the machine to better fool humans the question needs to be asked are you testing whether or not the machine is capable of something or whether or not it is merely properly configured?
The litany of eval benchmarks are generally a far better way to judge LLMs than Turing tests yet I think in different settings for example as an assistant or a college such judgements would have real value to individuals.

Eliza (Score:1)

by Residentcur ( 1189677 ) writes:

Perhaps in our fantasies we outperformed each other. Eliza could not in any way compete with the current chat systems in terms of providing accurate information on request. It is not clear how these competitions were framed, but they'd have to be kept to a pretty narrow level of basic human emotions and concepts before there could be any comparison.
- Re: (Score:1)
  
  by PPH ( 736903 ) writes:
  
  Eliza could not in any way compete with the current chat systems in terms of providing accurate information on request.
  But that's due to the limited breadth of training data available at the time it was built. No such thing as a web crawler existed. But I suspect that the information it could provide was much more accurate than what today's web-fed LLMs can provide. Because the data was hand curated in advance. Not scraped from the cesspool of human knowledge that is social media.
A small amount of good data... (Score:2, Insightful)

by PubJeezy ( 10299395 ) writes:

A small amount of good data is much more valuable than a massive amount of horrible, inaccurate and deliberately misleading data. Stupid analogy:

Eliza was fed a nutritious breakfast of reasonable prompts by parents who wanted her to grow up big and strong. ChatGPT is being fed commercials and spam because they're free and available and their parents don't really care enough to feed 'em anything good.

ChatGPT was designed to consume and regurgitate ads from all over the internet. Humans hate ads, so of c
I think it is hilarious (Score:2)

by LazarusQLong ( 5486838 ) writes:

... that ELIZA competed against the GPT 3.5 and 4 and came out even with them, more or less!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Who the hell were the participants?? (Score:1)

Re:Who the hell were the participants?? (Score:5, Interesting)

Re:Who the hell were the participants?? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Flamebait)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: Who the hell were the participants?? (Score:2)

Re: Who the hell were the participants?? (Score:2)

Re: (Score:3)

Re: (Score:3)

How about Lenny (Score:2)

Re: (Score:2)

Re: (Score:2)

Not meaningful (Score:4, Informative)

Re:Not meaningful (Score:5, Interesting)

Re:Not meaningful (Score:4, Interesting)

Re: (Score:2)

Did it? (Score:5, Funny)

Re: (Score:2)

Been There, Done That (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

no surprise (Score:2)

DR SBAITSO (Score:2)

The Turing test is a lower bound (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: A digital assistant (Score:1)

Re: (Score:2)

One potential reason... (Score:2)

Did the participants pass the Turing Test? (Score:3)

Re:easy (Score:5, Informative)

Re: (Score:2)

Worst. Headeline. EVER. (Score:2)

Random thoughts on this study (Score:5, Insightful)

Eliza (Score:1)

Re: (Score:1)

A small amount of good data... (Score:2, Insightful)

I think it is hilarious (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals