Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI

OpenAI Admits that AI Writing Detectors Don't Work 70

OpenAI recently shared guidelines for educators on using ChatGPT as an educational tool in a blog post, also noting in a related FAQ the ineffectiveness of AI writing detectors often resulting in false positives against students. ArsTechnica: In a section of the FAQ titled "Do AI detectors work?", OpenAI writes, "In short, no. While some (including OpenAI) have released tools that purport to detect AI-generated content, none of these have proven to reliably distinguish between AI-generated and human-generated content."
This discussion has been archived. No new comments can be posted.

OpenAI Admits that AI Writing Detectors Don't Work

Comments Filter:
  • Regulation (Score:5, Insightful)

    by systemd-anonymousd ( 6652324 ) on Friday September 08, 2023 @12:47PM (#63832956)

    "The only thing that'll work is immediately regulating our competition, which is getting dangerously close to providing a FOSS solution to our darling, ChatGPT 4. We can't have another repeat of DALL-E 2 on our hands."

  • by VeryFluffyBunny ( 5037285 ) on Friday September 08, 2023 @12:52PM (#63832966)
    This is really not surprising. Since LLMs basically work on statistical probabilities from massive datasets, the output texts exhibit pretty authentic genre-specific distributions of recurrent lexicogrammatical register configurations that you'd expect from a human being (who's very literate!)

    The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work. I don't see anything wrong with students using an LLM to aid them in researching, planning, & writing but by the end of the course/programme, they've got to be able to do it unaided/independently, i.e. it's *all* their own work.
    • "but by the end of the course/programme, they've got to be able to do it unaided/independently, i.e. it's *all* their own work."

      Exactly, no slide-rules allowed.

    • by ranton ( 36917 )

      The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.

      I'm curious if these LLM detection tools have started using an individual's previous writing as an input to determine if a current piece of writing has a similar writing style. Looking at a single essay may not be enough to tell if an LLM was used, but looking at a dozen essays could either show a sudden shift in writing style or perhaps even a complete lack of unique writing style (suggesting all works are LLM generated).

    • The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.

      The obvious counter-measure is to use an LLM for ALL of your work, so it's consistent.

      Another obvious counter-measure is to start an LLM essay-writing service that's fed a student's previous work and writes a new essay on the specified topic using the same style.

      • So that they get consistently poor marks for the entirety of their education?
        • So that they get consistently poor marks for the entirety of their education?

          Most people don't hire outside essayists because they can't write but to save time.

          When I was in college, I did programming assignments for money[1]. I always asked my clients what grade they wanted. A few said an "A", but most wanted a "B", and more requested a "C" than an "A".

          They could have done it themselves for a B or C, but they paid me because they had loads of other schoolwork or maybe a party over the weekend that was more important than coding.

          [1] My excuse: I was broke.

          • So were those assignments in order to develop their knowledge, skills, & attitudes (KSAs) or simply busy-work? If it was the former, they'd have been in trouble come the next related, extending assignments, i.e. that build on the formerly acquired/developed KSAs... or maybe I'm expecting too much from higher education?
            • Then they paid me to do the next assignment as well. I had a lot of repeat customers.

              • That may explain why I once had to explain the basic differences between server-side & client-side development to someone who apparently had a post-grad degree from MIT. Luckily, I didn't have to work with him.
      • Surely it's okay if the student gets better? Otherwise, why be a student?
    • The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.

      So, please correct me if I misunderstood, to pass your course, the key thing will be to never do your studying and homework and suddenly become knowledgable in the subject? To begin with, your marks will be low, but as all the people that are changed ("educated") by the course gradually get eliminated, the grade average will gradually fall to meet your level of ignorance.

    • That's not true. Depending on topic, temperature setting can have a huge impact. If you set T=0 the LLM sounds very unnatural. There is no one best temperature for everything. It's not really like humans, unless you consider alcohol level = Temperature.
  • by WDot ( 1286728 ) on Friday September 08, 2023 @12:57PM (#63832992)
    Detecting AI-generated images seems like an easier problem, because the data being generated is very high dimensional and there can be “tells” that are artifacts of the faking algorithm (e.g. blurry backgrounds, misaligned teeth, or hands with incorrect numbers of fingers). With text from a large language model, there’s not a lot to go on. Earlier models would make silly mistakes, like getting stuck in a loop or forgetting details from its own recent memory, but now that those seem to have been solved, what can you observe about AI-generated text that gives away the game? That the details are often vague and the writing passable but mediocre? That is precisely the kind of writing that teachers grade dozens of times a day!
    • Yep, that's the difference between symbolic (language) & analogue (natural world) data. Words & the stuff they make up are essentially symbols of categories of recurrent phenomena. They're not the phenomena themselves. That's orders of magnitude simpler & easier to train. LLMs are just better trained than image generators because it's easier to train them.
    • there can be “tells” that are artifacts of the faking algorithm

      GANs work by searching for, and eliminating these "tells". If you can find it, then so can the GAN. Maybe not today, but soon.

    • It always follows a set pattern - introduction, three ideas & conclusion.
    • This is really a solution in search of a problem. There are already tried and true methods that work well for teachers.

      I guess I'm showing my age, but in my day as a student I got to write essays during class on paper in a time constrained environment, with the teacher walking up and down the room. Also for exams. And if it was a really big exam with official people it might be called a "defense" and involve a lot of talking in a room with people sitting in chairs.

      It should be obvious that the final gra

      • by Mal-2 ( 675116 )

        I had to write essays longhand as well, but there's no way I'd sit a class that makes me do that now. I learned to type for a reason, and writing is physically draining for me. If it's longer than a shopping list, forget it, my hand will cramp before I get to the end.

        I'm also pretty sure the Americans with Disabilities Act (ADA) or its equivalent in your jurisdiction says that you can't demand that anything be done in a particular way, as you will encounter students that cannot comply and must be accommodat

        • You paid your dues, and now you get to choose what you do in life and where you draw the line. If you were young today and at the start of your career and someone says "longhand!" you'd do it.

          I agree that it's unfair to require certain skillsets from people who are disabled, but it makes no difference to me if I'm interviewing. Being able to write longhand legibly on a piece of scrap paper and pass it quickly to a colleague for review is a condition of being able to fit into the team.

  • by bradley13 ( 1118935 ) on Friday September 08, 2023 @01:14PM (#63833046) Homepage
    Seriously? Of course they don't work. LLMs are trained to mimic human-written text. That's what they do. Unless their output is watermarked somehow, there is literally no way to differentiate it from what a human might write.
    • by RobinH ( 124750 )
      Yeah, our biggest complaint is that the output is often factually incorrect. But you wouldn't expect a student essay to be completely factual either.
    • by gweihir ( 88907 )

      Make that a "not very capable human". In some contexts, LLMs can perform on the level of a person that cannot think deeply. The real problem is low academic standards, IMO.

    • Seriously? Of course they don't work. LLMs are trained to mimic human-written text. That's what they do. Unless their output is watermarked somehow, there is literally no way to differentiate it from what a human might write.

      Not quite.

      LLMs are very good, but they still sometimes produce nonsensical text and hallucinations which are easily caught by humans.

      The difficultly in training a model to do the same isn't so much that the LLM mimics human written text, but that the LLM tries to mimic human text.

      If the AI writing detector could reliably differentiate the LLM output from humans that would mean the LLM would be generating output that an AI model can statistically flag as not human.

      And then the LLM could then apply that same

  • Do the AIs pass the Turing Test, or do most humans fail it?

    • "Let us fix our attention on one particular digital computer C. Is it true that by modifying this computer to have an adequate storage, suitably increasing its speed of action, and providing it with an appropriate programme, C can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?"

      [...]

      I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well t

    • by gweihir ( 88907 )

      Essentially, it is that humans fail it. Yes, LLMs cannot do any "deep thinking" (i.e. iterated deduction). As statistical models all they can do is flat, very broad and quite unreliable and randomized "deduction". But as it turns out, the average person cannot do any better in most cases either. It is not that LLMs are "smart", they are dumb as bread. But so are a great many people.

      As to "hallucinations", remember anti-vaxxers, flat-earthers, or the morons that claimed here in this very place that storming

      • I think we can generalise this - we are all contextual language generators. We're just running the "language stack" on the "brain stack" to solve our problems. And language is damn smart, the result of billions people's life and experience over tens of millennia. The ideas we have collected over time, and replicated by language, are the backbone of our civilisation. What we add on top of this language intelligence is a bit of our own experience, sometimes just luck - and we may add a small discovery on top
        • by gweihir ( 88907 )

          Well, I _can_ do real deduction and I can iterate it. And I do it all the time. Not even a need for language, I can do it symbolically if needed. I have been told that my skills and easy with that are quite unusual.

          I do agree that most people basically stay on the level of a pretty advanced LLM (plus things like fear, greed, and other base emotions) most of the time and cannot get far beyond the rest of the time. Explains a lot.

      • Essentially, it is that humans fail it. Yes, LLMs cannot do any "deep thinking" (i.e. iterated deduction).

        "We explore how generating a chain of thoughtâ"a series of intermediate reasoning
        stepsâ"signiïcantly improves the ability of large language models to perform
        complex reasoning."

        https://arxiv.org/pdf/2201.119... [arxiv.org]

        "Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is
        flexible enough to incorporate various types (scalar values or free-form language)
        and sourc

        • by gweihir ( 88907 )

          So some researchers are looking into it. Sure. Does that mean this is actually a promising idea? Not at all. All they will get is pretty spectacular hallucinations. Also not the first time that researchers were unaware or chose to ignore pretty fundamental limitations. As I have quite a bit of experience as a paper-reviewer, I have seen that countless times.

  • by OrangeTide ( 124937 ) on Friday September 08, 2023 @01:29PM (#63833074) Homepage Journal

    Then it would be trivial to train AI not to look like AI content. It is the turtles all the way down problem. At some point your AI isnt going to be more powerful than itself.

  • What's the difference?

    • by Junta ( 36770 )

      It depends.

      If it is truly novel content touching on a relatively advanced topic, the "LLM" smell can be a bit distinct. Basically writing something that actually would carry intrinsic value. "LLM" smell is kind of like a school student trying to hit a mandatory word count on an essay assignment that countless students have done before.

      Which brings us to the scenario where it can be extremely hard to tell: school students doing essays that have no particular value and have been done countless times before.

  • by Tony Isaac ( 1301187 ) on Friday September 08, 2023 @02:27PM (#63833228) Homepage

    "Our AI is so good, it fools our own AI-detection tools!"

    Hmmm...

    • I call it spin on the part of the reporter to claim that OpenAI "admits" this. Nowadays the press always has to tart up a story by brushing on some hint of scandal such as saying somebody "admits" something or that something is a "leak" when it isn't really.
      • by Junta ( 36770 )

        Well, it's a fair assessment.

        OpenAI participated in the "here's some AI detectors", so to do so and then say it didn't work is an admittance.

        I suspect further, that the detector approach may work in some contexts, but not for homework assignment, as sincere student work has the same sort of traits that LLM output has, since the LLM likely trained on essays on the exact same topic the assignment was about, and students aren't exactly breaking new ground in their history report on Benjamin Franklin or anythin

  • by Tony Isaac ( 1301187 ) on Friday September 08, 2023 @02:29PM (#63833242) Homepage

    Turing tests are supposed to be able to tell the difference between human intelligence and computer automation.

    What they actually tested was the difference between human intelligence, and _primitive_ computer automation.

  • by Tony Isaac ( 1301187 ) on Friday September 08, 2023 @02:35PM (#63833268) Homepage

    I've had a couple of programmer candidates who tried to cheat on a Teams interview, by quickly looking up answers as I asked them. I got suspicious when, every time I asked a question, the candidate would look away and then answer. One candidate would repeat my questions back to me while looking down, then provided me with very precise, very detailed answers on every imaginable subject. I was able to reproduce the exact wording of the candidate's answers by doing a similar search with Google or ChatGPT. My guess is that an AI-detection tool would be hard-pressed to use this kind of reaction analysis to detect that AI was being used. I'll bet a lot of human interviewers would be fooled too.

    • by Junta ( 36770 ) on Friday September 08, 2023 @03:36PM (#63833420)

      I think the best strategy for interviewing is to start a scenario with incomplete information, to make the candidate have to ask questions to get more data.

      I don't think I've ever seen an LLM approach that would recognize and prompt for missing data to iterate on..

      Besides, it's more informative about a candidate too. You want a candidate that recognizes missing data and asks for clarification. As long as you let them know up front that the problem is incomplete and asking for more info is expected, and wouldn't be seen as a weakness.

      • Good idea.

        My go-to approach is to ask "why" questions.

        "You've done both Entity Framework and Dapper? Which one would you choose, and why?"
        "Why are interfaces superior to abstract base classes?"
        "Why is dependency injection important?"

        Also, I'll ask leading questions that lead the wrong direction, like:

        "How does Razer affect the separation of concerns?" (it blows it up)
        And then "How would you unit test Razer code?" (you can't)

        And I like to follow threads that the candidate brings up.
        "I did a lot of work with

      • by gweihir ( 88907 )

        LLMs cannot actually iterate. Or rather it makes no sense to have them do it. Sure, the can fake it to a limited degree, but that is all they do.

        • How hard can it be to send those questions out to Wolfram Alpha?

          • by gweihir ( 88907 )

            Actually not difficult at all. And ChatGPT _has_ an API to Wolfram Alpha. The problem is that LLMs cannot reliably identify what questions to send. For example if you ask an LLM "is n prime", it will compute statistics on the terms "prime" but also the number n. The second may well prevent it recognizing this should be sent to Wolfram Alpha. Yes, completely bereft of any insight, but that is how LLMs work. Also, there is no Wolfram Alpha for other areas.

    • Why you bothering interviewing with question like that. Everyone should be using tools for simple tasks, code cookbook or AI, it is burred in code. Stop giving bus drivers the test from the DMV, your not going to be happy with the filtering you are doing.
      I honestly would rather spend 20 minutes telling war stories about how stupid DBAs are with indexes and Project Managers are with timelines. Ask candidates how they deal will death marches and unwritten specs. How does one automate installati
      • It's amazing how much you seem to know about the questions I ask, since I didn't specify! And my filtering is working quite nicely, thank you! Of the 10 candidates I've hired in the last 6 months, all 10 have turned out to be outstanding developers.

        I frankly don't care that somebody worked on T1 lines 25 years ago. What have you don't _lately_? Are you stuck in the past, or have you kept your skills updated?

        My preference is to ask "why" questions.

        "You've used Entity Framework and Dapper. Which one do you pr

    • by gweihir ( 88907 )

      Amateurs. Obviously not thinkers either. Otherwise they would have understood that a) somebody competent needs a bit of time to think about an answer and b) if you fake the job interview, how can you expect to be able to do the job?

      • That's very logical. But those who try to cheat on an interview are already doing something inherently illogical. They believe they can bluff their way through their job, if they can bluff their way through an interview. And at some companies, they might be right.

        • by gweihir ( 88907 )

          Hmm. I admit I always only had jobs were "faking it" was completely impossible. I do see your point though.

      • They'll try to cheat their way through that as well. I don't know if it was on Slashdot or another site, but a developer was fired from a company for sub-contracting all of their work out to some code-for-hire guy overseas. As I recall the person was only caught because someone in IT was wondering why there was someone connecting to the network from the other side of the world.

        I don't know if they're thinking that far ahead though. That's a problem for them to figure out later, though they'll probably tr
    • I've had a couple of programmer candidates who tried to cheat on a Teams interview, by quickly looking up answers as I asked them. I got suspicious when, every time I asked a question, the candidate would look away and then answer. One candidate would repeat my questions back to me while looking down, then provided me with very precise, very detailed answers on every imaginable subject. I was able to reproduce the exact wording of the candidate's answers by doing a similar search with Google or ChatGPT. My guess is that an AI-detection tool would be hard-pressed to use this kind of reaction analysis to detect that AI was being used. I'll bet a lot of human interviewers would be fooled too.

      LLMs don't provide the same exact answers to the same exact questions. Not only is randomness intentionally injected as part of inference users context influences responses even when it doesn't seem like it would be at all relevant.

      Here is an example asking a LLM the same exact question "when launching a water rocket what is the optimal mixture of water to air?"

      The answers are different with each run.

      1st...

      The optimal mixture of water to air in a water rocket depends on various factors such as the size an

      • That's right. The case where I found the exact same answer, was a regular Google search. But the LLM answers still follow certain recognizable patterns. The answers, for example, tend to be much more thorough than a human answer would be, itemizing bullet points, for example.

        • So if the programmer's work is more thorough and quicker because they used AI, is that a bad thing because you just want to see them sweat and that's the real point of hiring ppl?

          • Yes of course, you got it! I just want to see them sweat!

            Actually, I want to see if they can think. I'd rather have somebody who can think, even if they don't have the exact skillset we are looking for. The kinds of developers I hire, actually want to be challenged in an interview, because they understand it's not just a checklist.

    • I usually end the interview with "emacs" or "vi" ? ChatGPT *always* gets this question wrong.
  • If it can be generated by an algorithm, it can be detected by an algorithm. This is a mathematical fact, no matter how many VCs sign on.

  • Maybe schools and colleges should build AI tools into their curricula and accept that they are part of the landscape now. AI doesn't stop students writing expressively and accurately and may even remove some roadblocks to expression.
    • One of my master's course professors takes our the transcript of our class and has a variety of different models produce summaries, key points and complete the assignments. He then emails those out to us as a bar that we have to beat.
    • by godrik ( 1287354 )

      We do. In some cases it is easier to adapt than in others.

      The core problem that instructors have with these tools is that it makes many of our typical assessment useless. We are trying to assess the ability to achieve an outcome through some assessment. Now the assessment was never perfect but it was good enough.

      So maybe we want to assess whether you can analyze randomized algorithms. We are not going to give you an analysis of the complexity quick sort because we did it in class, so maybe we give quick hul

To communicate is the beginning of understanding. -- AT&T

Working...