OpenAI Admits that AI Writing Detectors Don't Work 70
OpenAI recently shared guidelines for educators on using ChatGPT as an educational tool in a blog post, also noting in a related FAQ the ineffectiveness of AI writing detectors often resulting in false positives against students. ArsTechnica: In a section of the FAQ titled "Do AI detectors work?", OpenAI writes, "In short, no. While some (including OpenAI) have released tools that purport to detect AI-generated content, none of these have proven to reliably distinguish between AI-generated and human-generated content."
Regulation (Score:5, Insightful)
"The only thing that'll work is immediately regulating our competition, which is getting dangerously close to providing a FOSS solution to our darling, ChatGPT 4. We can't have another repeat of DALL-E 2 on our hands."
Not surprising (Score:3)
The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work. I don't see anything wrong with students using an LLM to aid them in researching, planning, & writing but by the end of the course/programme, they've got to be able to do it unaided/independently, i.e. it's *all* their own work.
Re: (Score:2)
"but by the end of the course/programme, they've got to be able to do it unaided/independently, i.e. it's *all* their own work."
Exactly, no slide-rules allowed.
Re: (Score:2)
The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.
I'm curious if these LLM detection tools have started using an individual's previous writing as an input to determine if a current piece of writing has a similar writing style. Looking at a single essay may not be enough to tell if an LLM was used, but looking at a dozen essays could either show a sudden shift in writing style or perhaps even a complete lack of unique writing style (suggesting all works are LLM generated).
Re:Not surprising (Score:4, Insightful)
Re: (Score:2)
The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.
The obvious counter-measure is to use an LLM for ALL of your work, so it's consistent.
Another obvious counter-measure is to start an LLM essay-writing service that's fed a student's previous work and writes a new essay on the specified topic using the same style.
Re: (Score:2)
Re: (Score:2)
So that they get consistently poor marks for the entirety of their education?
Most people don't hire outside essayists because they can't write but to save time.
When I was in college, I did programming assignments for money[1]. I always asked my clients what grade they wanted. A few said an "A", but most wanted a "B", and more requested a "C" than an "A".
They could have done it themselves for a B or C, but they paid me because they had loads of other schoolwork or maybe a party over the weekend that was more important than coding.
[1] My excuse: I was broke.
Re: (Score:2)
Re: (Score:2)
Then they paid me to do the next assignment as well. I had a lot of repeat customers.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
The only way I can think of to catch students out is to get to know their writing so when it suddenly changes, you have a pretty good idea that it probably wasn't all their own work.
So, please correct me if I misunderstood, to pass your course, the key thing will be to never do your studying and homework and suddenly become knowledgable in the subject? To begin with, your marks will be low, but as all the people that are changed ("educated") by the course gradually get eliminated, the grade average will gradually fall to meet your level of ignorance.
Re: (Score:2)
Re: (Score:2)
Different problem from DeepFakes (Score:4, Interesting)
Re: (Score:2)
Re: Different problem from DeepFakes (Score:2)
Have you heard of context-sensitive grammars, which combine symbols in natural ways that precluded simple rule-based AI from passing simple grammar tests?
Re: (Score:2)
Re: (Score:2)
there can be “tells” that are artifacts of the faking algorithm
GANs work by searching for, and eliminating these "tells". If you can find it, then so can the GAN. Maybe not today, but soon.
Re: (Score:2)
Re: (Score:3)
I guess I'm showing my age, but in my day as a student I got to write essays during class on paper in a time constrained environment, with the teacher walking up and down the room. Also for exams. And if it was a really big exam with official people it might be called a "defense" and involve a lot of talking in a room with people sitting in chairs.
It should be obvious that the final gra
Re: (Score:2)
I had to write essays longhand as well, but there's no way I'd sit a class that makes me do that now. I learned to type for a reason, and writing is physically draining for me. If it's longer than a shopping list, forget it, my hand will cramp before I get to the end.
I'm also pretty sure the Americans with Disabilities Act (ADA) or its equivalent in your jurisdiction says that you can't demand that anything be done in a particular way, as you will encounter students that cannot comply and must be accommodat
Re: (Score:2)
I agree that it's unfair to require certain skillsets from people who are disabled, but it makes no difference to me if I'm interviewing. Being able to write longhand legibly on a piece of scrap paper and pass it quickly to a colleague for review is a condition of being able to fit into the team.
Of course they don't work (Score:4, Interesting)
Re: (Score:2)
Re: (Score:2)
Make that a "not very capable human". In some contexts, LLMs can perform on the level of a person that cannot think deeply. The real problem is low academic standards, IMO.
Re: (Score:2)
Seriously? Of course they don't work. LLMs are trained to mimic human-written text. That's what they do. Unless their output is watermarked somehow, there is literally no way to differentiate it from what a human might write.
Not quite.
LLMs are very good, but they still sometimes produce nonsensical text and hallucinations which are easily caught by humans.
The difficultly in training a model to do the same isn't so much that the LLM mimics human written text, but that the LLM tries to mimic human text.
If the AI writing detector could reliably differentiate the LLM output from humans that would mean the LLM would be generating output that an AI model can statistically flag as not human.
And then the LLM could then apply that same
So the question is (Score:2)
Do the AIs pass the Turing Test, or do most humans fail it?
Re: So the question is (Score:2)
Re: (Score:3)
Essentially, it is that humans fail it. Yes, LLMs cannot do any "deep thinking" (i.e. iterated deduction). As statistical models all they can do is flat, very broad and quite unreliable and randomized "deduction". But as it turns out, the average person cannot do any better in most cases either. It is not that LLMs are "smart", they are dumb as bread. But so are a great many people.
As to "hallucinations", remember anti-vaxxers, flat-earthers, or the morons that claimed here in this very place that storming
Re: (Score:2)
Re: (Score:2)
Well, I _can_ do real deduction and I can iterate it. And I do it all the time. Not even a need for language, I can do it symbolically if needed. I have been told that my skills and easy with that are quite unusual.
I do agree that most people basically stay on the level of a pretty advanced LLM (plus things like fear, greed, and other base emotions) most of the time and cannot get far beyond the rest of the time. Explains a lot.
Re: (Score:2)
Essentially, it is that humans fail it. Yes, LLMs cannot do any "deep thinking" (i.e. iterated deduction).
"We explore how generating a chain of thoughtâ"a series of intermediate reasoning
stepsâ"signiïcantly improves the ability of large language models to perform
complex reasoning."
https://arxiv.org/pdf/2201.119... [arxiv.org]
"Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is
flexible enough to incorporate various types (scalar values or free-form language)
and sourc
Re: (Score:2)
So some researchers are looking into it. Sure. Does that mean this is actually a promising idea? Not at all. All they will get is pretty spectacular hallucinations. Also not the first time that researchers were unaware or chose to ignore pretty fundamental limitations. As I have quite a bit of experience as a paper-reviewer, I have seen that countless times.
if AI could detect AI content (Score:4, Insightful)
Then it would be trivial to train AI not to look like AI content. It is the turtles all the way down problem. At some point your AI isnt going to be more powerful than itself.
How can anyone tell? (Score:2)
What's the difference?
Re: (Score:2)
It depends.
If it is truly novel content touching on a relatively advanced topic, the "LLM" smell can be a bit distinct. Basically writing something that actually would carry intrinsic value. "LLM" smell is kind of like a school student trying to hit a mandatory word count on an essay assignment that countless students have done before.
Which brings us to the scenario where it can be extremely hard to tell: school students doing essays that have no particular value and have been done countless times before.
Underhanded self-compliment? (Score:3)
"Our AI is so good, it fools our own AI-detection tools!"
Hmmm...
Re: (Score:2)
Re: (Score:2)
Well, it's a fair assessment.
OpenAI participated in the "here's some AI detectors", so to do so and then say it didn't work is an admittance.
I suspect further, that the detector approach may work in some contexts, but not for homework assignment, as sincere student work has the same sort of traits that LLM output has, since the LLM likely trained on essays on the exact same topic the assignment was about, and students aren't exactly breaking new ground in their history report on Benjamin Franklin or anythin
So Turing tests have never really delivered (Score:4, Funny)
Turing tests are supposed to be able to tell the difference between human intelligence and computer automation.
What they actually tested was the difference between human intelligence, and _primitive_ computer automation.
Re: So Turing tests have never really delivered (Score:2)
Did you just move the goalposts?
Re: (Score:2)
The goalposts have certainly moved.
Detecting AI-enabled interviews (Score:4, Funny)
I've had a couple of programmer candidates who tried to cheat on a Teams interview, by quickly looking up answers as I asked them. I got suspicious when, every time I asked a question, the candidate would look away and then answer. One candidate would repeat my questions back to me while looking down, then provided me with very precise, very detailed answers on every imaginable subject. I was able to reproduce the exact wording of the candidate's answers by doing a similar search with Google or ChatGPT. My guess is that an AI-detection tool would be hard-pressed to use this kind of reaction analysis to detect that AI was being used. I'll bet a lot of human interviewers would be fooled too.
Re:Detecting AI-enabled interviews (Score:4, Interesting)
I think the best strategy for interviewing is to start a scenario with incomplete information, to make the candidate have to ask questions to get more data.
I don't think I've ever seen an LLM approach that would recognize and prompt for missing data to iterate on..
Besides, it's more informative about a candidate too. You want a candidate that recognizes missing data and asks for clarification. As long as you let them know up front that the problem is incomplete and asking for more info is expected, and wouldn't be seen as a weakness.
Re: (Score:2)
Good idea.
My go-to approach is to ask "why" questions.
"You've done both Entity Framework and Dapper? Which one would you choose, and why?"
"Why are interfaces superior to abstract base classes?"
"Why is dependency injection important?"
Also, I'll ask leading questions that lead the wrong direction, like:
"How does Razer affect the separation of concerns?" (it blows it up)
And then "How would you unit test Razer code?" (you can't)
And I like to follow threads that the candidate brings up.
"I did a lot of work with
Re: (Score:2)
LLMs cannot actually iterate. Or rather it makes no sense to have them do it. Sure, the can fake it to a limited degree, but that is all they do.
Re: Detecting AI-enabled interviews (Score:2)
How hard can it be to send those questions out to Wolfram Alpha?
Re: (Score:2)
Actually not difficult at all. And ChatGPT _has_ an API to Wolfram Alpha. The problem is that LLMs cannot reliably identify what questions to send. For example if you ask an LLM "is n prime", it will compute statistics on the terms "prime" but also the number n. The second may well prevent it recognizing this should be sent to Wolfram Alpha. Yes, completely bereft of any insight, but that is how LLMs work. Also, there is no Wolfram Alpha for other areas.
Re: (Score:2)
I honestly would rather spend 20 minutes telling war stories about how stupid DBAs are with indexes and Project Managers are with timelines. Ask candidates how they deal will death marches and unwritten specs. How does one automate installati
Re: (Score:2)
It's amazing how much you seem to know about the questions I ask, since I didn't specify! And my filtering is working quite nicely, thank you! Of the 10 candidates I've hired in the last 6 months, all 10 have turned out to be outstanding developers.
I frankly don't care that somebody worked on T1 lines 25 years ago. What have you don't _lately_? Are you stuck in the past, or have you kept your skills updated?
My preference is to ask "why" questions.
"You've used Entity Framework and Dapper. Which one do you pr
Re: Detecting AI-enabled interviews (Score:2)
Can I bet that your company produces something utterly worthless to me, so why should I care about your stupid control issues?
Re: (Score:2)
Oh, so not only do you know what kinds of questions I ask, but you know what my company produces! You're truly amazing!
Re: (Score:3)
Amateurs. Obviously not thinkers either. Otherwise they would have understood that a) somebody competent needs a bit of time to think about an answer and b) if you fake the job interview, how can you expect to be able to do the job?
Re: (Score:2)
That's very logical. But those who try to cheat on an interview are already doing something inherently illogical. They believe they can bluff their way through their job, if they can bluff their way through an interview. And at some companies, they might be right.
Re: (Score:2)
Hmm. I admit I always only had jobs were "faking it" was completely impossible. I do see your point though.
Re: (Score:2)
I don't know if they're thinking that far ahead though. That's a problem for them to figure out later, though they'll probably tr
Re: Detecting AI-enabled interviews (Score:2)
How many customers does the hiring company cheat?
Re: (Score:2)
I've had a couple of programmer candidates who tried to cheat on a Teams interview, by quickly looking up answers as I asked them. I got suspicious when, every time I asked a question, the candidate would look away and then answer. One candidate would repeat my questions back to me while looking down, then provided me with very precise, very detailed answers on every imaginable subject. I was able to reproduce the exact wording of the candidate's answers by doing a similar search with Google or ChatGPT. My guess is that an AI-detection tool would be hard-pressed to use this kind of reaction analysis to detect that AI was being used. I'll bet a lot of human interviewers would be fooled too.
LLMs don't provide the same exact answers to the same exact questions. Not only is randomness intentionally injected as part of inference users context influences responses even when it doesn't seem like it would be at all relevant.
Here is an example asking a LLM the same exact question "when launching a water rocket what is the optimal mixture of water to air?"
The answers are different with each run.
1st...
Re: (Score:2)
That's right. The case where I found the exact same answer, was a regular Google search. But the LLM answers still follow certain recognizable patterns. The answers, for example, tend to be much more thorough than a human answer would be, itemizing bullet points, for example.
Re: Detecting AI-enabled interviews (Score:2)
So if the programmer's work is more thorough and quicker because they used AI, is that a bad thing because you just want to see them sweat and that's the real point of hiring ppl?
Re: (Score:2)
Yes of course, you got it! I just want to see them sweat!
Actually, I want to see if they can think. I'd rather have somebody who can think, even if they don't have the exact skillset we are looking for. The kinds of developers I hire, actually want to be challenged in an interview, because they understand it's not just a checklist.
Re: (Score:2)
Re: (Score:2)
There is *no* right answer to that question!
Pure Nonsense (Score:1)
If it can be generated by an algorithm, it can be detected by an algorithm. This is a mathematical fact, no matter how many VCs sign on.
If you can't beat it - embrace it (Score:2)
Re: (Score:2)
Re: (Score:2)
We do. In some cases it is easier to adapt than in others.
The core problem that instructors have with these tools is that it makes many of our typical assessment useless. We are trying to assess the ability to achieve an outcome through some assessment. Now the assessment was never perfect but it was good enough.
So maybe we want to assess whether you can analyze randomized algorithms. We are not going to give you an analysis of the complexity quick sort because we did it in class, so maybe we give quick hul