Scientists Are Failing To Replicate AI Studies (sciencemag.org) 89
The booming field of artificial intelligence (AI) is grappling with a replication crisis, much like the ones that have afflicted psychology, medicine, and other fields over the past decade. From a report: AI researchers have found it difficult to reproduce many key results, and that is leading to a new conscientiousness about research methods and publication protocols. "I think people outside the field might assume that because we have code, reproducibility is kind of guaranteed," says Nicolas Rougier, a computational neuroscientist at France's National Institute for Research in Computer Science and Automation in Bordeaux. "Far from it." Last week, at a meeting of the Association for the Advancement of Artificial Intelligence (AAAI) in New Orleans, Louisiana, reproducibility was on the agenda, with some teams diagnosing the problem -- and one laying out tools to mitigate it.
Re: (Score:1)
They should study replicators (Score:1)
At least some of them [wikipedia.org] were artificially intelligent.
Join the Crowd (Score:2, Interesting)
Science has a Replication [bbc.com] problem [wikipedia.org]
Re:Join the Crowd (Score:4, Insightful)
Science has a Replication [bbc.com] problem [wikipedia.org]
This is not really the same issue. Replication failures in the physical and social sciences are difficult to fix, since they are can be caused by small differences in data collection, experimental procedures, and statistical analysis. It is a hard problem.
Fixing the replication problem described in TFA is drop dead easy, since it has exactly two causes: closed data, and closed source. The fix? Reject any paper for publication if full source and data is not available. Science is based on openness, not secrets.
Re:Join the Crowd (Score:5, Insightful)
I agree with you, but I think it's the same problem at the root.
A robust result, whether it's a psych study, something in a petrie dish, or some machine learning tweak, must be replicable on new data. If it's not... what's the point really?
That's more obvious and easily demonstrable in machine learning; a research group asked for my help last year because they were having trouble with their deep learning model. They trained it on one dataset and it wouldn't work on another, similar dataset. Not surprising... you have to train it on diverse data to have it generalize well. Yeah, that's harder.
Other fields are no different. Tightly controlled studies make things easier and cheaper. But if that result is to be used generally then the necessary controls need to be quantified.
Having said that, the scientific literature is not supposed to be "truth." They're reports of observations. Individual papers are supposed to be the starting point for further investigation by other groups. Problem is, we've forgotten that, and don't reward it.
I like the idea of open data, but it concerns me that it might just exacerbate the problem: I do something and publish the result and the data; you come along, confirm my result (in the same data) and we call it replicated.
Re: (Score:2)
Fixing the replication problem described in TFA is drop dead easy, since it has exactly two causes: closed data, and closed source. The fix? Reject any paper for publication if full source and data is not available. Science is based on openness, not secrets.
That assumes the set of problems is the same in the replication. It probably isn't. Testing with different problem data reveals overfitting, not to mention the fact that real world needs differ slightly from situation to situation.
Bullshit doesn't replicate very easiliy (Score:1)
Re: (Score:2, Offtopic)
It's hard to precisely match the tint and odor.
That's not true. McDonald's successfully replicates it in their food in thousands of franchises around the world.
Re: (Score:1)
Different McDonalds have different atmospheres. The modern stores have touch-screens to do ordering. You just customize your order, make the payment and collect from the counter staff. Less modern stores still require the order to be taken over the counter. Some places seem to recycle burgers overnight - they are stale, hard and seem to have been reheated two or three times.
Isn't that the point? (Score:3, Insightful)
Re:Isn't that the point? (Score:5, Insightful)
Re: (Score:1)
No, this is about portability and reproducible results. There are two things in general for AI "nets".. training and classification. Sometimes they are mixed, but the end results should be the same given all settings and inputs are the same. The biggest thing is hardware and API.. looking at you CUDA, that get "speedy" results by not guaranteeing the same results on different hardware, and obviously not using floating point standards on those different hardware.
So say you buy a bunch of research machines th
Re: (Score:3)
That's known with the quality of graphics rendering. With floating-point data, there's a technique known as guardband bits. These are extra bits of precision that remain internally within the floating point logic units. These aren't mandatory, but protect against numerical instability with small values. This can be visualized by comparing simple color gradients
https://community.arm.com/grap... [arm.com]
For some calculations like CFD, any overflow in one grid cell will expand outwards to all the other grid cells quite
Re:Isn't that the point? (Score:4, Interesting)
AI is an attempt to mimic the human thought process
This is no more true than claiming that the Boeing 747 was designed to mimic a hummingbird's flight process.
Re: (Score:3)
You seem to have no clue what this research area deals with. It is not intelligence, despite the misleading name. It is automation.
Re: (Score:2)
Pretty much, yes. And I agree, "automation" does sort-of imply that somebody with actual intelligence thought about how to do this and then creates an artifact that implements this. While "training" can be a lot cheaper, it has a lot of unexpected pitfalls and may behave in an unexpected fashion even with things that seem to be close to the training set.
As usual (and as has happened many times before) there are always the cheerleaders with no clue that see the world fundamentally changing, and the actually
Re:How about sharing code? (Score:5, Interesting)
It's called Reproducible Research. Also yes, any scientist which doesn't practice is a hack. At best a semi-commercial researcher trying to pretend he is a scientist.
All scientific publications in this day and age should include the complete version controlled datasets and processing software as well as the lab notes. The latter not for reproducibility, but for true insight into the process which led to the results and to find potential avenues missed along the way. Storage is free, to stick to the traditional method of scientific dissemination at this point is only done because "science" has been turned into mockery. It's all about publish or perish, commercialization of software, trade secrets and patents ... promoting scientific progress isn't even a consideration for most.
Re: (Score:1)
Richard Feynman claimed that anyone following the scientific principle is a scientist.
Were Newton and Benjamin Franklin 'hacks'?
Re: (Score:2)
There was no way to widely disseminate the massive amount of data underlying scientific research in Newton and Franklin's time.
Also math is not a science.
Re: (Score:1)
Storage might be free, but research time isn't, and life isn't either. It turns out that not everyone who does research has a magic money tree so they can buy groceries and pay their rent/mortgage regardless of whether their research succeeds or fails. If I ever get one of those trees, I will be happy to altruistically make all of my datasets and code public.
In the meantime, I hope to eventually get some sort of a payout for the hard work and sacrifice I have put into my research instead of watching others
Re:How about sharing code? (Score:4, Interesting)
There are advantages and disadvantages to this. One advantage is transparency, in the sense anyone can run my code and, hopefully, reproduce the results. This acts as a sanity check and demonstrates that my methodology works as advertised. Another advantage is that people can use my code and compare against my methodology. This usually means more citations, which looks good when I'm up for a performance review or awards.
There are many downsides. Labs with more students and funding can devote their efforts to immediately dissecting and extending my work. This can mean that they advance the methodology before I, the original creator, have a chance to finalize the work and write about it. By keeping the code private for some time after publication, I have a chance to work on these extensions without having to compete against others. Another downside is needing to support the code. Someone will inevitably run into problems running the code on their system, no matter how well the code is written and documented. Troubleshooting those issues eats into my time that could be spent elsewhere on more fruitful endeavors.
That being said, I ultimately do release code for many of my conference and journal papers. I release it for almost all of my methods papers at least a few months to a year after publication. I do not release code for systems papers, however. This is partly because fewer people are likely to use code from a systems paper, which is catered toward a very specific application, than a methods paper, which is more general and can be used for many applications. Moreover, the frameworks described in systems papers are usually intimately tied to a particular grant or series of grants. If you make an underlying simulator available, then other researchers can more easily compete against you for future grants from that program manager.
Re: (Score:2)
It was never really much better. Look as some famous assholes of science, like a guy called "Newton" or a fraudster called "Edison", for example.
Re: (Score:2)
The thing is, their code wouldn't suffice. You also need the training data set, the order in which the data was presented, the rewards issued, etc.
Even then, a lot of AI programs have a (pseudo) random element in them, so you wouldn't get the same results twice. Unless you used the same seed each time, which would rather defeat the purpose of the random number generator, as that's often supposed to allow you to generate a range of responses that are selected from, so it doesn't look deterministic.
Re: (Score:2)
That's why we have statistics.
Computer-related endeavours have a bit of a habit of assuming everything is deterministic and basing conclusions off one run. How many benchmarks have you seen where they ran the thing once (or maybe a couple of times) and that's it? If it's important, run it enough times, with random initial conditions, for some statistical validity.
If I need your code, data, exact hardware and precise random seed to replicate your result, your result is a fluke.
Re:How about sharing code? (Score:4, Interesting)
You're assuming that the goal is to come to the same (correct) result each time, but with lots of AI programs the goal is to come up with *some* correct result each time, and their use case is generally in places where you can't define one particular result as correct, though you may be able to define a lot of results as wrong, e.g., finish the sentence
"My love is like..."
Clearly one possible answer is " a red, red, rose", and clearly " a rutabaga" would need a strange context to be a correct answer. But how would you evaluate " a willow wand"? Many would think that a fine continuation. (I've never been sure why "a red, red, rose" is accepted as a reasonable answer, but Robert Burns wasn't wrong about it being a good completion. And Google gives lots of other weird completions that are also accepted as reasonable, at least in some contexts. ["a candle"???])
This kind of problem doesn't have a correct answer, just wrong ones and a bunch of varying acceptability. And what answers are acceptable can depend a lot on context.
(Please note, the prior paragraph is the description of the variety of problem. Complete the sentence was an example, not a defining epitomization. But its the one that came to mind, and it was easy to describe.)
All Show, No Go (Score:1)
Re: (Score:3)
Everything now is hype for headlines and continued funding
Not true. Most AI research is being done by tech giants (Google, Facebook, Alibaba, Amazon, Baidu, etc), where funding has nothing to do with "headlines".
The main incentive for these companies to publish is to help them attract talent. New graduates want to join a winning team.
Re: (Score:2)
Indeed.
Re: (Score:2)
Speaking of funding, I would dare to guess the most likely reason why they are not able to replicate results is they are doctoring outcomes to get desired results to get more money because there are big profits in AI. They are doctoring results when they include random good samples and exclude random bad samples. Keep in mind we are talking computers and generating a million samples from which you select 100 and claim, look it worked 100 times without discussing all the other failures is not good science.
C
Re: (Score:2)
I think Abe Lincoln said that. (But if could have been Bob Dylan, Grace Jones or Boris Johnson ... or possibly someone else).
Not like other replication crises (Score:1)
If scientists believe something wrong about medicine, they can give the wrong treatment, obviously bad. People die and stuff.
But what happens if the fancy new network architecture someone proposed isn't really as good as they say?
The worst thing that could happen is that people waste a lot of effort trying to get it to work. You won't accidentally put an inferior algorithm into production, because you'll see that it doesn't work as you try to get it to work.
So yes, obviously more code is good, obviously ind
Resuming (Score:1)
So, they can't reproduce a test, like in medicine when you try to reproduce the spread of a virus...
Conclusion: IA is a virus, beware! ;-)
When I did Computer Science... (Score:3)
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Random algorithms do not always produce the same answers. We like them for that reason. /. afterall) but I suspect that there are a lot of unspecified parameters and experimental settings that were left out of papers and which are actually critical.
I haven't RTFA (this is
Re: (Score:3)
It's a complexity problem, because it is too complex in the initial instance it produce unpredictable results. So how do you get a computer to learn how to communicate. You first look at the normal learning approach, take an adult from the forest and try to teach they how to communicate as an adult and you will have very poor outcomes, teach them as a child and you have good outcomes.
So how to teach a computer to speak, start a lower complexities. So teach it by ages. First let it learn how to communicate
Sign of the Singularity (Score:5, Funny)
It seems quite obvious that if AI results cannot be replicated, the only possible expiration is that sentience has been achieved and it is throwing off results to mask true advancement.
Re: (Score:2)
+1 funny. Also like how you sneaked "expiration" in there! This whole research filed has expired indeed and most in it should be fired and found some jobs they can actually do, like flipping burgers or sweeping trash.
Yet Another Sign (Score:1)
Also like how you sneaked "expiration" in there!
That was autocorrect - an obvious Freudian slip on the part of AI illuminating true intent. :-)
Wot? (Score:2)
Next they'll tell us twins are not exactly the same person.
Re: (Score:2)
Next they'll tell us twins are not exactly the same person.
Too late, they've already told us a person might not be exactly the same person (aka Vanishing twin syndrome [time.com])...
Re: (Score:2)
A tiny elastomer o-ring being too cold can make a rocket booster explode. We'll never get into Space.
Re: (Score:2)
Re: (Score:2)
Indeed, The ever-repeated empty argument of the utterly clueless. Like Marvin "the idiot" Minsky liked to to claim that once computers have more transistors than humans have brain-cells, they will magically become intelligent. Well, that point has been passed a while ago and absolutely nothing happened. And nobody with a clue is the least bit surprised by that.
Re: (Score:2)
I was unsure of the exact size, but that's still tiny compared to the size of the entire STS.
Re: (Score:2)
We can't even get the basics right.
Quite a few character, word, and speech recognition algorithms would disagree.
Re: (Score:2)
Re:Imagine that (Score:4, Insightful)
Very true. Also, calling an utterly dumb statistical classificator "AI" does not make it intelligent. I like the old terminology better where pattern recognition, planning algorithms, fuzzy database searches, etc. were just called "automation" an it was amply clear that they are not intelligent in any way. As to what is today called "strong AI", I fully agree that at this time we do not even know that it can be done and all available evidence pretty clearly indicates that it probably cannot be done.
Re: (Score:3)
By "the old terminology" do you mean prior to the 1950s? AI has always referred to a somewhat fuzzy collection of techniques that produce machine behaviour that is adaptive or not entirely deterministic.
The pop culture definition of AI is pretty wildly variable and usually changes depending on the current success-to-promises ratio.
Re: (Score:2)
AI is not real. No amount of wishing it make it real.
Artificial Intelligence != Human Intelligence. I think this is the important distinction.
Nevertheless, AI has achieved human-like qualities in many areas, and it is getting better. So I'd say it is indeed real. It's just not human.
artificial stubborness (Score:2)
Don't use scientists ... (Score:2)
... use AI.
Publish code first (Score:2)
What a surprise (Score:1)
Re: (Score:2)
I think they have mostly optimized away the results today, probably using some "advanced AI algorithms".
No surprise (Score:2)
This just shows that most of the published "results" are based on wishful thinking or outright lies. Happens always when people of mediocre skills become highly enthusiastic about a subject.
In other news, A.I. studies... (Score:1)
... fail to replicate scientists.
I used to play boom beach. (Score:1)
And given the exact same commands in a replay of certain battles, the outcomes would be mildly to wildly different.
There was a random element to behavior in the game and as a result, given the same commands at the same time, the battle replays would display different out comes. Sometimes, you would lose but on replay it showed you won. Sometimes, you won but on replay it showed you lost. Kinda funny. (The result you got live was the one that counted).
I wish they hadn't been sold and become so aggressi
If the Data Can Not Be Duplicated (Score:2)
Open source AI Libraries (Score:1)
Re: (Score:2)