Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI

AI Researcher Warns Data Science Could Face a Reproducibility Crisis (beabytes.com) 56

Long-time Slashdot reader theodp shared this warning from a long-time AI researcher arguing that data science "is due" for a reckoning over whether results can be reproduced. "Few technological revolutions came with such a low barrier of entry as Machine Learning..." Unlike Machine Learning, Data Science is not an academic discipline, with its own set of algorithms and methods... There is an immense diversity, but also disparities in skill, expertise, and knowledge among Data Scientists... In practice, depending on their backgrounds, data scientists may have large knowledge gaps in computer science, software engineering, theory of computation, and even statistics in the context of machine learning, despite those topics being fundamental to any ML project. But it's ok, because you can just call the API, and Python is easy to learn. Right...?

Building products using Machine Learning and data is still difficult. The tooling infrastructure is still very immature and the non-standard combination of data and software creates unforeseen challenges for engineering teams. But in my views, a lot of the failures come from this explosive cocktail of ritualistic Machine Learning:

- Weak software engineering knowledge and practices compounded by the tools themselves;
- Knowledge gap in mathematical, statistical, and computational methods, encouraged black boxing API;
- Ill-defined range of competence for the role of data scientist, reinforced by a pool of candidates with an unusually wide range of backgrounds;
- A tendency to follow the hype rather than the science.


- What can you do?

- Hold your data scientists accountable using Science.
- At a minimum, any AI/ML project should include an Exploratory Data Analysis, whose results directly support the design choices for feature engineering and model selection.
- Data scientists should be encouraged to think outside-of-the box of ML, which is a very small box - Data scientists should be trained to use eXplainable AI methods to provide context about the algorithm's performance beyond the traditional performance metrics like accuracy, FPR, or FNR.
- Data scientists should be held at similar standards than other software engineering specialties, with code review, code documentation, and architectural designs.

The article concludes, "Until such practices are established as the norm, I'll remain skeptical of Data Science."
This discussion has been archived. No new comments can be posted.

AI Researcher Warns Data Science Could Face a Reproducibility Crisis

Comments Filter:
  • by Yo,dog! ( 1819436 ) on Sunday June 16, 2024 @07:39PM (#64554199)
    The author seems confused. When the subject of p-values was brought up, I was also expecting mention of Bonferroni, but alas there was none.
    • by Mr. Dollar Ton ( 5495648 ) on Sunday June 16, 2024 @09:19PM (#64554303)

      WTF is "data science" anyway?

      There was a time decades ago when we called it "statistics", and we'd learn how to do some math to make sure we get the best estimators out of a sample, calculate correlation coefficients and significance figures so that we can evaluate the validity of a model that came from some non-statistical assumptions that were called "understanding".

      Then we got a generation of "statisticians" that said seriously "I'm a statishitshian, I don't know any calculus". Then we got "specialists" in "AI" data massaging who don't even know what they do when they "tune" their models.

      What do them "data scientists" do today?

      • by znrt ( 2424692 )

        this role became prominent with the discovery of the holy grail of big data. data scientists process, analyze and seek patterns in big data using statistics, software tools, models and visualization, etc. it's a multidisciplinar field, they often are expected to have a background in the particular domain of the data. i would assume they use ai now too and suppose you could call them "full stack" statisticians.

        • You're a "full stack statistician" if you, given a model and some population, can develop your own estimators, collect and pre-process your data, calculate whatever statistics you need from it and draw inference from the said statistics with assessment of its significance. You can then plug them into the model and get predictions.

          Typically when you're there, you know enough numerical methods, perl and fortran to be able to undertake the tasks above efficiently. It is also quite common that you're not limit

          • Let me put it this way

            When I was a neophyte computer scientist working for an academic services search engine thing, circa 2007... I saw the first job offer for a bioinformatician at the uni I was based at. My first thought was "gosh it's gotta be an awful lot of effort to learn both biology and computer science to the extent that you can write domain specific code like that..."

            Then I saw the pricetag and realised that actually they were asking for the intersection of those skills only when, in my opinion a

            • people with too much individual power trying to apply tools they're familiar with in ways they're familiar with without understanding the tasks involved except at a superficial (in some cases consumer) capacity

              Yes, that's always a potential problem, when we've had it in rare cases in my vicinity it is typically dealt with by the culprit getting schooled by the people who know better. Most of the time the people are smart enough to ask for help or a review beforehand, though.

              • It's hard when the people who need to be schooled are the people writing the job description, and they think you having a meeting with them is a privilege for you.

                I imagine they'll learn over time, but this is a period of growing pains. I hope I'm not being generous when I say this.

                • Yeah, that sucks. I've been lucky not to ever be in this situation, people around me have been knowledgeable, humble and helpful.

            • by laughing_badger ( 628416 ) on Monday June 17, 2024 @05:20AM (#64554793) Homepage

              People with a union set of skills are rare and expensive.

              It's a similar sort of post to 'Research Software Engineering' where the ideal candidate has enough of a software engineering background (probably picked up by elective courses and experience rather than directly taught) and enough of a domain background (likely a degree in a related field) to be able to bridge the gap between the people who know the domain and the people who know the minutia of the software. You need someone with enough of a background to quickly learn the domain, and enough humility to ask for support and clarification for the tiny details of it. Similarly they need enough of a software background to either be able to code the solution themselves or to manage a team of software people that can. Since we already have domain experts and software experts, it's looking for someone who can act as a bridge between the two camps.

          • by znrt ( 2424692 )

            i used the term "full stack" with my tongue firmly in my cheek to abound in both the multidisciplinar nature of the role and the hype surrounding it. i don't think your paraphrasing of "full stack" in your rant does really make any sense, but don't let that stop you from yelling at the clouds :-)

          • This describes most people who do research in any of the sciences - physics, chemistry, economics, biology, etc.

            Not the perl and Fortran part. It's Python and C++ in particle physics and has been for several decades and even decades ago perl was almost never used, it was almsot entirely Fortran plus some of using C where we could.

            • Dunno about C++, around here all particle physicists use a piece of shit library they call root, which is a mess of spaghetti code and a C++-like interpreter with a shitton of memory leaks and other bugs, and the only thing they use this root for is to fit Gaussian curves. I've heard it also have a python wrapper, may the gods have mercy on its users.

              Of course, something simpler like gnuplot runs circles around root, but you can't unteach BDSM.

              Also, when a particle physicist tries to write something in C++,

              • I've heard it also have a python wrapper, may the gods have mercy on its users.

                The Python interface is the only sane way I have ever found to use ROOT since the compiled C++ API has no clearly defined memory management so it is never clear if you are responsible for deleting objects or ROOT is (hence memory leaks or random crashes depending on how you guess wrong). Plus it also avoids the utter insanity of the C++-ike interpreter: it's not C++ because it misses key functionality like virtual functions.

                So yes, ROOT is an utter piece of garbage that has insane bugs (I found one once

                • Fortunately the experiments I now work on have refused to touch it with a barge pole

                  I'm glad to hear that there are still sane projects around :)

                  And yes, matplotlib is nice.

                  Speaking of C++, we're using one piece of HEP software that is reasonably good - Geant4. That one is, indeed, C++ and can be used as such. Sadly, the way I see it used is - copy-paste an example, "enrich" it with another 75,000 lines of C-like code, including your configuration and yay! C++ here we go!

  • by ebonum ( 830686 ) on Sunday June 16, 2024 @08:09PM (#64554233)

    If the next guy can't follow what you wrote and reproduce your result, there isn't much point to publishing. Authors must include everything someone working in the field (other experts) needs to know to get the same result. Same code (you must share), same data set to train on (you must share), and same series of steps = same result. The code isn't thinking. It isn't intelligent. It is executing an algorithm. (Yes. Certain mathematical equations can lead to chaos. If this comes into play, this can be explained in detail and quantified in the paper.)

    "data scientists may have large knowledge gaps in computer science, software engineering, theory of computation, and even statistics in the context of machine learning, despite those topics being fundamental to any ML project."
    Dude. If you don't understand these things, Don't publish. You're not qualified. Please STFU.

    What are these people doing? Calling an API someone else wrote and trying to fake a paper out of it? This isn't worthy of publishing. If you don't understand the code the API executes, please STFU. If you can't explain the How and the Why, there is no meat in your paper.

    • Not all but many are trend following faked.

      Before AI we had: crypto, quantum blah, and a bunch of kids with English degrees who skimmed a css book calling themselves web developers.

      Welcome to the new fotw, just like the old one but spelled differently.

    • You raise a question that has bothered me. If you ask one of these so-called AI programs the same question repeatedly, will it always give the same answer? I am assuming that no additional data is added to the system between questions, since that would potentially change the answer. That would seem to differentiate them humans, who can be very inconsistent in that regard.
      • No. You will not inherently get the same answer.

        They utilize a problem space, a seed value, to drive some level of variability in the output. If you had the seed you'd see the same results, but that wouldn't necessarily be useful overall.

        The image models work similarly, you want some uniqueness put into these models, or you'll only really experiment with a single problem space.

        But, you shouldn't be using an LLM to the entirety of data scientist;
         

      • What about extending that, is AI guaranteed to give an honest truthful answer? I asked perplexity a question and to provide citations, it did but when I checked into the first few, they were all made up. Real authors, but the titles were fake. From news articles over the years (like lawyers getting in trouble for not double checking AI written briefs), it seems this isn't an isolated incident. Some LLMs give an answer, but is it just telling us what we want to hear/read? Or tries to influence opinions by ta
    • I work as a research developer/consultant in an academic institution. I'm currently trying to explain to a medical researcher that a random page they found on github that is some guy's abandoned interface to a public database of biological data that changes over time, that was most recently updated 5 years ago is not an 'established pipeline' for research purposes... but this is the kind of thing I encounter quite regularly.
      I don't think the depth of things that people new to the environment don't understan

      • Let me guess - the 'established pipeline' has already formed part of a funded research proposal in which someone self-estimated that you'd probably need no more than a week of money to attach it to the 'it worked for this specific case' pile of python that a research assistant left behind when their post finished and they moved overseas?

        • I see you're familiar with the environment.

          To be fair... I could do it in far less than a week if only the researcher would actually listen to me when I said something wasn't going to work. The problem is really emphasised in situations where the researcher pushes back on the idea that they should have to understand something strategic about the tool to use the tool... It very much feels in those situations like they think they're using a consumer service that should take responsibility for the task away fr

    • There is another issue, that fails to be mentioned.

      When training a neural network on a dataset the inputed units in the dataset is most likely randomized, and order of input can (probably will) change the resulting weights of the nodes.

      So it is, unless the order is well defined (and not randomized*), impossible to get the same resulting network after training.

      * unless the randomizing function always produce the same series of random values, given the same seed and the seed is known.

      • Why does Wikipedia say:

        "Transformers have the advantage of having no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM),[7] and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl.[8]"

        Do large language models actually need any neural networks at all?

        • I think of the little I now have read up on transformers is that they are only there for encoding and decoding the input resp output into and out of the neural net.

          Thus getting rid of the problem that long-short term memory models and Recurrent models have. IE the first data in the stream has less value than later data. And a few other things.

    • The code is irrelevant. I can have crappy buggy code that I share and you reproduce my crappy results. At this point you can be either content with me being correct, or dig the code and find my errors. Generally, description of the method I used should be enough. You want to reproduce my results you reproduce my method in *your* code. Then the effect of my errors is eliminated and the problem will be clear.

      There is a big push in code sharing nowadays, but I think it is some form of self tricking into trus

  • by itamblyn ( 867415 ) on Sunday June 16, 2024 @08:37PM (#64554271) Homepage
    The article makes it sound like a PhD in CS makes you better suited to work with data than an undergrad in physics. This is does not agree with what I've seen. I'm sure there is a joke to be made about sample bias and anecdotal evidence, but notwithstanding that, this is what I have seen.
    • Re: Physics vs CS (Score:4, Insightful)

      by topham ( 32406 ) on Sunday June 16, 2024 @09:28PM (#64554315) Homepage

      You're not wrong.

      You know what gets you better results; cross discipline understanding.

      Comp Sci talks about data a lot; and applying the same process over it. What they often don't do excessively is actual data analysis. They're more worried about the data structure to hold it than the analysis of the data itself.

    • The article says psychology. Not physics.

      It incidentally applies to some other fields in addition to psychology... but physics was into simulations decades ago, before marketers relied on people not understanding that Markov models can never do more than probability analysis, and are therefore useless in stochastic scenarios. Physics actually needs its outcomes to be directly testable.

      I think you can maybe see some fundamental differences between physics and psychology research methodology. I know I probabl

  • If it can't be reproduced - then it can't be called Data SCIENCE

  • by Gideon Fubar ( 833343 ) on Sunday June 16, 2024 @09:11PM (#64554297) Journal

    It potentially could have been summarised as "many fields took the sales folks' proclamations about computing literally and thought that 'data science' was just an interface to a magic box. They have been suspending disbelief, and in practice have been applying transformations to data in ways that don't make functional sense or in some cases are actual formal fallacies"... But on the other hand it's a hard topic to talk about and may people who seem to have read the article clearly haven't understood what it's about.

    An obvious general example, a specific of which is used early in the piece, is the assertion that a random slice of data should fit a Gaussian distribution... but that's just the *pleasing* example, and asserting it will obviously distort the dataset. We understand this readily when discussing the basics of data distribution, but many researchers still make assumptions like this in actual research.

    I wish this wasn't my personal experience, as a research developer. But here we are.

  • There is an immense diversity, but also disparities in skill, expertise, and knowledge among Data Scientists... In practice, depending on their backgrounds, data scientists may have large knowledge gaps in computer science, software engineering, theory of computation

    This isn't really new with "data science." You've got people in every IT discipline that don't know what the heck they're doing, and managed to shuffle along from one job to the next not doing much of anything. If you're hiring data scientists--or regular programmers--you need to know what you're doing (how to tell what quality looks like), or you'll get screwed.

  • by jimjoe ( 527975 )
    GNU Guix [nature.com] could be big part of the solution.
  • Data "science" is in many ways what economists have been doing for decades, when they thought using their analytical skills (which I am not gonna question) by throwing their spreadsheets at whatever problem they fancied would result in undeniable progress. You know: "hammer", "nail" .... A bit like Elon Musk seeing himself as world's best manufacturing engineer ... I digress.

    This can certainly lead to progress. Out-of-the-box and sideways thinking is good. But at some point you also need to throw in actual

  • Other than many sciences where you ask yourself if it is reproducible, the successful papers are instantly applied. There is a great new ML training algorithm? Expect hobby users to implement it to train their stable diffusion models in the next week. After the first hundred models you know if it really works better or if they go back to the older algorithms.

  • Data scientists use Python for its simplicity, but core libraries like NumPy, Pandas, and Scikit-learn are written in C, C++, and Fortran for efficiency. This creates a black box environment where users rely on high-level APIs without understanding the underlying implementations. While this approach boosts productivity, ultimately, the reliance on abstracted, efficient libraries can result in a gap in fundamental understanding and control. Part of the solution could be to move away from Python to more effi
    • What do you mean, "black box"?

      This is like saying "BLAS and LAPACK are a black box environment".

      The source of all these libraries is available for anyone to examine and learn from. This is what you do when you study numerical methods and statistical computing, which I assume are the core of any "data science" education.

      Most certainly a "data scientist" should know in detail what they are doing, including the nitty-gritty details of how the software they use works, otherwise what is the point of having a "da

  • by akw0088 ( 7073305 ) on Monday June 17, 2024 @04:24AM (#64554725)
    So I view data science much like the current wave of Computer Security Professionals. A lot of people with degrees that have no value claiming to be experts but not knowing a single thing about what they are actually doing. These people combined with non-technical management is a recipe for disaster (which no one will acknowledge because that would mean admitting failure)
  • In decades past, machine learning was the red headed stepchild of AI for, well, these basic reasons. In many ways it is the worst domain in the field. But it happens to rrun REALLY well on easy to build hardware, and you can make a lot of money using it, so the field has kinda ignored those problems and embraced it.

    ML is great for industry, esp recommendation systems and search (so, pretty much anywhere where advertisers want access to your eyeballs), but has never been a good tool for research.
  • Don't worry Life Sciences already has you covered with unreproducible results for decades now. I have had professors tell me that just because their results can't be reproduced doesn't mean jack. So if you can get a Nature or Science paper published with unreproducible results and have been able to do so for decades it's nothing new. The notion that if it is not reproducible it's not real seems to be something reserved for physics and chemistry alone.

  • I was recently at an ML/AI conference and when I asked a creator of an image classifier what results it gave for randomly generated images from black and white static to randomized color shapes to randomly oriented and placed features at the second-layer discrimination level (e.g. 12x12), they asked "why would you do that?" We have not reached the point where anyone wants to understand their black-boxes, they just want to find datasets where they can claim their black-box is better than someone else's usin
  • So, "data science" is not science. There are now 4 kinds of lies: lies, damn lies, statistics. and data "science."

The Tao is like a glob pattern: used but never used up. It is like the extern void: filled with infinite possibilities.

Working...