Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Twitter

Stack Overflow Will Charge AI Giants For Training Data (wired.com) 31

An anonymous reader quotes a report from Wired: Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users. Stack Overflow's decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June.

"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach." Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created." But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.

Chandrasekar says that LLM developers are violating Stack Overflow's terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says. Neither Stack Overflow nor Reddit has released pricing information.
"Both Stack Overflow and Reddit will continue to license data for free to some people and companies," notes Wired. "Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes."

"When people start charging for products that are built on community-built sites like ours, that's where it's not fair use," he says.
This discussion has been archived. No new comments can be posted.

Stack Overflow Will Charge AI Giants For Training Data

Comments Filter:
  • Is that legal? (Score:4, Interesting)

    by backslashdot ( 95548 ) on Friday April 21, 2023 @06:24PM (#63468342)

    We provided stackoverflow with responses/questions and now they are charging companies for access to it? I want AI to learn that stuff and give solutions to people.

    • Stack Overflow sees AI replacing them so they are trying to cash in while they can. AI is the new Stack Overflowâ¦
      • Stack Overflow sees AI replacing them so they are trying to cash in while they can. AI is the new Stack Overflowâ¦

        It sure is and you don't get the snark and half-assed answers that you often get with stack overflow.

    • by kiviQr ( 3443687 ) on Saturday April 22, 2023 @12:20AM (#63468944)
      Think about feature - every one asks GPT no one contributes to public knowledge like stackoverflow. We will be owned by GPT knowledgebase.
      • by AmiMoJo ( 196126 )

        It's Stack Overflow. Most of the answers are half baked at best. Often out of date, rarely best practice, security isn't even considered.

  • ...after the horse has bolted.
    Seems like the original terms & conditions could have been better.Expert Exchange died, because they aggressively tried to monetize their contributor's IP. There's a middle ground, if they want it.
  • by Dr. Spork ( 142693 ) on Friday April 21, 2023 @06:30PM (#63468360)
    That's incredibly hypocritical that this hosting site wants money for "their" when they never paid a cent for the data that helpful users posted there. If they're ready to take money for the data, they should pay money for the data.
    • That's incredibly hypocritical that this hosting site wants money for "their" when they never paid a cent for the data that helpful users posted there. If they're ready to take money for the data, they should pay money for the data.

      So you think you can run a StackOverflow clone for free? I am pretty confident they have a surprisingly large run rate and cloud hosting cost. They've paid fare more than "a cent" throughout the years keeping the site running, free of spam/porn, and building infrastructure to ensure smooth functionality as well as CDN fees to ensure it loads in a timely manner.

      It costs a LOT of money to run a popular website. Is that equivalent to the value of what users contributed for free already?...not my place t

    • While Stack Overflow may not own the copyright of any user submitted content, they certain have full rights to limit or deny anyone's access to their systems.

      i.e. if you already have a copy of some code from their site, they probably cannot stop you from doing whatever you want with it, but they sure can block you from accessing their site anytime they wanted.

  • by JoshuaZ ( 1134087 ) on Friday April 21, 2023 @06:31PM (#63468364) Homepage
    This is not how Creative Commons works. Humans can learn from it. Very unclear why an LLM is any different. But more to the point,if you really, sincerely think that an LLM doing this is violating Creative Commons, then charging for it does not make it go away. Nothing in the Creative Commons license allows Stackexchange to take a charge to allow people to violate the license.
  • by ffkom ( 3519199 ) on Friday April 21, 2023 @06:33PM (#63468366)
    Or is Stack Overflow, a company that lives off the content others contribute for free, just squealing on other companies that would also like to live off the same content for free?
  • I don't believe they can, either legally or morally.
    • Why can't they charge someone that wants to put a relatively large strain on their servers? They don't have to charge for the data itself, just the data mining type of access to it.

      Ok it's about licensing - well the lawyers will have fun with that one. Their fun will be cashing the checks.
    • by znrt ( 2424692 ) on Friday April 21, 2023 @07:16PM (#63468482)

      they can try, and make noise, and they're not alone. it's not just stackoverflow, as tfa reports it's also reddit and an association of over 2000 newspapers in the us. and a lot more, they all have smelt the blood in the water and want a share. for some like stackoverfolw in particular this is critical because they are going out of business right away.

      that's how this has always worked, current batshit crazy ip law didn't come from thin air, it exists because lobbies relentlessly pushed outright outlandish and ridiculous claims just like these which are now considered normal. besides, they're "teh media" no less. consider it done.

    • by Anonymous Coward
      They cannot. Stack Overflow contributions are accepted under the CC-BY-SA 4.0 [creativecommons.org] license. Stack Overflow may not charge for access to its content, nor can the LLM thieves charge for any services they build using that data.
  • ChatGPT and others were already trained with Stack Overflow data. Sorry SO, that ship has sailed.
  • by linuxguy ( 98493 ) on Friday April 21, 2023 @07:23PM (#63468498) Homepage

    As others have mentioned Stack Overflow is just a middleman and does not own the data. The data belongs to the user who uploaded it. Everybody, including LLMs should have access to it. If this ever goes to court, I hope LLMs win. We don't need more toll collectors on public data.

  • by fahrbot-bot ( 874524 ) on Friday April 21, 2023 @07:26PM (#63468512)

    Offer them a discount if they're willing to click through Stack's GIANT "Accept all cookies" overlay every time they want to access data. :-)

  • by HotNeedleOfInquiry ( 598897 ) on Friday April 21, 2023 @07:50PM (#63468558)
    Typical Stack Overflow cynicism and smart-asserry?
  • by LostMyBeaver ( 1226054 ) on Friday April 21, 2023 @10:19PM (#63468758)
    I have to admit that since installing copilot, I have barely once searched google for a coding answer or found myself on stackoverflow.

    I produced a few thousand lines of well written, cleanly documented and tested code yesterday without using Google or Stackoverflow even once. Now I spend most of my time simply auditing code and rewriting the descriptions of what I what copilot to generate to produce better results.

    I had not really considered what impact this would have on stackoverflow. But honestly, I cannot imagine Iâ€(TM)ll visit their site very often in the future. And this is sad because I often learned quite a lot reading through more responses.

    I hope they manage to survive the existence of AIs, but somehow I do not think I will actually notice if they fade into the abyss.
  • Copyright law allows the creators or works to control who can make copies of their work. It does not give them rights to control who... or what... is allowed to read or learn from their works or for what purpose. It'd be hard to call ML training a public performance. They might argue that a knowledge model is a derivative work. But that's gonna he hard to swing in court since the model is just information ABOUT the training material and it's so hard to get the model to reconstitute anything recognizable a
  • someone needs to ask GPT to answer all open questions on stack overflow.
    • by narcc ( 412956 ) on Saturday April 22, 2023 @12:50AM (#63468968) Journal

      It's not really possible for GPT to "contribute back". I don't mean the temporary ban, that's probably widely ignored, it's that these kinds of models simply can't produce anything novel.

      Not that you'd want it to contribute anything. I've explained before why AI generated content is poison for future models. Sadly, this fad will be over before that becomes a real problem.

      • by kiviQr ( 3443687 )
        My guess is that 60% of questions are duplicates. 20% could be derived from code/github access to source code and issues. For remaining 20% GPT could show intelligence by answering - I don't know.
        • by narcc ( 412956 )

          I shouldn't need to point this out, but models like this lack anything like understanding or analysis. It's just probability. There is no possibility for novelty here.

          This is easy to see in other kinds of models, where what is being retained is more obvious/direct, like an n-gram model, but make no mistake, there is no fundamental difference between the two. Both merely produce output on the basis of statistical information encoded in the model from the training data.

          That's probably worth digging into a

  • Sorry, you programming question is not about programming.
    Your question is a duplicate, even though the "duplicate" have nothing to do with your question.
    Why would you want to know that? You should solve your problem a way I consider easier.

  • You don't sell training data... you charge royalties for anything built on it.

    Otherwise, it's a one-time cost for the AI trainer and afterwards they get all the ongoing profits from your work.

"Don't try to outweird me, three-eyes. I get stranger things than you free with my breakfast cereal." - Zaphod Beeblebrox in "Hithiker's Guide to the Galaxy"

Working...