Stack Overflow Will Charge AI Giants For Training Data (wired.com) 31
An anonymous reader quotes a report from Wired: Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users. Stack Overflow's decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June.
"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach." Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created." But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.
Chandrasekar says that LLM developers are violating Stack Overflow's terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says. Neither Stack Overflow nor Reddit has released pricing information. "Both Stack Overflow and Reddit will continue to license data for free to some people and companies," notes Wired. "Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes."
"When people start charging for products that are built on community-built sites like ours, that's where it's not fair use," he says.
"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach." Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created." But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.
Chandrasekar says that LLM developers are violating Stack Overflow's terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says. Neither Stack Overflow nor Reddit has released pricing information. "Both Stack Overflow and Reddit will continue to license data for free to some people and companies," notes Wired. "Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes."
"When people start charging for products that are built on community-built sites like ours, that's where it's not fair use," he says.
Is that legal? (Score:4, Interesting)
We provided stackoverflow with responses/questions and now they are charging companies for access to it? I want AI to learn that stuff and give solutions to people.
Re: Is that legal? (Score:2)
Re: (Score:1)
Stack Overflow sees AI replacing them so they are trying to cash in while they can. AI is the new Stack Overflowâ¦
It sure is and you don't get the snark and half-assed answers that you often get with stack overflow.
Reverse:Is that legal? (Score:5, Interesting)
Re: (Score:2)
It's Stack Overflow. Most of the answers are half baked at best. Often out of date, rarely best practice, security isn't even considered.
Closing the barn door... (Score:2)
Seems like the original terms & conditions could have been better.Expert Exchange died, because they aggressively tried to monetize their contributor's IP. There's a middle ground, if they want it.
Re: (Score:2)
Oh God, I remember Expert Sexchange. Thank God it died.
Stack Overflow owns the user data? (Score:5, Insightful)
So who hosts your site for free? (Score:3, Interesting)
That's incredibly hypocritical that this hosting site wants money for "their" when they never paid a cent for the data that helpful users posted there. If they're ready to take money for the data, they should pay money for the data.
So you think you can run a StackOverflow clone for free? I am pretty confident they have a surprisingly large run rate and cloud hosting cost. They've paid fare more than "a cent" throughout the years keeping the site running, free of spam/porn, and building infrastructure to ensure smooth functionality as well as CDN fees to ensure it loads in a timely manner.
It costs a LOT of money to run a popular website. Is that equivalent to the value of what users contributed for free already?...not my place t
Re: So who hosts your site for free? (Score:1)
Google and Bing bring in revenue (Score:2)
Does it cost stack exchange more when AI software crawls the content or search engines to crawl? Why don't they charge Google and Bing a fee to harvest their site?
Referrals bring business. ChatGPT swallows their data and gives answers without forcing the user to visit their website.
Access != ownership (Score:3)
While Stack Overflow may not own the copyright of any user submitted content, they certain have full rights to limit or deny anyone's access to their systems.
i.e. if you already have a copy of some code from their site, they probably cannot stop you from doing whatever you want with it, but they sure can block you from accessing their site anytime they wanted.
Not how Creative Commons works (Score:5, Insightful)
And then they'll pay their contributors, right? (Score:3, Insightful)
Given Stack Overflows license... (Score:2)
Re: (Score:2)
Ok it's about licensing - well the lawyers will have fun with that one. Their fun will be cashing the checks.
Re: (Score:2)
Re:Given Stack Overflows license... (Score:5, Insightful)
they can try, and make noise, and they're not alone. it's not just stackoverflow, as tfa reports it's also reddit and an association of over 2000 newspapers in the us. and a lot more, they all have smelt the blood in the water and want a share. for some like stackoverfolw in particular this is critical because they are going out of business right away.
that's how this has always worked, current batshit crazy ip law didn't come from thin air, it exists because lobbies relentlessly pushed outright outlandish and ridiculous claims just like these which are now considered normal. besides, they're "teh media" no less. consider it done.
Re: (Score:1)
LOL! Stack Overflow was mined already (Score:2)
I am rooting for LLM in this fight (Score:4, Insightful)
As others have mentioned Stack Overflow is just a middleman and does not own the data. The data belongs to the user who uploaded it. Everybody, including LLMs should have access to it. If this ever goes to court, I hope LLMs win. We don't need more toll collectors on public data.
Just for fun ... (Score:3)
Offer them a discount if they're willing to click through Stack's GIANT "Accept all cookies" overlay every time they want to access data. :-)
Does the data set come with.... (Score:4, Funny)
Since installing copilot (Score:3)
I produced a few thousand lines of well written, cleanly documented and tested code yesterday without using Google or Stackoverflow even once. Now I spend most of my time simply auditing code and rewriting the descriptions of what I what copilot to generate to produce better results.
I had not really considered what impact this would have on stackoverflow. But honestly, I cannot imagine Iâ€(TM)ll visit their site very often in the future. And this is sad because I often learned quite a lot reading through more responses.
I hope they manage to survive the existence of AIs, but somehow I do not think I will actually notice if they fade into the abyss.
Good Luck (Score:1)
simple solution - have GPT contribute back (Score:2)
Re:simple solution - have GPT contribute back (Score:4, Interesting)
It's not really possible for GPT to "contribute back". I don't mean the temporary ban, that's probably widely ignored, it's that these kinds of models simply can't produce anything novel.
Not that you'd want it to contribute anything. I've explained before why AI generated content is poison for future models. Sadly, this fad will be over before that becomes a real problem.
Re: (Score:2)
Re: (Score:2)
I shouldn't need to point this out, but models like this lack anything like understanding or analysis. It's just probability. There is no possibility for novelty here.
This is easy to see in other kinds of models, where what is being retained is more obvious/direct, like an n-gram model, but make no mistake, there is no fundamental difference between the two. Both merely produce output on the basis of statistical information encoded in the model from the training data.
That's probably worth digging into a
New ChatGPT responses (Score:2)
Sorry, you programming question is not about programming.
Your question is a duplicate, even though the "duplicate" have nothing to do with your question.
Why would you want to know that? You should solve your problem a way I consider easier.
Selling is stupid (Score:2)
You don't sell training data... you charge royalties for anything built on it.
Otherwise, it's a one-time cost for the AI trainer and afterwards they get all the ongoing profits from your work.