Stack Overflow Will Charge AI Giants for Training Data

Large language models can generate strings of text based on word patterns learned from the web pages, books, and other bodies of text in their training data. Besides ChatGPT, the programs make up the guts of search chatbots such as Microsoft Bing chat and Google’s Bard, and they underlie a growing number of applications that produce professional and creative copy in a flash. Their counterparts that generate AI-composed illustrations and videos draw on patterns from image datasets such as photos gathered from Pinterest and Flickr.

Often, data sets used in AI development are built through unofficial means such as dispatching software that scrapes content from websites. In the US that is typically considered legal, though copyright issues and websites’ terms of use against the practice have left it in dispute.

A few websites such as Reddit and Stack Overflow have been more inviting. They offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free.

But Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

Neither Stack Overflow nor Reddit has released pricing information. “We’re working on that as we speak,” Reddit spokesperson Tim Rathschmidt says, “and will share more with partners in the coming weeks.” Stack Overflow will study Reddit’s strategy and consult with its own potential customers, some of whom have already reached out about data access, Chandrasekar says.

A potential roadmap to pricing could come from Elon Musk, who this month hiked prices for access to Twitter data. They start at $42,000 per month for access to 50 million tweets. About three times the volume of tweets had been previously available for free. In a tweet this week, Musk accused Microsoft, a major AI developer and close partner of OpenAI, of training algorithms “illegally using Twitter data.” Without elaboration, he added, “Lawsuit time.”

Both Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that’s where it’s not fair use,” he says.

Reddit CEO Steve Huffman told The New York Times this week that he didn’t want to give a freebie to the world’s largest companies. “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” he said.

Original Source Link

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

What's Hot

Let’s Make a Deal host Wayne Brady ‘falls asleep’ during game as co-star takes over

How To Wear A Check Blazer Like A Don

The Top Book News of the Week

Stack Overflow Will Charge AI Giants for Training Data

OpenAI’s GPT-5 reportedly falling short of expectations

Google Says It Won’t Force Gemini on Partners in Antitrust Remedy Proposal

Meet Skyseed, a VC fund and incubator backing the Bluesky and AT Protocol ecosystem

Meet Skyseed, a VC fund and incubator backing the Bluesky and AT Protocol ecosystem

2024 Was the Year the Bottom Fell Out of the Games Industry

Energy Revolution Ventures’ $18M fund lays a bet on ‘new chemistry’ startups in energy and hydrogen

Let’s Make a Deal host Wayne Brady ‘falls asleep’ during game as co-star takes over

How To Wear A Check Blazer Like A Don

The Top Book News of the Week

Chic Singer Alfa Anderson Dies at 78

Roche's Parkinson's Drug Candidate Misses Key Goal in Mid-Stage Study

Park Mobile Class Action Lawsuit Explained: What You Need to Know To Claim Your Award

Chiefs’ Patrick Mahomes eases ankle injury concerns, sets personal rushing mark on touchdown run

A Jazzman’s Blues – first-look review

Sam Altman Is Reinstated to OpenAI’s Board

How to Test Your Wi-Fi Speed

Why Do We Give Gifts? An Anthropologist Explains This Ancient Human Behavior

Our Picks

Park Mobile Class Action Lawsuit Explained: What You Need to Know To Claim Your Award

Chiefs’ Patrick Mahomes eases ankle injury concerns, sets personal rushing mark on touchdown run

OpenAI’s GPT-5 reportedly falling short of expectations

Subscribe to Updates

What's Hot

Stack Overflow Will Charge AI Giants for Training Data

RELATED POSTS