Search the site

OpenAI will pay to scrape Stack Overflow data; surface it with links

That's the sound of a lifeline being thrown...

OpenAI will pay Stack Overflow to scrape code from its community under a new “API partnership” that will also see OpenAI “surface validated technical knowledge from Stack Overflow directly into ChatGPT.”

Stack Overflow is a Quorra or Reddit-like Q&A forum where grizzled developers lurk and sometimes answer your esoteric programming-related questions; sometimes simply insult you. The company was bought by investor Prosus for $1.8 billion in a deal that closed in August 2021.

Just a year later, ChatGPT launched and (like other LLMs) its ability to (sometimes) make light work of obscure coding-related questions suggested that Stack Overflow might be killed off by generative AI. Even if that transpires to be the case, it will take some cash to the grave.

The new partnership will see OpenAI “utilize Stack Overflow’s OverflowAPI product” (that’s “a subscription-based API service that provides continuous access to Stack Overflow’s public dataset to train and fine-tune large language models”) and collaborate with Stack Overflow to improve model performance for developers who use their products.” 

See also: Synthetic focus groups and RAG in the contact centre: Bayer, Verizon, WPP on their AI deployments

This integration will “help OpenAI improve its AI models using enhanced content and feedback from the Stack Overflow community and provide attribution to the Stack Overflow community within ChatGPT to foster deeper engagement with content,” Stack Overflow said on May 6.

“You have to wonder how that will look in the long run” mused Apache Cassandra community member Patrick McFadin on LinkedIn.

“Stackoverflow answers are usually based on other people’s experience and many times it’s some arcane knowledge.. ‘The docs say do it this way, but here’s what really works.’ Unless there’s a simulator out there trying everything and getting real world experience, I don’t know how that gets replaced.., pne thing I can hope for is a much better signal to noise ratio. “Elimination of the RTFM [read the f******* manual] questions and a better way to reward people sharing actual experience.”

The deal is the latest commercial partnership OpenAI has signed with organisations whose data some observers had presumed it was already scraping – with a new deal with the Financial Times inked in April also surfacing clearly attributed content in ChatGPT responses, including “quotes and links to FT journalism in response to relevant queries.”

Google meanwhile has signed a deal worth an estimated $60 million to scrape data from Reddit and use it to train its AI models.