Reddit strikes AI training deal for user content

Reddit’s reported deal with an un-named AI partner to repurpose its users’ musings and self-generated “content” as model training fodder has sparked concerns over exactly what Redditors have signed up for.

The deal, according to Bloomberg, will be worth $60m a year, and comes ahead of an expected IPO by Reddit.

Reddit, like many content generating/aggregation sites, has been pushing back against AI giants who have been scraping the web for content to train models. It also made API changes last year to prevent third party apps free-riding off its (users’) content.

At the time, it said, “Reddit needs to be a self-sustaining business, and to do that we can no subsidize commercial entities that require large scale data use from our API.”

If the reports are true, its putative AI partner would gain access to a mountain of content. The site has been active since 2005 and claims over 100,000 communities and 70 million unique visitors a day. That amounts to over 16 billion posts and comments to fuel and tune whatever foundational model the potential partner is running.

But it might also be expected to spook users’ of Reddit’s communities the topics of which range from the inane – r/Renault, Adam and the Ants – to the topical – r/uktrains – to the really, definitely not safe for work.

However, anyone looking to pull their content of any description, or otherwise object to its being used to train AI may be disappointed.

The emergence of the potential deal coincides with a scrub-up of the Reddit user agreement, flagged up to users in recent days.

These make it clear to users in the US and outside the EEA, UK and Switzerland, that when they post content, “You grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world.”

Reddit also gains the right to “make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.” It also gets to remove associated metadata, and users “irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.”

Users in the UK, EEA and Switzerland shouldn’t breathe a sigh of relief. They’re covered by the same clause. And similar wording has been in used in previous versions of the user agreement.

A Reddit spokesperson declined to comment.

But Bruna de Castro e Silva, AI Governance Specialist at Saidot, a Finnish AI governance and alignment company, said “As a social news forum that hosts millions of daily users, the data used to shape AI models will come from millions of diverse, individual sources who haven’t necessarily given explicit consent for their contributions to be used to develop AI models.”

She said Reddit should clarify exactly how its terms and conditions allow for selling on user content for AI, and whether users have an option to opt out. It also need to clarify the status of archived data, she said.

“Compounding these concerns, potential copyright issues arise if users have posted content belonging to others on Reddit. Until these issues are properly addressed, Reddit is treading on uncertain territory when it comes to privacy and copyright law.

The situation underlined the need for “clear, enforceable rules" around data usage in AI training.

“While such partnerships can drive innovation, they mustn't do so at the expense of ethical considerations and individual rights," de Castro e Silva said. "AI firms and platforms alike must take responsibility for ensuring transparency, informed consent, and robust data protection and governance.”

If Reddit has sealed a deal, both parties must be pretty sure it's sufficiently bullet proof. In which, it seems inevitable that Reddit's will flow into an AI model, while Reddit and its partner works through any legal fall out after the fact.

Just how sophisticated the end result will be remains to be seen. Perhaps we can look forward to a chatbot that can deliver the definitive question to the eternal question, “AITA”?