Reddit user posts will fuel Google’s LLM – others should think twice before trying to do the same

Changing T/Cs retrospectively ain't gonna work

Reddit user posts will fuel Google’s LLM – others should think twice before trying to do the same
Photo by Mitchell Luo / Unsplash

Reddit looks set to hand its vast archive of user generated content to Google as AI model training fodder without too many legal headaches, but smaller scale tech leaders looking to do the same are advised to tread carefully.

The deal, worth a reported $60m a year, was confirmed at the end of last week. It coincided with Reddit’s long awaited IPO filing, which claimed “Reddit will be core to the capabilities of organizations that use data as well as the next generation of generative AI and LLM platforms” and made clear its intention to monetize its data.

Google gushed, “Reddit plays a unique role on the open internet as a large platform with an incredible breadth of authentic, human conversations and experiences, and we’re excited to partner to make it even easier for people to benefit from that useful information.”

Under the deal, it continued, Reddit will use Google’s Vertex AI tools, while Google “now has access to Reddit’s Data API, which delivers real-time, structured, unique content from their large and dynamic platform.”

This means Google “will now have efficient and structured access to fresher information, as well as enhanced signals that will help us better understand Reddit content and display, train on, and otherwise use it in the most accurate and relevant way.”

The move inevitably sparked ire amongst some Redditors and wearied resignation from others. More broadly observers were concerned that all public discourse is now simply grist to the AI training mill.

Eliot Bendinelli at Privacy International said the move “perfectly illustrates the problematic nature of internet giants' business model. It consolidates the exploitation of people's personal data for profit and power, with little regard for our rights, interests and security.”

But there seems little that reluctant Redditors can do about it. Reddit’s own T/Cs are widely drawn, warning users they grant it a (perpetual) license for an incredibly wide range of purposes, and that Reddit gains the right to user content for “syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.”

Mark Webber, US managing partner at Field Fisher, said “Broadly, if you have a use right and that right includes the right to sublicence to another company then you can use content to train AI or licence that content for third parties to train AI.  BUT if that content is personal data there's more to do to be fair.”

Giles Parsons, Partner at UK law firm Browne Jacobson, said terms of service have to be fair, but “I think that the UK courts would likely find that a free platform is able to licence public content including to AIs as Reddit appear to be doing. Users concerned about the selling of the data they put on services like this need to make sure they read the terms before deciding whether to post.”

However, other companies looking at Reddit’s move and thinking that it gives them carte blanche to repurpose data may not be in the clear.

Webber said businesses need to think about the right to use and whether the user can grant that right. “I could not upload a photo where the copyright is owned by someone else and grant a license to content that isn't mine.

They also need to consider privacy concerns. “There is an obligation to explain what personal information is collected and why, ie for what purposes?  Then, as well as notice, either consent is required which is cumbersome to collect but may sometimes be required or, if it's possible to work on a lawful basis other than consent best practice would say a form of choice, like an opt-out.” 

The laws around the world have some nuances, he added. “So the combination of the Terms, the Privacy Notice and typically an Acceptable Use Policy (AUP) is going to set the platform up for success.”

Webber said the context of the platform was also important. “If B2C the platform has a relationship and direct contract.  If B2B, the platform has a relationship with a business, typically the employer, [and] the employer can't give permissions for its employees and particularly when privacy is involved.

Parsons added, “Whether other companies can use their users’ data is going to depend on what the data is. Is it personal data? Do they have a licence to sublicence it?

“What it does tell you though is that data is valuable and if you want to monetise it, ensure you have the appropriate ownership or licence to do that.”

Terms of service apart, Bendinelli at Privacy International said Reddit's move “perfectly illustrates the problematic nature of internet giants' business model. It consolidates the exploitation of people's personal data for profit and power, with little regard for our rights, interests and security.”

Bendinelli said, “Before their data gets embedded in AI products, people deserve to be informed and asked for their consent. We need to know what is happening to our data, what it will be used for and by whom.”

He compared the practices around AI to those in the early days of the adtech industry, adding it took too long for regulators to bring that industry under control, and the same must not be allowed to happen in the AI sector.

“Regulators must take urgent steps to protect users' interests in those juicy deals before they transform all online spaces into nothing more than perpetual surveillance machines designed to enrich companies beyond our control."

Ironically, the deal just a few weeks after the FTC warned companies about surreptitiously changing their terms of services to allow data to be shovelled into AI models.

“Market participants should be on notice that any firm that reneges on its user privacy commitments risks running afoul of the law,” it continued.

The regulator warned it would continue to bring actions against companies that engage in unfair or deceptive practices—"including those that try to switch up the ‘rules of the game’ on consumers by surreptitiously re-writing their privacy policies or terms of service to allow themselves free rein to use consumer data for product development.”

Reddit, it should be said, has not changed the rules of the game. Previous versions of its terms going back years said exactly the same thing.

Perhaps the bigger question is just how much value Google will be getting. As one Redditor noted, “Lots of content on here already is created by bots and AI ...so AI training AI I guess ¯\_(ツ)_/¯”