Tech Thoughts: Media needs a united front against data scraping to train AI

Data scraping to train artificial intelligence is a big issue for the web, seeing as only a handful of companies will be able to benefit – through training their AIs – from the content and information coming from millions of people online.

While OpenAI’s ChatGPT has already confirmed the existence of its data-scraping GPTbot, Google is creating an environment where people are immediately coopted into sharing their data for the purposes of training AI. This environment, proposed earlier this month in Australia, asks for revisions to copyright laws that would facilitate training of AI models in the country by allowing “workable opt-outs for entities that prefer their data not to be trained in using AI systems.”

I’d have thought that this would light a fire under media companies and website owners large and small to create a united front pushing for either an opt-in approach to data scraping to train AI – meaning you have to choose to allow your data to be scraped instead of opting out of such things – or otherwise finding ways to make it easier for everyone to opt out of being mined for AI training.

What seems to be happening instead is an approach where the media entities that stand the most to gain by selling their data – in this case, The New York Times (NYT) – are making their own path instead of teaming up with others to ensure fair treatment of data against artificial intelligence training.

NYT’s Terms of Service changes

According to Nieman Lab, the NYT updated its terms of service to essentially say data scraping the media org is a a no-no.

Specifically, under Section 2.1 of its terms of service, it says, “Non-commercial use does not include the use of Content without prior written consent from The New York Times Company in connection with: (1) the development of any software program, including, but not limited to, training a machine learning or artificial intelligence system; or (2) providing archived or cached data sets containing Content to another person or entity.”

Meanwhile under Section 4, which determines prohibited uses on the service, it said that without NYT’s prior consent, other entities may not “use the Content for the development of any software program, including, but not limited to, training a machine learning or artificial intelligence system.”

Dropping out of the so-called ‘coalition’

Semafor on August 13 said the NYT would also be dropping out from a “group of media companies attempting to jointly negotiate with the major tech companies over use of their content to power artificial intelligence.”

This so-called “coalition” wasn’t of all media stakeholders, mind you, but rather a group of major media entities trying to negotiate a good deal for the use of their data.

As such, the coalition they were trying to build did not include smaller outlets around the world which could also use support and funding to further local journalism.

The New York Times‘ actions splintered an already fragmented alliance that did not take into account all the possible stakeholders who might be harmed by data scraping, AI training, and the potential other harmful ramifications of AI on the media and information landscape.

Stronger together than apart

What does this bode for AI companies and media?

It means AI companies will probably pick or fund those institutions carrying the largest datasets so they can use that data to train their AI. The NYT appears to be essentially making its case for negotiations by standing by its lonesome to find a AI-loving patron to appreciate it.

Actions like the coalition of only major news outlets and the splintering of alliances leaves out smaller outfits as I mentioned – my mind immediately goes to news outfits under attack around the world, including the Marion County Record which was doing good journalism in Kansas but was raided by police over allegations of “identity theft” that supposedly violated a local restaurant owner’s privacy – and also makes it more difficult for even major media groups to unite and say “No thanks” towards data scraping.

I wish small media outfits and larger, established media companies banded together to push back against data-scraping for AI, either to force an opt-in agreement for data training, or to make AI companies pay the various news groups, big and small, for their contributions.

Maybe it will work, or perhaps it won’t, but if all these myriad groups stood their ground as one to say, “If you want our news, pay for it!” then maybe all the AI-owning companies would put their money where their mouths were and actually stop exploiting the web, and news organizations, for profit. – Rappler.com

Add a comment

There are no comments yet. Add your comment to start the conversation.

Victor Barreiro Jr.

Victor Barreiro Jr is part of Rappler's Central Desk. An avid patron of role-playing games and science fiction and fantasy shows, he also yearns to do good in the world, and hopes his work with Rappler helps to increase the good that's out there.