artificial intelligence

The Guardian, New York Times, CNN figure in growing list of sites blocking OpenAI crawler

Gelo Gonzales

This is AI generated summarization, which may have errors. For context, always refer to the full article.

The Guardian, New York Times, CNN figure in growing list of sites blocking OpenAI crawler

CHATGPT. A smartphone with a displayed ChatGPT logo is placed on a computer motherboard in this illustration taken February 23, 2023

Dado Ruvic/Reuters

'The scraping of intellectual property from the Guardian’s website for commercial purposes is, and has always been, contrary to our terms of service,' says a spokesperson from the site's publishing firm

MANILA, Philippines – The Guardian on Friday, September 1, became the latest among major news publications that have blocked ChatGPT maker OpenAI’s crawler, GPTBot.

OpenAI makes use of GPTBot to crawl websites and gather data that can be used to train AI systems and LLMs (large language models) such as its own GPT (Generative Pretrained Transformer). 

In August, the company published a blog post with instructions on how to block GPTBot. OpenAI has never disclosed what data and content it used to train its systems. Its blog post on block GPTBot had also revealed that the company was indeed using a web crawler to scrape data from sites. 

While search engines like Google uses bots to index sites on search results, the benefit of letting AI systems freely process copyrighted content is unclear.  

The Guardian quoted a spokesperson for Guardian News & Media, the site’s publisher as well as the Observer’s, on the blocking: “The scraping of intellectual property from the Guardian’s website for commercial purposes is, and has always been, contrary to our terms of service. The Guardian’s commercial licensing team has many mutually beneficial commercial relationships with developers around the world, and looks forward to building further such relationships in the future.”

Must Read

Tech Thoughts: Media needs a united front against data scraping to train AI

Tech Thoughts: Media needs a united front against data scraping to train AI

Other major news sites such as the New York Times, CNN, Reuters, the Washington Post, and Bloomberg, have blocked GPTBot the site reported. The site also said that other major non-news sites such as the question-and-answer site Quora, Amazon, and dictionary.com have blocked the bot. 

CNN’s Reliable Sources, its newsletter on the information economy, also found other media sites and entities such as Disney, The Atlantic, Insider, ABC News, ESPN, and publishers  Condé Nast, Hearst, and Vox Media have also put a stoppage to GPTBot

Axios, which also blocks the bot, has found that about nearly 20% of the top 1000 websites in the world are blocking crawlers for AI services, citing data from AI content detector, Originality.AI. 

Amid the blocking, X recently updated its privacy policy that confirms its use of public data to train AI models. Google, which is behind AI tool Bard, had proposed in August for a revision of copyright laws in Australia that would allow them to gather data unless a rights owner opts out. – Rappler.com

Add a comment

Sort by

There are no comments yet. Add your comment to start the conversation.

Summarize this article with AI

How does this make you feel?

Loading
Download the Rappler App!
Clothing, Apparel, Person

author

Gelo Gonzales

Gelo Gonzales is Rappler’s technology editor. He covers consumer electronics, social media, emerging tech, and video games.