artificial intelligence

OpenAI lets websites block GPTBot, a web crawler that looks for AI training data

Gelo Gonzales

This is AI generated summarization, which may have errors. For context, always refer to the full article.

OpenAI lets websites block GPTBot, a web crawler that looks for AI training data

OPENAI. The OpenAI logo and 'artificial intelligence' words are seen in this illustration taken May 4, 2023.

Dado Ruvic/Reuters

OpenAI says 'web pages crawled with the GPTBot user agent may potentially be used to improve future models'

MANILA, Philippines – OpenAI is letting website owners opt out of being scraped by its web crawler GPTBot. 

Web crawlers are most commonly used by search engines to browse through the internet, and index websites. OpenAI’s GPTBot web crawler will be used by the ChatGPT creator to improve its AI systems by scouring the internet for websites that hold publicly available, potentially valuable training data. 

On its blog, OpenAI said, “Web pages crawled with the GPTBot user agent may potentially be used to improve future models…. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

The company also said some websites are automatically excluded from being scraped, such as those “that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.” 

While GPTBot would prosper the most if it could access all websites freely, OpenAI has also released instructions on how to disable or block the bot. It simply involves adding a few lines to a website’s robots.txt file, a document that tells crawlers which parts of a website they can access, or by blocking its IP address. 

The specific instructions can be found in OpenAI’s blog post. 

OpenAI’s bot opt-out options come amid an ongoing debate on AI makers’ use of internet data to train their systems, specifically generative AI such as ChatGPT and Google Bard. 

As these tools rose in popularity, platforms like Reddit, Twitter, now known as X, and developer-centric forum Stack Overflow announced plans to charge for access to their data, which AI companies need for their systems training. 

Various lawsuits have also been filed against AI companies for scraping data, allegedly without the consent of the rights holder of a specific piece of creative work under copyright infringement laws. – Rappler.com

Add a comment

Sort by

There are no comments yet. Add your comment to start the conversation.

Summarize this article with AI

How does this make you feel?

Loading
Download the Rappler App!
Clothing, Apparel, Person

author

Gelo Gonzales

Gelo Gonzales is Rappler’s technology editor. He covers consumer electronics, social media, emerging tech, and video games.