OpenAI lets websites block GPTBot, a web crawler that looks for AI training data

MANILA, Philippines – OpenAI is letting website owners opt out of being scraped by its web crawler GPTBot.

Web crawlers are most commonly used by search engines to browse through the internet, and index websites. OpenAI’s GPTBot web crawler will be used by the ChatGPT creator to improve its AI systems by scouring the internet for websites that hold publicly available, potentially valuable training data.

On its blog, OpenAI said, “Web pages crawled with the GPTBot user agent may potentially be used to improve future models…. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

The company also said some websites are automatically excluded from being scraped, such as those “that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”

While GPTBot would prosper the most if it could access all websites freely, OpenAI has also released instructions on how to disable or block the bot. It simply involves adding a few lines to a website’s robots.txt file, a document that tells crawlers which parts of a website they can access, or by blocking its IP address.

The specific instructions can be found in OpenAI’s blog post.

OpenAI’s bot opt-out options come amid an ongoing debate on AI makers’ use of internet data to train their systems, specifically generative AI such as ChatGPT and Google Bard.

As these tools rose in popularity, platforms like Reddit, Twitter, now known as X, and developer-centric forum Stack Overflow announced plans to charge for access to their data, which AI companies need for their systems training.

Various lawsuits have also been filed against AI companies for scraping data, allegedly without the consent of the rights holder of a specific piece of creative work under copyright infringement laws. – Rappler.com

Add a comment

There are no comments yet. Add your comment to start the conversation.

OpenAI lets websites block GPTBot, a web crawler that looks for AI training data

Add a comment

Related Topics

author

Gelo Gonzales

Recommended Stories

{{ item.sitename }}

{{ item.title }}