Authors allege Meta used pirated books to train its AI, despite warnings from lawyers

MANILA, Philippines – A new filing in a copyright infringement lawsuit against Meta Platforms said lawyers for Meta warned the company about using thousands of pirated books to train its AI models, though the company did so regardless.

Reuters reports the amended complaint – which was made Monday night, December 11 – consolidates two lawsuits made by comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon, and other authors.

The lawsuits allege Meta used their written works without permission to train its Llama artificial intelligence (AI) large language model (LLM). In this case, the direct copyright infringement claim comes from the alleged use of a pirated book dataset to train the Llama LLM.

The Atlantic, in an August report, noted how pirated books were powering generative AI’s LLMs.

A California judge dismissed part of the Silverman lawsuit in November, but gave the authors permission to amend their claims.

Chat logs

The new filing now quotes chat logs from a Meta-affiliated researcher, Tim Dettmers, who was discussing with Meta’s lawyers whether procuring the specific dataset of book files in a Discord server would be considered “legally ok.” The chat logs are seen as potentially significant evidence, as it could point to Meta being aware that the use of the books may not be protected by US copyright law.

Dettmers in 2021 wrote that “At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons.”

Meta’s first version of its Llama large language model, which came out in February, listed under its training datasets “the Books3 section of ThePile.” The complaint added the person who assembled that dataset had said elsewhere it contained 196,640 books.

The complaint said Dettmers earlier wrote that Meta’s lawyers had told him “the data cannot be used or models cannot be published if they are trained on that data.”

Reuters, in its report, added that while Dettmers did not describe the lawyers’ concerns, counterparts in the chat identified “books with active copyrights” as the biggest likely source of worry. They said training on the data should “fall under fair use,” a US legal doctrine protecting certain unlicensed uses of copyrighted works.

Dettmers, a doctoral student at the University of Washington, was not immediately able to comment on the claims.

Meta did not disclose the datasets it used to train Llama 2, which came out in the summer. –Rappler.com

Add a comment

There are no comments yet. Add your comment to start the conversation.

Victor Barreiro Jr.

Victor Barreiro Jr is part of Rappler's Central Desk. An avid patron of role-playing games and science fiction and fantasy shows, he also yearns to do good in the world, and hopes his work with Rappler helps to increase the good that's out there.