This is AI generated summarization, which may have errors. For context, always refer to the full article.
MANILA, Philippines – A new filing in a copyright infringement lawsuit against Meta Platforms said lawyers for Meta warned the company about using thousands of pirated books to train its AI models, though the company did so regardless.
The lawsuits allege Meta used their written works without permission to train its Llama artificial intelligence (AI) large language model (LLM). In this case, the direct copyright infringement claim comes from the alleged use of a pirated book dataset to train the Llama LLM.
The Atlantic, in an August report, noted how pirated books were powering generative AI’s LLMs.
A California judge dismissed part of the Silverman lawsuit in November, but gave the authors permission to amend their claims.
The new filing now quotes chat logs from a Meta-affiliated researcher, Tim Dettmers, who was discussing with Meta’s lawyers whether procuring the specific dataset of book files in a Discord server would be considered “legally ok.” The chat logs are seen as potentially significant evidence, as it could point to Meta being aware that the use of the books may not be protected by US copyright law.
Dettmers in 2021 wrote that “At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons.”
Meta’s first version of its Llama large language model, which came out in February, listed under its training datasets “the Books3 section of ThePile.” The complaint added the person who assembled that dataset had said elsewhere it contained 196,640 books.
The complaint said Dettmers earlier wrote that Meta’s lawyers had told him “the data cannot be used or models cannot be published if they are trained on that data.”
Reuters, in its report, added that while Dettmers did not describe the lawyers’ concerns, counterparts in the chat identified “books with active copyrights” as the biggest likely source of worry. They said training on the data should “fall under fair use,” a US legal doctrine protecting certain unlicensed uses of copyrighted works.
Dettmers, a doctoral student at the University of Washington, was not immediately able to comment on the claims.
Meta did not disclose the datasets it used to train Llama 2, which came out in the summer. –Rappler.com