Sarah Silverman is part of a new copyright class action suit against OpenAI, which alleges that ChatGPT was illegally trained on copyrighted books.

Silverman is one of three lead plaintiffs, alongside the authors Christopher Golden and Richard Kadrey. They claim, on behalf of the prospective class, that their books were part of a trove of copyrighted material “copied by OpenAI” and used to train ChatGPT “without consent, without credit, and without compensation.” 

The new lawsuit is similar to one brought by authors Paul Tremblay and Mona Awad against OpenAI earlier this year. In fact, Silverman and co. have enlisted the same attorney to represent them, Matthew Butterick of the Joseph Saveri Law Firm.

OpenAI — a research lab with nonprofit and corporate arms — officially launched ChatGPT last year. The software, known as a “large language model” (LLM), is fed copious amounts of text and is thus able to generate human-like responses to text inputs. 

Noting that an LLM’s output is “entirely and uniquely reliant on the material in its dataset,” the new lawsuit alleges that “much of the material in OpenAI’s training datasets [came] from copyrighted works.” The suit alleges that books by the three lead plaintiffs — Silverman’s memoir The Bedwetter, Golden’s Ararat, and Kadrey’s Sandman Slim — were among the copyrighted works used to train ChatGPT. 

The plaintiffs’ lawyers say they can prove this by using ChatGPT itself: “The reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested b the underlying OpenAI Language Model as part of its training data,” the suit alleges. “When ChatGPT was prompted to summarize books written by  each of the Plaintiffs, it generated very accurate summaries.”

While the suit acknowledged that the summaries got “some details wrongs,” it asserts that such mistakes are “expected” since LLMs combine “expressive material derived from many sources.” Still, the suit claims the overall accuracy suggests “ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content.” 

The suit also proffers some theories as to how OpenAI may have allegedly trained ChatGPT on copyrighted works. It notes that a July 2020 paper about an earlier version of ChatGPT said the software was trained on two troves of books, known as Books1 and Books2. Though OpenAI did not specify where the books came from, the suit says statistics mentioned in that OpenAI paper suggest Books1 came from Project Gutenberg — “an online archive of e-books whose copyright has expired” — while Books2 allegedly came from “notorious ‘shadow library’ websites” where ebooks can be pirated illegally.

The lawsuit goes on to cite OpenAI’s recent March 2023 paper introducing ChatGPT-4, saying it “contained no information about its dataset at all.” Quoting from the paper, OpenAI said that “given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about … dataset construction.” 


OpenAI did not immediately return Rolling Stone‘s request for comment.

In June, Rolling Stone spoke with several experts about a wave of lawsuits being brought against artificial intelligence companies like OpenAI, which dealt with both privacy and copyright issues (a defamation suit was also brought). Several suggested that the similar copyright suit brought by Tremblay and Awad would face an uphill battle. For instance, Mehtab Khan, a resident fellow at Yale Law School and the lead for the Yale/Wikimedia Initiative on Intermediaries and Information, said the claim the plaintiffs’ books were used to train ChatGPT was “tenuous” and said the authors will have to prove their writing was infringed and that there’s “substantial similarity between their works and the output generated by the chatbot.”