It was a highly contentious ruling, one that could’ve changed the course of not only this case but others against AI companies. The court had just ruled that OpenAI waived attorney-client privilege by denying allegations that it knowingly infringed upon the copyrights of authors whose books it illegally downloaded. The finding opened the door to internal communications behind the company’s deletion of two huge datasets of pirated books, potentially exposing it to massive damages. Its in-house legal team was set to be deposed.
OpenAI immediately appealed the ruling, bringing on Lisa Blatt, a veteran of the Supreme Court bar who counts Google, Bank of America and Starbucks as clients. In her brief, she delivered a dire warning: the decision, if allowed to stand, will eviscerate assertions of privilege in any copyright case involving what’s known as state of mind, an analysis used to determine whether a defendant intentionally committed infringement or was unaware of doing so.
On Friday, OpenAI prevailed in its bid to reverse the court’s ruling. At stake was billions of dollars. The communications could’ve helped prove willful infringement, which triggers damages as high as $150,000 per work opposed to just $200. And perhaps even more importantly, the decision threatened to provide a pathway for those suing AI companies to obtain evidence typically considered to be privileged information.
The issue has been a key battleground in discovery. It relates to an OpenAI employee in 2018 downloading pirated copies of books and using them to create two datasets, known as “books 1” and “books 2,” to train two discontinued GPT models. After initially telling the court that the datasets were deleted in 2022 “due to their non-use,” the company has since maintained that information over the reasons for their erasure are privileged. Lawyers representing authors and publishers have called foul play.
In November, Magistrate Judge Ona Wang found that OpenAI must hand over evidence revealing the company’s motivations for deleting the datasets. She concluded that the company opened the door to the privileged material when it disclosed that “books 1” and “books 2” were deleted because of “non-use,” though it was another part of her order that attracted the attention of the copyright bar. Wang reasoned that the company effectively waived attorney-client privilege by denying allegations of willful infringement, which she said puts the company’s state of mind under the court spotlight. For “OpenAI to deny that it willfully infringed class plaintiffs’ copyrighted works is to argue that it acted in good faith,” per the order.
Ruling against authors and publishers, including Sarah Silverman, in Friday’s order, U.S. District Judge Sidney Stein stressed that the denial of willful infringement claims isn’t equivalent to advancing a good faith defense. This puts discovery over the reasons OpenAI erased the datasets in 2022 squarely off the table, she said. “There is a distinction between a copyright defendant merely denying allegations of willfulness — on which a plaintiff bears the burden of proof — and a copyright defendant affirmatively asserting its good faith belief that its actions were lawful,” Stein wrote.
With the reversal, the discovery battle becomes a precedential detour in the case. Although it was reversed, the argument was a crafty one by lawyers representing authors, led by Justin Nelson and Craig Smyser of Susman Godfrey, the firm that negotiated the $1.5 billion Anthropic settlement. If the ruling was allowed to stand, AI companies would’ve borne the burden of proving they didn’t intend to violate copyright law anytime they deny allegations that they knowingly infringed on copyrighted works.
Also discussed in the reversal was whether OpenAI revealed privileged information when it said that the datasets were deleted “due to non-use.” On this issue, Stein said that the assertion doesn’t represent legal advice, meaning that they can’t be used as a basis for finding that OpenAI waived privilege.
Despite the loss on the discovery battleground, the authors’ lawyers are gaining ground on what’s increasingly looking like a winning argument over the practice of pirating books from shadow library. That theory has changed over the course of AI litigation. At first, lawyers for the authors directly connected the piracy to OpenAI’s training of its models under a single umbrella. But later, they separated the theories and alleged that the distinct act of illegally downloading the works, regardless of whether they were used, constitutes copyright infringement.
The move takes advantage of the one win for authors in another AI copyright case, this one initiated by Andrea Bartz against Anthropic, related to the company illegally downloading millions of books and storing them in a central library. The decision heavily leaned in favor of Anthropic, but the court greenlit that theory, which is now a part of the OpenAI case, for trial. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft,” wrote U.S. District Judge William Alsup. After the ruling, Anthropic agreed to pay $1.5 billion to settle the lawsuit.
It remains largely unknown what today’s AI systems are trained on. OpenAI used “books 1” and “books 2” — downloaded from LibGen, a shadow library website — to train old versions of GPT but later deleted the datasets. Still, billions of dollars were raised by AI firms largely because of models they trained on pirated books.
Read the full article here















