Academics File Class Action Lawsuit Against Apple Over Alleged Use of Pirated Books in AI Training

In a significant legal development, professors Susana Martinez-Conde and Stephen Macknik from SUNY Health Sciences University have initiated a class action lawsuit against Apple Inc. The lawsuit alleges that Apple utilized pirated versions of their books to train its Apple Intelligence models without obtaining proper authorization.

Background of the Authors and Their Works

Martinez-Conde and Macknik are renowned for their contributions to neuroscience and visual perception. Their notable publications include Champions of Illusion: The Science Behind Mind-Boggling Images and Mystifying Brain Puzzles and Sleights of Mind: What the Neuroscience of Magic Reveals About Our Everyday Deceptions. These works delve into the intricacies of human perception and cognition, offering insights into how illusions and magic tricks exploit the brain’s processing mechanisms.

Allegations Against Apple

The crux of the lawsuit centers on Apple’s alleged use of a dataset known as Books3 to train its Apple Intelligence models. Books3 is a component of The Pile, a curated collection of English-language data that includes content from the Books3 shadow library. This library reportedly contained the entirety of texts indexed by the Bibliotik private BitTorrent tracker, encompassing approximately 186,640 books. The authors’ works were among those listed in this dataset.

The plaintiffs contend that Apple copied their copyrighted books in their entirety without authorization to train its OpenELM language models. They argue that this constitutes direct infringement of their copyrights, as well as those of other authors whose works were included in the dataset. The lawsuit further claims that the materials were used to test model performance and to implement filters preventing model outputs from containing copyrighted material.

Legal Context and Precedents

The lawsuit raises complex questions about the legality of using copyrighted materials in training artificial intelligence models. Previous court rulings have set precedents in this area. For instance, in a case involving AI startup Anthropic, Judge William Alsup ruled that using copyrighted works to train language models could be considered fair use. However, the creation of a central library of pirated digital books for training purposes was deemed a violation of copyright laws. Anthropic agreed to a $1.5 billion settlement to resolve the lawsuit, which included compensating authors and destroying the data collected for training.

In another case, a U.S. District Court ruled that training AI models using copyrighted works was permissible under fair use, emphasizing the transformative nature of the technology. However, the court also noted that creating a library of pirated digital books, even if not used for training, does not constitute fair use.

Apple’s Stance on AI Training Practices

Apple has publicly emphasized its commitment to ethical AI training practices. In a research paper, the company stated that it trains its models using diverse and high-quality data, including licensed content from publishers and publicly available information. Apple asserts that it does not use users’ private personal data or user interactions when training its foundation models. Additionally, the company claims to follow best practices for ethical web crawling, including adhering to widely adopted robots.txt protocols to allow web publishers to opt out of their content being used for training.

Challenges and Implications

The lawsuit presents several challenges. One key issue is proving that Apple specifically used the plaintiffs’ publications in training its models. While Apple acknowledged using The Pile, which includes Books3, it is unclear whether the specific books in question were utilized. Apple does not publicly disclose the individual documents processed for language use, nor is it evident if the company maintains records of the specific books used.

The plaintiffs are seeking a jury trial, monetary damages, and an injunction preventing Apple from using their copyrighted works in the future. Under U.S. copyright law, willful infringement can result in statutory damages of up to $150,000 per work. However, it remains to be determined whether Apple’s actions constitute willful infringement.

Broader Industry Context

This lawsuit is part of a broader trend of legal actions against tech companies for allegedly using copyrighted materials without authorization to train AI models. Authors and content creators have increasingly raised concerns about their works being used without compensation in the development of AI technologies. The outcomes of these cases could have significant implications for the future of AI development and the rights of content creators.

Conclusion

The lawsuit filed by Martinez-Conde and Macknik against Apple underscores the ongoing tension between technological advancement and intellectual property rights. As AI technologies continue to evolve, the legal frameworks governing their development and use will likely face further scrutiny and potential reform. The resolution of this case may set important precedents for how copyrighted materials are used in AI training and the responsibilities of tech companies in respecting intellectual property rights.