Skip to content

Exploring a Books Data Commons for AI Training


A colorful illustration of a set of books

Our work on copyright has long focused on supporting libraries and archives in the service of their missions to preserve and ensure access to culture. Our 2022 copyright reform agenda centers those sorts of institutions (and more generally GLAMs) and the critical role they play in society. Among other things, that agenda calls attention to the ways in which copyright might impede libraries and archives who wish to make their collections available for research uses, including use for AI training in order to fulfill their public interest missions.

That issue – AI training – has become ever more relevant. The concept of mass digitization of books, including to support text and data mining, of which AI training is a subset, is not new. But AI training is newly of the zeitgeist, and its transformative use makes questions about how we digitize, preserve, and make accessible knowledge and cultural heritage salient in a distinct way.

In 2023, multiple news publications reported on the availability and use of a dataset of books called “Books3” to train large language models (LLMs), a form of generative AI tool.  The Books3 dataset contains text from over 170,000 books, which are a mix of in-copyright and out-of-copyright works. It is believed to have been originally sourced from a website that was not authorized to distribute all of the works therein. In lawsuits brought against OpenAI, Microsoft, Meta, and Bloomberg related to their LLMs, the use of Books3 as training data was specifically cited. 

The Books3 controversy highlights a critical question at the heart of generative AI: what role do books play in training AI models, and how might digitized books be made widely accessible for the purposes of training AI for the public good? What dataset of books could be constructed and under what circumstances? 

Earlier this year, we collaborated with Open Future and Proteus Strategies on a series of workshops to explore these questions and more. We brought together practitioners on the front lines of building next-generation AI models, as well as legal and policy scholars with expertise in the copyright and licensing challenges surrounding digitized books. Our goal was also to bridge the perspective of stewards of content repositories, like libraries, with that of AI developers. A “books data commons” needs to be both responsibly managed, and useful for developers of AI models. Today, we’re releasing a paper based on those workshops and additional research. 

While this paper does not prescribe a particular path forward, we do think it’s important to move beyond the status quo. Today, large swaths of knowledge contained in books are effectively locked up and inaccessible to most everyone. Large companies have huge advantages when it comes to access to books for AI training (and access to data in general). At the same time, as the paper highlights, there are already relevant examples of nonprofit and library-led efforts to provide responsible, fair access to books for many more people, not just the privileged few. We hope this paper can support further research, collaboration and investment in this space.

Read the full paper

Posted 08 April 2024