Industry & Advocacy News
May 19, 2025
On Friday May 9, 2025, the U.S. Copyright Office released a draft of Part 3 of its AI report, titled Generative AI Training, which addresses critical issues surrounding the use of copyrighted materials in training AI models. The released draft is a pre-publication version, but the Copyright Office has made it clear that the final release will have no substantive difference from the current draft.
This part of the report delves into the legality of training and developing AI systems under copyright law and the rights of copyright owners whose works are used to develop these systems. Although the report does not choose sides between AI companies and rightsholders, its findings—which have important ramifications for ongoing infringement lawsuits against AI companies, including our own action against OpenAI—question some of the arguments routinely mde by AI proponents, such as the claim that AI does not copy expressive elements, or that there are no existing licensing markets to acquire works for training data. The report’s conclusions align with most of the analysis and recommendations that we provided in our comments submitted in response to the Copyright Office’s Notice of Inquiry. Our comments are cited more than a dozen times throughout the report.
We are enormously grateful to the staff of the Copyright Office for undertaking this important study and to Register of Copyrights Shira Perlmutter, who was wrongfully terminated a day after the report was issued.
The report presents thoughtful and balanced analysis of fair use in the context of AI training, emphasizing that AI training is not categorically fair use and that whether a use qualifies as a fair use is a matter of context and degree.
Fair use is a legal doctrine that permits certain unlicensed uses of copyrighted material, but it must be assessed on a case-by-case basis using a fact-specific inquiry. In determining whether a use is fair use courts apply a four-factor test that considers: (1) the purpose and character of the use, including whether it is commercial and whether it transforms the original work; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used; and (4) the effect of the use on the market for the original. In the context of AI training, the Copyright Office cautions that fair use cannot be presumed and must be evaluated based on how copyrighted works are used during training and deployment.
While acknowledging that some uses of copyrighted works in AI training may qualify as fair use, the Copyright Office makes it clear that “[m]aking commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets… goes beyond established fair use boundaries.”
Below is a breakdown of the report’s findings on each of the four factors.
In analyzing the first factor the Copyright Office devotes considerable attention to whether the use of copyrighted works in training is a “transformative use”—i.e. whether the use adds new expression, meaning, or message to the original work. It notes that according to the recent Supreme Court decision in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, transformative-ness alone does not render a use fair, and that a “transformative” use of a work can nevertheless be infringing if it serves the same purpose as the original work and could substitute the market for the original work. The report suggests that while AI training transforms works by creating a statistical model, the fair use query does not end there; where a model trained on unlicensed copyrighted works is deployed to create expressive works similar to the works ingested during training, then the use is less likely to be fair use than, for example, where the model is used for research or in a closed system that constrains the model from generating works that substitute the original. Further, the Copyright Office notes that the use of AI models is commercial and for-profit, which weighs against a finding of fair use.
The report rejects the argument made by AI companies that AI training is “inherently transformative” because it is “not for expressive purposes,” noting that language models absorb “not just the meaning and parts of speech of words, but how they are selected and arranged at the sentence, paragraph, and document level—the essence of linguistic expression” and can be “used to generate expressive content, or potentially reproduce copyrighted expression.”
The report underscores that where AI developers acquire works from pirate sites to build LLMs that can be used to generate content that competes in the marketplace for the works used, that use is unlikely to be fair use. This observation is especially germane to ongoing court cases because all current LLMs, from GPT to Llama and Claude, were built using books acquired from pirate sources. The report also comments on increasingly popular retrieval augmented generation (or “RAG”) uses, whereby an existing LLM queries and retrieves information from an external database to correct or amplify its response, as being less likely to be transformative if the response summarizes or abridges the source work, rejecting the analogy of such uses to hyperlinking.
The second factor considers whether the works used in training are closer to the core of copyright protection. The report explains that “[t]he use of more creative or expressive works (such as novels, movies, art, or music) is less likely to be fair use than use of factual or functional works (such as computer code).” Although courts give this factor less weight overall, it is still relevant—especially when expressive, published content is used. Citing our comments, the Copyright Office notes that training sets “usually include expressive works” such as books and musical compositions that “are highly creative and closer to the heart of copyright,” and thus the use of such works may disfavor fair use under the second factor. The report acknowledges that some training data may be unpublished, which can further weigh against a finding of fair use, though most content used appears to be published, which “modestly supports a fair use argument.”
Under the third factor, the report considers “whether the amount and substantiality of the portion used . . . are reasonable in relation to the purpose of the copying.” In the context of AI, the Copyright Office notes that models generally ingest entire works, and while courts have sometimes upheld mass copying when necessary for a transformative use (such as full-text search), the report states that “the use of entire copyrighted works is less clearly justified in the context of AI training than it was for Google Books or a thumbnail image search.” Where AI systems are capable of producing expressive outputs that may compete with original works, the argument for wholesale copying being fair use is even weaker. However, in some cases where the AI systems contain adequate guardrails against the public accessing the copied text, the third factor may not weigh against fair use.
This is an area where the Authors Guild disagrees with the report: We do not believe that AI companies’ current practices regarding guardrails should be considered in the analysis, given that they can and do change them on a regular basis and there is no assurance that today’s guardrails will be there tomorrow.
The report devotes significant attention to the fourth factor of fair use—the effect of the use on the market for or value of the copyrighted work—concluding that the “copying involved in AI training threatens significant potential harm to the market for or value of copyrighted works” (emphasis added). It identifies three distinct types of harm to the market for copyrightedworks:
The report identifies cases where generative AI training can lead directly to lost sales, such as when pirated collections of copyrighted works are used to build training datasets and made publicly available, hurting the market for included books. And if training enables models to generate substantially similar outputs, those outputs may serve as direct substitutes for the original works, displacing legitimate sales.
The threat of market dilution from AI-generated books is one that we have long warned about and emphasized. The report delves into how the flood of AI-generated works diminishes the overall value of human-authored content, making the critical observation that the effect on the market should be viewed broadly to encompass any effect on the potential market for works of the same kind, instead of individualized harm to the market for specific works. Using the romance genre as an example, the Copyright Office notes that:
“The speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data. That means more competition for sales of an author’s works and more difficulty for audiences in finding them. If thousands of AI-generated romance novels are put on the market, fewer of the human-authored romance novels that the AI was trained on are likely to be sold. Royalty pools can also be diluted.”
This kind of harm is deeply corrosive. If AI outputs saturate the marketplace, it would lower prices, reduce demand for original works, and hurt authorship.
The most direct harm is the loss of existing or potential licensing revenue. The report notes that if a market exists—or could exist—for licensing works for training, bypassing it cuts directly against a fair use defense. It underscores that licensing is already occurring in multiple sectors, from music and news to images, undercutting the argument that licensing isn’t practical or available.
The report underscores that licensing copyrighted works for AI training is not only feasible but already occurring across various sectors. It notes that “licensing is core to the business model of many content industries,” and several industry representatives have expressed their willingness and ability to license works for AI training. Furthermore, the report highlights that “AI developers were licensing copyrighted works in a number of sectors, including music, vocal recordings, and news reports.” Importantly, the report emphasizes that the existence of licensing options weakens fair use claims, stating that “where licensing options exist or are likely to be feasible, this consideration will disfavor fair use under the fourth factor.”
Read the Full Draft Report (PDF)
You can also find our previous post on Part 2 of the report, which covered copyrightability, here. We will continue to update authors on developments in copyright and AI in our newsletter and on our website.
Press Releases
Congress Receives Authors Guild Petition to Defend Copyright Independence with More Than 7,000 Signatures
May 20, 2025
Statements
Authors Guild Condemns Purported Firing of Register of Copyrights Shira Perlmutter (Updated With Petition)
May 13, 2025
Understanding the AI Class Action Lawsuits
May 6, 2025