/ 3 min read / Entertainment and Media Guide to AI

Text and data mining in US

Read time: 5 minutes

Introduction

Copyright is a territorial beast and not all countries are equal in how they have decided to approach the TDM debate.

The U.S. apprehends TDM through its doctrine of “fair use,” that permits limited use of copyright protected material without having to first acquire permission from the copyright holder – in particular where the contemplated use is deemed “transformative.”

Geopolitics of AI

Copying copyright protected content for TDM purposes

In the United States, the reproduction right is reserved for the copyright owner of a work or its licensees under section 106 of the U.S. Copyright Act of 1976. While there are no express exceptions in U.S. copyright law, section 107 of the Copyright Act authorizes the fair use of a copyright protected work, “including by reproduction in copies or phonorecord or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching […], scholarship, or research.” Copying copyright protected works for the sole purpose of text and data mining has traditionally been considered a case of fair use by the technology sector. The creative sector disagrees, and the launch of generative AI solutions capable of producing photos, paintings and music at the push of a button has seen copyright holders rally behind the “unfair use” banner to condemn the use of their content by AI businesses.

What is fair use?

To determine whether the use of a copyright protected work without the consent of the copyright owner constitutes non-infringing fair use, courts will balance the following four factors on a case-by-case, highly fact-specific inquiry basis:

  1. The purpose and character of the use, including whether the use is of a commercial nature or is for non-profit educational purposes;
  2. The nature of the copyright protected work;
  3. The amount and substantiality of the portion used in relation to the copyright protected work as a whole;
  4. The effect of the use upon the potential market for or value of the copyright protected work.

The first factor. The first factor, also known as the “transformative use factor,” is generally the most heavily weighted by the courts. A use is transformative, if it merely supersedes the existing work, or, to the contrary, if it adds something new, with a further purpose or different character, altering the first work with new expression, meaning or message1. Even if a work is copied and stored in substantially the same form as the original without meaningful alteration, it does not preclude the use from being considered transformative in nature, so long as the use by the would-be copier serves a materially different function than the original work2.

Some examples where courts have found a use to be transformative include making digital copies of student papers to use an anti-plagiarism software (where the defendant’s use of the works was unrelated to such works’ expressive content),3 or scanning books to create a full-text searchable database and public search function (in a manner that did not allow users to read the texts).4 While educational and non-commercial uses are generally more likely to be decided to be fair use, courts will not necessarily find a commercial use to be unfair and will instead balance the purpose and character of the use against other factors.

Copies of original works made for TDM purposes appear to have a purely functional purpose, namely, to teach an AI model about the underlying characteristics of a work through pattern recognition. Copies of original works made for TDM purposes are never released or made available to the public, hence it would appear that their transformative nature is on par with existing case law.

The second fair use factor examines whether the reproduced content is factual in nature, in which case it is entitled to a lower level of protection in an attempt to encourage the spread of scientific or educational works for the public’s benefit. While the reproduction of nonfactual, creative works such as images or sound recordings is less likely to satisfy this factor, the second factor has been considered by courts to hold little weight in the fair use balance and is rarely found to be determinative.

The third fair use factor assesses whether the quantity and significance of the portion of the work reproduced is justifiable, considering the intent of the copying. Even though using an entire image, sound recording or other creative work may seem contradictory to fair use, it does not necessarily preclude the possibility of such a ruling.5 Importantly, the factors should not be scrutinized in isolation but should be weighed collectively. In this regard, the fourth factor, which along with the first factor is generally given the greatest weight by the courts, could tip the balance towards a fair use finding.

The fourth factor examines whether the copy brings to the marketplace a competing substitute for the original work or if it diminishes the original work’s value by serving as an alternative that potential buyers might prefer.6 More generally, in order to be deemed fair, the use should not negatively impact the market (or the potential market) or the value of the original copyright protected work by serving as a viable substitute. As copyright is a commercial right intended to protect the ability of authors to profit from their work, this factor is often influential in a fair use analysis.7 The interrelation between the fourth and first factors is crucial: the more the new work serves a different purpose than the original work (the first factor), the more unlikely it is that the second work will serve as a market substitute for the original work8 (the fourth factor).

Should the long-term purpose of the TDM operations be considered by the courts when assessing the fairness of the practice? Should the court’s fair use analysis differ based on the type of AI model being trained (generative or not)? These questions are highly topical and, for a large part, they hinge on the US courts’ response to a small number of highly visible lawsuits which the drafters of this guide will watch closely.


  1. Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).
  2. A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630, 639 (4th Cir. 2009).
  3. Id.
  4. Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 97-98 (2d Cir. 2014).
  5. Capitol Records, LLC v. ReDigi Inc., 910 F.3d 649, 662 (2d Cir. 2018).
  6. Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018).
  7. Authors Guild, Inc. v. Google Inc., 804 F.3d 202 (2d Cir. 2015).
  8. Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).

Related Insights