Text and Data Mining (TDM) generally involves the identification of patterns or relationships in data sets that were previously unknown. TDM can be used to build predictive models of behavior in the retail context, so that when a customer Amazon, or opens their Facebook page, they are presented with advertising keyed to their individual tastes and preferences.
In the media and entertainment context, one form of TDM, machine-learning, is being used to train AI programs to create content, whether in text, audio, visual or audiovisual form. Machine learning, like traditional TDM, is intended to discover novel and useful knowledge in data. However, a fundamental difference between machine learning and traditional TDM, is that TDM in and of itself, can extract data for human comprehension, whereas machine learning extracts data to improve an AI program’s own understanding and ability to produce output. In addition, TDM does not necessarily involve rule or pattern discovery, while machine learning almost always does.
TDM in the U.S.: What is ‘fair use’ anyway?
As discussed in the Text and data mining in US section, the legality of making copies of the text or data through TDM has become a serious issue. As AI search engines crawl through the world wide web endlessly seeking, digesting, and aggregating content, they inevitably digest copyrighted works such as music videos, songs, novels, and news stories. Since this digestion – which generally requires the making of a copy – is frequently performed without the express consent of the copyright holder, its legality often depends on whether it is permitted under an exception to, or outside the framework of, copyright law. Under U.S. copyright law, the exception that is most frequently relied upon is fair use.
Under section 107 of the Copyright Act, fair use is a four-factor test: (1) the purpose of and character of the use; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the whole; and (4) the effect of the use on the potential market for, or value of, the copyrighted work. Fair use of a copyrighted work for such things as teaching, scholarship, and research is specifically permitted by section 107. A key consideration that courts have used in deciding whether fair use exists is whether the use is “transformative.”
Whether copying of copyrighted material for the purpose of machine learning constitutes fair use is a hotly debated topic that will affect the future of AI in the United States. For example, Thomson Reuters and West Publishing Corp. have sued Ross Intelligence, Inc. over, among other things, its alleged use of machine learning to create a legal research platform for Ross from the Westlaw database. The outcome of this case is still pending, and Ross’ motion to dismiss the copyright infringement and was denied.1
Will fair use protect machine learning?
In a seminal case from 2015, the Second Circuit found Google Books’ scanning of more than 20 million books, many of which were subject to copyright, to be a non-expressive and transformative fair use of the texts because Google Books enabled users to find information about copyrighted books, as opposed to the expressions contained in the books themselves.2 A key learning from the case was the distinction made between “expressive” and “non-expressive” use of copyrighted materials, the latter being deemed fair use by the court. Applied to AI, could the solution mean that so long as the original text does not “express” in the final work product, the act of machine reading is fair use?
We are not aware of U.S. courts applying fair use in the context of TDM, in part because cases considering AI functionality have often involved the express use of copyrighted material that qualified as traditional copyright infringement. For example, the Second Circuit found in a 2018 case, that although TVEyes’ “search feature” for Fox News content in and of itself might have been sufficiently transformative to be fair use, the fact that TVEyes also had a “watch feature” that redistributed copyrighted Fox News content to TVEyes users for a monthly fee did not permit a fair use defense (Fox News Network, LLC v. TVEyes, Inc., No. 15-3885 (Feb. 27, 2018)).
In practice, major TDM search projects are generally dealt with under contract, which has resulted in low instances of litigation. Academic and commercial arguments have also been raised against over-reliance on fair use for TDM. As a practical matter, a key factor that U.S. courts will look at is whether TDM deprives the copyright owner of the value of their copyrighted material.
- Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., 529 F. Supp. 3d 303 (D. Del., Mar. 29, 2021).
- Authors Guild, Inc. v. Google Inc., 804 F.3d 202 (2d Cir. 2015).