Read time: 26 minutes
The use, ownership and exploitation of data is extremely valuable. The era of AI has ushered in a veritable gold rush of companies and individuals seeking to mine this man-made resource, which, unlike gold, is available in great abundance. However, the alchemy involved in turning a seemingly infinite into something valuable requires tremendous computational power and investment.
Text and Data Mining (TDM) generally involves the identification of patterns or relationships in data sets that were previously unknown. TDM can be used to build predictive models of behavior in the retail context, so that when a customer Amazon, or opens their Facebook page, they are presented with advertising keyed to their individual tastes and preferences.
In the media and entertainment context, one form of TDM, machine-learning, is being used to train AI programs to create content, whether in text, audio, visual or audiovisual form. Machine learning, like traditional TDM, is intended to discover novel and useful knowledge in data. However, a fundamental difference between machine learning and traditional TDM, is that TDM in and of itself, can extract data for human comprehension, whereas machine learning extracts data to improve an AI program’s own understanding and ability to produce output. In addition, TDM does not necessarily involve rule or pattern discovery, while machine learning almost always does.
TDM in the U.S.: What is ‘fair use’ anyway?
As discussed in the Geopolitics of AI section, the legality of making copies of the text or data through TDM has become a serious issue. As AI search engines crawl through the world wide web endlessly seeking, digesting, and aggregating content, they inevitably digest copyrighted works such as music videos, songs, novels, and news stories. Since this digestion – which generally requires the making of a copy – is frequently performed without the express consent of the copyright holder, its legality often depends on whether it is permitted under an exception to, or outside the framework of, copyright law. Under U.S. copyright law, the exception that is most frequently relied upon is fair use.
Under section 107 of the Copyright Act, fair use is a four-factor test: (1) the purpose of and character of the use; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the whole; and (4) the effect of the use on the potential market for, or value of, the copyrighted work. Fair use of a copyrighted work for such things as teaching, scholarship, and research is specifically permitted by section 107. A key consideration that courts have used in deciding whether fair use exists is whether the use is “transformative.”
Whether copying of copyrighted material for the purpose of machine learning constitutes fair use is a hotly debated topic that will affect the future of AI in the United States. For example, Thomson Reuters and West Publishing Corp. have sued Ross Intelligence, Inc. over, among other things, its alleged use of machine learning to create a legal research platform for Ross from the Westlaw database. The outcome of this case is still pending, and Ross’ motion to dismiss the copyright infringement and was denied.1
Will fair use protect machine learning?
In a seminal case from 2015, the Second Circuit found Google Books’ scanning of more than 20 million books, many of which were subject to copyright, to be a non-expressive and transformative fair use of the texts because Google Books enabled users to find information about copyrighted books, as opposed to the expressions contained in the books themselves.2 A key learning from the case was the distinction made between “expressive” and “non-expressive” use of copyrighted materials, the latter being deemed fair use by the court. Applied to AI, could the solution mean that so long as the original text does not “express” in the final work product, the act of machine reading is fair use?
We are not aware of U.S. courts applying fair use in the context of TDM, in part because cases considering AI functionality have often involved the express use of copyrighted material that qualified as traditional copyright infringement. For example, the Second Circuit found in a 2018 case, that although TVEyes’ “search feature” for Fox News content in and of itself might have been sufficiently transformative to be fair use, the fact that TVEyes also had a “watch feature” that redistributed copyrighted Fox News content to TVEyes users for a monthly fee did not permit a fair use defense (Fox News Network, LLC v. TVEyes, Inc., No. 15-3885 (Feb. 27, 2018)).
In practice, major TDM search projects are generally dealt with under contract, which has resulted in low instances of litigation. Academic and commercial arguments have also been raised against over-reliance on fair use for TDM. As a practical matter, a key factor that U.S. courts will look at is whether TDM deprives the copyright owner of the value of their copyrighted material.
In addition, licensing can also be used to help address issues of confidentiality, usage considerations or limitations and, increasingly, learning and other issues which often may be experiential and machine-aided in connection with the collection, use and disclosure of data. For example, secondary usage or derivative usage of data, which may not be subject to copyright or trade secret protection is increasingly addressed by contract. Similarly, residuals which refer to information in nontangible form, which may be remembered by persons with access to confidential information are something increasingly important for parties to consider when exchanging confidential information with other parties. Not only can the information generated by a business relationship be valuable but who has a right to secrecy with respect to it and whether and how the counter-party can use it has become of such great importance that the entire enterprise value of certain businesses has been written off when rights in underlying data were questioned and more recently acquisitions transactions have had their purchase price changed or deals fail to close because of uncertainty about data rights.
With this in mind, it is helpful to understand common contractual provisions used in licensing relating to the collection, use and disclosure of data.
- Most AI systems are trained by analyzing and extracting information from vast quantities of data
- Copyrighted material is making its way into AI products, potentially changing the way that data is licensed
- Purchasers of AI systems should consider clauses in contracts to protect themselves from new AI risks