Read time: 5 minutes
The star of all the legal issues affecting the input stage of AI is undoubtedly that of the training set. Surprisingly, it is also one of the rare legal issues to have been dealt with a number of years ago, in most countries boasting a vibrant AI sector. Yet, it continues to be disputed, likely for months to come, as two AI heavyweight countries – the UK and the U.S. – come to terms with it. We set out below the key elements of the debate.
Most AI systems are trained using a process known as “text and data mining (TDM).” TDM or “machine-reading,” involves analyzing and extracting information from vast quantities of data used as “training material.” It is the same way humans learn, but when we read books, watch movies or listen to music, that’s not a process restricted by law. Machines? Well, it’s different for them.
Text and data mining
In the words of the UK IPO, “TDM is the use of automated computational techniques to analyze large amounts of information to identify patterns, trends and other useful information. TDM may be used to develop and train AI and has a range of other uses including enabling research. This includes the analysis of medical and scientific data, business intelligence, and data analytics. TDM automates and accelerates what would traditionally be done by eye – reading a document, making notes, and understanding relationships and trends.”
The importance of a copy
In any TDM process it is typically necessary to “clean” the text and data being mined (which in some cases takes up to 80% of the mining time), in order to remove inconsistent, unreliable or redundant data, and to “normalize” the data into a specific format adapted to the relevant application. For example, when normalizing text, one is seeking to reduce its randomness, bringing it closer to a predefined “standard.” This helps reduce the amount of different information that the machine has to deal with, and therefore improves efficiency. In other words, to be processed by an AI system, the data must be copied, sometimes stored (at least temporarily), and often modified (e.g., by formatting, cutting, merging, compilation, etc.) to be made usable. If the data contains works protected by copyright, each of these operations implicate rights which are exclusive to the copyright owners or its licensees and may constitute copyright infringement if they are performed without a license from the rights owner.
- If the training set includes works protected by copyright, the text and data mining operations will engage the right of reproduction, which is reserved to copyright owners and requires their authorization unless an exception applies.