/ 2 min read / Entertainment and Media Guide to AI

Why the AI sector has a training problem

Read time: 5 minutes

The star of all the legal issues affecting the input stage of AI is undoubtedly that of the training set. Surprisingly, it is also one of the rare legal issues to have been dealt with a number of years ago, in most countries boasting a vibrant AI sector. Yet, it continues to be disputed, likely for months to come, as two AI heavyweight countries – the UK and the U.S. – come to terms with it. We set out below the key elements of the debate.

Most AI systems are trained using a process known as “text and data mining (TDM).” TDM or “machine-reading,” involves analyzing and extracting information from vast quantities of data used as “training material.” It is the same way humans learn, but when we read books, watch movies or listen to music, that’s not a process restricted by law. Machines? Well, it’s different for them.

Geopolitics in AI

Text and data mining

In the words of the UK IPO, “TDM is the use of automated computational techniques to analyze large amounts of information to identify patterns, trends and other useful information. TDM may be used to develop and train AI and has a range of other uses including enabling research. This includes the analysis of medical and scientific data, business intelligence, and data analytics. TDM automates and accelerates what would traditionally be done by eye – reading a document, making notes, and understanding relationships and trends.”

The importance of a copy

In any TDM process it is typically necessary to “clean” the text and data being mined (which in some cases takes up to 80% of the mining time), in order to remove inconsistent, unreliable or redundant data, and to “normalize” the data into a specific format adapted to the relevant application. For example, when normalizing text, one is seeking to reduce its randomness, bringing it closer to a predefined “standard.” This helps reduce the amount of different information that the machine has to deal with, and therefore improves efficiency. In other words, to be processed by an AI system, the data must be copied, sometimes stored (at least temporarily), and often modified (e.g., by formatting, cutting, merging, compilation, etc.) to be made usable. If the data contains works protected by copyright, each of these operations implicate rights which are exclusive to the copyright owners or its licensees and may constitute copyright infringement if they are performed without a license from the rights owner.

The container vs. content debate

Copyright law protects original expressions; it does not protect ideas, concepts, facts, knowledge or mere data. Yet more often than not, data and facts are embedded within an original text, song, image, film, etc. Whether or not the law should facilitate access to underlying data captured within protected works is a question that has divided, and continues to divide, opinions.

The tech industry and the AI sector have long argued that the content of a protected work is not protected by copyright; only the form – the container of the work – is. Yet, because there is currently no viable form of using content without also using the container, questions regarding the need to create a dedicated copyright exemption or exception have arisen. The affirmation that non-protected mere facts and data should not be captured by copyright has been at the center of the AI sector’s campaign in favor of a broad TDM exception to copyright.

On the other hand, copyright holders are unsurprisingly adamant that the law cannot deprive them of their ability to control or monetize their works, be it for TDM or other purposes. The recent launch of AI-powered solutions capable of producing photos, paintings and music at the push of a button has only contributed to reinforce the defiance of the creative sector against tech businesses, which are seen as making a living on the back of the creative sector’s content, and amidst a flurry of lawsuits regarding the legality of TDM, including by Getty Images in front of the High Court of Justice in London and the U.S. District Court in Delaware, the debate seems to intensify daily.

Related Insights