Show me the data: Ways in which personal data is processed in AI
Like other industries, media and entertainment companies must pay close attention to how they use and protect personal and sensitive data. Globally, there is an increasingly complex regulatory landscape that companies must navigate with respect to consumer personal and sensitive data. Personal data refers to information that can be used to identify an individual, such as name, address, date of birth, etc. Sensitive data is a subset of personal data that is considered particularly sensitive or private, such as medical records, financial information, race, religion, ethnicity, etc.
In the context of AI, media and entertainment companies rely heavily on personal and sensitive data to personalize content recommendations, to target advertising, to create interactive content and to develop new products and services. These companies engage in complex data-gathering efforts to provide services accessible across multiple platforms (e.g., mobile, web, television, gaming consoles, etc.) and across a vast range of jurisdictions. They leverage algorithms to analyze consumer viewing habits and preference to make targeted decisions about content. For these reasons, these companies must pay attention to data privacy regulations governing personal data.
In the United States, one of the key legal frameworks governing personal data is the California Consumer Privacy Act (CCPA), which went into effect on Jan. 1, 2020. The law applies to companies that meet at least one of the following thresholds: (1) annual gross revenue of greater than $25 million; (2) buys, receives or sells the personal data of 50,000 or more California residents, households or devices each year or (3) earns half or more of its annual revenue from selling California residents’ personal information. Companies can continue to collect consumer data but are required under the CCPA to disclose the personal information they process and who it is shared with. Under the CCPA, personal information is broadly defined as “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” This can include commercial, electronic, behavioral, biometric, financial and educational information. For example, companies that stream podcasts would be required to disclose the mobile phone numbers they use to identify listeners.
One of the most influential legal frameworks governing the use of personal data is the EU’s General Data Protection Regulation and UK GDPR (collectively, GDPR). The GDPR applies to companies that process personal data, including special category data (the term given for sensitive data under the GDPR), that are based in the EU/UK but also have extraterritorial effect – for example, if a media company is based in the United States but is providing goods or services to or monitoring individuals located in the EU/UK. Therefore, any company targeting an audience in the EU or UK for goods and services or monitoring the behavior of individuals including through geolocation data and online identifiers, which is likely to be a core function of AI data mining, must comply with the GDPR. The GDPR has spawned a proliferation of similar laws in many countries around the world, and the issues will be similar. Most probably, an AI provider’s role is likely to be that of a controller because of the discretion used in formulating the AI algorithms, and the only way to fall outside the GDPR is to anonymize data, a process equally fraught, given the difficulty in doing so and the question of whether the data would have any utility. Those using AI technologies and outputs may also be acting as data controllers.
To comply with these regulations, throughout the AI life cycle from development to deployment, companies must consider how to address their use of personal and sensitive information. Various considerations are discussed in the sections that follow, such as the lawful basis for processing personal data, transparency, data minimization and data integrity or accuracy. Other privacy/data protection by design considerations include:
- Design and governance. During the design phase, determine what personal data will be required and how it will be used. Document how personal and sensitive data will flow through and out of the AI system. Many privacy laws like GDPR require accountability and governance to show the steps taken to protect personal data. This may require records of processing, policies and assessments. If other companies are involved, appropriate contracts may need to be put in place.
- Data rights. Consider how individual rights requests will be addressed. When designing an AI framework, consider how to index personal data so that it can be retrieved when a request is received. Functionality should exist at the outset that enables the company to respond to requests from consumers.
- Training and testing. To guard against unintended scope creep caused by a learning algorithm, conduct training and testing. Evaluate whether there are any changes to the purpose for which the data has been collected and confirm that any new purposes are lawful. If there are any changes in purpose from what was originally disclosed, update privacy information provided to consumers and consider whether additional consent must be obtained.
- Security. Care is needed to ensure that personal data is kept secure. This can be another driver for ensuring that personal data sets are not used as inputs without thinking through how that personal data will then be held in systems and to whom it may become accessible.
- Transfers. Many privacy laws around the world include restrictions on data transfers. Consider how it may be possible to comply with them, thinking through where personal data is inputted and how and where outputs can be used.
What is the lawful basis for AI processing, and is consent required?
In the UK and EU, data protection laws require a data controller (for example, a developer training an AI system or an organization making predictions using an existing system) to establish a lawful basis to process personal data. Relevant lawful bases may include performance of a contract, legitimate business interests or consent; the development and deployment process of an AI system must be broken down with one or more lawful bases assigned to each step. Additional, more stringent rules apply where health or other sensitive data (as described above) is being processed. We have already seen some enforcement activity on this issue. In July 2017, the UK data protection regulator (the Information Commissioner) found that the UK NHS Royal Free Hospital had violated the GDPR by sharing the data of 1.6 million patients with Google DeepMind for clinical safety testing. UK ICO Undertaking, RFA0627721, provision of patient data to DeepMind (July 17, 2017). That decision considered the fairness of the processing in terms of the reasonable expectations of individuals – would they have expected their health data to be shared for the purpose of training an AI algorithm, and was there a suitable, lawful basis? The answer was a resounding “no,” and the hospital should have obtained consent.
While consent may seem like the natural path to go down and may be the only option in some situations such as where sensitive personal data is concerned, the GDPR standard of consent is very prescriptive as to form and, ultimately, can be withdrawn by the data subject at any time, requiring the data controller to delete that individual’s personal data. This may prove tricky, if not impossible, with situations such as the ingesting of training data such as consumer viewing-history or streaming data. Accordingly, data controllers deploying, developing or otherwise making use of AI systems in the UK and EU may wish to consider a lawful basis other than consent.
The landscape is entirely different in the United States, which relies largely on a notice and consent approach, with state laws imposing varying levels of consent (affirmative, written, explicit, etc.) and proscribing various disclosure requirements (categories of personal information collected, purposes for collection, etc.). These concepts are similar to consent requirements in the UK and EU. The key difference, however, is that alternatives to consent are not generally available in the United States, and so a feasible approach to the collection of consent is something that must be considered. Consent is only valid if it is informed, which requires accurate disclosure to the individual. This model works relatively well under a static system, and its rollout onto mobile and web experiences has been relatively straightforward and is used successfully across the mobile web, connected TV, SVOD and desktop devices. However, it will pose some issues when it comes to black box AI systems. What kind of disclosure should creators of AI models provide when the data collected and used depends on the user input and user input is essentially unlimited? More to the point, what can the creators disclose about the AI model and its underlying algorithm if that system is a black box that is unknown to its creators? Even if it is known, such a system is designed to evolve over time yet must continue to comply with evolving legal data protection requirements.
One possible solution is to provide a notice of the limitations on data input (together with automated systems that enforce those limitations – although that in and of itself is the processing of data) together with an explanation of how the AI model is intended to work and an explanation that the system will continue to evolve, along with a description of its guiding principles. Another solution is to provide periodic notices that are triggered based on user input and interactions with the AI, which functionally is equivalent to today’s cookie banners.
Another issue looming here is consent specification and granularity. U.S. states are increasingly passing data privacy laws that require purpose specification. However, AI can and likely will use data for many purposes, some of which may be reasonable and necessary and others which likely will not. Will regulators require that users be able to limit the use of their data to a specific purpose, and if so, how will that impact the accuracy of these models that depend on learning from user input?
And finally, there is the question of how to obtain this consent. Consent requires affirmative action, such as clicking a box or selecting technical settings – not pre-ticked boxes and not being bundled with other non-privacy related terms and conditions. Users must have genuine choice and be able to refuse or withdraw consent without detriment. Some AI systems may be designed to manipulate individuals into giving consent, especially in light of the smooth-talking AI systems we have already seen out there.
Ultimately, rather than trying to fit new technology into existing legal frameworks, we will need to come up with a framework that is designed for this new technology. AI has been around for a long time and will continue to improve our lives. We should likewise develop new systems for educating consumers and, where necessary, obtain consent.
The transparency struggle: Shedding light on data protection in the dark
Transparency is a crucial aspect of data protection in the context of AI since it is a basic requirement of data protection rules that individuals know when their personal data is being processed, by whom and for what purposes. For example, articles 13 and 14 of the GDPR set out the information that must be provided to individuals when their personal data is collected. This also includes information about the purpose and legal basis for processing, the categories of personal data collected and any recipients of the data. As discussed in the section above, U.S. privacy laws also place great importance on disclosures.
Media organizations in the EU have become very familiar with such transparency requirements, particularly in Europe after almost five years of GDPR. However, in the context of AI, transparency requirements require some new reflection on both what to explain and how to do so. For example, individuals should be informed if automated decision-making, including profiling, is taking place (see below for more detail on these concepts), and they have the right to obtain meaningful information about the logic involved. This means that organizations must be able to explain how their AI systems work in a way that is understandable to the average person – no mean feat!
In addition, individuals have the right to access their personal data, including any decisions AI systems have made about them. This includes the right to know the source of the data.
Organizations in the media and entertainment sector, such as game developers, movie producers and advertisers, must ensure that they are transparent about their AI systems and the data they process. This includes being clear about any biases or limitations in the technology and taking steps to mitigate these. This also includes finding ways to efficiently provide the information in a timely manner, for example, at the beginning of a game or when an AI-supported chat starts. Ultimately, transparency is essential for building trust with individuals and ensuring that their privacy rights are respected in the context of AI. The decision by the Italian data protection authority (Garante per la protezione dei dati personali) on a GPT chatbot from March 2023 shows that it is a real struggle for organizations to provide sufficient and correct information.
Can AI ever conform to data protection requirements for data minimization and accuracy?
AI can produce valuable output through algorithms trained on training sets formed of a vast amount of data with varying levels of quality. As a result, the specific nature of AI therefore poses unique challenges to the fundamental principles of data accuracy and data minimization under data protection laws around the world, including the GDPR and the Personal Data (Privacy) Ordinance in Hong Kong, just to name two.
Accuracy in data protection law requires ensuring that personal data is correct and up-to-date. That may present obvious issues for AI, which is, by its nature, learning and guessing. For example, will AI used in advertising really be able to accurately predict what I might be interested in purchasing? AI tools can also be used to make parody or fake versions of things whether they be chapters “written in the style of” a particular author or an image of a celebrity in a potentially compromising or humorous position; readers may recall the pictures pretending to be Trump running from the law or even of the Pope in a Balenciaga coat – enjoyable, but not necessarily accurate in a data protection sense. More serious is the risk of a lack of accuracy in exacerbating a potential for bias and discrimination. While no AI system claims to be 100% accurate, this does not mean, however, that it cannot necessarily satisfy the accuracy principle under data protection law. Regulators have said that users and developers should focus on steps to make sure that the output of the AI system is not incorrect or misleading in any way (for example, by ensuring that information about an individual used as an input to the system is correct in the first place) and that measures are in place to rectify any inaccuracies that arise.1
As huge amounts of personal data are generally used to develop an AI system, this gives rise to the need to continually add more data to ensure the system improves over time and produces relevant and accurate results. An ad agency wanting to use AI-generated images in an ad campaign will generally find, for example, that an algorithm that has been trained on thousands of training set pictures in the area it is keen to portray will give better results than one that only had a few random examples to work with.
Ultimately, the more data a system is trained on, the more accurate it will hopefully be. Running contrary to this, however, the data minimization principle under most data protection laws around the world requires processing the minimum amount of personal data necessary to fulfill a specific purpose, and to only process such information and no more.
Again, it may appear difficult to see how AI systems can comply with the data minimization principle. For example, it may be hard, particularly with generative AI, to define the purpose of any AI use at the outset, and the purpose may change as the system develops. Regulators nonetheless recommend seeking to limit the amount of data involved in training or using a model, and wherever possible to seek to define the purpose.
There is an obvious need therefore to balance these two potentially competing principles to ensure that the rights of data subjects are protected and that confidence in the effectiveness and accuracy of models is retained, but the trade-off between them certainly creates a challenge for compliance.
In an attempt to provide clearer direction on this complex topic, the UK’s ICO has produced a Guidance on AI and Data Protection. In Hong Kong, the Office of the Privacy Commissioner for Personal Data has issued a Guidance Note on the ethical development and use of artificial intelligence, which provides helpful suggestions to address some key concerns in relation to collection of data, including putting in place measures to ensure compliance with the requirements under the PDPO2 and minimizing the amount of personal data used in the development and use of AI to reduce privacy risk,3 in line with the existing data protection principles under the PDPO with a view to strike a balance between issues of accuracy and data minimization in the development of AI technology.
- For more details, please refer the article, “How to use AI and personal data appropriately and lawfully,” published by the UK’s ICO, and section 4.3 of the Guidance Note on the ethical development and use of artificial intelligence issued by the HK Office of the Privacy Commissioner for Personal Data.
- Examples of such measures include collecting an adequate but not excessive amount of personal data by lawful and fair means; refraining from using personal data for any purpose that is not compatible with the original purpose of collection unless express and voluntary consents of the data subjects have been obtained or the personal data has been anonymized; taking all practical steps to ensure the accuracy of personal data before use; taking all practical steps to ensure the security of personal data; erasing or anonymizing personal data when the original purpose of collection has been achieved.
- This can be achieved by collecting only the data that is relevant to the particular purpose of the AI in question and discarding the data containing characteristics of individuals that are irrelevant to the purposes concerned; using anonymized, pseudonymized or synthetic data to train AI models; using federated learning for training AI models so as to avoid unnecessary sharing of training data from different sources; and erasing personal data from the AI system when the data is no longer required for the development and use of the AI.