EU privacy body weighs in on some tricky GenAI lawfulness questions

Dec 17,2024

The European Data Protection Board (EDPB) published an opinion on Wednesday that explores how AI developers might use personal data to develop and deploy AI models, such as large language models (LLMs), without falling foul of the bloc’s privacy laws. The Board plays a key steering role in the application of these laws, issuing guidance that supports regulatory enforcement, so its views are important.

Areas the EDPB opinion covers include whether AI models can be considered to be anonymous (which would mean privacy laws wouldn’t apply); whether a “legitimate interests” legal basis can be used for lawfully processing personal data for the development and the deployment of AI models (which would mean individuals’ consent would not need to be sought); and whether AI models that were developed with unlawfully processed data could subsequently be deployed lawfully .

The question of what legal basis might be appropriate for AI models to ensure they are compliant with the General Data Protection Regulation (GDPR), especially, remains a hot and open one. We’ve already seen OpenAI’s ChatGPT getting into hot water here. And failing to abide by the privacy rules could lead to penalties of up to 4% of global annual turnover and/or orders to change how AI tools work.

Almost a year ago, Italy’s data protection authority issued a preliminary finding that OpenAI’s chatbot breaches the GDPR. Since then, other complaints have been lodged against the tech, including in Poland and Austria , targeting aspects such as its lawful basis for processing people’s data, tendency to make up information and inability to correct erroneous pronouncements on individuals.

The GDPR contains both rules for how personal data can be processed lawfully and a suite of data access rights for individuals — including the ability to ask for a copy of data held about them; have data about them deleted; and correct incorrect info about them. But for confabulating AI chatbots (or “hallucinating,” as the industry frames it) these are not trivial asks.

But while generative AI tools have quickly faced multiple GDPR complaints, there has — so far — been a lot less enforcement. EU data protection authorities are clearly wrestling with how to apply long-established data protection rules on a technology that demands so much data for training. The EDPB opinion is intended to help oversight bodies with their decision-making.

Responding in a statement, Ireland’s Data Protection Commission (DPC), the regulator which instigated the request for Board views on the areas the opinion tackles — and the watchdog that’s set to lead on GDPR oversight of OpenAI following a legal switch late last year — suggested the EDPB’s opinion will “enable proactive, effective and consistent regulation” of AI models across the region.

“It will also support the DPC’s engagement with companies developing new AI models before they launch on the EU market, as well as the handling of the many AI related complaints that have been submitted to the DPC,” commissioner Dale Sunderland added.

As well as giving pointers to regulators on how to approach generative AI, the opinion offers some steer to developers on how privacy regulators might break on crux issues such as lawfulness. But the main message they should take away is there won’t be a one-size-fits-all solution to the legal uncertainty they face.

Model anonymity

For instance, on the question of model anonymity — which the Board defines as meaning an AI model that should be “very unlikely” to “directly or indirectly identify individuals whose data was used to create the model” and be very unlikely to allow users to extract such data from the model through prompt queries — the opinion stresses this must be assessed “on a case-by-case basis.”

The document also provides what the Board dubs “a non-prescriptive and non-exhaustive list” of methods whereby model developers might demonstrate anonymity, such as via source selection for training data that contains steps to avoid or limit collection of personal data (including by excluding “inappropriate” sources); data minimization and filtering steps during the data preparation phase pre-training; making robust “methodological choices” that “may significantly reduce or eliminate” the identifiability risk, such as choosing “regularization methods” aimed at improving model generalization and reducing overfitting, and applying privacy-preserving techniques like differential privacy; as well as any measures added to the model that could lower the risk of a user obtaining personal data from training data via queries.

This indicates that a whole host of design and development choices AI developers make could influence regulatory assessments of the extent to which the GDPR applies to that particular model. Only truly anonymous data, where there is no risk of re-identification, falls outside the scope of the regulation — but in the context of AI models the bar is being set at risks of identifying individuals or their data at “very unlikely.”

Prior to the EDPB opinion, there has been some debate among data protection authorities over AI model anonymity — including suggestions models can never themselves be personal data — but the Board is clear that AI model anonymity is not a given. Case by case assessments are necessary.

Legitimate interest

The opinion also looks at whether a legitimate interest legal basis can be used for AI development and deployment. This is important because there are only a handful of available legal basis in the GDPR, and most are inappropriate for AI — as OpenAI has already discovered via the Italian DPA’s enforcement.

Legitimate interest is likely to be the basis of choice for AI developers building models, since it does not require obtaining consent from every individual whose data is processed to build the tech. (And given the quantities of data used to train LLMs, it’s clear that a consent-based legal basis would not be commercially attractive or scalable.)

Again, the Board’s view is that DPAs will have to undertake assessments to determine whether legitimate interest is an appropriate legal basis for processing personal data for the development and the deployment of AI models — referring to the standard three-step test which requires watchdogs to consider the purpose and necessity of the processing (i.e., it is lawful and specific; and were there any alternative, less intrusive ways to achieve the intended outcome) and perform a balancing test to look at the impact of the processing on individual rights.

The EDPB’s opinion leaves the door open to it being possible for AI models to meet all the criteria for relying on a legitimate interest legal basis, suggesting for example that the development of an AI model to power a conversational agent service to assist users, or the deployment of improved threat detection in an information system would meet the first test (lawful purpose).

For assessing the second test (necessity), assessments must look at whether the processing actually achieves the lawful purpose and whether there is no less intrusive way to achieve the aim — paying particular attention to whether the amount of personal data processed is proportionate versus the goal, with mind to the GDPR’s data minimization principle.

The third test (balancing individual rights) must “take into account the specific circumstances of each case,” per the opinion. Special attention was required to any risks to individuals’ fundamental rights that may emerge during development and deployment.

Part of the balancing test also requires regulators to consider the “reasonable expectations” of data subjects — meaning, whether individuals whose data got processed for AI could have expected their information to be used in such a way. Relevant considerations here include whether the data was publicly available, the source of the data and the context of its collection, any relationship between the individual and the processor, and potential further uses of the model.

In cases where the balancing test fails, as the individuals’ interest outweigh the processors’, the Board says mitigation measures to limit the impact of the processing on individuals could be considered — which should be tailored to the “circumstances of the case” and “characteristics of the AI model,” such as its intended use.

Examples of mitigation measures the opinion cites include technical measures (such as those listed above in the section on model anonymity); pseudonymization measures (such as checks that would prevent any combination of personal data based on individual identifiers); measure to mask personal data or substitute it with fake personal data in the training set; measure that aim to enable individuals to exercise their rights (such as opt-outs); and transparency measures.

The opinion also discusses measures for mitigating risks associated with web scraping, which the Board says raises “specific risks.”

Unlawfully trained models

The opinion also weighs in on the sticky issue of how regulators should approach AI models that were trained on data that was not processed lawfully, as the GDPR demands.

Again, the Board recommends regulators take into account “the circumstances of each individual case” — so the answer to how EU privacy watchdogs will respond to AI developers who fall into this law-breaking category is… it depends.

However, the opinion appears to offer a sort of get-out clause for AI models that may have been built on shaky (legal) foundations, say because they scraped data from anywhere they could get it with no consideration of any consequences, if they take steps to ensure that any personal data is anonymized before the model goes into the deployment phase.

In such cases — so long as the developer can demonstrate that subsequent operation of the model does not entail the processing of personal data — the Board says the GDPR would not apply, writing: “Hence, the unlawfulness of the initial processing should not impact the subsequent operation of the model.”

Discussing the significance of this element of the opinion, Lukasz Olejnik , an independent consultant and affiliate of KCL Institute for Artificial Intelligence — whose GDPR complaint against ChatGPT remains under consideration by Poland’s DPA more than a year on — warned that “care must be taken not to allow systematic misuse schemes.”

“That’s an interesting potential divergence from the interpretation of data protection laws until now,” he told TechCrunch. “By focusing only on the end state (anonymization), the EDPB may unintentionally or potentially legitimize the scraping of web data without proper legal bases. This potentially undermines GDPR’s core principle that personal data must be lawfully processed at every stage, from collection to disposal.”

Asked what impact he sees the EDPB opinion as a whole having on his own complaint against ChatGPT, Olejnik added: “The opinion does not tie hands of national DPAs. That said I am sure that PUODO [Poland’s DPA] will consider it in its decision,” though he also stressed that his case against OpenAI’s AI chatbot “goes beyond training, and includes accountability and Privacy by Design.”

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.