From Privacy Compliance to AI Governance: Sourcing Training Data

For years, internet privacy compliance mainly focused on consumer-facing disclosures: privacy policies, cookie banners, and notices explaining how personal data would be collected, used, stored, and shared. The growing integration of generative AI into everyday digital services has disrupted this model, shifting legal attention away from disclosure and toward how data is collected, scraped, licensed, retained, and used to build and train AI systems. Recent litigation involving AI systems demonstrates growing legal scrutiny of how AI training data is sourced, processed, and deployed.

Recent privacy law developments demonstrate that privacy compliance can no longer end with consumer disclosures. The Texas Data Privacy and Security Act, which took effect on July 1, 2024, requires companies not only to provide privacy notices but also to limit personal data collection to what is “reasonably necessary”, implement reasonable safeguards, and conduct data protection assessments for high-risk processing activities such as targeted advertising or profiling that may influence significant decisions about consumers. These obligations make compliance depend on how companies actually manage personal data, rather than disclosures alone. The California Privacy Protection Agency (CPPA) has publicly highlighted enforcement concerns about excessive data collection and dark patterns that manipulate consent, and continues rulemaking tied to risk assessments, cybersecurity audits, and automated decision-making. These developments push companies to look beyond front-end disclosures and toward internal data governance, especially where personal information is repurposed for model development, profiling, or automated outputs.

The same trend is visible in AI-specific legislation. The relevant provisions of the EU AI Act generally become applicable on August 2, 2026, and extend compliance beyond disclosure to the governance of high-risk AI systems. Article 6 and Annex III classify certain systems as high-risk, including systems used in biometrics, critical infrastructure, education, employment, law enforcement, migration and border control, and the administration of justice and democratic processes. For those systems, Article 10 requires providers to implement data-governance measures for training, validation, and testing datasets, including measures addressing data collection, data origin, the purpose of personal data collection, and data preparation. These requirements are designed to mitigate bias, errors, and discriminatory outcomes. In the United States, Colorado’s SB24-205 similarly reflects a shift toward risk-based AI governance. Although its implementation timeline remains subject to ongoing legislative revision, it currently requires developers and deployers of high-risk AI systems to use reasonable care to prevent algorithmic discrimination against consumers, including obligations related to risk management, impact assessments, and correction of incorrect personal data.

Recent AI-related litigation has similarly scrutinized both sides of the AI lifecycle: how training data was acquired, retained, and deployed, and what AI systems later produced. In Bartz v. Anthropic PBC, the U.S. District Court for the Northern District of California held that using copyrighted books to train large language models (LLMs) could qualify as fair use, while retaining pirated books in a permanent LLM training library was not protected by fair use. New York Times v. Microsoft Corporation highlights both training-data and output risk: copyrighted news works were allegedly scraped from the plaintiffs’ websites to train LLMs, and the models were later alleged to reproduce portions of those works in response to user prompts. In Andersen et al v. Stability AI Ltd. et al., plaintiffs allege that Stable Diffusion used their copyrighted artworks as training data and can generate images in the artists’ styles. In re Clearview AI, Inc., Consumer Privacy Litigation illustrates the risks posed by biometric data sources. In this case, Clearview was sued for allegedly scraping facial images from publicly available websites without consent and compiling them into a searchable biometric database. Although these matters remain at different procedural stages, including a partial summary-judgment ruling in Bartz, unresolved copyright claims in New York Times and Andersen, and a court-approved settlement in In re Clearview AI, Inc. in March 2025, they collectively demonstrate that some AI legal risks depend on the original source of the training data.

Cross-border regulation reinforces the same point. Jurisdictions are adopting divergent rules governing AI training data. For example, the EU AI Act imposes data-governance requirements for training, validation, and testing datasets used in high-risk systems, while U.S. regulators are addressing training data through national-security and data-transfer rules. In April 2025, the U.S. Department of Justice’s Data Security Program took effect, restricting certain data transactions that could allow countries of concern—such as China, Russia, and Iran, or entities subject in their control—to access U.S. Government-related data or Americans’ bulk genomic, geolocation, biometric, health, financial, or other sensitive personal data. The Department warned that foreign adversaries could exploit such large datasets to train AI systems and develop military capabilities. Together, these developments make one point clear: AI legal risks increasingly hinge on the source of training data.

In the age of generative AI, legal risk no longer turns only on what a model produces; it also turns on where its training data comes from and whether that data can be lawfully collected, licensed, transferred, and reused. Output-related risks remain central to AI law, including hallucinations, defamation, discrimination, safety harms, consumer deception, and professional responsibility. But generative AI adds another layer of compliance risk at the data-input stage. Privacy notices still matter, but they no longer define the full scope of AI legal risk. Organizations must demonstrate that training data was lawfully sourced, licensed, transferred across jurisdictions, and governed throughout the AI lifecycle. That is the real transition from traditional privacy compliance to AI governance: not replacing output regulation but recognizing training-data provenance as a central legal issue for regulators, courts, and sophisticated counsel.

Source link