Home IssueArtificial Intelligence How Rules for Publicly Available Data Are Shaping the Future of AI

How Rules for Publicly Available Data Are Shaping the Future of AI

by Daniel Castro
by

Artificial intelligence (AI) systems learn by analyzing vast quantities of digital information. As governments debate how to regulate AI, a central question has emerged: Should developers be allowed to train models on information that is publicly available on the Internet, even when that information contains personal data?

The answer will shape not only privacy protections but also the future trajectory of AI development. Publicly accessible websites, open databases, government records, and other online resources form a critical pool of knowledge that AI systems rely on to understand language, reason about the world, and verify information. At the same time, the ability of AI systems to analyze this information at scale raises legitimate questions about how AI systems should use and protect personal data.

Different jurisdictions have begun to approach these questions in different ways. The United States generally treats publicly accessible web data as available for automated collection unless site owners impose technical barriers. The European Union, by contrast, places broader restrictions on how organizations may process personal data—even when that information appears on public websites. As AI capabilities advance and agentic systems begin interacting with information and services across the Internet, these policy differences increasingly will influence where AI development occurs, and which countries will capture the economic benefits of AI adoption.

Policymakers can protect individuals while preserving the open information ecosystem that supports innovation. This approach can be grounded in three key principles:

  1. Focus on outputs rather than training inputs. Address harmful uses of AI systems—such as revealing sensitive personal information—instead of restricting the collection of publicly available data for model training.
  2. Encourage transparency norms for autonomous AI agents. Promote voluntary industry practices for AI developers to help people understand when they are interacting with automated systems, while allowing flexibility for evolving uses of agentic AI.
  3. Create a safe harbor for responsible use of publicly available data. Provide legal certainty for developers that respect machine-readable opt-out signals from websites and that use automated tools to filter sensitive personal information during data preparation.

The Internet has long served as a shared source of public knowledge. In the AI era, it has become a foundational input for building systems that can reason, retrieve information, and interact effectively with the world. Policymakers in the United States, the European Union, and everywhere else that aspires to be at the forefront of AI development and adoption should ensure developers can continue using it that way.

Read the report.

You may also like

Show Buttons
Hide Buttons