In the Wake of Generative AI, Industry-Led Standards for Data Scraping Are a Must

by Morgan Stevens & Daniel Castro September 1, 2023

written by Morgan Stevens & Daniel Castro September 1, 2023

Shortly after the release of OpenAI’s popular generative AI system, ChatGPT, some website owners began complaining about AI companies scraping data—automatically gathering data from the public Internet—to train their AI systems. While courts have repeatedly reaffirmed that web scraping is legal in the United States, significant public concerns about AI have raised the risk that Congress might step in and pass anti-scraping legislation. However, a legislative intervention would be a mistake, especially given that the private sector has previously resolved a similar issue through voluntary measures.

Nearly 30 years ago similar complaints arose over the use of web crawlers, which are automated programs that index the content of webpages. Internet search engines widely deployed these bots to systematically browse the Internet to find and update their databases of webpages. However, website owners complained over their use, as the bots can create unwanted website traffic, increase server and network loads, and affect user experiences. In response, Internet engineers created the Robots Exclusion Protocol in 1994, a voluntary, community-developed standard to inform a web crawler about which parts of a website it can crawl. Since then, website owners and web crawlers have widely adopted this standard, balancing website owners’ concerns and search engines’ requirements, without the need for regulatory action.

The concerns today about data scraped from the public Internet to train AI systems are similar to prior complaints from website owners about search engine web crawlers. Just as search engine companies need to scrape data to provide accurate and up-to-date search results so too do AI companies need to scrape data to train their AI systems. For example, OpenAI has explained that scraping data from websites “can help AI models become more accurate and improve their general capabilities and safety.”

Web scraping is legal in the United States, but there is a risk that policymakers could decide to intervene. Indeed, top data protection regulators from a dozen countries—including Australia, Canada, Mexico, China, and the UK—recently published an open letter to website operators urging them to “implement measures to protect against unlawful data scraping.” But new laws and regulations are not necessary given that the private sector is already taking steps to give website operators more control over whether AI web crawlers scrape their sites.

First, many websites can use the existing Robots Exclusion Protocol to restrict web crawlers from popular AI companies. OpenAI, for example, provides details on its crawler, which allows website owners to easily disallow it from accessing their sites. Almost 20 percent of the top 1,000 websites in the world have blocked AI crawlers using this method, which shows how easily it can be done. Second, the private sector is exploring additional technical standards that would give website owners and content producers more control. For example, Adobe has proposed that creators can attach “Do Not Train” metadata to their work to inform companies within the AI industry that they cannot add their work to datasets used to train AI. Google has similarly stated that the Internet community should collaborate with the AI community on developing machine-readable standards that give website owners more control, and the company has announced its intent to lead a public discussion on this topic. As these initiatives gain momentum, organizations such as the Internet Engineering Task Force will likely provide a forum for finalizing the standard.

Given the success of non-government solutions to concerns about web crawling, policymakers should not intervene at this stage. The AI industry is evolving rapidly. Creating new laws or regulations in the United States to restrict how organizations and individuals can scrape publicly available data on the Internet to train their AI models will blunt progress and impede their ability to adapt to new developments or challenges and hurt U.S. competitiveness in AI.

Image credit: Flickr user Jernej Furman

Morgan Stevens

Morgan Stevens is a Research Assistant at the Center for Data Innovation. She holds a J.D. from the Sandra Day O'Connor College of Law at Arizona State University and a B.A. in Economics and Government from the University of Texas at Austin.

Daniel Castro

Daniel Castro is the director of the Center for Data Innovation and vice president of the Information Technology and Innovation Foundation. Mr. Castro writes and speaks on a variety of issues related to information technology and internet policy, including data, privacy, security, intellectual property, internet governance, e-government, and accessibility for people with disabilities. His work has been quoted and cited in numerous media outlets, including The Washington Post, The Wall Street Journal, NPR, USA Today, Bloomberg News, and Businessweek. In 2013, Mr. Castro was named to FedScoop’s list of “Top 25 most influential people under 40 in government and tech.” In 2015, U.S. Secretary of Commerce Penny Pritzker appointed Mr. Castro to the Commerce Data Advisory Council. Mr. Castro previously worked as an IT analyst at the Government Accountability Office (GAO) where he audited IT security and management controls at various government agencies. He contributed to GAO reports on the state of information security at a variety of federal agencies, including the Securities and Exchange Commission (SEC) and the Federal Deposit Insurance Corporation (FDIC). In addition, Mr. Castro was a Visiting Scientist at the Software Engineering Institute (SEI) in Pittsburgh, Pennsylvania where he developed virtual training simulations to provide clients with hands-on training of the latest information security tools. He has a B.S. in Foreign Service from Georgetown University and an M.S. in Information Security Technology and Management from Carnegie Mellon University.

In the Wake of Generative AI, Industry-Led Standards for Data Scraping Are a Must

5 Q’s for Brandon Contino, Co-founder and CEO of Four Growers

10 Bits: The Data News Hotlist

You may also like