Language datasets power many applications, from translation programs and spam detection to digital assistants and chatbots to assisting researchers with sentiment analysis. But not all languages have high-quality datasets, meaning many tools are built with data for just one or a select few languages. As a result, people who speak languages with lower-quality datasets either lack access to these tools outright or these tools don’t work effectively for them. Federal offices, like the Office of Science and Technology Policy (OSTP) and the National Science Foundation, should aid the development of high-quality language datasets to enhance the accuracy of natural language processing models and enable more communities to benefit from data-driven decision making.
Natural language processing (NLP) is a branch of artificial intelligence that uses computational linguistics and machine learning to power computer understanding of text and spoken words. It has a multitude of applications, from powering translation programs and spam detection to speech-to-text digital assistants and chatbots to assisting researchers with sentiment analysis. But NLP models are the product of their training data. When these models are built on incomplete or inaccurate data, they can inadvertently discriminate against certain groups, such as in resume or job screenings, and create obstacles to the detection of online harms, such as extremism or cyberbullying. Datasets for NLP are inaccurate for many languages, meaning tools like translation applications that enable access to information don’t work for many communities.
Many asymmetries exist in terms of the volume of text and audio available online in different languages. The top 10 languages used online make up 82 percent of all content on the Internet. This leaves some languages as “low-resource” with low levels of data availability and increased difficulties for natural language processing. Finding high-quality data to scrape from websites for these languages may be difficult, even on public sources like Wikipedia.
Low-resource languages often have levels of data availability disproportionate to the actual number of speakers of a given language. For example, more than 200 million speak Indonesian, yet the language is considered low-resource. Commonly spoken languages like Bengali, Hindi, Urdu, and Swahili also cannot be analyzed effectively due to a lack of accurate datasets. Datasets from these languages are often filled with errors, as demonstrated by a recent study examining five major public language datasets. The study found that lower-resource language datasets have systemic issues, including some corpora lacking usable text outright and some with less than half of sentences of acceptable quality. In addition, the corpora face a lack of standardization and high-quality labeling.
Disparities even exist for different English dialects. English language training data is often generalized to all English speakers, even when syntax structures, word choices, and slang often differs along geographic, racial, gender, and age lines. This generalization can create problems for models that are only trained to accept one specific dialect. For example, an NLP-enabled resume and cover letter screener may only flag documents with one specific presentation of English. Additionally, some researchers have highlighted lower levels of accuracy for African American Vernacular English (AAVE). Language identification tools try and predict the language used in a given piece of text, the critical first step to translation or sentiment analysis tasks, and correct identification enables the accuracy of user applications going forward. But one popular tool for language identification misclassifies AAVE as Danish with a 99.9 percent confidence level. This means that users who post online using AAVE might be excluded from research datasets or be limited in their ability to use NLP-powered tools.
The White House Office of Science and Technology Policy (OSTP) and the National Science Foundation (NSF) have placed a particular focus on improving large AI training datasets through the development of the National AI Research Resource (NAIRR). While a NAIRR task force continues to develop plans for such a resource, it has already identified building more representative and equitable datasets as a key priority. Creating language datasets would be one way to address this goal, as well as boost the use of NLP more widely. In particular, the NAIRR should focus should on building high-quality datasets for the most widely spoken languages in the United States, including dialects like AAVE.
The government should also explore opportunities to partner with other countries to aid the development of accurate NLP-enabled tools for non-English languages. As a forum for international collaboration, the United Nations can also play a key role in building, funding, and expanding existing language datasets. Without better language data, a divide will persist between those with access to accurate, data-driven technologies powered by NLP and those without.
NLP holds great potential to transform society by enabling more people to connect and communicate online, access information, and participate in democracy, but only if major improvements are made in the quality of language datasets available.
Image credit: National Academy of Medicine