Improving Language Translation Models

by Morgan Stevens September 22, 2023

written by Morgan Stevens September 22, 2023

Google has created a multilingual dataset to improve language translation models. It contains 4 billion documents with 100 billion sentences in 419 languages. The dataset improves upon past multilingual datasets as researchers manually audited the text to remove unusable, misaligned, or mislabeled data.

Get the data.

Image credit: Flickr user Ivan Radic

Morgan Stevens

Morgan Stevens is a Research Assistant at the Center for Data Innovation. She holds a J.D. from the Sandra Day O'Connor College of Law at Arizona State University and a B.A. in Economics and Government from the University of Texas at Austin.

Improving Language Translation Models

Visualizing the Impact of Climate Change on Crops

10 Bits: The Data News Hotlist

You may also like