Google has created a multilingual dataset to improve language translation models. It contains 4 billion documents with 100 billion sentences in 419 languages. The dataset improves upon past multilingual datasets as researchers manually audited the text to remove unusable, misaligned, or mislabeled data.
Image credit: Flickr user Ivan Radic