Home BlogDataset Building a Dataset of Translated Sentences

Building a Dataset of Translated Sentences

by Michael McLaughlin
by
A person clicking the screen on a smartphone..

Facebook has released CCMatrix, a dataset that contains 4.5 billion parallel sentences—sentences in one language and their corresponding translations in other languages. The dataset comprises parallel sentences for more than 500 language pairs. CCMatrix can help advance the development of translation systems, particularly for languages for which there is relatively little digitized material. 

Get the data.

Image: PxHere

You may also like

Show Buttons
Hide Buttons