Building a Dataset of Translated Sentences

by Michael McLaughlin February 14, 2020

written by Michael McLaughlin February 14, 2020

A person clicking the screen on a smartphone..

Facebook has released CCMatrix, a dataset that contains 4.5 billion parallel sentences—sentences in one language and their corresponding translations in other languages. The dataset comprises parallel sentences for more than 500 language pairs. CCMatrix can help advance the development of translation systems, particularly for languages for which there is relatively little digitized material.

Get the data.

Image: PxHere

Michael McLaughlin

Michael McLaughlin is a research analyst at the Center for Data Innovation. He researches and writes about a variety of issues related to information technology and Internet policy, including digital platforms, e-government, and artificial intelligence. Michael graduated from Wake Forest University, where he majored in Communication with Minors in Politics and International Affairs and Journalism. He received his Master’s in Communication at Stanford University, specializing in Data Journalism.

Building a Dataset of Translated Sentences

Visualizing Climate Change Data in Real-Time

10 Bits: the Data News Hotlist

You may also like