An Open Dataset for Multilingual Speech Research

by Cassidy Chansirik February 9, 2021

written by Cassidy Chansirik February 9, 2021

Facebook AI has released Multilingual LibriSpeech (MLS), a multilingual audio dataset to help improve speech research in AI-powered services, such as voice assistants. MLS expands upon English-only audiobook data from LibriVox to provide more than 50,000 hours of audio across seven languages: German, Dutch, French, Spanish, Italian, Portuguese, and Polish. Additionally, MLS provides data for language-model training sets and pretrained language models that enable researchers to compare existing data on different automatic speech recognition systems.

Get the data.

Image credit: Mahesh Patel

Cassidy Chansirik

Cassidy Chansirik is an intern at the Center for Data Innovation. Currently, she is a student at the University of California, Berkeley and is pursuing a B.A. in Legal Studies and a minor in Education. She is passionate about the intersections of technology, education, and law and how they impact the community.

An Open Dataset for Multilingual Speech Research

State and Local Governments Should Support Responsible Deployment of Sidewalk Delivery Robots

5 Q’s for Jennifer McGlone, Co-Founder and President of LawChamps

You may also like