Capturing India’s Linguistic Diversity

by Aswin Prabhakar March 7, 2024

written by Aswin Prabhakar March 7, 2024

Researchers at the Indian Institute of Technology, Madras, India, have created IndicVoices, a dataset of natural and spontaneous speech that captures the cultural, linguistic, and demographic diversity of India. The dataset contains around 7,300 hours of audio from 16,000 speakers, covering 145 Indian districts and 22 languages. This dataset will enable the development of innovative speech recognition solutions and make essential services more accessible to people across India, particularly in remote areas where language barriers have been a significant hurdle.

Get the data.

Image credits : Unsplash user Rohan Solankurkar

Aswin Prabhakar

Aswin Prabhakar is a Policy Analyst at the Center for Data Innovation. He was previously a James Buchanan Fellow at George Mason University, where he earned a Master's degree in Economics. Aswin also holds an Integrated Master's degree in Development Studies from the Indian Institute of Technology Madras. Aswin's research primarily focuses on the policy implications of emerging technologies, with a particular emphasis on open-source AI models. He closely examines the policy issues surrounding open-source AI and also extensively analyzes competition policy issues in the e-commerce sector.

Capturing India’s Linguistic Diversity

Visualizing the 2024 Solar Eclipse across the United States

10 Bits: The Data News Hotlist

You may also like