Transcribing YouTube Videos for LLM Training

by Martin Makaryan May 7, 2024

written by Martin Makaryan May 7, 2024

Pleias, a French startup that builds energy-efficient large language models (LLMs) for information-sensitive industries, has released a dataset called YouTube-Commons that contains over two million copyright-free video transcripts. YouTube-Commons includes full transcripts of each YouTube video, making it one of the largest collections of conversational data with nearly 30 billion words. The dataset provides LLM developers with large amounts of freely available data for training.

Get the data.

Image credit: Alexander Shatov

Martin Makaryan

Martin Makaryan is a research assistant specializing in digital policy. Makaryan is a current master's student at the School of Advanced International Studies (SAIS) at Johns Hopkins University where he specializes in security and strategy, with a focus on the intersection of security, policy, and emerging technologies. He holds a B.A. in Political Science and Global Studies from UCLA and previously worked in government affairs and policy research in California both in the non-profit and government sectors. His academic and professional interests include the impact of innovation and technology on foreign policy and national security policy, as well as automation and AI, cybersecurity, and digital policy.

Transcribing YouTube Videos for LLM Training

Monitoring Press Freedom Around the World

Comments to the Competition and Markets Authority Regarding the Amazon-Anthropic Partnership

You may also like