Home BlogDataset Transcribing YouTube Videos for LLM Training

Transcribing YouTube Videos for LLM Training

by Martin Makaryan

Pleias, a French startup that builds energy-efficient  large language models (LLMs) for information-sensitive industries, has released a dataset called YouTube-Commons that contains over two million copyright-free video transcripts. YouTube-Commons includes full transcripts of each YouTube video, making it one of the largest collections of conversational data with nearly 30 billion words. The dataset provides LLM developers with large amounts of freely available data for training. 

Get the data.

Image credit: Alexander Shatov

You may also like

Show Buttons
Hide Buttons