Making Scholarly Articles More Accessible for Machine Learning

by Cassidy Chansirik September 1, 2020

written by Cassidy Chansirik September 1, 2020

ArXiv, an open-access digital repository of scholarly articles maintained by Cornell University in New York, made available all of its 1.7 million research articles on Kaggle, a public online platform for machine learning training datasets. For each article, the dataset includes information such as the author, article title, category, abstract, citations, as well as a link to the full-text PDF. Researchers can more easily use the data from arXiv articles to perform trend analysis, create algorithms that group scholarly papers by topic, and improve search engines for scholarly papers.

Get the data.

Image: Susan Yin

Cassidy Chansirik

Cassidy Chansirik is an intern at the Center for Data Innovation. Currently, she is a student at the University of California, Berkeley and is pursuing a B.A. in Legal Studies and a minor in Education. She is passionate about the intersections of technology, education, and law and how they impact the community.

Making Scholarly Articles More Accessible for Machine Learning

5 Q’s for Gabe Otte, CEO of Freenome

Policymakers Shouldn’t Ask Platforms To Solve Online Disinformation Alone

You may also like