BigCode, a project led by U.S. AI research company Hugging Face and Canadian AI research company ServiceNow Research, has created a dataset of permissively-licensed code from GitHub. The dataset contains over 300 million code files in 30 programming languages, such as Java, Python, and Dockerfiles, as well as information on each file’s repository, size, and content. Researchers can use the dataset to train AI systems that can generate code.
Image credit: Flickr user CyberHades