Project CodeNet


Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are obtained from downloading submissions from two online judge web sites: AIZU Online Judge and AtCoder. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several language specific datasets for benchmarking in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity. To expedite AI for code research using graph neural networks, we also made available the simplified parse tree (SPT) representation of the code samples in the four benchmark datasets.

We are also launching a contest around Project CodeNet to creat excitement and to drive innovations in AI for Code, while focusing on inclusion and diversity as well. The experimentation and test datasets are hosted here as they become available.

Get this Dataset

Data Description Archive Dataset File Archive SPT and CASS Files
Full (Original) Dataset Project_CodeNet.tar.gz N/A
Full Dataset, Metadata Only Project_CodeNet_metadata.tar.gz N/A
Mini Project CodeNet Mini_Project_CodeNet.tar.gz N/A
Python benchmark Project_CodeNet_Python800.tar.gz Project_CodeNet_Python800_spts.tar.gz Project_CodeNet_Python800_cass.tar.gz
Java benchmark Project_CodeNet_Java250.tar.gz Project_CodeNet_Java250_spts.tar.gz Project_CodeNet_Java250_cass.tar.gz
C++ benchmark 1 Project_CodeNet_C++1000.tar.gz Project_CodeNet_C++1000_spts.tar.gz Project_CodeNet_C++1000_cass.tar.gz
C++ benchmark 2 Project_CodeNet_C++1400.tar.gz Project_CodeNet_C++1400_spts.tar.gz Project_CodeNet_C++1400_cass.tar.gz
Sample dataset for language classification Project_CodeNet_LangClass.tar.gz N/A
Sample dataset for masked language models Project_CodeNet_MLM.tar.gz N/A
Experimentation dataset for Project CodeNet contest Project_CodeNet_experimentation_dataset.tar.gz N/A
Development dataset for Project CodeNet contest Project_CodeNet_dev_dataset.tar.gz N/A

Dataset Metadata

Field Value
Format C++, Java, Python, other programming languages, csv, text
Dataset license CDLA Permissive v2.0
Source code license Apache 2.0
Domain Solutions to programming problems
Number of code samples 14 Million
Size 9 GB
Author Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss
Dataset Version Update Version 1 – May 5, 2021
Use Cases AI for Code Code search, Code completion, Code-Code Translation

Dataset Archive Contents

Click here for the full dataset’s directory structure and contents. The content of each benchmark dataset and its respective SPT and CASS files, and the experimentation dataset for the Project CodeNet contest are described in the README.

Data Glossary and Preview

Click here to explore the data glossary and here for details about the metadata. Small code samples with README are available for preview.

Use the Dataset

This dataset is complemented by a collection of data exploration and data analysis Python notebooks to help you get started:

Notebook 1: Project CodeNet Language Classification

This notebook takes you through the steps of a simple experiment that shows how to create and exercise a Keras model to detect the language of a piece of source code.

Get the notebook Run the notebook in Colab

Notebook 2: A Masked Language Model for Project CodeNet

This experiment investigates whether a popular attention model to construct a masked language model (MLM) can be used for source code instead of natural language sentences.

Get the notebook Run the notebook in Colab


  author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Shyam Ramji and Ulrich Finkler and Susan Malaika and Frederick Reiss},
  title = {CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks},
  year = {2021},
  journal={arXiv preprint arXiv:2105.12655}