Digital Developer Conference: Cloud Security 2021 -- Build the skills to secure your cloud and data Register free

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Explore how AI for Code can help you improve your productivity by automating the software engineering process

Software permeates every part of our existence. Google services alone combine for 2 billion lines of code, and a vehicle contains approximately 100 million lines of code. It’s a monumental challenge to create, debug, maintain, and update these complex software systems.

A fast-growing discipline known as AI for Code aims to help software developers improve their productivity by automating the software engineering process. AI for Code researchers have been leveraging technologies like natural language processing and augmenting them with code analysis and compilation techniques to perform a myriad of practical tasks, such as code search, summarization, and completion, as well as code-to-code translation. And, the discipline isn’t limited to academic research. Ruchir Puri, IBM Research’s chief research scientist, discussed in a recent podcast how technologies from AI for Code are being used to modernize legacy software by helping to migrate monolithic applications to microservices for IBM enterprise clients. To serve that purpose, the IBM AI Research division has released a new data set called Project CodeNet.

What is Project CodeNet?

Project CodeNet is a large-scale data set with approximately 14M code samples and around 500 lines of code in 55 different programming languages, each of which is an intended solution to one of 4000 coding problems. CodeNet also provides sample input and output test sets for over 7M code samples. The CodeNet data set contains problems, submissions, and metadata that are obtained from downloading submissions from two online judging web sites: AIZU Online Judge and AtCoder.

The data set is accompanied by a GitHub repository where we provide a set of tools to aggregate code samples based on user criteria and to transform code samples into token sequences, simplified parse trees, and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

Most importantly, Project CodeNet drives innovation in deep learning and machine learning models in code classification and code similarity. To expedite AI for Code research using graph neural networks, CodeNet researchers also made available the simplified parse tree (SPT) representation of the code samples in the four benchmark data sets. It’s said that Project CodeNet is a “very large-scale, diverse, and high-quality data set to accelerate the algorithmic advances in AI for Code.”

How Project CodeNet helps in machine learning tasks

The following examples show where machine learning models derived from CodeNet can help improve programming tasks.

  • Programming language detection and translation. You can take the Project CodeNet data set and build a deep learning model to detect the language of a piece of source code. This notebook showcases how to perform language classifications using a Keras model in TensorFlow. Moreover, in a future release of the data set, we plan to better support more use cases, for example, to enrich the data set to help you to create machine learning models to translate the programming language from one language to another language. This saves much of the old-school efforts for engineers, and will be useful for teams to transform old code to new programming languages to be accessible to new development tools.

  • Models derived from CodeNet could help in code recommendations. By running clustering methods, you can build recommendation tools to auto-complete a simple line of code to blocks of code, or even a full function.

  • Use a masked language model (MLM) on source code. The purpose is to infer the correct token for a masked-out token at an arbitrary position in the source code text. IBM Researchers created a notebook to complete this experimentation.

What makes Project CodeNet outstanding

There are two great features of Project CodeNet when comparing it with related data sets.

  • The tremendous size of the data set and the comprehensive programming languages written in it along with the code samples annotated with a rich set of information, such as its code size, memory footprint, CPU run time, and status, which indicates acceptance or error types. Over 90% of the problems come with the respective problem description, which contains a concise problem statement, specification of the input format, and the output format. When available, sample input and output is also extracted from the problem description and is provided as part of the data set. You can run the accepted code samples to extract additional metadata and to verify outputs from generative AI models for correctness.

  • Project CodeNet addresses issues of the quality of the data samples. Usually, a large number of frequently used AI for Code data sets have duplicate code samples, which could inflate performance metrics up to 100%. Plus, the problem-submission style data sets from online judging systems can contain clusters of identical problems, which also skew the performance metrics. However, in Project CodeNet, the researchers have identified issues such as near-duplicates and identical problem clusters for your benefit.

Related data sets comparison
Figure 1. Related data sets comparison

Data set statistics

Let’s look at the data set statistics. The data set comprises 13,916,868 submissions, divided into 4053 problems. Of the submissions, 53.6% (7,460,588) are accepted, 29.5% are marked as wrong answer, and the remaining suffer from one of the possible rejection causes.

Percentage of submissions per status
Figure 2. Percentage of submissions per status

The data contains submissions in 55 different languages, although 95% of them are coded in the six most common languages: C++, Python, Java programming, C, Ruby, and C#. C++ is the most common language with 8,008,527 submissions (57% of the total) of which 4,353,049 are accepted.

Percentage of submissions per programming language
Figure 3. Percentage of submissions per programming language


The rich metadata and language diversity enable Project CodeNet to help with a variety of uses cases. You can use the problem-submission relationship in CodeNet for code search and clone detection. The code samples in Project CodeNet are labeled with their acceptance status, and you can explore AI techniques to distinguish correct codes from problematic ones. Project CodeNet’s metadata also enables the tracking of how a submission evolves from problematic to accepted, which can be used for exploring automatic code correction. Each code sample is labeled with CPU run time and memory footprint, which can be used for regression studies and prediction. Project CodeNet can also be used for program translation given its large collection of programs written in different languages. One considerable challenge of neural machine translation is that model training depends on large, parallel corpora, and Project CodeNet covers a rich set of languages with ample training instances.

In summary, Project CodeNet is a first-of-its-kind, large-scale, diverse, and high-quality data set to accelerate the algorithmic advances in AI for Code. This data set is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark, from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety of programming languages to advances in code performance improvement techniques and code quality. The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases.

We also extracted several language-specific data sets for benchmarking in Python, Java programming, and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity. To expedite AI for Code research using graph neural networks, we made available the simplified parse tree representation of the code samples in the four benchmark data sets.