Overview
This is the first code vulnerabilities dataset derived from open source projects that transcends simple functions by including the trace associated with the identified vulnerabilities. By examining the code commits before and after a potential vulnerability is addressed we can increase the confidence on the labels of this dataset. We expect this dataset to continue to evolve through community contributions.
To know more about supporting contributions for tools and new entries to the dataset, visit the D2A GitHub Repo. This dataset is introduced in the paper: D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis to be presented at SEIP 2021.
Dataset Metadata
Field | Value |
---|---|
Format | JSON |
License | CDLA-Sharing |
Domain | Code Vulnerability Identification, Security |
Number of Records | 1,314,276 |
Data Split | 80% train, 10% dev, 10% test |
Size | 5.4GB |
Author | IBM Research |
Dataset Version Update | Feb 2021 – Version 1.0.0 |
Data Coverage | The samples were generated and labelled by running differential static program analysis on more than 11k git version pairs from six mid to large sized open-source C/C++ programs – OpenSSL, FFmpeg, httpd, NGINX, libtiff, and libav |
Business Use Case | Code Vulnerability Detection |
Dataset Archive Contents
File or Folder | Description |
---|---|
ffmpeg.tar.gz | samples from FFmpeg |
httpd.tar.gz | samples from HTTPD |
libav.tar.gz | samples from Libav |
libtiff.tar.gz | samples from Libtiff |
nginx.tar.gz | samples from NGINX |
openssl.tar.gz | samples from OpenSSL |
Note: after de-compression, there will be 3 pickle.gz files per project like nginx_after_fix_extractor_0.pickle.gz
, nginx_labeler_1.pickle.gz
and nginx_labeler_0.pickle.gz
. These samples were produced and labeled by two different extractors. More details about the extractors and samples can be found in Sec.III-C and Sec.III-D in the D2A paper. Examples of viewing and using the samples files can be found in the GitHub readme.
File or Folder | Description |
---|---|
splits.tar.gz | The train, dev and test splits of all samples listed above |
Note: this is the global split file. It’s an input to the data preparation script. Please refer to the example in the GitHub repo for details.
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Citation
@inproceedings{D2A,
author = {Zheng, Yunhui and Pujar, Saurabh and Lewis, Burn and Buratti, Luca and Epstein, Edward and Yang, Bo and Laredo, Jim and Morari, Alessandro and Su, Zhong},
title = {D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis},
year = {2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA}
booktitle = {Proceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice},
series = {ICSE-SEIP '21}
}