Win $20,000. Help build the future of education. Answer the call. Learn more

D2A – Differential Analysis Dataset

Overview

This is the first code vulnerabilities dataset derived from open source projects that transcends simple functions by including the trace associated with the identified vulnerabilities. By examining the code commits before and after a potential vulnerability is addressed we can increase the confidence on the labels of this dataset. We expect this dataset to continue to evolve through community contributions.

To know more about supporting contributions for tools and new entries to the dataset, visit the D2A GitHub Repo. This dataset is introduced in the paper: D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis. The dataset and a leaderboard for models using the dataset was presented at SEIP 2021. More information on the leaderboard can be found here.

Get this Dataset

Data Description Zipped File Name
ffmpeg Sample Dataset, 1.7 GB ffmpeg.tar.gz
httpd Sample Dataset, 33 MB httpd.tar.gz
libav Sample Dataset, 588 MB libav.tar.gz
libtiff Sample Dataset, 60 MB libtiff.tar.gz
nginx Sample Dataset, 38 MB nginx.tar.gz
openssl Sample Dataset, 1.3 GB openssl.tar.gz
splits Sample Dataset, 32 MB splits.tar.gz
Leaderboard Dataset, 242 MB d2a_leaderboard_dat.tar.gz

Dataset Metadata

Field Value
Format JSON
License CDLA-Sharing
Domain Code Vulnerability Identification, Security
Number of Records 1,314,276
Data Split 80% train, 10% dev, 10% test
Size 3.7GB
Author IBM Research
Dataset Version Update April 2021 – Version 1.1.0
Data Coverage The samples were generated and labelled by running differential static program analysis on more than 11k git version pairs from six mid to large sized open-source C/C++ programs – OpenSSL, FFmpeg, httpd, NGINX, libtiff, and libav
Business Use Case Code Vulnerability Detection

Dataset Archive Contents

File or Folder Description
ffmpeg.tar.gz samples from FFmpeg
httpd.tar.gz samples from HTTPD
libav.tar.gz samples from Libav
libtiff.tar.gz samples from Libtiff
nginx.tar.gz samples from NGINX
openssl.tar.gz samples from OpenSSL

Note: after de-compression, there will be 3 pickle.gz files per project like nginx_after_fix_extractor_0.pickle.gz, nginx_labeler_1.pickle.gz and nginx_labeler_0.pickle.gz. These samples were produced and labeled by two different extractors. More details about the extractors and samples can be found in Sec.III-C and Sec.III-D in the D2A paper. Examples of viewing and using the samples files can be found in the GitHub readme.

File or Folder Description
splits.tar.gz The train, dev and test splits of all samples listed above

Note: this is the global split file. It’s an input to the data preparation script. Please refer to the example in the GitHub repo for details.

File or Folder Description
d2a_leaderboard_data.tar.gz The full leaderboard dataset with splits for each of the 4 tasks

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Citation

  @inproceedings{D2A,
  author = {Zheng, Yunhui and Pujar, Saurabh and Lewis, Burn and Buratti, Luca and Epstein, Edward and Yang, Bo and Laredo, Jim and Morari, Alessandro and Su, Zhong},
  title = {D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis},
  year = {2021},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA}
  booktitle = {Proceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice},
  series = {ICSE-SEIP '21}
}
Legend