Win $20,000. Help build the future of education. Answer the call. Learn more

IBM Functional Genomics Platform BLAST Databases

Overview

This dataset provides gene and protein sequences for comparative analysis and search for over 2,700 bacterial and viral genera. The original genomes were processed by the IBM Functional Genomics Platform from de novo assembled genomes (raw data from NCBI SRA) and reference genomes (from NCBI RefSeq and GenBank) and then processed via NCBI BLAST tools to create both nucleotide and protein BLAST databases. The resulting databases contain over 76M gene and 58M protein sequences for bacteria and 662K gene and 521K protein sequences for virus that can be searched using the blastn and blastp command line tools from NCBI. These databases were built with NCBI BLAST version 2.9.0+

Note: The data included in this dataset reflects a best effort based on references and tools available today in the public domain. IBM does not represent or guarantee the accuracy of data provided or of the original sources and tools. IBM does not represent or guarantee that conclusions drawn from these tools and this data are free from defects including false positive or false negative classifications.

Get this Dataset

Data Description Zipped File Name
Genes and proteins for bacteria Dataset, 35 GB ibm_fgp_blast_bacteria_gene_and_proteins.tar.gz
Genes and proteins for virus Dataset, 256 MB ibm_fgp_blast_virus_gene_and_proteins.tar.gz

Dataset Metadata

Field Value
Format BLAST index
License CC-BY 4.0
Domain Computational Biology and Bioinformatics
Number of Records 76,233,193 bacterial genes, 58,669,101 bacterial proteins, 662,729 viral genes, 521,851 viral proteins
Data Split 76,233,193 bacterial genes, 58,669,101 bacterial proteins, 662,729 viral genes, 521,851 viral proteins
Size Bacteria – 35 GB (compressed), 113 GB (uncompressed); Virus – 256 MB (compressed), 1.2 GB (uncompressed)
Author Ed Seabolt, Kristen L. Beck, Gowri Nayar, Akshay Agarwal, Harsha Krishnareddy, Hakan Bulu, Thuan Doan, James Kaufman, Vandana Mukherjee
Dataset Origin NCBI and IBM Functional Genomics Platform
Dataset Version Update 1.0.0 – December 23rd, 2020

Dataset Archive Contents

File or Folder Description
ibm_fgp_blast_bacteria_gene_and_proteins.tar.gz Contains the genes and protiens for bacteria
ibm_fgp_blast_virus_gene_and_proteins.tar.gz Contains the genes and protiens for virus

Example Records

Example nucleotide search:

query.txt
ATGGCGATAACATTGACCGAAGCCGCGGCAAACCAAATCCGCAAACAACTTGCCAAACGAGGCAAGGGGCTGGCGTTACGAATCGGTGTGAAGAAGGTGGGGTGCTCAGGGTTCGCCTACACCTTCGATTATGCTGACGAGGTCCGTCAGGGTGATGAGATTTTTGCGTTTCATGATGCCAGTCTAGTGGTTGATGCCGATAGTCTGCCGTTTCTTGATGGCTCGCGTGTCGACTATATACGGGAAGGTCTGAACGATTCATTCCGACTTCATAATCCCAACGTTGGCGATACGTGTGGTTGTGGTGAAAGCTTCAGCTTGAAGGAGCCAGCAAAGGTTTAG


execution:
blastn -db all_gene_table -num_alignments 5 -num_threads 8 -query query.txt -outfmt 10


result:
Query_1,seq684|00003251fe1841590fb4940a5d0b2491|Iron-binding,100.000,342,0,0,1,342,1,342,8.29e-178,632

Example protein search:

query.txt
MKEGPDIAQIGSLIGDPARANMLTALMSGKALTATELAGTAGITLQTASSHLSKLEAGSLISQRKQGRHRYFALADDEVGLLLESLMGFAADRGFTRHRTGPKDPALRKARVCYNHLAGDYGVRLLDSLVAEEVIAGSGDTLALTGAGREKMAALGIDLSALTKSRRPVCRTCLDWSERRSHLAGSLGQALLGLFLDRGWAVREPGSRAVRFTGNGEKEFARLFPLPG


execution:
blastp -db all_protein_table -num_alignments 5 -num_threads 8 -query query.txt -outfmt 10


result:
Query_1,seq42570230|000437351cf4f2dd075e1a7eafd95488|Cadmium,100.000,228,0,0,1,228,1,228,8.80e-163,456Query_1,seq25140563|ef96ce0a70a1ea2cd2924b9c1e7d3c91|Cadmium,90.308,227,22,0,1,227,1,227,3.06e-145,412Query_1,seq24693335|9b56709f54a452d1bb05be9e9e4391b7|hypothetical,85.903,227,32,0,1,227,35,261,4.31e-140,401Query_1,seq235012|07bc0628a30cef287c7b9c320d7983fd|hypothetical,86.784,227,30,0,1,227,1,227,5.46e-140,399Query_1,seq35397595|66996c0ef9f0b7dc752d29edbc616f1f|hypothetical,86.344,227,31,0,1,227,1,227,8.75e-140,399

Citation

@ARTICLE{9184986,
  author={E. {Seabolt} and G. {Nayar} and H. {Krishnareddy} and A. {Agarwal} and K. L. {Beck} and E. {Kandogan} and M. {Kunitomi} and M. {Roth} and I. {Terrizzano} and J. {Kaufman} and V. {Mukherjee}},
  journal={IEEE/ACM Transactions on Computational Biology and Bioinformatics}, 
  title={IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale}, 
  year={2020},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/TCBB.2020.3021231}}
Legend