CC-BY 4.0 | BLAST index

IBM Functional Genomics Platform BLAST Databases

NCBI BLAST formatted databases for bacterial and viral gene and protein sequences derived using analytics from the IBM Functional Genomics Platform.

By

IBM Research

Overview

This dataset provides gene and protein sequences for comparative analysis and search for over 2,700 bacterial and viral genera. The original genomes were processed by the IBM Functional Genomics Platform from de novo assembled genomes (raw data from NCBI SRA) and reference genomes (from NCBI RefSeq and GenBank) and then processed via NCBI BLAST tools to create both nucleotide and protein BLAST databases. The resulting databases contain over 76M gene and 58M protein sequences for bacteria and 662K gene and 521K protein sequences for virus that can be searched using the blastn and blastp command line tools from NCBI. These databases were built with NCBI BLAST version 2.9.0+

Note: The data included in this dataset reflects a best effort based on references and tools available today in the public domain. IBM does not represent or guarantee the accuracy of data provided or of the original sources and tools. IBM does not represent or guarantee that conclusions drawn from these tools and this data are free from defects including false positive or false negative classifications.

Get this Dataset

Data DescriptionZipped File Name
Genes and proteins for bacteria Dataset, 35 GBibm_fgp_blast_bacteria_gene_and_proteins.tar.gz
Genes and proteins for virus Dataset, 256 MBibm_fgp_blast_virus_gene_and_proteins.tar.gz

Dataset Metadata

FieldValue
FormatBLAST index
LicenseCC-BY 4.0
DomainComputational Biology and Bioinformatics
Number of Records76,233,193 bacterial genes, 58,669,101 bacterial proteins, 662,729 viral genes, 521,851 viral proteins
Data Split76,233,193 bacterial genes, 58,669,101 bacterial proteins, 662,729 viral genes, 521,851 viral proteins
SizeBacteria - 35 GB (compressed), 113 GB (uncompressed); Virus - 256 MB (compressed), 1.2 GB (uncompressed)
AuthorEd Seabolt, Kristen L. Beck, Gowri Nayar, Akshay Agarwal, Harsha Krishnareddy, Hakan Bulu, Thuan Doan, James Kaufman, Vandana Mukherjee
Dataset OriginNCBI and IBM Functional Genomics Platform
Dataset Version Update1.0.0 - December 23rd, 2020

Dataset Archive Contents

File or FolderDescription
ibm_fgp_blast_bacteria_gene_and_proteins.tar.gzContains the genes and protiens for bacteria
ibm_fgp_blast_virus_gene_and_proteins.tar.gzContains the genes and protiens for virus

Example Records

Example nucleotide search:

query.txt
ATGGCGATAACATTGACCGAAGCCGCGGCAAACCAAATCCGCAAACAACTTGCCAAACGAGGCAAGGGGCTGGCGTTACGAATCGGTGTGAAGAAGGTGGGGTGCTCAGGGTTCGCCTACACCTTCGATTATGCTGACGAGGTCCGTCAGGGTGATGAGATTTTTGCGTTTCATGATGCCAGTCTAGTGGTTGATGCCGATAGTCTGCCGTTTCTTGATGGCTCGCGTGTCGACTATATACGGGAAGGTCTGAACGATTCATTCCGACTTCATAATCCCAACGTTGGCGATACGTGTGGTTGTGGTGAAAGCTTCAGCTTGAAGGAGCCAGCAAAGGTTTAG


execution:
blastn -db all_gene_table -num_alignments 5 -num_threads 8 -query query.txt -outfmt 10


result:
Query_1,seq684|00003251fe1841590fb4940a5d0b2491|Iron-binding,100.000,342,0,0,1,342,1,342,8.29e-178,632

Example protein search:

query.txt
MKEGPDIAQIGSLIGDPARANMLTALMSGKALTATELAGTAGITLQTASSHLSKLEAGSLISQRKQGRHRYFALADDEVGLLLESLMGFAADRGFTRHRTGPKDPALRKARVCYNHLAGDYGVRLLDSLVAEEVIAGSGDTLALTGAGREKMAALGIDLSALTKSRRPVCRTCLDWSERRSHLAGSLGQALLGLFLDRGWAVREPGSRAVRFTGNGEKEFARLFPLPG


execution:
blastp -db all_protein_table -num_alignments 5 -num_threads 8 -query query.txt -outfmt 10


result:
Query_1,seq42570230|000437351cf4f2dd075e1a7eafd95488|Cadmium,100.000,228,0,0,1,228,1,228,8.80e-163,456Query_1,seq25140563|ef96ce0a70a1ea2cd2924b9c1e7d3c91|Cadmium,90.308,227,22,0,1,227,1,227,3.06e-145,412Query_1,seq24693335|9b56709f54a452d1bb05be9e9e4391b7|hypothetical,85.903,227,32,0,1,227,35,261,4.31e-140,401Query_1,seq235012|07bc0628a30cef287c7b9c320d7983fd|hypothetical,86.784,227,30,0,1,227,1,227,5.46e-140,399Query_1,seq35397595|66996c0ef9f0b7dc752d29edbc616f1f|hypothetical,86.344,227,31,0,1,227,1,227,8.75e-140,399

Citation

@ARTICLE{9184986,
  author={E. {Seabolt} and G. {Nayar} and H. {Krishnareddy} and A. {Agarwal} and K. L. {Beck} and E. {Kandogan} and M. {Kunitomi} and M. {Roth} and I. {Terrizzano} and J. {Kaufman} and V. {Mukherjee}},
  journal={IEEE/ACM Transactions on Computational Biology and Bioinformatics}, 
  title={IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale}, 
  year={2020},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/TCBB.2020.3021231}}