PRROMenade Microbiome Annotation Index

Overview

This dataset provides a PRROMenade search index for hierarchical functional annotation of nucleotide sequences against bacterial and viral protein domains (amino acid sequences). The underlying microbial sequences were processed by the IBM Functional Genomics Platform and annotated with KEGG enzyme codes. The annotated domains and related enzyme code hierarchies were then used to build the PRROMenade index. The index is described and used in this Scientific Reports 2021 publication.

Please see the PRROMenade publication (iScience 2020) for more details on the method. The PRROMenade executable is also available on GithHub. This index has been optimized for short reads and thus allows for matches at most 100 amino acids (300 nucleotides).

Dataset Metadata

Field Value
Format PRROMenade index
License CDLA-Sharing
Domain Microbial genomics, Sequence annotation
Number of Records 21,199,760 bacterial protein domains, 52,607 viral protein domains
Data Split 21,199,760 bacterial protein domains, 52,607 viral protein domains
Size 50GB (compressed), 98GB (uncompressed)
Author Filippo Utro, Niina Haiminen, Ed Seabolt, James Kaufman, Laxmi Parida
Dataset Origin NCBI and IBM Functional Genomics Platform
Dataset Version Update 1.0.0 – Feb 16th, 2020

Dataset Archive Contents

File or Folder Description
bactvirus2020.* These files define the sequence search index and associated taxonomy and are required for classifying query sequences
name_taxid.dmp Associates each DB sequence with a taxonomy id
nodes_prromenade.dmp Defines the child, parent links in the taxonomy tree
taxid_name.txt Associates each taxonomy id with a KEGG enzyme code

Citation

@article{UTRO2020100988,
title = "Hierarchically Labeled Database Indexing Allows Scalable Characterization of Microbiomes",
journal = "iScience",
volume = "23",
number = "4",
pages = "100988",
year = "2020",
issn = "2589-0042",
doi = "https://doi.org/10.1016/j.isci.2020.100988",
url = "http://www.sciencedirect.com/science/article/pii/S2589004220301723",
author = "Filippo Utro and Niina Haiminen and Enrico Siragusa and Laura-Jayne Gardiner and Ed Seabolt and Ritesh Krishna and James H. Kaufman and Laxmi Parida"}

@article {Haiminen2021,
author = {Haiminen, Niina and Utro, Filippo and Seabolt, Ed and Parida, Laxmi},
title = {{Functional profiling of COVID-19 respiratory tract microbiomes}},
year = {2021},
volume = {11},
number = {6433},
doi = {https://doi.org/10.1038/s41598-021-85750-0},
URL = {https://www.nature.com/articles/s41598-021-85750-0},
journal = {Scientific Reports}}

@article{9184986,
 author={E. {Seabolt} and G. {Nayar} and H. {Krishnareddy} and A. {Agarwal} and K. L. {Beck} and E. {Kandogan} and M. {Kuntomi} and M. {Roth} and I. {Terrizzano} and J. {Kaufman} and V. {Mukherjee}},
  journal={IEEE/ACM Transactions on Computational Biology and Bioinformatics}, 
  title={IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale}, 
  year={2020},
  volume={},
  number={},
  pages={1-1},
  url  = {https://doi.ieeecomputersociety.org/10.1109/TCBB.2020.3021231},
  doi={10.1109/TCBB.2020.3021231}}
Legend