Genomic Data Retrieval

Perform metagenomic data retrieval and functional annotation retrieval. In detail, this package aims to provide users with a standardized way to automate genome, proteome, coding sequence ('CDS'), 'GFF', and metagenome retrieval from 'NCBI' and 'ENSEMBL' databases. Furthermore, an interface to the 'BioMart' database (Smedley et al. (2009) ) allows users to retrieve functional annotation for genomic loci. Users can download entire databases such as 'NCBI RefSeq' (Pruitt et al. (2007) ), 'NCBI nr', 'NCBI nt' and 'NCBI Genbank' (Benson et al. (2013) ) as well as 'ENSEMBL' and 'ENSEMBLGENOMES' with only one command.


The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time.

The first step, however, of any genome based study is to retrieve genomes from databases. For automating this retrieval process on a meta-genomic scale, the biomartr package provides useful interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim of biomartr is to facilitate reproducibility and large-scale handling of genomic data for (meta-)genomic analyses.

In detail, biomartr aims to provide users with an easy to use framework to obtain genome, proteome, CDS, GFF, and metagenome project data. Furthermore, an interface to the BioMart database allows users to retrieve functional annotation for genomic loci. Users can download entire databases such as NCBI RefSeq, NCBI nr, NCBI nt, NCBI Genbank, etc. as well as ENSEMBL and ENSEMBLGENOMES with only one command.

Hence, the biomartr package is designed to achieve the highest degree of reproducible research.

Getting Started with biomartr:

Before users can download and install biomartr they need to install the following packages from Bioconductor:

# install Bioconductor base packages
source("http://bioconductor.org/biocLite.R")
biocLite()
 
# load the biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
 
# load the Biostrings package
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

Users might be asked during the installation process of Biostrings and biomaRt whether or not they would like to update all package dependencies of the corresponding packages. Please type a specifying that all package dependencies of the corresponding packages shall be updated. This is important for the sufficient functionality of biomartr.

Now users can download biomartr from CRAN :

# install biomartr 0.2.1 from CRAN
install.packages("biomartr")

The current status of the package as well as a detailed history of the functionality of each version of biomartr can be found in the NEWS section.

  • meta.retieval() : Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life
  • meta.retieval.all() : Perform Meta-Genome Retieval from NCBI of the entire kingdom of life
  • getMetaGenomes() : Retrieve metagenomes from NCBI Genbank
  • getMetaGenomeAnnotations() : Retrieve annotation *.gff files for metagenomes from NCBI Genbank
  • listMetaGenomes() : List available metagenomes on NCBI Genbank
  • getMetaGenomeSummary() : Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomes
  • listGenomes() : List all genomes available on NCBI and ENSEMBL servers
  • listKingdoms() : list the number of available species per kingdom of life on NCBI and ENSEMBL servers
  • listGroups() : list the number of available species per group on NCBI and ENSEMBL servers
  • listSubgroups() : list the number of available species per subgroup on NCBI and ENSEMBL servers
  • getKingdoms() : Retrieve available kingdoms of life
  • getGroups() : Retrieve available groups for a kingdom of life
  • getSubgroups() : Retrieve available subgroups for a kingdom of life
  • is.genome.available() : Check Genome Availability NCBI and ENSEMBL servers
  • getGenome() : Download a specific genome stored on NCBI and ENSEMBL servers
  • getProteome() : Download a specific proteome stored on NCBI and ENSEMBL servers
  • getCDS() : Download a specific CDS file (genome) stored on NCBI and ENSEMBL servers
  • getGFF() : Genome Annotation Retrieval (*.gff) from NCBI and ENSEMBL servers
  • getKingdomAssemblySummary() : Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdoms
  • getMetaGenomeSummary() : Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomes
  • getSummaryFile() : Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdom
  • read_genome() : Import genomes as Biostrings or data.table object
  • read_proteome() : Import proteome as Biostrings or data.table object
  • read_cds() : Import CDS as Biostrings or data.table object
  • read_gff() : Import GFF file
  • listDatabases() : Retrieve a list of available NCBI databases
  • download.database() : Download a NCBI database to your local hard drive
  • download.database.all() : Download a complete NCBI Database such as e.g. NCBI nr to your local hard drive
  • biomart() : Main function to query the BioMart database
  • getMarts() : Retrieve All Available BioMart Databases
  • getDatasets() : Retrieve All Available Datasets for a BioMart Database
  • getAttributes() : Retrieve All Available Attributes for a Specific Dataset
  • getFilters() : Retrieve All Available Filters for a Specific Dataset
  • organismBM() : Function for organism specific retrieval of available BioMart marts and datasets
  • organismAttributes() : Function for organism specific retrieval of available BioMart attributes
  • organismFilters() : Function for organism specific retrieval of available BioMart filters
  • getGO() : Function to retrieve GO terms for a given set of genes

The developer version of biomartr might include more functionality than the stable version on CRAN.

Now you can use the devtools package to install biomartr from GitHub.

# install.packages("devtools")
 
# install the current version of biomartr on your system
library(devtools)
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
 
# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
 
# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")
 
# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:
 
devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
 
# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
  • Install biomartr on a Win 8 laptop: solution ( Thanks to Andres Romanowski )

I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.

Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:

twitter: HajkDrost or email

For Bug Report: Please send me an issue.

News

biomartr 0.2.1

In this version of biomartr the organism*() functions were adapted to the new ENSEMBL 87 release in which organism name specification in the Biomart description column was changed from a scientific name convention to a mix of common name and scientific name convention.

  • all organism*() functions have been adapted to the new ENSEMBL 87 release organism name notation that is used in the Biomart description

  • fixing error handling bug that caused commands such as download.database(db = "nr.27.tar.gz") to not execute properly

biomartr 0.2.0

In this version, biomartr was extended to now retrieve genome, proteome, CDS, GFF and meta-genome data also from ENSEMBL and ENSEMLGENOMES. Furthermore, all NCBI retrieval functions were updated to the new server folder structure standards of NCBI.

  • new meta-retrieval function meta.retrieval.all() allows users to download all individual genomes of all kingdoms of life with one command

  • new metagenome retrieval function getMetaGenomes() allows users to retrieve metagenome projects from NCBI Genbank

  • new metagenome retrieval function getMetaGenomeAnnotations() allows users to retrieve annotation files for genomes belonging to a metagenome project stored at NCBI Genbank

  • new retrieval function getGFF() allows users to retrieve annotation (*.gff) files for specific genomes from NCBI and ENSEMBL databases

  • new import function read_gff() allowing users to import GFF files downloaded with getGFF()

  • new internal functions to check for availability of ENSEMBL or ENSEMBLGENOMES databases

  • new database retrieval function download.database.all() allows users to download entire NCBI databases with one command

  • new function listMetaGenomes() allowing users to list available metagenomes on NCBI Genbank

  • new external helper function getSummaryFile() to retrieve the assembly_summary.txt file from NCBI

  • new external helper function getKingdomAssemblySummary() to retrieve the assembly_summary.txt files from NCBI for all kingdoms and combine them into one big data.frame

  • new function listKingdoms() allows users to list the number of available species per kingdom of life

  • new function listGroups() allows users to list the number of available species per group

  • new function listSubgroups() allows users to list the number of available species per subgroup

  • new function getGroups() allows users to retrieve available groups for a kingdom of life

  • new function getSubgroups() allows users to retrieve available subgroups for a kingdom of life

  • new external helper function getMetaGenomeSummary() to retrieve the assembly_summary.txt files from NCBI genbank metagenomes

  • new internal helper function getENSEMBL.Seq() acting as main interface function to communicate with the ENSEMBL database API for sequence retrieval

  • new internal helper function getENSEMBLGENOMES.Seq() acting as main interface function to communicate with the ENSEMBL database API for sequence retrieval

  • new internal helper function getENSEMBL.Annotation() acting as main interface function to communicate with the ENSEMBL database API for GFF retrieval

  • new internal helper function getENSEMBLGENOMES.Annotation() acting as main interface function to communicate with the ENSEMBL database API for GFF retrieval

  • new internal helper function get.ensemblgenome.info() to retrieve general organism information from ENSEMBLGENOMES

  • new internal helper function get.ensembl.info() to retrieve general organism information from ENSEMBL

  • new internal helper function getGENOMEREPORT() to retrieve the genome reports file from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/overview.txt

  • new internal helper function connected.to.internet() enabling internet connection check

  • functions getGenome(), getProteome(), and getCDS() now can also in addition to NCBI retrieve genomes, proteomes or CDS from ENSEMBL and ENSEMLGENOMES

  • the functions getGenome(), getProteome(), and getCDS() were completely re-written and now use the assembly_summary.txt files provided by NCBI to retrieve the download path to the corresponding genome. Furthermore, these functions now lost the kingdom argument. Users now only need to specify the organism name and not the kingdom anymore. Furthermore, all get* functions now return the path to the downloaded genome so that this path can be used as input to all read_* functions.

  • download_databases() has been renamed to download.databases() to be more consistent with other function notation

  • the argument db_format was removed from listDatabases() and download.database() because it was misleading

  • the command listDatabases("all") now returns all available NCBI databases that can be retrieved with download.database()

  • download.database() now internally checks if input database specified by the user is actually available on NCBI servers

  • the documentary file generated by getGenome(), getProteome(), and getCDS() is now extended to store more details about the downloaded genome

  • argument database in is.genome.available() and listGenomes() has been renamed to db to be consistent with all other sequence retrieval functions

  • is.genome.available() now also checks availability of organisms in ENSEMBL. See db = "ensembl"

  • the argument db_name in listDatabases() has been renamed db to be more consistent with the notation in other functions

  • the argument name in download.database() has been renamed db to be more consistent with the notation in other functions

  • getKingdoms() now retrieves also kingdom information for ENSEMBL and ENSEMBLGENOMES

  • getKingdoms() received new argument db to specify from which database (e.g. refseq, genbank, ensembl or ensemblgenomes) kingdom information shall be retrieved

  • getKingdoms(db = "refseq") received one more member: "viral", allowing the genome retrieval of all viruses

  • argument out.folder in meta.retrieval() has been renamed to path to be more consistent with other retrieval functions

  • all read_* functions now received a new argument obj.type allowing users to choose between storing input genomes as Biostrings object or data.table object

  • all read_* functions now have format = "fasta" as default

  • the kingdom argument in the listGenomes() function was renamed to type, now allowing users to specify not only specify kingdoms, but also groups and subgroups. Use: listGenomes(type = "kingdom") or listGenomes(type = "group") or listGenomes(type = "subgroup")

  • the listGenomes() function receives a new argument subset to specify a subset of the selected type argument. E.g. subset = "Eukaryota" when specifying type = "kingdom"

  • new Vignette Meta-Genome Retrieval
  • Update examples and extend Introduction Vignette
  • Update examples and extend Database Retrieval Vignette
  • Update examples and extend Sequence Retrieval Vignette
  • Update examples and extend Functional Annotation Vignette

biomartr 0.1.0

  • fixing a parsing error of the file ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt The problem was that comment lines were introduced and columns couldn't be parsed correctly anymore. This caused that genomes, proteomes, and CDS files could not be downloaded properly. This has been fixed now.

  • genomes, proteome, and CDS as well as meta-genomes can now be retrieved from RefSeq and Genbank (not only RefSeq); only getCDS() does not have genebank access, becasue genbank does not provide CDS sequences

  • adding new function meta.retrieval() to mass retrieve genomes for entire kingdoms of life

  • fixed a major bug in organismBM() causing the function to fail. The failure of this function affected all downstream organism*() functions. Bug is now fixed and everything works properly

  • updated Vignettes

biomartr 0.0.3

  • updating unit tests for new API

  • fixing API problems that caused all BioMart related functions to fail

  • fixing retrieval problems in getCDS(), getProteome(), and getGenome()

  • the listDatabases() function now has a new option db_name = "all" allowing users to list all available databases stored on NCBI

  • adding new vignette: Database Retrieval
  • update the vignettes: Phylotranscriptomics, Sequence Retrieval, and Functional Annotation

biomartr 0.0.2

  • adding vignettes: Introduction, Functional Annotation, Phylotranscriptomics, and Sequence Retrieval

biomartr 0.0.1

Release Version

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("biomartr")

0.4.0 by Hajk-Georg Drost, 16 days ago


https://github.com/HajkD/biomartr


Report a bug at https://github.com/HajkD/biomartr/issues


Browse source code at https://github.com/cran/biomartr


Authors: Hajk-Georg Drost


Documentation:   PDF Manual  


GPL-3 license


Imports biomaRt, Biostrings, stringi, curl, tibble, jsonlite, data.table, dplyr, readr, downloader, RCurl, XML, httr, stringr

Suggests knitr, rmarkdown, devtools, testthat


See at CRAN