Perform large scale genomic data retrieval and functional annotation retrieval. This package aims to provide users with a standardized
way to automate genome, proteome, 'RNA', coding sequence ('CDS'), 'GFF', and metagenome
retrieval from 'NCBI RefSeq', 'NCBI Genbank', 'ENSEMBL', 'ENSEMBLGENOMES',
and 'UniProt' databases. Furthermore, an interface to the 'BioMart' database
(Smedley et al. (2009)
This package is born out of my own frustration to automate the genomic data retrieval process to create computationally reproducible scripts for large-scale genomics studies. Since I couldn't find easy-to-use and fully reproducible software libraries I sat down and tried to implement a framework that would enable anyone to automate and standardize the genomic data retrieval process. I hope that this package is useful to others as well and that it helps to promote reproducible research in genomics studies.
I happily welcome anyone who wishes to contribute to this project :) Just drop me an email.
The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time.
The first step, however, of any genome based study is to retrieve genomes and their annotation from databases. To automate the
retrieval process of this information on a meta-genomic scale, the biomartr
package provides interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim of biomartr
is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses.
In addition, biomartr
aims to address the genome version crisis
. With biomartr
users can now control and be informed
about the genome versions they retrieve automatically. Many large scale genomics studies lack this information
and thus, reproducibility and data interpretation become nearly impossible when documentation of genome version information
gets neglected.
In detail, biomartr
automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as
Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can download entire databases such as NCBI RefSeq
, NCBI nr
, NCBI nt
, NCBI Genbank
, etc. as well as ENSEMBL
and ENSEMBLGENOMES
with only one command.
I would be very greatful if you could cite the following paper in case biomartr
was useful for your own research. I plan on vastly extending
the biomartr functionality and usability in the next years. Many thanks in advance :)
I truly value your opinion and improvement suggestions. Hence, I would be extremely grateful if you could take this 1 minute and 3 question survey (https://goo.gl/forms/Qaoxxjb1EnNSLpM02) so that I can learn how to improve
biomartr
in the best possible way. Many many thanks in advance.
# install biomartr 0.8.0source("http://bioconductor.org/biocLite.R")biocLite('biomartr')
The automated retrieval of collections (= Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats)
will make sure that the genome file of an organism will match the CDS, proteome, RNA, GFF, etc file
and was generated using the same genome assembly version. One aspect of why genomics studies
fail in computational and biological reproducibility is that it is not clear whether CDS, proteome, RNA, GFF, etc files
used in a proposed analysis were generated using the same genome assembly file denoting the same genome assembly version.
To avoid this seemingly trivial mistake we encourage users to retrieve
genome file collections using the biomartr
function getCollection()
and attach the corresponding output as Supplementary Data
to the respective genomics study to ensure computational and biological reproducibility.
# download collection for Saccharomyces cerevisiaegetCollection( db = "refseq", organism = "Saccharomyces cerevisiae", path = file.path("refseq","Collections"))
Internally, the getCollection()
function will now generate a folder named refseq/Collection/Saccharomyces_cerevisiae
and will store all genome and annotation files for Saccharomyces cerevisiae
in the same folder.
In addition, the exact genoem and annotation version will be logged in the doc
folder.
Internally, a text file named doc_Saccharomyces_cerevisiae_db_refseq.txt
is generated. The information stored in this log file is structured as follows:
File Name: Saccharomyces_cerevisiae_assembly_stats_refseq.txt
Organism Name: Saccharomyces_cerevisiae
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt
Download_Date: Wed Jun 27 15:21:51 2018
refseq_category: reference genome
assembly_accession: GCF_000146045.2
bioproject: PRJNA128
biosample: NA
taxid: 559292
infraspecific_name: strain=S288C
version_status: latest
release_type: Major
genome_rep: Full
seq_rel_date: 2014-12-17
submitter: Saccharomyces Genome Database
In an ideal world this reference file could then be included as supplementary information in any life science publication that relies on genomic information so that reproducibility of experiments and analyses becomes achievable.
Download all mammalian vertebrate genomes from NCBI RefSeq
via:
# download all vertebrate genomesmeta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")
All geneomes are stored in the folder named according to the kingdom.
In this case vertebrate_mammalian
. Alternatively, users can specify
the out.folder
argument to define a custom output folder path.
Find
biomartr
also at OmicTools.
Please find all FAQs here.
I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:
For Bug Reports: Please send me an issue.
Getting Started with biomartr
:
Users can also read the tutorials within (RStudio) :
# source the biomartr packagelibrary(biomartr) # look for all tutorials (vignettes) available in the biomartr package# this will open your web browserbrowseVignettes("biomartr")
The current status of the package as well as a detailed history of the functionality of each version of biomartr
can be found in the NEWS section.
Some bug fixes or new functionality will not be available on CRAN yet, but in the developer version here on GitHub. To download and install the most recent version of biomartr
run:
# install the current version of biomartr on your systemsource("http://bioconductor.org/biocLite.R")biocLite("ropensci/biomartr")
meta.retrieval()
: Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroupmeta.retrieval.all()
: Perform Meta-Genome Retieval from NCBI of the entire kingdom of lifegetMetaGenomes()
: Retrieve metagenomes from NCBI GenbankgetMetaGenomeAnnotations()
: Retrieve annotation *.gff files for metagenomes from NCBI GenbanklistMetaGenomes()
: List available metagenomes on NCBI GenbankgetMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomeslistGenomes()
: List all genomes available on NCBI and ENSEMBL serverslistKingdoms()
: list the number of available species per kingdom of life on NCBI and ENSEMBL serverslistGroups()
: list the number of available species per group on NCBI and ENSEMBL serversgetKingdoms()
: Retrieve available kingdoms of lifegetGroups()
: Retrieve available groups for a kingdom of lifeis.genome.available()
: Check Genome Availability NCBI and ENSEMBL serversgetCollection()
: Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStatsgetGenome()
: Download a specific genome stored on NCBI and ENSEMBL serversgetProteome()
: Download a specific proteome stored on NCBI and ENSEMBL serversgetCDS()
: Download a specific CDS file (genome) stored on NCBI and ENSEMBL serversgetRNA()
: Download a specific RNA file stored on NCBI and ENSEMBL serversgetGFF()
: Genome Annotation Retrieval from NCBI (*.gff
) and ENSEMBL (*.gff3
) serversgetGTF()
: Genome Annotation Retrieval (*.gtf
) from ENSEMBL serversgetRepeatMasker() :
Repeat Masker TE Annotation RetrievalgetAssemblyStats()
: Genome Assembly Stats Retrieval from NCBIgetKingdomAssemblySummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdomsgetMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomesgetSummaryFile()
: Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdomgetENSEMBLInfo()
: Retrieve ENSEMBL info filegetGENOMEREPORT()
: Retrieve GENOME_REPORTS file from NCBIread_genome()
: Import genomes as Biostrings or data.table objectread_proteome()
: Import proteome as Biostrings or data.table objectread_cds()
: Import CDS as Biostrings or data.table objectread_gff()
: Import GFF fileread_rna()
: Import RNA fileread_rm()
: Import Repeat Masker output fileread_assemblystats()
: Import Genome Assembly Stats FilelistNCBIDatabases()
: Retrieve a List of Available NCBI Databases for Downloaddownload.database()
: Download a NCBI database to your local hard drivedownload.database.all()
: Download a complete NCBI Database such as e.g. NCBI nr
to your local hard drivebiomart()
: Main function to query the BioMart databasegetMarts()
: Retrieve All Available BioMart DatabasesgetDatasets()
: Retrieve All Available Datasets for a BioMart DatabasegetAttributes()
: Retrieve All Available Attributes for a Specific DatasetgetFilters()
: Retrieve All Available Filters for a Specific DatasetorganismBM()
: Function for organism specific retrieval of available BioMart marts and datasetsorganismAttributes()
: Function for organism specific retrieval of available BioMart attributesorganismFilters()
: Function for organism specific retrieval of available BioMart filtersgetGO()
: Function to retrieve GO terms for a given set of genes# On Windows, this won't work - see ?build_github_devtoolsinstall_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE) # When working with Windows, first you need to install the# R package: rtools -> install.packages("rtools") # Afterwards you can install devtools -> install.packages("devtools")# and then you can run: devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE) # and then call it from the librarylibrary("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
biomartr
on a Win 8 laptop: solution ( Thanks to Andres Romanowski )Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
getCollection()
for retrieval of a collection: the genome sequence,
protein sequences, gff files, etc for a particular speciesgetProteome()
can now retrieve proteomes from the UniProt database by specifying getProteome(db = "uniprot")
.
An example can be found here
is.genome.available()
now prints out more useful interactive messages when searching for available organisms
is.genome.available()
can now handle taxids
and assembly_accession ids
in addition to the scientific name when
specifying argument organism
An example can be found here
is.genome.available()
can now check for organism availability in the UniProt database
getGenome()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve genome assemblies
getProteome()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve proteomes
getCDS()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve CDS
getRNA()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve RNAs
is.genome.available()
: argument order was changed from is.genome.available(organism, details, db) to is.genome.available(db, organism, details) to be logically more consistent
with all get*()
functions
meta.retrieval
receives a new argument restart_at_last
to indicate whether or not the download process when re-running the meta.retrieval
function
shall pick up at the last species or whether it should crawl through all existing files to check the md5checksum
meta.retrieval
now generates an csv overview file in the doc
folder which stores genome version, date, origin, etc information for
all downloaded organisms and can be directly used as Supplementary Data file in publications to increase computational and biological reproducibility of the genomics study
download.database.all()
can now skip already downloaded files and internally removes corrupted files with non-matching md5checksum. Re-downloading of currupted
files and be performed by simply re-running the download.database.all()
command
the function meta.retrieval()
will now pick up the download at the organism
where it left off and will report which species have already been retrieved
all get*()
functions and the meta.retrieval()
function receive a new argument reference
which allows users to retrieve not-reference or not-representative genome versions when downloading from NCBI RefSeq or NCBI Genbank
the argument order in meta.retrieval()
changed from meta.retrieval(kingdom, group, db, ...)
to meta.retrieval(db,kingdom, group, ...)
to make the argument order more consistent with the get*()
functions
the argument order in getGroups()
changed from getGroups(kingdom, db)
to getGroups(db, kingdom)
to make the argument order more consistent with the get*()
and meta.retrieval()
functions
existingOrganisms()
and existingOrganisms_ensembl()
which check the organisms that have already been downloadedfixing a bug in exists.ftp.file()
and getENSEMBLGENOMES.Seq()
that caused bacterial genome, proteome, etc retrieval to fail due to the wrong construction of a query ftp request https://github.com/HajkD/biomartr/issues/7
(Many thanks to @dbsseven)
fix a major bug in which organisms having no representative genome would generate NULL paths that subsequently crashed the meta.retrieval()
function when it tried to print out the result paths.
new function getRepeatMasker()
for retrieval of Repeat Masker output files
new function getGTF()
for genome annotation retrieval from ensembl
and ensemblgenomes
in gtf
format (Thanks for suggesting it Ge Tan)
new function getRNA()
to perform RNA Sequence Retrieval from NCBI and ENSEMBL databases (Thanks for suggesting it @carlo-berg)
new function read_rna()
for importing Repeat Masker output files downloaded with getRepeatMasker()
new function read_rm()
for importing RNA downloaded with getRNA()
as Biostrings or data.table object
new helper function custom_download()
that aims to make the download process more robust and stable
-> In detail, the download process is now adapting to the operating system, e.g. using either curl
(macOS), wget
(Linux), or wininet
(Windows)
function name listDatabases()
has been renamed listNCBIDatabases()
. In biomartr
version 0.6.0 the function name listDatabases()
will be depreciated
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of GTF files for type = 'ensembl'
and type = 'esnemblgenomes'
via type = "gtf"
. See getGTF()
for more details.
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of RNA files via type = "rna"
. See getRNA()
for more details.
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of Repeat Masker output files via type = "rm"
. See getRepeatMasker()
for more details.
all get*()
retrieval functions now skip the download of a particular file if it already exists in the specified file path
download.database()
and download.database.all()
now internally perform md5 check sum checks to make sure that the file download was successful
download.database()
and download.database.all()
now return the file paths of the downloaded file so that it is easier to use these
functions when constructing pipelines, e.g. download.database() %>% ...
or download.database.all() %>% ...
.
meta.retrieval()
and meta.retrieval.all()
now return the file paths of the downloaded file so that it is easier to use these
functions when constructing pipelines, e.g. meta.retrieval() %>% ...
or meta.retrieval() %>% ...
.
getGenome()
, getProteome()
, getCDS()
, getRNA()
, getGFF()
, and getAssemblyStats()
now internally perform md5 checksum tests
to make sure that files are retrieved intact.
get*()
(genome, proteome, gff, etc.) and meta.retrieval*()
functions
the meta retrieval process errored and terminated whenever NCBI or ENSEMBL didn't
store all types of sequences for a particular organism: genome, proteome, cds, etc. This has been fixed now and function calls
such as meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")
should work properly now (Thanks to @ARamesh123 for making me aware if this bug). Hence, this bug affected all attempts to download all proteome sequences e.g. for bacteria and viruses, because NCBI does not store genome AND proteome information for all bacterial or viral species.new function getAssemblyStats()
allows users to retrieve the genome assembly stats file from NCBI RefSeq or Genbank, e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt
new function read_assemblystats()
allows to import the genome assembly stats file from NCBI RefSeq or Genbank that was retrieved
using the getAssemblyStats()
function
meta.retrieval()
and meta.retrieval.all()
can now also download genome assembly stats for all selected species
meta.retrieval()
receives a new argument group
that allows users to retrieve species belonging to a subgroup instead of the entire kingdom.
Available groups can be retrieved with getGroups()
.
functions getSubgroups()
and listSubgroups()
have been removed and their initial functionality
has been merged and integrated into getGroups()
and listGroups()
listGroups()
receives a new argument details
that allows users to retrieve the organism names that belong to the corresponding subgroups
getGroups()
is now based on listGroups()
internal function getGENOMESREPORT()
is now exported and available to the user
all organism*()
functions now also support Ensembl Plants, Ensembl Metazoa, Ensembl Protist, and Ensembl Fungi (Thanks for pointing out Alex Gabel)
getMarts()
and getDatasets()
now also support Ensembl Plants, Ensembl Metazoa, Ensembl Protist, and Ensembl Fungi (Thanks for pointing out Alex Gabel)
Meta-Genome Retrieval
has more examples how to download genomes of species that belong to the same subgroupgetSummaryFile()
, getKingdomAssemblySummary()
, getMetaGenomeSummary()
,
getENSEMBL.Seq()
and getENSEMBLGENOMES.Seq()
functions causing quoted lines in the assembly_summary.txt
to be omitted when reading these files. This artefact caused that e.g. instead of information of 80,000 Bacteria genomes only 40,000 (which non-quotations) were read (Thanks to Xin Wu).In this version of biomartr
the organism*()
functions were adapted to the new ENSEMBL 87 release
in which organism name specification in the Biomart description column was changed
from a scientific name convention to a mix of common name and scientific name convention.
all organism*()
functions have been adapted to the new ENSEMBL 87 release organism name notation that is used in the Biomart description
fixing error handling bug that caused commands such as download.database(db = "nr.27.tar.gz")
to not execute properly
In this version, biomartr
was extended to now retrieve genome, proteome, CDS, GFF and meta-genome data
also from ENSEMBL and ENSEMLGENOMES.
Furthermore, all NCBI retrieval functions were updated to the new server folder structure standards of NCBI.
new meta-retrieval function meta.retrieval.all()
allows users to download all individual genomes of all kingdoms of life with one command
new metagenome retrieval function getMetaGenomes()
allows users to retrieve metagenome projects from NCBI Genbank
new metagenome retrieval function getMetaGenomeAnnotations()
allows users to retrieve annotation files for genomes belonging to a metagenome project stored at NCBI Genbank
new retrieval function getGFF()
allows users to retrieve annotation (*.gff) files for specific genomes from NCBI and ENSEMBL databases
new import function read_gff()
allowing users to import GFF files downloaded with getGFF()
new internal functions to check for availability of ENSEMBL or ENSEMBLGENOMES databases
new database retrieval function download.database.all()
allows users to download entire NCBI databases with one command
new function listMetaGenomes()
allowing users to list available metagenomes on NCBI Genbank
new external helper function getSummaryFile()
to retrieve the assembly_summary.txt file from NCBI
new external helper function getKingdomAssemblySummary()
to retrieve the assembly_summary.txt files from NCBI for all kingdoms and combine them
into one big data.frame
new function listKingdoms()
allows users to list the number of available species per kingdom of life
new function listGroups()
allows users to list the number of available species per group
new function listSubgroups()
allows users to list the number of available species per subgroup
new function getGroups()
allows users to retrieve available groups for a kingdom of life
new function getSubgroups()
allows users to retrieve available subgroups for a kingdom of life
new external helper function getMetaGenomeSummary()
to retrieve the assembly_summary.txt files from NCBI genbank metagenomes
new internal helper function getENSEMBL.Seq()
acting as main interface function to communicate with the ENSEMBL database API for sequence retrieval
new internal helper function getENSEMBLGENOMES.Seq()
acting as main interface function to communicate with the ENSEMBL database API for sequence retrieval
new internal helper function getENSEMBL.Annotation()
acting as main interface function to communicate with the ENSEMBL database API for GFF retrieval
new internal helper function getENSEMBLGENOMES.Annotation()
acting as main interface function to communicate with the ENSEMBL database API for GFF retrieval
new internal helper function get.ensemblgenome.info()
to retrieve general organism information from ENSEMBLGENOMES
new internal helper function get.ensembl.info()
to retrieve general organism information from ENSEMBL
new internal helper function getGENOMEREPORT()
to retrieve the genome reports file from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/overview.txt
new internal helper function connected.to.internet()
enabling internet connection check
functions getGenome()
, getProteome()
, and getCDS()
now can also in addition to NCBI retrieve genomes, proteomes or CDS from ENSEMBL and ENSEMLGENOMES
the functions getGenome()
, getProteome()
, and getCDS()
were completely re-written and now use the assembly_summary.txt files
provided by NCBI to retrieve the download path to the corresponding genome. Furthermore, these functions now lost the kingdom
argument.
Users now only need to specify the organism name and not the kingdom anymore. Furthermore, all get*
functions now
return the path to the downloaded genome so that this path can be used as input to all read_*
functions.
download_databases()
has been renamed to download.databases()
to be more consistent with other function notation
the argument db_format
was removed from listDatabases()
and download.database()
because it was misleading
the command listDatabases("all")
now returns all available NCBI databases that can be retrieved with download.database()
download.database()
now internally checks if input database specified by the user is actually available on NCBI servers
the documentary file generated by getGenome()
, getProteome()
, and getCDS()
is now extended to store more details about the downloaded genome
argument database
in is.genome.available()
and listGenomes()
has been renamed to db
to be consistent with all other sequence retrieval functions
is.genome.available()
now also checks availability of organisms in ENSEMBL. See db = "ensembl"
the argument db_name
in listDatabases()
has been renamed db
to be more consistent with the notation in other functions
the argument name
in download.database()
has been renamed db
to be more consistent with the notation in other functions
getKingdoms()
now retrieves also kingdom information for ENSEMBL and ENSEMBLGENOMES
getKingdoms()
received new argument db
to specify from which database (e.g. refseq
, genbank
, ensembl
or ensemblgenomes
) kingdom information shall be retrieved
getKingdoms(db = "refseq")
received one more member: "viral"
, allowing the genome retrieval of all viruses
argument out.folder
in meta.retrieval()
has been renamed to path
to be more consistent with other retrieval functions
all read_*
functions now received a new argument obj.type
allowing users to choose between storing input genomes as Biostrings object or data.table object
all read_*
functions now have format = "fasta"
as default
the kingdom
argument in the listGenomes()
function was renamed to type
, now allowing users to specify not only specify kingdoms,
but also groups and subgroups. Use: listGenomes(type = "kingdom")
or listGenomes(type = "group")
or listGenomes(type = "subgroup")
the listGenomes()
function receives a new argument subset
to specify a subset of the selected type
argument. E.g. subset = "Eukaryota"
when specifying
type = "kingdom"
Meta-Genome Retrieval
Introduction
VignetteDatabase Retrieval
VignetteSequence Retrieval
VignetteFunctional Annotation
Vignettefixing a parsing error of the file ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt
The problem was that comment lines were introduced and columns couldn't be parsed correctly anymore. This caused that genomes, proteomes, and CDS files could not be downloaded properly. This has been fixed now.
genomes, proteome, and CDS as well as meta-genomes can now be retrieved
from RefSeq and Genbank (not only RefSeq); only getCDS()
does not have genebank access,
becasue genbank does not provide CDS sequences
adding new function meta.retrieval()
to mass retrieve genomes for entire kingdoms of life
fixed a major bug in organismBM()
causing the function to fail. The failure of
this function affected all downstream organism*()
functions. Bug is now fixed and everything
works properly
updated Vignettes
updating unit tests for new API
fixing API problems that caused all BioMart related functions to fail
fixing retrieval problems in getCDS()
, getProteome()
, and getGenome()
the listDatabases()
function now has a new option db_name = "all"
allowing users to list all available databases stored on NCBI
Release Version