Generating Various Numerical Representation Schemes for Protein Sequences

Comprehensive toolkit for generating various numerical features of protein sequences described in Xiao et al. (2015) . For full functionality, the software 'ncbi-blast+' is needed, see < https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download> for more information.


Build Status AppVeyor Build Status CRAN Version Downloads from the RStudio CRAN mirror

Comprehensive toolkit for generating various numerical features of protein sequences described in Xiao et al. (2015) <DOI:10.1093/bioinformatics/btv042> (PDF).

Paper Citation

Formatted citation:

Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.

BibTeX entry:

@article{Xiao2015,
  author = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
  title = {{protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences}},
  journal = {Bioinformatics},
  year = {2015},
  volume = {31},
  number = {11},
  pages = {1857--1859},
  doi = {10.1093/bioinformatics/btv042},
  issn = {1367-4803},
  url = {http://bioinformatics.oxfordjournals.org/content/31/11/1857}
}

Installation

To install protr from CRAN:

install.packages("protr")

Or try the latest version on GitHub:

# install.packages("devtools")
devtools::install_github("road2stat/protr")

Browse the package vignette for a quick-start.

Shiny Web Application

ProtrWeb, the Shiny web application built on protr, can be accessed from http://protr.org.

ProtrWeb is a user-friendly web application for computing the protein sequence descriptors (features) presented in the protr package.

Descriptors List

Commonly used descriptors

  • Amino acid composition descriptors

    • Amino acid composition
    • Dipeptide composition
    • Tripeptide composition
  • Autocorrelation descriptors

    • Normalized Moreau-Broto autocorrelation
    • Moran autocorrelation
    • Geary autocorrelation
  • CTD descriptors

    • Composition
    • Transition
    • Distribution
  • Conjoint Triad descriptors

  • Quasi-sequence-order descriptors

    • Sequence-order-coupling number
    • Quasi-sequence-order descriptors
  • Pseudo amino acid composition (PseAAC)

    • Pseudo amino acid composition
    • Amphiphilic pseudo amino acid composition
  • Profile-based descriptors

    • Profile-based descriptors derived by PSSM (Position-Specific Scoring Matrix)

Proteochemometric (PCM) modeling descriptors

  • Scales-based descriptors derived by principal components analysis
    • Scales-based descriptors derived by amino acid properties (AAindex)
    • Scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.)
    • Scales-based descriptors derived by factor analysis
    • Scales-based descriptors derived by multidimensional scaling
    • BLOSUM and PAM matrix-derived descriptors

Similarity Computation

Local and global pairwise sequence alignment for protein sequences:

  • Between two protein sequences
  • Parallelized pairwise similarity calculation with a list of protein sequences

GO semantic similarity measures:

  • Between two groups of GO terms / two Entrez Gene IDs
  • Parallelized pairwise similarity calculation with a list of GO terms / Entrez Gene IDs

Miscellaneous tools and datasets

  • Retrieve protein sequences from UniProt
  • Read protein sequences in FASTA format
  • Read protein sequences in PDB format
  • Sanity check of the amino acid types appeared in the protein sequences
  • Protein sequence segmentation
  • Auto cross covariance (ACC) for generating scales-based descriptors of the same length
  • 20+ pre-computed 2D and 3D descriptor sets for the 20 amino acids to use with the scales-based descriptors
  • BLOSUM and PAM matrices for the 20 amino acids
  • Meta information of the 20 amino acids

Links

Contribute

To contribute to this project, please take a look at the Contributing Guidelines first. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

News

protr 1.6-1 (2019-02-24)

Improvements

  • Added a new argument batches to parSeqSim(). The new argument supports breaking down the pairwise similarity computation into smaller batches. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and hold all the pairwise similarities in a single batch. Also, use the other new argument verbose to track the computation progress.

New Features

  • Added a new function parSeqSimDisk(). Compared to the in-memory version parSeqSim(), this new function caches the partial results in each batch to the hard drive and merges the results together in the end. This could further reduce the memory usage for parallel similarity computations involving a large number of protein sequences.

Bug Fixes

  • Fixed an issue in parGOSim() that will create minor numerical inconsistencies in results due to argument matching.

protr 1.6-0 (2019-02-11)

Bug Fixes

  • Updated twoGOSim() and parGOSim() to use the latest GOSemSim API for computing GO based semantic similarity. Issues in the code examples are also fixed. We thank Denisa Duma for the feedback.

protr 1.5-2 (2018-11-21)

Bug Fixes

  • Fixed the API endpoint issue (from HTTP to HTTPS) in getUniProt().

Improvements

  • Added two new parameters gap.opening and gap.extension to parSeqSim(), allowing more flexible tuning of the sequence alignment for more types of amino acid sequence data. We thank Dr. Maisa Pinheiro for the feedback.
  • Added floating TOC and new CSS style in the vignette to improve navigation and readability.

protr 1.5-1 (2018-07-12)

New Features

  • Added a new function removeGaps() for removing/replacing gaps (-) or any irregular characters from protein sequences, to make them suitable for feature extraction or sequence alignment based similarity computation. We thank Dr. Maisa Pinheiro for the feedback.

protr 1.5-0 (2017-11-17)

Bug Fixes

  • Resolved a critical bug due to improper ifelse conditioning (3f6e106) for the distribution descriptor in CTD. We thank Jielu Yan from the University of Macau for kindly reporting this issue.

Improvements

  • General fixes and improvements for the package vignette.

protr 1.4-2 (2017-09-28)

Improvements

  • The function list is now organized into sections on the package website (https://nanx.me/protr/reference/).
  • Use system font stack instead of Google Fonts in vignettes to avoid pandoc SSL issue.

protr 1.4-1 (2017-07-08)

Improvements

  • Converted table images to markdown tables in the vignette
  • Updated the screenshot of protrweb in the vignette

protr 1.4-0 (2017-06-06)

Improvements

  • Migrated from Sweave-based PDF vignette to knitr-based HTML vignette

protr 1.3-0 (2017-05-07)

Improvements

  • Fix obsolete URLs
  • Better R code formatting
  • Better function documentation and vignette formatting

protr 1.2-1 (2016-12-29)

Improvements

  • New website: https://nanx.me/protr/
  • Added Windows continuous integration support using AppVeyor.
  • Better R file naming scheme

protr 1.2-0 (2016-11-12)

Improvements

  • Added continuous integration
  • Code style improvements

protr 1.1-1 (2015-12-29)

Bug Fixes

  • Fix URLs that cannot be accessed by curl -I -L:

    1. Use http://protr.org
    2. Remove all inaccessible URLs

protr 1.1-0 (2015-12-28)

Bug Fixes

  • Bug fix in extractCTDD()

protr 1.0-1 (2015-11-26)

Bug Fixes

  • Improvements for dealing with boundary cases in several functions (thanks for @koefoed's patches)

Improvements

  • Added citation information

protr 0.5-1 (2014-12-22)

Improvements

  • Minor improvements and fixes for documentation

protr 0.5-0 (2014-12-18)

Improvements

  • Added functions allowing users to specify their own classification of the amino acid
  • Documentation improvements
  • Other minor improvements

protr 0.4-1 (2014-10-10)

Improvements

  • General documentation improvements

protr 0.4-0 (2014-09-20)

New Features

  • Added profile-based descriptors derived by PSSM

protr 0.3-0 (2014-06-20)

Improvements

  • Added example workflow using protr in the vignette

protr 0.2-1 (2014-01-25)

Improvements

  • Added LICENSE file according to CRAN policies

protr 0.2-0 (2013-12-10)

New Features

  • second release
  • added Proteochemometric (PCM) Modeling descriptors, parallellized similarity computation derived by protein sequence alignment and Gene Ontology (GO) semantic similarity measures between a list of protein sequences / GO terms / Entrez Gene IDs
  • added misc tools and datasets
  • initial version of Scales-Based Descriptors derived by Principal Components Analysis
  • initial version of Scales-Based Descriptors derived by AA-Properties (AAindex)
  • initial version of Scales-Based Descriptors derived by 20+ classes of 2D and 3D Molecular Descriptors
  • initial version of Scales-Based Descriptors derived by Factor Analysis
  • initial version of Scales-Based Descriptors derived by Multidimensional Scaling
  • initial version of BLOSUM and PAM Matrix-Derived Descriptors
  • initial version of parallelized pairwise similarity calculation with a list of protein sequences
  • initial version of pairwise semantic similarity calculation with a list of GO terms / Entrez Gene IDs
  • initial version of Auto Cross Covariance (ACC) for generating scales-based descriptors of the same length
  • introducing ProtWeb, the web service based on protr: http://protr.org

protr 0.1-0 (2012-11-18)

New Features

  • initial version
  • first version of Amino Acid Composition descriptor
  • first version of Dipeptide Composition descriptor
  • first version of Tripeptide Composition descriptor
  • first version of Normalized Moreau-Broto Autocorrelation descriptor
  • first version of Moran Autocorrelation descriptor
  • first version of Geary Autocorrelation descriptor
  • first version of CTD - Composition descriptor
  • first version of CTD - Transition descriptor
  • first version of CTD - Distribution descriptor
  • first version of Conjoint Triad descriptor
  • first version of Sequence Order Coupling Number descriptor
  • first version of Quasi-Sequence-Order descriptor
  • first version of Pseudo Amino Acid Composition descriptor
  • first version of Amphiphilic Pseudo Amino Acid Composition descriptor
  • first version of readFASTA()
  • first version of getUniProt()
  • first version of protcheck()
  • first version of protseg()

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("protr")

1.6-2 by Nan Xiao, 5 months ago


https://nanx.me/protr/, https://github.com/nanxstats/protr, http://protr.org


Report a bug at https://github.com/nanxstats/protr/issues


Browse source code at https://github.com/cran/protr


Authors: Nan Xiao [aut, cre] , Qing-Song Xu [aut] , Dong-Sheng Cao [aut]


Documentation:   PDF Manual  


BSD_3_clause + file LICENSE license


Suggests knitr, rmarkdown, Biostrings, GOSemSim, foreach, doParallel, org.Hs.eg.db

System requirements: ncbi-blast+ (see <https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download>)


See at CRAN