Rapid Implementation of Machine Learning Algorithms for Genomic Data

Supervised machine learning has an increasingly important role in biological studies. However, the sheer complexity of classification pipelines poses a significant barrier to the expert biologist unfamiliar with machine learning. Moreover, many biologists lack the time or technical skills necessary to establish their own pipelines. This package introduces a framework for the rapid implementation of high-throughput supervised machine learning built with the biologist user in mind. Written by biologists, for biologists, this package provides a user-friendly interface that empowers investigators to execute state-of-the-art binary and multi-class classification, including deep learning, with minimal programming experience necessary.


Welcome to the exprso GitHub page!

Supervised machine learning has an increasingly important role in biological studies. However, the sheer complexity of classification pipelines poses a significant barrier to the expert biologist unfamiliar with machine learning. Moreover, many biologists lack the time or technical skills necessary to establish their own pipelines. The exprso package introduces a framework for the rapid implementation of high-throughput supervised machine learning built with the biologist user in mind. Written by biologists, for biologists, exprso provides a user-friendly R interface that empowers investigators to execute state-of-the-art binary and multi-class classification, including deep learning, with minimal programming experience necessary. You can get started with exprso by installing the most up-to-date version of this package directly from GitHub.

library(devtools)
devtools::install_github("tpq/exprso")
library(exprso)

The exprso package organizes the myriad of methodological approaches to classification into analytical modules that provide the user with stackable and interchangeable data processing tools. Although this package primarily revolves around dichotomous (i.e., binary) classification, exprso also includes a rudimentary framework for multi-class classification. Some of the modules available include:

  • array: Modules that import data stored as a data.frame, eSet, or local file.
  • mod: Modules that modify the imported data prior to classification.
  • split: Modules that split these data into training and test sets.
  • fs: Modules that perform feature selection (e.g., statistical filters, SVM-RFE, mRMR, and more).
  • build: Modules that build classifiers (e.g, SVMs, artificial neural networks, random forests, and more).
  • predict: Modules that deploy classifiers and classifier ensembles.
  • calc: Modules that calculate classifier performance, including area under the ROC curve.
  • pl: Modules that manage elaborate classification pipelines (e.g., nested cross-validation, and more).
  • pipe: Modules that filter classification pipeline results.

To showcase this package, we make use of the publicly available hallmark Golub 1999 dataset to differentiate ALL (acute lymphocytic leukemia) from AML (acute myelogenous leukemia) based on gene expression as measured by microarray technology. We begin by importing this dataset from the package, GolubEsets, which exposes these data as an eSet (i.e., ExpressionSet) object. Then, using the arrayExprs function, we load the data into exprso. The modFilter, modTransform, and modNormalize functions allow us to replicate the pre-processing steps taken by the original investigators.

library(golubEsets)
data(Golub_Merge)
set.seed(12345)
array <- arrayExprs(Golub_Merge,
                    colBy = "ALL.AML",
                    include = list("ALL",
                                   "AML"))
array <- modFilter(array, 20, 16000, 500, 5)
array <- modTransform(array)
array <- modNormalize(array, c(1, 2))

In the next code chunk, we split the datasets randomly into training and test sets. Then, we perform feature selection on the training set by ranking features according to the results of a Student's t-test.

arrays <- splitSample(array, percent.include = 67)
array.train <- fsStats(arrays$array.train, top = 0, how = "t.test")
array.test <- arrays$array.valid

With the training set established, we can now build a classifier and deploy it on the test set. For this example, we will build a linear kernel support vector machine with minimal cost. We will build this classifier using the top 50 features as prioritized by fsStats.

mach <- buildSVM(array.train,
                 top = 50,
                 kernel = "linear",
                 cost = 1)
## Setting probability to TRUE (forced behavior, cannot override)...
## Setting cross to 0 (forced behavior, cannot override)...
pred <- predict(mach, array.train)
## Individual classifier performance:
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
##          actual
## predicted Control Case
##   Control      29    0
##   Case          0   19
##   acc sens spec
## 1   1    1    1
pred <- predict(mach, array.test)
## Individual classifier performance:
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
##          actual
## predicted Control Case
##   Control      18    0
##   Case          0    6
##   acc sens spec
## 1   1    1    1
calcStats(pred)
## Calculating accuracy using ROCR based on prediction probabilities...

##   acc sens spec auc
## 1   1    1    1   1

When constructing a classifier using build modules, we can only specify one set of parameters at a time. However, investigators often want to test models across a vast range of parameters. We provide the plGrid function for high-throughput parameter searches. This function wraps not only classifier construction, but deployment as well. By supplying a non-NULL argument to fold, this function will also calculate v-fold cross-validation using the training set.

pl <- plGrid(array.train,
             array.test,
             how = "buildSVM",
             top = c(5, 10, 25, 50),
             kernel = "linear",
             cost = 10^(-3:3),
             fold = NULL)

What if we wanted to analyze multiple splits of a dataset simultaneously? This package provides a simple interface for executing Monte Carlo style bootstrapping, embedding split, fs, and pl modules all within a single wrapper. The plMonteCarlo function effectively iterates over the above modules (including plGrid) some number B times. Using this function requires custom argument handlers that help organize the split, fs, and pl methods, respectively.

ss <- ctrlSplitSet(func = "splitSample", percent.include = 67)
fs <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs <- ctrlGridSearch(func = "plGrid",
                     how = "buildSVM",
                     top = c(5, 10, 25, 50),
                     kernel = "linear",
                     cost = 10^(-3:3),
                     fold = NULL)
boot <- plMonteCarlo(array, B = 5,
                     ctrlSS = ss,
                     ctrlFS = fs,
                     ctrlGS = gs)

We refer you to the official vignette for a more comprehensive discussion of the exprso package, including an elaboration of the modules introduced here.

  1. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537.

News


  • 2.1-import.R
    • Make Biobase and GEOquery optional packages.
  • 5.1-fs.R
    • Make limma an optional package.
    • Renamed from 5.1-fs-binary.R.
    • Add fsNULL method to return feature input unaltered.
    • Add fsANOVA feature selection method.
  • 5.2-build.R
    • Renamed from 5.2-build-binary.R.
  • 7.2-plGrid.R
    • Add back-end argument handler makeGridFromArgs.
    • Add default arguments for how = buildSVM.
  • 7.3-plGridMulti.R
    • New function for 1-vs.all classification with ctrlFS.
  • 7.4-plMonteCarlo.R
    • Renamed from 7.3-plMonteCarlo.R.
    • ctrlGS will now accept any pl function.
    • Use rbind.fill instead of rbind to prevent error.
  • 7.5-plNested.R
    • Renamed from 7.4-plNested.R.
    • ctrlGS will now accept any pl function.
    • Use rbind.fill instead of rbind to prevent error.
    • Add plGridMulti check to check.ctrlGS.

  • 1.2-methods.R
    • Rename getProbeSet to getFeatures.
  • 2.1-import.R
    • Rename probes.begin argument to begin.
  • 4.1-modSwap.R
    • Have fsPrcomp use top argument.
  • 4.2-modCluster.R
    • Change probes argument to top.
  • 5.1-fs-binary.R
    • Change probes argument to top.
    • Make fsSample an ExprsArray method.
  • 5.2-build-binary.R
    • Change probes argument to top.
    • Add ExprsArray signature to all build methods.
    • Add doMulti call to build. function.
  • 5.3-doMulti.R
    • Change probes argument to top.
    • Change what agument to method to allow for do.call.
    • Use NA instead of NULL for empty ExprsMachine.
    • Remove ExprsMulti method for fsSample.
    • Remove ExprsMulti method for fsStats.
  • 6-predict.R
    • Use NA instead of NULL for empty ExprsMachine.
    • Fix ExprsMulti tie breaker warning.
  • 7.1-plCV.R
    • Change probes argument to top.
  • 7.2-plGrid.R
    • Change probes argument to top.
  • 7.3-plMonteCarlo.R
    • Change probes argument to top.
  • 8.1-pipe.R
    • Change top.N argument to top.
  • 8.2-ens.R
    • Change top.N argument to top.
  • 9-global.R
    • Add documentation for data.

  • 1.1-classes.R
    • Add actual slot to ExprsPredict object.
  • 1.2-methods.R
    • Add 2D plotting to plot method.
  • 2.1-import.R
    • Force stringsAsFactors = FALSE.
  • 4.1-modSwap.R
    • Clean up plot calls using new plot method.
  • 6-predict.R
    • Add class check for modHistory @reductionModel.
    • Add class check for predict @mach.
    • Pass along known class values to ExprsPredict result.
    • Remove calcStats array argument.
  • 7.1-plCV.R
    • Tidy calcStats calls.
  • 7.2-plGrid.R
    • Tidy calcStats calls.
  • 8.2-ens.R
    • Pass along known class values to ExprsPredict result.
  • 9-global.R
    • Find optimal import combination.
  • 9-tidy.R
    • Add pipeSubset function.

  • 2.1-import.R
    • Deprecated arrayRead and arrayEset functions.
    • New arrayExprs function adds data.frame support.
  • 3.1-split.R
    • Remove splitSample warning.
    • Add details to the "Please Read" vignette.
  • 4.2-cluster.R
    • Add support for numeric vector probes.
  • 4.3-compare.R
    • Convert compare warning into error.
  • 5.1-fs-binary.R
    • Add imports to NAMESPACE.
    • Add fs. method to wrap repetitive code.
    • Add support for numeric vector probes.
    • Remove the fsStats "ks-boot" method.
    • Remove fsPenalizedSVM.
  • 5.2-build-binary.R
    • Add imports to NAMESPACE.
    • Add build. method to wrap repetitive code.
    • Add support for numeric vector probes.
  • 7.1-plCV.R
    • Remove plCV warning.
    • Add details to the "Please Read" vignette.
  • 7.2-plGrid.R
    • Convert plGrid warning into a message.
    • Consolidate numeric probes handling.
    • Add handling for a list of numeric or character probes.
    • Add details to the "Please Read" vignette.
  • 8.1-pipe.R
    • Convert pipeFilter warning into a message.
    • Add details to the "Please Read" vignette.
  • 9-deprecated.R
    • Contains deprecated functions.
  • 9-tidy.R
    • Add getArgs, defaultArg, and forceArg functions.
    • Add trainingSet, validationSet, and testSet functions.
    • Add modSubset wrapper for subset method.

  • 1.1-classes.R
    • Fixed warnings and notes.
  • 1.2-methods.R
    • Fixed warnings and notes.
  • 2.1-import.R
    • Fixed warnings and notes.
  • 2.2-misc.R
    • Code renamed to file 2.2-process.R.
    • Fixed warnings and notes.
  • 3.1-split.R
    • Fixed warnings and notes.
  • 4.1-modSwap.R
    • Store mutated annotation as boolean (not factor).
    • Fixed warnings and notes.
  • 4.2-modCluster.R
    • Fixed warnings and notes.
  • 4.3-compare.R
    • Fixed warnings and notes.
  • 5.1-fs-binary.R
    • Fixed warnings and notes.
  • 5.2-build-binary.R
    • Fixed warnings and notes.
  • 5.3-doMulti.R
    • Fixed warnings and notes.
  • 6-predict.R
    • Fixed warnings and notes.
  • 8.1-pipe.R
    • Fixed warnings and notes.
  • 8.2-ens.R
    • Fixed warnings and notes.
  • 9-global.R
    • Contains global imports.

  • 1.2-methods.R
    • Added subset method for ExprsArray objects.
    • Added subset method for ExprsPipeline objects.
  • 3-split.R
    • Code renamed to file 3.1-split.R.
  • 4-conjoin.R
    • Code renamed to file 3.2-conjoin.R.
    • modMutate merged with modSwap as 4.1-modSwap.R.
    • modCluster method added as 4.2-modCluster.R.
    • compare method added as 4.3-compare.R.
    • compare now handles ExprsMulti objects.
    • compare test added to validate method.
  • 5.1-fs-binary.R
    • fsStats has tryCatch to address rare error.

  • 1.2-methods.R
    • summary method now accommodates lists of vector arguments.
  • 5.2-build-binary.R
    • Added buildDNN.ExprsBinary method.
    • Removed e1071 cross-validation.
  • 5.3-doMulti.R
    • Added buildDNN.ExprsMulti method.
  • 6-build.R
    • Added buildDNN predict clause.
  • 7.2-plGrid.R
    • plGrid method now accommodates lists of vector arguments.
    • Removed e1071 cross-validation.

  • Project now organized in a package distribution format.
  • 0-misc.R
    • Temporarily removed compare function.
    • Code renamed to file 2.2-misc.R.
  • 1-classes.R
    • Code divided into files 1.1-classes.R and 1.2-methods.R.
    • Remaining 1-classes.R code renamed to 4-conjoin.R.
    • Removed getCases and getConts. Use [ and $ instead.
    • getProbeSet extended to replace getProbeSummary.
  • 2-import.R
    • Code renamed to file 2.1-import.R.
  • 3-split.R
    • arraySubset replaced with [ and $ in 1.1-classes.R.
    • splitSample code heavily edited, including an all.in bug fix.
    • splitStratify now handles ExprsMulti objects.
  • 4-speakEasy.R
    • Temporarily removed speakEasy and abridge functions.
  • 5-fs.R
    • Code renamed to file 5.1-fs-binary.R.
  • 6-build.R
    • reRank function serializes doMulti fs added to 5.3-doMulti.R.
    • fsSample and fsStats now have ExprsMulti methods.
    • Some code move to 5.2-build-binary.R and 5.3-doMulti.R.
    • Remaining 6-build.R code renamed to 6-predict.R.
  • 7-pl.R
    • Code divided into a separate file for each pl method.
    • Replaced ctrlGS (ctrlGridSearch) with ctrlPL (ctrlPipeLine).
    • plCV, fixed "1-subject artifact"" with drop = FALSE.
    • plNested, fixed "1-subject artifact" with drop = FALSE.
    • plNested argument checks moved to separate function.
  • 8-ens.R
    • Code divided into files 8.1-pipe.R and 8.2-ens.R.
    • Removed pipeSubset. Use [ and $ instead.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("exprso")

0.1.8 by Thomas Quinn, 10 months ago


http://github.com/tpq/exprso


Report a bug at http://github.com/tpq/exprso/issues


Browse source code at https://github.com/cran/exprso


Authors: Thomas Quinn [aut, cre], Daniel Tylee [ctb]


Documentation:   PDF Manual  


GPL-2 license


Imports affy, Biobase, cluster, MASS, e1071, lattice, methods, mRMRe, nnet, pathClass, plyr, stats, randomForest, ROCR, sampling

Depends on kernlab

Suggests GEOquery, h2o, golubEsets, knitr, limma, magrittr, rmarkdown, testthat


See at CRAN