Two implementations of canonical correlation analysis
(CCA) that are based on iterated regression. By choosing the
appropriate regression algorithm for each data domain, it is
possible to enforce sparsity, non-negativity or other kinds of
constraints on the projection vectors. Multiple canonical
variables are computed sequentially using a generalized
deflation scheme, where the additional correlation not
explained by previous variables is maximized. 'nscancor' is
used to analyze paired data from two domains, and has the same
interface as the 'cancor' function from the 'stats' package
(plus some extra parameters). 'mcancor' is appropriate for
analyzing data from three or more domains. See
and Sigg et al. (2007)
An R package for non-negative and sparse canonical correlation analysis (CCA).
CCA is a method for finding associations between paired data sets. For example, a health study might record the gene expression levels and a number of physiological parameters for a patient cohort. If one conjectures that the cause for the physiological symptoms has a genetic component, one could expect to find a correlation between the expression of certain genes and the strength of certain symptoms. CCA finds a pair of linear projections (called canonical vectors), one for each data modality, such that the projected values (called canonical variables) have maximum correlation. The next pair of canonical variables is found by again maximizing their correlation, under the additional constraint that the they have to be uncorrelated to all previous ones, and so on.
CCA was first introduced by Hotelling in 1936, and has many similarities to principal component analysis (PCA). Where the PCA solution is computed from the eigenvalue decomposition (EVD) of the covariance matrix of a single data set, the CCA solution is computed from the EVD of the cross-covariance matrix of the two data sets. This approach is very efficient, but one sometimes encounters the following problems during an analysis. First, if at least one of the data sets contains more features than samples (a common case for gene expression data), there exist an infinite number of trivial projections that achieve perfect correlation. Regularization of the canonical vectors is necessary to again solve a well-posed problem. Second, the projections are typically linear combinations with non-zero weights for all features, which makes an interpretation of the weights difficult. A sparse solution which only includes a small number of important features is often desirable.
This package implements a CCA algorithm called
nscancor which can
enforce appropriate constraints on the canonical vectors to address
both aforementioned problems. Enforcing a bound on the Euclidean norm
(also called the L2 norm) of the projections avoids trivial
correlations. Enforcing a bound on the L1 norm leads to sparse
solutions, where many of the weights are exactly zero. And enforcing
non-negativity of the projection weights is useful for analysing data
where only positive influence of features is deemed appropriate. The
algorithm executes iterated regression steps, and the constraints
enter via the regression functions.
nscancor is therefore modular,
and builds on the many regression methods that are
available, e.g. ridge regression or the elastic net. By using two
different regression functions, the proper constraints can be enforced
for each domain.
The package also provides a generalization of constrained CCA for
analyzing more than two data sets. The
mcancor algorithm is
structurally analogous to
nscancor, but it maximizes the sum of all
pairwise correlations of canonical variables. As with
specifying the regression function for each domain makes it possible
to enforce appropriate constraints on each canonical vector.
This blog post explains how to use the package and demonstrates its benefits.