Adjacency-Constrained Clustering of a Block-Diagonal Similarity Matrix

Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated to a position, and only adjacent clusters can be merged. Typical application fields in bioinformatics include Genome-Wide Association Studies or Hi-C data analysis, where the similarity between items is a decreasing function of their genomic distance. Taking advantage of this feature, the implemented algorithm is time and memory efficient. This algorithm is described in Chapter 4 of Alia Dehman (2015) < https://hal.archives-ouvertes.fr/tel-01288568v1>.


CRAN_Status_Badge Travis Build Status AppVeyor Build Status Coverage Status

adjclust is a package that provides methods to perform adjacency-constrained hierarchical agglomerative clustering. Adjacency-constrained hierarchical agglomerative clustering is hierarchical agglomerative clustering (HAC) in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. It is useful in bioinformatics (e.g. Genome Wide Association Studies or Hi-C data analysis).

adjclust provides three user level functions: adjClust, snpClust and hicClust, which are briefly explained below.

Installation

You can install adjclust from github with:

devtools::install_github("pneuvial/adjclust")

adjClust

adjClust performs adjacency-constrained HAC for standard and sparse, similarity and dissimilarity matrices and dist objects. Matrix::dgCMatrix and Matrix::dsCMatrix are the supported sparse matrix classes. Let's look at a basic example

library("adjclust")
 
sim <- matrix(c(1.0, 0.5, 0.2, 0.1,
                0.5, 1.0, 0.1, 0.2,
                0.2, 0.1, 1.0, 0.6,
                0.1, 0.2 ,0.6 ,1.0), nrow=4)
h <- 3
fit <- adjClust(sim, "similarity", h)
plot(fit)

The result is of class chac. It can be plotted as a dendogram (as shown above). Successive merge and heights of clustering can be obtained by fit$merge and fit$height respectively.

snpClust

snpClust performs adjacency-constrained HAC for specific application of Genome Wide Association Studies (GWAS). A minimal example is given below. See GWAS Vignette for details.

library("snpStats")
#> Loading required package: survival
#> Loading required package: Matrix
data("ld.example", package = "snpStats")
geno <- ceph.1mb[, -316]  ## drop one SNP leading to one missing LD value
h <- 100
ld.ceph <- ld(geno, stats = "R.squared", depth = h)
image(ld.ceph, lwd = 0)

 
fit <- snpClust(geno, stats = "R.squared", h = h)
#> Note: 132 merges with non increasing heights.
plot(fit)
#> Warning in plot.chac(fit): 
#> Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.

sel_clust <- select(fit, "bs")
plotSim(as.matrix(ld.ceph), clustering = sel_clust, dendro = fit)

hicClust

hicClust performs adjacency-constrained HAC for specific application of Hi-C data analysis. A minimal example is given below. See Hi-C Vignette for details.

library("HiTC")
#> Warning: multiple methods tables found for 'acbind'
#> Warning: multiple methods tables found for 'arbind'
#> Warning: multiple methods tables found for 'rglist'
load(system.file("extdata", "hic_imr90_40_XX.rda", package = "adjclust"))
binned <- binningC(hic_imr90_40_XX, binsize = 5e5)
#> Bin size 'xgi' =500488 [1x500488]
#> Bin size 'ygi' =500488 [1x500488]
mapC(binned)
#> minrange= 104  - maxrange= 36776.8

 
fitB <- hicClust(binned)
#> Note: 5 merges with non increasing heights.
plot(fitB)
#> Warning in plot.chac(fitB): 
#> Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.

plotSim(intdata(binned), dendro = fitB) # default: log scale for colors

Credits

Version 0.4.0 of this package was completed by Shubham Chaturvedi as a part of the Google Summer of Code 2017 program.

News

Version 0.5.7 [2018-09-26]

  • Example Hi-C data now 10x smaller (subset of the original one). The package is smaller and tests are faster.
  • implemented a model selection approach based on slope heuristic or on the broken stick heuristic to select a relevant number of clusters
  • fixed minor problems in some method definition for class 'chac'
  • proposed a log-transformation of data in the wrapper 'hicClust'
  • implemented a heatmap with possible highlighting of the constrained clustering
  • implemented an option to display number of the merge on the dendrogram

Version 0.5.6 [2018-02-08]

  • changed dependencies to bioconductor packages 'HiTC' and 'snpStats' into Suggest and conditionnaly used them

Version 0.5.5 [2018-01-30]

  • simplified code (replaced many C functions by a unique R function using Matrix)
  • adjClust now properly handles similarities with diagonal entries different from 1
  • removed arguments that were not used (blMin and verbose)
  • simplified Hi-C example

Version 0.5.4 [2018-01-12]

  • More tests for modify and modifySparse
  • BUG FIX in condnCheck

Version 0.5.3 [2017-12-04]

  • 'height' is now defined as the value of the linkage criterion (as is done in 'hclust'), rather than the total inertia of the clustering (as is done in 'rioja').
  • Added several representations for the dendrogram corresponding to different choices for the height.
  • Improved documentation and vignettes.
  • Removed non-standard fields in the output of 'adjclust' (#13).
  • Added tests for: equivalence with 'hclust', comparing sum of heights and pseudo inertia, plots, non-increasing heights, cutree (#14).
  • Fixed #13 (man).
  • Fixed #15 (Cutree with decreasing merges).
  • Fixed #3 (Non-positive 'gains').
  • Using BiocStyle::html_document2 as a temporary fix for vignette compilation errors.

Version 0.5.2 [2017-10-17]

  • Added citation to Alia Dehman's PhD thesis to DESCRIPTION.

Version 0.5.1 [2017-10-16]

  • More informative 'Description' of the method in DESCRIPTION
  • Updates to test scripts to pass R CMD check on all windows platforms
  • Moved README-*.png files to man/figures

Version 0.5.0 [2017-10-13]

  • Bump version number for CRAN submission

Version 0.4.2 [2017-10-05]

  • Added 'chac' S3 class and corresponding 'plot' and 'summary' methods
  • Documentation cleanups
  • Removed objects "R2.100" and "Dprime.100" (can be obtained from the imported 'snpStats' package)
  • In 'snpClust': argument 'stat' is now passed to the 'snpStats::ld' function through '...'
  • Some code cleanups
  • Improved handling of default value for 'h' in 'adjclust' for 'dist' objects
  • Renamed 'prevfit' into the more explicit 'res_adjclust_0.3.0'
  • Dropped 'simmatrix' toy data set (now generated on the fly in tests)

Version 0.4.1 [2017-09-15]

  • Cleanups in Hi-C and LD vignettes and corresponding tests
  • Dropped outdated BALD test script
  • Added test script for NA values in LD
  • Renamed Hi-C data sets and updated corresponding documentation
  • Added package website generated by pkgdown

Version 0.4.0 [2017-08-29]

  • Implemented interface to handle standard and sparse matrices in adjClust
  • Implemented interface to handle either kernel or dissimilarities
  • Implemented wrapper for SNP and Hi-C data
  • Documented the package and created vignettes for the different use cases
  • Added scripts to increase package coverage and test the equivalence with rioja for the small dimensional case
  • Cleaned up code to improve efficiency and removed unnecessary scripts and functions

Version 0.3.0 [2017-02-13]

  • Removed 'adjClustBand': main entry points are now 'HeapHop' and 'adjClustBand_heap'.
  • Updated test scripts and LD vignette accordingly.
  • Added Travis CI and Appveyor support.

Version 0.2.*

Version 0.2.3 [2017-02-02]

  • Updated LD vignette
  • In adjClustBand, renamed flavor "Koskas" to "PseudoMatrix"

Version 0.2.2 [2016-12-01]

  • Added dummy R/adjclust.R so that document() adds 'importFrom Rcpp evalCpp' to NAMESPACE
  • "Fixed" warning at check due to .hpp file in src (this warning should not exist IMHO)

Version 0.2.1 [2016-11-09]

  • Added minimal documentation
  • Replaced "std::cout" by "Rcpp::Rcout", and so on for "exit()" and "cerr".

Version 0.2.0 [2016-06-24]

  • Incorporated Michel's implementation (R function 'HeapHop')
  • 'adjClustBand' is now a wrapper to call either Alia's or Michel's implementation

Version 0.1.0 [2016-06-24]

  • Created from BALD
  • Added a test to check that we are reproducing the results of BALD::cWard

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("adjclust")

0.5.7 by Pierre Neuvial, 9 months ago


https://github.com/pneuvial/adjclust


Report a bug at https://github.com/pneuvial/adjclust/issues


Browse source code at https://github.com/cran/adjclust


Authors: Christophe Ambroise [aut] , Shubham Chaturvedi [aut] , Alia Dehman [aut] , Michel Koskas [aut] , Pierre Neuvial [aut, cre] , Guillem Rigaill [aut] , Nathalie Vialaneix [aut]


Documentation:   PDF Manual  


GPL-3 license


Imports stats, graphics, grDevices, Matrix, matrixStats, methods, utils, capushe

Suggests knitr, testthat, rmarkdown, rioja, HiTC, snpStats, BiocGenerics


See at CRAN