Performs cluster analysis using an ensemble
clustering framework, Chiu & Talhouk (2018)
The goal of diceR
is to provide a systematic framework for generating diverse cluster ensembles in R. There are a lot of nuances in cluster analysis to consider. We provide a process and a suite of functions and tools to implement a systematic framework for cluster discovery, guiding the user through the generation of a diverse clustering solutions from data, ensemble formation, algorithm selection and the arrival at a final consensus solution. We have additionally developed visual and analytical validation tools to help with the assessment of the final result. We implemented a wrapper function dice()
that allows the user to easily obtain results and assess them. Thus, the package is accessible to both end user with limited statistical knowledge. Full access to the package is available for informaticians and statisticians and the functions are easily expanded. More details can be found in our companion paper published at BMC Bioinformatics.
You can install diceR
from CRAN with:
install.packages("diceR")
Or get the latest development version from GitHub:
devtools::install_github("AlineTalhouk/diceR")
The following example shows how to use the main function of the package, dice()
. A data matrix hgsc
contains a subset of gene expression measurements of High Grade Serous Carcinoma Ovarian cancer patients from the Cancer Genome Atlas publicly available datasets. Samples as rows, features as columns. The function below runs the package through the dice()
function. We specify (a range of) nk
clusters over reps
subsamples of the data containing 80% of the full samples. We also specify the clustering algorithms
to be used and the ensemble functions used to aggregated them in cons.funs
.
library(diceR)data(hgsc)obj <- dice(hgsc, nk = 4, reps = 5, algorithms = c("hc", "diana"),cons.funs = c("kmodes", "majority"))
The first few cluster assignments are shown below:
knitr::kable(head(obj$clusters))
kmodes | majority | |
---|---|---|
TCGA.04.1331_PRO.C5 | 3 | 3 |
TCGA.04.1332_MES.C1 | 3 | 3 |
TCGA.04.1336_DIF.C4 | 1 | 3 |
TCGA.04.1337_MES.C1 | 3 | 3 |
TCGA.04.1338_MES.C1 | 3 | 3 |
TCGA.04.1341_PRO.C5 | 3 | 3 |
You can also compare the base algorithms
with the cons.funs
using internal evaluation indices:
knitr::kable(obj$indices$ii$`4`)
Algorithms | calinski_harabasz | dunn | pbm | tau | gamma | c_index | davies_bouldin | mcclain_rao | sd_dis | ray_turi | g_plus | silhouette | s_dbw | Compactness | Connectivity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HC_Euclidean | 4.945499 | 0.3025234 | 38.34704 | 0.1992999 | 0.5598731 | 0.3122823 | 3.100302 | 0.8237540 | 0.1795670 | 3.0886000 | 0.0278858 | 0.0300838 | NaN | 24.81662 | 49.69405 |
DIANA_Euclidean | 51.332198 | 0.3348103 | 32.92726 | 0.4271483 | 0.6216897 | 0.1639431 | 3.037874 | 0.8077658 | 0.2034291 | 3.1687896 | 0.0892952 | 0.0700862 | NaN | 22.05147 | 227.34841 |
kmodes | 39.127460 | 0.3352598 | 49.27019 | 0.3907289 | 0.5528538 | 0.2020221 | 1.563373 | 0.8254116 | 0.1046540 | 1.1356906 | 0.1116735 | NaN | 0.7207352 | 22.66419 | 148.61865 |
majority | 5.645220 | 0.4315581 | 96.93674 | 0.2221915 | 0.7330421 | 0.2458043 | 1.379460 | 0.7781939 | 0.0948754 | 0.8261741 | 0.0122634 | NaN | 0.7224928 | 24.70600 | 24.35079 |
This figure is a visual schematic of the pipeline that dice()
implements.
Please visit the overview page for more detail.
Fix length > 1 in coercion to logical
error in consensus_evaluate()
due to comparisons using ||
operator
Add suppressWarnings(RNGversion("3.5.0"))
before call to set.seed()
in examples, tests, and vignette to use old RNG sampling
Use .covrignore
to exclude zzz.R
from being considered in code coverage
Use dplyr
version >= 0.7.5 to ensure bind_rows()
works
Fixed bug where scaled matrix using the "robust" method in prepare_data()
was nested in the data
element (@AlineTalhouk, #134)
Add parameter hc.method
in dice
and consensus_cluster
to pass to method
parameter in stats::hclust
(@JakeNel28, #130)
Remove dependencies on largeVis
: package will be archived
Revert back to using NMF
since NNLM
has been archived and NMF
is back in active maintenance.
Choose fuzzifier m in cmeans
using Equation 5 from https://academic.oup.com/bioinformatics/article/26/22/2841/227572 (thanks @Asduveneck)
Replace all code that depended on NMF
with NNLM
and pheatmap
: CRAN notified that NMF
will be archived because of inactive maintenance
Update .yml
files default templates
Fix bug in consensus_cluster()
when custom algorithms were excluded from output (thanks @phiala)
Use markdown language for documentation
Various performance improvements and code simplifications
Suppress success/fail message printout and fix input data to be matrix for block clustering
Fix bug in algii_heatmap()
when k.method = "all"
in dice()
Fix bug in calculating internal indices when data has categorical variables (thanks Kurt Salmela)
Updated object output names in consensus_evaluate()
Fix unit test in test-dice.R
for R-devel
Add internal function: ranked algorithms vs internal validity indices heatmap graph
Fix bugs in graph_cdf()
, graph_tracking()
when only one k selected
Progress messages in dice()
Fix bug in consensus_evaluate()
when algorithm has NA
for all PAC values
New dimension reduction methods: t-SNE, largeVis (@dustin21)
Better annotated progress bar using progress
package
Speed up the operation that transforms a matrix to become "NMF-ready"
Simplify saving mechanism in consensus_cluster()
such that only file.name
needs to be specified, and the save
parameter has been removed
New algorithms: SOM, Fuzzy C-Means, DBSCAN (@dustin21, #118)
Added significance testing section to vignette
Fixed direction of optimization: compactness and connectivity should be minimized