Gaussian Mixture Models, K-Means, Mini-Batch-Kmeans, K-Medoids and Affinity Propagation Clustering

Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software, ; (ii) "Web-scale k-means clustering" by D. Sculley (2010), ACM Digital Library, ; (iii) "Armadillo: a template-based C++ library for linear algebra" by Sanderson et al (2016), The Journal of Open Source Software, ; (iv) "Clustering by Passing Messages Between Data Points" by Brendan J. Frey and Delbert Dueck, Science 16 Feb 2007: Vol. 315, Issue 5814, pp. 972-976, .



The ClusterR package consists of Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. More details on the functionality of ClusterR can be found in the package Vignette and Documentation.

UPDATE 16-08-2018

As of version 1.1.4 the ClusterR package allows R package maintainers to perform linking between packages at a C++ code (Rcpp) level. This means that the Rcpp functions of the ClusterR package can be called in the C++ files of another package. In the next lines I'll give detailed explanations on how this can be done:


Assumming that an R package ('PackageA') calls one of the ClusterR Rcpp functions. Then the maintainer of 'PackageA' has to :


  • 1st. install the ClusterR package to take advantage of the new functionality either from CRAN using,

 
install.packages("ClusterR")
 
 

or download the latest version from Github using the devtools package,


 
devtools::install_github('mlampros/ClusterR')
 
 

  • 2nd. update the DESCRIPTION file of 'PackageA' and especially the Imports, Depends and LinkingTo fields by adding the ClusterR package (besides any other packages),

 
Imports: ClusterR
Depends: ClusterR
LinkingTo: ClusterR
 
 

  • 3rd. update the NAMESPACE file of 'PackageA' by importing the ClusterR package (besides any other imports),

 
import(ClusterR)
 
 

  • 4th. open a new C++ file (for instance in Rstudio) and at the top of the file add the following 'headers', 'depends' and 'plugins',

 
# include <RcppArmadillo.h>
#include <ClusterRHeader.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]
 
 

The available functions can be found in the ClusterRHeader.h file.


A complete minimal example would be :


# include <RcppArmadillo.h>
#include <ClusterRHeader.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]
 
 
using namespace clustR;
 
 
// [[Rcpp::export]]
Rcpp::List mini_batch_kmeans(arma::mat& data, int clusters, int batch_size, int max_iters, int num_init = 1, 
 
                            double init_fraction = 1.0, std::string initializer = "kmeans++",
                            
                            int early_stop_iter = 10, bool verbose = false, 
                            
                            Rcpp::Nullable<Rcpp::NumericMatrix> CENTROIDS = R_NilValue, 
                            
                            double tol = 1e-4, double tol_optimal_init = 0.5, int seed = 1) {
 
  ClustHeader clust_header;
 
  return clust_header.mini_batch_kmeans(data, clusters, batch_size, max_iters, num_init, init_fraction, 
  
                                        initializer, early_stop_iter, verbose, CENTROIDS, tol, 
                                        
                                        tol_optimal_init, seed);
}
 
 

Then, by opening an R file a user can call the mini_batch_kmeans function using,


 
Rcpp::sourceCpp('example.cpp')              # assuming that the previous Rcpp code is included in 'example.cpp' 
             
set.seed(1)
dat = matrix(runif(100000), nrow = 1000, ncol = 100)
 
mbkm = mini_batch_kmeans(dat, clusters = 3, batch_size = 50, max_iters = 100, num_init = 2, 
 
                         init_fraction = 1.0, initializer = "kmeans++", early_stop_iter = 10, 
                         
                         verbose = T, CENTROIDS = NULL, tol = 1e-4, tol_optimal_init = 0.5, seed = 1)
                         
str(mbkm)
 
 

Use the following link to report bugs/issues,

https://github.com/mlampros/ClusterR/issues


News

ClusterR 1.1.5

As of version 1.1.5 the ClusterR functions can take tibble objects as input too.

ClusterR 1.1.4

I modified the ClusterR package to a cpp-header-only package to allow linking of cpp code between Rcpp packages. See the update of the README.md file (16-08-2018) and issue #11 for more information.

ClusterR 1.1.3

I updated the example section of the documentation by replacing the optimal_init with the kmeans++ initializer

ClusterR 1.1.2

  • I fixed an Issue related to NAs produced by integer overflow of the external_validation function. See, the commented line of the Clustering_functions.R file (line 1830).

ClusterR 1.1.1

ClusterR 1.1.0

  • I added the DARMA_64BIT_WORD flag in the Makevars file to allow the package processing big datasets
  • I modified the kmeans_miniBatchKmeans_GMM_Medoids.cpp file and especially all Rcpp::List::create() objects to addrress the clang-ASAN errors.

ClusterR 1.0.9

  • I modified the Optimal_Clusters_KMeans function to return a vector with the distortion_fK values if criterion is distortion_fK (instead of the WCSSE values).
  • I added the 'Moore-Penrose pseudo-inverse' for the case of the 'mahalanobis' distance calculation.

ClusterR 1.0.8

  • I modified the OpenMP clauses of the .cpp files to address the ASAN errors.
  • I removed the threads parameter from the KMeans_rcpp function, to address the ASAN errors ( negligible performance difference between threaded and non-threaded version especially if the num_init parameter is less than 10 ). The threads parameter was removed also from the Optimal_Clusters_KMeans function as it utilizes the KMeans_rcpp function to find the optimal clusters for the various methods.

ClusterR 1.0.7

I modified the kmeans_miniBatchKmeans_GMM_Medoids.cpp file in the following lines in order to fix the clang-ASAN errors (without loss in performance):

  • lines 1156-1160 : I commented the second OpenMp parallel-loop and I replaced the k variable with the i variable in the second for-loop [in the dissim_mat() function]
  • lines 1739-1741 : I commented the second OpenMp parallel-loop [in the silhouette_matrix() function]
  • I replaced (all) the silhouette_matrix (arma::mat) variable names with Silhouette_matrix, because the name overlapped with the name of the Rcpp function [in the silhouette_matrix function]
  • I replaced all sorted_medoids.n_elem with the variable unsigned int sorted_medoids_elem [in the silhouette_matrix function]

I modified the following functions in the clustering_functions.R file:

  • KMeans_rcpp() : I added an experimental note in the details for the optimal_init and quantile_init initializers.
  • Optimal_Clusters_KMeans() : I added an experimental note in the details for the optimal_init and quantile_init initializers.
  • MiniBatchKmeans() : I added an experimental note in the details for the optimal_init and quantile_init initializers.

ClusterR 1.0.6

The normalized variation of information was added in the external_validation function (https://github.com/mlampros/ClusterR/pull/1)

ClusterR 1.0.5

I fixed the valgrind memory errors

ClusterR 1.0.4

I removed the warnings, which occured during compilation. I corrected the UBSAN memory errors which occured due to a mistake in the check_medoids() function of the utils_rcpp.cpp file. I also modified the quantile_init_rcpp() function of the utils_rcpp.cpp file to print a warning if duplicates are present in the initial centroid matrix.

ClusterR 1.0.3

  • I updated the dissimilarity functions to accept data with missing values.
  • I added an error exception in the predict_GMM() function in case that the determinant is equal to zero. The latter is possible if the data includes highly correlated variables or variables with low variance.
  • I replaced all unsigned int's in the rcpp files with int data types

ClusterR 1.0.2

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results

ClusterR 1.0.1

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results

ClusterR 1.0.0

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("ClusterR")

1.1.6 by Lampros Mouselimis, 10 days ago


https://github.com/mlampros/ClusterR


Report a bug at https://github.com/mlampros/ClusterR/issues


Browse source code at https://github.com/cran/ClusterR


Authors: Lampros Mouselimis [aut, cre] , Conrad Sanderson [cph] (Author of the C++ Armadillo library) , Ryan Curtin [cph] (Author of the C++ Armadillo library) , Siddharth Agrawal [cph] (Author of the C code of the Mini-Batch-Kmeans algorithm (https://github.com/siddharth-agrawal/Mini-Batch-K-Means)) , Brendan Frey [cph] (Author of the matlab code of the Affinity propagation algorithm (for commercial use please contact the author)) , Delbert Dueck [cph] (Author of the matlab code of the Affinity propagation algorithm)


Documentation:   PDF Manual  


GPL-3 license


Imports Rcpp, OpenImageR, graphics, grDevices, utils, gmp, FD, stats, ggplot2

Depends on gtools

Suggests testthat, covr, knitr, rmarkdown

Linking to Rcpp, RcppArmadillo


Imported by CensMixReg, demu, jackstraw.


See at CRAN