Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software,
The ClusterR package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. More details on the functionality of ClusterR can be found in the blog-post, Vignette and in the package Documentation.
UPDATE 16-08-2018
As of version 1.1.4 the ClusterR package allows R package maintainers to perform linking between packages at a C++ code (Rcpp) level. This means that the Rcpp functions of the ClusterR package can be called in the C++ files of another package. In the next lines I'll give detailed explanations on how this can be done:
Assumming that an R package ('PackageA') calls one of the ClusterR Rcpp functions. Then the maintainer of 'PackageA' has to :
install.packages("ClusterR")
or download the latest version from Github using the devtools package,
devtools::install_github('mlampros/ClusterR')
LinkingTo: ClusterR
# include <RcppArmadillo.h># include <ClusterRHeader.h># include <affinity_propagation.h>// [[Rcpp::depends("RcppArmadillo")]]// [[Rcpp::depends(ClusterR)]]// [[Rcpp::plugins(cpp11)]]
The available functions can be found in the following files: inst/include/ClusterRHeader.h and inst/include/affinity_propagation.h
A complete minimal example would be :
# include <RcppArmadillo.h># include <ClusterRHeader.h># include <affinity_propagation.h>// [[Rcpp::depends("RcppArmadillo")]]// [[Rcpp::depends(ClusterR)]]// [[Rcpp::plugins(cpp11)]] using namespace clustR; // [[Rcpp::export]]Rcpp::List mini_batch_kmeans(arma::mat& data, int clusters, int batch_size, int max_iters, int num_init = 1, double init_fraction = 1.0, std::string initializer = "kmeans++", int early_stop_iter = 10, bool verbose = false, Rcpp::Nullable<Rcpp::NumericMatrix> CENTROIDS = R_NilValue, double tol = 1e-4, double tol_optimal_init = 0.5, int seed = 1) { ClustHeader clust_header; return clust_header.mini_batch_kmeans(data, clusters, batch_size, max_iters, num_init, init_fraction, initializer, early_stop_iter, verbose, CENTROIDS, tol, tol_optimal_init, seed);}
Then, by opening an R file a user can call the mini_batch_kmeans function using,
Rcpp::sourceCpp('example.cpp') # assuming that the previous Rcpp code is included in 'example.cpp' set.seed(1)dat = matrix(runif(100000), nrow = 1000, ncol = 100) mbkm = mini_batch_kmeans(dat, clusters = 3, batch_size = 50, max_iters = 100, num_init = 2, init_fraction = 1.0, initializer = "kmeans++", early_stop_iter = 10, verbose = T, CENTROIDS = NULL, tol = 1e-4, tol_optimal_init = 0.5, seed = 1) str(mbkm)
Use the following link to report bugs/issues,
https://github.com/mlampros/ClusterR/issues
As of version 1.1.5 the ClusterR functions can take tibble objects as input too.
I modified the ClusterR package to a cpp-header-only package to allow linking of cpp code between Rcpp packages. See the update of the README.md file (16-08-2018) for more information.
I updated the example section of the documentation by replacing the optimal_init with the kmeans++ initializer
I modified the kmeans_miniBatchKmeans_GMM_Medoids.cpp file in the following lines in order to fix the clang-ASAN errors (without loss in performance):
I modified the following functions in the clustering_functions.R file:
The normalized variation of information was added in the external_validation function (https://github.com/mlampros/ClusterR/pull/1)
I fixed the valgrind memory errors
I removed the warnings, which occured during compilation. I corrected the UBSAN memory errors which occured due to a mistake in the check_medoids() function of the utils_rcpp.cpp file. I also modified the quantile_init_rcpp() function of the utils_rcpp.cpp file to print a warning if duplicates are present in the initial centroid matrix.
I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results
I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results