Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software,

The ClusterR package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. More details on the functionality of ClusterR can be found in the blog-post, Vignette and in the package Documentation.

**UPDATE 16-08-2018**

As of version 1.1.4 the *ClusterR* package allows R package maintainers to perform **linking between packages at a C++ code (Rcpp) level**. This means that the Rcpp functions of the *ClusterR* package can be called in the C++ files of another package. In the next lines I'll give detailed explanations on how this can be done:

Assumming that an R package ('PackageA') calls one of the *ClusterR* Rcpp functions. Then the maintainer of 'PackageA' has to :

**1st.**install the*ClusterR*package to take advantage of the new functionality either from CRAN using,

` install.packages("ClusterR") `

or download the latest version from Github using the *devtools* package,

` devtools::install_github('mlampros/ClusterR') `

**2nd.**update the**DESCRIPTION**file of 'PackageA' and especially the*LinkingTo*field by adding the*ClusterR*package (besides any other packages),

` LinkingTo: ClusterR `

**3rd.**open a**new C++ file**(for instance in Rstudio) and at the top of the file add the following 'headers', 'depends' and 'plugins',

` # include <RcppArmadillo.h># include <ClusterRHeader.h># include <affinity_propagation.h>// [[Rcpp::depends("RcppArmadillo")]]// [[Rcpp::depends(ClusterR)]]// [[Rcpp::plugins(cpp11)]] `

The available functions can be found in the following files: **inst/include/ClusterRHeader.h** and **inst/include/affinity_propagation.h**

A *complete minimal example* would be :

`# include <RcppArmadillo.h># include <ClusterRHeader.h># include <affinity_propagation.h>// [[Rcpp::depends("RcppArmadillo")]]// [[Rcpp::depends(ClusterR)]]// [[Rcpp::plugins(cpp11)]] using namespace clustR; // [[Rcpp::export]]Rcpp::List mini_batch_kmeans(arma::mat& data, int clusters, int batch_size, int max_iters, int num_init = 1, double init_fraction = 1.0, std::string initializer = "kmeans++", int early_stop_iter = 10, bool verbose = false, Rcpp::Nullable<Rcpp::NumericMatrix> CENTROIDS = R_NilValue, double tol = 1e-4, double tol_optimal_init = 0.5, int seed = 1) { ClustHeader clust_header; return clust_header.mini_batch_kmeans(data, clusters, batch_size, max_iters, num_init, init_fraction, initializer, early_stop_iter, verbose, CENTROIDS, tol, tol_optimal_init, seed);} `

Then, by opening an R file a user can call the *mini_batch_kmeans* function using,

` Rcpp::sourceCpp('example.cpp') # assuming that the previous Rcpp code is included in 'example.cpp' set.seed(1)dat = matrix(runif(100000), nrow = 1000, ncol = 100) mbkm = mini_batch_kmeans(dat, clusters = 3, batch_size = 50, max_iters = 100, num_init = 2, init_fraction = 1.0, initializer = "kmeans++", early_stop_iter = 10, verbose = T, CENTROIDS = NULL, tol = 1e-4, tol_optimal_init = 0.5, seed = 1) str(mbkm) `

Use the following link to report bugs/issues,

https://github.com/mlampros/ClusterR/issues

- I added parallelization for the
*exact*method of the*AP_preferenceRange*function which is more computationally intensive as the*bound*method - I modified the
*Optimal_Clusters_KMeans*,*Optimal_Clusters_GMM*and*Optimal_Clusters_Medoids*to accept also a contiguous or non-contiguous vector besides single values as a*max_clusters*parameter. However, the limitation currently is that the user won't be in place to plot the clusters but only to receive the ouput data ( this can be changed in the future however the plotting function for the contiguous and non-contiguous vectors must be a separate plotting function outside of the existing one). Moreover, the*distortion_fK*criterion can't be computed in the*Optimal_Clusters_KMeans*function if the*max_clusters*parameter is a contiguous or non-continguous vector ( the*distortion_fK*criterion requires consecutive clusters ). The same applies also to the*Adjusted_Rsquared*criterion which returns incorrect output. For this feature request see the following Github issue.

- I moved the
*OpenImageR*dependency in the DESCRIPTION file from 'Imports' to 'Suggests', as it appears only in the Vignette file.

- I fixed the
*clang-UBSAN*errors

- I updated the README.md file (I removed unnecessary calls of ClusterR in DESCRIPTION and NAMESPACE files)
- I renamed the
*export_inst_header.cpp*file in the src folder to*export_inst_folder_headers.cpp* - I modified the
*Predict_mini_batch_kmeans()*function to accept an armadillo matrix rather than an Rcpp Numeric matrix. The function appers both in*ClusterRHeader.h*file ( 'inst' folder ) and in*export_inst_folder_headers.cpp*file ( 'src' folder ) - I added the
*mini_batch_params*parameter to the*Optimal_Clusters_KMeans*function. Now, the optimal number of clusters can be found also based on the min-batch-kmeans algorithm (except for the*variance_explained*criterion) - I changed the license from MIT to GPL-3
- I added the
*affinity propagation algorithm*(www.psi.toronto.edu/index.php?q=affinity propagation). Especially, I converted the matlab files*apcluster.m*and*referenceRange.m*. - I modified the minimum version of RcppArmadillo in the DESCRIPTION file to 0.9.1 because the Affinity Propagation algorithm requires the
*.is_symmetric()*function, which was included in version 0.9.1

As of version 1.1.5 the ClusterR functions can take tibble objects as input too.

I modified the ClusterR package to a cpp-header-only package to allow linking of cpp code between Rcpp packages. See the update of the README.md file (16-08-2018) for more information.

I updated the example section of the documentation by replacing the *optimal_init* with the *kmeans++* initializer

- I fixed an Issue related to
*NAs produced by integer overflow*of the*external_validation*function. See, the commented line of the*Clustering_functions.R*file (line 1830).

- I added a
*tryCatch*in*Optimal_Clusters_Medoids()*function to account for the error described in Error in Optimal_Clusters_Medoids function#5 issue

- I added the
*DARMA_64BIT_WORD*flag in the Makevars file to allow the package processing big datasets - I modified the
*kmeans_miniBatchKmeans_GMM_Medoids.cpp*file and especially all*Rcpp::List::create()*objects to addrress the clang-ASAN errors.

- I modified the
*Optimal_Clusters_KMeans*function to return a vector with the*distortion_fK*values if criterion is*distortion_fK*(instead of the*WCSSE*values). - I added the 'Moore-Penrose pseudo-inverse' for the case of the 'mahalanobis' distance calculation.

- I modified the
*OpenMP*clauses of the .cpp files to address the ASAN errors. - I removed the
*threads*parameter from the*KMeans_rcpp*function, to address the ASAN errors ( negligible performance difference between threaded and non-threaded version especially if the*num_init*parameter is less than 10 ). The*threads*parameter was removed also from the*Optimal_Clusters_KMeans*function as it utilizes the*KMeans_rcpp*function to find the optimal clusters for the various methods.

I modified the *kmeans_miniBatchKmeans_GMM_Medoids.cpp* file in the following lines in order to fix the clang-ASAN errors (without loss in performance):

- lines 1156-1160 : I commented the second OpenMp parallel-loop and I replaced the
*k*variable with the*i*variable in the second for-loop [in the*dissim_mat()*function] - lines 1739-1741 : I commented the second OpenMp parallel-loop [in the
*silhouette_matrix()*function] - I replaced (all) the
*silhouette_matrix*(arma::mat) variable names with*Silhouette_matrix*, because the name overlapped with the name of the Rcpp function [in the*silhouette_matrix*function] - I replaced all
*sorted_medoids.n_elem*with the variable*unsigned int sorted_medoids_elem*[in the*silhouette_matrix*function]

I modified the following *functions* in the *clustering_functions.R* file:

*KMeans_rcpp()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.*Optimal_Clusters_KMeans()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.*MiniBatchKmeans()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.

The *normalized variation of information* was added in the *external_validation* function (https://github.com/mlampros/ClusterR/pull/1)

I fixed the valgrind memory errors

I removed the warnings, which occured during compilation.
I corrected the UBSAN memory errors which occured due to a mistake in the *check_medoids()* function of the *utils_rcpp.cpp* file.
I also modified the *quantile_init_rcpp()* function of the *utils_rcpp.cpp* file to print a warning if duplicates are present in the initial centroid matrix.

- I updated the dissimilarity functions to accept data with missing values.
- I added an error exception in the predict_GMM() function in case that the determinant is equal to zero. The latter is possible if the data includes highly correlated variables or variables with low variance.
- I replaced all unsigned int's in the rcpp files with int data types

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results