Size-Constrained Clustering

Provides wrappers for 'scclust', a C library for computationally efficient size-constrained clustering with near-optimal performance. See < https://github.com/fsavje/scclust> for more information.


CRAN Status Build Status Build status codecov

The scclust package is an R wrapper for the scclust library. The package provides functions to construct near-optimal size-constrained clusterings.

Most conventional clustering functions restrict the number of clusters, but do not impose restrictions on the content of the clusters (see, for example, k-means). scclust takes another route. It imposes conditions on the content of the clusters, but allow any number of them to be formed. Specifically, subject to user-specified constraints on the size and composition of the clusters, scclust constructs a clustering so that within-cluster pair-wise distances are minimized.

It is possible to impose an overall size constraint so that each cluster must contain at least a certain number of points in total. It is also possible to impose constraints on the composition of the clusters so that each cluster must contain a certain number of points of different types. For example, in a sample with "red" and "blue" data points, one can constrain the clustering so that each cluster must contain at least 10 points in total of which at least 3 must be "red" and at least 2 must be "blue".

scclust was made with large data sets in mind, and it can cluster tens of millions of data points within minutes on an ordinary desktop computer.

How to install

scclust is on CRAN and can be installed by running:

install.packages("scclust")

How to install development version

It is recommended to use the stable CRAN version, but the latest development version can be installed directly from Github using devtools:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("fsavje/scclust-R")

The package contains compiled code, and you must have a development environment to install the development version. (Use devtools::has_devel() to check whether you do.) If no development environment exists, Windows users download and install Rtools and macOS users download and install Xcode.

How to use scclust

The following snippet shows how scclust can be used to make clusters with both size and type constraints. See the package documentation for more details.

# Make example data
my_data <- data.frame(id = 1:100000,
                      type = factor(rbinom(100000, 3, 0.3),
                                    labels = c("A", "B", "C", "D")),
                      x1 = rnorm(100000),
                      x2 = rnorm(100000),
                      x3 = rnorm(100000))
 
# Construct distance metric
my_dist <- distances(my_data,
                     id_variable = "id",
                     dist_variables = c("x1", "x2", "x3"))
 
# Make clustering with at least 3 data points in each cluster
my_clustering <- sc_clustering(my_dist, 3)
 
# Check so clustering satisfies constraints
check_clustering(my_clustering, 3)
# > TRUE
 
# Get statistics about the clustering
get_clustering_stats(my_dist, my_clustering)
# > num_data_points        1.000000e+05
# > ...
 
# Make clustering with at least one point of each type in each cluster
my_clustering <- sc_clustering(my_dist,
                               type_labels = my_data$type,
                               type_constraints = c("A" = 1, "B" = 1,
                                                    "C" = 1, "D" = 1))
 
# Check so clustering satisfies constraints
check_clustering(my_clustering,
                 type_labels = my_data$type,
                 type_constraints = c("A" = 1, "B" = 1,
                                      "C" = 1, "D" = 1))
# > TRUE
 
# Make clustering with at least 8 points in total of which at least
# one must be "A", two must be "B" and five can be any type
my_clustering <- sc_clustering(my_dist,
                               size_constraint = 8,
                               type_labels = my_data$type,
                               type_constraints = c("A" = 1, "B" = 2))

News

scclust 0.2.2

  • Fixes incompatibility error with Sun make.

scclust 0.2.1

  • Fixes so compilation uses local Makeconf file.

scclust 0.2.0

  • Makes defaults friendlier with discrete data.

scclust 0.1.2

  • Updates maintainer information.

scclust 0.1.1

  • Uses new version of the scclust library (fixes a memory overflow issue).

  • Makes C-code POSIX compliant so package builds on Solaris.

scclust 0.1.0

  • Initial release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("scclust")

0.2.2 by Fredrik Savje, 3 months ago


https://github.com/fsavje/scclust-R


Report a bug at https://github.com/fsavje/scclust-R/issues


Browse source code at https://github.com/cran/scclust


Authors: Fredrik Savje [aut, cre] , Michael Higgins [aut] , Jasjeet Sekhon [aut]


Documentation:   PDF Manual  


GPL (>= 3) license


Depends on distances

Suggests testthat


Imported by quickblock, quickmatch.


See at CRAN