Compare two classifications or clustering solutions that may or may not have the same number of classes, and that might have hard or soft (fuzzy, probabilistic) membership. Calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. This package is not geared towards traditional accuracy assessment for classification/ mapping applications - the motivating use case is for comparing a probabilistic clustering solution to a set of reference or existing class labels that could have any number of classes (that is, without having to degrade the probabilistic clustering to hard classes).

An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.

The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the 'truth'). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.

This example is on silly fake data, but it's quick and will run
without any additional data or package loads. Check out the
vignette for something a little more sensible.

c2c vignette.

First install and load c2c

`install.packages("c2c")library(c2c)`

Make a silly made up soft classification matrix

`my_soft_mat <- matrix(runif(50,0,1), nrow = 10, ncol = 5)`

and a made up set of class labels, with matching number of observations

`my_labels <- rep(c("a","b","c"), length.out = 10)`

The two main functions are `get_conf_mat`

and `calculate_clustering_metrics`

.
First generate the confusion matrix

`conf_mat <- get_conf_mat(my_soft_mat, my_labels)conf_mat`

then calculate the metrics - see `?calculate_cluster_metrics`

for details

`calculate_clustering_metrics(conf_mat)`

You could also just pass any confusion matrix (that you have already generated elsewhere).
Another thing you can do within `get_conf_mat`

is turn a soft matrix into a hard one.

You can install directly from CRAN as above

`install.packages("c2c")`

or if you want to get the development version,
which might have some new functionality, you can install from GitHub. It's very easy, simply
use Hadley Wickham's (excellent) `devtools`

package

`install.packages("devtools")`

then call

`library(devtools)devtools::install_github("mitchest/c2c")`

There are some probably. If you find them, please let me know about them - either directly on github, or the contact details below.

- Mitchell Lyons
- [email protected] / [email protected]

Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.

Foster, Hill and Lyons (2017) "Ecological Grouping of Survey Sites when Sampling Artefacts are Present". Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211

- This is the first release, coinciding with first release on CRAN too
- At present c2c contains all the functionality described in some 2017 papers
- This version has a vignette, so that should be your first port of call, the next release will have some more motivating examples, and may include some data too
- c2c is very light on tests, which will be recitfied on the next release
- Check for changes in between CRAN releases: https://github.com/mitchest/c2c