Compare two classifications or clustering solutions that may or may not have the same number of classes, and that might have hard or soft (fuzzy, probabilistic) membership. Calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. This package is not geared towards traditional accuracy assessment for classification/ mapping applications - the motivating use case is for comparing a probabilistic clustering solution to a set of reference or existing class labels that could have any number of classes (that is, without having to degrade the probabilistic clustering to hard classes).
An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.
The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the 'truth'). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.
This example is on silly fake data, but it's quick and will run
without any additional data or package loads. Check out the
vignette for something a little more sensible.
First install and load c2c
Make a silly made up soft classification matrix
my_soft_mat <- matrix(runif(50,0,1), nrow = 10, ncol = 5)
and a made up set of class labels, with matching number of observations
my_labels <- rep(c("a","b","c"), length.out = 10)
The two main functions are
First generate the confusion matrix
conf_mat <- get_conf_mat(my_soft_mat, my_labels)conf_mat
then calculate the metrics - see
?calculate_cluster_metrics for details
You could also just pass any confusion matrix (that you have already generated elsewhere).
Another thing you can do within
get_conf_mat is turn a soft matrix into a hard one.
You can install directly from CRAN as above
or if you want to get the development version,
which might have some new functionality, you can install from GitHub. It's very easy, simply
use Hadley Wickham's (excellent)
There are some probably. If you find them, please let me know about them - either directly on github, or the contact details below.
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
Foster, Hill and Lyons (2017) "Ecological Grouping of Survey Sites when Sampling Artefacts are Present". Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211