Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms

A fast reimplementation of several density-based algorithms of the DBSCAN family for spatial data. Includes the DBSCAN (density-based spatial clustering of applications with noise) and OPTICS (ordering points to identify the clustering structure) clustering algorithms HDBSCAN (hierarchical DBSCAN) and the LOF (local outlier factor) algorithm. The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search. An R interface to fast kNN and fixed-radius NN search is also provided.


CRAN version CRAN RStudio mirror downloads Travis-CI Build Status AppVeyor Build Status

This R package provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

  • DBSCAN: Density-based spatial clustering of applications with noise.
  • HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction.
  • OPTICS/OPTICSXi: Ordering points to identify the clustering structure clustering algorithms.
  • FOSC: Framework for Optimal Selection of Clusters for unsupervised and semisupervised clustering of hierarchical cluster tree.
  • Jarvis-Patrick clustering
  • SNN Clustering: Shared Nearest Neighbor Clustering.

Outlier Detection

  • LOF: Local outlier factor algorithm.
  • GLOSH: Global-Local Outlier Score from Hierarchies algorithm.

Fast Nearest-Neighbor Search (using kd-trees)

  • kNN search
  • Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python's scikit-learn.

Installation

Stable CRAN version: install from within R with

install.packages("dbscan")

Current development version: Download package from AppVeyor or install from GitHub (needs devtools).

library("devtools")
install_github("mhahsler/dbscan")

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")
 
data("iris")
x <- as.matrix(iris[, 1:4])

Run DBSCAN

db <- dbscan(x, eps = .4, minPts = 4)
db
DBSCAN clustering for 150 objects.
Parameters: eps = 0.4, minPts = 4
The clustering contains 4 cluster(s) and 25 noise points.

 0  1  2  3  4 
25 47 38 36  4 

Available fields: cluster, eps, minPts

Visualize results (noise is shown in black)

pairs(x, col = db$cluster + 1L)

Calculate LOF (local outlier factor) and visualize (larger bubbles in the visualization have a larger LOF)

lof <- lof(x, k = 4)
pairs(x, cex = lof)

Run OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt
OPTICS clustering for 150 objects.
Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA
Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)

Extract a hierarchical clustering using the Xi method (captures clusters of varying density)

opt <- extractXi(opt, xi = .05)
opt
plot(opt)

Run HDBSCAN (captures stable clusters)

hdb <- hdbscan(x, minPts = 4)
hdb
HDBSCAN clustering for 150 objects.
Parameters: minPts = 4
The clustering contains 2 cluster(s) and 0 noise points.

  1   2 
100  50 

Available fields: cluster, minPts, cluster_scores, membership_prob, outlier_scores, hc

Visualize the results as a simplified tree

plot(hdb, show_flat = T)

See how well each point corresponds to the clusters found by the model used

  colors <- mapply(function(col, i) adjustcolor(col, alpha.f = hdb$membership_prob[i]), 
                   palette()[hdb$cluster+1], seq_along(hdb$cluster))
  plot(x, col=colors, pch=20)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with explicit permission granted by the original author, Erich Schubert.

Further Information

Maintainer: Michael Hahsler

News

dbscan 1.1-3 (2018-11-12)

Bugfix

  • pointdensity was double counting the query point (reported by Marius Hofert).

dbscan 1.1-2 (2018-05-18)

New Features

  • OPTICS now calculates eps if it is omitted.

Bugfix

  • Example now only uses igraph conditionally since it is unavailable on Solaris (reported by B. Ripley).

dbscan 1.1-1 (2017-03-19)

Bugfix

  • Fixed problem with constant name on Solaris in ANN code (reported by B. Ripley).

dbscan 1.1-0 (2017-03-18)

New Features

  • HDBSCAN was added.
  • extractFOSC (optimal selection of clusters for HDBSCAN) was added.
  • GLOSH outlier score was added.
  • hullplot uses now filled polygons as the default.
  • hullplot now used PCA if the data has more than 2 dimensions.
  • Added NN superclass for kNN and frNN with plot and with adjacencylist().
  • Added shared nearest neighbor clustering as sNNclust() and sNN to calculate the number of shared nearest neighbors.
  • Added pointdensity function.
  • Unsorted kNN and frNN can now be sorted using sort().
  • kNN and frNN now also accept kNN and frNN objects, respectively. This can be used to create a new kNN (frNN) with a reduced k or eps.
  • Datasets added: DS3 and moon.

Interface Changes

  • Improved interface for dbscan() and optics(): ... it now passed on to frNN.
  • OPTICS clustering extraction methods are now called extractDBSCAN and extractXi.
  • kNN and frNN are now objects with a print function.
  • dbscan now also accepts a frNN object as input.
  • jpclust and sNNclust now return a list instead of just the cluster assignments.

dbscan 1.0-0 (2017-02-02)

New Features

  • The package has now a vignette.
  • Jarvis-Patrick clustering is now available as jpclust().
  • Improved interface for dbscan() and optics(): ... is now passed on to frNN.
  • OPTICS clustering extraction methods are now called extractDBSCAN and extractXi.
  • hullplot uses now filled polygons as the default.
  • hullplot now used PCA if the data has more than 2 dimensions.
  • kNN and frNN are now objects with a print function.
  • dbscan now also accepts a frNN object as input.

dbscan 0.9-8 (2016-08-05)

New Features

  • Added hullplot to plot a scatter plot with added convex cluster hulls.
  • OPTICS: added a predecessor correction step that is used by the ELKI implementation (Matt Piekenbrock).

Bugfixes

  • Fixed a memory problem in frNN (reported by Yilei He).

dbscan 0.9-7 (2016-04-14)

  • OPTICSXi is now implemented (thanks to Matt Piekenbrock).
  • DBSCAN now also accepts MinPts (with a capital M) to be compatible with the fpc version.
  • DBSCAN objects are now also of class db scan_fast to avoid clashes with fpc.
  • DBSCAN and OPTICS have now predict functions.
  • Added test for unhandled NAs.
  • Fixed LOF for more than k duplicate points (reported by Samneet Singh).

dbscan 0.9-6 (2015-12-14)

  • OPTICS: fixed second bug reported by Di Pang
  • all methods now also accept dist objects and have a search method "dist" which precomputes distances.

dbscan 0.9-5 (2015-10-04)

  • OPTICS: fixed bug with first observation reported by Di Pang
  • OPTICS: clusterings can now be extracted using optics_cut

dbscan 0.9-4 (2015-09-17)

  • added tests (testthat).
  • input data is now checked if it can safely be coerced into a numeric matrix (storage.mode double).
  • fixed self matches in kNN and frNN (now returns the first NN correctly).

dbscan 0.9-3 (2015-9-2)

  • Added weights to DBSCAN.

dbscan 0.9-2 (2015-08-11)

  • Added kNN interface.
  • Added frNN (fixed radius NN) interface.
  • Added LOF.
  • Added OPTICS.
  • All algorithms check now for interrupt (CTRL-C/Esc).
  • DBSCAN now returns a list instead of a numeric vector.

dbscan 0.9-1 (2015-07-21)

  • DBSCAN: Improved speed by avoiding repeated sorting of point ids.
  • Added linear NN search option.
  • Added fast calculation for kNN distance.
  • fpc and microbenchmark are now used conditionally in the examples.

dbscan 0.9-0 (2015-07-15)

  • initial release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("dbscan")

1.1-4 by Michael Hahsler, a month ago


https://github.com/mhahsler/dbscan


Report a bug at https://github.com/mhahsler/dbscan/issues


Browse source code at https://github.com/cran/dbscan


Authors: Michael Hahsler [aut, cre, cph] , Matthew Piekenbrock [aut, cph] , Sunil Arya [ctb, cph] , David Mount [ctb, cph]


Documentation:   PDF Manual  


Task views: Cluster Analysis & Finite Mixture Models


GPL (>= 2) license


Imports Rcpp, graphics, stats, methods

Suggests fpc, microbenchmark, testthat, dendextend, igraph, knitr, DMwR

Linking to Rcpp

System requirements: C++11


Imported by AFM, Biopeak, DDoutlier, ParBayesianOptimization, cordillera, diceR, funtimes, haploReconstruct, projector, smotefamily, stream.

Depended on by ktaucenters.

Suggested by OTclust, opticskxi, shipunov, supc.


See at CRAN