Optimal and Fast Univariate Clustering
Fast optimal univariate clustering and segementation
by dynamic programming. Three types of problem including
univariate k-means, k-median, and k-segments are solved with
guaranteed optimality and reproducibility. The core algorithm
minimizes the sum of within-cluster distances using respective
metrics. Its advantage over heuristic clustering algorithms in
efficiency and accuracy is increasingly pronounced as the
number of clusters k increases. Weighted k-means and unweighted
k-segments algorithms can also optimally segment time series
and perform peak calling. An auxiliary function generates
histograms that are adaptive to patterns in data. In contrast to
heuristic methods, this package provides a powerful set of tools
for univariate data analysis with guaranteed optimality.
- Expanded the vignette of adaptive histograms to a tutorial.
- Expanded the vignette of optimal univariate k-means clustering to a tutorial.
- Update the time course example in Ckmeans.1d.dp function.
- Fixed a bug in the dynamic programming solution to optimal weighted
clustering. This bug affected the first weighted cluster. The code that
contains this bug was not used for non-weighted fast optimal clustering.
If no weight vector is provided when calling Ckmeans.1d.dp, the clustering
result will be the same and remains correct.
- Added a vignette to visualize examples of adaptive histograms.
- Added a vignette to visualize examples of optimal univariate k-means
- Added an equal bin width histogram example to contrast with the adaptive
- Moved ahist() function from visualize.R to a new R file ahist.R.
- Added a breaks argument to ahist() so as use default graphics::hist() but
with the capacity to add sticks to the histogram generated.
- Added a skip.empty.bin.color argument to ahist() to gain more control over
colors of the histogram bars.
- Added a data argument to ahist() to provide raw data for visualization.
- Allow x to ahist() to be an object of the class "Ckmeans.1d.dp" to avoid
recomputing the clustering if it has already been done. This requires the
data for clustering to be provided via the data argument.
- Added an argument add.sticks=TRUE to ahist() to turn on or off the sticks
just above the horizontal axis.
- Added an argument style to ahist() for different styles of adaptive
- Updated examples in Ckmeans.1d.dp() manual
- Added a new function plot() to visualize the clusters.
- Updated examples of ahist().
- Added sticks to ahist() to show the original input data.
- Fixed ahist() when there is only a single bin detected.
- Made ahist() run much faster than the previous version.
- Updated previous examples and added more examples to illustrate
the use of ahist() better.
- Introduced a new function ahist() to generate adaptive histograms
corresponding to the optimal univariate k-means clustering.
- Removed usage of lambda functions in C++ code for compatibility with
older versions of C++ compilers.
- Shifted the values of x by median to improve numerical stability.
- Implemented an O(kn lg n) algorithm, speeding up the program greatly.
- s[j,i] is now computed in constant time based on pre-computed
sums of input x and its squares from 0 to i.
- Incorporated a numerically stable method for computing sample variance when
selecting the number of clusters.
- Improved documentation.
- Removed a typo in describing time complexity.
- Now Ckmeans.1d.dp() function returns "totss", "tot.withinss", and
"betweenss" statistics to summarize the optimal clustering obtained.
- print.Ckmeans.1d.dp() print out the above statistics.
- Upgraded to support c++11
- Introduced optimal k-means clustering for weighted data
- Implemented backward filling of the dynamic programming matrix to
utilize lower bounds for the optimal cluster boundary. This step
substantially reduced the runtime by half (two or more times faster).
- Implemented mathematically proven tighter ranges when searching for
cluster boundaries. The runtime of the function is greatly reduced.
Most notably, the runtime is roughly constant when number of clusters
increases after k=2.
- Integrated all test cases into one single file.
- Substantial runtime reduction. Added code to check for an upper bound
for the sum of within cluster square distances. This reduced the runtime
by half when clustering 100000 points (from standard normal distribution)
into 10 clusters.
- Eliminated the unnecessary calculation of (n-1) elements in the dynamic
programming matrix that are not needed for the final result. This
resulted in enormous reduction in run time when the number of cluster
is 2: assigning one million points into two clusters took half a
a second on iMac with 2.93 GHz Intel Core i7 processor.
- Included a reference to the first description of the dynamic programming
solution by Richard Bellman (1973).
- Fixed a bug on cluster assignment when there is only one cluster. This
was a bug introduced in version 3.3.2.
- Added automatic test cases.
- Removed an incorrect warning message when the number of clusters is equal
to the number of unique elements in the input vector.
- Changed from 1-based to 0-based C implementation.
- Optimized the code by reducing overhead. See 22% reduction in runtime to
repeatedly cluster seven points into two clusters one million times.
Version 3.3.1 2015-02-10
- Fixed a problem that prevented Windows compilation (now forced the size_t
type to unsigned long in max() function.
Version 3.3.0 2015-02-09
- Added automated test cases into the package.
- Changed the code to not issue a warning message when the number of clusters
is estimated to be 1.
- When lower bound of the number of clusters is greater than the unique
number of elements in the input vector, both the min and max numbers of
clusters are set to the number of unique number of input values.
- When the upper bound of the number of clusters is greater than the unique
number of elements in the input vector, the max number of clusters is set
to the number of unique elements in the input vector.
- Use warning() instead of cat() to display warning messages.
- Incorporate changes suggested by a user to speed up the code.
- Revised the examples and documentation to improve usability of the package
- Started the NEWS file.
Version 3.02 2014-03-24 and earlier
- The program now automatically determines the number of clusters from a
- The code is optimized for further speedup.