Computes proximity between rows or columns of large matrices efficiently in C++. Functions are optimised for large sparse matrices using the Armadillo and Intel TBB libraries. Among several built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast.
proxyC computes proximity between rows or columns of large matrices efficiently in C++. It is optimized for large sparse matrices using the Armadillo and Intel TBB libraries. Among several built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast.
This code was originally written for quanteda to compute similarity/distance between documents or features in large corpora, but separated as a stand-alone package to make it available for broader data scientific purposes.
install.packages("proxyC")
require(Matrix)## Loading required package: Matrixrequire(microbenchmark)## Loading required package: microbenchmarkrequire(RcppParallel)## Loading required package: RcppParallelrequire(ggplot2)## Loading required package: ggplot2# Set number of threadssetThreadOptions(8)# Make a matrix with 99% zerossm1k <- rsparsematrix(1000, 1000, 0.01) # 1,000 columnssm10k <- rsparsematrix(1000, 10000, 0.01) # 10,000 columns# Convert to dense formatdm1k <- as.matrix(sm1k)dm10k <- as.matrix(sm10k)
With sparse matrices, proxyC is roughly 10 to 100 times faster than proxy.
bm1 <- microbenchmark("proxyC 1k" = proxyC::simil(sm1k, margin = 2, method = "cosine"),"proxy 1k" = proxy::simil(dm1k, method = "cosine"),"proxyC 10k" = proxyC::simil(sm10k, margin = 2, method = "cosine"),"proxy 10k" = proxy::simil(dm10k, method = "cosine"),times = 10)autoplot(bm1)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
If rank
is used, proxyC becomes even faster as many similarity
scores are discarded (rounded to zero).
bm2 <- microbenchmark("proxyC rank" = proxyC::simil(sm1k, margin = 2, method = "cosine", rank = 10),"proxyC all" = proxyC::simil(sm1k, margin = 2, method = "cosine"),times = 10)autoplot(bm2)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
min_simil
also makes proxyC faster.
bm3 <- microbenchmark("proxyC min_simil" = proxyC::simil(sm1k, margin = 2, method = "correlation", min_simil = 0.9),"proxyC all" = proxyC::simil(sm1k, margin = 2, method = "correlation"),times = 10)autoplot(bm3)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.