Statistical Tools for Filebacked Big Matrices

Easy-to-use, efficient, flexible and scalable statistical tools. Package bigstatsr provides and uses Filebacked Big Matrices via memory-mapping. It provides for instance matrix operations, Principal Component Analysis, sparse linear supervised models, utility functions and more .


Travis-CI Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge DOI

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette). As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).

LIST OF FEATURES

Note that most of the algorithms of this package don't handle missing values.

Installation

# For the current development version
devtools::install_github("privefl/bigstatsr")

Small example

library(bigstatsr)
 
# Create the data on disk
X <- FBM(5e3, 10e3, backingfile = "test")$save()
# If you open a new session you can do
X <- big_attach("test.rds")
 
# Fill it by chunks with random values
U <- matrix(0, nrow(X), 5); U[] <- rnorm(length(U))
V <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V))
NCORES <- nb_cores()
# X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) {
  X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
  NULL  ## you don't want to return anything here
}, a.combine = 'c', ncores = NCORES, U = U, V = V)
# Check some values
X[1:5, 1:5]
 
# Compute first 10 PCs
obj.svd <- big_randomSVD(X, fun.scaling = big_scale(), 
                         k = 10, ncores = NCORES)
plot(obj.svd)
 
# Cleanup
unlink(paste0("test", c(".bk", ".rds")))

Learn more with this introduction to package {bigstatsr}.

Bug report / Help

Please open an issue if you find a bug. If you want help using {bigstatsr}, please post on Stack Overflow with the tag bigstatsr. How to make a great R reproducible example?

Use cases

Parallelization

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.

Large datasets

News

bigstatsr 0.9.0

  • Use mio instead of boost for memory-mapping.

  • Add a parameter base.row to predict.big_sp_list() and automatically detect if needed (as well as for covar.row).

  • Possibility to subset a big_sp_list without losing attributes, so that one can access one model (corresponding to one alpha) even if it is not the 'best'.

  • Add parameters pf.X and pf.covar in big_sp***Reg() to provide different penalization for each variable (possibly no penalization at all).

bigstatsr 0.8.4

Add %*%, crossprod and tcrossprod operations for 'double' FBMs.

bigstatsr 0.8.3

Now also returns the number of non-zero variables ($nb_active) and the number of candidate variables ($nb_candidate) for each step of the regularization paths of big_spLinReg() and big_spLogReg().

bigstatsr 0.8.0

  • Parameters warn and return.all of big_spLinReg() and big_spLogReg() are deprecated; now always return the maximum information. Now provide two methods (summary and plot) to get a quick assessment of the fitted models.

bigstatsr 0.7.3

  • Check of missing values for input vectors (indices and targets) and matrices (covariables).

  • AUC() is now stricter: it accepts only 0s and 1s for target.

bigstatsr 0.7.1

  • $bm() and $bm.desc() have been added in order to get an FBM as a filebacked.big.matrix. This enables using {bigmemory} functions.

bigstatsr 0.7.0

  • Type float added.

bigstatsr 0.6.2

  • big_write added.

bigstatsr 0.6.1

  • big_read now has a filter argument to filter rows, and argument nrow has been removed because it is now determined when reading the first block of data.

  • Removed the save argument from FBM (and others); now, you must use FBM(...)$save() instead of FBM(..., save = TRUE).

bigstatsr 0.6.0

  • You can now fill an FBM using a data frame. Note that factors will be used as integers.

  • Package {bigreadr} has been developed and is now used by big_read.

bigstatsr 0.5.0

  • There have been some changes regarding how conversion between types is checked. Before, you would get a warning for any possible loss of precision (without actually checking it). Now, any loss of precision due to conversion between types is reported as a warning, and only in this case. If you want to disable this feature, you can use options(bigstatsr.downcast.warning = FALSE), or you can use without_downcast_warning() to disable this warning for one call.

bigstatsr 0.4.1

  • change big_read so that it is faster (corresponding vignette updated).

bigstatsr 0.4.0

  • possibility to add a "base predictor" for big_spLinReg and big_spLogReg.

  • don't store the whole regularization path (as a sparse matrix) in big_spLinReg and big_spLogReg anymore because it caused major slowdowns.

  • directly average the K predictions in predict.big_sp_best_list.

  • only use the "PSOCK" type of cluster because "FORK" can leave zombies behind. You can change this with options(bigstatsr.cluster.type = "PSOCK").

bigstatsr 0.3.4

  • Fix a bug in big_spLinReg related to the computation of summaries.

  • Now provides function plus to be used as the combine argument in big_apply and big_parallelize instead of '+'.

bigstatsr 0.3.3

  • Before, this package used only the "PSOCK" type of cluster, which has some significant overhead. Now, it uses the "FORK" type on non-Windows systems. You can change this with options(bigstatsr.cluster.type = "PSOCK"). Uses "PSOCK" in 0.4.0.

bigstatsr 0.3.2

  • you can now provide multiple $\alpha$ values (as a numeric vector) in big_spLinReg and big_spLogReg. One will be chosen by grid-search.

bigstatsr 0.3.1

  • fixed a bug in big_prodMat when using a dimension of 1 or 0.

bigstatsr 0.3.0

bigstatsr 0.2.6

  • no scaling is used by default for big_crossprod, big_tcrossprod, big_SVD and big_randomSVD (before, there was no default at all)

bigstatsr 0.2.4

  • Integrate Cross-Model Selection and Averaging (CMSA) directly in big_spLinReg and big_spLogReg, a procedure that automatically chooses the value of the $\lambda$ hyper-parameter.

  • Speed up big_spLinReg and big_spLogReg (issue #12)

bigstatsr 0.2.3

  • Speed up AUC computations

bigstatsr 0.2.0

  • No longer use the big.matrix format of package bigmemory

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("bigstatsr")

0.9.1 by Florian Privé, 2 months ago


https://privefl.github.io/bigstatsr


Report a bug at https://github.com/privefl/bigstatsr/issues


Browse source code at https://github.com/cran/bigstatsr


Authors: Florian Privé [aut, cre] , Michael Blum [ths] , Hugues Aschard [ths]


Documentation:   PDF Manual  


Task views: High-Performance and Parallel Computing with R


GPL-3 license


Imports bigreadr, cowplot, doParallel, foreach, ggplot2, graphics, methods, parallel, Rcpp, RSpectra, stats, tibble, utils

Suggests bigalgebra, biglasso, bigmemory, covr, glmnet, hexbin, microbenchmark, ModelMetrics, RcppEigen, RhpcBLASctl, spelling, testthat

Linking to Rcpp, RcppArmadillo, rmio


Imported by bigdist.


See at CRAN