Statistical Tools for Filebacked Big Matrices

Easy-to-use, efficient, flexible and scalable statistical tools. Package bigstatsr provides and uses Filebacked Big Matrices via memory-mapping. It provides for instance matrix operations, Principal Component Analysis, sparse linear supervised models, utility functions and more .


Travis-CI Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge DOI

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette).

LIST OF FEATURES

Note that most of the algorithms of this package don't handle missing values.

Installation

# For the current development version
devtools::install_github("privefl/bigstatsr")

Small example

library(bigstatsr)
 
# Create the data on disk
X <- FBM(5e3, 10e3, backingfile = "test")$save()
# If you open a new session you can do
X <- big_attach("test.rds")
 
# Fill it by chunks with random values
U <- matrix(0, nrow(X), 5); U[] <- rnorm(length(U))
V <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V))
NCORES <- nb_cores()
# X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) {
  X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
  NULL  ## you don't want to return anything here
}, a.combine = 'c', ncores = NCORES, U = U, V = V)
# Check some values
X[1:5, 1:5]
 
# Compute first 10 PCs
obj.svd <- big_randomSVD(X, fun.scaling = big_scale(), 
                         k = 10, ncores = NCORES)
plot(obj.svd)
 
# Cleanup
unlink(paste0("test", c(".bk", ".rds")))

Learn more with this introduction to package {bigstatsr}.

Input format

As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).

To memory-map character text files, see package {mmapcharr}.

Bug report / Help

Please open an issue if you find a bug. If you want help using {bigstatsr}, please post on Stack Overflow with the tag bigstatsr (not yet created). How to make a great R reproducible example?

Use cases

Parallelisation

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tuto.

Large datasets

News

bigstatsr 0.6.2

  • big_write added.

bigstatsr 0.6.1

  • big_read now has a filter argument to filter rows, and argument nrow has been removed because it is now determined when reading the first block of data.

  • Removed the save argument from FBM (and others); now, you must use FBM(...)$save() instead of FBM(..., save = TRUE).

bigstatsr 0.6.0

  • You can now fill an FBM using a data frame. Note that factors will be used as integers.

  • Package {bigreadr} has been developed and is now used by big_read.

bigstatsr 0.5.0

  • There have been some changes regarding how conversion between types is checked. Before, you would get a warning for any possible loss of precision (without actually checking it). Now, any loss of precision due to conversion between types is reported as a warning, and only in this case. If you want to disable this feature, you can use options(bigstatsr.downcast.warning = FALSE).

bigstatsr 0.4.1

  • change big_read so that it is faster (corresponding vignette updated).

bigstatsr 0.4.0

  • possibility to add a "base predictor" for big_spLinReg and big_spLogReg.

  • don't store the whole regularization path (as a sparse matrix) in big_spLinReg and big_spLogReg anymore because it caused major slowdowns.

  • directly average the K predictions in predict.big_sp_best_list.

  • only use the "PSOCK" type of cluster because "FORK" can leave zombies behind. You can change this with options(bigstatsr.cluster.type = "PSOCK").

bigstatsr 0.3.4

  • Fix a bug in big_spLinReg related to the computation of summaries.

  • Now provides function plus to be used as the combine argument in big_apply and big_parallelize instead of '+'.

bigstatsr 0.3.3

  • Before, this package used only the "PSOCK" type of cluster, which has some significant overhead. Now, it uses the "FORK" type on non-Windows systems. You can change this with options(bigstatsr.cluster.type = "PSOCK").

bigstatsr 0.3.2

  • you can now provide multiple $\alpha$ values (as a numeric vector) in big_spLinReg and big_spLogReg. One will be choosed by grid-search.

bigstatsr 0.3.1

  • fixed a bug in big_prodMat when using a dimension of 1 or 0.

bigstatsr 0.3.0

bigstatsr 0.2.6

  • no scaling is used by default for big_crossprod, big_tcrossprod, big_SVD and big_randomSVD (before, there was no default at all)

bigstatsr 0.2.4

  • Integrate Cross-Model Selection and Averagind (CMSA) directly in big_spLinReg and big_spLogReg, a procedure that automatically chooses the value of the $\lambda$ hyper-parameter.

  • Speed up big_spLinReg and big_spLogReg (issue #12)

bigstatsr 0.2.3

  • Speed up AUC computations

bigstatsr 0.2.0

  • No longer use the big.matrix format of package bigmemory

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("bigstatsr")

0.6.2 by Florian Privé, 6 months ago


https://privefl.github.io/bigstatsr


Report a bug at https://github.com/privefl/bigstatsr/issues


Browse source code at https://github.com/cran/bigstatsr


Authors: Florian Privé [aut, cre] , Michael Blum [ths] , Hugues Aschard [ths]


Documentation:   PDF Manual  


Task views: High-Performance and Parallel Computing with R


GPL-3 license


Imports bigreadr, cowplot, doParallel, foreach, ggplot2, graphics, methods, parallel, Rcpp, RSpectra, stats, utils

Suggests biglasso, bigmemory, covr, glmnet, grid, ModelMetrics, RcppEigen, RhpcBLASctl, spelling, testthat

Linking to BH, Rcpp, RcppArmadillo


See at CRAN