# Statistical Tools for Filebacked Big Matrices

Easy-to-use, efficient, flexible and scalable statistical tools. Package bigstatsr provides and uses Filebacked Big Matrices via memory-mapping. It provides for instance matrix operations, Principal Component Analysis, sparse linear supervised models, utility functions and more .

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette).

LIST OF FEATURES

Note that most of the algorithms of this package don't handle missing values.

## Input format

As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).

To memory-map character text files, see package {mmapcharr}.

## Bug report / Help

Please open an issue if you find a bug. If you want help using {bigstatsr}, please post on Stack Overflow with the tag bigstatsr (not yet created). How to make a great R reproducible example?

## Use cases

### Parallelisation

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tuto.

# News

## bigstatsr 0.6.2

• big_write added.

## bigstatsr 0.6.1

• big_read now has a filter argument to filter rows, and argument nrow has been removed because it is now determined when reading the first block of data.

• Removed the save argument from FBM (and others); now, you must use FBM(...)$save() instead of FBM(..., save = TRUE). ## bigstatsr 0.6.0 • You can now fill an FBM using a data frame. Note that factors will be used as integers. • Package {bigreadr} has been developed and is now used by big_read. ## bigstatsr 0.5.0 • There have been some changes regarding how conversion between types is checked. Before, you would get a warning for any possible loss of precision (without actually checking it). Now, any loss of precision due to conversion between types is reported as a warning, and only in this case. If you want to disable this feature, you can use options(bigstatsr.downcast.warning = FALSE). ## bigstatsr 0.4.1 • change big_read so that it is faster (corresponding vignette updated). ## bigstatsr 0.4.0 • possibility to add a "base predictor" for big_spLinReg and big_spLogReg. • don't store the whole regularization path (as a sparse matrix) in big_spLinReg and big_spLogReg anymore because it caused major slowdowns. • directly average the K predictions in predict.big_sp_best_list. • only use the "PSOCK" type of cluster because "FORK" can leave zombies behind. You can change this with options(bigstatsr.cluster.type = "PSOCK"). ## bigstatsr 0.3.4 • Fix a bug in big_spLinReg related to the computation of summaries. • Now provides function plus to be used as the combine argument in big_apply and big_parallelize instead of '+'. ## bigstatsr 0.3.3 • Before, this package used only the "PSOCK" type of cluster, which has some significant overhead. Now, it uses the "FORK" type on non-Windows systems. You can change this with options(bigstatsr.cluster.type = "PSOCK"). ## bigstatsr 0.3.2 • you can now provide multiple$\alpha$values (as a numeric vector) in big_spLinReg and big_spLogReg. One will be choosed by grid-search. ## bigstatsr 0.3.1 • fixed a bug in big_prodMat when using a dimension of 1 or 0. ## bigstatsr 0.3.0 ## bigstatsr 0.2.6 • no scaling is used by default for big_crossprod, big_tcrossprod, big_SVD and big_randomSVD (before, there was no default at all) ## bigstatsr 0.2.4 • Integrate Cross-Model Selection and Averagind (CMSA) directly in big_spLinReg and big_spLogReg, a procedure that automatically chooses the value of the$\lambda\$ hyper-parameter.

• Speed up big_spLinReg and big_spLogReg (issue #12)

## bigstatsr 0.2.3

• Speed up AUC computations

## bigstatsr 0.2.0

• No longer use the big.matrix format of package bigmemory

# Reference manual

install.packages("bigstatsr")

0.6.2 by Florian Privé, 6 months ago

https://privefl.github.io/bigstatsr

Report a bug at https://github.com/privefl/bigstatsr/issues

Browse source code at https://github.com/cran/bigstatsr

Authors: Florian Privé [aut, cre] , Michael Blum [ths] , Hugues Aschard [ths]

Documentation:   PDF Manual