Extend lasso and elastic-net model fitting for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It's much more memory- and computation-efficient as compared to existing lasso-fitting packages like 'glmnet' and 'ncvreg', thus allowing for very powerful big data analysis even with an ordinary laptop.

`biglasso`

extends lasso and elastic-net linear and logistic regression models for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. **As a result, this package is much more memory- and computation-efficient and highly scalable as compared to existing lasso-fitting packages such as glmnet and ncvreg**. Bechmarking experiments using both simulated and real data sets show that `biglasso`

is not only 1.5x to 4x times faster than existing packages, but also at least 2x more memory-efficient. More importantly, to the best of our knowledge, `biglasso`

is the first R package that enables users to fit lasso models with data sets that are larger than available RAM, thus allowing for powerful big data analysis on an ordinary laptop.

- It utilizes memory-mapped files to store the massive data on the disk, only loading data into memory when necessary during model fitting. Consequently, it's able to seamlessly handle out-of-core computation.
- It is built upon pathwise coordinate descent algorithm with
*warm start, active set cycling, and feature screening*strategies, which has been proven to be one of fastest lasso solvers. - We develop new, hybrid feature screening rules that outperform state-of-the-art screening rules such as the sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) with additional 1.5x to 4x speedup.
- The implementation is designed to be as memory-efficient as possible by eliminating extra copies of the data created by other R packages, making
`biglasso`

at least 2x more memory-efficient than`glmnet`

. - The underlying computation is implemented in C++, and parallel computing with OpenMP is also supported.

**Packages**to be compared:`biglasso (1.2-3)`

(with`screen = "SSR-BEDPP"`

),`glmnet (2.0-5)`

,`ncvreg (3.7-0)`

, and`picasso (0.5-4)`

.**Platform**: MacBook Pro with Intel Core i7 @ 2.3 GHz and 16 GB RAM.**Experiments**: solving lasso-penalized linear regression over the entire path of 100 $\lambda$ values equally spaced on the scale of`lambda / lambda_max`

from 0.1 to 1; varying number of observations`n`

and number of features`p`

; 20 replications, the mean (SE) computing time (in seconds) are reported.**Data generating model**:`y = X * beta + 0.1 eps`

, where`X`

and`eps`

are i.i.d. sampled from`N(0, 1)`

.

`biglasso`

is more computation-efficient:In all the settings, `biglasso`

(1 core) is uniformly 2x faster than `glmnet`

and `ncvreg`

, and 2.5x faster than `picasso`

. Moreover, the computing time of `biglasso`

can be further reduced by half via parallel-computation of 4 cores.

`biglasso`

is more memory-efficient:To prove that `biglasso`

is much more memory-efficient, we simulate a `1000 X 100000`

large feature matrix. The raw data is 0.75 GB. We used Syrupy to measure the memory used in RAM (i.e. the resident set size, RSS) every 1 second during lasso model fitting by each of the packages.

The maximum RSS (in **GB**) used by a single fit and 10-fold cross validation is reported in the Table below. In the single fit case, `biglasso`

consumes 0.84 GB memory in RAM, 50% of that used by `glmnet`

and 22% of that used by `picasso`

. Note that the memory consumed by `glmnet`

, `ncvreg`

, and `picasso`

are respectively 2.2x, 2.1x, and 5.1x larger than the size of the raw data. More strikingly, `biglasso`

does not require additional memory to perform cross-validation, unlike other packages. For serial 10-fold cross-validation, `biglasso`

requires just 27% of the memory used by `glmnet`

and 23% of that used by `ncvreg`

, making it 3.6x and 4.3x more memory-efficient compared to these two, respectively.

Package | picasso | ncvreg | glmnet | biglasso |
---|---|---|---|---|

Single fit | 3.84 | 1.60 | 1.67 | 0.84 |

10-fold CV | - | 3.74 | 3.18 | 0.87 |

**Note**:
..* the memory savings offered by `biglasso`

would be even more significant if cross-validation were conducted in parallel. However, measuring memory usage across parallel processes is not straightforward and not implemented in `Syrupy`

;
..* cross-validation is not implemented in `picasso`

at this point.

The performance of the packages are also tested using diverse real data sets:

- Breast cancer gene expression data (GENE);
- MNIST handwritten image data (MNIST);
- Cardiac fibrosis genome-wide association study data (GWAS);
- Subset of New York Times bag-of-words data (NYT).

The following table summarizes the mean (SE) computing time (in seconds) of solving the lasso along the entire path of 100 `lambda`

values equally spaced on the scale of `lambda / lambda_max`

from 0.1 to 1 over 20 replications.

Package | GENE | MNIST | GWAS | NYT |
---|---|---|---|---|

`n=536` | `n=784` | `n=313` | `n=5,000` | |

`p=17,322` | `p=60,000` | `p=660,495` | `p=55,000` | |

picasso | 1.50 (0.01) | 6.86 (0.06) | 34.00 (0.47) | 44.24 (0.46) |

ncvreg | 1.14 (0.02) | 5.60 (0.06) | 31.55 (0.18) | 32.78 (0.10) |

glmnet | 1.02 (0.01) | 5.63 (0.05) | 23.23 (0.19) | 33.38 (0.08) |

biglasso | 0.54 (0.01) | 1.48 (0.10) | 17.17 (0.11) | 14.35 (1.29) |

To demonstrate the out-of-core computing capability of `biglasso`

, a 31 GB real data set from a large-scale genome-wide association study is analyzed. The dimensionality of the design matrix is: `n = 2898, p = 1,339,511`

. **Note that the size of data is nearly 2x larger than the installed 16 GB of RAM.**

Since other three packages cannot handle this data-larger-than-RAM case, we compare the performance of screening rules `SSR`

and `SSR-BEDPP`

based on our package `biglasso`

. In addition, two cases in terms of `lambda_min`

are considered: (1) `lam_min = 0.1 lam_max`

; and (2) `lam_min = 0.5 lam_max`

, as in practice there is typically less interest in lower values of `lambda`

for very high-dimensional data such as this case. Again the entire solution path with 100 `lambda`

values is obtained. The table below summarizes the overall computing time (in **minutes**) by screening rule `SSR`

(which is what other three packages are using) and our new rule `SSR-BEDPP`

. (Only 1 trial is conducted.)

Cases | SSR | SSR-BEDP |
---|---|---|

`lam_min / lam_max = 0.1` , 1 core | 284.56 | 189.21 |

`lam_min / lam_max = 0.1` , 4 cores | 142.55 | 93.74 |

`lam_min / lam_max = 0.5` , 1 core | 285.61 | 102.75 |

`lam_min / lam_max = 0.5` , 4 cores | 141.28 | 51.02 |

- The stable version:
`install.packages("biglasso")`

- The latest version:
`devtools::install_github("YaohuiZeng/biglasso")`

- open an issue or send an email to Yaohui Zeng at [email protected]

- This package on GitHub has been updated to Version 1.2-5. See details in NEWS.
- The newest stable version will be submitted to CRAN soon after testing.

- Minor change ** Major change

1.3-6 (04/12/2017)

- optimized the code for computing the slores rule.
- added Slores screening without active cycling (-NAC) for logistic regression, research usage only.
- corrected BEDPP for elastic net.
- fixed a bug related to "exporting SSR-BEDPP".

1.3-5 (03/29/2017)

- redocumented using Roxygen2.
- registered native routines for faster and more stable performance.

1.3-4 (01/29/2017)

- fixed a bug related to
`dfmax`

option. (thanks you Florian PrivĂ©!)

1.3-3 (01/24/2017)

- fixed bugs related to KKT checking for elastic net. (thanks you Florian PrivĂ©!)
- added references for screening rules and the technical paper of biglasso package.

1.3-2 (01/16/2017)

- added screening methods without active cycling (-NAC) for comparison, research usage only.
- fixed a bug related to numeric comparison in Dome test.

1.3-1 (12/24/2016)

- fixed bug in SSR-Slores related to numeric equality comparison.

1.3-0 (12/15/2016)

- version 1.3-0 for CRAN submission.

1.2-6 (12/15/2016) ** added a newly proposed screening rule, SSR-Slores, for lasso-penalized logistic regression. ** added SSR-BEDPP for elastic-net-penalized linear regression.

1.2-5 (12/10/2016)

- updated README.md with benchmarking results.
- added tutorial (vignette).

1.2-4 (11/14/2016)

- added gaussian.cpp: solve lasso without screening, for research only.
- added tests.

1.2-3 (11/13/2016)

- changed convergence criteria of logistic regression to be the same as that in glmnet.
- optimized source code; preparing for CRAN submission.
- fixed memory leaks occurred on Windows.

1.2-2 (10/27/2016)

- added internal data set: the colon cancer data.

1.2-1 (10/18/2016) ** Implemented another new screening rule (SSR-BEDPP), also combining hybrid strong rule with a safe rule (BEDPP). ** implemented EDPP rule with active set cycling strategy for linear regression.

- changed convergence criteria to be the same as that in glmnet.

1.1-2 (9/1/2016)

- fixed bugs occurred when some features have identical values for different observations. These features are internally removed from model fitting.

1.1-1 (8/31/2016) ** Three sparse screening rules (SSR, EDPP, SSR-Dome) were implemented. Our new proposed HSR-Dome combines HSR and Dome test for feature screening, leading to even better performance as compared to 'glmnet'. ** OpenMP parallel computing was added to speedup single model fitting. ** Both exact Newton and majorization-minimization (MM) algorithm for logistic regression were implemented. The latter could be faster, especially in data-larger-than-RAM cases. ** Source code were rewritten in pure cpp.

- Sparse matrix representation was added using Armadillo library.

1.0-1 (3/1/2016) ** package ready for CRAN submission.