Tools for Building OLS Regression Models

Tools designed to make it easier for users, particularly beginner/intermediate R users to build ordinary least squares regression models. Includes comprehensive regression output, heteroskedasticity tests, collinearity diagnostics, residual diagnostics, measures of influence, model fit assessment and variable selection procedures.


CRAN_Status_Badge cranchecks Travis-CI BuildStatus AppVeyor BuildStatus Coveragestatus

Overview

The olsrr package provides following tools for building OLS regression models using R:

  • Comprehensive Regression Output
  • Variable Selection Procedures
  • Heteroskedasticity Tests
  • Collinearity Diagnostics
  • Model Fit Assessment
  • Measures of Influence
  • Residual Diagnostics
  • Variable Contribution Assessment

Installation

You can install olsrr from github with:

# install olsrr from CRAN
install.packages("olsrr")
 
# the development version from github
# install.packages("devtools")
devtools::install_github("rsquaredacademy/olsrr")

Shiny App

Use ols_launch_app() to explore the package using a shiny app.

Articles

Usage

olsrr uses consistent prefix ols_ for easy tab completion.

olsrr is built with the aim of helping those users who are new to the R language. If you know how to write a formula or build models using lm, you will find olsrr very useful. Most of the functions use an object of class lm as input. So you just need to build a model using lm and then pass it onto the functions in olsrr. Below is a quick demo:

Regression

ols_regress(mpg ~ disp + hp + wt + qsec, data = mtcars)
#>                         Model Summary                          
#> --------------------------------------------------------------
#> R                       0.914       RMSE                2.622 
#> R-Squared               0.835       Coef. Var          13.051 
#> Adj. R-Squared          0.811       MSE                 6.875 
#> Pred R-Squared          0.771       MAE                 1.858 
#> --------------------------------------------------------------
#>  RMSE: Root Mean Square Error 
#>  MSE: Mean Square Error 
#>  MAE: Mean Absolute Error 
#> 
#>                                ANOVA                                 
#> --------------------------------------------------------------------
#>                 Sum of                                              
#>                Squares        DF    Mean Square      F         Sig. 
#> --------------------------------------------------------------------
#> Regression     940.412         4        235.103    34.195    0.0000 
#> Residual       185.635        27          6.875                     
#> Total         1126.047        31                                    
#> --------------------------------------------------------------------
#> 
#>                                   Parameter Estimates                                    
#> ----------------------------------------------------------------------------------------
#>       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
#> ----------------------------------------------------------------------------------------
#> (Intercept)    27.330         8.639                  3.164    0.004     9.604    45.055 
#>        disp     0.003         0.011        0.055     0.248    0.806    -0.019     0.025 
#>          hp    -0.019         0.016       -0.212    -1.196    0.242    -0.051     0.013 
#>          wt    -4.609         1.266       -0.748    -3.641    0.001    -7.206    -2.012 
#>        qsec     0.544         0.466        0.161     1.166    0.254    -0.413     1.501 
#> ----------------------------------------------------------------------------------------

Stepwise Regression

Build regression model from a set of candidate predictor variables by entering and removing predictors based on p values, in a stepwise manner until there is no variable left to enter or remove any more.

Variable Selection

# stepwise regression
model <- lm(y ~ ., data = surgical)
ols_step_both_p(model)
#> Stepwise Selection Method   
#> ---------------------------
#> 
#> Candidate Terms: 
#> 
#> 1. bcs 
#> 2. pindex 
#> 3. enzyme_test 
#> 4. liver_test 
#> 5. age 
#> 6. gender 
#> 7. alc_mod 
#> 8. alc_heavy 
#> 
#> We are selecting variables based on p value...
#> 
#> Variables Entered/Removed: 
#> 
#> - liver_test added 
#> - alc_heavy added 
#> - enzyme_test added 
#> - pindex added 
#> - bcs added 
#> 
#> No more variables to be added/removed.
#> 
#> 
#> Final Model Output 
#> ------------------
#> 
#>                           Model Summary                           
#> -----------------------------------------------------------------
#> R                       0.884       RMSE                 195.454 
#> R-Squared               0.781       Coef. Var             27.839 
#> Adj. R-Squared          0.758       MSE                38202.426 
#> Pred R-Squared          0.700       MAE                  137.656 
#> -----------------------------------------------------------------
#>  RMSE: Root Mean Square Error 
#>  MSE: Mean Square Error 
#>  MAE: Mean Absolute Error 
#> 
#>                                  ANOVA                                  
#> -----------------------------------------------------------------------
#>                    Sum of                                              
#>                   Squares        DF    Mean Square      F         Sig. 
#> -----------------------------------------------------------------------
#> Regression    6535804.090         5    1307160.818    34.217    0.0000 
#> Residual      1833716.447        48      38202.426                     
#> Total         8369520.537        53                                    
#> -----------------------------------------------------------------------
#> 
#>                                       Parameter Estimates                                        
#> ------------------------------------------------------------------------------------------------
#>       model         Beta    Std. Error    Std. Beta      t        Sig         lower       upper 
#> ------------------------------------------------------------------------------------------------
#> (Intercept)    -1178.330       208.682                 -5.647    0.000    -1597.914    -758.746 
#>  liver_test       58.064        40.144        0.156     1.446    0.155      -22.652     138.779 
#>   alc_heavy      317.848        71.634        0.314     4.437    0.000      173.818     461.878 
#> enzyme_test        9.748         1.656        0.521     5.887    0.000        6.419      13.077 
#>      pindex        8.924         1.808        0.380     4.935    0.000        5.288      12.559 
#>         bcs       59.864        23.060        0.241     2.596    0.012       13.498     106.230 
#> ------------------------------------------------------------------------------------------------
#> 
#>                                 Stepwise Selection Summary                                 
#> ------------------------------------------------------------------------------------------
#>                         Added/                   Adj.                                         
#> Step     Variable      Removed     R-Square    R-Square     C(p)        AIC         RMSE      
#> ------------------------------------------------------------------------------------------
#>    1    liver_test     addition       0.455       0.444    62.5120    771.8753    296.2992    
#>    2     alc_heavy     addition       0.567       0.550    41.3680    761.4394    266.6484    
#>    3    enzyme_test    addition       0.659       0.639    24.3380    750.5089    238.9145    
#>    4      pindex       addition       0.750       0.730     7.5370    735.7146    206.5835    
#>    5        bcs        addition       0.781       0.758     3.1920    730.6204    195.4544    
#> ------------------------------------------------------------------------------------------

Stepwise AIC Backward Regression

Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more.

Variable Selection
# stepwise aic backward regression
model <- lm(y ~ ., data = surgical)
k <- ols_step_backward_aic(model)
#> Backward Elimination Method 
#> ---------------------------
#> 
#> Candidate Terms: 
#> 
#> 1 . bcs 
#> 2 . pindex 
#> 3 . enzyme_test 
#> 4 . liver_test 
#> 5 . age 
#> 6 . gender 
#> 7 . alc_mod 
#> 8 . alc_heavy 
#> 
#> 
#> Variables Removed: 
#> 
#> - alc_mod 
#> - gender 
#> - age 
#> 
#> No more variables to be removed.
k
#> 
#> 
#>                         Backward Elimination Summary                         
#> ---------------------------------------------------------------------------
#> Variable        AIC          RSS          Sum Sq        R-Sq      Adj. R-Sq 
#> ---------------------------------------------------------------------------
#> Full Model    736.390    1825905.713    6543614.824    0.78184      0.74305 
#> alc_mod       734.407    1826477.828    6543042.709    0.78177      0.74856 
#> gender        732.494    1829435.617    6540084.920    0.78142      0.75351 
#> age           730.620    1833716.447    6535804.090    0.78091      0.75808 
#> ---------------------------------------------------------------------------

Breusch Pagan Test

Breusch Pagan test is used to test for herteroskedasticity (non-constant error variance). It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. It is a (\chi^{2}) test.

model <- lm(mpg ~ disp + hp + wt + drat, data = mtcars)
ols_test_breusch_pagan(model)
#> 
#>  Breusch Pagan Test for Heteroskedasticity
#>  -----------------------------------------
#>  Ho: the variance is constant            
#>  Ha: the variance is not constant        
#> 
#>              Data               
#>  -------------------------------
#>  Response : mpg 
#>  Variables: fitted values of mpg 
#> 
#>        Test Summary         
#>  ---------------------------
#>  DF            =    1 
#>  Chi2          =    1.429672 
#>  Prob > Chi2   =    0.231818

Collinearity Diagnostics

model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
ols_coll_diag(model)
#> Tolerance and Variance Inflation Factor
#> ---------------------------------------
#> # A tibble: 4 x 3
#>   Variables Tolerance   VIF
#>   <chr>         <dbl> <dbl>
#> 1 disp          0.125  7.99
#> 2 hp            0.194  5.17
#> 3 wt            0.145  6.92
#> 4 qsec          0.319  3.13
#> 
#> 
#> Eigenvalue and Condition Index
#> ------------------------------
#>    Eigenvalue Condition Index   intercept        disp          hp
#> 1 4.721487187        1.000000 0.000123237 0.001132468 0.001413094
#> 2 0.216562203        4.669260 0.002617424 0.036811051 0.027751289
#> 3 0.050416837        9.677242 0.001656551 0.120881424 0.392366164
#> 4 0.010104757       21.616057 0.025805998 0.777260487 0.059594623
#> 5 0.001429017       57.480524 0.969796790 0.063914571 0.518874831
#>             wt         qsec
#> 1 0.0005253393 0.0001277169
#> 2 0.0002096014 0.0046789491
#> 3 0.0377028008 0.0001952599
#> 4 0.7017528428 0.0024577686
#> 5 0.2598094157 0.9925403056

Getting Help

If you encounter a bug, please file a minimal reproducible example using reprex on github. For questions and clarifications, use StackOverflow.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

News

olsrr 0.5.2

This is a minor release to fix bugs from breaking changes in recipes package and other enhancements.

Enhancements

  • variable selection procedures now return the final model as an object of class lm (#81)
  • data preparation functions of selected plots are now exported to enable end users to create customized plots and to use plotting library of their choice (#86)

olsrr 0.5.1

This is a patch release to fix minor bugs and improve error messages.

Enhancements

olsrr now throws better error messages keeping in mind beginner and intermediate R users. It is a work in progress and should get better in future releases.

Bug Fixes

Variable selection procedures based on p values now handle categorical variables in the same way as the procedures based on AIC values.

olsrr 0.5.0

This is a minor release for bug fixes and API changes.

API Changes

We have made some changes to the API to make it more user friendly:

  • all the variable selection procedures start with ols_step_*
  • all the test start with ols_test_*
  • all the plots start with ols_plot_*

Bug Fixes

  • ols_regress returns error in the presence of interaction terms in the formula (#49)

  • ols_regress returns error in the presence of interaction terms in the formula (#47)

  • return current version (#48)

olsrr 0.4.0

Enhancements

  • use ols_launch_app() to launch a shiny app for building models
  • save beta coefficients for each independent variable in ols_all_subset() (#41)

Bug Fixes

  • mismatch in sign of partial and semi partial correlations (#44)
  • error in diagnostic panel (#45)
  • standardized betas in the presence of interaction terms (#46)

A big thanks goes to (Dr. Kimberly Henry) for identifying bugs and other valuable feedback that helped improve the package.

olsrr 0.3.0

This is a minor release containing bug fixes.

Bug Fixes

  • output from reg_compute rounded up to 3 decimal points (#24)
  • added variable plot fails when model includes categorical variables (#25)
  • all possible regression fails when model includes categorical predictors (#26)
  • output from bartlett test rounded to 3 decimal points (#27)
  • best subsets regression fails when model includes categorical predictors (#28)
  • output from breusch pagan test rounded to 4 decimal points (#29)
  • output from collinearity diagnostics rounded to 3 decimal points (#30)
  • cook's d bar plot threshold rounded to 3 decimal points (#31)
  • cook's d chart threshold rounded to 3 decimal points (#32)
  • output from f test rounded to 3 decimal points (#33)
  • output from measures of influence rounded to 4 decimal points (#34)
  • output from information criteria rounded to 4 decimal points (#35)
  • studentized residuals vs leverage plot threshold rounded to 3 decimal points (#36)
  • output from score test rounded to 3 decimal points (#37)
  • step AIC backward method AIC value rounded to 3 decimal points (#38)
  • step AIC backward method AIC value rounded to 3 decimal points (#39)
  • step AIC both direction method AIC value rounded to 3 decimal points (#40)

olsrr 0.2.0

This is a minor release containing bug fixes and minor improvements.

Bug Fixes

  • inline functions in model formula caused errors in stepwise regression (#2)
  • added variable plots (ols_avplots) returns error when model formula contains inline functions (#3)
  • all possible regression (ols_all_subset) returns an error when the model formula contains inline functions or interaction variables (#4)
  • best subset regression (ols_best_subset) returns an error when the model formula contains inline functions or interaction variables (#5)
  • studentized residual plot (ols_srsd_plot) returns an error when the model formula contains inline functions (#6)
  • stepwise backward regression (ols_step_backward) returns an error when the model formula contains inline functions or interaction variables (#7)
  • stepwise forward regression (ols_step_backward) returns an error when the model formula contains inline functions (#8)
  • stepAIC backward regression (ols_stepaic_backward) returns an error when the model formula contains inline functions (#9)
  • stepAIC forward regression (ols_stepaic_forward) returns an error when the model formula contains inline functions (#10)
  • stepAIC regression (ols_stepaic_both) returns an error when the model formula contains inline functions (#11)
  • outliers incorrectly plotted in (ols_cooksd_barplot) cook's d bar plot (#12)
  • regression (ols_regress) returns an error when the model formula contains inline functions (#21)
  • output from step AIC backward regression (ols_stepaic_backward) is not properly formatted (#22)
  • output from step AIC regression (ols_stepaic_both) is not properly formatted (#23)

Enhancements

  • cook's d bar plot (ols_cooksd_barplot) returns the threshold value used to classify the observations as outliers (#13)
  • cook's d chart (ols_cooksd_chart) returns the threshold value used to classify the observations as outliers (#14)
  • DFFITs plot (ols_dffits_plot) returns the threshold value used to classify the observations as outliers (#15)
  • deleted studentized residuals vs fitted values plot (ols_dsrvsp_plot) returns the threshold value used to classify the observations as outliers (#16)
  • studentized residuals vs leverage plot (ols_rsdlev_plot) returns the threshold value used to detect outliers/high leverage observations (#17)
  • standarized residuals chart (ols_srsd_chart) returns the threshold value used to classify the observations as outliers (#18)
  • studentized residuals plot (ols_srsd_plot) returns the threshold value used to classify the observations as outliers (#19)

Documentation

There were errors in the description of the values returned by some functions. The documentation has been thoroughly revised and improved in this release.

olsrr 0.1.0

First release.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.