Tools for Standardizing Variables for Regression in R

Tools which allow regression variables to be placed on similar scales, offering computational benefits as well as easing interpretation of regression output.


standardize 0.2.1

Build Status

Installation

To install the standardize package, call:

install.packages("standardize")

Package use

The standardize package provides tools for controlling continuous variable scaling and factor contrasts. The goal of these standardizations is to keep the regression parameters on similar scales, and to ensure that the intercept (which is the predicted value of an observation when all other coefficients are multiplied by 0) represents the corrected mean (i.e. the predicted value for an observation which is average in every way, holding covariates at their mean values and averaging over group differences in factors). When the predictors are all on a similar scale, there are computational benefits for both frequentist and Bayesian approaches in mixed effects regressions, reasonable Bayesian priors are easier to specify, and regression output is easier to interpret. Take, for example, the ptk dataset included in the package (for which we will create a new ordered factor preheight):

library(standardize)
 
ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
  "Mid", "High"))
 
summary(ptk)
#>  Min.   : 61.0   Min.   :  0.00   Bilabial:228   Post-Tonic:250   a:217   
#>  1st Qu.:117.0   1st Qu.: 49.00   Dental  :275   Tonic     :252   e:169   
#>  Median :133.0   Median : 62.00   Velar   :248   Unstressed:249   i:146   
#>  Mean   :137.5   Mean   : 60.69                                   o:151   
#>  3rd Qu.:155.0   3rd Qu.: 77.00                                   u: 68   
#>  Max.   :281.0   Max.   :190.00                                           
#>                                                                           
#>  posvowel    wordpos       wordfreq        speechrate         sex     
#>  a:212    Initial:322   Min.   :     1   Min.   : 2.000   Female:384  
#>  e:144    Medial :429   1st Qu.:  1104   1st Qu.: 5.000   Male  :367  
#>  i: 86                  Median :  4899   Median : 6.000               
#>  o:255                  Mean   : 18840   Mean   : 5.812               
#>  u: 54                  3rd Qu.: 26610   3rd Qu.: 7.000               
#>                         Max.   :158168   Max.   :10.000               
#>                                                                       
#>     speaker    preheight 
#>  s02    : 45   Low :217  
#>  s03    : 45   Mid :320  
#>  s04    : 45   High:214  
#>  s07    : 45             
#>  s11    : 45             
#>  s16    : 44             
#>  (Other):482

Suppose we want to fit a linear mixed effects regression with total consonant duration cdur as the response, place, stress, preheight, the natural log of wordfreq, and speaker-relative speechrate as fixed effects, and random intercepts for speaker. The variables for this regression can be easily placed into a standardized space with the standardize function:

sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
  scale_by(speechrate ~ speaker) + (1 | speaker), ptk)
  
sobj
#> 
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) + 
#>     scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker + 
#>     (1 | speaker)
#> 
#> Variables:
#>  Variable                       Standardized Name           
#>  cdur                           cdur                        
#>  place                          place                       
#>  stress                         stress                      
#>  preheight                      preheight                   
#>  log(wordfreq)                  log_wordfreq                
#>  scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker
#>  speaker                        speaker                     
#>  Class           
#>  response.numeric
#>  factor          
#>  factor          
#>  ordered         
#>  numeric         
#>  scaledby        
#>  group           
#> 
#> Response has mean 0 and standard deviation 1.
#> 
#> Continuous variables have mean 0 and standard deviation 1
#>   (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 1
#> Ordered factors have orthogonal polynomial contrasts whose
#>   columns have standard deviation 1
#> Grouping factors are coded as unordered factors with default contrasts
 
names(sobj)
#>  [1] "call"      "scale"     "formula"   "family"    "data"     
#>  [6] "offset"    "pred"      "variables" "contrasts" "groups"
 
head(sobj$data)
#>          cdur  place     stress preheight log_wordfreq
#> 1  0.58032620  Velar Unstressed       Low  -1.64753617
#> 2  0.21456174  Velar Post-Tonic      High  -1.64753617
#> 3 -0.11795140 Dental Unstressed      High  -0.05570344
#> 4  0.01505386 Dental Post-Tonic      High  -1.90076049
#> 5  0.91283934  Velar      Tonic       Mid  -0.19203879
#> 6  1.51136300 Dental Post-Tonic      High  -0.22152148
#>   speechrate_scaled_by_speaker speaker
#> 1                   -0.4984662     s01
#> 2                    1.3452937     s01
#> 3                   -0.4984662     s01
#> 4                    0.6454974     s01
#> 5                    2.2671736     s01
#> 6                   -2.3422261     s01
 
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
 
mean(sobj$data$log_wordfreq)
#> [1] 1.550764e-16
sd(sobj$data$log_wordfreq)
#> [1] 1
all.equal(scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
 
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -2.960268e-16  3.165268e-18 -3.119346e-16  1.004790e-16  2.183358e-16 
#>           s06           s07           s08           s09           s10 
#> -7.174906e-17 -1.953684e-16 -2.200950e-16  1.468327e-16 -1.266348e-16 
#>           s11           s12           s13           s14           s15 
#> -2.923792e-16  3.850162e-16  1.980309e-16  6.325185e-17  1.427623e-16 
#>           s16           s17           s18 
#> -2.220446e-16  1.582818e-16 -2.505491e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
 
sobj$contrasts
#> $place
#>          Bilabial Dental
#> Bilabial        1      0
#> Dental          0      1
#> Velar          -1     -1
#> 
#> $stress
#>            Post-Tonic Tonic
#> Post-Tonic          1     0
#> Tonic               0     1
#> Unstressed         -1    -1
#> 
#> $preheight
#>                 .L         .Q
#> Low  -1.000000e+00  0.5773503
#> Mid  -2.168241e-17 -1.1547005
#> High  1.000000e+00  0.5773503
 
sobj$groups
#> $speaker
#>  [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11"
#> [12] "s12" "s13" "s14" "s15" "s16" "s17" "s18"

The default settings for standardize have placed all the continuous variables on unit scale, set (named) sum contrasts for all unordered factors, and orthogonal polynomial contrasts with column standard deviations of 1 for all ordered factors. In the case of speechrate, the call to scale_by ensured that rather than simply placing speechrate on unit scale, it was placed on unit scale for each speaker, so that the resulting variable represents speaker-relative speech rate. We can then simply use the formula and data elements of the object returned by standardize to fit the mixed effects regression in this standardized space:

library(lme4)
#> Loading required package: Matrix
 
mod <- lmer(sobj$formula, sobj$data)
 
summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: 
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +  
#>     (1 | speaker)
#>    Data: sobj$data
#> 
#> REML criterion at convergence: 2018
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.5479 -0.6919 -0.1014  0.5883  4.2981 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  speaker  (Intercept) 0.09375  0.3062  
#>  Residual             0.79081  0.8893  
#> Number of obs: 751, groups:  speaker, 18
#> 
#> Fixed effects:
#>                               Estimate Std. Error t value
#> (Intercept)                   0.015759   0.079462   0.198
#> placeBilabial                 0.008255   0.047766   0.173
#> placeDental                  -0.011703   0.045913  -0.255
#> stressPost-Tonic             -0.062456   0.048407  -1.290
#> stressTonic                   0.212176   0.047974   4.423
#> preheight.L                   0.017645   0.046657   0.378
#> preheight.Q                   0.059551   0.039241   1.518
#> log_wordfreq                 -0.037784   0.033815  -1.117
#> speechrate_scaled_by_speaker -0.306867   0.033216  -9.239
#> 
#> Correlation of Fixed Effects:
#>             (Intr) plcBlb plcDnt strP-T strssT prhg.L prhg.Q lg_wrd
#> placeBilabl  0.024                                                 
#> placeDental -0.039 -0.500                                          
#> strssPst-Tn  0.004  0.053 -0.051                                   
#> stressTonic -0.006  0.003 -0.003 -0.524                            
#> preheight.L  0.009  0.075 -0.106 -0.272  0.276                     
#> preheight.Q  0.083 -0.041 -0.127  0.029 -0.059  0.023              
#> log_wordfrq  0.005 -0.092  0.062 -0.131 -0.029  0.106  0.093       
#> spchrt_sc__  0.007  0.015 -0.043  0.058  0.046  0.061  0.073  0.013

The scaling of the predictors can be controlled through the scale argument to standardize. For example:

sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
  scale_by(speechrate ~ speaker) + (1 | speaker), ptk, scale = 0.5)
  
sobj
#> 
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) + 
#>     scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk, 
#>     scale = 0.5)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker + 
#>     (1 | speaker)
#> 
#> Variables:
#>  Variable                       Standardized Name           
#>  cdur                           cdur                        
#>  place                          place                       
#>  stress                         stress                      
#>  preheight                      preheight                   
#>  log(wordfreq)                  log_wordfreq                
#>  scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker
#>  speaker                        speaker                     
#>  Class           
#>  response.numeric
#>  factor          
#>  factor          
#>  ordered         
#>  numeric         
#>  scaledby        
#>  group           
#> 
#> Response has mean 0 and standard deviation 1.
#> 
#> Continuous variables have mean 0 and standard deviation 0.5
#>   (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 0.5
#> Ordered factors have orthogonal polynomial contrasts whose
#>   columns have standard deviation 0.5
#> Grouping factors are coded as unordered factors with default contrasts
 
names(sobj)
#>  [1] "call"      "scale"     "formula"   "family"    "data"     
#>  [6] "offset"    "pred"      "variables" "contrasts" "groups"
 
head(sobj$data)
#>          cdur  place     stress preheight log_wordfreq
#> 1  0.58032620  Velar Unstressed       Low  -0.82376808
#> 2  0.21456174  Velar Post-Tonic      High  -0.82376808
#> 3 -0.11795140 Dental Unstressed      High  -0.02785172
#> 4  0.01505386 Dental Post-Tonic      High  -0.95038025
#> 5  0.91283934  Velar      Tonic       Mid  -0.09601939
#> 6  1.51136300 Dental Post-Tonic      High  -0.11076074
#>   speechrate_scaled_by_speaker speaker
#> 1                   -0.2492331     s01
#> 2                    0.6726468     s01
#> 3                   -0.2492331     s01
#> 4                    0.3227487     s01
#> 5                    1.1335868     s01
#> 6                   -1.1711131     s01
 
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
 
mean(sobj$data$log_wordfreq)
#> [1] 7.753821e-17
sd(sobj$data$log_wordfreq)
#> [1] 0.5
all.equal(0.5 * scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
 
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#>           s01           s02           s03           s04           s05 
#> -1.480134e-16  1.582634e-18 -1.559673e-16  5.023952e-17  1.091679e-16 
#>           s06           s07           s08           s09           s10 
#> -3.587453e-17 -9.768421e-17 -1.100475e-16  7.341635e-17 -6.331741e-17 
#>           s11           s12           s13           s14           s15 
#> -1.461896e-16  1.925081e-16  9.901544e-17  3.162593e-17  7.138116e-17 
#>           s16           s17           s18 
#> -1.110223e-16  7.914090e-17 -1.252746e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 
sobj$contrasts
#> $place
#>          Bilabial Dental
#> Bilabial      0.5    0.0
#> Dental        0.0    0.5
#> Velar        -0.5   -0.5
#> 
#> $stress
#>            Post-Tonic Tonic
#> Post-Tonic        0.5   0.0
#> Tonic             0.0   0.5
#> Unstressed       -0.5  -0.5
#> 
#> $preheight
#>                .L         .Q
#> Low  -5.00000e-01  0.2886751
#> Mid  -1.08412e-17 -0.5773503
#> High  5.00000e-01  0.2886751
 
sobj$groups
#> $speaker
#>  [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11"
#> [12] "s12" "s13" "s14" "s15" "s16" "s17" "s18"

The standardize function works by making use of the function scale from base R, as well as the standardize functions scale_by, named_contr_sum, and scaled_contr_poly. For more details, install the package and see the vignette "Using the standardize Package".

News

standardize 0.2.1

  • Changes to default na.action are now deprecated and ignored with a warning

standardize 0.2.0

  • Offsets are now supported

  • The formula with renamed terms now always has the same structure/ordering as the input

  • Efficiency improvements

standardize 0.1.1

  • Minor documentation and vignette updates

  • Fixed reverse dependency issue for lme4 1.1-13

  • Now suggests the latest version of afex, 0.17-8

standardize 0.1.0

  • Initial release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("standardize")

0.2.1 by Christopher D. Eager, 2 years ago


https://github.com/CDEager/standardize


Report a bug at https://github.com/CDEager/standardize/issues


Browse source code at https://github.com/cran/standardize


Authors: Christopher D. Eager [aut, cre]


Documentation:   PDF Manual  


GPL (>= 3) license


Imports stats, methods, MASS, lme4, stringr

Suggests testthat, knitr, rmarkdown, afex, lsmeans, lmerTest


Imported by Countr.


See at CRAN