Normalizing Transformation Functions

Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ). ORQ normalization combines a rank-mapping approach with a shifted logit approximation that allows the transformation to work on data outside the original domain. It is also able to handle new data within the original domain via linear interpolation. The package is built to estimate the best normalizing transformation for a vector consistently and accurately. It implements the Box-Cox transformation, the Yeo-Johnson transformation, three types of Lambert WxF transformations, and the ordered quantile normalization transformation. It also estimates the normalization efficacy of other commonly used transformations.


Travis-CI BuildStatus CRANversion downloads

The bestNormalize R package was designed to help find a normalizing transformation for a vector. There are many techniques that have been developed in this aim, however each has been subject to their own strengths/weaknesses, and it is unclear on how to decide which will work best until the data is oberved. This package will look at a range of possible transformations and return the best one, i.e. the one that makes it look the most normal.

Note that some authors use the term “normalize” differently than in this package. We define “normalize”: to transform a vector of data in such a way that the transformed values follow a Gaussian distribution (or equivalently, a bell curve). This is in contrast to other such techniques designed to transform values to the 0-1 range, or to the -1 to 1 range.

This package also introduces a new adaptation of a normalization technique, which we call Ordered Quantile normalization (orderNorm(), or ORQ). ORQ transforms the data based off of a rank mapping to the normal distribution. This allows us to guarantee normally distributed transformed data (if ties are not present). The adaptation uses a shifted logit approximation on the ranks transformation to perform the transformation on newly observed data outside of the original domain. On new data within the original domain, the transformation uses linear interpolation of the fitted transformation.

To evaluate the efficacy of the normalization technique, the bestNormalize() function implements repeated cross-validation to estimate the Pearson’s P statistic divided by its degrees of freedom. This is called the “Normality statistic”, and if it is close to 1 (or less), then the transformation can be thought of as working well. The function is designed to select the transformation that produces the lowest P / df value, when estimated on out-of-sample data (estimating this on in-sample data will always choose the orderNorm technique, and is generally not the main goal of these procedures).

Installation

You can install the most recent (devel) version of bestNormalize from github with:

# install.packages("devtools")
devtools::install_github("petersonR/bestNormalize")

Or, you can download it from CRAN with:

install.packages("bestNormalize")

Example

In this example, we generate 1000 draws from a gamma distribution, and normalize them:

library(bestNormalize)
set.seed(100)
x <- rgamma(1000, 1, 1)
 
# Estimate best transformation with repeated cross-validation
BN_obj <- bestNormalize(x, allow_lambert_s = TRUE)
BN_obj
#> Best Normalizing transformation with 1000 Observations
#>  Estimated Normality Statistics (Pearson P / df, lower => more normal):
#>  - No transform: 6.966 
#>  - Box-Cox: 1.1176 
#>  - Lambert's W (type s): 1.1004 
#>  - Log_b(x+a): 2.0489 
#>  - sqrt(x+a): 1.6444 
#>  - exp(x): 50.7939 
#>  - arcsinh(x): 3.6245 
#>  - Yeo-Johnson: 1.933 
#>  - orderNorm: 1.2694 
#> Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
#>  
#> Based off these, bestNormalize chose:
#> Standardized Lambert WxF Transformation of type s with 1000 nonmissing obs.:
#>  Estimated statistics:
#>  - gamma = 0.4129
#>  - mean (before standardization) = 0.667563 
#>  - sd (before standardization) = 0.7488649
 
# Perform transformation
gx <- predict(BN_obj)
 
# Perform reverse transformation
x2 <- predict(BN_obj, newdata = gx, inverse = TRUE)
 
# Prove the transformation is 1:1
all.equal(x2, x)
#> [1] TRUE

As of version 1.3, the package supports leave-one-out cross-validation as well. ORQ normalization works very well when the size of the test dataset is low relative to the training data set, so it will often be selected via leave-one-out cross-validation (which is why we set allow_orderNorm = FALSE here).

(BN_loo <- bestNormalize(x, allow_orderNorm = FALSE, allow_lambert_s = TRUE, loo = TRUE))
#> Note: passing a cluster (?makeCluster) to bestNormalize can speed up CV process
#> Best Normalizing transformation with 1000 Observations
#>  Estimated Normality Statistics (Pearson P / df, lower => more normal):
#>  - No transform: 26.624 
#>  - Box-Cox: 0.8077 
#>  - Lambert's W (type s): 1.269 
#>  - Log_b(x+a): 4.5374 
#>  - sqrt(x+a): 3.3655 
#>  - exp(x): 451.435 
#>  - arcsinh(x): 14.0712 
#>  - Yeo-Johnson: 5.7997 
#> Estimation method: Out-of-sample via leave-one-out CV
#>  
#> Based off these, bestNormalize chose:
#> Standardized Box Cox Transformation with 1000 nonmissing obs.:
#>  Estimated statistics:
#>  - lambda = 0.2739638 
#>  - mean (before standardization) = -0.3870903 
#>  - sd (before standardization) = 1.045498

It is also possible to visualize these transformations:

plot(BN_obj, leg_loc = "bottomright")

For a more in depth tutorial, please consult the package vignette.

News

bestNormalize 1.4.0

  • Correctly subtract 1/2 from ranks in ORQ transformation to make quantile estimation unbiased (this was a bug in 1.3.0, as ranks start at 1, not zero).
  • Specify the weights for the GLM in the ORQ transformation to be the number of observations. This doesn't change the transformation but seems to have a bit faster computational speed, and it's more mathematically tractable.
  • Other various bug fixes to tests and to plotting functions.

bestNormalize 1.3.0

  • Add 1/2 to ranks in ORQ transformation to make quantile estimation unbiased (should have minimal impact)
  • Add option loo for leave-one-out cross-validation
  • Add progress bar for cross-validation methods (both with/without parallel)
  • Add "no_transform" function - does the same thing as I(x) but in the syntax of other transformations (this allows the normalization statistics to also be calculated if no transformation is performed).
  • Add support for lambert transforms of type "h" in the bestNormalize function via allow_lambert_h argument.
  • Add "before standardization" to printout of different transforms' means and sds to clarify output

bestNormalize 1.2.0

  • Added other transformations commonly used to normalize a vector
    • exponential, log, square root, arcsinh
  • Lambert WxF is no longer done by default by bestNormalize since it is unstable on certain OS (Linux, Solaris), and does not abide by the CRAN policy.

bestNormalize 1.1.0

  • Clarified that the transformations are standardized by default, and providing option to not standardize in transformations
  • Updated tests to run a bit faster and to use proper S3 classes

bestNormalize 1.0.1

  • Added references for original papers (Van der Waerden, Bartlett) that cite the basis for the orderNorm transformation, as well as discussion in Beasley (2009)
  • Edited description to clarify that this procedure is a new adaptation of an older technique rather than a new technique in itself

bestNormalize 1.0.0

  • Added feature to estimate out-of-sample normality statistics in bestNormalize instead of in-sample ones via repeated cross-validation

    • Note: set out_of_sample = FALSE to maintain backward-compatibility with prior versions and set allow_orderNorm = FALSE as well so that it isn't automatically selected
  • Improved extrapolation of the ORQ (orderNorm) method

    • Instead of linear extrapolation, it uses binomial (logit-link) model on ranks
    • No more issues with Cauchy transformation
  • Added plotting feature for transformation objects

  • Cleared up some documentation

bestNormalize 0.2.2

  • Changed the name of the orderNorm technique to "Ordered Quantile normalization".

bestNormalize 0.2.1

  • Made description more clear in response to comments from CRAN

bestNormalize 0.2.0

First submission to CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("bestNormalize")

1.4.0 by Ryan Andrew Peterson, a month ago


https://github.com/petersonR/bestNormalize


Browse source code at https://github.com/cran/bestNormalize


Authors: Ryan Andrew Peterson [aut, cre]


Documentation:   PDF Manual  


GPL-3 license


Imports LambertW, nortest, dplyr, doParallel, foreach, doRNG

Suggests knitr, rmarkdown, MASS, testthat, mgcv, parallel


See at CRAN