Fast Imputation of Missing Values

Alternative implementation of the beautiful 'MissForest' algorithm used to impute mixed-type data sets by chaining random forests, introduced by Stekhoven, D.J. and Buehlmann, P. (2012) . Under the hood, it uses the lightning fast random jungle package 'ranger'. Between the iterative model fitting, we offer the option of using predictive mean matching. This firstly avoids imputation with values not already present in the original data (like a value 0.3334 in 0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting conditional distributions to a realistic level. This would allow e.g. to do multiple imputation when repeating the call to missRanger(). A formula interface allows to control which variables should be imputed by which.


Description

This package uses the ranger package [1] to do fast missing value imputation by chained random forest, see [2] and [3]. Between the iterative model fitting, it offers the option of using predictive mean matching. This firstly avoids the imputation with values not present in the original data (like a value 0.3334 in a 0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting conditional distributions to a realistic level. This would allow e.g. to do multiple imputation when repeating the call to missRanger(). Package mice utilizes the randomForest package with only ten trees as default.

Please check the help ?missRanger for how to call the function and to see all options.

Example

This example first generates a data set with about 10% missing values in each column. Then those gaps are filled by missRanger. In the end, the resulting data frame is displayed.

library(missRanger)
 
# Generate data with missing values in all columns
irisWithNA <- generateNA(iris)
 
# Impute missing values with missRanger
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
 
# Check results
head(irisImputed)
head(irisWithNA)
head(iris)
 
# With extra trees algorithm
irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, splitrule = "extratrees", num.trees = 100)
head(irisImputed_et)

Since release 1.0.3, thanks to Andrew Landgraf, it is now possible to use in line with tidyverse.

library(tidyverse)
 
iris %>% 
  as.tibble %>% 
  generateNA %>% 
  missRanger(verbose = 0) %>% 
  head
 

How to deal with date variables etc.?

missRanger natively deals with numeric and character/factor variables. In real-world data sets, also other types of variables appear, e.g. date variables. These can be imputed as well, but it requires some pre- and post-processing:

  1. Transform the variable to a numeric or character/factor.

  2. Impute with pmm.k > 0, so that no new values are created.

  3. Transform the imputed variable back to its original type.

Example

library(missRanger)
library(lubridate)
library(tidyverse)
 
# Add a date variable to iris
iris$random_date <- seq.Date(as.Date("1998-12-17"), 
                             by = "1 day", 
                             length.out = nrow(iris))
 
set.seed(3234)
irisWithNA <- generateNA(iris, p = 0.2)
head(irisWithNA$random_date)
# Output: "1998-12-17" NA           NA           NA           "1998-12-21" "1998-12-22"
 
# Convert date to numeric, impute with PMM, convert back to date
irisImputed <- irisWithNA %>% 
  mutate(random_date = as.numeric(random_date)) %>% 
  missRanger(pmm.k = 5, num.trees = 100) %>% 
  mutate(random_date = as.Date(random_date, origin = "1970-01-01"))
 
head(irisImputed$random_date)
# Output: "1998-12-17" "1999-01-13" "1999-02-04" "1999-01-22" "1998-12-21" "1998-12-22"
 

How to deal with censored variables?

There is no obvious way of how to deal with survival variables in imputation models, mostly since it is unclear of how to use them as covariables to predict other variables.

Options discussed in [add citation] include:

  • Use both status variable s and (censored) time variable t

  • s and log(t)

  • KM(t), and, optionally s

By KM(t), we denote the Kaplan-Meier estimate at each value of t.

The third option is the most elegant one as it explicitly deals with censoring information.

Let's go through an example to explain it:

Example

to do
 

Installation

Release 1.0.4 on CRAN

install.packages("missRanger")

References

[1] Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.

[2] Stekhoven, D.J. and Buehlmann, P. (2012), 'MissForest - nonparametric missing value imputation for mixed-type data', Bioinformatics, 28(1) 2012, 112-118, doi: 10.1093/bioinformatics/btr597

[3] Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/

News

missRanger 1.0.4

  • Thanks @markgrujic to suggest an additional argument "returnOOB" to return the average OOB prediction error in the resulting data frame.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("missRanger")

2.1.0 by Michael Mayer, 5 months ago


Browse source code at https://github.com/cran/missRanger


Authors: Michael Mayer [aut, cre, cph]


Documentation:   PDF Manual  


GPL (>= 2) license


Imports stats, FNN, ranger

Suggests mice, dplyr, survival, ggplot2, knitr, rmarkdown


Imported by wiseR.


See at CRAN