Last updated on 2019-02-09 by Julie Josse, Nicholas Tierney and Nathalie Vialaneix (r-miss-tastic team)

Missing data are very frequently found in datasets. Base R provides a few options to handle them using computations that involve only observed data (`na.rm = TRUE`

in functions `mean`

, `var`

, ... or `use = complete.obs|na.or.complete|pairwise.complete.obs`

in functions `cov`

, `cor`

, ...). The base package stats also contains the generic function `na.action`

that extracts information of the `NA`

action used to create an object.

These basic options are complemented by many packages on CRAN, which we structure into main topics:

- Exploration of missing data
- Likelihood based approaches
- Single imputation
- Multiple imputation
- Weighting methods
- Specific types of data
- Specific application fields

If you think that we missed some important packages in this list, please contact the maintainer.

*Manipulation of missing data*is implemented in the packages sjmisc and sjlabelled. memisc also provides defineable missing values, along with infrastruture for the management of survey data and variable labels.*Missing data patterns*can be identified and explored using the packages mi, dlookr, wrangle, DescTools, extracat (`visna`

function) and naniar.*Graphics that describe distributions and patterns of missing data*are implemented in VIM (which has a Graphical User Interface, VIMGUI) and naniar (which abides by tidyverse principles). tabplot also contains functions to visualize missing data with large datasets.*Tests of the MAR assumption (versus the MCAR assumption)*are implemented in the function`LittleMCAR`

from BaylorEdPsych (Little's test) and from MissMech (a non parametric test).*Evaluation with simulations*can be performed using the function`ampute`

of mice.

*Methods based on the Expectation Maximization (EM) algorithm*are implemented in norm (using the function`em.norm`

for multivariate Gaussian data), in cat (function`em.cat`

for multivariate categorical data), in mix (function`em.mix`

for multivariate mixed categorical and continuous data). These packages also implement*Bayesian approaches*(with Imputation and Posterior steps) for the same models (functions`da.`

XXX for`norm`

,`cat`

and`mix`

) and can be used to obtain imputed complete datasets or multiple imputations (functions`imp.`

XXX for`norm`

,`cat`

and`mix`

), once the model parameters have been estimated. In addition, TestDataImputation implements imputation based on EM estimation (and other simpler imputation methods) that are well suited for for dichotomous and polytomous test with item responses.*Full Information Maximum Likelihood*(also known as "direct maximum likelihood" or "raw maximum likelihood") is available in lavaan, OpenMx and rsem, for handling missing data in structural equation modeling.*Bayesian approaches*for handling missing values in model based clustering with variable selection is available in VarSelLCM. The package also provides imputation using the posterior mean.*Missing values in mixed-effect models and generalized linear models*are supported in the packages PSM, mdmb, icdGLM and JointAI, the last one being based on a Bayesian approach. brlrmr also handles MNAR values in response variable for logistic regression using an EM approach.*Missing data in item response models*is implemented in TAM, mirt and ltm and in idealstan.*Variable selection*under ignorable and non ignorable missing data mechanisms is implemented in TVsMiss.*Robust covariance estimation*is implemented in the package GSE.

- The simplest method for missing data imputation is
*imputation by mean*(or median, mode, ...). This approach is available in many packages among which ForImp, Hmisc, and dlookr that contain various proposals for imputing the same value for all missing data of a variable. This method and other simple imputation methods are also available in tidyimpute that works after the tidyverse approach. *k-nearest neighbors*is a popular method for missing data imputation that is available in many packages including DMwR, impute, VIM, GenForImp and yaImpute (with many different methods for kNN imputation, including a CCA based imputation). wNNSel implements a kNN based method for imputation in large dimensional datasets.*hot-deck*imputation is implemented in hot.deck, HotDeckImputation, FHDI and VIM (function`hotdeck`

).*Other regression based imputations*are implemented in VIM (linear regression based imputation in the function`regressionImp`

). In addition, simputation that is a general package for imputation by any prediction method that can be combined with various regression methods, and works well with the tidyverse. WaverR imputes data using a weighted average of several regressions.*Based on random forest*in missForest.*Based on copula*in CoImp and in sbgcop (semi-parametric Bayesian copula imputation). The last one supports multiple imputation.*PCA/Singular Value Decomposition/matrix completion*is implemented in the package missMDA for numerical, categorical and mixed data. softImpute contains several methods for iterative matrix completion, as well as filling and denoiseR for numerical variables, or mimi that uses low rank assumption to impute mixed datasets. The package pcaMethods offers some Bayesian implementation of PCA with missing data.*NIPALS*(based on SVD computation) is implemented in the packages mixOmics (for PCA and PLS), ade4, nipals and plsRglm (for generalized model PLS). ddsPLS implements a multi-block imputation method based on PLS in a supervise framework. NNLM implements a non-negative matrix factorization imputation. ROptSpace and CMF proposes a matrix completion method under low-rank assumption and collective matrix factorization for imputation using Bayesian matrix completion for groups of variables (binary, quantitative, poisson). Imputation for groups is also avalaible in the missMDA in the function`imputeMFA`

.*Imputation for non-parametric regression by wavelet shrinkage*is implemented in CVThresh using solely maximization of the h-likelihood.- mi and VIM also provide diagnostic plots to
*evaluate the quality of imputation*.

Some of the above mentionned packages can also handle multiple imputations.

- Amelia implements Bootstrap multiple imputation using EM to estimate the parameters, for quantitative data it imputes assuming a Multivariate Gaussian distribution. In addition, AmeliaView is a GUI for Amelia, available from the Amelia web page. NPBayesImputeCat also implements multiple imputation by joint modelling for categorical variables with a Bayesian approach.
- mi, mice and smcfcs implement multiple imputation by Chained Equations. smcfcs extends the models covered by the two previous packages. miceFast provides an alternative implementation of mice imputation methods using object oriented style programming and c++. miceMNAR imputes MNAR responses under Heckman selection model for use with mice.
- missMDA implements multiple imputation based on SVD methods.
- MixedDataImpute (for mixed datasets) suggests multiple imputation based on Bayesian nonparametrics methods.
- hot.deck implements hot deck based multiple imputation and StatMatch uses multiple hot deck imputation to impute surveys from an external dataset.
*Multilevel imputation*: Multilevel multiple imputation is implemented in hmi, jomo, mice, miceadds, micemd, mitml and pan.- Qtools implements multiple imputation based on quantile regression.
- Tree based multiple imputation is available in CALIBERrfimpute, which performs multiple imputation based on random forest (also available in mice) and in sbart, which proposes sequential BART (Bayesian Additive Regression Trees) to impute missing covariates.
- BaBooN implements a Bayesian bootstrap approach for discrete data imputation that is based on Predictive Mean Matching (PMM).
- accelmissingmultiple imputation with the zero-inflated Poisson lognormal model for missing count values in accelerometer data.

In addition, mitools provide a generic approach to handle multiple imputation in combination with any imputation method.

*Computation of weights*for observed data to account for data unobserved by*Inverse Probability Weighting (IPW)*is implemented in ipw.*Doubly Robust Inverse Probability Weighted Augmented GEE Estimator with missing outcome*is implemented in CRTgeeDR.

*Longitudinal data / time series and censored data*: Imputation for time series is implemented in imputeTS and imputePSF. Other packages, such as forecast, spacetime, timeSeries, xts, prophet, stlplus or zoo, are dedicated to time series but also contain some (often basic) methods to handle missing data (see also TimeSeries). To help fill down missing values for time series, the padr and tsibble packages provides methods for imputing implicit missing values. Imputation of time series based on Dynamic Time Warping is implemented in DTWBI for univariate time series and in DTWUMI for multivariate ones. naniar also imputed data below the range for exploratory graphical analysis with the function`impute_below`

. TAR implements an estimation of the autoregressive threshold models with Gaussian noise and of positive-valued time series with a Bayesian approach in the presence of missing data. swgee implements a probability weighted generalized estimating equations method for longitudinal data with missing observations and measurement error in covariates based on SIMEX. icenReg performs imputation for censored responses for interval data. imputeTestbench proposes tools to benchmark missing data imputation in univariate time series.*Spatial data*: Imputation for spatial data is implemented in phylin using interpolation with spatial distance weights or kriging. gapfill is dedicated to satellite data and geostatistical interpolation of data with irregular spatial support is implemented in rtop*Spatio-temporal data*: Imputation for spatio-temporal data is implemented in the package cutoffR using different methods as knn and SVD. Similarly, reddPrec imputes missing values in daily precipitation time series accross different locations and sptemExp imputes missing data air polluant concentrations.*Graphs/networks*: Imputation for graphs/networks is implemented in the package dils to impute missing edges. PST provides a framework for analyzing Probabilistic Suffix Trees, including functions for learning and optimizing VLMC (variable length Markov chains) models from sets of individual sequences possibly containing missing values.*Imputation for contingency table*is implemented in lori that can also be used for the analysis of contingency tables with missing data.*Imputation for compositional data (CODA)*is implemented in robCompositions (based on kNN or EM approaches) and in zCompositions (various imputation methods for zeros, left-censored and missing data).*Imputation for diffusion processes*is implemented in DiffusionRimp by imputing missing sample paths with Brownian bridges.- experiment handles missing values in experimental design such as randomized experiments with missing covariate and outcome data, matched-pairs design with missing outcome.
- cdparcoord handles missing values in parallel coordinates settings.

*Genetics*: SNPassoc provides function to visualize missing data in the case of SNP studies (genetics). Analyses of Case-Parent Triad and/or Case-Control Data with SNP haplotypes is implemented in Haplin, where missing genotypic data are handled with an EM algorithm. FamEvent and snpStats implement imputation of missing genotypes, respectively with an EM algorithm and a nearest neighbor approach. Imputation for genotype and haplotype is implemented in alleHap using solely deterministic techniques on pedigree databases and imputation of missing genotypes are also implemented in QTLRel that contains tools for QTL analyses. Tools for Hardy-Weinberg equilibrium for bi- and multi-allelic genetic marker data are implemented in HardyWeinberg, where genotypes are imputed with a multinomial logit model. StAMPP computes genomic relationship when SNP genotype datasets contain missing data and PSIMEX computes inbreeding depression or heritability on pedigree structures affected by missing paternities with a variant of the SIMEX algorithm.*Genomics*: Imputation for dropout events (*i.e.*, under-sampling of mRNA molecules) in single-cell RNA-Sequencing data is implemented in DrImpute and Rmagic. RNAseqNet uses hot-deck imputation to improve RNA-seq network inference with an auxiliary dataset.*Phylogeny*: Rphylopars can perform ancestral state reconstruction and missing data imputation on the estimated evolutionary mode in phylogeny (traits/species) datasets. TreePar and TreeSim respectively estimate birth and death rates for phylogeny and simulate philogenic trees with incomplete phylogeny (missing species).*Epidemiology*: powerlmm implements power calculation for time x treatment effects in the presence of*dropouts*and missing data in mixed linear models and pseval evaluates principal surrogates in a single clinical trial in the presence of missing counterfactual surrogate responses. idem provides missing data imputation with a sensitivity analysis strategy to handle the unobserved functional outcomes not due to death. dejaVu implements imputation for recurrent event data sets with dropouts under MAR and MNAR assumptions.*Causal inference*: cobalt computes the balance of variables from multiple imputed data sets. Similarly, causal inference with interactive fixed-effect models is available in gsynth with missing values handled by matrix completion.*Sensitivity analysis*to help diagnose missing data and imputation is implemented in TippingPoint. In addition, sensitivity analysis of the MAR assumption is implemented in samon under monotone and non monotone patterns of missing data.*Scoring*: Basic methods (mean, median, mode, ...) for imputing missing data in scoring datasets are proposed in scorecardModelUtils.*Preference models*: Missing data in preference models are handled with a*Composite Link*approach that allows for MCAR and MNAR patterns to be taken into account in prefmod.*Administrative records*: fastLink provides a Fellegi-Sunter probabilistic record linkage that allows for missing data and the inclusion of auxiliary information.*Regression and classification*eigenmodel handles missing values in regression models for symmetric relational data. randomForest and StratifiedRF handles missing values in predictors for random forest like methods.- robustrao computes the Rao-Stirling diversity index (a well-established bibliometric indicator to measure the interdisciplinarity of scientific publications) with data containing uncategorized references.

- Task view: TimeSeries
- Bioconductor package: impute
- Bioconductor package: snpStats
pcaMethods - Bioconductor package: mixOmics
- Amelia II: A Program for Missing Data
- A resource website on missing data

6 months ago by Aurélie Siberchicot

Analysis of Ecological Data: Exploratory and Euclidean Methods in Environmental Sciences

a year ago by Nathan Medina-Rodriguez

Allele Imputation and Haplotype Reconstruction from Pedigree Databases

4 years ago by Florian Meinfelder

Bayesian Bootstrap Predictive Mean Matching - Multiple and Single Imputation for Discrete Data

7 years ago by A. Alexander Beaujean

R Package for Baylor University Educational Psychology Quantitative Courses

6 years ago by Fernando Tusell

Analysis of categorical-variable datasets with missing values

2 years ago by Melanie Prague

Doubly Robust Inverse Probability Weighted Augmented GEE Estimator

a month ago by Hadrien Lorenzo

Data-Driven Sparse PLS Robust to Missing Samples for Mono and Multi-Block Data Sets

2 years ago by Etienne A.D. Pienaar

Inference and Analysis for Diffusion Processes via Data Imputation and Method of Lines

5 years ago by Stephen R. Haptonstahl

Data-Informed Link Strength. Combine multiple-relationship networks into a single weighted network. Impute (fill-in) missing network links.

2 years ago by Il-Youp Kwak

Imputing Dropout Events in Single-Cell RNA-Sequencing Data

7 months ago by Emilie Poisson-Caillault

Imputation of Time Series Based on Dynamic Time Warping

7 months ago by POISSON-CAILLAULT Emilie

Imputation of Multivariate Time Series Based on Dynamic Time Warping

8 months ago by Peter Hoff

Semiparametric Factor and Regression Models for Symmetric Relational Data

10 months ago by Kosuke Imai

R Package for Designing and Analyzing Randomized Experiments

2 years ago by Yun-Hee Choi

Family Age-at-Onset Data Simulation and Penetrance Estimation

4 years ago by Alessandro Barbiero

Imputation of Missing Values Through a Forward Imputation Algorithm

4 years ago by Alessandro Barbiero

The Forward Imputation: A Sequential Distance-Based Approach for Imputing Missing Data

2 years ago by Andy Leung

Robust Estimation in the Presence of Cellwise and Casewise Contamination and Missing Data

9 months ago by Hakon K. Gjessing

Analyzing Case-Parent Triad and/or Case-Control Data with SNP Haplotypes

9 months ago by Jan Graffelman

Statistical Tests and Graphics for Hardy-Weinberg Equilibrium

3 years ago by Dieter William Joenssen

Hot Deck Imputation Methods for Missing Data

3 years ago by Stephan Dlugosz

EM by the Method of Weights for Incomplete Categorical Data in Generlized Linear Models

4 months ago by Clifford Anderson-Bergman

Regression Models for Interval Censored Data

3 months ago by Chenguang Wang

Inference in Randomized Controlled Trials with Death and Missingness

3 years ago by Neeraj Bokde

Impute Missing Data in Time Series Data with PSF Based Method

2 years ago by Marcus W. Beck

Test Bench for the Comparison of Imputation Methods

5 months ago by Martin Elff

Management of Survey Data and Presentation of Analysis Results

2 months ago by Alexander Robitzsch

Some Additional Multiple Imputation Functions, Especially for 'mice'

3 months ago by Vincent Audigier

Multiple Imputation by Chained Equations with Multilevel Data

6 months ago by Jacques-Emmanuel Galimard

Missing not at Random Imputation Models for Multiple Imputation by Chained Equation

a month ago by Genevieve Robin

Main Effects and Interactions in Mixed and Incomplete Data

5 years ago by Daniel J. Stekhoven

Nonparametric Missing Value Imputation using Random Forest

25 days ago by Francois Husson

Handling Missing Values with Multivariate Data Analysis

4 years ago by Mortaza Jamshidian

Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random

2 years ago by Brian Ripley

Estimation/Multiple Imputation for Mixed Categorical and Continuous Data

3 years ago by Jared S. Murray

Missing Data Imputation for Continuous and Categorical Data using Nonparametric Bayesian Joint Models

2 days ago by Nicholas Tierney

Data Structures, Summaries, and Visualisations for Missing Data

4 months ago by Kevin Wright

Principal Components Analysis using NIPALS with Gram-Schmidt Orthogonalization

3 months ago by Jingchen Hu

Non-Parametric Bayesian Multiple Imputation for Categorical Data

8 months ago by Jing hua Zhao

Multiple Imputation for Multivariate Panel or Clustered Data

15 days ago by Frederic Bertrand

Partial Least Squares Regression for Generalized Linear Models

6 months ago by Kristoffer Magnusson

Power Analysis for Longitudinal Multilevel Models

a year ago by Marco Johannes Maier

Utilities to Fit Paired Comparison Models for Preferences

20 days ago by Michael C Sachs

Methods for Evaluating Principal Surrogates of Treatment Response

9 months ago by Robert Miller

Non-Linear Mixed-Effects Modelling using Stochastic Differential Equations

2 years ago by Alexis Gabadinho

Probabilistic Suffix Trees and Variable Length Markov Chains

a year ago by Riyan Cheng

Tools for Mapping of Quantitative Traits of Genetically Related Individuals and Calculating Identity Coefficients from Pedigrees

a year ago by Andy Liaw

Breiman and Cutler's Random Forests for Classification and Regression

a year ago by Roberto Serrano-Notivoli

Reconstruction of Daily Data - Precipitation

3 months ago by Scott Gigante

MAGIC - Markov Affinity-Based Graph Imputation of Cells

a year ago by Nathalie Villa-Vialaneix

Log-Linear Poisson Graphical Model with Hot-Deck Multiple Imputation

2 years ago by María del Carmen Calatrava Moreno

An Extended Rao-Stirling Diversity Index to Handle Missing Data

3 years ago by Eric W. Goolsby

Phylogenetic Comparative Tools for Missing Data and Within-Species Variation

4 years ago by Zhiyong Zhang

Robust Structural Equation Modeling with Missing Data and Auxiliary Variables

9 months ago by Peter Hoff

Semiparametric Bayesian Gaussian Copula Estimation and Imputation

4 months ago by Jonathan Bartlett

Multiple Imputation of Covariates by Substantive Model Compatible Fully Conditional Specification

7 months ago by Juan Xiong

Simulation Extrapolation Inverse Probability Weighted Generalized Estimating Equations

2 years ago by Hanwen Zhang

Bayesian Modeling of Autoregressive Threshold Time Series Models

3 years ago by Shenghai Dai

Missing Item Responses Imputation for Test and Assessment Data

3 years ago by Xikun Han

Enhanced Tipping Point Displays the Results of Sensitivity Analyses for Missing Data

6 months ago by Mohammed Sedki

Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values

2 years ago by Alexander Kowarik

Visualization and Imputation of Missing Values - Graphical User Interface

3 years ago by Olivia Cheronet

Data Estimation using Weighted Averages of Multiple Regressions

a year ago by Shahla Faisal

Weighted Nearest Neighbor Imputation of Missing Values using Selected Variables

a month ago by Nicholas L. Crookston

Nearest Neighbor Observation Imputation and Evaluation Tools

3 months ago by Javier Palarea-Albaladejo

Treatment of Zeros, Left-Censored and Missing Values in Compositional Data Sets

5 months ago by Achim Zeileis

S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations)