Multiple Imputation

Michael Mayer

2021-03-27

How to use missRanger for multiple imputation?

For machine learning tasks, imputation is typically seen as a fixed data preparation step like dummy coding. There, multiple imputation is rarely applied as it adds another level of complexity to the analysis. This might be fine since a good validation schema will account for variation introduced by imputation.

For tasks with focus on statistical inference (p values, standard errors, confidence intervals, estimation of effects), the extra variability introduced by imputation has to be accounted for except if only very few missing values appear. One of the standard approaches is to impute the data set multiple times, generating e.g. 10 or 100 versions of a complete data set. Then, the intended analysis (t-test, linear model etc.) is applied independently to each of the complete data sets. Their results are combined afterward in a pooling step, usually by Rubin’s rule (Rubin 1987). For parameter estimates, averages are taken. Their variance is basically a combination of the average squared standard errors plus the variance of the parameter estimates across the imputed data sets, leading to inflated standard errors and thus larger p values and wider confidence intervals.

The package mice (Buuren and Groothuis-Oudshoorn 2011) takes case of this pooling step. The creation of multiple complete data sets can be done by mice or also by missRanger. In the latter case, in order to keep the variance of imputed values at a realistic level, we suggest to use predictive mean matching on top of the random forest imputations.

The following example shows how easy such workflow looks like.

library(missRanger)
library(mice)

set.seed(19)

irisWithNA <- generateNA(iris, p = c(0, 0.1, 0.1, 0.1, 0.1))

# Generate 20 complete data sets
filled <- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 50, pmm.k = 5), simplify = FALSE)
                           
# Run a linear model for each of the completed data sets                          
models <- lapply(filled, function(x) lm(Sepal.Length ~ ., x))

# Pool the results by mice
summary(pooled_fit <- pool(models))
#>                term   estimate  std.error statistic       df      p.value
#> 1       (Intercept)  2.5318379 0.35328813  7.166496 78.42046 3.702585e-10
#> 2       Sepal.Width  0.4219628 0.10914841  3.865955 86.24543 2.139366e-04
#> 3      Petal.Length  0.7463245 0.09206148  8.106805 55.91280 5.215939e-11
#> 4       Petal.Width -0.1942882 0.18919218 -1.026936 65.08255 3.082525e-01
#> 5 Speciesversicolor -0.7083371 0.28697395 -2.468298 92.54440 1.541229e-02
#> 6  Speciesvirginica -0.9094360 0.39326030 -2.312555 91.05446 2.300101e-02

# Compare with model on original data
summary(lm(Sepal.Length ~ ., data = iris))
#> 
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.79424 -0.21874  0.00899  0.20255  0.73103 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
#> Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
#> Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
#> Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
#> Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
#> Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3068 on 144 degrees of freedom
#> Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
#> F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

The standard errors and p values of the multiple imputation are larger than of the original data set. This reflects the additional uncertainty introduced by the presence of missing values in a realistic way.

References

Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software, Articles 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Statistics. Wiley.