mixgb: Multiple Imputation Through XGBoost

Yongshi Deng

2022-06-07

Introduction

Mixgb offers a scalable solution for imputing large datasets using XGBoost, bootstrapping and predictive mean matching. Mixgb is built under Fully Conditional Specification (FCS), where XGBoost imputation models are built for each incomplete variable. Mixgb can automatically handle different types of variables and users do not need to encode categorical variables themselves. Users can also choose different settings regarding bootstrapping and the types predictive mean matching to enhance imputation performance.

Impute missing values with mixgb

We first load the mixgb package and the nhanes3_newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

library(mixgb)
str(nhanes3_newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#>  $ HSHSIZER: int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#>  $ HSAGEIR : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#>  $ HSSEX   : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 2 2 2 1 ...
#>  $ DMARACER: Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 2 ...
#>  $ DMAETHNR: Factor w/ 3 levels "1","2","3": 3 1 3 3 3 3 3 3 3 3 ...
#>  $ DMARETHN: Factor w/ 4 levels "1","2","3","4": 1 3 2 1 1 1 2 1 2 2 ...
#>  $ BMPHEAD : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#>   ..- attr(*, "label")= chr "Head circumference (cm)"
#>  $ BMPRECUM: num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#>   ..- attr(*, "label")= chr "Recumbent length (cm)"
#>  $ BMPSB1  : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#>   ..- attr(*, "label")= chr "First subscapular skinfold (mm)"
#>  $ BMPSB2  : num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#>   ..- attr(*, "label")= chr "Second subscapular skinfold (mm)"
#>  $ BMPTR1  : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#>   ..- attr(*, "label")= chr "First triceps skinfold (mm)"
#>  $ BMPTR2  : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#>   ..- attr(*, "label")= chr "Second triceps skinfold (mm)"
#>  $ BMPWT   : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#>   ..- attr(*, "label")= chr "Weight (kg)"
#>  $ DMPPIR  : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#>   ..- attr(*, "label")= chr "Poverty income ratio"
#>  $ HFF1    : Factor w/ 2 levels "1","2": 2 2 1 1 1 2 2 1 2 1 ...
#>  $ HYD1    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(nhanes3_newborn))
#> HSHSIZER  HSAGEIR    HSSEX DMARACER DMAETHNR DMARETHN  BMPHEAD BMPRECUM 
#>        0        0        0        0        0        0      124      114 
#>   BMPSB1   BMPSB2   BMPTR1   BMPTR2    BMPWT   DMPPIR     HFF1     HYD1 
#>      161      169      124      167      117      192        7        0

To impute this dataset, we can use the default settings. The default number of imputed datasets m = 5. Note that we do not need to convert our data into dgCMatrix or one-hot coding format. Our package will convert it automatically. Variables should be of the following types: numeric, integer, factor or ordinal factor.

# use mixgb with default settings
imputed.data <- mixgb(data = nhanes3_newborn, m = 5)

Customise imputation settings

We can also customise imputation settings:

# Use mixgb with chosen settings
params <- list(
  max_depth = 6,
  gamma = 0.1,
  eta = 0.3,
  min_child_weight = 1,
  subsample = 1,
  colsample_bytree = 1,
  colsample_bylevel = 1,
  colsample_bynode = 1,
  nthread = 4,
  tree_method = "auto",
  gpu_id = 0,
  predictor = "auto"
)

imputed.data <- mixgb(
  data = nhanes3_newborn, m = 5, maxit = 1,
  ordinalAsInteger = TRUE, bootstrap = TRUE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 50, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)

Tune hyperparameters

Imputation performance can be affected by the hyperparameter settings. It may seem daunting to tune a large set of hyperparameters, but often we can narrow down the search as many hyperparameters are correlated. In our package, we have a function mixgb_cv() to tune nrounds. There is no default nrounds value in XGBoost, so we need to specify it. The default nrounds in mixgb is 50. However, we recommend using mixgb_cv() to find the optimal nrounds first.

cv.results <- mixgb_cv(data = nhanes3_newborn, verbose = FALSE)
cv.results$evaluation.log
#>     iter train_logloss_mean train_logloss_std test_logloss_mean
#>  1:    1          0.6208135       0.001946554         0.6552126
#>  2:    2          0.5767785       0.004676169         0.6374645
#>  3:    3          0.5449469       0.005228210         0.6272369
#>  4:    4          0.5191374       0.008095561         0.6242182
#>  5:    5          0.4941135       0.008357847         0.6243667
#>  6:    6          0.4755729       0.009002235         0.6254604
#>  7:    7          0.4595016       0.007635777         0.6274494
#>  8:    8          0.4438848       0.007674532         0.6277083
#>  9:    9          0.4337404       0.008486617         0.6281046
#> 10:   10          0.4190590       0.008824536         0.6305885
#> 11:   11          0.4102110       0.005878129         0.6333391
#> 12:   12          0.3978915       0.007290803         0.6358570
#> 13:   13          0.3889896       0.009138579         0.6394993
#> 14:   14          0.3781350       0.007533795         0.6391836
#>     test_logloss_std
#>  1:      0.005066467
#>  2:      0.001808090
#>  3:      0.005509982
#>  4:      0.005613646
#>  5:      0.007161814
#>  6:      0.008257190
#>  7:      0.011842903
#>  8:      0.014264475
#>  9:      0.014754702
#> 10:      0.013225266
#> 11:      0.010154894
#> 12:      0.010633046
#> 13:      0.009992737
#> 14:      0.009951158
cv.results$response
#> [1] "HFF1"
cv.results$best.nrounds
#> [1] 4

By default, mixgb_cv() will randomly choose an incomplete variable as the response and build an XGBoost model with other variables using the complete cases of the dataset. Therefore, each run of mixgb_cv() is likely to return different results. Users can also specify the response and covariates in the argument response and select_features, respectively.

cv.results <- mixgb_cv(data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1, 
                       response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM",                           "BMPSB1", "BMPSB2", "BMPTR1", "BMPTR2", "BMPWT"), verbose = FALSE)

cv.results$best.nrounds
#> [1] 19

Since using mixgb_cv() with this dataset mostly returns a number less than 20, I’ll set nrounds = 20 in mixgb() to obtain m imputed datasets.

imputed.data <- mixgb(data = nhanes3_newborn, m = 5, nrounds = 20)

Convert ordinal factors to integers

If our dataset has ordinal factors, then the imputation process will be sped up by converting them to integers. In nhanes3_newborn data, there is only one ordinal factor, but we can still see that the imputation time is smaller with ordinalAsInteger = TRUE. If m and maxit are larger, the difference will be even more obvious. Note that when ordinalAsInteger = TRUE , the class of these ordinal factors in the imputed datasets will also be integer. We recommend to use the default PMM setting pmm.type = "auto", which will impute numeric and integer variables with PMM (type 2) and impute categorical variables without PMM. If we don’t apply PMM to integer variables, the missing values may be imputed as numeric. Sometimes this can be problematic, especially when the integers are actually converted from ordinal factors. Therefore, pmm.type = NULL is disabled when ordinalAsInteger = TRUE .

system.time(imputed.data <- mixgb(data = nhanes3_newborn, m = 5, maxit = 1, ordinalAsInteger = FALSE))
#>    user  system elapsed 
#>  10.626   0.235  10.907

system.time(imputed.data <- mixgb(data = nhanes3_newborn, m = 5, maxit = 1, ordinalAsInteger = TRUE))
#>    user  system elapsed 
#>  10.532   0.328  11.148

Inspect multiply imputed values

The mixgb package provides the following visual diagnostics functions:

  1. Single variable: plot_hist(), plot_box(), plot_bar() ;

  2. Two variables: plot_2num(), plot_2fac(), plot_1num1fac() ;

  3. Three variables: plot_2num1fac(), plot_1num2fac().

Each function will return m+1 panels to compare the observed data with m sets of actual imputed values.

For more details, please check the vignette on githubVisual diagnostics for multiply imputed values.