CITE: Haghish, E. F. (2022). mlim: Missing Data Imputation with Automated Machine Learning [Computer software]. https://CRAN.R-project.org/package=mlim.
mlim
: Missing Data Imputation with Automated Machine LearningIn reccent years, there have been several attempts for using machine
learning for missing data imputation. Yet, mlim
R package
is unique because it is the first R package to implement automation for
missing data imputation. In other words, mlim
implements
automated machine learning and brings the state-of-the-arts of this
technique, which is expected to result in imputation with lower
imputation error compared to other standard procedures of missing data
imputation.
The figure below shows the normalized RMSE of the imputation of
several algorithms, including MICE
,
missForest
, missRanger
, and mlim
.
Here, two of mlim
’s algorithms, Elastic Net (ELNET) and
Gradient Boosting Machine (GBM) are used for the imputation and the
result are compared with Random Forest imputations as well as Multiple
Imputation with Chained Equations (MICE), which uses Predictive Mean
Matching (PMM). This imputation was carried out on iris
dataset in R, by adding 10% artifitial missing data and comparing the
imputed values with the original.
mlim
supports several algorithms. However, officially,
only ELNET
is recommended for personal
computers with limited RAM. mlim
is
extremely computation hungry and is more suitable for
servers with a lot of RAM. However, ELNET
converges rather fast and hence, provides a fast, scalable, yet highly
flexible solution for missing data imputation. Compared to a fine-tuned
GBM
, ELNET
generally performs poorer, but their computational demands are vastly
different. In order to fine-tune a GBM
model that out-performs ELNET
, you need to
include a large number of models to allow mlim
to search
for the ideal parameters for each variable, within each iteration.
Algorithm | Speed | RAM | CPU |
---|---|---|---|
ELNET |
High | Low | Low |
GBM |
Low | High | High |
GBM
vs ELNET
But which one should you choose, assuming computation resources are
not in question? Well, GBM
is very liokely
to outperform ELNET
, if you specify a
large enough max_models
argument to well-tune the algorithm
for imputing each feature. That basically means generating more than 100
models, at least. But you will enjoy a slight – yet probably
statistically significant – improvement in the imputation accuracy. The
option is there, for those who can use it, and to my knowledge,
fine-tuning GBM
with large enough number
of models will be the most accurate imputation algorithm compared to any
other procedure I know. But ELNET
comes
second and compared to its speed advantage, it is indeed charming!
Both of these algorithms offer one advantage over all the other
machine learning missing data imputation methods such as kNN, K-Means,
PCA, Random Forest, etc… Simply put, you do not need to specify any
parameter yourself, everything is automatic and mlim
searches for the optimal parameters for imputing each variable within
each iteration. For all the aformentioned packages, some parameters need
to be specified, which influence the imputation accuracy. Number of
k for kNN, number of components for PCA, number of trees (and
other parameters) for Random Forest, etc… This is why elnet
outperform the other packages. You get a software that optimizes its
models on its own.
mlim
fine-tunes models for imputation, a procedure that
has never been implemented in other R packages. This procedure often
yields much higher accuracy compared to other machine learning
imputation methods or missing data imputation procedures because of
using more accurate models that are fine-tuned for each feature in the
dataset. The cost, however, is computational resources. If you have
access to a very powerful machine, with a huge amount of RAM per CPU,
then try GBM
. If you specify a high enough
number of models in each fine-tuning process, you are likely to get a
more accurate imputation that ELNET
.
However, for personal machines and laptops,
ELNET
is generally recommended (see
below). If your machine is not powerful enough, it is likely
that the imputation crashes due to memory problems…. So,
perhaps begin with ELNET
, unless you are
working with a powerful server. This is my general advice as long as
mlim
is in Beta version and under development.
mlim
implements a trick to reduce number of iterations
needed for reaching the optimized imputation. Usually, prior to the
imputation, the missing data are replaced with mean, mode, or even
random values from within the variable. This is a fair start-point for
the imputation procedure, but makes the optimization very time
consuming. Another possibility would be to use a fast and
well-established imputation algorithm for the pre-imputation and then
improve the imputed values. mlim
supports the following
algorithms for preimputation:
Algorithm | Speed | RAM | CPU |
---|---|---|---|
kNN |
Very fast | Low | Low |
ranger |
fast | High | High |
missForest |
Very Slow | High | Very High |
mm |
Extremely fast | Very Low | Very Low |
iris
ia a small dataset with 150 rows only. Let’s add
50% of artifitial missing data and compare several state-of-the-art
machine learning missing data imputation procedures.
ELNET
comes up as a winner for a very
simple reason! Because it was fine-tuned and all the rest were not. The
larger the dataset and the higher the number of features, the difference
between ELNET
and the others becomes more
vivid.
# Comparison of different R packages imputing iris dataset
# ===========================================================
rm(list = ls())
library(mlim)
library(mice)
library(missForest)
library(missRanger)
library(VIM)
# Add artifitial missing data
# ===========================================================
<- missRanger::generateNA(iris, p = 0.5, seed = 2022)
irisNA
# ELNET Imputation with mlim
# ===========================================================
<- mlim(irisNA, init = TRUE, maxiter = 10,
mlimELNET include_algos = "ELNET", preimpute = "knn",
report = "mlimELNET.log", verbosity = "debug",
max_models = 1, min_mem_size = "6G", nthreads = 1,
max_mem_size = "8G", iteration_stopping_tolerance = .01,
shutdown = TRUE, flush=FALSE, seed = 2022)
<- mixError(mlimELNET, irisNA, iris))
(mlimELNETerror
# kNN Imputation with VIM
# ===========================================================
<- kNN(irisNA, imp_var=FALSE)
kNN <- mixError(kNN, irisNA, iris))
(kNNerror
# MICE Imputation with mice (10 datasets)
# ===========================================================
<- 10
m <- mice(irisNA, m=m, maxit = 50, method = 'pmm', seed = 500)
mc <- NULL
MCerror for (i in 1:m) MCerror <- c(MCerror, mixError(complete(mc,i), irisNA, iris)[1])
<- mean(MCerror))
(MCerror
# Random Forest Imputation with missForest
# ===========================================================
set.seed(2022)
<- missForest(irisNA)
RF <- mixError(RF$ximp, irisNA, iris))
(RFerror
<- missRanger(irisNA, num.trees=100, seed = 2022)
rngr <- mixError(rngr, irisNA, iris)) (missRanger
But that is not all! mlim
also outperforms other R
packages for imputing categorical and ordinal variables. Here is an
example from the trait
dataset, which is included in the
package.