The purpose of the following case-studies is to demonstrate the core functionalities of DoubleML
.
The DoubleML package for R can be downloaded using (requires previous installation of the remotes
package).
Load the package after completed installation.
The python package DoubleML
is available via the github repository. For more information, please visit our user guide.
For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model.
## inuidur1 female black othrace dep1 dep2 q2 q3 q4 q5 q6 agelt35 agegt54
## 1: 2.890372 0 0 0 0 1 0 0 0 1 0 0 0
## 2: 0.000000 0 0 0 0 0 0 0 0 1 0 0 0
## 3: 3.295837 0 0 0 0 0 0 0 1 0 0 0 0
## 4: 2.197225 0 0 0 0 0 0 1 0 0 0 1 0
## 5: 3.295837 0 0 0 1 0 0 0 0 1 0 0 1
## 6: 3.295837 1 0 0 0 0 0 0 0 1 0 0 1
## durable lusd husd tg
## 1: 0 0 1 0
## 2: 0 1 0 0
## 3: 0 1 0 0
## 4: 0 0 0 1
## 5: 1 1 0 0
## 6: 0 1 0 0
\[\begin{align*} Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0, \\ D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, \end{align*}\] where \(Y\) is the outcome variable and \(D\) is the policy variable of interest. The high-dimensional vector \(X = (X_1, \ldots, X_p)\) consists of other confounding covariates, and \(\zeta\) and \(V\) are stochastic errors.
DoubleMLData
DoubleML
provides interfaces to objects of class data.table
as well as R base classes data.frame
and matrix
. Details on the data-backend and the interfaces can be found in the user guide. The DoubleMLData
class serves as data-backend and can be initialized from a dataframe by specifying the column y_col="inuidur1"
serving as outcome variable \(Y\), the column(s) d_cols = "tg"
serving as treatment variable \(D\) and the columns x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")
specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data.
# Specify the data and variables for the causal model
dml_data_bonus = DoubleMLData$new(df_bonus,
y_col = "inuidur1",
d_cols = "tg",
x_cols = c("female", "black", "othrace", "dep1", "dep2",
"q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54",
"durable", "lusd", "husd"))
print(dml_data_bonus)
## ================= DoubleMLData Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: inuidur1
## Treatment variable(s): tg
## Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
## Instrument(s):
## No. Observations: 5099
# matrix interface to DoubleMLData
dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d)
dml_data_sim
## ================= DoubleMLData Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: y
## Treatment variable(s): d
## Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
## Instrument(s):
## No. Observations: 500
To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate \(m_0\) and \(g_0\). For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of DoubleML
is based on the meta-packages mlr3 for R. For details on the specification of learners and their hyperparameters we refer to the user guide Learners, hyperparameters and hyperparameter tuning.
library(mlr3)
library(mlr3learners)
# surpress messages from mlr3 package during fitting
lgr::get_logger("mlr3")$set_threshold("warn")
learner = lrn("regr.ranger", num.trees=500, mtry=floor(sqrt(n_vars)), max.depth=5, min.node.size=2)
ml_g_bonus = learner$clone()
ml_m_bonus = learner$clone()
learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars)/(n_obs)))
ml_g_sim = learner$clone()
ml_m_sim = learner$clone()
When initializing the object for PLR models DoubleMLPLR
, we can further set parameters specifying the resampling:
n_folds
(defaults to n_folds = 5
) as well asn_rep
(defaults to n_rep = 1
).Additionally, one can choose between the algorithms "dml1"
and "dml2"
via dml_procedure
(defaults to "dml2"
). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default score is "partialling out"
.
The user guide provides details about the Sample-splitting, cross-fitting and repeated cross-fitting, the Double machine learning algorithms and the Score functions
We now initialize DoubleMLPLR
objects for our examples using default parameters. The models are estimated by calling the fit()
method and we can for example inspect the estimated treatment effect using the summary()
method. A more detailed result summary can be obtained via the print()
method. Besides the fit()
method DoubleML
model classes also provide functionalities to perform statistical inference like bootstrap()
, confint()
and p_adjust()
, for details see the user guide Variance estimation, confidence intervals and boostrap standard errors.
set.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_g=ml_g_bonus, ml_m=ml_m_bonus)
## Warning: The argument ml_g was renamed to ml_l. Please adapt the argument name accordingly. ml_g is redirected to ml_l.
## The redirection will be removed in a future version.
## ================= DoubleMLPLR Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: inuidur1
## Treatment variable(s): tg
## Covariates: female, black, othrace, dep1, dep2, q2, q3, q4, q5, q6, agelt35, agegt54, durable, lusd, husd
## Instrument(s):
## No. Observations: 5099
##
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
##
## ------------------ Machine learner ------------------
## ml_l: regr.ranger
## ml_m: regr.ranger
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: TRUE
##
## ------------------ Fit summary ------------------
## Estimates and significance testing of the effect of target variables
## Estimate. Std. Error t value Pr(>|t|)
## tg -0.07385 0.03528 -2.093 0.0364 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: The argument ml_g was renamed to ml_l. Please adapt the argument name accordingly. ml_g is redirected to ml_l.
## The redirection will be removed in a future version.
## ================= DoubleMLPLR Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: y
## Treatment variable(s): d
## Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100
## Instrument(s):
## No. Observations: 500
##
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
##
## ------------------ Machine learner ------------------
## ml_l: regr.glmnet
## ml_m: regr.glmnet
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: TRUE
##
## ------------------ Fit summary ------------------
## Estimates and significance testing of the effect of target variables
## Estimate. Std. Error t value Pr(>|t|)
## d 2.98094 0.05871 50.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1