This vignette is adapted from Example 2 on the SimEngine website.
library(tidyr)
library(SimEngine)
When developing a novel statistical method, we often wish to compare our proposed method with one or more existing methods. This serves to highlight the differences between our method and whatever is used in common practice. Generally, we wish to examine realistic settings, motivated by statistical theory, in which the novel method confers some advantage over the alternatives.
In this example, we will consider the problem of estimating the variance-covariance matrix of the least-squares estimator in linear regression. We assume the reader has some familiarity with linear regression.
Suppose our dataset consists of \(n\) independent observations \(\{(Y_1, X_1), \dots, (Y_n, X_n)\}\), where \(X\) and \(Y\) are both scalar variables. A general linear regression model posits that
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
where \(\epsilon_i\) is a mean-zero noise term with variance \(\sigma^2_i\). We refer to this as a heteroskedastic model, since the variances need not be equal across all \(i\). This is the true data-generating model. (Note: A more restrictive (but misspecified) model assumes that there is a common variance \(\sigma^2\) such that \(\sigma^2_i = \sigma^2\) for all \(i\). We refer to this incorrect model as the homoskedastic model.)
In linear regression, we use least-squares to estimate the coefficient vector \(\beta\). For the purposes of building confidence intervals and performing hypothesis tests, we need to also estimate the standard error of the least squares estimator. There are two common ways to do this: (1) Use a model-based standard error that assume homoskedasticity. This is the estimator used by default in most statistical software (including the lm()
function in R). (2) Use a so-called sandwich standard error. Statistical theory tells us that estimator will be consistent even under heteroskedasticity. See the Statistical Appendix for more details.
We will carry out a small simulation study to compare these two estimators.
We start by declaring a new simulation object and writing a function that generates some data according to our model. For this simulation, we will make \(\sigma^2_i\) larger for larger values of \(X_i\).
<- new_sim()
sim
<- function(n) {
create_regression_data <- c(-1, 10)
beta <- rnorm(n)
x <- exp(x)
sigma2 <- rnorm(n=n, mean=(beta[1]+beta[2]*x), sd = sqrt(sigma2))
y return(data.frame(x=x, y=y))
}
To get a sense of what heteroskedasticity looks like in practice, we can generate a dataset, fit a linear regression model, and make a scatterplot of the residuals against \(X\).
<- create_regression_data(n=500)
dat <- lm(y~x, data=dat)
linear_model $residuals <- linear_model$residuals
dat
library(ggplot2)
ggplot(dat, aes(x=x, y=residuals)) +
geom_point() +
theme_bw() +
labs(x="x", y="residual")
Now we declare two methods: one returns the least squares estimate and model-based estimator of the variance-covariance matrix of \(\hat{\beta}\), and the other returns the least squares estimate and sandwich estimator (using the sandwich package).
<- function(data) {
model_vcov <- lm(y~x, data=data)
mod return(list("coef"=mod$coefficients, "vcov"=diag(vcov(mod))))
}
<- function(data) {
sandwich_vcov <- lm(y~x, data=data)
mod return(list("coef"=mod$coefficients, "vcov"=diag(vcovHC(mod))))
}
Next, we write the simulation script. This script returns a point estimate and a standard error estimate for both the intercept parameter \(\beta_0\) and the slope parameter \(\beta_1\). We will tell SimEngine to run 500 simulation replicates for each of four sample sizes. It is important to use the seed
argument in set_config
so that our results will be reproducible. In addition, we will use the packages
option to load the sandwich package. Loading packages via set_config
(as opposed to running library(sandwich)
) is required if running simulations in parallel. Finally, we run the simulation.
%<>% set_script(function() {
sim <- create_regression_data(n=L$n)
data <- use_method(L$estimator, list(data))
estimates return(list(
"beta0_est" = estimates$coef[1],
"beta1_est" = estimates$coef[2],
"beta0_se_est" = sqrt(estimates$vcov[1]),
"beta1_se_est" = sqrt(estimates$vcov[2])
))
})
%<>% set_levels(
sim estimator = c("model_vcov", "sandwich_vcov"),
n = c(50, 100, 500, 1000)
)
%<>% set_config(
sim num_sim = 500,
seed = 24,
packages = c("sandwich")
)
%<>% run()
sim #> Done. No errors or warnings detected.
Now we can summarize the results using summarize
. There are two main quantities of interest for us. The primary purpose of the standard error estimate for \(\hat{\beta}\) is to form confidence intervals, so we will look at (1) the average width of the resulting interval (simply 1.96 times the average standard error estimate across simulation replicates), and (2) the estimated coverage of the interval, which is simply the proportion of simulation replicates in which the interval contains the true value of \(\beta\). We will focus on 95% confidence intervals in this simulation.
<- sim %>% summarize(
summarized_results mean = list(
list(name="mean_se_beta0", x="beta0_se_est"),
list(name="mean_se_beta1", x="beta1_se_est")
),coverage = list(
list(name="cov_beta0", estimate="beta0_est", se="beta0_se_est", truth=-1),
list(name="cov_beta1", estimate="beta1_est", se="beta1_se_est", truth=10)
)
)
print(summarized_results)
#> level_id estimator n mean_se_beta0 mean_se_beta1 cov_beta0 cov_beta1
#> 1 1 model_vcov 50 0.18110782 0.18232629 0.948 0.850
#> 2 2 sandwich_vcov 50 0.18229941 0.24313215 0.938 0.922
#> 3 3 model_vcov 100 0.12720997 0.12764561 0.946 0.864
#> 4 4 sandwich_vcov 100 0.12800377 0.17264265 0.942 0.946
#> 5 5 model_vcov 500 0.05703074 0.05697449 0.946 0.836
#> 6 6 sandwich_vcov 500 0.05749570 0.07968853 0.948 0.942
#> 7 7 model_vcov 1000 0.04055713 0.04058957 0.962 0.858
#> 8 8 sandwich_vcov 1000 0.04052118 0.05721305 0.958 0.958
To visualize our results, we set up a plotting function.
<- function(which_graph, n_est) {
plot_results if (n_est == 3) {
<- c("#999999", "#E69F00", "#56B4E9")
values <- c("model_vcov", "sandwich_vcov", "bootstrap_vcov")
breaks <- c("Model-based", "Sandwich", "Bootstrap")
labels else {
} <- c("#999999", "#E69F00")
values <- c("model_vcov", "sandwich_vcov")
breaks <- c("Model-based", "Sandwich")
labels
}if (which_graph == "width") {
%>%
summarized_results pivot_longer(
cols = c("mean_se_beta0", "mean_se_beta1"),
names_to = "parameter",
names_prefix = "mean_se_"
%>%
) ::mutate(value_j = jitter(value, amount = 0.01)) %>%
dplyrggplot(aes(x=n, y=1.96*value_j, color=estimator)) +
geom_line(aes(linetype=parameter)) +
geom_point() +
theme_bw() +
ylab("Average CI width") +
scale_color_manual(
values = values,
breaks = breaks,
name = "SE estimator",
labels = labels
+
) scale_linetype_discrete(
breaks = c("beta0", "beta1"),
name = "Parameter",
labels = c(expression(beta[0]), expression(beta[1]))
)else {
} %>%
summarized_results pivot_longer(
cols = c("cov_beta0","cov_beta1"),
names_to = "parameter",
names_prefix = "cov_"
%>%
) ::mutate(value_j = jitter(value, amount = 0.01)) %>%
dplyrggplot(aes(x=n, y=value, color=estimator)) +
geom_line(aes(linetype = parameter)) +
geom_point() +
theme_bw() +
ylab("Coverage") +
scale_color_manual(
values = values,
breaks = breaks,
name = "SE estimator",
labels = labels
+
) scale_linetype_discrete(
breaks = c("beta0", "beta1"),
name = "Parameter",
labels = c(expression(beta[0]), expression(beta[1]))
+
) geom_hline(yintercept=0.95)
} }
We then use the plotting function to make our figures.
plot_results("width", 2)
plot_results("coverage", 2)
Looking at these plots, we can see that the sandwich method results in a wider interval, on average, for \(\beta_1\). In terms of coverage, the sandwich estimator achieves near nominal coverage for both parameters, while there is moderate undercoverage for \(\beta_1\) using the model-based estimator.
The bootstrap is another popular approach to estimating standard errors. We can add a bootstrap method and use update_sim
to run the new simulation replicates without re-running any of our previous work. All we need to do is include the new estimator in our simulation levels. Since the bootstrap can be computationally intensive, we will use parallelization to speed things up. This requires us to specify the option parallel = "outer"
in set_config
. (Even with parallelization, update_sim
will likely take a few minutes to run.)
<- function(data) {
bootstrap_vcov <- lm(y~x, data=data)
mod <- matrix(NA, nrow=100, ncol=2)
boot_ests for (j in 1:100) {
<- sample(1:nrow(data), size=nrow(data), replace=TRUE)
indices <- data[indices,]
boot_dat <- lm(y~x, data=boot_dat)
boot_mod <- boot_mod$coefficients
boot_ests[j,]
}<- var(boot_ests[,1])
boot_v1 <- var(boot_ests[,2])
boot_v2 return(list("coef"=mod$coefficients, "vcov"=c(boot_v1, boot_v2)))
}
%<>% set_levels(
sim estimator = c("model_vcov", "sandwich_vcov", "bootstrap_vcov"),
n = c(50, 100, 500, 1000)
)
%<>% set_config(
sim num_sim = 500,
seed = 24,
parallel = "outer",
n_cores = 2,
packages = c("sandwich")
)
%<>% update_sim()
sim #> Done. No errors or warnings detected.
Now that we have the bootstrap results included in our simulation object, we can look at the updated results.
<- sim %>% summarize(
summarized_results mean = list(
list(name="mean_se_beta0", x="beta0_se_est"),
list(name="mean_se_beta1", x="beta1_se_est")
),coverage = list(
list(name="cov_beta0", estimate="beta0_est", se="beta0_se_est", truth=-1),
list(name="cov_beta1", estimate="beta1_est", se="beta1_se_est", truth=10)
)
)
plot_results("width", 3)
plot_results("coverage", 3)
Like the sandwich estimator, the bootstrap results in wider intervals for \(\beta_1\), but is much closer to achieving 95% coverage compared to the model-based estimator.
For notational simplicity, we build a matrix \(\mathbb{X}\), whose first column is all 1’s (the intercept column) and whose second column is \((X_1, \dots, X_n)^T\). We also define a matrix
\[ \Sigma = \begin{pmatrix}\sigma^2_1 &\dots& 0 \\ \vdots &\ddots& \vdots \\ 0 & \dots & \sigma^2_n\end{pmatrix}\]
The ordinary least squares estimator of \(\beta\) is \(\hat{\beta} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\mathbb{Y}\). The variance of this estimator, which we call \(\mathbb{V}\) is
\[\text{Var}(\hat{\beta}) = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\Sigma\mathbb{X}(\mathbb{X}^T\mathbb{X})^{-1}\]
The usual estimator of \(\mathbb{V}\) is the model-based standard error \(s^2(\mathbb{X}^T\mathbb{X})^{-1}\), where \(s^2 = \frac{\sum_i (Y_i - (\hat{\beta}_0 + \hat{\beta}_1X_i))^2}{n-1}\). However, under our heteroskedastic model where the \(\sigma^2_i\) are not all equal, this is not a consistent estimator of \(\mathbb{V}\)! That is to say, even in large samples, we cannot generally expect this estimator to be close to the truth.
A better estimator for our setting is the sandwich standard error, or Huber-White standard error, given by
\[(\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\hat{\Sigma}\mathbb{X}(\mathbb{X}^T\mathbb{X})^{-1}\]
where
\[\hat{\Sigma} = \begin{pmatrix}(Y_1 - (\hat{\beta}_0 + \hat{\beta}_1X_1))^2 &\dots& 0 \\ \vdots &\ddots& \vdots \\ 0 & \dots & (Y_n - (\hat{\beta}_0 + \hat{\beta}_1X_n))^2\end{pmatrix}\]