Accounting for Weights in the Model

Simon Bonner

2021-11-22

In some cases the response variable may be an average of responses from within the same subject or group. In this case, the variances for two observations having the same covariates will not be identical if the size of the groups vary. Instead, it will be inversely proportional to the size of the group. Mathematically, if the observed response is the group average \(\bar{Y}_i=\sum_{j=1}^{n_i} Y_{ij}/n_i\) and \(Y_{ij} \sim N(\mu_i,\phi_i)\) where the dispersion parameter, \(\phi_i\), may vary by group and depend on covariates then \[\bar{Y}_i \sim N(\mu_i,\phi_i/n_i).\] This situation can be accommodated in dalmatian by specifying a column of weights within the data frame that provides the group size. Here is an example with simulated data that demonstrates this feature and shows the importance of properly accounting for weights.

Simulated data

Data for this example are contained in the data frame weights.df saved within the file weights-data-1.RData.

## Load library
library(dalmatian)

## Load simulated data
data(weights_data_1)
head(weights_data_1)
##    n     x     y
## 1  6 -1.00 -0.89
## 2  6 -0.98 -0.71
## 3  7 -0.96 -0.82
## 4 10 -0.94 -1.05
## 5  8 -0.92 -0.61
## 6 20 -0.90 -0.52

The three columns in the data frame record the number of responses per group (n), the value of the covariate (x), and the mean response (y). The data were generated from the model \[Y_{ij} \sim N(x_i,\exp(1+1x_i))\] with \(i=1,\ldots,100\) indexing the groups and \(j\) the observations within groups. In the data, the number of observations per group ranges from 5 to 40.

Model 1: No Weights

First we run the model with no weights.

## Mean model
mymean=list(fixed=list(name="alpha",formula=~ x,
            priors=list(c("dnorm",0,.001))))

## Variance model
mydisp=list(fixed=list(name="psi",
                       formula=~ x,
                       priors=list(c("dnorm",0,.001))),
            link="log")

## Set working directory
## By default uses a system temp directory. You probably want to change this.
workingDir <- tempdir()

## Define list of arguments for jags.model()
jm.args = list(file=file.path(workingDir,"weights_1_jags.R"),n.adapt=1000)

## Define list of arguments for coda.samples()
cs.args = list(n.iter = 5000, n.thin = 20)

## Run the model using dalmatian
results1 <- dalmatian(
  df = weights_data_1,
  mean.model = mymean,
  dispersion.model = mydisp,
  jags.model.args = jm.args,
  coda.samples.args = cs.args,
  response = "y",
  overwrite = TRUE,
  debug = FALSE)
## Step 1: Generating JAGS data...
## Done
## Step 2: Generating JAGS code...
## Done
## Step 3: Generating initial values...
## 
##     Running three parallel chains by default...
##      Initializing chain 1 ...
##      Initializing chain 2 ...
##      Initializing chain 3 ...
## Done
## Step 4: Running model
## module glm loaded
##     Initializing model
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 4
##    Total graph size: 1512
## 
## Initializing model
## 
##    Generating samples
## Done
## Step 5: Tidying Output...
## Done
## Numerical summary statistics
summary1 <- summary(results1)
summary1
## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## Posterior Summary Statistics for Each Model Component
## 
## Mean Model: Fixed Effects 
##              Mean Median   SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.02  -0.02 0.05     -0.12     -0.06      0.02      0.08
## x            0.87   0.87 0.08      0.71      0.82      0.93      1.03
## 
## Dispersion Model: Fixed Effects 
##             Mean Median   SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -1.5   -1.5 0.14     -1.78      -1.6      -1.4      -1.2
## x            1.2    1.2 0.27      0.67       1.0       1.4       1.7
## Graphical summaries
caterpillar1 <- caterpillar(results1, show = TRUE)

From the summaries we can see that that the intercept in the dispersion model is being underestimated. The true value is 1, but the posterior mean is -1.5 with 95% HPD interval (-1.78,-1.22).

Model 2: Weights

We now re-run the model including the weights.

## Specify column containing weights
mydisp$weights = "n"

## Run model
jm.args = list(file=file.path(workingDir,"weights_2_jags.R"),n.adapt=1000)

results2 = dalmatian(df=weights_data_1,
                     mean.model=mymean,
                     dispersion.model=mydisp,
                     jags.model.args=jm.args,
                     coda.samples.args=cs.args,
                     response="y",
                     overwrite = TRUE,
                     debug=FALSE)
## Step 1: Generating JAGS data...
## Done
## Step 2: Generating JAGS code...
## Done
## Step 3: Generating initial values...
## 
##     Running three parallel chains by default...
##      Initializing chain 1 ...
##      Initializing chain 2 ...
##      Initializing chain 3 ...
## Done
## Step 4: Running model
##     Initializing model
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 300
##    Unobserved stochastic nodes: 4
##    Total graph size: 1612
## 
## Initializing model
## 
##    Generating samples
## Done
## Step 5: Tidying Output...
## Done
## Numerical summary statistics
summary2 <- summary(results2)
summary2
## 
## Iterations = 1001:6000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## Posterior Summary Statistics for Each Model Component
## 
## Mean Model: Fixed Effects 
##              Mean Median   SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept) -0.04  -0.04 0.05     -0.13     -0.07      0.00      0.05
## x            0.88   0.89 0.08      0.73      0.83      0.93      1.04
## 
## Dispersion Model: Fixed Effects 
##             Mean Median   SD Lower 95% Lower 50% Upper 50% Upper 95%
## (Intercept)  1.1    1.1 0.14      0.80      0.97       1.2       1.4
## x            1.2    1.2 0.27      0.63      1.01       1.4       1.7
## Graphical summaries
caterpillar2 <- caterpillar(results2, show = TRUE)

The new output shows that the estimate of the intercept for the variance model, 1.08, is now very close to the truth and the 95% credible interval, (0.8,1.36) easily covers the true value of 1.