Demo of simulating multi-level data

Purpose of this vignette

The main purpose of PUMP is to estimate power, MDES, and sample size requirements. As a separate task, PUMP provides extra functions that allow a user to generate data that simulates multi-level RCTs. These functions may not be relevant for most users, but we provide them for users who may find it useful. This vignette documents how to use the simulation functions.

Choosing parameter values

The user first sets parameter values that will inform the data generating process. For full explanations of parameters, see the Technical Appendix.

Note that some of these parameters directly influence things like power calculations, while others are "nuisance" parameters that are nevertheless needed to generate a full dataset.

model.params.list <- list(
  M = 3                             # number of outcomes
  , J = 30                          # number of schools
  , K = 10                          # number of districts
                                      # (for two-level model, set K = 1)
  , nbar = 50                       # number of individuals per school
  , rho.default = 0.5               # default rho value (optional)
  ################################################## impact
  , MDES = 0.125                    # minimum detectable effect size      
  ################################################## level 3: districts
  , numCovar.3 = 1                  # number of district covariates
  , R2.3 = 0.1                      # percent of district variation
                                      # explained by district covariates
  , ICC.3 = 0.2                     # district intraclass correlation
  , omega.3 = 0.1                   # ratio of district effect size variability
                                      # to random effects variability
  ################################################## level 2: schools
  , numCovar.2 = 1                  # number of school covariates
  , R2.2 = 0.1                      # percent of school variation
                                    # explained by school covariates
  , ICC.2 = 0.2                     # school intraclass correlation 
  , omega.2 = 0.1                   # ratio of school effect size variability
                                      # to random effects variability
  ################################################## level 1: individuals
  , numCovar.1 = 1                  # number of individual covariates
  , R2.1 = 0.1                      # percent of indiv variation explained
                                      # by indiv covariates
)

Above is the minimum set of parameters the user can provide. However, the user can also make additional choices that influence the simulated dat. A couple of notes about possible choices: - By default, the function will generate a vector of school and district assignments S.id and D.id that are evenly split, e.g. with an equal number of schools in each district and an equal number of students in each school. However, the user can also provide their own vector of assignments if they require a specific setup. - If the user specifies a rho.default value, all matrices will be populated using the assumed \(\rho\). The user can instead provide their own \(\rho\) matrices.

The full set of parameters the user can specify is below.

M <- 3
rho.default <- 0.5
default.rho.matrix <- gen_corr_matrix(M = M, rho.scalar = rho.default)
default.kappa.matrix <- matrix(0, M, M) 

model.params.list <- list(
  M = 3                             # number of outcomes
  , J = 30                          # number of schools
  , K = 10                          # number of districts
                                      # (for two-level model, set K = 1)
  , nbar = 50                       # number of individuals per school
  , S.id = NULL                     # N-length vector of school assignments
  , D.id = NULL                     # N-length vector of district assignments
  ################################################## grand mean outcome and impact
  , Xi0 = 0                         # scalar grand mean outcome under no treatment
  , MDES = rep(0.125, M)            # minimum detectable effect size      
  ################################################## level 3: districts
  , numCovar.3 = 1                  # number of district covariates
  , R2.3 = rep(0.1, M)              # percent of district variation
                                      # explained by district covariates
  , rho.V = default.rho.matrix      # MxM correlation matrix of district covariates
  , ICC.3 = rep(0.2, M)             # district intraclass correlation
  , omega.3 = rep(0.1, M)           # ratio of district effect size variability
                                      # to random effects variability
  , rho.w0 = default.rho.matrix     # MxM matrix of correlations for district random effects
  , rho.w1 = default.rho.matrix     # MxM matrix of correlations for district impacts
  , kappa.w =  default.kappa.matrix # MxM matrix of correlations between district
                                      # random effects and impacts
  ################################################## level 2: schools
  , numCovar.2 = 1                  # number of school covariates
  , R2.2 = rep(0.1, M)              # percent of school variation
                                      # explained by school covariates
  , rho.X = default.rho.matrix      # MxM correlation matrix of school covariates
  , ICC.2 = rep(0.2, M)             # school intraclass correlation 
  , omega.2 = rep(0.1, M)           # ratio of school effect size variability
                                      # to random effects variability
  , rho.u0 = default.rho.matrix     # MxM matrix of correlations for school random effects
  , rho.u1 = default.rho.matrix     # MxM matrix of correlations for school impacts
  , kappa.u = default.kappa.matrix  # MxM matrix of correlations between school
                                      # random effects and impacts
  ################################################## level 1: individuals
  , numCovar.1 = 1                  # number of individual covariates
  , R2.1 = rep(0.1, M)              # percent of indiv variation explained
                                      # by indiv covariates
  , rho.C = default.rho.matrix      # MxM correlation matrix of individual covariates
  , rho.r = default.rho.matrix      # MxM matrix of correlations for individual residuals 
)

Simulate data

The user-given parameters are then converted into parameters that inform the data-generating process (DGP). For example, a certain value of \(R^2\) is converted in a coefficient value. Then, the user can generate a full set of simulated data. The simulated data includes both unobserved and unobserved quantities, such as both potential outcomes.

d_m <- 'd3.3_m3rc2rc'
dgp.params.list <- convert_params(model.params.list)
sim.data.full <- gen_full_data(dgp.params.list)

Finally, we generate the treatment assignment, and the observed outcomes \(Y^{obs}\).

T.x <- gen_T.x(
    d_m = d_m, S.id = sim.data.full$ID$S.id, D.id = sim.data.full$ID$D.id,
    nbar = model.params.list$nbar, Tbar = 0.5
)
sim.data.obs <- sim.data.full
sim.data.obs$Yobs <- gen_Yobs(sim.data.full, T.x)
## List of 7
##  $ Y0   :'data.frame':   15000 obs. of  3 variables:
##   ..$ m1: num [1:15000] -0.334 -0.89 -1.653 -0.88 -2.655 ...
##   ..$ m2: num [1:15000] 0.318 -0.424 -1.286 -0.755 0.316 ...
##   ..$ m3: num [1:15000] -0.5314 0.0173 -1.5291 -0.1862 -1.6029 ...
##  $ Y1   :'data.frame':   15000 obs. of  3 variables:
##   ..$ m1: num [1:15000] 0.0607 -0.4957 -1.2589 -0.4857 -2.2603 ...
##   ..$ m2: num [1:15000] 0.668 -0.074 -0.936 -0.405 0.666 ...
##   ..$ m3: num [1:15000] -0.149 0.4 -1.147 0.196 -1.22 ...
##  $ V.k  : num [1:15000, 1:3] -1.96 -1.96 -1.96 -1.96 -1.96 ...
##  $ X.jk : num [1:15000, 1:3] 1.29 1.29 1.29 1.29 1.29 ...
##  $ C.ijk: num [1:15000, 1:3] 1.1249 -1.7509 -0.9587 0.1567 -0.0352 ...
##  $ ID   :'data.frame':   15000 obs. of  2 variables:
##   ..$ S.id: int [1:15000] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ D.id: int [1:15000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Yobs :'data.frame':   15000 obs. of  3 variables:
##   ..$ m1: num [1:15000] 0.0607 -0.4957 -1.2589 -0.4857 -2.2603 ...
##   ..$ m2: num [1:15000] 0.668 -0.074 -0.936 -0.405 0.666 ...
##   ..$ m3: num [1:15000] -0.149 0.4 -1.147 0.196 -1.22 ...