Testing the Package

Testing the package is an important part of package development. Although we are unit-testing the functions, there is always a situation that is not tested. Please feel free to try the code under different conditions. Please report the bugs as a new issue. In the following, there steps to generate synthetic data and conduct some testing.

In order to test the package, you need to have the code on your system (R (>= 3.5.0)). There are two options:

If you are running the code on a cluster and trying to carry out some test on scaling out, installing it is the best option. However, if you want to dig deeper and debug the code interactively, cloning the code is your choice. If you are planning to test, debug, and commit the changes, use forking. In the following, we explain all these options.

Getting the Code

Installing the Package

Use devtools::install_github to install the package. If you do not specify the ref, it will install the master (or main) branch. The master branch hosts the latest released code. The latest updates are committed to the develop branch. For more details please refer Git Branching Model in the Developers Guide section.

library(devtools)
try(detach("package:CausalGPS", unload = TRUE), silent = TRUE) # if already you have the package, detach and unload it, to have a new install. 
install_github("fasrc/CausalGPS", ref="develop")
library(CausalGPS)

The process should run smoothly. Try ?CausalGPS. It should open the package description page under the help tab (assuming you are using RStudio).

Cloning the Package

Go to the package Github repository, at the top left (A), choose the branch that you want to clone, then click on the Code button (B), then choose Download ZIP. The following figure shows the buttons’ location.

After cloning the code, open one of the files using RStudio, then change the project directory to the project directory (Session > Set Working Directory > To Project Directory).

Forking the package

Forking is explained in the Contribution page.

Package development, test, check cycle

There are some commands that you need to use during testing and debugging the package. You can read about these commands in R Packages book. For debugging and testing, you do not need to know all steps. Here is the list of commands that you need:

Running Examples

You can use any causal inference studies data to test the package. The database needs to have the following attributes:

The package can generate synthetic data that can be used to test different features of the package. At the current implementation, the code only generates synthetic numerical data; however, with small innovation, one can add categorical data. In the following, we present some reproducible examples that you can copy and build upon them.

Generating Synthetic Data

library("CausalGPS")

mydata <- generate_syn_data(sample_size = 10000) 
str(mydata)
#'data.frame':  10000 obs. of  8 variables:
# $ Y    : num  -17 3.66 -45.88 -12.12 2.75 ...
# $ treat: num  9.74 14.88 5.22 10.19 16.65 ...
# $ cf1  : num  -0.128 0.155 -1.271 0.551 0.846 ...
# $ cf2  : num  0.2376 0.0466 0.8918 -1.8434 -2.4487 ...
# $ cf3  : num  1.5492 -0.4453 0.0718 0.8309 0.6749 ...
# $ cf4  : num  1.3738 0.8621 0.4735 0.7013 -0.0851 ...
# $ cf5  : num  -2 2 1 -2 0 1 1 2 -2 -1 ...
# $ cf6  : num  -1.794 -1.822 0.79 -0.591 2.673 ...

Estimating GPS Values for the Dataset

In the following example, first, we generate 10000 synthetic data samples, then we feed them into estimate_gps function to estimate their GPS values. Please note that the Y variable is not used inside the code, and it is only provided to cbind with the generated GPS values. You can read more about different arguments in the documentation (?estimate_gps). In summary, - We want to run a code using SuperLearner (sl) prediction model. Inside the SuperLearner package, we want to use the XGBoost package. As you know, we internally generate a customized wrapper for the XGBoost package; thus, the correct way to activate that is by passing m_xgboost, which stands for modified xgboost. - We requested for one thread (nthread = 1); you can use as much as you want; the package will use the available one and will ignore the rest of them. XGBoost uses OpenMP backend to use all cores. Sometimes it becomes a really big challenge on Mac systems. So if you are using Mac and do not see any performance improvement, be aware of that. - In params, we passed the list of parameters; the function will choose one from each list at random and will generate a customize wrapper based on them. If you want to use a specific value, just give a list one number (e.g., xgb_max_depth = c(3)). All parameters that start with xgb_ will only change XGBoost hyperparameters.

library("CausalGPS")
mydata <- generate_syn_data(sample_size = 10000) 
data_with_gps <- estimate_gps(mydata$Y,
                              mydata$treat,
                              mydata[c("cf1","cf2","cf3","cf4","cf5","cf6")],
                              pred_model = "sl",
                              internal_use = FALSE,
                              params = list(xgb_max_depth = c(3,4,5),
                                       xgb_nrounds=c(10,20,30,40,50,60)),
                              nthread = 1,
                              sl_lib = c("m_xgboost")
                             )

Now, let’s add some categorical data. Let’s say our data belongs to 5 different years (2000 observations per year), and each year we have data from 4 different regions, including North, South, East, and West.

library("CausalGPS")
mydata <- generate_syn_data(sample_size = 10000) 

year <- c(rep(c("2001"), each=2000),
          rep(c("2002"), each=2000),
          rep(c("2003"), each=2000),
          rep(c("2004"), each=2000), 
          rep(c("2005"), each=2000))

region <- rep(c(rep("North",each=500),
                rep("South",each=500),
                rep("East",each=500),
                rep("West",each=500)), each=5)

mydata$year <- as.factor(year)
mydata$region <- as.factor(region)

data_with_gps <- estimate_gps(mydata$Y,
                              mydata$treat,
                              mydata[c("cf1","cf2","cf3","cf4","cf5","cf6", "year", "region")],
                              pred_model = "sl",
                              internal_use = FALSE,
                              params = list(xgb_max_depth = c(3,4,5),
                                       xgb_nrounds=c(10,20,30,40,50,60)),
                              nthread = 1,
                              sl_lib = c("m_xgboost")
                             )

Generating Pseudo Population

Generating pseudo population (gen_pseudo_pop())is one of the important parts of the package. It internally uses estimate_gps() package. An acceptable pseudo population should pass the covariate balance test. Users choose which test and what threshold, and how many attempts. If the covariate balance test is met, the function stops trying and returns the generated population. However, if it cannot pass the test, it still returns what is generated with a message to the user indicating the population did not pass the test. At each iteration, we change the hyperparameters to modify GPS values (to better or worse) to pass the covariate balance test. After estimating the GPS values, we need to compile the population. There are three major approaches to compile pseudo population, including:

Among different methods for testing covariate balance test, only absolute approach is implemented.

library("CausalGPS")
mydata <- generate_syn_data(sample_size = 10000) 
pseudo_pop <- generate_pseudo_pop(mydata$Y,
                                  mydata$treat,
                                  mydata[c("cf1","cf2","cf3","cf4","cf5","cf6")],
                                  ci_appr = "matching",
                                  pred_model = "sl",
                                  sl_lib = c("m_xgboost"),
                                  params = list(xgb_nrounds=c(10,20,30),
                                  xgb_eta=c(0.1,0.2,0.3)),
                                  nthread = 1,
                                  covar_bl_method = "absolute",
                                  covar_bl_trs = 0.1,
                                  covar_bl_trs_type= "mean",
                                  max_attempt = 1,
                                  matching_fun = "matching_l1",
                                  delta_n = 1,
                                  scale = 0.5)

The second causal inference approach is weighting. Here is an example to generate pseudo population using weighting approach.

mydata <- generate_syn_data(sample_size = 10000) 

year <- c(rep(c("2001"), each=2000),
          rep(c("2002"), each=2000),
          rep(c("2003"), each=2000),
          rep(c("2004"), each=2000), 
          rep(c("2005"), each=2000))

region <- rep(c(rep("North",each=500),
                rep("South",each=500),
                rep("East",each=500),
                rep("West",each=500)), each=5)

mydata$year <- as.factor(year)
mydata$region <- as.factor(region)

pseudo_pop <- generate_pseudo_pop(mydata$Y,
                                  mydata$treat,
                                  mydata[c("cf1","cf2","cf3","cf4","cf5","cf6","year","region")],
                                  ci_appr = "weighting",
                                  pred_model = "sl",
                                  sl_lib = c("m_xgboost"),
                                  params = list(xgb_nrounds=c(10,20,30),
                                   xgb_eta=c(0.1,0.2,0.3)),
                                  nthread = 1,
                                  covar_bl_method = "absolute",
                                  covar_bl_trs = 0.1,
                                  covar_bl_trs_type = "mean",
                                  max_attempt = 1
                                  )

Further Processing

After generating a pseudo population, we can process the data for different purposes. So far, estimating exposure rate function (estimate_erf) is implemented.

# library("CausalGPS")
mydata <- generate_syn_data(sample_size = 10000) 
pseudo_pop <- generate_pseudo_pop(mydata$Y,
                                  mydata$treat,
                                  mydata[c("cf1","cf2","cf3","cf4","cf5","cf6")],
                                  ci_appr = "matching",
                                  pred_model = "sl",
                                  sl_lib = c("m_xgboost"),
                                  params = list(xgb_nrounds=c(10,20,30),
                                   xgb_eta=c(0.1,0.2,0.3)),
                                  nthread = 1,
                                  covar_bl_method = "absolute",
                                  covar_bl_trs = 0.1,
                                  covar_bl_trs_type= "mean",
                                  max_attempt = 1,
                                  matching_fun = "matching_l1",
                                  delta_n = 1,
                                  scale = 0.5)

erf_val <- estimate_npmetric_erf(pseudo_pop$pseudo_pop$Y,
                                 pseudo_pop$pseudo_pop$w,
                                 bw_seq=seq(0.2,2,0.2),
                                 w_vals = seq(2,20,0.5))

Steps for using precomputed data during the test

The package is being tested on different data samples, some of them are generated during testing, and some others are generated before and just are loaded. These data are located in “R/sysdata.rda” file. If you add new features to the code, you may need to use some external or pre-computed data set to test your functions. In the following, we explain the steps to append data to “R/sysdata.rda”. Please note that CRAN might reject large data sets.

# in terminal
cp  R/sysdata.rda R/sysdata_backup.rda
rm(list=ls())
load("R/sysdata.rd")
list_names <- ls()
list_names <- c(list_names, "mydata") 
save(list=list_names, file="R/sysdata.rda")