BranchGLM Vignette

1 Description

BranchGLM is a package for fitting GLMs and performing variable selection. Most functions in this package make use of RcppArmadillo and some of them can also make use of OpenMP to perform parallel computations. This vignette introduces the package, provides examples on how to use the main functions in the package and also briefly describes the methods employed by the functions.

2 Installation

BranchGLM can be installed using the install.packages() function


install.packages("BranchGLM")

It can also be installed via the install_github() function from the devtools package.


devtools::install_github("JacobSeedorff21/BranchGLM")

3 Fitting GLMs

3.1 Optimization methods

3.2 Examples

### Using mtcars

library(BranchGLM)

cars <- mtcars

### Fitting linear regression model with Fisher scoring

LinearFit <- BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")

LinearFit
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ .
#> 
#>             Estimate       SE       t p.values  
#> (Intercept) 12.30000 18.72000  0.6573  0.51810  
#> cyl         -0.11140  1.04500 -0.1066  0.91610  
#> disp         0.01334  0.01786  0.7468  0.46350  
#> hp          -0.02148  0.02177 -0.9868  0.33500  
#> drat         0.78710  1.63500  0.4813  0.63530  
#> wt          -3.71500  1.89400 -1.9610  0.06325 .
#> qsec         0.82100  0.73080  1.1230  0.27390  
#> vs           0.31780  2.10500  0.1510  0.88140  
#> am           2.52000  2.05700  1.2250  0.23400  
#> gear         0.65540  1.49300  0.4389  0.66520  
#> carb        -0.19940  0.82880 -0.2406  0.81220  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 7.0235
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 147 on 21 degrees of freedom
#> AIC: 166

### Fitting gamma regression with inverse link with L-BFGS

GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse",
                      method = "LBFGS", grads = 5)

GammaFit
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ .
#> 
#>               Estimate         SE       t p.values  
#> (Intercept) -6.794e-02  3.085e-02 -2.2020  0.03898 *
#> cyl          1.760e-03  1.946e-03  0.9047  0.37590  
#> disp        -7.947e-06  3.515e-05 -0.2261  0.82330  
#> hp          -6.724e-05  4.050e-05 -1.6600  0.11170  
#> drat        -4.283e-04  2.728e-03 -0.1570  0.87670  
#> wt          -9.224e-03  3.360e-03 -2.7450  0.01214 *
#> qsec         1.739e-03  1.165e-03  1.4930  0.15040  
#> vs          -3.087e-04  3.322e-03 -0.0929  0.92690  
#> am           6.304e-04  3.441e-03  0.1832  0.85640  
#> gear         4.882e-03  2.577e-03  1.8940  0.07204 .
#> carb        -1.025e-03  1.481e-03 -0.6926  0.49610  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0087
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 21 degrees of freedom
#> AIC: 152
#> Algorithm converged in 3 iterations using L-BFGS

3.3 Useful functions

### Predict method

predict(GammaFit)
#>  [1] 21.18646 20.58540 25.08262 19.73260 16.84361 19.66750 14.40825 22.08188
#>  [9] 23.72831 18.71060 19.08321 15.34256 16.20899 16.27084 12.61678 12.23244
#> [17] 12.09701 28.31601 31.25370 32.11988 22.42958 17.18607 17.63342 13.73885
#> [25] 15.78752 29.52892 26.61243 30.53420 16.86624 19.39720 14.08588 21.53099

### Accessing coefficients matrix

GammaFit$coefficients
#>                  Estimate           SE           t   p.values
#> (Intercept) -6.794026e-02 3.085396e-02 -2.20199519 0.03897971
#> cyl          1.760179e-03 1.945517e-03  0.90473592 0.37586830
#> disp        -7.947123e-06 3.515071e-05 -0.22608709 0.82331968
#> hp          -6.724158e-05 4.049734e-05 -1.66039498 0.11169449
#> drat        -4.282730e-04 2.727545e-03 -0.15701775 0.87673070
#> wt          -9.224050e-03 3.360483e-03 -2.74485832 0.01213720
#> qsec         1.739266e-03 1.165327e-03  1.49251309 0.15043663
#> vs          -3.086809e-04 3.322383e-03 -0.09290948 0.92685616
#> am           6.304156e-04 3.441185e-03  0.18319724 0.85640047
#> gear         4.881920e-03 2.577117e-03  1.89433406 0.07203615
#> carb        -1.025451e-03 1.480570e-03 -0.69260584 0.49614538

4 Performing variable selection

4.1 Stepwise methods

4.1.1 Forward selection example

### Forward selection with mtcars

VariableSelection(GammaFit, type = "forward")
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using forward selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 27
#> 
#> Order the variables were added to the model:
#> 
#> 1). wt
#> 2). hp
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt
#> 
#>               Estimate         SE      t  p.values    
#> (Intercept) -8.923e-03  2.806e-03 -3.180 0.0034910 ** 
#> hp          -8.887e-05  2.110e-05 -4.212 0.0002245 ***
#> wt          -9.826e-03  1.384e-03 -7.100  8.21e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0104
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 29 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring

4.1.2 Backward elimination example

### Backward elimination with mtcars

VariableSelection(GammaFit, type = "backward")
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using backward elimination with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 49
#> 
#> Order the variables were removed from the model:
#> 
#> 1). vs
#> 2). drat
#> 3). am
#> 4). disp
#> 5). carb
#> 6). cyl
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>               Estimate         SE      t  p.values    
#> (Intercept) -4.691e-02  1.782e-02 -2.633   0.01384 *  
#> hp          -6.284e-05  3.015e-05 -2.084   0.04675 *  
#> wt          -9.485e-03  1.819e-03 -5.213 1.719e-05 ***
#> qsec         1.299e-03  7.471e-04  1.739   0.09348 .  
#> gear         2.662e-03  1.652e-03  1.612   0.11870    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring

4.2 Branch and bound

4.2.1 Branch and bound example

  • If showprogress is true, then progress of the branch and bound algorithm will be reported occasionally.
  • Parallel computation can be used with these methods and can lead to very large speedups.
### Branch and bound with mtcars

VariableSelection(GammaFit, type = "branch and bound", showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 142
#> Number of models fit: 74
#> 
#> 
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>               Estimate         SE      t  p.values    
#> (Intercept) -4.691e-02  1.782e-02 -2.633   0.01384 *  
#> hp          -6.284e-05  3.015e-05 -2.084   0.04675 *  
#> wt          -9.485e-03  1.819e-03 -5.213 1.719e-05 ***
#> qsec         1.299e-03  7.471e-04  1.739   0.09348 .  
#> gear         2.662e-03  1.652e-03  1.612   0.11870    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring

### Can also use a formula and data

FormulaVS <- VariableSelection(mpg ~ . ,data = cars, family = "gamma", 
                               link = "inverse", type = "branch and bound",
                               showprogress = FALSE)

### Number of models fit divided by the number of possible models

FormulaVS$numchecked / 2^(length(FormulaVS$variables))
#> [1] 0.07226562

### Extracting final model

FormulaVS$finalmodel
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ hp + wt + qsec + gear
#> 
#>               Estimate         SE      t  p.values    
#> (Intercept) -4.691e-02  1.782e-02 -2.633   0.01384 *  
#> hp          -6.284e-05  3.015e-05 -2.084   0.04675 *  
#> wt          -9.485e-03  1.819e-03 -5.213 1.719e-05 ***
#> qsec         1.299e-03  7.471e-04  1.739   0.09348 .  
#> gear         2.662e-03  1.652e-03  1.612   0.11870    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0091
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 27 degrees of freedom
#> AIC: 142
#> Algorithm converged in 3 iterations using Fisher's scoring

4.3 Using keep

### Example of using keep

VariableSelection(mpg ~ . ,data = cars, family = "gamma", 
                               link = "inverse", type = "branch and bound",
                               keep = c("hp", "cyl"), metric = "AIC",
                               showprogress = FALSE)
#> Variable Selection Info:
#> --------------------------------------------
#> Variables were selected using branch and bound selection with AIC
#> The best value of AIC obtained was 143
#> Number of models fit: 38
#> Variables that were kept in each model:  hp, cyl
#> 
#> --------------------------------------------
#> Final Model:
#> --------------------------------------------
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ cyl + hp + wt + qsec + gear
#> 
#>               Estimate         SE       t  p.values    
#> (Intercept) -6.464e-02  2.700e-02 -2.3940   0.02418 *  
#> cyl          1.412e-03  1.642e-03  0.8603   0.39750    
#> hp          -7.523e-05  3.321e-05 -2.2650   0.03205 *  
#> wt          -1.037e-02  2.082e-03 -4.9830 3.517e-05 ***
#> qsec         1.816e-03  9.462e-04  1.9190   0.06603 .  
#> gear         3.861e-03  2.155e-03  1.7910   0.08490 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.0089
#> 32 observations used to fit model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0 on 26 degrees of freedom
#> AIC: 143
#> Algorithm converged in 3 iterations using Fisher's scoring

4.4 Convergence issues

5 Utility functions for binomial GLMs

5.1 Table

### Predicting if a car gets at least 18 mpg

catData <- ToothGrowth

catFit <- BranchGLM(supp ~ ., data = catData, family = "binomial", link = "logit")

Table(catFit)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

5.2 ROC


catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

5.3 Cindex/AUC


Cindex(catFit)
#> [1] 0.7127778

AUC(catFit)
#> [1] 0.7127778

5.4 MultipleROCPlots

### Showing ROC plots for logit, probit, and cloglog

probitFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "probit")

cloglogFit <- BranchGLM(supp ~ . ,data = catData, family = "binomial", 
                       link = "cloglog")

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), 
                  names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

5.5 Using predictions


preds <- predict(catFit)

Table(preds, catData$supp)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

AUC(preds, catData$supp)
#> [1] 0.7127778

ROC(preds, catData$supp) |> plot(main = "ROC Curve", col = "deepskyblue")