Contents

1 Installation

BOSO can be installed from CRAN repository:

install.packages("BOSO")

Note that it is necessary to have IBM ILOG CPLEX installed, with a version greater than 12.1, and the cplexAPI package, which is used to invoke CPLEX from R environment.

Furthermore, check if packages are installed. Check if bestsubset is installed. If not, source necessary functions from bestsubset.

if (!requireNamespace('cplexAPI', quietly = TRUE)) {
  testthat::skip('Package cplexAPI not installed (required for this vignette)!\n
         Install it from CRAN: https://cran.r-project.org/web/packages/cplexAPI/index.html')
  stop('Package cplexAPI not installed (required for this vignette)!\n
         Install it from CRAN: https://cran.r-project.org/web/packages/cplexAPI/index.html', call. = FALSE)
}
if (!requireNamespace('bestsubset', quietly = TRUE)) {
  devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/fs.R")
  devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/lasso.R")
  devtools::source_url("https://raw.githubusercontent.com/ryantibs/best-subset/master/bestsubset/R/sim.R")
  enlist <- function (...) 
  {
    result <- list(...)
    if ((nargs() == 1) & is.character(n <- result[[1]])) {
      result <- as.list(seq(n))
      names(result) <- n
      for (i in n) result[[i]] <- get(i)
    } else {
      n <- sys.call()
      n <- as.character(n)[-1]
      if (!is.null(n2 <- names(result))) {
        which <- n2 != ""
        n[which] <- n2[which]
      }
      names(result) <- n
    }
    result
  }
} else {require("bestsubset")}
## i SHA-1 hash of file is cc12499ddd2cc538f318f3ee14ad4a3911945a27
## i SHA-1 hash of file is e7c4bf157a3c33b0a11179541be100c0037a4fea
## i SHA-1 hash of file is 531fcc6ad06024eae37afaa3e15d83e0ea3ece89

2 Introduction

The package is designed to run the BOSO algorithm in one single function, which comprises two steps: block strategy and final problem.

Figure 1. Flow diagram of BOSO.

Figure 1 summarizes the BOSO algorithm. An example dataset with 7 features is split into training and validation sets. Green boxes represent optimal selected features for a Specific K value. For example, for K=2, the subset of features that minimizes the validation error is {X3, X6}. The problem of selecting the best subset of features of length K is formulated via mixed-integer quadratic programming (MIQP) and solved using IBM ILOG CPLEX This process is repeated for each K value until an information criterion, in this case the extended Bayesian Information Criterion (eBIC), is not further improved. Minimal eBIC is found in this example for K=2. The final model is derived from Ridge regression with only these two selected variables.

3 Different parameters that can be changed in the BOSO algortihm.

Function BOSO has a number of parameters that can be changed:

On the other hand, there are some parameters that are identical to the ones found in the glmnet function: lambda.min.ratio, lambda, intercept, standardize and dfmax.

Finally, some of the parameters are solver-specific:

4 Example: High-5 scenario discussed in the BOSO article

Given a synthetic data set (see Methods section in the article), we evaluate the capacity of BOSO to extract the features that are related with the response variable. First o, we set the parameters for the generation of synthetic data:

# Set some overall simulation parameters
n = 100;          # Size of training set
nval = n          # Size of validation set
p = 10;           # Number of variables
s = 5             # Number of nonzero coefficients
beta.type = 1     # Type of coefficiente structure
rho.vec = 0.35    # Variable autocorrelation level
snr.vec = exp(seq(log(0.05),log(6),length=10)) # Signal-to-noise ratios 
nrep = 10          # Number of repetitions for a given setting
seed = 0          # Random number generator seed

Benchmarked methods are defined: Forward Stepwise, Lasso, Relaxed lasso and BOSO.

# Regression functions: lasso, forward stepwise and BOSO
reg.funs = list()

reg.funs[["Forward stepwise"]] = function(x,y) fs(x,y,intercept=FALSE, max=min(n,p))

reg.funs[["Lasso"]] = function(x,y) lasso(x,y,intercept=FALSE,nlam=50)

reg.funs[["Relaxed lasso"]] = function(x,y) lasso(x,y,intercept=FALSE,
                                                  nrelax=10,nlam=50)

reg.funs[["BOSO - AIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
                                                         IC = "AIC",
                                                         nlambda = 100,
                                                         Threads = 4,
                                                         timeLimit = 60,
                                                         intercept = F,
                                                         standardize = F,
                                                         verbose = 0,
                                                         warmstart=T,
                                                         seed = 123456)

reg.funs[["BOSO - BIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
                                                         IC = "BIC",
                                                         nlambda = 100,
                                                         Threads = 4,
                                                         timeLimit = 60,
                                                         intercept = F,
                                                         standardize = F,
                                                         verbose = 0,
                                                         warmstart=T,
                                                         seed = 123456)

reg.funs[["BOSO - eBIC"]] <- function(x,y,xval,yval) BOSO(x,y,xval,yval,
                                                          IC = "eBIC",
                                                          nlambda = 100,
                                                          Threads = 4,
                                                          timeLimit = 60,
                                                          intercept = F,
                                                          standardize = F,
                                                          verbose = 0,
                                                          warmstart=T,
                                                          seed = 123456)

We have adapted the sim.masterfunction from the bestsubset library function to assess the different methods.

sim.results <- list()

for (i in 1:length(snr.vec)) {
  set.seed(i)
  sim.results[[i]]  <- sim.master(n = n, p = p, nval = nval,
                                  rho=rho.vec, s=s, beta.type=beta.type, 
                                  snr = snr.vec[[i]], 
                                  reg.funs, nrep=nrep, seed=i,
                                  verbose=T)
}

4.1 Comparison of BOSO with other methods

Finally, the next part of the code shows the comparison of BOSO-eBIC with some of the most widely used algorithms in the literature: Lasso, Relaxed lasso and Forward Stepwise.

method.nums = c(6,1,2,3)
method.names = c("BOSO - eBIC","Forward stepwise","Lasso","Relaxed lasso")


plot.BOSO <- list(
  "F" = plot.from.sim(sim.list = sim.results, what="F", rel.to=NULL, tuning="val",
                      method.nums=method.nums, method.names=method.names),
  "nonzeros" = plot.from.sim(sim.list = sim.results, what="nonzero", rel.to=NULL, tuning="val",
                             method.nums=method.nums, method.names=method.names),
  "fpos" = plot.from.sim(sim.list = sim.results, what="fpos", rel.to=NULL, tuning="val",
                         method.nums=method.nums, method.names=method.names),
  "fneg" = plot.from.sim(sim.list = sim.results, what="fneg", rel.to=NULL, tuning="val",
                         method.nums=method.nums, method.names=method.names))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
ggarrange(plotlist = plot.BOSO, ncol = 2, nrow = 2, common.legend = T, legend = "bottom")

4.2 Comparison of BOSO with different information criteria

The following part of code is designed to compare the performance of the BOSO algorithm using different information criteria. The options available are AIC, BIC and eBIC.

method.nums = c(4, 5, 6)
method.names = c("BOSO - AIC","BOSO - BIC","BOSO - eBIC")

plot.BOSO <- list(
  "F" = plot.from.sim(sim.list = sim.results, what="F", rel.to=NULL, tuning="val",
                      method.nums=method.nums, method.names=method.names),
  "nonzeros" = plot.from.sim(sim.list = sim.results, what="nonzero", rel.to=NULL, tuning="val",
                             method.nums=method.nums, method.names=method.names),
  "fpos" = plot.from.sim(sim.list = sim.results, what="fpos", rel.to=NULL, tuning="val",
                         method.nums=method.nums, method.names=method.names),
  "fneg" = plot.from.sim(sim.list = sim.results, what="fneg", rel.to=NULL, tuning="val",
                         method.nums=method.nums, method.names=method.names))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
ggarrange(plotlist = plot.BOSO, ncol = 2, nrow = 2, common.legend = T, legend = "bottom")

5 References

6 Session Information

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                     LC_CTYPE=English_Ireland.1252   
## [3] LC_MONETARY=English_Ireland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=English_Ireland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggpubr_0.4.0     ggplot2_3.3.4    glmnet_4.1-1     Matrix_1.3-4    
## [5] kableExtra_1.3.4 dplyr_1.0.7      BOSO_1.0.3       knitr_1.33      
## [9] BiocStyle_2.16.1
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0            usethis_2.0.1       devtools_2.4.2     
##  [4] webshot_0.5.2       httr_1.4.2          rprojroot_2.0.2    
##  [7] tools_4.0.5         backports_1.2.1     bslib_0.2.5.1      
## [10] utf8_1.2.1          R6_2.5.0            DBI_1.1.1          
## [13] colorspace_2.0-1    withr_2.4.2         gridExtra_2.3      
## [16] tidyselect_1.1.1    prettyunits_1.1.1   processx_3.5.2     
## [19] curl_4.3.1          compiler_4.0.5      cli_2.5.0          
## [22] rvest_1.0.0         xml2_1.3.2          desc_1.3.0         
## [25] labeling_0.4.2      bookdown_0.22       sass_0.4.0         
## [28] scales_1.1.1        callr_3.7.0         systemfonts_1.0.2  
## [31] stringr_1.4.0       digest_0.6.27       foreign_0.8-81     
## [34] rmarkdown_2.9       svglite_2.0.0       rio_0.5.27         
## [37] pkgconfig_2.0.3     htmltools_0.5.1.1   sessioninfo_1.1.1  
## [40] highr_0.9           fastmap_1.1.0       rlang_0.4.11       
## [43] readxl_1.3.1        rstudioapi_0.13     farver_2.1.0       
## [46] shape_1.4.6         jquerylib_0.1.4     generics_0.1.0     
## [49] jsonlite_1.7.2      zip_2.2.0           car_3.0-10         
## [52] magrittr_2.0.1      Rcpp_1.0.6          munsell_0.5.0      
## [55] fansi_0.5.0         abind_1.4-5         lifecycle_1.0.0    
## [58] stringi_1.6.2       yaml_2.2.1          carData_3.0-4      
## [61] pkgbuild_1.2.0      grid_4.0.5          forcats_0.5.1      
## [64] crayon_1.4.1        lattice_0.20-44     cowplot_1.1.1      
## [67] haven_2.4.1         splines_4.0.5       hms_1.1.0          
## [70] magick_2.7.2        ps_1.6.0            pillar_1.6.1       
## [73] ggsignif_0.6.2      pkgload_1.2.1       codetools_0.2-18   
## [76] glue_1.4.2          evaluate_0.14       remotes_2.4.0      
## [79] data.table_1.14.0   BiocManager_1.30.16 vctrs_0.3.8        
## [82] foreach_1.5.1       testthat_3.0.3      cellranger_1.1.0   
## [85] gtable_0.3.0        purrr_0.3.4         tidyr_1.1.3        
## [88] assertthat_0.2.1    cachem_1.0.5        xfun_0.24          
## [91] openxlsx_4.2.4      broom_0.7.7         rstatix_0.7.0      
## [94] survival_3.2-11     viridisLite_0.4.0   tibble_3.1.2       
## [97] iterators_1.0.13    memoise_2.0.0       ellipsis_0.3.2