A Robust And Generalized Variable Selection Method (RAEN) For High Dimensional Competing Risk Analysis

Introduction

We propose a novel method, the Random Approximate Elastic Net (RAEN), with a robust and generalized solution to the variable selection problem focused on the competing risk analysis. RAEN is composed of three stages. (1) split variables. We randomly split the high dimensional data set into a number of lower dimensional, low correlation subsets with a de-correlation splitting algorithm. (2) prescreen and estimate variable importance. For each subgroup of variables, a penalized model is fit by minimizing the least square approximation of an elastic net objective function to many bootstrap samples. We screen relevant variables from each of the subgroups based on the bootstrap aggregation. A measure of importance is yielded from this step for each selected variable. (3) merge, select and estimate variables. We merge the screened variables from step 2 to one data set and perform another bootstrap aggregated penalized model fitting. Although we focus on competing risks analysis, the approach proposed can be applied to most of common regression models, including generalized linear models, Cox regression, quantile regression, and many others as special cases. Our algorithm naturally has a parallel structure, thus it can be easily implemented in a parallel architecture and applied to high or ultra-high dimensional problems with tens of thousands or more of predictors.

Installation

RAEN can be installed from R-CRAN

install.packages('RAEN')

Users can install the developmental version from Github.

library(devtools)
install_github('saintland/RAEN')

Splitting correlated variables

The simulated data toydata contains 200 rows, time to event, censoring status, and 1000 predictors, 20 of which are true predictors (\(X1-X20\)). The variable correlation blocks are identified as the following example.

require(RAEN,quietly = T)
#> Loaded lars 1.2
data(toydata)
x=toydata[,-c(1:2)]
y=toydata[,1:2]
fgrp<-deCorr(x)
#> 
#> No: 1 cluster contains : 10 , remaining 990 
#> 
#> No: 2 cluster contains : 9 , remaining 981 
#> 
#> No: 3 cluster contains : 9 , remaining 972 
#> 
#> No: 4 cluster contains : 4 , remaining 968

RAEN discovers 5 groups of correlated variables, with group sizes ranging from 2 to 19. Users can either determine the number of subgroups (ngrp) to separate the correlated variables, or use the default setting \(15p/n\). The correlated variables and the remaining weakly correlated variables will be separated randomly into different subsets or blocks, such that each subset is weakly correlated. The benefit of this de-correlation is that the penalization based variable selection method (like Lasso) will not randomly drop correlated variables in such subsets. The variable selection is executed via the functions grpselect and r2select. Users can call the main function RAEN to run the whole process,

library(RAEN)
myres<-RAEN(x,y, B = 50, family='competing')

where x is the predictor matrix of \(n \times p\), y is the time and censoring status data frame, and ncore is the number of threads to use for parallel processing. The selected variables and the regression coefficients are returned.

  id        coef
x1  -0.55913895
x2  -0.45441885
x3  -0.66547988
x4  -0.2042797
x5  -0.51010548
...
...
x18 -0.84730951
x19 -1.02583733
x20 -0.98603833
...