Stepwise Search

Mathematical Background

A common problem in building statistical models is determining which features to include in a model. Mathematical publications provide some suggestions, but there is no consensus. Some examples are the lasso or simply trying all possible combinations of predictors. Another option is stepwise search.

The more parameters a model has, the better it will fit the data. If the model is too complex, the worse it will perform on unseen data. AIC strikes a balance between fitting the training data well and keeping the model simple.

Using AIC, a search starts with no features. \[g(Y) = \beta_0\] Then each feature is considered. If there are 10 features, there are 10 models under consideration. For each model, AIC is calculated and the model with the lowest AIC is selected. In this case, X1 was selected. \[g(Y) = \beta_1X_1 + \beta_0\]

After the first feature is selected, all remaining 9 features are considered. Of the 9 features, the one with the lowest AIC is selected, creating a 2 feature model. In this round, X3 was selected. \[g(Y) = \beta_3X_3 + \beta_1X_1 + \beta_0\]

When adding more features does not improve AIC, the procedure stops.

Simulation Setup

How well does stepwise search work when there are unrelated variables? Is a large amount of data needed to find the correct variables? The below tests stepwise search in a variety of settings to answer these questions.

Easy Problem: Large N and half the variables are unrelated

library(GlmSimulatoR)
library(MASS)

#Creating data to work with
set.seed(1)
simdata <- simulate_inverse_gaussian(N = 100000, link = "1/mu^2", 
                                     weights = c(1, 2, 3), unrelated = 3)

#Setting the simplest model and the most complex model.
scopeArg <- list(
  lower = Y ~ 1,
  upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)

#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg, trace = 0)

summary(glmSearch)
#> 
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"), 
#>     data = simdata)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.68546  -0.47420  -0.08869   0.29790   2.36148  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.81196    0.20842   13.49   <2e-16 ***
#> X3           3.03192    0.08116   37.36   <2e-16 ***
#> X2           2.05731    0.08100   25.40   <2e-16 ***
#> X1           1.02594    0.08066   12.72   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3335593)
#> 
#>     Null deviance: 34005  on 99999  degrees of freedom
#> Residual deviance: 33270  on 99996  degrees of freedom
#> AIC: -212019
#> 
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

Looking at the summary, the correct model was found. Stepwise search worked perfectly!

Medium Problem: Large N and most variables are unrelated

#Creating data to work with
set.seed(2)
simdata <- simulate_inverse_gaussian(N = 100000, link = "1/mu^2", 
                                     weights = c(1, 2, 3), unrelated = 20)
#Setting the simplest model and the most complex model.
scopeArg <- list(
  lower = Y ~ 1,
  upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 + 
    Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 + 
    Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 + 
    Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)

#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg, trace = 0)

summary(glmSearch)
#> 
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 + 
#>     Unrelated9 + Unrelated19, family = inverse.gaussian(link = "1/mu^2"), 
#>     data = simdata)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.58072  -0.46825  -0.08547   0.29830   2.38608  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.09440    0.34079   9.080  < 2e-16 ***
#> X3           3.02170    0.08106  37.276  < 2e-16 ***
#> X2           1.95173    0.08091  24.121  < 2e-16 ***
#> X1           1.03102    0.08070  12.776  < 2e-16 ***
#> Unrelated4   0.21602    0.08082   2.673  0.00752 ** 
#> Unrelated18 -0.13564    0.08087  -1.677  0.09347 .  
#> Unrelated14  0.13661    0.08080   1.691  0.09092 .  
#> Unrelated9  -0.13462    0.08079  -1.666  0.09567 .  
#> Unrelated19 -0.13306    0.08085  -1.646  0.09982 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3318067)
#> 
#>     Null deviance: 34006  on 99999  degrees of freedom
#> Residual deviance: 33294  on 99991  degrees of freedom
#> AIC: -212459
#> 
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

Some unrelated variables made it into the final model. At least all related features are in the model.

Hard Problem: Small N and half the variables are unrelated

#Creating data to work with
set.seed(3)
simdata <- simulate_inverse_gaussian(N = 1000, link = "1/mu^2", 
                                     weights = c(1, 2, 3), unrelated = 3)
#Setting the simplest model and the most complex model.
scopeArg <- list(
  lower = Y ~ 1,
  upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)

#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg, trace = 0)

summary(glmSearch)
#> 
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"), 
#>     data = simdata)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -1.75902  -0.49421  -0.08637   0.30463   1.85664  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   2.8867     2.1940   1.316  0.18857   
#> X3            2.7355     0.8309   3.292  0.00103 **
#> X2            1.8694     0.8490   2.202  0.02791 * 
#> X1            1.5190     0.8304   1.829  0.06766 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3466137)
#> 
#>     Null deviance: 348.02  on 999  degrees of freedom
#> Residual deviance: 341.58  on 996  degrees of freedom
#> AIC: -2104.6
#> 
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

The correct model was found. Again, stepwise search worked perfectly!

Very Hard Problem: Small N and most variables are unrelated

#Creating data to work with
set.seed(4)
simdata <- simulate_inverse_gaussian(N = 1000, link = "1/mu^2", 
                                     weights = c(1, 2, 3), unrelated = 20)
#Setting the simplest model and the most complex model.
scopeArg <- list(
  lower = Y ~ 1,
  upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 + 
    Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 + 
    Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 + 
    Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)

#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg, trace = 0)

summary(glmSearch)
#> 
#> Call:
#> glm(formula = Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15 + 
#>     Unrelated20, family = inverse.gaussian(link = "1/mu^2"), 
#>     data = simdata)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.36149  -0.48517  -0.09361   0.29985   1.65156  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   3.2105     3.3206   0.967 0.333856    
#> X3            3.0476     0.8384   3.635 0.000292 ***
#> X2            1.7018     0.8362   2.035 0.042100 *  
#> Unrelated8   -1.4772     0.8320  -1.776 0.076117 .  
#> X1            1.5461     0.8362   1.849 0.064750 .  
#> Unrelated19   1.3527     0.8358   1.618 0.105881    
#> Unrelated15  -1.3127     0.8584  -1.529 0.126535    
#> Unrelated20   1.2959     0.8523   1.520 0.128706    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3392188)
#> 
#>     Null deviance: 345.03  on 999  degrees of freedom
#> Residual deviance: 334.55  on 992  degrees of freedom
#> AIC: -2142.8
#> 
#> Number of Fisher Scoring iterations: 5

rm(simdata, scopeArg, glmSearch, startingModel)

A few unrelated features made it into the model, but at least all true predictors were selected.