Introduction to glmnetUtils

The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also features a function cva.glmnet to do crossvalidation for both \(\alpha\) and \(\lambda\), as well as some utility functions.

The formula interface

The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here are some simple examples for different types of data:

# least squares regression
(mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars))

## Call:
## glmnetUtils:::glmnet.formula(formula = mpg ~ cyl + disp + hp, 
##     data = mtcars)
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha: 1
##     Lambda summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03326 0.11690 0.41003 1.02839 1.44125 5.05505

# multinomial logistic regression with specified elastic net alpha parameter
(irisMod <- glmnet(Species ~ ., data=iris, family="multinomial", alpha=0.5))

## Call:
## glmnetUtils:::glmnet.formula(formula = Species ~ ., data = iris, 
##     family = "multinomial", alpha = 0.5)
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha: 0.5
##     Lambda summary:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000870 0.0008707 0.0087093 0.0979220 0.0870709 0.8699915

# Poisson regression with an offset
(InsMod <- glmnet(Claims ~ District + Group + Age, data=MASS::Insurance,
                  family="poisson", offset=log(Holders)))

## Call:
## glmnetUtils:::glmnet.formula(formula = Claims ~ District + Group + 
##     Age, data = MASS::Insurance, family = "poisson", offset = log(Holders))
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha: 1
##     Lambda summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02877 0.11613 0.46883 1.40515 1.89269 7.64083

Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. A simple print method is also provided, to show the main model details at a glance. I’ll describe shortly what the “sparse model matrix” and “use model.frame” options do.

Predicting from a model works as you’d expect: just pass a data frame containing the new observations to the predict method. You can also specify any arguments that predict.glmnet accepts.

# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)

# multinomial logistic regression: get predicted class
predict(irisMod, newdata=iris, type="class")

# Poisson regression: need to specify offset
predict(InsMod, newdata=MASS::Insurance, offset=log(Holders))

If you want, you can still use the original model matrix-plus-response syntax:

mtcarsX <- as.matrix(mtcars[c("cyl", "disp", "hp")])
mtcarsY <- mtcars$mpg
mtcarsMod2 <- glmnet(mtcarsX, mtcarsY)

summary(as.numeric(predict(mtcarsMod, mtcars) - 
                   predict(mtcarsMod2, mtcarsX)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

This shows that the resulting models are identical, in terms of the predictions they make and the regularisation parameters used.

Generating the model matrix

There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time.

Using `model.frame`

This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.

The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a \(p \times p\) square matrix where \(p\) is the number of main effects in the model. For wide datasets with \(p > 10000\), this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can be very slow.

Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an \(N\)-level factor into an indicator matrix with \(N-1\) columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of \(N\) columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.

This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

Manually building the model matrix

To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).

This works in an additive fashion, ie the formula ~ a + b:c + d*e is treated as consisting of three terms, a, b:c and d*e each of which is processed independently of the others. A dot in the formula includes all main effect terms, ie ~ . + a:b + f(x) expands to ~ a + b + x + a:b + f(x) (assuming a, b and x are the only columns in the data). Note that a formula like ~ (a + b) + (c + d) will be treated as two terms, a + b and c + d.

Speed comparisons

To examine the speed impact of using model.frame, let’s do some simple comparisons of run times. We’ll generate sample datasets with 100, 1,000 and 10,000 predictors, and then run glmnet with both options for generating the model matrix.

# generate sample (uncorrelated) data of a given size
makeSampleData <- function(N, P)
{
    X <- matrix(rnorm(N*P), nrow=N)
    data.frame(y=rnorm(N), X)
}

# test for three sizes: 100/1000/10000 predictors
df1 <- makeSampleData(N=1000, P=100)
df2 <- makeSampleData(N=1000, P=1000)
df3 <- makeSampleData(N=1000, P=10000)

library(microbenchmark)
res <- microbenchmark(
    glmnet(y ~ ., df1, use.model.frame=TRUE), 
    glmnet(y ~ ., df1, use.model.frame=FALSE), 
    glmnet(y ~ ., df2, use.model.frame=TRUE), 
    glmnet(y ~ ., df2, use.model.frame=FALSE), 
    glmnet(y ~ ., df3, use.model.frame=TRUE), 
    glmnet(y ~ ., df3, use.model.frame=FALSE),
    times=10 
)
print(res, unit="s", digits=2)

## Unit: seconds
##                                         expr    min     lq   mean median     uq    max neval
##   glmnet(y ~ ., df1, use.model.frame = TRUE)  0.024  0.024  0.027  0.025  0.029  0.032    10
##  glmnet(y ~ ., df1, use.model.frame = FALSE)  0.023  0.026  0.029  0.026  0.028  0.051    10
##   glmnet(y ~ ., df2, use.model.frame = TRUE)  3.703  3.916  4.258  4.272  4.428  5.153    10
##  glmnet(y ~ ., df2, use.model.frame = FALSE)  3.756  3.874  4.291  4.352  4.561  5.073    10
##   glmnet(y ~ ., df3, use.model.frame = TRUE) 11.973 12.353 13.262 13.350 13.864 14.622    10
##  glmnet(y ~ ., df3, use.model.frame = FALSE)  4.295  4.639  4.992  4.822  5.060  6.111    10

From this, we can see that for datasets with up to 1,000 predictors, both methods are about as fast as each other. However, for 10,000 predictors (not uncommon these days), the model.frame method takes three times as long as building the model matrix term by term.

What happens if we take it up to 100,000 predictors? As seen below, the standard approach of using model.frame fails when R runs out of memory: the data frame itself is about 800MB in size, but trying to allocate the terms object requires more then 67GB. However, building the model matrix by term still works, and (on this machine) finishes in about two minutes.

df4 <- makeSampleData(N=1000, P=100000)

glmnet(y ~ ., df4, use.model.frame=TRUE)

## Warning in terms.formula(formula, data = data): Reached total allocation of
## 32666Mb: see help(memory.size)

## Error: cannot allocate vector of size 37.3 Gb

glmnet(y ~ ., df4, use.model.frame=FALSE)

## Call:
## glmnet.formula(formula = y ~ ., data = df4, use.model.frame = FALSE)
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha: 1
##     Lambda summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002393 0.006506 0.017680 0.032470 0.048080 0.130700

Sparse model matrix

As an option, glmnetUtils can also generate a sparse model matrix, using the sparse.model.matrix function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels. This can be combined with both of the previously mentioned options for generating model matrices.

Crossvalidation for \(\alpha\)

One piece missing from the standard glmnet package is a way of choosing \(\alpha\), the elastic net mixing parameter, similar to how cv.glmnet chooses \(\lambda\), the shrinkage parameter. To fix this, glmnetUtils provides the cva.glmnet function, which uses crossvalidation to examine the impact on the model of changing \(\alpha\) and \(\lambda\). The interface is the same as for the other functions:

# Leukemia dataset from Trevor Hastie's website:
# https://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
leuk <- do.call(data.frame, Leukemia)

leukMod <- cva.glmnet(y ~ ., data=leuk, family="binomial")
leukMod

## Call:
## cva.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
##     Number of crossvalidation folds for lambda: 10

cva.glmnet uses the algorithm described in the help for cv.glmnet, which is to fix the distribution of observations across folds and then call cv.glmnet in a loop with different values of \(\alpha\). Optionally, you can parallelise this outer loop, by setting the outerParallel argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:

Via parLapply in the parallel package. To use this, set outerParallel to a valid cluster object created by makeCluster.
Via rxExec as supplied by Microsoft R Server’s RevoScaleR package. To use this, set outerParallel to a valid compute context created by RxComputeContext, or a character string specifying such a context.

If the outer loop is run in parallel, cva.glmnet can check if the inner loop (over \(\lambda\)) is also set to run in parallel, and disable this if it would lead to contention for cores.

Because crossvalidation is often a statistically noisy procedure, it doesn’t try to automatically choose \(\alpha\) and \(\lambda\) for you. Instead you can plot the output, to see how the results depend on the values of these parameters. Using this information, you can choose appropriate values for your data.

plot(leukMod)

In this case, we see that values of \(\alpha\) close to \(1\) tend to lead to better accuracy. The curves don’t have a well-defined minimum, but they do flatten out for lower values of \(\lambda\). As the cv.glmnet documentation recommends though, it’s a good idea to run cva.glmnet multiple times to reduce the impact of noise.

A cva.glmnet object contains a list of individual cv.glmnet objects, corresponding to the different \(\alpha\) values tried. This lets you plot the crossvalidation results easily for a given \(\alpha\):

plot(leukMod$modlist[[10]])  # alpha = 0.729