library(butcher)
library(parsnip)
One of the beauties of working with R
is the ease with
which you can implement intricate models and make challenging data
analysis pipelines seem almost trivial. Take, for example, the
parsnip
package; with the installation of a few associated
libraries and a few lines of code, you can fit something as complex as a
boosted tree:
library(rpart)
<- boost_tree(trees = 15) %>%
fitted_model set_engine("C5.0") %>%
fit(as.factor(am) ~ disp + hp, data = mtcars)
Or, let’s say you’re working on petabytes of data, in which data are
distributed across many nodes, just switch out the parsnip
engine:
library(sparklyr)
<- spark_connect(master = "local")
sc
<- sdf_copy_to(sc, mtcars[, c("am", "disp", "hp")])
mtcars_tbls
<- boost_tree(trees = 15) %>%
fitted_model set_engine("spark") %>%
fit(am ~ disp + hp, data = mtcars_tbls)
Yet, while our code may appear compact, the underlying fitted result
may not be. Since parsnip
works as a wrapper for many
modeling packages, its fitted model objects inherit the same properties
as those that arise from the original modeling package. A
straightforward example is the popular lm
function from the
base stats
package. Whether you leverage
parsnip
or not, you arrive at the same result:
<- linear_reg() %>%
parsnip_lm set_engine("lm") %>%
fit(mpg ~ ., data = mtcars)
parsnip_lm#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = mpg ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
#> qsec vs am gear carb
#> 0.82104 0.31776 2.52023 0.65541 -0.19942
Using just lm
:
<- lm(mpg ~ ., data = mtcars)
old_lm
old_lm#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
#> qsec vs am gear carb
#> 0.82104 0.31776 2.52023 0.65541 -0.19942
Let’s say we take this familiar old_lm
approach in
building our in-house modeling pipeline. Such a pipeline might entail
wrapping lm()
in other function, but in doing so, we may
end up carrying some junk.
<- function() {
in_house_model <- runif(1e6) # we didn't know about
some_junk_in_the_environment lm(mpg ~ ., data = mtcars)
}
The linear model fit that exists in our pipeline is:
library(lobstr)
obj_size(in_house_model())
#> 8,022,440 B
When it is fundamentally the same as our old_lm
, which
only takes up:
obj_size(old_lm)
#> 22,224 B
Ideally, we want to avoid saving this new
in_house_model()
on disk, when we could have something like
old_lm
that takes up less memory. So, what the heck is
going on here? We can examine possible issues with a fitted model object
using the butcher
package:
<- in_house_model()
big_lm ::weigh(big_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 8.01
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
The problem here is in the terms
component of
big_lm
. Because of how lm
is implemented in
the base stats
package—relying on intermediate forms of the
data from the model.frame
and model.matrix
output, the environment in which the linear fit was created
was carried along in the model output.
We can see this with the env_print
function from the
rlang
package:
library(rlang)
env_print(big_lm$terms)
#> <environment: 0x7fb365ac4b98>
#> Parent: <environment: global>
#> Bindings:
#> • some_junk_in_the_environment: <dbl>
To avoid carrying possible junk in our production pipeline, whether
it be associated with an lm
model (or something more
complex), we can leverage axe_env()
within the
butcher
package. In other words,
<- butcher::axe_env(big_lm, verbose = TRUE) cleaned_lm
Comparing it against our old_lm
, we find:
::weigh(cleaned_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 0.00789
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
…it now takes the same memory on disk:
::weigh(old_lm, threshold = 0, units = "MB")
butcher#> # A tibble: 25 × 2
#> object size
#> <chr> <dbl>
#> 1 terms 0.00781
#> 2 qr.qr 0.00666
#> 3 residuals 0.00286
#> 4 fitted.values 0.00286
#> 5 effects 0.0014
#> 6 coefficients 0.00109
#> 7 call 0.000728
#> 8 model.mpg 0.000304
#> 9 model.cyl 0.000304
#> 10 model.disp 0.000304
#> # … with 15 more rows
Axing the environment, however, is not the only functionality of
butcher
. This package provides five S3 generics that
include:
axe_call()
: Remove the call object.axe_ctrl()
: Remove the controls fixed for
training.axe_data()
: Remove the original data.axe_env()
: Replace inherited environments with empty
environments.axe_fitted()
: Remove fitted values.In our case here with lm
, if we are only interested in
prediction as the end product of our modeling pipeline, we could free up
a lot of memory if we execute all the possible axe functions at once. To
do so, we simply run butcher()
:
<- butcher::butcher(big_lm)
butchered_lm predict(butchered_lm, mtcars[, 2:11])
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 22.59951 22.11189 26.25064 21.23740
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 17.69343 20.38304 14.38626 22.49601
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 24.41909 18.69903 19.19165 14.17216
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 15.59957 15.74222 12.03401 10.93644
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 10.49363 27.77291 29.89674 29.51237
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 23.64310 16.94305 17.73218 13.30602
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 16.69168 28.29347 26.15295 27.63627
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 18.87004 19.69383 13.94112 24.36827
Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.
<- big_lm %>%
butchered_lm ::axe_env() %>%
butcher::axe_fitted()
butcherpredict(butchered_lm, mtcars[, 2:11])
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> 22.59951 22.11189 26.25064 21.23740
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> 17.69343 20.38304 14.38626 22.49601
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> 24.41909 18.69903 19.19165 14.17216
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> 15.59957 15.74222 12.03401 10.93644
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> 10.49363 27.77291 29.89674 29.51237
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> 23.64310 16.94305 17.73218 13.30602
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> 16.69168 28.29347 26.15295 27.63627
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> 18.87004 19.69383 13.94112 24.36827
butcher
makes it easy to axe parts of the fitted output
that are no longer needed, without sacrificing much functionality from
the original model object.