Just like most functions in R, all functions in civis
block. This means that each function in a program must complete before the next function runs. For instance,
nap <- function(seconds) {
Sys.sleep(seconds)
}
start <- Sys.time()
nap(1)
nap(2)
nap(3)
end <- Sys.time()
print(end - start)
This program takes 6 seconds to complete, since it takes 1 second for the first nap
, 2 for the second and 3 for the last. This program is easy to reason about because each function is sequentially executed. Usually, that is how we want our programs to run.
There are some exceptions to this rule. Sequentially executing each function might be inconvenient if each nap
took 30 minutes instead of a few seconds. In that case, we might like our program to perform all 3 naps simultaneously. In the above example, running all 3 naps simultaneously would take 3 seconds (the length of the longest nap) rather than 6 seconds.
As all function calls in civis
block, civis
relies on the mature R ecosystem for parallel programming to enable multiple simultaneous tasks. The three packages we introduce are future
, foreach
, and parallel
(included in base R). For all packages, simultaneous tasks are enabled by starting each task in a separate R process. Examples for building several models in parallel with different libraries are included below. The libraries have strengths and weaknesses and choosing which library to use is often a matter of preference.
It is important to note that when calling civis
functions, the computation required to complete the task takes place in Platform. For instance, during a call to civis_ml
, Platform builds the model while your laptop waits for the task to complete. This means that you don’t have to worry about running out of memory or cpu cores on your laptop when training dozens of models, or when scoring a model on a very large population. The task being parallelized in the code below is simply the task of waiting for Platform to send results back to your laptop.
future
library(future)
library(civis)
# Define a concurrent backend with enough processes so each function
# we want to run concurrently has its own process. Here we'll need at least 2.
plan("multiprocess", workers=10)
# Load data
data(iris)
data(airquality)
airquality <- airquality[!is.na(airquality$Ozone),] # remove missing in dv
# Create a future for each model, using the special %<-% assignment operator.
# These futures are created immediately, kicking off the models.
air_model %<-% civis_ml(airquality, "Ozone", "gradient_boosting_regressor")
iris_model %<-% civis_ml(iris, "Species", "sparse_logistic")
# At this point, `air_model` has not finished training yet. That's okay,
# the program will just wait until `air_model` is done before printing it.
print("airquality R^2:")
print(air_model$metrics$metrics$r_squared)
print("iris ROC:")
print(iris_model$metrics$metrics$roc_auc)
foreach
library(parallel)
library(doParallel)
library(foreach)
library(civis)
# Register a local cluster with enough processes so each function
# we want to run concurrently has its own process. Here we'll
# need at least 3, with 1 for each model_type in model_types.
cluster <- makeCluster(10)
registerDoParallel(cluster)
# Model types to build
model_types <- c("sparse_logistic",
"gradient_boosting_classifier",
"random_forest_classifier")
# Load data
data(iris)
# Listen for multiple models to complete concurrently
model_results <- foreach(model_type=iter(model_types), .packages='civis') %dopar% {
civis_ml(iris, "Species", model_type)
}
stopCluster(cluster)
print("ROC Results")
lapply(model_results, function(result) result$metrics$metrics$roc_auc)
mcparallel
Note: mcparallel
relies on forking and thus is not available on Windows.
library(civis)
library(parallel)
# Model types to build
model_types <- c("sparse_logistic",
"gradient_boosting_classifier",
"random_forest_classifier")
# Load data
data(iris)
# Loop over all models in parallel with a max of 10 processes
model_results <- mclapply(model_types, function(model_type) {
civis_ml(iris, "Species", model_type)
}, mc.cores=10)
# Wait for all models simultaneously
print("ROC Results")
lapply(model_results, function(result) result$metrics$metrics$roc_auc)
Differences in operating systems and R environments may cause errors for some users of the parallel libraries listed above. In particular, mclapply
does not work on Windows and may not work in RStudio on certain operating systems. future
may require plan(multisession)
on certain operating systems. If you encounter an error parallelizing functions in civis
, we recommend first trying more than one method listed above. While we will address errors specific to civis
with regards to parallel code, the technicalities of parallel libraries in R across operating systems and environments prevent us from providing more general support for issues regarding parallelized code in R.