Common Resampling Patterns

The rsample package provides a number of resampling methods which are broadly applicable to a wide variety of modeling applications. This vignette walks through the most popular methods in the package, with brief descriptions of how they can be applied. For a more in-depth overview of resampling, check out the matching chapters in Tidy Modeling with R and Feature Engineering and Selection.

Let’s go ahead and load rsample now:

library(rsample)

As well as dplyr, for the pipe operator %>%:

library(dplyr)

We’ll also load in a few data sets from the modeldata package. First, the Ames housing data, containing the sale prices of homes in Ames, Iowa:

data(ames, package = "modeldata")
head(ames, 2)
#> # A tibble: 2 × 74
#>   MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#>   <fct>     <fct>     <dbl>   <int> <fct>  <fct> <fct>   <fct>   <fct>   <fct>  
#> 1 One_Stor… Reside…     141   31770 Pave   No_A… Slight… Lvl     AllPub  Corner 
#> 2 One_Stor… Reside…      80   11622 Pave   No_A… Regular Lvl     AllPub  Inside 
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> #   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> #   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> #   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> #   Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> #   Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> #   BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names

Secondly, data on Chicago transit ridership numbers:

data(Chicago, package = "modeldata")
head(Chicago, 2)
#> # A tibble: 2 × 50
#>   rider…¹ Austin Quinc…² Belmont Arche…³ Oak_P…⁴ Western Clark…⁵ Clinton Merch…⁶
#>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1    15.7   1.46    8.37    4.60    2.01    1.42    3.32    15.6    2.40    6.48
#> 2    15.8   1.50    8.35    4.72    2.09    1.43    3.34    15.7    2.40    6.48
#> # … with 40 more variables: Irving_Park <dbl>, Washington_Wells <dbl>,
#> #   Harlem <dbl>, Monroe <dbl>, Polk <dbl>, Ashland <dbl>, Kedzie <dbl>,
#> #   Addison <dbl>, Jefferson_Park <dbl>, Montrose <dbl>, California <dbl>,
#> #   temp_min <dbl>, temp <dbl>, temp_max <dbl>, temp_change <dbl>, dew <dbl>,
#> #   humidity <dbl>, pressure <dbl>, pressure_change <dbl>, wind <dbl>,
#> #   wind_max <dbl>, gust <dbl>, gust_max <dbl>, percip <dbl>, percip_max <dbl>,
#> #   weather_rain <dbl>, weather_snow <dbl>, weather_cloud <dbl>, …
#> # ℹ Use `colnames()` to see all variable names

In addition to these data sets from the modeldata package, we’ll also make use of the Orange data set in base R, containing repeated measurements of 5 orange trees over time:

head(Orange, 2)
#>   Tree age circumference
#> 1    1 118            30
#> 2    1 484            58

And last but not least, we’ll set a seed so our results are reproducible:

set.seed(123)

Random Resampling

By far and away, the most common use for rsample is to generate simple random resamples of your data. The rsample package includes a number of functions specifically for this purpose.

Initial Splits

To split your data into two sets – often referred to as the “training” and “testing” sets – rsample provides the initial_split() function:

initial_split(ames)
#> <Training/Testing/Total>
#> <2197/733/2930>

The output of this is an rsplit object with each observation assigned to one of the two sets. You can control the proportion of data assigned to the “training” set through the prop argument:

initial_split(ames, prop = 0.8)
#> <Training/Testing/Total>
#> <2344/586/2930>

To get the actual data assigned to either set, use the training() and testing() functions:

resample <- initial_split(ames, prop = 0.6)

head(training(resample), 2)
#> # A tibble: 2 × 74
#>   MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#>   <fct>     <fct>     <dbl>   <int> <fct>  <fct> <fct>   <fct>   <fct>   <fct>  
#> 1 One_Stor… Reside…     110   14333 Pave   No_A… Regular Lvl     AllPub  Corner 
#> 2 One_Stor… Reside…      65    8450 Pave   No_A… Regular Lvl     AllPub  Inside 
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> #   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> #   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> #   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> #   Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> #   Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> #   BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names
head(testing(resample), 2)
#> # A tibble: 2 × 74
#>   MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#>   <fct>     <fct>     <dbl>   <int> <fct>  <fct> <fct>   <fct>   <fct>   <fct>  
#> 1 One_Stor… Reside…     141   31770 Pave   No_A… Slight… Lvl     AllPub  Corner 
#> 2 One_Stor… Reside…      80   11622 Pave   No_A… Regular Lvl     AllPub  Inside 
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> #   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> #   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> #   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> #   Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> #   Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> #   BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names

Validation Splits

You should only evaluate models against your test set once, when you’ve completely finished tuning and training your models. However, it’s possible to have additional sets of data “held out” from the model training process, which can be used to evaluate models multiple times before you’re ready to evaluate against the final test set.

These sets of data are often called “validation sets”, and can be created in rsample via validation_split():

validation_split(ames, prop = 0.8)
#> # Validation Set Split (0.8/0.2)  
#> # A tibble: 1 × 2
#>   splits             id        
#>   <list>             <chr>     
#> 1 <split [2344/586]> validation

These validation splits separate your data into “analysis” and “assessment” sets, which you can use to fit models and assess their accuracy while still preserving your initial hold-out test set.

Just like initial_split(), you can control the amount of data assigned to each set using the prop argument. Unlike the output from initial_split(), however, the output from validation_split() is an rset object, which can then be used by other packages in the tidymodels universe (such as tune) to evaluate model performance.

V-Fold Cross-Validation

For hyperparameter tuning and model fitting, it’s often useful to assess your model against more than just a single validation set in order to get a more stable estimate of model performance. As a result, modelers often use a process known as cross-validation, where your data is split into analysis and assessment sets multiple times.

Perhaps the most common cross-validation method is V-fold cross-validation. Also known as “k-fold cross-validation”, this method creates V resamples by splitting your data into V groups (also known as “folds”) of roughly equal size. The analysis set of each resample is made up of V-1 folds, with the remaining fold being used as the assessment set. This way, each observation in your data is used in exactly one assessment set.

To use V-fold cross-validation in rsample, use the vfold_cv() function:

vfold_cv(ames, v = 2)
#> #  2-fold cross-validation 
#> # A tibble: 2 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [1465/1465]> Fold1
#> 2 <split [1465/1465]> Fold2

One downside to V-fold cross validation is that it tends to produce “noisy”, or high-variance, estimates when compared to other resampling methods. To try and reduce that variance, it’s often helpful to perform what’s known as repeated cross-validation, effectively running the V-fold resampling procedure multiple times for your data. To perform repeated V-fold cross-validation in rsample, you can use the repeats argument inside of vfold_cv():

vfold_cv(ames, v = 2, repeats = 2)
#> #  2-fold cross-validation repeated 2 times 
#> # A tibble: 4 × 3
#>   splits              id      id2  
#>   <list>              <chr>   <chr>
#> 1 <split [1465/1465]> Repeat1 Fold1
#> 2 <split [1465/1465]> Repeat1 Fold2
#> 3 <split [1465/1465]> Repeat2 Fold1
#> 4 <split [1465/1465]> Repeat2 Fold2

Monte-Carlo Cross-Validation

An alternative to V-fold cross-validation is Monte-Carlo cross-validation. Where V-fold assigns each observation in your data to one (and exactly one) assessment set, Monte-Carlo cross-validation takes a random subset of your data for each assessment set, meaning each observation can be used in 0, 1, or many assessment sets. The analysis set is then made up of all the observations that weren’t selected. Because each assessment set is sampled independently, you can repeat this as many times as you want.

To use Monte-Carlo cross-validation in rsample, use the mc_cv() function:

mc_cv(ames, prop = 0.8, times = 2)
#> # Monte Carlo cross-validation (0.8/0.2) with 2 resamples  
#> # A tibble: 2 × 2
#>   splits             id       
#>   <list>             <chr>    
#> 1 <split [2344/586]> Resample1
#> 2 <split [2344/586]> Resample2

Just as with validation_set(), you can control the proportion of your data assigned to the analysis fold using prop. You can also control the number of resamples you create using the times argument.

Monte-Carlo cross-validation tends to produce more biased estimates than V-fold. As such, when computationally feasible we typically recommend using five or so repeats of 10-fold cross-validation for model assessment.

Bootstrap Resampling

The last primary resampling technique in rsample is bootstrap resampling. A “bootstrap sample” is a sample of your data set, the same size as your data set, taken with replacement so that a single observation might be sampled multiple times. The assessment set is then made up of all the observations that weren’t selected for the analysis set. Generally, bootstrap resampling produces pessimistic estimates of model accuracy.

You can create bootstrap resamples in rsample using the bootstraps() function. While you can’t control the proportion of data in each set – the assessment set of a bootstrap resample is always the same size as the training data – the function otherwise works exactly like mc_cv():

bootstraps(ames, times = 2)
#> # Bootstrap sampling 
#> # A tibble: 2 × 2
#>   splits              id        
#>   <list>              <chr>     
#> 1 <split [2930/1081]> Bootstrap1
#> 2 <split [2930/1084]> Bootstrap2

Stratified Resampling

If your data is heavily imbalanced (that is, if the distribution of an important continuous variable is skewed, or some classes of a categorical variable are much more common than others), simple random resampling may accidentally skew your data even further by allocating more “rare” observations disproportionately into the analysis or assessment fold. In these situations, it can be useful to instead use stratified resampling to ensure the analysis and assessment folds have a similar distribution as your overall data.

All of the functions discussed so far support stratified resampling through their strata argument. This argument takes a single column identifier and uses it to stratify the resampling procedure:

vfold_cv(ames, v = 2, strata = Sale_Price)
#> #  2-fold cross-validation using stratification 
#> # A tibble: 2 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [1464/1466]> Fold1
#> 2 <split [1466/1464]> Fold2

By default, rsample will cut continuous variables into four bins, and ensure that each bin is proportionally represented in each set. If desired, this behavior can be changed using the breaks argument:

vfold_cv(ames, v = 2, strata = Sale_Price, breaks = 100)
#> #  2-fold cross-validation using stratification 
#> # A tibble: 2 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [1439/1491]> Fold1
#> 2 <split [1491/1439]> Fold2

Grouped Resampling

Often, some observations in your data will be “more related” to each other than would be probable under random chance, for instance because they represent repeated measurements of the same subject or were all collected at a single location. In these situations, you often want to assign all related observations to either the analysis or assessment fold as a group, to avoid having assessment data that’s closely related to the data used to fit a model.

All of the functions discussed so far have a “grouped resampling” variation to handle these situations. These functions all start with the group_ prefix, and use the argument group to specify which column should be used to group observations. Other than respecting these groups, these functions all work like their ungrouped variants:

resample <- group_initial_split(Orange, group = Tree)

unique(training(resample)$Tree)
#> [1] 1 2 4 5
#> Levels: 3 < 1 < 5 < 2 < 4
unique(testing(resample)$Tree)
#> [1] 3
#> Levels: 3 < 1 < 5 < 2 < 4

It’s important to note that, while functions like group_mc_cv() and group_validation_split() still let you specify what proportion of your data should be in the analysis set (and group_bootstraps() still attempts to create analysis sets the same size as your original data), rsample won’t “split” groups in order to exactly meet that proportion. These functions start out by assigning one group at random to each set (or, for group_vfold_cv(), to each fold) and then assign each of the remaining groups, in a random order, to whichever set brings the relative sizes of each set closest to the target proportion. That means that resamples are randomized, and you can safely use repeated cross-validation just as you would with ungrouped resampling, but also means you can wind up with very differently sized analysis and assessment sets than anticipated if your groups are unbalanced:

set.seed(1)
group_bootstraps(ames, Neighborhood, times = 2)
#> # Group bootstrap sampling 
#> # A tibble: 2 × 2
#>   splits             id        
#>   <list>             <chr>     
#> 1 <split [2939/907]> Bootstrap1
#> 2 <split [2958/635]> Bootstrap2

While most of the grouped resampling functions are always focused on balancing the proportion of data in the analysis set, by default group_vfold_cv() will attempt to balance the number of groups assigned to each fold. If instead you’d like to balance the number of observations in each fold (meaning your assessment sets will be of similar sizes, but smaller groups will be more likely to be assigned to the same folds than would happen under random chance), you can use the argument balance = "observations":

group_vfold_cv(ames, Neighborhood, balance = "observations", v = 2)
#> # Group 2-fold cross-validation 
#> # A tibble: 2 × 2
#>   splits              id       
#>   <list>              <chr>    
#> 1 <split [1475/1455]> Resample1
#> 2 <split [1455/1475]> Resample2

If you’re working with spatial data, your observations will often be more related to their neighbors than to the rest of the data set; as Tobler’s first law of geography puts it, “everything is related to everything else, but near things are more related than distant things.” However, you often won’t have a pre-defined “location” variable that you can use to group related observations. The spatialsample package provides functions for spatial cross-validation using rsample syntax and classes, and is often useful for these situations.

Time-Based Resampling

When working with time-based data, it usually doesn’t make sense to randomly resample your data: random resampling will likely result in your analysis set having observations from later than your assessment set, which isn’t a realistic way to assess model performance.

As such, rsample provides a few different functions to make sure that all data in your assessment sets are after that in the analysis set.

First off, two variants on initial_split() and validation_split(), initial_time_split() and validation_time_split(), will assign the first rows of your data to the analysis set (with the number of rows assigned determined by prop):

initial_time_split(Chicago)
#> <Training/Testing/Total>
#> <4273/1425/5698>

validation_time_split(Chicago)
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 × 2
#>   splits              id        
#>   <list>              <chr>     
#> 1 <split [4273/1425]> validation

There are also several functions in rsample to help you construct multiple analysis and assessment sets from time-based data. For instance, the sliding_window() will create “windows” of your data, moving down through the rows of the data frame:

sliding_window(Chicago) %>%
  head(2)
#> # A tibble: 2 × 2
#>   splits        id       
#>   <list>        <chr>    
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002

If you want to create sliding windows of your data based on a specific variable, you can use the sliding_index() function:

sliding_index(Chicago, date) %>%
  head(2)
#> # A tibble: 2 × 2
#>   splits        id       
#>   <list>        <chr>    
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002

And if you want to set the size of windows based on units of time, for instance to have each window contain a year of data, you can use sliding_period():

sliding_period(Chicago, date, "year") %>%
  head(2)
#> # A tibble: 2 × 2
#>   splits            id     
#>   <list>            <chr>  
#> 1 <split [344/365]> Slice01
#> 2 <split [365/365]> Slice02

All of these functions produce analysis sets of the same size, with the start and end of the analysis set “sliding” down your data frame. If you’d rather have your analysis set get progressively larger, so that you’re predicting new data based upon a growing set of older observations, you can use the rolling_origin() function:

rolling_origin(Chicago) %>%
  head(2)
#> # A tibble: 2 × 2
#>   splits        id       
#>   <list>        <chr>    
#> 1 <split [5/1]> Slice0001
#> 2 <split [6/1]> Slice0002

Note that all of these time-based resampling functions are deterministic: unlike the rest of the package, running these functions repeatedly under different random seeds will always return the same results.