Abstract

When performing cross-validation, we tend to go with the common 10 folds (k=10). In this vignette, we try different number of folds settings and assess the differences in performance. To make our results robust to this choice, we average the results of different settings. The functions of interest are cross_validate_fn() and groupdata2::fold().

Contact the author at r-pkgs@ludvigolsen.dk

Multiple-k repeated cross-validation

Our goal here is two-fold:

Try multiple values of k (different numbers of folds) and see the effect on the prediction error.
Repeat each scenario multiple times to get more robust results.

Whereas the previous section used a regression example (continuous y-variable), we will now perform multiclass classification on the iris dataset. This fairly well-known dataset has three species of iris flowers with 50 flowers from each species. The predictors are length and width measurements of the sepals and petals.

First, we attach the needed packages and set a random seed:

library(cvms)  # version >= 1.2.2 
library(groupdata2)  # version >= 1.4.1
library(dplyr)
library(ggplot2)

xpectr::set_test_seed(1)

As the fold creation and cross-validation will take some time to run, we can enable parallelization to speed up the processes:

# Enable parallelization
# NOTE: Uncomment to run
# library(doParallel)
# doParallel::registerDoParallel(6)

Now, we load the data and convert it to a tibble:

# Load iris
data("iris")

# Convert iris to a tibble
iris <- dplyr::as_tibble(iris)
iris
#> # A tibble: 150 × 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

We count the rows per species, to ensure everything is in order:

iris %>% 
  dplyr::count(Species)
#> # A tibble: 3 × 2
#>   Species        n
#>   <fct>      <int>
#> 1 setosa        50
#> 2 versicolor    50
#> 3 virginica     50

Creating folds

When creating the folds, we would like to balance them such that the distribution of the species are similar in all the folds. In groupdata2::fold(), this is possible with the cat_col argument. This also ensures that there’s at least 1 of each species in each fold. With this approach, our maximum number of folds becomes 50. The lowest number of folds we can meaningfully generate is technically 2, but that seems unlikely in practice, so we will set our lower limit at 3 (arbitrary yes, but I get to choose here!). We thus pick 10 ks in that range using the seq() function.

As we are interested in comparing the results at each k setting, we repeat each of the settings 3 times to have more robustness towards the randomness when splitting. You might want to increase this to 10 repetitions, but that increases running time too much for this tutorial. If you are only interested in the average results, you might not need to repeat each setting, as the multiple settings of k becomes a kind of repeated cross-validation in itself.

# Generate sequence of `k` settings in the 3-50 range
fold_counts <- round(seq(from = 3, to = 50, length.out = 10))
# Repeat each 3 times
fold_counts <- rep(fold_counts, each = 3)

fold_counts
#>  [1]  3  3  3  8  8  8 13 13 13 19 19 19 24 24 24 29 29 29 34 34 34 40 40 40 45
#> [26] 45 45 50 50 50

We pass this sequence of counts to the k argument in groupdata2::fold(). We must also set the num_fold_cols argument to match the length of our sequence. As explained previously, we set cat_col = "Species" to ensure a balanced distribution of the species in all folds. Finally, we enable parallelization to speed things up:

data <- iris %>% 
  groupdata2::fold(
    k = fold_counts, 
    cat_col = "Species", 
    num_fold_cols = length(fold_counts),  # Must match the length of `k` 
    parallel = TRUE
  )

data
#> # A tibble: 150 × 35
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species .folds_1 .folds_2
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct>    <fct>   
#>  1          5.1         3.5          1.4         0.2 setosa  1        2       
#>  2          4.9         3            1.4         0.2 setosa  2        2       
#>  3          4.7         3.2          1.3         0.2 setosa  2        1       
#>  4          4.6         3.1          1.5         0.2 setosa  3        2       
#>  5          5           3.6          1.4         0.2 setosa  1        2       
#>  6          5.4         3.9          1.7         0.4 setosa  3        1       
#>  7          4.6         3.4          1.4         0.3 setosa  3        3       
#>  8          5           3.4          1.5         0.2 setosa  2        1       
#>  9          4.4         2.9          1.4         0.2 setosa  2        1       
#> 10          4.9         3.1          1.5         0.1 setosa  1        3       
#> # … with 140 more rows, and 28 more variables: .folds_3 <fct>, .folds_4 <fct>,
#> #   .folds_5 <fct>, .folds_6 <fct>, .folds_7 <fct>, .folds_8 <fct>,
#> #   .folds_9 <fct>, .folds_10 <fct>, .folds_11 <fct>, .folds_12 <fct>,
#> #   .folds_13 <fct>, .folds_14 <fct>, .folds_15 <fct>, .folds_16 <fct>,
#> #   .folds_17 <fct>, .folds_18 <fct>, .folds_19 <fct>, .folds_20 <fct>,
#> #   .folds_21 <fct>, .folds_22 <fct>, .folds_23 <fct>, .folds_24 <fct>,
#> #   .folds_25 <fct>, .folds_26 <fct>, .folds_27 <fct>, .folds_28 <fct>, …

We see that the .folds_* columns have been added with the fold identifiers. We will need the names of the generated fold columns in a second so here’s my favorite approach to generating them with paste0():

# Quick way to generate the names of the fold columns
# Note: `seq_along()` is equal to `1:length(fold_counts)`
fold_columns <- paste0(".folds_", seq_along(fold_counts))
fold_columns
#>  [1] ".folds_1"  ".folds_2"  ".folds_3"  ".folds_4"  ".folds_5"  ".folds_6" 
#>  [7] ".folds_7"  ".folds_8"  ".folds_9"  ".folds_10" ".folds_11" ".folds_12"
#> [13] ".folds_13" ".folds_14" ".folds_15" ".folds_16" ".folds_17" ".folds_18"
#> [19] ".folds_19" ".folds_20" ".folds_21" ".folds_22" ".folds_23" ".folds_24"
#> [25] ".folds_25" ".folds_26" ".folds_27" ".folds_28" ".folds_29" ".folds_30"

groupdata2 has the tool summarize_group_cols() for inspecting the generated fold columns (and factors in general). We can use this to assure ourselves that the right number of folds were created in each of the fold columns:

fold_stats <- groupdata2::summarize_group_cols(
    data = data,
    group_cols = fold_columns
  ) %>% 
  rename(`Fold Column` = `Group Column`)

# View fold column statistics
# We only look at one fold column per `k` setting
fold_stats %>% 
  dplyr::group_by(`Num Groups`) %>% 
  dplyr::filter(dplyr::row_number() == 1) %>% 
  knitr::kable()

Fold Column	Num Groups	Mean Rows	Median Rows	Std Rows	IQR Rows	Min Rows	Max Rows
.folds_1	3	50.000000	51	1.7320508	1.50	48	51
.folds_4	8	18.750000	18	1.3887301	0.75	18	21
.folds_7	13	11.538462	12	1.1266014	0.00	9	12
.folds_10	19	7.894737	9	1.4867839	3.00	6	9
.folds_13	24	6.250000	6	0.8469896	0.00	6	9
.folds_16	29	5.172414	6	1.3645765	3.00	3	6
.folds_19	34	4.411765	3	1.5199212	3.00	3	6
.folds_22	40	3.750000	3	1.3155870	0.75	3	6
.folds_25	45	3.333333	3	0.9534626	0.00	3	6
.folds_28	50	3.000000	3	0.0000000	0.00	3	3

Performing cross-validation

Now we are ready to cross-validate a model on our data. We will use the e1071::svm() Support Vector Machine model function. To use this with cross_validate_fn(), we can use the included model_fn and predict_fn functions. We further need to specify the kernel and cost hyperparameters.

For more elaborate examples of cross_validate_fn(), see here.

# To use the e1071::svm() model function
# we specify the model and predict functions
model_fn <- cvms::model_functions("svm_multinomial")
predict_fn <- cvms::predict_functions("svm_multinomial")

# Specify hyperparameters
hyperparameters <- list('kernel' = 'radial', 'cost' = 10)

We define an (arbitrary) set of formulas to cross-validate:

formulas <- c(
  "Species ~ Sepal.Length + Sepal.Width",
  "Species ~ Petal.Length + Petal.Width",
  "Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width",
  "Species ~ Sepal.Length * Sepal.Width + Petal.Length * Petal.Width"
)

Now, we are ready to run the cross-validation! We pass our data, formulas, functions, hyperparameters and fold column names to cross_validate_fn() and specify that the type of task is multiclass classification (i.e. multinomial). We also enable parallelization.

NOTE: This number of fold columns and formulas requires fitting 3180 model instances. That can take a few minutes to run, depending on your computer. Unfortunately, I have not yet found a way to include a progress bar when running in parallel.

cv <- cross_validate_fn(
  data = data,
  formulas = formulas,
  type = "multinomial",
  model_fn = model_fn,
  predict_fn = predict_fn,
  hyperparameters = hyperparameters,
  fold_cols = fold_columns,
  parallel = TRUE
)

cv
#> # A tibble: 4 × 26
#>   Fixed          `Overall Accura… `Balanced Accur…    F1 Sensitivity Specificity
#>   <chr>                     <dbl>            <dbl> <dbl>       <dbl>       <dbl>
#> 1 Sepal.Length+…            0.789            0.842 0.789       0.789       0.894
#> 2 Petal.Length+…            0.958            0.968 0.958       0.958       0.979
#> 3 Sepal.Length+…            0.960            0.970 0.960       0.960       0.980
#> 4 Sepal.Length*…            0.974            0.981 0.974       0.974       0.987
#> # … with 20 more variables: Pos Pred Value <dbl>, Neg Pred Value <dbl>,
#> #   Kappa <dbl>, MCC <dbl>, Detection Rate <dbl>, Detection Prevalence <dbl>,
#> #   Prevalence <dbl>, Predictions <chr>, Confusion Matrix <list>,
#> #   Results <list>, Class Level Results <list>, Coefficients <list>,
#> #   Folds <int>, Fold Columns <int>, Convergence Warnings <int>,
#> #   Other Warnings <int>, Warnings and Messages <list>, Process <list>,
#> #   HParams <list<tibble[,2]>>, Dependent <chr>

Above, we see the averaged results from all the fold columns. The last model formula seems to have performed the best as it has the highest Overall Accuracy, Balanced Accuracy, F1, and so on. If our objective was to average the results with different settings of k to increase robustness to that choice, we could stop now. In the following section, we will have a look at the results from the different settings of k.

Inspecting the effect of `k`

There is a list of nested tibbles (data frames) in the cross-validation output called Results. This has the results from each fold column. Let’s extract it and format it a bit.

Note: In regression tasks, the Results tibbles would have the results from each fold, from each fold column, but in classification we gather the predictions from all folds within a fold column before evaluation.

# Extract the fold column results
# This is a list of data frames (one per formula)
fold_column_results <- cv$Results

As this is currently a list of tibbles (one for each formula), we first name it by the model formulas and then combine the tibbles to a single data frame with dplyr::bind_rows().

# Set the names of the data frames to their respective formula
names(fold_column_results) <- cv$Fixed

# Combine the data frames
# Create a 'Formula' column from the names
fold_column_results <- fold_column_results %>% 
  dplyr::bind_rows(.id = "Formula")

We now have a single tibble where a new column (Formula) specifies what model the results came from. When plotting, the full model formula strings are a bit long though, so let’s convert them to something shorter:

# Make the formula string prettier for plotting
# Sepal -> S; Petal -> P; Width -> W; Length -> L
fold_column_results <- fold_column_results %>% 
  dplyr::mutate(
    Formula = gsub(
      x = Formula,
      pattern = "[^SPWL+*]",
      replacement = ""
    )
  )

This leaves us with the following data frame:

fold_column_results
#> # A tibble: 120 × 14
#>    Formula `Fold Column` `Overall Accuracy` `Balanced Accurac…    F1 Sensitivity
#>    <chr>   <chr>                      <dbl>              <dbl> <dbl>       <dbl>
#>  1 SL+SW   .folds_1                   0.773              0.83  0.774       0.773
#>  2 SL+SW   .folds_2                   0.767              0.825 0.768       0.767
#>  3 SL+SW   .folds_3                   0.76               0.82  0.761       0.76 
#>  4 SL+SW   .folds_4                   0.78               0.835 0.780       0.78 
#>  5 SL+SW   .folds_5                   0.8                0.85  0.801       0.8  
#>  6 SL+SW   .folds_6                   0.787              0.84  0.787       0.787
#>  7 SL+SW   .folds_7                   0.78               0.835 0.781       0.78 
#>  8 SL+SW   .folds_8                   0.807              0.855 0.806       0.807
#>  9 SL+SW   .folds_9                   0.787              0.84  0.788       0.787
#> 10 SL+SW   .folds_10                  0.787              0.84  0.787       0.787
#> # … with 110 more rows, and 8 more variables: Specificity <dbl>,
#> #   Pos Pred Value <dbl>, Neg Pred Value <dbl>, Kappa <dbl>, MCC <dbl>,
#> #   Detection Rate <dbl>, Detection Prevalence <dbl>, Prevalence <dbl>

Currently, we lack the number of folds for each of the fold columns. We have stored those in the fold column statistics we looked at in the beginning (fold_stats), so let’s add them with dplyr::left_join():

# Select the `Num Groups` column and rename to `Num Folds`
fold_counts_df <- fold_stats %>% 
  dplyr::select(`Fold Column`, `Num Groups`) %>% 
  dplyr::rename(`Num Folds` = `Num Groups`)

fold_counts_df
#> # A tibble: 30 × 2
#>    `Fold Column` `Num Folds`
#>    <chr>               <dbl>
#>  1 .folds_1                3
#>  2 .folds_2                3
#>  3 .folds_3                3
#>  4 .folds_4                8
#>  5 .folds_5                8
#>  6 .folds_6                8
#>  7 .folds_7               13
#>  8 .folds_8               13
#>  9 .folds_9               13
#> 10 .folds_10              19
#> # … with 20 more rows

# Add the counts with a join
fold_column_results <- fold_counts_df %>% 
  dplyr::left_join(fold_column_results, by = "Fold Column")

fold_column_results
#> # A tibble: 120 × 15
#>    `Fold Column` `Num Folds` Formula     `Overall Accura… `Balanced Accur…    F1
#>    <chr>               <dbl> <chr>                  <dbl>            <dbl> <dbl>
#>  1 .folds_1                3 SL+SW                  0.773            0.83  0.774
#>  2 .folds_1                3 PL+PW                  0.967            0.975 0.967
#>  3 .folds_1                3 SL+SW+PL+PW            0.967            0.975 0.967
#>  4 .folds_1                3 SL*SW+PL*PW            0.973            0.98  0.973
#>  5 .folds_2                3 SL+SW                  0.767            0.825 0.768
#>  6 .folds_2                3 PL+PW                  0.953            0.965 0.953
#>  7 .folds_2                3 SL+SW+PL+PW            0.96             0.97  0.960
#>  8 .folds_2                3 SL*SW+PL*PW            0.967            0.975 0.967
#>  9 .folds_3                3 SL+SW                  0.76             0.82  0.761
#> 10 .folds_3                3 PL+PW                  0.96             0.97  0.96 
#> # … with 110 more rows, and 9 more variables: Sensitivity <dbl>,
#> #   Specificity <dbl>, Pos Pred Value <dbl>, Neg Pred Value <dbl>, Kappa <dbl>,
#> #   MCC <dbl>, Detection Rate <dbl>, Detection Prevalence <dbl>,
#> #   Prevalence <dbl>

We further add a column that indicates which repetition of the k setting a fold column is. This is done with the l_starts method in groupdata2::group(), which automatically starts a new group every time the value of a column changes. So c(".folds_1", ".folds_1", ".folds_2", ".folds_2") would give the groups c(1, 1, 2, 2). By first grouping the data frame by the number of folds, these group indices start over for each setting of k. Note though, that this assumes that the Fold Column is sorted correctly. We should also remember to remove the grouping in the end, if we don’t need it in the following step.

# Add a column for indicating the repetition of the `k` setting
fold_column_results <- fold_column_results %>%
  # Group to restart groupdata2 group numbers for each `Num Folds` setting
  dplyr::group_by(`Num Folds`) %>% 
  # Create a group whenever the `Fold Column` column changes
  groupdata2::group(n = 'auto',
                    method = 'l_starts',
                    starts_col = "Fold Column",
                    col_name = "Repetition") %>% 
  dplyr::ungroup()

# Inspect the columns relevant to this
fold_column_results %>% 
  dplyr::select(`Fold Column`, `Num Folds`, Formula, Repetition) %>% 
  head(20) %>% 
  knitr::kable()

Fold Column	Num Folds	Formula	Repetition
.folds_1	3	SL+SW	1
.folds_1	3	PL+PW	1
.folds_1	3	SL+SW+PL+PW	1
.folds_1	3	SLSW+PLPW	1
.folds_2	3	SL+SW	2
.folds_2	3	PL+PW	2
.folds_2	3	SL+SW+PL+PW	2
.folds_2	3	SLSW+PLPW	2
.folds_3	3	SL+SW	3
.folds_3	3	PL+PW	3
.folds_3	3	SL+SW+PL+PW	3
.folds_3	3	SLSW+PLPW	3
.folds_4	8	SL+SW	1
.folds_4	8	PL+PW	1
.folds_4	8	SL+SW+PL+PW	1
.folds_4	8	SLSW+PLPW	1
.folds_5	8	SL+SW	2
.folds_5	8	PL+PW	2
.folds_5	8	SL+SW+PL+PW	2
.folds_5	8	SLSW+PLPW	2

To plot the average lines for each formula, we calculate the average Balanced Accuracy for each formula, for each number of folds setting:

avg_balanced_acc <- fold_column_results %>%
  dplyr::group_by(`Num Folds`, Formula) %>%
  dplyr::summarise(`Balanced Accuracy` = mean(`Balanced Accuracy`),
                   .groups = "drop")

avg_balanced_acc
#> # A tibble: 40 × 3
#>    `Num Folds` Formula     `Balanced Accuracy`
#>          <dbl> <chr>                     <dbl>
#>  1           3 PL+PW                     0.97 
#>  2           3 SL*SW+PL*PW               0.978
#>  3           3 SL+SW                     0.825
#>  4           3 SL+SW+PL+PW               0.973
#>  5           8 PL+PW                     0.967
#>  6           8 SL*SW+PL*PW               0.97 
#>  7           8 SL+SW                     0.842
#>  8           8 SL+SW+PL+PW               0.967
#>  9          13 PL+PW                     0.967
#> 10          13 SL*SW+PL*PW               0.98 
#> # … with 30 more rows

With the data ready, we plot the effect of k on the Balanced Accuracy metric. Feel free to plot one of the other metrics as well!

# Plot the balanced accuracy by the number of folds
# We add jitter to the points to separate overlapping points slightly
fold_column_results %>% 
  ggplot(aes(x = `Num Folds`, y = `Balanced Accuracy`, color = Formula)) +
  geom_point(
    aes(shape = Repetition), 
    size = 1,
    position = position_jitter(h = 0.0, w = 0.6)) + 
  geom_line(data = avg_balanced_acc) +
  theme_minimal()

For this dataset, the ranking of the models seems somewhat stable, although the PL+PW and SL+SW+PL+PW models are so close that it might switch their ranking at times. The variation in the three points at each k setting (for each formula) shows why repeated cross-validation is a good idea when possible. Without it, the random split can have a much bigger effect on the results (and potentially model ranking). By relying on the average of multiple k settings and repetitions, our results are more robust to fluctuations.

If you don’t want to run all these models always (e.g. in production), running this analysis at the beginning (and perhaps once in a while, in case of data drift) might help you check whether the choice of k makes a difference with your type of data.

Class level results

Finally, let’s have a look at the class level fold column results from the one-vs-all evaluations. These describe how well a model did for each of the species, for each of the fold columns. We will make a plot similar to the above to see whether the k setting affects the class level performance.

The cross-validation output has a nested tibble called Class Level Results for each formula. Within such tibble, we find another nested tibble (Results) that contains the results for each fold column, for each species.

To get the class level fold column results for the best model (i.e. Sepal.Length * Sepal.Width + Petal.Length * Petal.Width), we thus first get the Class Level Results for this (fourth) model and then extract the Results from that.

This gives us a list with the fold column results for each species which we concatenate to a single data frame.

# Extract class level fold column results for the best model
# It is a list of tibbles (one for each species) 
# so we concatenate them to a single tibble with bind_rows()
class_level_fold_results <- cv$`Class Level Results`[[4]]$Results %>% 
  dplyr::bind_rows()

# Add the Num Folds counts
class_level_fold_results <- fold_counts_df %>% 
  dplyr::left_join(class_level_fold_results, by = "Fold Column")

class_level_fold_results
#> # A tibble: 90 × 13
#>    `Fold Column` `Num Folds` Class      `Balanced Accuracy`    F1 Sensitivity
#>    <chr>               <dbl> <chr>                    <dbl> <dbl>       <dbl>
#>  1 .folds_1                3 setosa                   1     1            1   
#>  2 .folds_1                3 versicolor               0.965 0.959        0.94
#>  3 .folds_1                3 virginica                0.975 0.961        0.98
#>  4 .folds_2                3 setosa                   1     1            1   
#>  5 .folds_2                3 versicolor               0.96  0.949        0.94
#>  6 .folds_2                3 virginica                0.965 0.950        0.96
#>  7 .folds_3                3 setosa                   1     1            1   
#>  8 .folds_3                3 versicolor               0.96  0.958        0.92
#>  9 .folds_3                3 virginica                0.98  0.962        1   
#> 10 .folds_4                8 setosa                   0.99  0.990        0.98
#> # … with 80 more rows, and 7 more variables: Specificity <dbl>,
#> #   Pos Pred Value <dbl>, Neg Pred Value <dbl>, Kappa <dbl>,
#> #   Detection Rate <dbl>, Detection Prevalence <dbl>, Prevalence <dbl>

Again, we add the repetition column, using the l_starts method in groupdata2::group():

# Add a column for indicating the repetition of the `k` setting
class_level_fold_results <- class_level_fold_results %>% 
  # Group to restart groupdata2 group numbers for each `Num Folds` setting
  dplyr::group_by(`Num Folds`) %>% 
  # Create a group whenever the `Fold Column` column changes
  groupdata2::group(n = 'auto',
                    method = 'l_starts',
                    starts_col = "Fold Column",
                    col_name = "Repetition") %>% 
  dplyr::ungroup()

# Inspect the columns relevant to this
class_level_fold_results %>% 
  dplyr::select(`Fold Column`, `Num Folds`, Class, Repetition)
#> # A tibble: 90 × 4
#>    `Fold Column` `Num Folds` Class      Repetition
#>    <chr>               <dbl> <chr>      <fct>     
#>  1 .folds_1                3 setosa     1         
#>  2 .folds_1                3 versicolor 1         
#>  3 .folds_1                3 virginica  1         
#>  4 .folds_2                3 setosa     2         
#>  5 .folds_2                3 versicolor 2         
#>  6 .folds_2                3 virginica  2         
#>  7 .folds_3                3 setosa     3         
#>  8 .folds_3                3 versicolor 3         
#>  9 .folds_3                3 virginica  3         
#> 10 .folds_4                8 setosa     1         
#> # … with 80 more rows

To plot the average lines for each formula, we calculate the average Balanced Accuracy for each class, for each number of folds setting:

class_level_avg_balanced_acc <- class_level_fold_results %>%
  dplyr::group_by(`Num Folds`, Class) %>%
  dplyr::summarise(`Balanced Accuracy` = mean(`Balanced Accuracy`),
                   .groups = "drop")

class_level_avg_balanced_acc
#> # A tibble: 30 × 3
#>    `Num Folds` Class      `Balanced Accuracy`
#>          <dbl> <chr>                    <dbl>
#>  1           3 setosa                   1    
#>  2           3 versicolor               0.962
#>  3           3 virginica                0.973
#>  4           8 setosa                   0.993
#>  5           8 versicolor               0.955
#>  6           8 virginica                0.962
#>  7          13 setosa                   0.997
#>  8          13 versicolor               0.97 
#>  9          13 virginica                0.973
#> 10          19 setosa                   0.993
#> # … with 20 more rows

Now, we can plot the Balanced Accuracy by the number of folds:

# Plot the balanced accuracy by the number of folds
# We add jitter to the points to separate overlapping points slightly
class_level_fold_results  %>% 
  ggplot(aes(x = `Num Folds`, y = `Balanced Accuracy`, color = Class)) +
  geom_point(aes(shape = Repetition), size = 1, 
             position = position_jitter(h = 0.0, w = 0.6)) + 
  geom_line(data = class_level_avg_balanced_acc) +
  theme_minimal()

The Versicolor and Virginica species seem affected by the number of folds. They both have slightly lower average balanced accuracies at the lower k settings.

In this vignette, we have covered the choice of the “number of folds” setting when using cross-validation. We have discussed why larger k settings should give lower prediction errors on average and shown how to make results robust to this setting by averaging over a range of k values. The groupdata2::fold() and cvms cross-validation functions enable this type of analysis.

This concludes the vignette. If elements are unclear or you need help to apply this to your context, you can leave feedback in a mail or in a GitHub issue :-)

Multiple-k: Picking the number of folds for cross-validation

Ludvig Renbo Olsen

2022-07-14

Introduction

Does it matter?

Multiple-k repeated cross-validation

Creating folds

Performing cross-validation

Inspecting the effect of `k`

Class level results

Multiple-k: Picking the number of folds for cross-validation

Ludvig Renbo Olsen

2022-07-14

Introduction

Does it matter?

Multiple-k repeated cross-validation

Creating folds

Performing cross-validation

Inspecting the effect of k

Class level results

Inspecting the effect of `k`