Using seplyr to Program Over dplyr

John Mount

2021-09-01

Introduction

seplyr is an R package that makes it easy to program over dplyr 0.7.* + without needing to directly use rlang notation.

seplyr

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

Our converted code looks like the following.

library("dplyr")
library("seplyr")
starwars %>%
  group_by_se("homeworld") %>%
  summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                 "mean_mass" := "mean(mass, na.rm = TRUE)",
                 "count" := "n()"))
## # A tibble: 49 x 4
##    homeworld      mean_height mean_mass count
##    <chr>                <dbl>     <dbl> <int>
##  1 Alderaan              176.      64       3
##  2 Aleen Minor            79       15       1
##  3 Bespin                175       79       1
##  4 Bestine IV            180      110       1
##  5 Cato Neimoidia        191       90       1
##  6 Cerea                 198       82       1
##  7 Champala              196      NaN       1
##  8 Chandrila             150      NaN       1
##  9 Concord Dawn          183       79       1
## 10 Corellia              175       78.5     2
## # … with 39 more rows

This code works the same as the original dplyr code. Also the translation could be performed by following the small set of explicit re-coding rules that we gave above.

Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

grouped_mean <- function(data, 
                         grouping_variables, 
                         value_variables,
                         count_name = "count") {
  result_names <- paste0("mean_", 
                         value_variables)
  expressions <- paste0("mean(", 
                        value_variables, 
                        ", na.rm = TRUE)")
  calculation <- result_names := expressions
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(calculation,
                   count_name := "n()")) %>%
    ungroup()
}

starwars %>% 
  grouped_mean(grouping_variables = c("eye_color", "skin_color"),
               value_variables = c("mass", "birth_year"))
## # A tibble: 53 x 5
##    eye_color skin_color       mean_mass mean_birth_year count
##    <chr>     <chr>                <dbl>           <dbl> <int>
##  1 black     green                 80.5            44       2
##  2 black     grey                  78.7           NaN       4
##  3 black     none                 NaN             NaN       1
##  4 black     orange                80              22       1
##  5 black     red, blue, white      57             NaN       1
##  6 black     white, blue          NaN             NaN       1
##  7 blue      blue                 NaN             NaN       1
##  8 blue      brown                136             NaN       1
##  9 blue      dark                  50             NaN       1
## 10 blue      fair                  90              62.6    10
## # … with 43 more rows

We have translated our original interactive or ad-hoc calculation into a parameterized reusable function in two easy steps:

To be sure: there are some clunky details of using to build up the expressions, but the conversion process is very regular and easy. In seplyr parametric programming is intentionally easy (just replace values with variables).

Conclusion

The seplyr methodology is simple, easy to teach, and powerful.

There are alternatives that differ in philosophy.

The seplyr package contains a number of worked examples both in help() and vignette(package='seplyr') documentation.