Comparison with dplyr Tabulation

Gabriel Becker and Adrian Waddell

2022-05-20

Introduction

In this vignette we would like to discuss the similarities and differences between dplyr and rtable.

Much of the rtables framework focuses on tabulation/summarizing of data and then the visualization of the table. In this vignette we focus on summarizing data using dplyr and contrast it to rtables. We won’t pay attention to the table visualizaion/markup and just derive the cell content.

Using dplyr to summarize data and gt to visualize the table is a good way if the tabulation is of a certain nature or complexity. However, there are tables such as the table created in the introduction vignette that take some effort to create with dplyr. Part of the effort is due to fact that when using dplyr the table data is stored in data.frames or tibbles which is not the most natural way to represent a table as we will show in this vignette.

If you know a more elegant way of deriving the table content with dplyr please let us know and we will update the vignette.

library(rtables)
library(dplyr)

Here is table and data used in the introduction vignette:

n <- 400

set.seed(1)

df <- tibble(
  arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
  country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
  gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
  handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
  age = rchisq(n, 30) + 10
) %>% mutate(
  weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
) 

basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  add_colcounts() %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x") %>%
  build_table(df)
                    Arm A                     Arm B         
             Female        Male        Female        Male   
             (N=96)      (N=105)       (N=92)      (N=107)  
————————————————————————————————————————————————————————————
CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
  Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
    mean      38.9         40.4         40.3         37.7   
  Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
    mean      36.6         40.2         40.2         40.6   
USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
  Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
    mean      40.4         39.7         39.2         40.1   
  Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
    mean      36.9         39.8         38.5         39.0   

Getting Started

We will start by deriving the first data cell on row 3 (note, row 1 and 2 have content cells, see the introduction vignette). Cell 3,1 contains the mean age for left handed & female Canadians in “Arm A”:

mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"])
[1] 38.86979

or with dplyr:

df %>%
  filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>%
  summarise(mean_age = mean(age))
# A tibble: 1 × 1
  mean_age
     <dbl>
1     38.9

Further, dplyr gives us other verbs to easily get the average age of left handed Canadians for each group defined by the 4 columns:

df %>%
  group_by(arm, gender) %>%
  filter(country == "CAN", handed == "Left") %>%
  summarise(mean_age = mean(age))
`summarise()` has grouped output by 'arm'. You can override using the `.groups`
argument.
# A tibble: 4 × 3
# Groups:   arm [2]
  arm   gender mean_age
  <fct> <fct>     <dbl>
1 Arm A Female     38.9
2 Arm A Male       40.4
3 Arm B Female     40.3
4 Arm B Male       37.7

We can further get to all the average age cell values with:

average_age <- df %>%
  group_by(arm, gender, country, handed) %>%
  summarise(mean_age = mean(age))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
average_age
# A tibble: 16 × 5
# Groups:   arm, gender, country [8]
   arm   gender country handed mean_age
   <fct> <fct>  <fct>   <fct>     <dbl>
 1 Arm A Female CAN     Left       38.9
 2 Arm A Female CAN     Right      36.6
 3 Arm A Female USA     Left       40.4
 4 Arm A Female USA     Right      36.9
 5 Arm A Male   CAN     Left       40.4
 6 Arm A Male   CAN     Right      40.2
 7 Arm A Male   USA     Left       39.7
 8 Arm A Male   USA     Right      39.8
 9 Arm B Female CAN     Left       40.3
10 Arm B Female CAN     Right      40.2
11 Arm B Female USA     Left       39.2
12 Arm B Female USA     Right      38.5
13 Arm B Male   CAN     Left       37.7
14 Arm B Male   CAN     Right      40.6
15 Arm B Male   USA     Left       40.1
16 Arm B Male   USA     Right      39.0

In rtable syntax we need the following code to get to the same content:

basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  split_rows_by("country") %>%
  split_rows_by("handed") %>%
  analyze("age", afun = mean, format = "xx.x") %>%
  build_table(df)
               Arm A           Arm B    
           Female   Male   Female   Male
————————————————————————————————————————
CAN                                     
  Left                                  
    mean    38.9    40.4    40.3    37.7
  Right                                 
    mean    36.6    40.2    40.2    40.6
USA                                     
  Left                                  
    mean    40.4    39.7    39.2    40.1
  Right                                 
    mean    36.9    39.8    38.5    39.0

As mentioned in the introduction to this vignette, please ignore the difference in arranging and formatting the data: it’s possible to condense the rtable more and it is possible to make the tibble look more like the reference table using the gt R package.

In terms of tabulation for this example there was arguably not much added by rtables over dplyr.

Content Information

We will now focus on the count percentage information for handedness for each country. For that we need to get the correct denominator for the percentages

df %>% 
  group_by(arm, gender, country, handed) %>%
  summarize(mean = mean(age), c_h_count = n()) %>%
  # now the `handed` grouping has been removed, therefore we can calculate percent now:
  mutate(c_h_percent = c_h_count / sum(c_h_count))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
# A tibble: 16 × 7
# Groups:   arm, gender, country [8]
   arm   gender country handed  mean c_h_count c_h_percent
   <fct> <fct>  <fct>   <fct>  <dbl>     <int>       <dbl>
 1 Arm A Female CAN     Left    38.9        32       0.711
 2 Arm A Female CAN     Right   36.6        13       0.289
 3 Arm A Female USA     Left    40.4        34       0.667
 4 Arm A Female USA     Right   36.9        17       0.333
 5 Arm A Male   CAN     Left    40.4        42       0.656
 6 Arm A Male   CAN     Right   40.2        22       0.344
 7 Arm A Male   USA     Left    39.7        19       0.463
 8 Arm A Male   USA     Right   39.8        22       0.537
 9 Arm B Female CAN     Left    40.3        26       0.565
10 Arm B Female CAN     Right   40.2        20       0.435
11 Arm B Female USA     Left    39.2        25       0.543
12 Arm B Female USA     Right   38.5        21       0.457
13 Arm B Male   CAN     Left    37.7        37       0.597
14 Arm B Male   CAN     Right   40.6        25       0.403
15 Arm B Male   USA     Left    40.1        25       0.556
16 Arm B Male   USA     Right   39.0        20       0.444

which has 16 rows (cells) like the average_age data frame defined above. Next, we will derive the group information for countries:

df %>% 
  group_by(arm, gender, country, handed) %>%
  summarize(mean = mean(age), c_h_count = n()) %>%
  mutate(c_h_percent = c_h_count / sum(c_h_count)) %>%
  mutate(c_count = sum(c_h_count)) %>%
  ungroup(country) %>%
  # note that we always use the finest level of counts to avoid duplicate counting
  mutate(c_percent = c_count / sum(c_h_count))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
# A tibble: 16 × 9
# Groups:   arm, gender [4]
   arm   gender country handed  mean c_h_count c_h_percent c_count c_percent
   <fct> <fct>  <fct>   <fct>  <dbl>     <int>       <dbl>   <int>     <dbl>
 1 Arm A Female CAN     Left    38.9        32       0.711      45     0.469
 2 Arm A Female CAN     Right   36.6        13       0.289      45     0.469
 3 Arm A Female USA     Left    40.4        34       0.667      51     0.531
 4 Arm A Female USA     Right   36.9        17       0.333      51     0.531
 5 Arm A Male   CAN     Left    40.4        42       0.656      64     0.610
 6 Arm A Male   CAN     Right   40.2        22       0.344      64     0.610
 7 Arm A Male   USA     Left    39.7        19       0.463      41     0.390
 8 Arm A Male   USA     Right   39.8        22       0.537      41     0.390
 9 Arm B Female CAN     Left    40.3        26       0.565      46     0.5  
10 Arm B Female CAN     Right   40.2        20       0.435      46     0.5  
11 Arm B Female USA     Left    39.2        25       0.543      46     0.5  
12 Arm B Female USA     Right   38.5        21       0.457      46     0.5  
13 Arm B Male   CAN     Left    37.7        37       0.597      62     0.579
14 Arm B Male   CAN     Right   40.6        25       0.403      62     0.579
15 Arm B Male   USA     Left    40.1        25       0.556      45     0.421
16 Arm B Male   USA     Right   39.0        20       0.444      45     0.421

Note that the information in c_count and c_percent must be repeated, as there are only 8 country group summaries but the rest of the data structure requires 16 rows for the information.

The rtables call in contrast is:

basic_table() %>%
  split_cols_by("arm") %>%
  split_cols_by("gender") %>%
  add_colcounts() %>%
  split_rows_by("country") %>%
  summarize_row_groups() %>%
  split_rows_by("handed") %>%
  summarize_row_groups() %>%
  analyze("age", afun = mean, format = "xx.x") %>%
  build_table(df)
                    Arm A                     Arm B         
             Female        Male        Female        Male   
             (N=96)      (N=105)       (N=92)      (N=107)  
————————————————————————————————————————————————————————————
CAN        45 (46.9%)   64 (61.0%)   46 (50.0%)   62 (57.9%)
  Left     32 (33.3%)   42 (40.0%)   26 (28.3%)   37 (34.6%)
    mean      38.9         40.4         40.3         37.7   
  Right    13 (13.5%)   22 (21.0%)   20 (21.7%)   25 (23.4%)
    mean      36.6         40.2         40.2         40.6   
USA        51 (53.1%)   41 (39.0%)   46 (50.0%)   45 (42.1%)
  Left     34 (35.4%)   19 (18.1%)   25 (27.2%)   25 (23.4%)
    mean      40.4         39.7         39.2         40.1   
  Right    17 (17.7%)   22 (21.0%)   21 (22.8%)   20 (18.7%)
    mean      36.9         39.8         38.5         39.0   

Further, the rtable syntax has hopefully also become a bit more straightforward to derive the cell values than with dplyr for this particular table.

Summary

In this vignette learned that:

We recommend that you continue reading the clinical_trials vignette where we create a number of more advanced tables using layouts.