dtrackr - Consort statement example

CONSORT statement

CONSORT diagrams are part of the requirements in reporting parallel group clinical trials or case control designs in observational studies. They are described in the updated 2010 CONSORT statement (Schulz, Altman, and Moher 2010). They clarify how patients were recruited, selected, randomised and followed up. For observational studies an equivalent requirement is the STROBE statement (von Elm et al. 2008). There are other similar requirements for other types of study such as the TRIPOD statement that are applicable for multivariate models (Collins et al. 2015).

As we don’t have to hand a randomised control trial the following example is more geared to reporting observational studies, and uses the Indian Liver Patients disease dataset (Ramana, Venkateswarlu, and Pradesh, n.d.; Ramana, Babu, and Venkateswarlu 2011; Dua and Graff 2017) as an example.

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
#> ✔ tibble  3.1.6     ✔ dplyr   1.0.7
#> ✔ tidyr   1.2.0     ✔ stringr 1.4.0
#> ✔ readr   2.0.2     ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(dtrackr)
#> 
#> Attaching package: 'dtrackr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     add_count, add_tally, bind_rows
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:utils':
#> 
#>     history
#> The following object is masked from 'package:base':
#> 
#>     comment

The Indian liver patients data set has some clinical data on a case control study of patients with and without liver disease. We can use to pretend we are conducting an observational study.

Our study criteria are as follows:

This can be coded into the dplyr pipeline, with additional dtrackr functions:

# Some useful formatting options
old = options(
  dtrackr.strata_glue="{tolower(.value)}",
  dtrackr.strata_sep=", ",
  dtrackr.default_message = "{.count} patients",
  dtrackr.default_headline = NULL
)

ilpd = dtrackr::ILPD %>% 
  track() %>%
  capture_exclusions() %>%
  include_any(
    Gender == "Female" & Total_Bilirubin >= 0.7 ~ "{.included} women with bili>0.7",
    Gender == "Male" & Total_Bilirubin >= 0.8 ~ "{.included} men with bili>0.8"
  ) %>%
  group_by(Case_or_Control, .messages="cases versus controls") %>%
  comment() %>%
  exclude_all(
    Age<18 ~ "{.excluded} subjects under 18",
    Age>80 ~ "{.excluded} subjects over 80"
  ) %>%
  comment(.messages = "{.count} after exclusions") %>%
  status(
    mean_bili = sprintf("%1.2f \u00B1 %1.2f",mean(Total_Bilirubin),sd(Total_Protein)),
    mean_alb = sprintf("%1.2f \u00B1 %1.2f",mean(Albumin),sd(Albumin)),
    .messages = c(
      "bilirubin: {mean_bili}",
      "albumin: {mean_alb}"
    )                    
  ) %>%
  ungroup(.messages = "{.count} in final data set")

# restore to originals
options(old)

With a bit of experimentation the flowchart needed for a STROBE/CONSORT checklist can be generated. One option to output the flowchart is SVG which can then be manually formatted as required, but for publication ready output pdf is usually preferred.

# saving this flowchart for the JOSS paper.
# ilpd %>% flowchart(filename = here::here("vignettes/joss/figure1-ilpd-consort.pdf"))
ilpd %>% flowchart()
%0 10:s->12 11:s->12 8:s->10 9:s->11 4:s->8 4:e->6 5:s->9 5:e->7 3:s->4 3:s->5 2:s->3 1:s->2 12 460 in final data set 10 case bilirubin: 4.79 ± 1.07 albumin: 3.03 ± 0.78 11 control bilirubin: 1.35 ± 1.05 albumin: 3.38 ± 0.78 8 case 344 after exclusions 9 control 116 after exclusions 4 case 357 patients 5 control 125 patients 6 case 12 subjects under 18 1 subjects over 80 7 control 7 subjects under 18 2 subjects over 80 3 cases versus controls 2 inclusions: 128 women with bili>0.7 354 men with bili>0.8 1 583 patients

Excluded data

During this pipeline, we may be keen to understand why certain data items are being rejected. This would enable us to examine the source data, and potentially correct it during the data collection process. We’ve used it to allow continuous quality checks on the data to feed back to the data curators, as we regularly conduct analyses. By tracking the exclusions, not only do we track the data flow through the pipeline we also retain all excluded items, with the reason for exclusion. Thus we can reassure ourselves that the exclusions are as expected. We enabled this by calling capture_exclusions() in the pipeline above. Having tracked the exclusions we can retrieve them by calling excluded() which gives a data frame with the excluded records and the reasons. If the exclusions happened over multiple stages as the dataframe format change in between then this will be held as a nested dataframe (i.e. see ?tidyr::nest):


# here we filter out the majority of the actual content of the excluded data to focus on the 
# metadata recovered during the exclusion.
ilpd %>% excluded() %>% select(.stage,.message,.filter,Age, Gender)
#> # A tibble: 22 × 5
#>    .stage  .message             .filter  Age   Gender
#>    <chr>   <glue>               <chr>    <chr> <chr> 
#>  1 stage 1 7 subjects under 18  Age < 18 17    Male  
#>  2 stage 1 7 subjects under 18  Age < 18 17    Female
#>  3 stage 1 7 subjects under 18  Age < 18 4     Male  
#>  4 stage 1 7 subjects under 18  Age < 18 4     Male  
#>  5 stage 1 7 subjects under 18  Age < 18 14    Male  
#>  6 stage 1 7 subjects under 18  Age < 18 12    Male  
#>  7 stage 1 7 subjects under 18  Age < 18 17    Male  
#>  8 stage 1 12 subjects under 18 Age < 18 14    Male  
#>  9 stage 1 12 subjects under 18 Age < 18 17    Male  
#> 10 stage 1 12 subjects under 18 Age < 18 15    Male  
#> # … with 12 more rows

This list may have multiple entries for a single data item, if for example something is excluded in any one step for many reasons.

It would be interesting to integrate this into a continuous integration work-flow to run automated checks on data as it is collected.

Tagging the pipeline

For reporting results it is useful to have the numbers from the flowchart to embed into the text of the results section of the write up. Here we show the same pipeline as above, but with 4 uses of the .tag system for labelling part of the pipeline. This captures data in a tag-value list during the pipeline, and retains it as metadata for later reuse.


ilpd = dtrackr::ILPD %>% 
  
  track(.tag="initial cohort") %>%
  #     ^^^^^^^^^^^^^^^^^^^^^
  #     TAGS DEFINED
  
  capture_exclusions() %>%
  
  include_any(
    Gender == "Female" & Total_Bilirubin >= 0.7 ~ "{.included} women with bili>0.7",
    Gender == "Male" & Total_Bilirubin >= 0.8 ~ "{.included} men with bili>0.8"
  ) %>%
  
  group_by(Case_or_Control, .messages="cases versus controls") %>%
  
  comment(.tag="study cohort") %>%
  #       ^^^^^^^^^^^^^^^^^^^
  #       SECOND SET OF TAGS DEFINED
  
  exclude_all(
    Age<18 ~ "{.excluded} subjects under 18",
    Age>80 ~ "{.excluded} subjects over 80"
  ) %>%
  
  comment(.messages = "{.count} after exclusions") %>%
  
  status(
    mean_bili = sprintf("%1.2f \u00B1 %1.2f",mean(Total_Bilirubin),sd(Total_Protein)),
    mean_alb = sprintf("%1.2f \u00B1 %1.2f",mean(Albumin),sd(Albumin)),
    .messages = c(
      "bilirubin: {mean_bili}",
      "albumin: {mean_alb}"
    ),
    .tag = "qualifying patients"
  # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  # THIRD SET TAGS DEFINED
  ) %>%
  
  ungroup(.messages = "{.count} in final data set", .tag="final set")
  #                                                 ^^^^^^^^^^^^^^^^
  #                                                 LAST TAGS DEFINED

The tagged data can be retrieved as follows, which will give you all tagged data for all 4 points in the pipeline:

ilpd %>% tagged() %>% tidyr::unnest(.content)
#> # A tibble: 6 × 7
#>   .tag                .count .total .strata   Case_or_Control mean_bili mean_alb
#>   <chr>                <int>  <int> <chr>     <ord>           <chr>     <chr>   
#> 1 initial cohort         583    583 ""        <NA>            <NA>      <NA>    
#> 2 study cohort           357    482 "Case_or… case            <NA>      <NA>    
#> 3 study cohort           125    482 "Case_or… control         <NA>      <NA>    
#> 4 qualifying patients    344    460 "Case_or… case            4.79 ± 1… 3.03 ± …
#> 5 qualifying patients    116    460 "Case_or… control         1.35 ± 1… 3.38 ± …
#> 6 final set              460    460 ""        <NA>            <NA>      <NA>

More often though you will want to retrieve specific values from specific points for the results text for example:

initialSet = ilpd %>% tagged(.tag = "initial cohort", .glue = "{.count} patients")
finalSet = ilpd %>% tagged(.tag = "final set", .glue = "{.count} patients")

# there were `r initialSet` in the study, of whom `r finalSet` met the eligibility criteria.

For example there were 583 patients in the study, of whom 460 patients met the eligibility criteria.

More complex formatting and calculations are made possible by use of the glue specification, including those that happen on a per group basis, and we can also pull in values from elsewhere in our analysis.

ilpd %>% tagged(
    .tag = "qualifying patients", 
    .glue = "{.strata}: {.count}/{.total} ({sprintf('%1.1f', .count/.total*100)}%) patients on {sysDate}, with a mean bilirubin of {mean_bili}", 
    sysDate = Sys.Date()
    # we could have included any number of other parameters here from the global environment
  ) %>% dplyr::pull(.label)
#> Case_or_Control:case: 344/460 (74.8%) patients on 2022-07-04, with a mean bilirubin of 4.79 ± 1.07
#> Case_or_Control:control: 116/460 (25.2%) patients on 2022-07-04, with a mean bilirubin of 1.35 ± 1.05

Sometimes it will be necessary to operate on all tagged content at once. This is possible but be aware that the content available depends somewhat on where the tag was set in the pipeline so not all fields will always be present (although .count and .total will be). The .total is the overall number of cases at that point in the pipeline. .count is the number of cases in each strata.

ilpd %>% tagged(.glue = "{.count}/{.total} patients")
#> # A tibble: 6 × 3
#>   .tag                .strata                   .label          
#>   <chr>               <chr>                     <glue>          
#> 1 initial cohort      ""                        583/583 patients
#> 2 study cohort        "Case_or_Control:case"    357/482 patients
#> 3 study cohort        "Case_or_Control:control" 125/482 patients
#> 4 qualifying patients "Case_or_Control:case"    344/460 patients
#> 5 qualifying patients "Case_or_Control:control" 116/460 patients
#> 6 final set           ""                        460/460 patients

For comparing inclusions and exclusions at different stages in the pipeline using tags the following example may be useful:

ilpd %>% 
  tagged() %>%   # selects only top level content
  tidyr::unnest(.content) %>% 
  dplyr::select(.tag, .total) %>% 
  dplyr::distinct() %>%
  tidyr::pivot_wider(values_from=.total, names_from=.tag) %>% 
  glue::glue_data("Out of {`initial cohort`} patients, {`study cohort`} were eligible for inclusion on the basis of their liver function tests but {`study cohort`-`qualifying patients`} were 
                  outside the age limits. This left {`final set`} patients included in the final study (i.e. overall {`initial cohort`-`final set`} were removed).")
#> Out of 583 patients, 482 were eligible for inclusion on the basis of their liver function tests but 22 were 
#> outside the age limits. This left 460 patients included in the final study (i.e. overall 123 were removed).

References

Collins, Gary S., Johannes B. Reitsma, Douglas G. Altman, and Karel GM Moons. 2015. “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD Statement.” BMC Medicine 13 (1): 1. https://doi.org/10.1186/s12916-014-0241-z.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml/.

Elm, Erik von, Douglas G Altman, Matthias Egger, Stuart J Pocock, Peter C Gøtzsche, Jan P Vandenbroucke, and STROBE Initiative. 2008. “The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: Guidelines for Reporting Observational Studies.” Journal of Clinical Epidemiology 61 (4): 344–49.

Ramana, B., M. Babu, and N. Venkateswarlu. 2011. “A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis.” https://doi.org/10.5121/IJDMS.2011.3207.

Ramana, Bendi Venkata, Prof N. B. Venkateswarlu, and Andhra Pradesh. n.d. “A Critical Comparative Study of Liver Patients from USA and INDIA: An Exploratory Analysis.”

Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. “CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials.” BMJ 340 (March): c332. https://doi.org/10.1136/bmj.c332.