dtrackr - Basic operations

Tracking data provenance

When wrangling raw data into a form suitable for analysis there may be many steps where explicit or implicit assumptions are made about the nature of the data, which determine the nature of the resulting analysis data-set. Checking that these assumptions have the expected results as the analysis is proceeding can be time consuming. If a data pipeline is broken up over a number of parameterised functions, following the data flow through the code is time consuming and getting an overview difficult. This is where tracking data provenance can help, by monitoring the steps the data goes through and summarizing the outcomes of the steps as they occur we can generate a flow chart of the data history.

# devtools::load_all()
library(tidyverse)
library(dtrackr)

Basic operation - comments and stratification

The main documentation for data pipelines is provided by the comment function. This (and all other functions) uses a glue package specification to define the comment text. This glue specification can use a range of variables to describe the data as it passes through the pipeline. Firstly it can use global variables such as filename in this example. Secondly the .count variable is the number of rows in the current group. Thirdly the .strata variable is defined to be a description of the group(s) we are currently in, but in grouped data the grouping variables (in this case Species) can also be used. Finally the .total variable returns the whole size of the ungrouped data-set.

Comments can either be a single .headline or a list of .messages. Setting either of these to "" disables them for a given comment. As in the example, thanks to glue, any expression can be evaluated in the messages but be warned, debugging them is hard. If an error in the glue spec is present dtrackr will try and tell you what is the problem was and what variables should have been available for the glue spec. (N.B. A common mistake is to use .message rather than .messages when providing a glue spec.)

# devtools::load_all()
filename = "~/tmp/iris.csv"
# this is just a pretend example
# iris = read.csv(filename)


iris %>%
  track() %>%
  comment(
    .headline = "Iris data:",
    .messages = c(
      "loaded from \"{filename}\"",
      "starts with {.count} items")) %>%
  group_by(Species) %>%
  comment(
    .headline = "{.strata}",
    .messages = c(
    "In {Species}",
    "there are {.count} items",
    "that is {sprintf('%1.0f',.count/.total*100)}% of the total"),
    .tag = "note1"
    ) %>%
  ungroup() %>%
  comment("Final data has {.total} items", .tag="note2") %>%  
  flowchart()
%0 7:s->8 4:s->7 5:s->7 6:s->7 3:s->4 3:s->5 3:s->6 2:s->3 1:s->2 8 Final data has 150 items 7 150 items 4 Species:setosa In setosa there are 50 items that is 33% of the total 5 Species:versicolor In versicolor there are 50 items that is 33% of the total 6 Species:virginica In virginica there are 50 items that is 33% of the total 3 stratify by Species 2 Iris data: loaded from "~/tmp/iris.csv" starts with 150 items 1 150 items

Status - Further analysis in the workflow

In the middle of a pipeline you may wish to document something about the data that is more complex than the simple counts. comment has a more sophisticated counterpart status. status is essentially a dplyr summarisation step which is connected to a glue specification output, that is recorded in the data frame history. In plain English this means you can do an arbitrary summarisation and put the result into the flowchart.


iris %>%
  track("starts with {.count} items") %>%
  group_by(Species) %>%
  status(
    petalMean = sprintf("%1.1f", mean(Petal.Width)),
    petalSd = sprintf("%1.1f", sd(Petal.Width)),
    .messages = c(
    "In {Species} the petals are",
    "on average {petalMean} \u00B1 {petalSd} cms wide")) %>%
  ungroup(.messages = "ends with {.total} items") %>%
  flowchart()
%0 3:s->6 4:s->6 5:s->6 2:s->3 2:s->4 2:s->5 1:s->2 6 ends with 150 items 3 Species:setosa In setosa the petals are on average 0.2 ± 0.1 cms wide 4 Species:versicolor In versicolor the petals are on average 1.3 ± 0.2 cms wide 5 Species:virginica In virginica the petals are on average 2.0 ± 0.3 cms wide 2 stratify by Species 1 starts with 150 items

A frequent use case for more detailed description is to have a subgroup count within a flowchart. This is different enough to have its own function count_subgroup(). This works best for factor subgroup columns but other data will be converted to a factor automatically. As this uses glue specifications a modified formatted output is also possible as in this example, where the subgroup percentages are calculated to 1dp.

diamonds %>%
  track() %>%
  group_by(cut) %>%
  count_subgroup(
    color,
    .messages = "colour {.name}: {sprintf('%1.1f%%', {.count}/{.subtotal}*100)}"
  ) %>%
  ungroup() %>%
  flowchart()
%0 3:s->8 4:s->8 5:s->8 6:s->8 7:s->8 2:s->3 2:s->4 2:s->5 2:s->6 2:s->7 1:s->2 8 53940 items 3 cut:Fair colour D: 10.1% colour E: 13.9% colour F: 19.4% colour G: 19.5% colour H: 18.8% colour I: 10.9% colour J: 7.4% 4 cut:Good colour D: 13.5% colour E: 19.0% colour F: 18.5% colour G: 17.8% colour H: 14.3% colour I: 10.6% colour J: 6.3% 5 cut:Ideal colour D: 13.2% colour E: 18.1% colour F: 17.8% colour G: 22.7% colour H: 14.5% colour I: 9.7% colour J: 4.2% 6 cut:Premium colour D: 11.6% colour E: 16.9% colour F: 16.9% colour G: 21.2% colour H: 17.1% colour I: 10.4% colour J: 5.9% 7 cut:Very Good colour D: 12.5% colour E: 19.9% colour F: 17.9% colour G: 19.0% colour H: 15.1% colour I: 10.0% colour J: 5.6% 2 stratify by cut 1 53940 items

Filtering, exclusions and inclusions

Documenting the data set is only useful if you can also manipulate it, and one part of this is including and excluding things we don’t want. The standard dplyr::filter approach works for this, and we can use the before and after .count.in and .count.out, and the difference between the two .excluded to document what the result was.

In this example we exclude items that are >1 SD above the mean. The default message (.messages = "excluded {.excluded} items") has been left as is which simply returns how many things have been excluded. With no customization the goal is for the pipeline to look as much as possible like a dplyr pipeline.


iris %>%
  track() %>%
  group_by(Species) %>%
  filter(
    Petal.Width < mean(Petal.Width)+sd(Petal.Width)
  ) %>%
  ungroup() %>%
  flowchart()
%0 2:s->6 2:e->3 2:e->4 2:e->5 1:s->2 6 130 items 2 stratify by Species 3 Species:setosa excluded 9 items 4 Species:versicolor excluded 5 items 5 Species:virginica excluded 6 items 1 150 items

This is useful but the reason for exclusion is not as clear as we would like, and this does not scale particularly well to multiple criteria, which are typical of the filters needed to massage real life data. For this we have written exclude_all which takes multiple criteria and applies them in a step-wise manner, summarising at each step. Rather than a logical expression expected by dplyr::filter we provide matching criteria as a formula relating the criteria to the glue specification (a trick inspired by case_when’s syntax). This is very much slower than filter but gives fine control over the output.

It should be noted that the logic of exlude_all is reversed compared to base filter for which a TRUE value is INCLUDED. In this example there are no missing values, however the behaviour of the filter when filter expressions cannot be evaluated and NAs are generated, is controlled by na.rm. This defaults to FALSE which means that values that cannot be evaluated are NOT excluded. You can also explicitly check for missingness in the filter expression.

Exclusions produced like this are additive and the items may be counted in more than one exclusion category, and so won’t add up to an exclusion total.


dataset1 = iris %>%
  track() %>%
  comment("starts with {.count} items") %>%
  exclude_all(
    Species=="versicolor" ~ "removing {.excluded} versicolor"
  ) %>%
  group_by(Species) %>%
  comment("{Species} has {.count} items") %>%
  exclude_all(
    Petal.Width > mean(Petal.Width)+sd(Petal.Width) ~ "{.excluded} with petals > 1 SD wider than the mean",
    Petal.Length > mean(Petal.Length)+sd(Petal.Length) ~ "{.excluded} with petals > 1 SD longer than the mean",
    Sepal.Width > mean(Sepal.Width)+sd(Sepal.Width) ~ "{.excluded} with sepals > 1 SD wider than the mean",
    Sepal.Length > mean(Sepal.Length)+sd(Sepal.Length) ~ "{.excluded} with sepals > 1 SD longer than the mean"
  ) %>%
  comment("{Species} now has {.count} items") %>%
  ungroup() %>%
  comment("ends with {.total} items")

dataset1 %>% flowchart()
%0 11:s->12 9:s->11 10:s->11 5:s->9 5:e->7 6:s->10 6:e->8 4:s->5 4:s->6 2:s->4 2:e->3 1:s->2 12 ends with 67 items 11 67 items 9 Species:setosa setosa now has 33 items 10 Species:virginica virginica now has 34 items 5 Species:setosa setosa has 50 items 6 Species:virginica virginica has 50 items 7 Species:setosa 9 with petals > 1 SD wider than the mean 6 with petals > 1 SD longer than the mean 6 with sepals > 1 SD wider than the mean 10 with sepals > 1 SD longer than the mean 8 Species:virginica 6 with petals > 1 SD wider than the mean 6 with petals > 1 SD longer than the mean 8 with sepals > 1 SD wider than the mean 8 with sepals > 1 SD longer than the mean 4 stratify by Species 2 starts with 150 items 3 removing 50 versicolor 1 150 items

Sometimes inclusion criteria are more important. For this we use include_any which works in a similar manner but including items which match any of the supplied criteria, essentially combining with a logical OR operation, and in this case resulting in very different result from our previous example.


dataset2 = iris %>%
  track() %>%
  comment("starts with {.count} items") %>%
  include_any(
    Species=="versicolor" ~ "{.included} versicolor",
    Species=="setosa" ~ "{.included} setosa"
  ) %>%
  #mutate(Species = forcats::fct_drop(Species)) %>%
  group_by(Species) %>%
  comment("{Species} has {.count} items") %>%
  include_any(
    Petal.Width < mean(Petal.Width)+sd(Petal.Width) ~ "{.included} with petals <= 1 SD wider than the mean",
    Petal.Length < mean(Petal.Length)+sd(Petal.Length) ~ "{.included} with petals <= 1 SD longer than the mean",
    Sepal.Width < mean(Sepal.Width)+sd(Sepal.Width) ~ "{.included} with sepals <= 1 SD wider than the mean",
    Sepal.Length < mean(Sepal.Length)+sd(Sepal.Length) ~ "{.included} with sepals <= 1 SD longer than the mean"
  ) %>%
  comment("{Species} now has {.count} items") %>%
  ungroup() %>%
  comment("ends with {.total} items")
  
dataset2 %>% flowchart()
%0 11:s->12 9:s->11 10:s->11 7:s->9 8:s->10 5:s->7 6:s->8 4:s->5 4:s->6 3:s->4 2:s->3 1:s->2 12 ends with 99 items 11 99 items 9 Species:setosa setosa now has 49 items 10 Species:versicolor versicolor now has 50 items 7 Species:setosa inclusions: 41 with petals <= 1 SD wider than the mean 44 with petals <= 1 SD longer than the mean 44 with sepals <= 1 SD wider than the mean 40 with sepals <= 1 SD longer than the mean 8 Species:versicolor inclusions: 45 with petals <= 1 SD wider than the mean 44 with petals <= 1 SD longer than the mean 42 with sepals <= 1 SD wider than the mean 41 with sepals <= 1 SD longer than the mean 5 Species:setosa setosa has 50 items 6 Species:versicolor versicolor has 50 items 4 stratify by Species 3 inclusions: 50 versicolor 50 setosa 2 starts with 150 items 1 150 items

Displaying combined data flows

It is possible to merge data flows into the same flow diagram. This might make sense if you want to try and document a branching data pipeline. This is not obviously essential but is possible.


p_flowchart(list(dataset1,dataset2))
%0 21:s->22 19:s->21 20:s->21 11:s->12 17:s->19 18:s->20 9:s->11 10:s->11 15:s->17 16:s->18 5:s->9 5:e->7 6:s->10 6:e->8 14:s->15 14:s->16 4:s->5 4:s->6 13:s->14 2:s->4 2:s->13 2:e->3 1:s->2 22 ends with 99 items 21 99 items 12 ends with 67 items 19 Species:setosa setosa now has 49 items 20 Species:versicolor versicolor now has 50 items 11 67 items 17 Species:setosa inclusions: 41 with petals <= 1 SD wider than the mean 44 with petals <= 1 SD longer than the mean 44 with sepals <= 1 SD wider than the mean 40 with sepals <= 1 SD longer than the mean 18 Species:versicolor inclusions: 45 with petals <= 1 SD wider than the mean 44 with petals <= 1 SD longer than the mean 42 with sepals <= 1 SD wider than the mean 41 with sepals <= 1 SD longer than the mean 9 Species:setosa setosa now has 33 items 10 Species:virginica virginica now has 34 items 15 Species:setosa setosa has 50 items 16 Species:versicolor versicolor has 50 items 5 Species:setosa setosa has 50 items 6 Species:virginica virginica has 50 items 7 Species:setosa 9 with petals > 1 SD wider than the mean 6 with petals > 1 SD longer than the mean 6 with sepals > 1 SD wider than the mean 10 with sepals > 1 SD longer than the mean 8 Species:virginica 6 with petals > 1 SD wider than the mean 6 with petals > 1 SD longer than the mean 8 with sepals > 1 SD wider than the mean 8 with sepals > 1 SD longer than the mean 14 stratify by Species 4 stratify by Species 13 inclusions: 50 versicolor 50 setosa 2 starts with 150 items 3 removing 50 versicolor 1 150 items

Excluded data

When considering a data pipeline it is sometimes important to know what has been excluded and at what stage. This can help for debugging or for addressing data quality issues. Dtrackr can collect exclusions at the same time as the history graph. This is enabled by the capture_exclusions() flag.


tmp = iris %>%
  track() %>% 
  capture_exclusions() %>%
  exclude_all(
    Petal.Length > 5.8 ~ "{.excluded} long ones",
    Petal.Length < 1.3 ~ "{.excluded} short ones",
    .stage = "petal length exclusion"
  ) %>%
  comment("leaving {.count}") %>%
  group_by(Species) %>%
  filter(
    Sepal.Length >= quantile(Sepal.Length, 0.05),
    .messages="removing {.count.in-.count.out} with sepals < q 0.05",
    .type = "comment",
    .stage = "sepal length exclusion"
  ) %>%
  comment("leaving {.count}") %>%
  exclude_all(
    Petal.Width < 0.2 ~ "{.excluded} narrow ones",
    Petal.Width > 2.1 ~ "{.excluded} wide ones"
  ) %>%
  comment("leaving {.count}")


tmp %>% flowchart()
%0 8:s->14 8:e->11 9:s->15 9:e->12 10:s->16 10:e->13 5:s->8 6:s->9 7:s->10 4:s->5 4:s->6 4:s->7 3:s->4 1:s->3 1:e->2 14 Species:setosa leaving 39 15 Species:versicolor leaving 47 16 Species:virginica leaving 24 8 Species:setosa leaving 43 9 Species:versicolor leaving 47 10 Species:virginica leaving 35 11 Species:setosa 4 narrow ones 0 wide ones 12 Species:versicolor 0 narrow ones 0 wide ones 13 Species:virginica 0 narrow ones 11 wide ones 5 Species:setosa removing 3 with sepals < q 0.05 6 Species:versicolor removing 3 with sepals < q 0.05 7 Species:virginica removing 2 with sepals < q 0.05 4 stratify by Species 3 leaving 133 1 150 items 2 13 long ones 4 short ones

Give the previous data pipeline we can identify the items that were excluded, the phase at which they were excluded and details of the code that resulted in them being excluded.

tmp %>% excluded()
#> # A tibble: 40 × 9
#>    .stage     .strata .message Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <chr>      <chr>   <glue>   <chr>        <chr>       <chr>        <chr>      
#>  1 petal len… ""      13 long… 6.3          3.3         6            2.5        
#>  2 petal len… ""      13 long… 7.1          3           5.9          2.1        
#>  3 petal len… ""      13 long… 7.6          3           6.6          2.1        
#>  4 petal len… ""      13 long… 7.3          2.9         6.3          1.8        
#>  5 petal len… ""      13 long… 7.2          3.6         6.1          2.5        
#>  6 petal len… ""      13 long… 7.7          3.8         6.7          2.2        
#>  7 petal len… ""      13 long… 7.7          2.6         6.9          2.3        
#>  8 petal len… ""      13 long… 7.7          2.8         6.7          2          
#>  9 petal len… ""      13 long… 7.2          3.2         6            1.8        
#> 10 petal len… ""      13 long… 7.4          2.8         6.1          1.9        
#> # … with 30 more rows, and 2 more variables: Species <chr>, .filter <chr>

Advanced handling of the history graph.

The history graph is a stored as an attribute on a tracked dataframe. The contents of this attribute is a list of dataframes including an edge list and node list. These can be imported into other graph processing packages, and visualised in different ways.

tmp2 = tmp %>% p_get()

# the nodes, .id is a graph unique identifier
tmp2$nodes %>% glimpse()
#> Rows: 16
#> Columns: 6
#> $ .id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
#> $ .rank   <dbl> 1, 1, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6
#> $ .strata <chr> "", "", "", "", "Species:setosa", "Species:versicolor", "Speci…
#> $ .label  <chr> "150 items<BR ALIGN='LEFT'/>", "13 long ones<BR ALIGN='LEFT'/>…
#> $ .type   <chr> "info", "exclusion", "info", "stratify", "comment", "comment",…
#> $ .stage  <chr> "", "petal length exclusion", "", "", "sepal length exclusion"…

# the edges, .to and .from are foreign keys for .id
tmp2$edges %>% glimpse()
#> Rows: 15
#> Columns: 5
#> $ .to          <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
#> $ .from        <dbl> 1, 1, 3, 4, 4, 4, 5, 6, 7, 8, 9, 10, 8, 9, 10
#> $ .rel         <chr> "exclusion", "info", "stratify", "comment", "comment", "c…
#> $ .strata      <chr> "", "", "", "Species:setosa", "Species:versicolor", "Spec…
#> $ .strata.prev <chr> "", "", "", "", "", "", "Species:setosa", "Species:versic…

The GraphViz language provides many options for formatting the flowchart. Rather than try and provide and interface for them we have gone for sane defaults. If you want to change this or use a different layout engine the GraphViz output can be retrieved and edited directly. Alternatively after rendering SVG output can be edited but hand.

cat(tmp %>% p_get_as_dot())
#> digraph {
#>      graph [layout = 'dot',
#>         splines='ortho',
#>         rankdir = 'TB',
#>         outputorder = 'edgesfirst',
#>         bgcolor = 'white',
#>         ranksep = '0.25',
#>         nodesep = '0.2',
#>         newrank='true']
#> 
#>     node [fontname = 'Helvetica',
#>         fontsize = '8',
#>         shape='box',
#>         fixedsize = 'false',
#>         margin = '0.1,0.1',
#>         width = '0',
#>         height = '0',
#>         style = 'filled',
#>         color = 'black',
#>         fontcolor = 'black',
#>         labeljust='l']
#> 
#>     edge [fontname = 'Helvetica',
#>         fontsize = '8',
#>         len = '0.5',
#>         color = 'black',
#>         arrowsize = '0.5']
#>     
#> { rank='same';
#> '14' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>leaving 39<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '15' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>leaving 47<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '16' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>leaving 24<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> }
#> { rank='same';
#> '8' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>leaving 43<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '9' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>leaving 47<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '10' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>leaving 35<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> '11' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>4 narrow ones<BR ALIGN='LEFT'/>0 wide ones<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='grey80'];
#> '12' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>0 narrow ones<BR ALIGN='LEFT'/>0 wide ones<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='grey80'];
#> '13' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>0 narrow ones<BR ALIGN='LEFT'/>11 wide ones<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='grey80'];
#> }
#> { rank='same';
#> '5' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>removing 3 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '6' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>removing 3 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '7' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>removing 2 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> }
#> { rank='same';
#> '4' [label=<stratify by Species<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> }
#> { rank='same';
#> '3' [label=<leaving 133<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> }
#> { rank='same';
#> '1' [label=<150 items<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> '2' [label=<13 long ones<BR ALIGN='LEFT'/>4 short ones<BR ALIGN='LEFT'/>>,group='',fillcolor='grey80'];
#> }
#> 
#> 
#> '10' -> '16' [tailport='s',weight='100']
#> '9' -> '15' [tailport='s',weight='100']
#> '8' -> '14' [tailport='s',weight='100']
#> '10' -> '13' [tailport='e',weight='1']
#> '9' -> '12' [tailport='e',weight='1']
#> '8' -> '11' [tailport='e',weight='1']
#> '7' -> '10' [tailport='s',weight='100']
#> '6' -> '9' [tailport='s',weight='100']
#> '5' -> '8' [tailport='s',weight='100']
#> '4' -> '7' [tailport='s',weight='100']
#> '4' -> '6' [tailport='s',weight='100']
#> '4' -> '5' [tailport='s',weight='100']
#> '3' -> '4' [tailport='s',weight='100']
#> '1' -> '3' [tailport='s',weight='100']
#> '1' -> '2' [tailport='e',weight='1']
#> }