dtrackr - Joining data pipelines

Joins across data sets

Joining dtrackr tracked data is supported and allows us to combine linked data sets. In this toy example the data sets are characters from a popular film from my youth.

people = starwars %>% select(-films, -vehicles, -starships)
vehicles = starwars %>% select(name,vehicles) %>% unnest(cols = c(vehicles))
starships = starwars %>% select(name,starships) %>% unnest(cols = c(starships))
films = starwars %>% select(name,films) %>% unnest(cols = c(films))

tmp1 = people %>% track() %>% comment("People df {.total}")
tmp2 = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment")

tmp1 %>% inner_join(tmp2, by="name") %>% comment("joined {.total}") %>% flowchart()
%0 6:s->7 5:s->6 2:s->6 4:s->5 1:s->2 3:s->4 7 joined 173 6 Inner join by name 87 on LHS 173 on RHS 173 in linked set 5 a test comment 2 People df 87 4 Films df 173 1 87 items 3 173 items
# The join message is configurable but defaults to 
# {.count.lhs} on LHS
# {.count.rhs} on RHS
# {.count.out} in linked set

All join types are supported by dtrackr which will allow us to report on the numbers on either side of the join and on the resulting total. This can help detect if any data items are lost during the join. However we do not yet capture data that becomes excluded during joins, as the interpretation depends on the type of join employed.

Unions

Another type of binary operator is a union. This is a simpler problem and works as expected. In this example the early part of the pipeline is detected to be the same on both branches of the data flow. This therefore results in a flow that splits then subsequently joins again during the union (bind_rows) operator.

tmp = people %>% comment("start")

tmp1 = tmp %>% include_any(
  species == "Human" ~ "{.included} humans",
  species == "Droid" ~ "{.included} droids"
  )

tmp2 = tmp %>% include_any(
  species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans")

tmp3 = bind_rows(tmp1,tmp2) %>% comment("{.count} human,droids and gungans") 
tmp3 %>% flowchart()
%0 5:s->6 4:s->5 2:s->5 3:s->4 1:s->2 1:s->3 6 44 human,droids and gungans 5 Union 44 in union 4 3 gungans 2 inclusions: 35 humans 6 droids 3 inclusions: 3 gungans 1 start