2. Application: hierarchical data structure

Global Historical Climatology Network (GHCN) provides daily climate measures from stations across the world. prcp_aus extracts daily precipitation and minimum temperature from GHCN for 639 stations in Australia from 2016 to 2020. This is where these stations locate in an Australia map:

state_map <- rmapshaper::ms_simplify(ozmaps::abs_ste, keep = 2e-3)
ggplot() + 
  geom_sf(data = state_map, inherit.aes = FALSE,
          color = "grey80", alpha = 0.4, linetype = 3) +
  geom_point(data = prcp_aus , aes(x = long, y = lat)) + 
  theme_void()

This is a lot of stations to look at for one climate variable and they can’t all fit into a glyph map. What we can do is to group stations into clusters and look at the aggregated series in the glyph map. In this vignette, I will introduce how to perform aggregation using a hierarchical data structure in cubble.

First, let’s summarise the daily data into weekly sum for each station:

station_long <- prcp_aus %>%  
  face_temporal(ts) %>%  
  mutate(wk = lubridate::week(date)) %>%  
  group_by(wk) %>%  
  summarise(prcp = sum(prcp, na.rm = TRUE))

station_long
#> # cubble:  wk, id [639]: long form
#> # bbox:    [113.53, -43.66, 153.64, -10.05]
#> # spatial: lat [dbl], long [dbl], elev [dbl], name [chr], wmo_id [dbl]
#>    id             wk  prcp
#>    <chr>       <dbl> <dbl>
#>  1 ASN00001006     1  3660
#>  2 ASN00001006     2  1766
#>  3 ASN00001006     3  1748
#>  4 ASN00001006     4  4614
#>  5 ASN00001006     5  1080
#>  6 ASN00001006     6  1868
#>  7 ASN00001006     7  2494
#>  8 ASN00001006     8  1696
#>  9 ASN00001006     9  3124
#> 10 ASN00001006    10  1834
#> # … with 33,814 more rows

First we need to assign each station a cluster number. Here we use a simple kmean algorithm based on the distance matrix to create 20 clusters. This creates station_nested as a station level nested cubble with a cluster column indicating the group each station belongs to.

station_nested
#> # cubble:   id [639]: nested form
#> # bbox:     [113.53, -43.66, 153.64, -10.05]
#> # temporal: wk [dbl], prcp [dbl]
#>    id            lat  long  elev name                wmo_id ts       cluster
#>    <chr>       <dbl> <dbl> <dbl> <chr>                <dbl> <list>     <int>
#>  1 ASN00001006 -15.5  128.   3.8 wyndham aero         95214 <tibble>       3
#>  2 ASN00001007 -13.8  126.   6   troughton island     94102 <tibble>       3
#>  3 ASN00001018 -16.4  126. 546   mount elizabeth      94211 <tibble>      16
#>  4 ASN00001019 -14.3  127.  23   kalumburu            94100 <tibble>       3
#>  5 ASN00001020 -14.1  126.  51   truscott             95101 <tibble>       3
#>  6 ASN00001025 -15.4  126. 385   doongan              94215 <tibble>       3
#>  7 ASN00002012 -18.2  128. 422   halls creek airport  94212 <tibble>      17
#>  8 ASN00002032 -17.0  128. 203   warmun               94213 <tibble>       3
#>  9 ASN00002056 -15.8  129.  44   kununurra aero       94216 <tibble>       3
#> 10 ASN00002064 -16.6  128. 164   argyle aerodrome     94217 <tibble>       3
#> # … with 629 more rows

To create a group level cubble, use switch_key() with the new key variable, cluster:

cluster_nested <- station_nested %>%  switch_key(cluster)
cluster_nested %>%  head(5)
#> # cubble:   cluster [5]: nested form
#> # bbox:     [113.53, -35.03, 136.83, -11.04]
#> # temporal: id [chr], wk [dbl], prcp [dbl]
#>   cluster .val              ts                  
#>     <int> <list>            <list>              
#> 1       3 <tibble [30 × 6]> <tibble [1,590 × 3]>
#> 2      16 <tibble [13 × 6]> <tibble [689 × 3]>  
#> 3      17 <tibble [15 × 6]> <tibble [795 × 3]>  
#> 4      12 <tibble [16 × 6]> <tibble [848 × 3]>  
#> 5      20 <tibble [61 × 6]> <tibble [3,233 × 3]>

The resulted cluster_nested now has cluster as the key and all the station level time invariant variables are nested inside .val. Currently, there is no cluster level time invariant variables and we can add the centroid of each cluster by get_centroid():

cluster_nested <- cluster_nested %>%  get_centroid()
cluster_nested
#> # cubble:   cluster [20]: nested form
#> # bbox:     [113.53, -43.66, 153.64, -10.05]
#> # temporal: id [chr], wk [dbl], prcp [dbl]
#>    cluster .val              ts                   hull     cent_long cent_lat
#>      <int> <list>            <list>               <list>       <dbl>    <dbl>
#>  1       3 <tibble [30 × 6]> <tibble [1,590 × 3]> <tibble>      131.    -13.7
#>  2      16 <tibble [13 × 6]> <tibble [689 × 3]>   <tibble>      122.    -19.5
#>  3      17 <tibble [15 × 6]> <tibble [795 × 3]>   <tibble>      133.    -17.5
#>  4      12 <tibble [16 × 6]> <tibble [848 × 3]>   <tibble>      115.    -24.2
#>  5      20 <tibble [61 × 6]> <tibble [3,233 × 3]> <tibble>      117.    -31.2
#>  6       4 <tibble [18 × 6]> <tibble [954 × 3]>   <tibble>      121.    -30.6
#>  7       6 <tibble [16 × 6]> <tibble [848 × 3]>   <tibble>      130.    -27.3
#>  8      13 <tibble [17 × 6]> <tibble [901 × 3]>   <tibble>      140.    -21.1
#>  9      10 <tibble [20 × 6]> <tibble [1,060 × 3]> <tibble>      137.    -29.8
#> 10      19 <tibble [47 × 6]> <tibble [2,491 × 3]> <tibble>      138.    -34.2
#> 11       5 <tibble [36 × 6]> <tibble [1,908 × 3]> <tibble>      142.    -36.1
#> 12      18 <tibble [10 × 6]> <tibble [530 × 3]>   <tibble>      143.    -13.6
#> 13       9 <tibble [22 × 6]> <tibble [1,166 × 3]> <tibble>      146.    -19.3
#> 14       2 <tibble [28 × 6]> <tibble [1,484 × 3]> <tibble>      149.    -24.5
#> 15      14 <tibble [54 × 6]> <tibble [2,845 × 3]> <tibble>      151.    -28.4
#> 16      15 <tibble [17 × 6]> <tibble [901 × 3]>   <tibble>      144.    -29.9
#> 17       1 <tibble [62 × 6]> <tibble [3,286 × 3]> <tibble>      150.    -32.1
#> 18       8 <tibble [55 × 6]> <tibble [2,915 × 3]> <tibble>      148.    -35.3
#> 19      11 <tibble [53 × 6]> <tibble [2,809 × 3]> <tibble>      145.    -37.8
#> 20       7 <tibble [49 × 6]> <tibble [2,571 × 3]> <tibble>      147.    -41.8

You can also use face_temporal() to get the cluster level long cubble:

cluster_long <- cluster_nested %>%  face_temporal(ts)
cluster_long
#> # cubble:  wk, cluster [20]: long form
#> # bbox:    [113.53, -43.66, 153.64, -10.05]
#> # spatial: id [chr], lat [dbl], long [dbl], elev [dbl], name [chr], wmo_id
#> #   [dbl], hull [list], cent_long [dbl], cent_lat [dbl]
#>    cluster id             wk  prcp
#>      <int> <chr>       <dbl> <dbl>
#>  1       3 ASN00001006     1  3660
#>  2       3 ASN00001006     2  1766
#>  3       3 ASN00001006     3  1748
#>  4       3 ASN00001006     4  4614
#>  5       3 ASN00001006     5  1080
#>  6       3 ASN00001006     6  1868
#>  7       3 ASN00001006     7  2494
#>  8       3 ASN00001006     8  1696
#>  9       3 ASN00001006     9  3124
#> 10       3 ASN00001006    10  1834
#> # … with 33,814 more rows

Now we should have access to both station and cluster level in the nested and long form. Let’s summarise them within a diagram:

With these data, we can make a glyph map to understand the precipitation pattern in Australia:

Or to inspect the station membership of each cluster:

ggplot() + 
  geom_sf(data = state_map, inherit.aes = FALSE,
          color = "grey80", alpha = 0.4, linetype = 3) +
  geom_point(data = station_nested, aes(x = long, y = lat), size = 0.5) +
  ggforce::geom_mark_hull(data = cluster_nested %>%  tidyr::unnest(hull),
                          expand = 0, radius = 0,
                          aes(x = long, y = lat, group = cluster)) + 
  theme_void()