3. Manipulating multiple signals

Various analyses involve working with multiple signals at once. The covidcast package provides some helper functions for fetching multiple signals from the API, and aggregating them into one data frame for various downstream uses.

Fetching multiple signals

To load confirmed cases and deaths at the state level, in a single function call, we can use covidcast_signals() (note the plural form of “signals”):

library(covidcast)

start_day <- "2020-06-01"
end_day <- "2020-10-01"

signals <- suppressMessages(
  covidcast_signals(data_source = "usa-facts",
                    signal = c("confirmed_incidence_num",
                               "deaths_incidence_num"),
                    start_day = start_day, end_day = end_day,
                    geo_type = "state")
)

summary(signals[[1]])
## A `covidcast_signal` dataframe with 6273 rows and 9 columns.
## 
## data_source : usa-facts
## signal      : confirmed_incidence_num
## geo_type    : state
## 
## first date                          : 2020-06-01
## last date                           : 2020-10-01
## median number of geo_values per day : 51
summary(signals[[2]])
## A `covidcast_signal` dataframe with 6273 rows and 9 columns.
## 
## data_source : usa-facts
## signal      : deaths_incidence_num
## geo_type    : state
## 
## first date                          : 2020-06-01
## last date                           : 2020-10-01
## median number of geo_values per day : 51

This returns a list of covidcast_signal objects. The argument structure for covidcast_signals() matches that of covidcast_signal(), except the first four arguments (data_source, signal, start_day, end_day) are allowed to be vectors. See the covidcast_signals() documentation for details.

Aggregating signals, wide format

To aggregate multiple signals together, we can use the aggregate_signals() function, which accepts a list of covidcast_signal objects, as returned by covidcast_signals(). With all arguments set to their default values, aggregate_signals() returns a data frame in “wide” format:

library(dplyr)

aggregate_signals(signals) %>% head()
##   geo_value time_value value+0:usa-facts_confirmed_incidence_num
## 1        tn 2020-06-01                                       540
## 2        tn 2020-06-02                                       740
## 3        tn 2020-06-03                                       409
## 4        tn 2020-06-04                                       356
## 5        tn 2020-06-05                                       465
## 6        tn 2020-06-06                                       529
##   value+0:usa-facts_deaths_incidence_num
## 1                                      1
## 2                                     10
## 3                                      7
## 4                                     13
## 5                                      9
## 6                                      9

In “wide” format, only the latest issue of data is retained, and the columns data_source, signal, issue, lag, stderr, sample_size are all dropped from the returned data frame. Each unique signal—defined by a combination of data source name, signal name, and time-shift—is given its own column, whose name indicates its defining quantities.

As hinted above, aggregate_signals() can also apply time-shifts to the given signals, through the optional dt argument. This can be either be a single vector of shifts or a list of vectors of shifts, this list having the same length as the list of covidcast_signal objects (to apply, respectively, the same shifts or a different set of shifts to each covidcast_signal object). Negative shifts translate into in a lag value and positive shifts into a lead value; for example, if dt = -1, then the value on June 2 that gets reported is the original value on June 1; if dt = 0, then the values are left as is.

aggregate_signals(signals, dt = c(-1, 0)) %>%
  filter(geo_value == "tx") %>% head()
##   geo_value time_value value-1:usa-facts_confirmed_incidence_num
## 1        tx 2020-06-02                                       592
## 2        tx 2020-06-20                                      3471
## 3        tx 2020-06-23                                      3278
## 4        tx 2020-06-25                                      5551
## 5        tx 2020-06-26                                      5982
## 6        tx 2020-06-27                                      5717
##   value+0:usa-facts_confirmed_incidence_num
## 1                                      1683
## 2                                      4391
## 3                                      5532
## 4                                      5982
## 5                                      5717
## 6                                      5758
##   value-1:usa-facts_deaths_incidence_num value+0:usa-facts_deaths_incidence_num
## 1                                      6                                     20
## 2                                     34                                     24
## 3                                     10                                     28
## 4                                     28                                     48
## 5                                     48                                     32
## 6                                     32                                     42
aggregate_signals(signals, dt = list(0, c(-1, 0, 1))) %>%
  filter(geo_value == "tx") %>% head()
##   geo_value time_value value+0:usa-facts_confirmed_incidence_num
## 1        tx 2020-06-02                                      1683
## 2        tx 2020-06-20                                      4391
## 3        tx 2020-06-23                                      5532
## 4        tx 2020-06-25                                      5982
## 5        tx 2020-06-26                                      5717
## 6        tx 2020-06-27                                      5758
##   value-1:usa-facts_deaths_incidence_num value+0:usa-facts_deaths_incidence_num
## 1                                      6                                     20
## 2                                     34                                     24
## 3                                     10                                     28
## 4                                     28                                     48
## 5                                     48                                     32
## 6                                     32                                     42
##   value+1:usa-facts_deaths_incidence_num
## 1                                     36
## 2                                     16
## 3                                     28
## 4                                     32
## 5                                     42
## 6                                     27

Finally, aggregate_signals() also accepts a single data frame (instead of a list of data frames), intended to be convenient when applying shifts to a single covidcast_signal object:

aggregate_signals(signals[[1]], dt = c(-1, 0, 1)) %>%
  filter(geo_value == "tx") %>% head()
##   geo_value time_value value-1:usa-facts_confirmed_incidence_num
## 1        tx 2020-06-02                                       592
## 2        tx 2020-06-20                                      3471
## 3        tx 2020-06-23                                      3278
## 4        tx 2020-06-25                                      5551
## 5        tx 2020-06-26                                      5982
## 6        tx 2020-06-27                                      5717
##   value+0:usa-facts_confirmed_incidence_num
## 1                                      1683
## 2                                      4391
## 3                                      5532
## 4                                      5982
## 5                                      5717
## 6                                      5758
##   value+1:usa-facts_confirmed_incidence_num
## 1                                      1674
## 2                                      3864
## 3                                      5551
## 4                                      5717
## 5                                      5758
## 6                                      5352

Aggregating signals, long format

We can also use aggregate_signals() in “long” format, with one observation per row:

aggregate_signals(signals, format = "long") %>%
  filter(geo_value == "tx") %>% head()
##   data_source                  signal geo_value time_value      issue lag
## 1   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 2   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 3   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
## 4   usa-facts confirmed_incidence_num        tx 2020-06-04 2020-10-17 135
## 5   usa-facts confirmed_incidence_num        tx 2020-06-05 2020-10-17 134
## 6   usa-facts confirmed_incidence_num        tx 2020-06-06 2020-10-17 133
##   stderr sample_size dt value
## 1     NA          NA  0   592
## 2     NA          NA  0  1683
## 3     NA          NA  0  1674
## 4     NA          NA  0  1614
## 5     NA          NA  0  1690
## 6     NA          NA  0  1936
aggregate_signals(signals, dt = c(-1, 0), format = "long") %>%
  filter(geo_value == "tx") %>% head()
##   data_source                  signal geo_value time_value      issue lag
## 1   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 2   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 3   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 4   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 5   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
## 6   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
##   stderr sample_size dt value
## 1     NA          NA -1    NA
## 2     NA          NA  0   592
## 3     NA          NA -1   592
## 4     NA          NA  0  1683
## 5     NA          NA -1  1683
## 6     NA          NA  0  1674
aggregate_signals(signals, dt = list(-1, 0), format = "long") %>%
  filter(geo_value == "tx") %>% head()
##   data_source                  signal geo_value time_value      issue lag
## 1   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 2   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 3   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
## 4   usa-facts confirmed_incidence_num        tx 2020-06-04 2020-10-17 135
## 5   usa-facts confirmed_incidence_num        tx 2020-06-05 2020-10-17 134
## 6   usa-facts confirmed_incidence_num        tx 2020-06-06 2020-10-17 133
##   stderr sample_size dt value
## 1     NA          NA -1    NA
## 2     NA          NA -1   592
## 3     NA          NA -1  1683
## 4     NA          NA -1  1674
## 5     NA          NA -1  1614
## 6     NA          NA -1  1690

As we can see, time-shifts work just as before, in “wide” format. However, in “long” format, all columns are retained, and an additional dt column is added to record the time-shift being used.

Just as before, covidcast_signals() can also operate on a single data frame, to conveniently apply shifts, in “long” format:

aggregate_signals(signals[[1]], dt = c(-1, 0), format = "long") %>%
  filter(geo_value == "tx") %>% head()
##   data_source                  signal geo_value time_value      issue lag
## 1   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 2   usa-facts confirmed_incidence_num        tx 2020-06-01 2020-10-17 138
## 3   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 4   usa-facts confirmed_incidence_num        tx 2020-06-02 2021-02-10 253
## 5   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
## 6   usa-facts confirmed_incidence_num        tx 2020-06-03 2020-10-17 136
##   stderr sample_size dt value
## 1     NA          NA -1    NA
## 2     NA          NA  0   592
## 3     NA          NA -1   592
## 4     NA          NA  0  1683
## 5     NA          NA -1  1683
## 6     NA          NA  0  1674

Pivoting longer or wider

The package also provides functions for pivoting an aggregated signal data frame longer or wider. These are essentially wrappers around pivot_longer() and pivot_wider() from the tidyr package, that set the column structure and column names appropriately. For example, to pivot longer:

aggregate_signals(signals, dt = list(-1, 0)) %>%
  covidcast_longer() %>%
  filter(geo_value == "tx") %>% head()
##   data_source                  signal geo_value time_value dt value
## 1   usa-facts confirmed_incidence_num        tx 2020-06-02 -1   592
## 2   usa-facts    deaths_incidence_num        tx 2020-06-02  0    20
## 3   usa-facts confirmed_incidence_num        tx 2020-06-20 -1  3471
## 4   usa-facts    deaths_incidence_num        tx 2020-06-20  0    24
## 5   usa-facts confirmed_incidence_num        tx 2020-06-23 -1  3278
## 6   usa-facts    deaths_incidence_num        tx 2020-06-23  0    28

And to pivot wider:

aggregate_signals(signals, dt = list(-1, 0), format = "long") %>%
  covidcast_wider() %>%
  filter(geo_value == "tx") %>% head()
##   geo_value time_value value-1:usa-facts_confirmed_incidence_num
## 1        tx 2020-06-01                                        NA
## 2        tx 2020-06-02                                       592
## 3        tx 2020-06-03                                      1683
## 4        tx 2020-06-04                                      1674
## 5        tx 2020-06-05                                      1614
## 6        tx 2020-06-06                                      1690
##   value+0:usa-facts_deaths_incidence_num
## 1                                      6
## 2                                     20
## 3                                     36
## 4                                     33
## 5                                     21
## 6                                     29

A sanity check

Lastly, here’s a small sanity check, that lagging cases by 7 days using aggregate_signals() and correlating this with deaths using covidcast_cor() yields the same result as telling covidcast_cor() to do the time-shifting itself:

df_cor1 <- covidcast_cor(x = aggregate_signals(signals[[1]], dt = -7,
                                              format = "long"),
                        y = signals[[2]])

df_cor2 <- covidcast_cor(x = signals[[1]], y = signals[[2]], dt_x = -7)
identical(df_cor1, df_cor2)
## [1] TRUE