PHEindicatormethods DSR function

Georgina Anderson

2022-08-08

Introduction

This vignette documents the method for calculating DSRs using the PHEindicatormethods::phe_dsr function which calculates DSRs and their confidence limits using the Dobson method.

The function can be used to calculate DSRs for grouping single or multiple geographic areas/ genders/ timeperiods/ indicators in a single execution and takes the following arguments as inputs:



Argument Type Definition Default value
data data.frame data.frame containing the data to be standardised none
x unquoted string field name from data containing the observed number of events for each standardisation category (eg ageband) within each grouping set (eg area or indicator) none
n unquoted string field name from data containing the populations for each standardisation category (eg ageband) within each grouping set (eg area or indicator) none
stdpop unquoted string standard populations for each standardisation category (eg age band) specified as a field name from data or a vector. esp2013
stdpoptype quoted string whether the stdpop argument has been specified as a vector or a field name “vector”
type quoted string defines the data and metadata columns to include in output. Can by ‘value’, ‘lower’, ‘upper’, ‘standard’ or ‘full’ “full”
confidence numeric value the required level of confidence expressed as a number between 0.9 and 1 or 90 and 100 0.95
multiplier numeric value the multiplier used to express the final values (eg 100,000 = rate per 100,000 100,000



Note that the European Standard Population 2013 divided into 19 five-year agebands (0-4, 5-9, 10-14, …..90+) is provided in vector format within the package and will be used as the default for the stdpop argument

If multiple DSRs are required from a single data frame then the data frame must be grouped prior to inputting to the function - this is demonstrated below

The following packages must be installed and loaded if not already available

library(PHEindicatormethods)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.1.3

First let’s create some data to play with

In a real situation we’d most likely be sourcing our numerators and denominators from different places so let’s create them separately for now.

pops <- data.frame(indicator = rep(c("Ind1","Ind2","Ind3","Ind4"), each = 19 * 2 * 5),
                   period = rep(2012:2016, each = 19 * 2),
                   region = rep(rep(c("Area1","Area2"),each=19), times = 5),
                   ageband = rep(c(0,5,10,15,20,25,30,35,40,45,50,
                                   55,60,65,70,75,80,85,90),times = 10),
                   pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(pops)
#>   indicator period region ageband   pop
#> 1      Ind1   2012  Area1       0 16242
#> 2      Ind1   2012  Area1       5 16259
#> 3      Ind1   2012  Area1      10 10521
#> 4      Ind1   2012  Area1      15 17762
#> 5      Ind1   2012  Area1      20 16859
#> 6      Ind1   2012  Area1      25 10866


deaths <- data.frame(indicator = rep(c("Ind1","Ind2","Ind3","Ind4"), each = 19 * 2 * 5),
                   period = rep(2012:2016, each = 19 * 2),
                   region = rep(rep(c("Area1","Area2"),each=19), times = 5),
                   ageband = rep(c(0,5,10,15,20,25,30,35,40,45,50,
                                   55,60,65,70,75,80,85,90),times = 10),
                   dths = sample(200, 19 * 2 * 5 * 4, replace=TRUE))
head(deaths)
#>   indicator period region ageband dths
#> 1      Ind1   2012  Area1       0   18
#> 2      Ind1   2012  Area1       5   94
#> 3      Ind1   2012  Area1      10   52
#> 4      Ind1   2012  Area1      15   40
#> 5      Ind1   2012  Area1      20   84
#> 6      Ind1   2012  Area1      25   21

Then let’s prepare and validate our data

Our data contains records for 4 different indicators, 5 time periods and 2 geographies so let’s calculate a DSR for each combination - that’s 40 separate DSRs from a single execution of the phe_dsr function……

Prepare the data frame

First we’ll need to join our datasets to create the input data.frame for the function and specify the grouping sets:

df <- left_join(pops,deaths, by = c("indicator","period","region","ageband")) %>%
  group_by(indicator, period, region)

Check the data meets the function requirements

It is important that your data meets the following criteria in order for the phe_dsr function to work so it is wise to check this before we move on.

1. Each grouping set within your data must contain an equal number of records.

The phe_dsr function has built in error handling to check this requirement, or you can check your data manually using code like this:

# check equal number of records in each grouping set - eyeball check
summarise(df,n=n())
#> `summarise()` has grouped output by 'indicator', 'period'. You can override
#> using the `.groups` argument.
#> # A tibble: 40 x 4
#> # Groups:   indicator, period [20]
#>    indicator period region     n
#>    <chr>      <int> <chr>  <int>
#>  1 Ind1        2012 Area1     19
#>  2 Ind1        2012 Area2     19
#>  3 Ind1        2013 Area1     19
#>  4 Ind1        2013 Area2     19
#>  5 Ind1        2014 Area1     19
#>  6 Ind1        2014 Area2     19
#>  7 Ind1        2015 Area1     19
#>  8 Ind1        2015 Area2     19
#>  9 Ind1        2016 Area1     19
#> 10 Ind1        2016 Area2     19
#> # ... with 30 more rows
#> # i Use `print(n = ...)` to see more rows
# or alternatively the following should return TRUE
n_distinct(select(ungroup(summarise(df,n=n())),n)) == 1
#> `summarise()` has grouped output by 'indicator', 'period'. You can override
#> using the `.groups` argument.
#> [1] TRUE

2. If you are supplying your standard population in vector format (stdpoptype=“vector”) then this vector must also contain the same number of records as each grouping set within your data.

In this example we’re going to use the default esp2013 vector that is provided with the PHEindicatormethods package for our standard population - it contains 19 ordered values representing the 5-year age bands 0-4, 5-9, 10-14….85-89, 90+.

The phe_dsr function has built in error handling to check this requirement, or you can check your data manually using code like this:

# check standard population has same number of records as in each grouping set of data in check 1 above - eyeball check
length(esp2013)
#> [1] 19
# or alternatively the following should return TRUE
pull(slice(select(ungroup(summarise(df,n=n())),n),1)) == length(esp2013)
#> `summarise()` has grouped output by 'indicator', 'period'. You can override
#> using the `.groups` argument.
#> [1] TRUE

3. If you are supplying your standard population in vector format (stdpoptype=“vector”) then the standard population and your data (for each grouping set) must be sorted in the same standardisation category order because the function will join these by position. This would normally mean sorting both the standard population vector and the records for each group within your data by age band from youngest to oldest.

The phe_dsr function does not have any built in error handling to check this requirement as the function does not require the standardisation category labels to be provided in your data. It is therefore the responsibility of the function user to ensure this requirement is met. If the standardisation category labels are included with your data (as in our example) then the following code can be used to check the requirement manually:

# check data is ordered by required agebands from youngest to oldest
all(df$ageband == rep(c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90)))
#> [1] TRUE

CAUTION: Failure to ensure that this 3rd requirement is met could result in apparent successful execution of the phe_dsr function even though the data may have been incorrectly standardised. This is demonstrated in the sample code below:


right <- phe_dsr(df,dths,pop)

wrong <- df %>%
          arrange(desc(ageband)) %>%
          phe_dsr(dths,pop)

# the following statement shows that execution of the phe_dsr function on the deliberately-incorrectly sorted data frame produces different (and incorrect) results.
identical(right,wrong)
#> [1] FALSE

Now let’s calculate some DSRs

Now we’re ready to calculate the DSRs using our correctly ordered df data frame.

By default the function will apply 95% confidence, a 100,000 multiplier and will output 3 data fields against each grouping set:

It will also output 3 metadata fields as an audit showing which argument parameters were passed:

phe_dsr(df, dths, pop)
#> # A tibble: 40 x 11
#> # Groups:   indicator, period, region [40]
#>    indicator period region total~1 total~2 value lowercl uppercl confi~3 stati~4
#>    <chr>      <int> <chr>    <int>   <int> <dbl>   <dbl>   <dbl> <chr>   <chr>  
#>  1 Ind1        2012 Area1     1273  275310  462.    434.    491. 95%     dsr pe~
#>  2 Ind1        2012 Area2     1693  291740  609.    578.    641. 95%     dsr pe~
#>  3 Ind1        2013 Area1     1938  283629  691.    657.    725. 95%     dsr pe~
#>  4 Ind1        2013 Area2     1710  278941  604.    574.    635. 95%     dsr pe~
#>  5 Ind1        2014 Area1     1826  286690  638.    607.    670. 95%     dsr pe~
#>  6 Ind1        2014 Area2     2425  290203  933.    894.    974. 95%     dsr pe~
#>  7 Ind1        2015 Area1     1839  275728  685.    652.    719. 95%     dsr pe~
#>  8 Ind1        2015 Area2     2343  298011  739.    706.    772. 95%     dsr pe~
#>  9 Ind1        2016 Area1     1832  312867  580.    553.    609. 95%     dsr pe~
#> 10 Ind1        2016 Area2     1884  297368  699.    665.    733. 95%     dsr pe~
#> # ... with 30 more rows, 1 more variable: method <chr>, and abbreviated
#> #   variable names 1: total_count, 2: total_pop, 3: confidence, 4: statistic
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names



Alternatively, we can just output the data fields by specifying the ‘type’ argument value as ‘standard’:

phe_dsr(df, dths, pop, type = "standard", confidence = 99.8, multiplier = 10000)
#> # A tibble: 40 x 8
#> # Groups:   indicator, period, region [40]
#>    indicator period region total_count total_pop value lowercl uppercl
#>    <chr>      <int> <chr>        <int>     <int> <dbl>   <dbl>   <dbl>
#>  1 Ind1        2012 Area1         1273    275310  46.2    41.8    50.8
#>  2 Ind1        2012 Area2         1693    291740  60.9    56.1    66.0
#>  3 Ind1        2013 Area1         1938    283629  69.1    63.9    74.5
#>  4 Ind1        2013 Area2         1710    278941  60.4    55.7    65.3
#>  5 Ind1        2014 Area1         1826    286690  63.8    59.0    68.9
#>  6 Ind1        2014 Area2         2425    290203  93.3    87.2    99.7
#>  7 Ind1        2015 Area1         1839    275728  68.5    63.4    74.0
#>  8 Ind1        2015 Area2         2343    298011  73.9    68.8    79.2
#>  9 Ind1        2016 Area1         1832    312867  58.0    53.7    62.6
#> 10 Ind1        2016 Area2         1884    297368  69.9    64.7    75.3
#> # ... with 30 more rows
#> # i Use `print(n = ...)` to see more rows



Alternative Standard Populations

In some cases you may wish to standardise against a different population to the default esp2013 one provided - such as the 1976 European Standard Population or an age and sex standardised population. There are two ways to specify an alternative standard population:

1. Provide the custom standard population as a vector

In the example below, the 1976 European Standard Population (which has 18 age groups) is provided as a vector and then referenced in the function call. To ensure the function works we must also ensure that our data has been broken down into these same 18 age bands (for the purposes of this example I’ve just combined the 85-90 and 90+ age band data into a single 85+ age band from the data.frame we used earlier).

The phe_dsr function can then be executed using a user-defined standard population:

esp1976 <- c(8000,  7000,   7000,   7000,   7000,   7000,   7000,   7000,   7000,   7000,   7000,   6000,   5000,   4000,   3000,   2000,   1000,   1000)

df18 <- df
df18$dths[df18$ageband == 85] <- df18$dths[df18$ageband == 85] + df18$dths[df18$ageband == 90]
df18$pop[df18$ageband == 85]  <- df18$pop[df18$ageband == 85] + df18$pop[df18$ageband == 90]
df18 <- filter(df18,ageband != 90)

phe_dsr(df18,dths,pop,stdpop = esp1976)
#> # A tibble: 40 x 11
#> # Groups:   indicator, period, region [40]
#>    indicator period region total~1 total~2 value lowercl uppercl confi~3 stati~4
#>    <chr>      <int> <chr>    <int>   <int> <dbl>   <dbl>   <dbl> <chr>   <chr>  
#>  1 Ind1        2012 Area1     1273  275310  432.    403.    461. 95%     dsr pe~
#>  2 Ind1        2012 Area2     1693  291740  617.    583.    651. 95%     dsr pe~
#>  3 Ind1        2013 Area1     1938  283629  716.    681.    753. 95%     dsr pe~
#>  4 Ind1        2013 Area2     1710  278941  630.    597.    665. 95%     dsr pe~
#>  5 Ind1        2014 Area1     1826  286690  605.    574.    637. 95%     dsr pe~
#>  6 Ind1        2014 Area2     2425  290203  941.    900.    984. 95%     dsr pe~
#>  7 Ind1        2015 Area1     1839  275728  665.    631.    700. 95%     dsr pe~
#>  8 Ind1        2015 Area2     2343  298011  718.    684.    753. 95%     dsr pe~
#>  9 Ind1        2016 Area1     1832  312867  550.    522.    579. 95%     dsr pe~
#> 10 Ind1        2016 Area2     1884  297368  699.    665.    734. 95%     dsr pe~
#> # ... with 30 more rows, 1 more variable: method <chr>, and abbreviated
#> #   variable names 1: total_count, 2: total_pop, 3: confidence, 4: statistic
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names



2. Append the standard populations to your data frame before executing the function

In the example below, the esp2013 standard population is appended to our data frame prior to calling the phe_dsr function. The field name can then be specified in the function call. If stdpop is specified as a field name we must also tell the function this by specifying stdpoptype = “field” as below:

df_with_stdpop <- df %>%
    mutate(spop = esp2013)
names(df_with_stdpop)
#> [1] "indicator" "period"    "region"    "ageband"   "pop"       "dths"     
#> [7] "spop"
phe_dsr(df_with_stdpop, dths, pop, stdpop = spop, stdpoptype = "field")
#> # A tibble: 40 x 11
#> # Groups:   indicator, period, region [40]
#>    indicator period region total~1 total~2 value lowercl uppercl confi~3 stati~4
#>    <chr>      <int> <chr>    <int>   <int> <dbl>   <dbl>   <dbl> <chr>   <chr>  
#>  1 Ind1        2012 Area1     1273  275310  462.    434.    491. 95%     dsr pe~
#>  2 Ind1        2012 Area2     1693  291740  609.    578.    641. 95%     dsr pe~
#>  3 Ind1        2013 Area1     1938  283629  691.    657.    725. 95%     dsr pe~
#>  4 Ind1        2013 Area2     1710  278941  604.    574.    635. 95%     dsr pe~
#>  5 Ind1        2014 Area1     1826  286690  638.    607.    670. 95%     dsr pe~
#>  6 Ind1        2014 Area2     2425  290203  933.    894.    974. 95%     dsr pe~
#>  7 Ind1        2015 Area1     1839  275728  685.    652.    719. 95%     dsr pe~
#>  8 Ind1        2015 Area2     2343  298011  739.    706.    772. 95%     dsr pe~
#>  9 Ind1        2016 Area1     1832  312867  580.    553.    609. 95%     dsr pe~
#> 10 Ind1        2016 Area2     1884  297368  699.    665.    733. 95%     dsr pe~
#> # ... with 30 more rows, 1 more variable: method <chr>, and abbreviated
#> #   variable names 1: total_count, 2: total_pop, 3: confidence, 4: statistic
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names



And what if the data are not so tidy?

Zero deaths for a specific age band within a small geography

This would be a fairly common scenario - maybe you have Local Authority data and there are no deaths in some of the younger age groups for some of the smaller areas. From PHEindicatormethods version 1.1.0 onwards, the phe_dsr function can handle this scenario and will automatically assign a zero death count where a death count is missing or recorded as NA.

Let’s fudge a couple of data frames to represent this. In this example, there are no deaths in the 10-14, 15-20 and 20-14 age bands:

pops2   <- data.frame(ageband    = c( 0, 5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90),
                      pop = c(30,35,35,35,40,40,45,50,50,50,60,60,70,75,70,60,20,20,15))

deaths2 <- data.frame(ageband = c(0,5,25,30,35,40,45,50,55,60,65,70,75,80,85,90),
                      dths    = c(1,1, 1, 1, 3, 3, 3, 3,10,10,10,10, 8, 8, 8, 8))

If we join these data frames to produce the input data frame required for the phe_dsr function then we get NA values in the Deaths column. From PHEindicatormethods version 1.1.0 onwards, the phe_dsr function will return the correct DSR, assuming zero deaths in the age groups with no deaths recorded. If you are using an earlier version of PHEindicatormethods then an error will be returned. See what you get…..

df2 <- left_join(pops2, deaths2, by="ageband")
phe_dsr(df2, dths, pop)
#>   total_count total_pop    value  lowercl  uppercl confidence      statistic
#> 1          88       860 8283.016 6570.819 10289.65        95% dsr per 100000
#>   method
#> 1 Dobson

For earlier versions of PHEindicatormethods, the NA values must be replaced with zeros before executing the function:

df3 <- df2 %>%
        mutate(dths = replace(dths, which(is.na(dths)), 0))
phe_dsr(df3, dths, pop)
#>   total_count total_pop    value  lowercl  uppercl confidence      statistic
#> 1          88       860 8283.016 6570.819 10289.65        95% dsr per 100000
#>   method
#> 1 Dobson