Databases in srvyr

Greg Freedman Ellis

2022-02-20

## Loading required package: RSQLite

Srvyr 0.3 has a completely rewritten database backend. Using databases that are already stored dplyr’s tbl_lazy objects is now just as easy as working with data stored in regular R data.frames as you don’t need to have a unique identifier. Additionally, it now works more similarly to the survey package’s database code and so shouldn’t be any slower.

During development, I have tested using SQLite (and the now defunct MonetDBLite) databases, but in theory other database backends should work as well.

This vignette shows the basics of how to use srvyr with databases. It is based on analysis from the wonderful resource asdfree ( website and github ). Many thanks to ajdamico and collaborators. Specifically, I have adapted code from American Community Survey - 2011 single year analysis and the associated data preparation scripts.

Database Setup

In order to focus on srvyr and databases, we start with a prepared dataset. The full code is available on Github, and the high level description of what it does is:

For more information on the specifics of the American Community Survey, see the asdfree site. Now, our code loads this prepared dataset, initiates a SQLite database, and puts the data into the dataset.

suppressMessages({
  library(survey)
  library(srvyr)
  library(dplyr)
  library(dbplyr)
  library(RSQLite)
})

# Load data (Currently only Alaska and Hawaii to keep file size small and with 
# limited variables, butcode that downloaded the files is available here
# https://github.com/gergness/srvyr/blob/main/vignettes/save_acs_data.R
# and could easily be adapted to download all states.)
load("acs_m.Rdata") # acs_m data

# Set up database and table
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
acs_m_db <- copy_to(db, acs_m, "acs_m", temporary = FALSE)

# Or, if the data was already stored in the database, you could do this
# acs_m_data <- tbl(db, sql("SELECT * FROM acs_m")) 

Now that we have the data in the database, we can interact with the database directly using sql commands, or we can use dplyr’s functionality to treat it mostly the same as a local data.frame. However, the data is not stored in memory, so we could work with much larger datasets (though in this case, the data is too small for this to be a problem).

# Same results
acs_m %>% 
  group_by(sex) %>%
  summarize(hicov = mean(hicov))
## # A tibble: 2 × 2
##     sex hicov
##   <int> <dbl>
## 1     1  1.14
## 2     2  1.10
acs_m_db %>% 
  group_by(sex) %>%
  summarize(hicov = mean(hicov))
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
##     sex hicov
##   <int> <dbl>
## 1     1  1.14
## 2     2  1.10
# But smaller object size
object.size(acs_m)
## 7777312 bytes
object.size(acs_m_db)
## 10544 bytes

Note that though many commands behave exactly the same whether on a local data.frame or database, sometimes more advanced / complicated syntax around variable modification allowed in dplyr does not work on a particular database and so it is better to be more explicit. For example, creating a variable inside of a summarize call does not work in some databases. .

acs_m %>% 
  group_by(sex) %>%
  summarize(hicov = mean(hicov == 1))
## # A tibble: 2 × 2
##     sex hicov
##   <int> <dbl>
## 1     1 0.858
## 2     2 0.895
# Works in RSQlite, but didn't in the now defunct MonetDB
# acs_m_db %>% 
#   group_by(sex) %>%
#   summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ....

# Creating the variable separately works as an integer works though
acs_m_db %>% 
  group_by(sex) %>%
  mutate(hicov = ifelse(hicov == 1, 1L, 0L)) %>%
  summarize(hicov = mean(hicov))
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
##     sex hicov
##   <int> <dbl>
## 1     1 0.858
## 2     2 0.895

Further, sometimes working with variable types can get difficult if you are used to working in R. Notice how in the above, instead of hicov = (hicov == 1), I wrote out the ifelse statement. If I hadn’t RSQLite would be unable to calculate the mean of the boolean variable created.

Finally, a major difference when transitioning from dplyr on local data.frames is that not all R functions are translated to SQL. For example, cut() isn’t implemented in SQL, so you can’t create a new variable in the data.frame using it.

acs_m %>% 
  group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
  summarize(hicov = mean(hicov == 1))
## # A tibble: 6 × 2
##   agecat   hicov
##   <fct>    <dbl>
## 1 (0,19]   0.908
## 2 (19,35]  0.795
## 3 (35,50]  0.837
## 4 (50,65]  0.873
## 5 (65,200] 0.992
## 6 <NA>     0.953
# acs_m_db %>% 
#   group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
#   summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ...

acs_m_db %>% 
  mutate(agecat = ifelse(agep < 19, "0-18", 
                         ifelse(agep >= 19 & agep < 35, "19-34", 
                                ifelse(agep >= 35 & agep < 50, "35-49", 
                                       ifelse(agep >= 50 & agep < 65, "50-64", 
                                              ifelse(agep >= 65, "65+", NA)))))) %>%
  group_by(agecat) %>% 
  summarize(hicov = mean(hicov))
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
##   agecat hicov
##   <chr>  <dbl>
## 1 0-18    1.09
## 2 19-34   1.20
## 3 35-49   1.17
## 4 50-64   1.13
## 5 65+     1.01

For more information on the specifics of databases with dplyr, see vignette("database", package = "dplyr"), the DBI package or the specific database packages, like RSQLite.

Srvyr Setup

Srvyr commands are nearly identical to old. The only difference for setup is that you need a variable that uniquely identifies each row in the database (uid).

acs_m_db_svy <- acs_m_db %>% 
  as_survey_rep(
    weight = pwgtp,
    repweights = matches("pwgtp[0-9]+") ,
    scale = 4 / 80,
    rscales = rep(1 , 80),
    mse = TRUE,
    type = "JK1", 
    variables = -c(matches("^pwgtp"))
  )

acs_m_db_svy
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 80 replicates and MSE variances.
## Sampling variables:
##  - repweights: `pwgtp1 + pwgtp2 + pwgtp3 + pwgtp4 + pwgtp5 + pwgtp6 + pwgtp7 + pwgtp8 + pwgtp9 + pwgtp10 + pwgtp11 + pwgtp12 + pwgtp13 + pwgtp14 + pwgtp15 + pwgtp16 + pwgtp17 + pwgtp18 + pwgtp19 + pwgtp20 + pwgtp21 + pwgtp22 + pwgtp23 + pwgtp24 + pwgtp25 + pwgtp26 + pwgtp27 + pwgtp28 + pwgtp29 + pwgtp30 + pwgtp31 + pwgtp32 + pwgtp33 + pwgtp34 + pwgtp35 + pwgtp36 + pwgtp37 + pwgtp38 + pwgtp39 + pwgtp40 + pwgtp41 + pwgtp42 + pwgtp43 + pwgtp44 + pwgtp45 + pwgtp46 + pwgtp47 + pwgtp48 + pwgtp49 + pwgtp50 + pwgtp51 + \n    pwgtp52 + pwgtp53 + pwgtp54 + pwgtp55 + pwgtp56 + pwgtp57 + pwgtp58 + pwgtp59 + pwgtp60 + pwgtp61 + pwgtp62 + pwgtp63 + pwgtp64 + pwgtp65 + pwgtp66 + pwgtp67 + pwgtp68 + pwgtp69 + pwgtp70 + pwgtp71 + pwgtp72 + pwgtp73 + pwgtp74 + pwgtp75 + pwgtp76 + pwgtp77 + pwgtp78 + pwgtp79 + pwgtp80`
##  - weights: pwgtp
## Data variables: serialno (int), sporder (chr), pwgtp (int), pwgtp1 (int),
##   pwgtp2 (int), pwgtp3 (int), pwgtp4 (int), pwgtp5 (int), pwgtp6 (int), pwgtp7
##   (int), pwgtp8 (int), pwgtp9 (int), pwgtp10 (int), pwgtp11 (int), pwgtp12
##   (int), pwgtp13 (int), pwgtp14 (int), pwgtp15 (int), pwgtp16 (int), pwgtp17
##   (int), pwgtp18 (int), pwgtp19 (int), pwgtp20 (int), pwgtp21 (int), pwgtp22
##   (int), pwgtp23 (int), pwgtp24 (int), pwgtp25 (int), pwgtp26 (int), pwgtp27
##   (int), pwgtp28 (int), pwgtp29 (int), pwgtp30 (int), pwgtp31 (int), pwgtp32
##   (int), pwgtp33 (int), pwgtp34 (int), pwgtp35 (int), pwgtp36 (int), pwgtp37
##   (int), pwgtp38 (int), pwgtp39 (int), pwgtp40 (int), pwgtp41 (int), pwgtp42
##   (int), pwgtp43 (int), pwgtp44 (int), pwgtp45 (int), pwgtp46 (int), pwgtp47
##   (int), pwgtp48 (int), pwgtp49 (int), pwgtp50 (int), pwgtp51 (int), pwgtp52
##   (int), pwgtp53 (int), pwgtp54 (int), pwgtp55 (int), pwgtp56 (int), pwgtp57
##   (int), pwgtp58 (int), pwgtp59 (int), pwgtp60 (int), pwgtp61 (int), pwgtp62
##   (int), pwgtp63 (int), pwgtp64 (int), pwgtp65 (int), pwgtp66 (int), pwgtp67
##   (int), pwgtp68 (int), pwgtp69 (int), pwgtp70 (int), pwgtp71 (int), pwgtp72
##   (int), pwgtp73 (int), pwgtp74 (int), pwgtp75 (int), pwgtp76 (int), pwgtp77
##   (int), pwgtp78 (int), pwgtp79 (int), pwgtp80 (int), agep (int), hicov (int),
##   sex (int), st (chr), rt (chr)

Because srvyr stores the survey variables locally, the srvyr object takes up much more memory than the dplyr one. However, this object would not grow in size if you added more data variables to your survey, so if your survey is very wide, it will save a lot space.

object.size(acs_m_db_svy)
## 8391576 bytes

Analysis

Analysis commands from srvyr are also similar to ones that work on local data.frames. The main differences come from the issues discussed above about explicitly creating variables difficulties in translating R commands, and variable types.

The following analysis is based on the asdfree analysis and shows some basic analysis on the total population, insurance coverage, age and sex.

# You can calculate the population of the united states #
# by state
acs_m_db_svy %>%
  mutate(one = 1L) %>% # Note that because of weird behavior of MonetDB, need to use 1L not just 1
  group_by(st) %>% 
  summarize(count = survey_total(one))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
##   st      count count_se
##   <chr>   <dbl>    <dbl>
## 1 02     722718        0
## 2 15    1374810        0
# Or the average age of downloaded states
acs_m_db_svy %>%
  summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
##    agep agep_se
##   <dbl>   <dbl>
## 1  37.4  0.0532
# Average age by state
acs_m_db_svy %>%
  group_by(st) %>% 
  summarize(agep = survey_mean(agep, na.rm = TRUE))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
##   st     agep agep_se
##   <chr> <dbl>   <dbl>
## 1 02     34.6  0.133 
## 2 15     38.8  0.0404
# percent uninsured - nationwide (of downloaded states)
acs_m_db_svy %>%
  mutate(hicov = as.character(hicov)) %>% 
  group_by(hicov) %>% 
  summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `hicov`
## # A tibble: 2 × 3
##   hicov   pct  pct_se
##   <chr> <dbl>   <dbl>
## 1 1     0.885 0.00339
## 2 2     0.115 0.00339
# by state
acs_m_db_svy %>%
  mutate(hicov = as.character(hicov)) %>% 
  group_by(st, hicov) %>% 
  summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `st`, `hicov`
## # A tibble: 4 × 4
## # Groups:   st [2]
##   st    hicov    pct  pct_se
##   <chr> <chr>  <dbl>   <dbl>
## 1 02    1     0.802  0.00793
## 2 02    2     0.198  0.00793
## 3 15    1     0.928  0.00383
## 4 15    2     0.0719 0.00383
# 25th, median, and 75th percentile of age of residents of the united states (downloaded states)
acs_m_db_svy %>%
  summarize(agep = survey_quantile(agep, c(0.25, 0.5, 0.75), na.rm = TRUE))
## # A tibble: 1 × 6
##   agep_q25 agep_q50 agep_q75 agep_q25_se agep_q50_se agep_q75_se
##      <dbl>    <dbl>    <dbl>       <dbl>       <dbl>       <dbl>
## 1       19       36       55       0.251       0.251       0.251
# Filter works, so we can restrict the acs_m object to females only
acs_m_db_svy_female <- acs_m_db_svy %>%
  filter(sex == 2)

# Now any of the above commands can be re-run using the acs_m_female object
# instead of the acs_m object in order to analyze females only

# This is equivalent to using acs_m, and applying the filter every time.

# average age - nationwide (of downloaded states), restricted to females
acs_m_db_svy_female %>%
  summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
##    agep agep_se
##   <dbl>   <dbl>
## 1  38.2   0.103
# median age - nationwide (of downloaded states), restricted to females
acs_m_db_svy_female %>%
  summarize(agep = survey_median(agep, na.rm = TRUE))
## # A tibble: 1 × 2
##    agep agep_se
##   <dbl>   <dbl>
## 1    37   0.251
# Note that though some R functions are translated by dplyr into SQL, not
# all of them are. For example, when constructing a new age category 
# variable in the dataset neither findIntervals nor cut work on databases, 
# so we have to spell out groups with ifelse()
acs_m_db_svy %>% 
  mutate(agecat = ifelse(agep < 19, "0-18", 
                         ifelse(agep >= 19 & agep < 35, "19-34", 
                                ifelse(agep >= 35 & agep < 50, "35-49", 
                                       ifelse(agep >= 50 & agep < 65, "50-64", 
                                              ifelse(agep >= 65, "65+", NA)))))) %>%
  group_by(agecat) %>% 
  summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `agecat`
## # A tibble: 5 × 3
##   agecat   pct   pct_se
##   <chr>  <dbl>    <dbl>
## 1 0-18   0.248 0.00101 
## 2 19-34  0.228 0.00172 
## 3 35-49  0.198 0.00176 
## 4 50-64  0.202 0.00134 
## 5 65+    0.126 0.000980

Advanced Topics

Running survey commands with collect

If you’d like to run a command from the survey package, you’ll need to collect the data locally first. You can select only the variables you’ll need for the analysis so that you don’t have to store the whole dataset in memory.

acs_m_db_svy %>%
  select(agep, hicov, sex) %>%
  collect() %>%
  {survey::svyglm(hicov ~ sex + agep, .)} %>%
  summary()
## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced

## Warning in log(wt): NaNs produced
## 
## Call:
## survey::svyglm(formula = hicov ~ sex + agep, .)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.1882942  0.0101068 117.573  < 2e-16 ***
## sex         -0.0309300  0.0048065  -6.435 9.56e-09 ***
## agep        -0.0007165  0.0001143  -6.269 1.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2165.145)
## 
## Number of Fisher Scoring iterations: 2

Write Access

Note that srvyr does not require write access to perform calculations, the database created in this vignette was set to read-only at the beginning. This can be important when you want to make sure that your original data is not altered accidentally, or if you don’t have write access to a database.