## Loading required package: RSQLite
Srvyr 0.3 has a completely rewritten database backend. Using databases that are already stored dplyr’s tbl_lazy
objects is now just as easy as working with data stored in regular R data.frames as you don’t need to have a unique identifier. Additionally, it now works more similarly to the survey package’s database code and so shouldn’t be any slower.
During development, I have tested using SQLite (and the now defunct MonetDBLite) databases, but in theory other database backends should work as well.
This vignette shows the basics of how to use srvyr with databases. It is based on analysis from the wonderful resource asdfree ( website and github ). Many thanks to ajdamico and collaborators. Specifically, I have adapted code from American Community Survey - 2011 single year analysis and the associated data preparation scripts.
In order to focus on srvyr and databases, we start with a prepared dataset. The full code is available on Github, and the high level description of what it does is:
Download data from acs website (currently only Alaska and Hawaii to save time, though it would be easy to adapt to download to all 50 states and Puerto Rico).
Merges the household and person datasets so that we can look at the variables related to each person including those at the household level
Selects only a few variables that will be used in this analysis to save space, but again it could easily be adapted to keep all of the variables.
For more information on the specifics of the American Community Survey, see the asdfree site. Now, our code loads this prepared dataset, initiates a SQLite database, and puts the data into the dataset.
suppressMessages({
library(survey)
library(srvyr)
library(dplyr)
library(dbplyr)
library(RSQLite)
})
# Load data (Currently only Alaska and Hawaii to keep file size small and with
# limited variables, butcode that downloaded the files is available here
# https://github.com/gergness/srvyr/blob/main/vignettes/save_acs_data.R
# and could easily be adapted to download all states.)
load("acs_m.Rdata") # acs_m data
# Set up database and table
<- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
db <- copy_to(db, acs_m, "acs_m", temporary = FALSE)
acs_m_db
# Or, if the data was already stored in the database, you could do this
# acs_m_data <- tbl(db, sql("SELECT * FROM acs_m"))
Now that we have the data in the database, we can interact with the database directly using sql commands, or we can use dplyr’s functionality to treat it mostly the same as a local data.frame
. However, the data is not stored in memory, so we could work with much larger datasets (though in this case, the data is too small for this to be a problem).
# Same results
%>%
acs_m group_by(sex) %>%
summarize(hicov = mean(hicov))
## # A tibble: 2 × 2
## sex hicov
## <int> <dbl>
## 1 1 1.14
## 2 2 1.10
%>%
acs_m_db group_by(sex) %>%
summarize(hicov = mean(hicov))
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
## sex hicov
## <int> <dbl>
## 1 1 1.14
## 2 2 1.10
# But smaller object size
object.size(acs_m)
## 7777312 bytes
object.size(acs_m_db)
## 10544 bytes
Note that though many commands behave exactly the same whether on a local data.frame or database, sometimes more advanced / complicated syntax around variable modification allowed in dplyr does not work on a particular database and so it is better to be more explicit. For example, creating a variable inside of a summarize call does not work in some databases. .
%>%
acs_m group_by(sex) %>%
summarize(hicov = mean(hicov == 1))
## # A tibble: 2 × 2
## sex hicov
## <int> <dbl>
## 1 1 0.858
## 2 2 0.895
# Works in RSQlite, but didn't in the now defunct MonetDB
# acs_m_db %>%
# group_by(sex) %>%
# summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ....
# Creating the variable separately works as an integer works though
%>%
acs_m_db group_by(sex) %>%
mutate(hicov = ifelse(hicov == 1, 1L, 0L)) %>%
summarize(hicov = mean(hicov))
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
## sex hicov
## <int> <dbl>
## 1 1 0.858
## 2 2 0.895
Further, sometimes working with variable types can get difficult if you are used to working in R. Notice how in the above, instead of hicov = (hicov == 1)
, I wrote out the ifelse statement. If I hadn’t RSQLite would be unable to calculate the mean of the boolean variable created.
Finally, a major difference when transitioning from dplyr on local data.frames is that not all R functions are translated to SQL. For example, cut()
isn’t implemented in SQL, so you can’t create a new variable in the data.frame using it.
%>%
acs_m group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
summarize(hicov = mean(hicov == 1))
## # A tibble: 6 × 2
## agecat hicov
## <fct> <dbl>
## 1 (0,19] 0.908
## 2 (19,35] 0.795
## 3 (35,50] 0.837
## 4 (50,65] 0.873
## 5 (65,200] 0.992
## 6 <NA> 0.953
# acs_m_db %>%
# group_by(agecat = cut(agep, c(0, 19, 35, 50, 65, 200))) %>%
# summarize(hicov = mean(hicov == 1))
#
# > Error in .local(conn, statement, ...) :
# > Unable to execute statement....
# > ...
%>%
acs_m_db mutate(agecat = ifelse(agep < 19, "0-18",
ifelse(agep >= 19 & agep < 35, "19-34",
ifelse(agep >= 35 & agep < 50, "35-49",
ifelse(agep >= 50 & agep < 65, "50-64",
ifelse(agep >= 65, "65+", NA)))))) %>%
group_by(agecat) %>%
summarize(hicov = mean(hicov))
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.37.0 [:memory:]
## agecat hicov
## <chr> <dbl>
## 1 0-18 1.09
## 2 19-34 1.20
## 3 35-49 1.17
## 4 50-64 1.13
## 5 65+ 1.01
For more information on the specifics of databases with dplyr, see vignette("database", package = "dplyr")
, the DBI
package or the specific database packages, like RSQLite
.
Srvyr commands are nearly identical to old. The only difference for setup is that you need a variable that uniquely identifies each row in the database (uid).
<- acs_m_db %>%
acs_m_db_svy as_survey_rep(
weight = pwgtp,
repweights = matches("pwgtp[0-9]+") ,
scale = 4 / 80,
rscales = rep(1 , 80),
mse = TRUE,
type = "JK1",
variables = -c(matches("^pwgtp"))
)
acs_m_db_svy
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 80 replicates and MSE variances.
## Sampling variables:
## - repweights: `pwgtp1 + pwgtp2 + pwgtp3 + pwgtp4 + pwgtp5 + pwgtp6 + pwgtp7 + pwgtp8 + pwgtp9 + pwgtp10 + pwgtp11 + pwgtp12 + pwgtp13 + pwgtp14 + pwgtp15 + pwgtp16 + pwgtp17 + pwgtp18 + pwgtp19 + pwgtp20 + pwgtp21 + pwgtp22 + pwgtp23 + pwgtp24 + pwgtp25 + pwgtp26 + pwgtp27 + pwgtp28 + pwgtp29 + pwgtp30 + pwgtp31 + pwgtp32 + pwgtp33 + pwgtp34 + pwgtp35 + pwgtp36 + pwgtp37 + pwgtp38 + pwgtp39 + pwgtp40 + pwgtp41 + pwgtp42 + pwgtp43 + pwgtp44 + pwgtp45 + pwgtp46 + pwgtp47 + pwgtp48 + pwgtp49 + pwgtp50 + pwgtp51 + \n pwgtp52 + pwgtp53 + pwgtp54 + pwgtp55 + pwgtp56 + pwgtp57 + pwgtp58 + pwgtp59 + pwgtp60 + pwgtp61 + pwgtp62 + pwgtp63 + pwgtp64 + pwgtp65 + pwgtp66 + pwgtp67 + pwgtp68 + pwgtp69 + pwgtp70 + pwgtp71 + pwgtp72 + pwgtp73 + pwgtp74 + pwgtp75 + pwgtp76 + pwgtp77 + pwgtp78 + pwgtp79 + pwgtp80`
## - weights: pwgtp
## Data variables: serialno (int), sporder (chr), pwgtp (int), pwgtp1 (int),
## pwgtp2 (int), pwgtp3 (int), pwgtp4 (int), pwgtp5 (int), pwgtp6 (int), pwgtp7
## (int), pwgtp8 (int), pwgtp9 (int), pwgtp10 (int), pwgtp11 (int), pwgtp12
## (int), pwgtp13 (int), pwgtp14 (int), pwgtp15 (int), pwgtp16 (int), pwgtp17
## (int), pwgtp18 (int), pwgtp19 (int), pwgtp20 (int), pwgtp21 (int), pwgtp22
## (int), pwgtp23 (int), pwgtp24 (int), pwgtp25 (int), pwgtp26 (int), pwgtp27
## (int), pwgtp28 (int), pwgtp29 (int), pwgtp30 (int), pwgtp31 (int), pwgtp32
## (int), pwgtp33 (int), pwgtp34 (int), pwgtp35 (int), pwgtp36 (int), pwgtp37
## (int), pwgtp38 (int), pwgtp39 (int), pwgtp40 (int), pwgtp41 (int), pwgtp42
## (int), pwgtp43 (int), pwgtp44 (int), pwgtp45 (int), pwgtp46 (int), pwgtp47
## (int), pwgtp48 (int), pwgtp49 (int), pwgtp50 (int), pwgtp51 (int), pwgtp52
## (int), pwgtp53 (int), pwgtp54 (int), pwgtp55 (int), pwgtp56 (int), pwgtp57
## (int), pwgtp58 (int), pwgtp59 (int), pwgtp60 (int), pwgtp61 (int), pwgtp62
## (int), pwgtp63 (int), pwgtp64 (int), pwgtp65 (int), pwgtp66 (int), pwgtp67
## (int), pwgtp68 (int), pwgtp69 (int), pwgtp70 (int), pwgtp71 (int), pwgtp72
## (int), pwgtp73 (int), pwgtp74 (int), pwgtp75 (int), pwgtp76 (int), pwgtp77
## (int), pwgtp78 (int), pwgtp79 (int), pwgtp80 (int), agep (int), hicov (int),
## sex (int), st (chr), rt (chr)
Because srvyr stores the survey variables locally, the srvyr object takes up much more memory than the dplyr one. However, this object would not grow in size if you added more data variables to your survey, so if your survey is very wide, it will save a lot space.
object.size(acs_m_db_svy)
## 8391576 bytes
Analysis commands from srvyr are also similar to ones that work on local data.frames. The main differences come from the issues discussed above about explicitly creating variables difficulties in translating R commands, and variable types.
The following analysis is based on the asdfree analysis and shows some basic analysis on the total population, insurance coverage, age and sex.
# You can calculate the population of the united states #
# by state
%>%
acs_m_db_svy mutate(one = 1L) %>% # Note that because of weird behavior of MonetDB, need to use 1L not just 1
group_by(st) %>%
summarize(count = survey_total(one))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
## st count count_se
## <chr> <dbl> <dbl>
## 1 02 722718 0
## 2 15 1374810 0
# Or the average age of downloaded states
%>%
acs_m_db_svy summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 37.4 0.0532
# Average age by state
%>%
acs_m_db_svy group_by(st) %>%
summarize(agep = survey_mean(agep, na.rm = TRUE))
## Adding missing grouping variables: `st`
## # A tibble: 2 × 3
## st agep agep_se
## <chr> <dbl> <dbl>
## 1 02 34.6 0.133
## 2 15 38.8 0.0404
# percent uninsured - nationwide (of downloaded states)
%>%
acs_m_db_svy mutate(hicov = as.character(hicov)) %>%
group_by(hicov) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `hicov`
## # A tibble: 2 × 3
## hicov pct pct_se
## <chr> <dbl> <dbl>
## 1 1 0.885 0.00339
## 2 2 0.115 0.00339
# by state
%>%
acs_m_db_svy mutate(hicov = as.character(hicov)) %>%
group_by(st, hicov) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `st`, `hicov`
## # A tibble: 4 × 4
## # Groups: st [2]
## st hicov pct pct_se
## <chr> <chr> <dbl> <dbl>
## 1 02 1 0.802 0.00793
## 2 02 2 0.198 0.00793
## 3 15 1 0.928 0.00383
## 4 15 2 0.0719 0.00383
# 25th, median, and 75th percentile of age of residents of the united states (downloaded states)
%>%
acs_m_db_svy summarize(agep = survey_quantile(agep, c(0.25, 0.5, 0.75), na.rm = TRUE))
## # A tibble: 1 × 6
## agep_q25 agep_q50 agep_q75 agep_q25_se agep_q50_se agep_q75_se
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 36 55 0.251 0.251 0.251
# Filter works, so we can restrict the acs_m object to females only
<- acs_m_db_svy %>%
acs_m_db_svy_female filter(sex == 2)
# Now any of the above commands can be re-run using the acs_m_female object
# instead of the acs_m object in order to analyze females only
# This is equivalent to using acs_m, and applying the filter every time.
# average age - nationwide (of downloaded states), restricted to females
%>%
acs_m_db_svy_female summarize(agep = survey_mean(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 38.2 0.103
# median age - nationwide (of downloaded states), restricted to females
%>%
acs_m_db_svy_female summarize(agep = survey_median(agep, na.rm = TRUE))
## # A tibble: 1 × 2
## agep agep_se
## <dbl> <dbl>
## 1 37 0.251
# Note that though some R functions are translated by dplyr into SQL, not
# all of them are. For example, when constructing a new age category
# variable in the dataset neither findIntervals nor cut work on databases,
# so we have to spell out groups with ifelse()
%>%
acs_m_db_svy mutate(agecat = ifelse(agep < 19, "0-18",
ifelse(agep >= 19 & agep < 35, "19-34",
ifelse(agep >= 35 & agep < 50, "35-49",
ifelse(agep >= 50 & agep < 65, "50-64",
ifelse(agep >= 65, "65+", NA)))))) %>%
group_by(agecat) %>%
summarize(pct = survey_mean(na.rm = TRUE))
## Adding missing grouping variables: `agecat`
## # A tibble: 5 × 3
## agecat pct pct_se
## <chr> <dbl> <dbl>
## 1 0-18 0.248 0.00101
## 2 19-34 0.228 0.00172
## 3 35-49 0.198 0.00176
## 4 50-64 0.202 0.00134
## 5 65+ 0.126 0.000980
If you’d like to run a command from the survey package, you’ll need to collect the data locally first. You can select only the variables you’ll need for the analysis so that you don’t have to store the whole dataset in memory.
%>%
acs_m_db_svy select(agep, hicov, sex) %>%
collect() %>%
::svyglm(hicov ~ sex + agep, .)} %>%
{surveysummary()
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
## Warning in log(wt): NaNs produced
##
## Call:
## survey::svyglm(formula = hicov ~ sex + agep, .)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1882942 0.0101068 117.573 < 2e-16 ***
## sex -0.0309300 0.0048065 -6.435 9.56e-09 ***
## agep -0.0007165 0.0001143 -6.269 1.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2165.145)
##
## Number of Fisher Scoring iterations: 2
Note that srvyr does not require write access to perform calculations, the database created in this vignette was set to read-only at the beginning. This can be important when you want to make sure that your original data is not altered accidentally, or if you don’t have write access to a database.