This small package contains a few utility functions for querying electronic health record (EHR) data in OMOP Common Data Model databases using a tidyverse
approach based on dbplyr
lazy tibble references. This allows efficient in-database querying and data wrangling without explicit writing of SQL queries. This vignette assumes a basic familiarity with OMOP CDM databases and tidyverse tools such as dplyr
. Anyone not familiar with dbplyr
(note the b
) should read the dbplyr
vignette before proceeding further.
The main substantive content of this package is a function, concept_names
, which joins a given lazy tibble reference against the CDM concept
table to add in human readable concept names. The package should work with any version of the OMOP CDM. Essentially the only thing the package assumes or requires is that there will be a concept
table containing at least the columns concept_id
and concept_name
.
It is assumed that users of this package will already have access to an RDBMS containing EHR data data according to the OMOP CDM. Connection details will be highly specific to the database, but this package is relatively unopinionated about the nature of the DBI
connection. In particular, both odbc
connections, and custom database connections, such as RPostgres
are fine. A minimal example might look something like:
but this must be customised for your own specific database details.
The package is initialised by generating a list of available CDM tables with
This should return an error if CDM tables are not found at the given connection. Note that this will work provided that the CDM tables are in the default RDBMS schema (often called public
or dbo
), but if a non-default schema is used (here, for example, cdm
), then this must be provided
The returned value, here called tRefs
, is a named list of the available CDM tables as lazy dbplyr
tibbles. So, for example,
is the person
table as a lazy tibble. Tables can be joined using standard dplyr
conventions, eg.
and the result will again be a lazy tibble reference. The point of this is that all of the joining and wrangling happens in the database, and not in the R session. For big databases this can be a massive advantage.
Note that the package includes a small function, row_counts
which will compute the number of populated rows in each CDM table:
For a large database, this query will be very slow, and the tibble returned will not be lazy.
The main function in the package is the function concept_names
.
For each column in the input table of the form XXX_concept_id
, this function adds a new column of the form XXX_concept_name
(provided that it does not already exist), obtained by a left join against the CDM concept
table. Again, the result is a lazy tibble, so the joining happens inside the RDBMS.
If not all concept names require resolving, a list of the required joins can be provided.
Since the resolving is carried out as a left join, any non-matching concept IDs appear as NA in the new column. In some cases this will be appropriate, but in others, it might be more useful to copy across a string representation of the concept ID in the non-matching cases. This can be obtained using the fill
argument.
Again, the result is a lazy tibble. The function is very much designed to be used with lazy tibbles, so that the joins happen in the RDBMS and not in in-memory in the R user session. So by default, the join will fail if applied to an in-memory tibble. In this case, it is possible to force an in-memory join by using the copy
argument.
but note that this join will be very slow for large databases.
Really the whole point of this package is to facilitate the querying and wrangling of data using tidyverse tools in R code rather than by explicitly writing SQL queries, but it is worth noting in passing that it is possible to create lazy dbplyr
tibbles directly from SQL statements.