In order to set up ricu
, download of datasets from
several platforms is required. Two data sources, mimic_demo
and eicu_demo
are available directly as R packages, hosted
on Github. The respective full-featured versions mimic
and
eicu
, as well as the hirid
dataset are
available from PhysioNet, while
access to the remaining standard dataset aumc
is available
from yet another website. The
following steps guide through package installation, data source set up
and conclude with some example data queries.
Stable package releases are available from CRAN as
install.packages("ricu")
and the latest development version is available from GitHub as
::install_github("eth-mds/ricu") remotes
The demo datasets mimic_demo
and eicu_demo
are listed as Suggests
dependencies and therefore their
availability is determined by the value passed as
dependencies
to the above package installation function.
The following call explicitly installs the demo data set packages
install.packages(
c("mimic.demo", "eicu.demo"),
repos = "https://eth-mds.github.io/physionet-demo"
)
Included with ricu
are functions for download and setup
of the following datasets: mimic
(MIMIC-III),
eicu
, hirid
, aumc
and
miiv
(MIMIC-IV), which can be invoked in several different
ways.
RICU_DATA_PATH
.
The current value can be retrieved by calling
data_dir()
..csv
form has already been downloaded,
this can be decompressed and copied to an appropriate sub-folder
(mimic
, eicu
, hirid
or
aumc
) to the directory identified by
data_dir()
.ricu
download the required data, login
credentials can be supplied as environment variables
RICU_PHYSIONET_USER
/RICU_PHYSIONET_PASS
and
RICU_AUMC_TOKEN
(the string the follows token=
in the download URL received from the AUMCdb data owners) or entered
into the terminal manually in interactive sessions.ricu
converts .csv
files into a binary format using the fst package..fst
format (and potentially data
download) is automatically triggered upon first access of a table. In
interactive sessions, the user is asked for permission to setup the
given data source and in non-interactive sessions, access to missing
data throws an error.setup_src_data()
.Many commonly used clinical data concepts are available for all data
sources, where the required data exists. An overview of available
concepts is available by calling explain_dictionary()
and
concepts can be loaded using load_concepts()
:
<- "mimic_demo"
src <- c(src, "eicu_demo")
demo
head(explain_dictionary(src = demo))
#> name category description
#> 1 abx medications antibiotics
#> 2 adh_rate medications vasopressin rate
#> 3 adm demographics patient admission type
#> 4 age demographics patient age
#> 5 alb chemistry albumin
#> 6 alp chemistry alkaline phosphatase
load_concepts("alb", src, verbose = FALSE)
#> # A `ts_tbl`: 297 ✖ 3
#> # Id var: `icustay_id`
#> # Units: `alb` [g/dL]
#> # Index var: `charttime` (1 hours)
#> icustay_id charttime alb
#> <int> <drtn> <dbl>
#> 1 201006 0 hours 2.4
#> 2 203766 -18 hours 2
#> 3 203766 4 hours 1.7
#> 4 204132 7 hours 3.6
#> 5 204201 9 hours 2.3
#> …
#> 293 298685 130 hours 1.9
#> 294 298685 154 hours 2
#> 295 298685 203 hours 2
#> 296 298685 272 hours 2.2
#> 297 298685 299 hours 2.5
#> # … with 287 more rows
Concepts representing time-dependent measurements are loaded as
ts_tbl
objects, whereas static information is retrieved as
id_tbl
object. Both classes inherit from
data.table
(and therefore also from
data.frame
) and can be coerced to any of the base classes
using as.data.table()
and as.data.frame()
,
respectively. Using data.table
‘by-reference’ operations,
this is available as zero-copy operation by passing
by_ref = TRUE
1.
<- load_concepts("height", src, verbose = FALSE))
(dat #> # An `id_tbl`: 63 ✖ 2
#> # Id var: `icustay_id`
#> # Units: `height` [cm]
#> icustay_id height
#> <int> <dbl>
#> 1 201006 157.
#> 2 201204 163.
#> 3 203766 165.
#> 4 204132 165.
#> 5 204201 157.
#> …
#> 59 293429 155.
#> 60 295043 165.
#> 61 295741 175.
#> 62 296804 173.
#> 63 298685 175.
#> # … with 53 more rows
head(tmp <- as.data.frame(dat, by_ref = TRUE))
#> icustay_id height
#> 1 201006 157.48
#> 2 201204 162.56
#> 3 203766 165.10
#> 4 204132 165.10
#> 5 204201 157.48
#> 6 210989 175.26
identical(dat, tmp)
#> [1] TRUE
Many functions exported by ricu
use id_tbl
and ts_tbl
objects in order to enable more concise
semantics. Merging an id_tbl
with a ts_tbl
,
for example, will automatically use the columns identified by
id_vars()
of both tables, as
by.x
/by.y
arguments, while for two
ts_tbl
object, respective columns reported by
id_vars()
and index_var()
will be used to
merge on.
When loading form multiple data sources simultaneously,
load_concepts()
will add a source
column
(which will be among the id_vars()
of the resulting
object), thereby allowing to identify stay IDs corresponding to the
individual data sources.
load_concepts("weight", demo, verbose = FALSE)
#> # An `id_tbl`: 2,434 ✖ 3
#> # Id vars: `source`, `icustay_id`
#> # Units: `weight` [kg]
#> source icustay_id weight
#> <chr> <int> <dbl>
#> 1 eicu_demo 141765 46.5
#> 2 eicu_demo 143870 77.5
#> 3 eicu_demo 144815 60.3
#> 4 eicu_demo 145427 91.7
#> 5 eicu_demo 147307 72.5
#> …
#> 2,430 mimic_demo 295043 96.6
#> 2,431 mimic_demo 295741 81.6
#> 2,432 mimic_demo 296804 71
#> 2,433 mimic_demo 297782 78.8
#> 2,434 mimic_demo 298685 52
#> # … with 2,424 more rows
In addition to the ~100 concepts that are available by default, adding user-defined concepts is possible either as R objects or more robustly, as JSON configuration files.
Data concepts consist of zero, one, or several data items per
data source, encoding how to retrieve the corresponding data. The
constructors concept()
and item()
can be used
to instantiate concepts as R objects.
<- concept("ldh",
ldh item("mimic_demo", "labevents", "itemid", 50954),
description = "Lactate dehydrogenase",
unit = "IU/L"
)load_concepts(ldh, verbose = FALSE)
#> # A `ts_tbl`: 365 ✖ 3
#> # Id var: `icustay_id`
#> # Units: `ldh` [IU/L]
#> # Index var: `charttime` (1 hours)
#> icustay_id charttime ldh
#> <int> <drtn> <dbl>
#> 1 201006 -45 hours 249
#> 2 201006 48 hours 399
#> 3 203766 4 hours 227
#> 4 204132 7 hours 489
#> 5 204132 36 hours 574
#> …
#> 361 298685 203 hours 222
#> 362 298685 226 hours 230
#> 363 298685 260 hours 218
#> 364 298685 272 hours 221
#> 365 298685 299 hours 253
#> # … with 355 more rows
Configuration files are looked for in both the package
installation directory and in user-specified locations, either using the
environment variable RICU_CONFIG_PATH
or by passing paths
as function arguments (load_dictionary()
for example
accepts a cfg_dirs
argument).
Mechanisms for both extending and replacing existing concept
dictionaries are supported by ricu
. The file name of the
default concept dictionary is called concept-dict.json
and
any file with the same name in user-specified locations will be used as
extensions. In order to forgo the internal dictionary, a different file
name can be chosen, which then has to be passed as function argument
(load_dictionary()
for example has a name
argument which defaults to concept-dict
)
A JSON-based concept akin to the one above can be specified as
{
"ldh": {
"unit": "IU/L",
"description": "Lactate dehydrogenase",
"sources": {
"mimic_demo": [
{
"ids": 50954,
"table": "labevents",
"sub_var": "itemid"
}
]
}
}
}
and this can (given that it is saved as
concept-dict.json
in a directory pointed to by
RICU_CONFIG_PATH
) then be loaded using
load_concepts()
as
load_concepts("ldh", "mimic_demo")
For further details on constructing concepts, refer to documentation
at ?concept
and ?item
.
While data.table
by-reference operations
can be very useful due to their inherent efficiency benefits, much care
is required if enabled, as they break with the usual base R by-value
(copy-on-modify) semantics.↩︎