The coder
package simplifies unit classification based on external code data. this is a generic aim that might be hard to grasp without further concretization. In this vignette, I will first explain the overall design principles, and then exemplify the concept with a typical use case involving patients with total hip arthroplasty (THA) and their pre-surgery comorbidity. Note, however, that the package is not limited to patient data or medical settings.
library(coder)
Functions of the package relies on a triad of objects:
It is easy to introduce new classification schemes (‘classcodes’ objects) or to use default schemes included in the package (see vignette("classcodes")
).
There are three important functions to control the intended work flow of the package:
codify()
will merge object (1) and (2) for a coded data set of the intended format. If optional dates are specified, those will be used to construct time windows in order to filter out only the important dates (i.e. comorbidity during one year before surgery or adverse events 90 days after).classify()
will then use the coded data and classify it using the classcodes
object (3) (i.e. to code comorbidity data by the Charlson or Elixhauser comorbidity classifications).index()
is a third optional step to summarize the individual classcodes
categories to a (possibly weighted) index sum for each coded item (i.e. to calculate the Charlson comorbidity index for each patient).Those steps could be performed explicitly as codify() %>% classify() %>% index()
or implicitly by the main function categorize()
combining all steps automatically.
A typical use case of the coder
package would consider patient data and comorbidity as described in the package readme.
The concept of comorbidity is often attributed to Feinstein (1970):
[T]he term co-morbidity will refer to any distinct additional clinical entity that has existed or that may occur during the clinical course of a patient who has the index disease under study.
Let’s consider a group of patients with THA, as identified from a national quality register, which might be large in size. Assume we are interested in those patients’ pre-surgery comorbidity, which is not captured by the quality register itself. Instead, this data might be codified in a secondary source, such as a national patient register containing all hospital visits and admissions during several years, both before and after the THA-surgery. Each hospital visit/admission might be recorded with one or several medical codes, for example using the International classification of diseases version 10 (ICD-10). Similarly, a medical prescription register might hold records of prescribed drugs with their corresponding codes from the Anatomic therapeutic chemical classification (ATC) system.
Thus, combining the primary and secondary data sets (objects 1-2 above) using some unique patient id, and a possible time window (i.e. to only consider comorbidity as recorded during one year before the THA), is a first step to identify patient comorbidity. This step is performed by the codify()
function in step (i) above.
We have now gathered all the relevant codes for each patient. Common classifications (i.e. ICD-10 and ATC) are wast, however, including tens of thousands of medical/chemical codes, which are cumbersome and impractical to use directly. It is therefore common to categorize such codes into broader categories (i.e. by the Charlson, Elixhauser or RxRisk V classifications as below). Such classification could be a simple code matching problem using a look-up table. This is generally a slow, cumbersome and error-prone process, however. I therefore recommend to use regular expression for a compact code representation, as well as a computationally faster procedure. This is implemented in the classify()
function from step (ii) above.
We have now reduced the data from tens of thousands of codes to perhaps 10-50 combined categories. This might be sufficient in some cases, although further simplifications might also be needed. It is thus common to simplify comorbidity into a single number, an index score, as the sum of individual comorbidities, possible weighted to differentiate more serious conditions from more trivial. Different weights might be of relevance under different circumstances or in different fields. This is implemented by the index()
function in step (iii) above.
The Charlson (1987) and Elixhauser (1998) comorbidity indices are two examples used in medical research. Each index consist of several medical conditions, possibly summarized by a (weighted) index. Each condition is defined by a set of medical codes (Quan et al. 2005). Different versions of the International Classification of Diseases (ICD) codes are often used.
The coder
package provides substantial functionality for both Charlson and Elixhauser, although we will not focus on those indices here (but see examples in vignette("classcodes")
). Several other R packages have functions for Charlson and Elixhauser:
icd
and comorbidity
are both good packages well suited for their purpose based on effective implementations. medicalrisk
can be used with ICD-9-CM codes but is not up-to-date with the latest version of ICD-10. comorbidities.icd10
and icdcoder
are not actively developed or maintained.
One advantage with the coder
package is the great flexibility for combining different sets of codes (ICD-8, ICD-9, ICD-9-CM and ICD-10 et cetera), with different weighted indices.
Another advantage of the coder
package is the inclusion of additional classifications (see ?all_classcodes()
), such as the pharmacy-based case-mix instrument Rx Risk V (Sloan et al. 2003). We will use this classification in an example. This classification, in contrast to Charlson and Elixhauser, relies on medical prescription data codified by the Anatomic Therapeutic Chemical classification system (ATC).
As for all classcodes objects in the package, additional information and references are found in the object documentation (?rxriskv
).
Let us consider the hypothetical setting above using some example data (ex_peopple
and ex_atc
) as described in vignette("ex_data")
.
A first attempt to calculate the Rx Risk V score for each patient:
<- categorize(
default codedata = ex_atc, cc = rxriskv, id = "name", code = "atc")
ex_people, #> Classification based on: atc_pratt
default#> # A tibble: 100 x 50
#> name surgery Alcohol.depende… Allergies Anticoagulants Antiplatelets
#> <chr> <date> <lgl> <lgl> <lgl> <lgl>
#> 1 Chen… 2020-09-29 FALSE TRUE FALSE FALSE
#> 2 Grav… 2020-06-21 FALSE FALSE TRUE FALSE
#> 3 Truj… 2020-06-08 FALSE FALSE FALSE FALSE
#> 4 Simp… 2020-09-10 FALSE FALSE FALSE FALSE
#> 5 Chin… 2020-08-24 FALSE FALSE FALSE FALSE
#> 6 Le, … 2020-03-28 FALSE TRUE FALSE FALSE
#> 7 Kang… 2020-06-30 TRUE TRUE FALSE FALSE
#> 8 Shue… 2020-03-29 FALSE TRUE FALSE FALSE
#> 9 Bouc… 2020-09-04 FALSE FALSE FALSE FALSE
#> 10 Le, … 2020-08-09 FALSE FALSE TRUE FALSE
#> # … with 90 more rows, and 44 more variables: Anxiety <lgl>, Arrhythmia <lgl>,
#> # Benign.prostatic.hyperplasia <lgl>, Bipolar.disorder <lgl>,
#> # Chronic.airways.disease <lgl>, Congestive.heart.failure <lgl>,
#> # Dementia <lgl>, Depression <lgl>, Diabetes <lgl>, Epilepsy <lgl>,
#> # Gastrooesophageal.reflux.disease <lgl>, Glaucoma <lgl>, Gout <lgl>,
#> # Hepatitis.B <lgl>, Hepatitis.C <lgl>, HIV <lgl>, Hyperkalaemia <lgl>,
#> # Hyperlipidaemia <lgl>, Hypertension <lgl>, Hyperthyroidism <lgl>,
#> # Hypothyroidism <lgl>, Incontinence <lgl>, Inflammation.pain <lgl>,
#> # Irritable.bowel.syndrome <lgl>,
#> # Ischaemic.heart.disease..hypertension <lgl>,
#> # Ischemic.heart.disease..angina <lgl>, Liver.failure <lgl>,
#> # Malignancies <lgl>, Malnutrition <lgl>, Migraine <lgl>,
#> # Osteoporosis.Paget.s <lgl>, Pain <lgl>, Pancreatic.insufficiency <lgl>,
#> # Parkinson.s.disease <lgl>, Psoriasis <lgl>, Psychotic.illness <lgl>,
#> # Pulmonary.hypertension <lgl>, Renal.disease <lgl>, Smoking.cessation <lgl>,
#> # Steroid.responsive.disease <lgl>, Transplant <lgl>, Tuberculosis <lgl>,
#> # pratt <dbl>, sum_all <dbl>
The first two columns are identical to ex_people
. Additional columns indicate whether patients had any of the individual comorbidities identified by Rx Risk V. Patients without any medical prescriptions have NA
values (which might be substituted by FALSE
). The last columns contain summarized index values (weighted sums of individual comorbidities). Let’s summarize the distribution of a weighted index according to pratt
(Pratt et al. 2018):
<- function(x) {
hist2 hist(x$pratt, main = NULL, xlab = "RxRisk V", col = "lightblue")
}hist2(default)
Some prescriptions might have been filed long before surgery, or even after. Those codes are less relevant for comorbidities present at surgery. We can limit the categorization to a time window of one year (365 days) prior to surgery. This is done internally by the codify()
function, hence by specifying a list of arguments passed to this function:
<-
codify_args list(date = "surgery", code_date = "prescription", days = c(-365, -1))
<-
ct categorize(
ex_people, codedata = ex_atc,
cc = rxriskv,
id = "name",
code = "atc",
codify_args = codify_args
)#> Classification based on: atc_pratt
hist2(ct)
Comorbidities are identified from ATC codes captured by regular expression (see vignette("classcodes")
and vignette("Intrpret_regular_expressions")
). Codes identified by atc_pratt
are used by default. Let’s use an alternative version adopted from Caughy (2010) as specified by an argument passed by the cc_args
argument.
hist2(
categorize(
ex_people, codedata = ex_atc,
cc = rxriskv,
id = "name",
code = "atc",
codify_args = codify_args,
cc_args = list(regex = "caughey")
) )
We did not specify how to calculate the weighted index sum above, wherefore all available indices were provided by default. We might go back to Pratt’s classification scheme (atc_pratt
) and only calculate the corresponding index pratt
. Let´s also perform the three computational steps explicitly instead of using the combining categorize()
function and tabulate the result
codify(
ex_people,
ex_atc, id = "name",
code = "atc",
date = "surgery",
code_date = "prescription",
days = c(-365, -1)
%>%
) classify(rxriskv) %>%
index("pratt") %>%
table()
#> Warning: 'classify()' does not preserve row order ('categorize()' does!)
#> Classification based on: atc_pratt
#> .
#> -1 0 1 2 3 4 5 6 7 8
#> 10 39 2 24 6 2 1 3 1 2
Let’s assume that our code data is not as clean as simulated above.
<- function(x) sample(x, 1e3, replace = TRUE)
s
$code <-
ex_atcpaste0(
s(letters), s(0:9), s(letters), s(c(".", "-", "?")),
$atc, s(letters), s(0:9)
ex_atc
)
ex_atc#> # A tibble: 10,000 x 4
#> name atc prescription code
#> <chr> <chr> <date> <chr>
#> 1 Meier, Hayden QC01BD02 2012-08-18 q7g-QC01BD02x8
#> 2 Garza, Kenia C05AA05 2017-02-26 l3e-C05AA05f2
#> 3 Chapa, Nicholas L01XC29 2017-07-17 o5v-L01XC29x2
#> 4 Slater, Trina QI07AA03 2012-08-03 d8r.QI07AA03a0
#> 5 Banks, Silbret QD06BB11 2014-03-01 x5q?QD06BB11g1
#> 6 Winn, Robert M01CC 2014-12-19 h4k.M01CCs1
#> 7 Cornelius, Kelly A16AA07 2017-11-30 r0j.A16AA07j6
#> 8 Rubin, Miriah QD10AD 2013-09-24 t2l?QD10ADg7
#> 9 Slater, Trina A12AA09 2015-09-13 n4w.A12AA09l3
#> 10 Jefferson, Deontrae S01AA05 2012-06-26 v1l?S01AA05s7
#> # … with 9,990 more rows
sum(
categorize(
ex_people, codedata = ex_atc,
cc = rxriskv,
id = "name",
code = "code"
$pratt,
)na.rm = TRUE
)#> Classification based on: atc_pratt
#> [1] 0
Thus, no codes are recognized (every one got index = 0). By default, codes are only recognized if found immediate in its corresponding column. This can be controlled by arguments start
and stop
specified via cc_args
. We can also ignore all non alphanumeric characters by setting alnum = TRUE
as passed to codify()
by argument codify_args
.
hist2(
categorize(
ex_people, codedata = ex_atc,
cc = rxriskv,
id = "name",
code = "code",
cc_args = list(
start = FALSE,
stop = FALSE
),codify_args = list(
alnum = TRUE
)
)
)#> Classification based on: atc_pratt