library(coder)
Let’s consider some example data (ex_peopple
and ex_icd10
) from vignette("ex_data")
.
Let’s categorize those patients by their Charlson comorbidity:
categorize(ex_people, codedata = ex_icd10, cc = charlson, id = "name", code = "icd10")
#> Classification based on: icd10
#> # A tibble: 100 x 25
#> name surgery myocardial.infa… congestive.hear… peripheral.vasc…
#> <chr> <date> <lgl> <lgl> <lgl>
#> 1 Chen… 2020-09-29 FALSE FALSE FALSE
#> 2 Grav… 2020-06-21 FALSE FALSE FALSE
#> 3 Truj… 2020-06-08 FALSE FALSE FALSE
#> 4 Simp… 2020-09-10 FALSE FALSE FALSE
#> 5 Chin… 2020-08-24 FALSE FALSE FALSE
#> 6 Le, … 2020-03-28 FALSE FALSE FALSE
#> 7 Kang… 2020-06-30 FALSE FALSE FALSE
#> 8 Shue… 2020-03-29 FALSE FALSE FALSE
#> 9 Bouc… 2020-09-04 FALSE FALSE TRUE
#> 10 Le, … 2020-08-09 FALSE FALSE FALSE
#> # … with 90 more rows, and 20 more variables: cerebrovascular.disease <lgl>,
#> # dementia <lgl>, chronic.pulmonary.disease <lgl>, rheumatic.disease <lgl>,
#> # peptic.ulcer.disease <lgl>, mild.liver.disease <lgl>,
#> # diabetes.without.complication <lgl>, hemiplegia.or.paraplegia <lgl>,
#> # renal.disease <lgl>, diabetes.complication <lgl>, malignancy <lgl>,
#> # moderate.or.severe.liver.disease <lgl>, metastatic.solid.tumor <lgl>,
#> # AIDS.HIV <lgl>, charlson <dbl>, deyo_ramano <dbl>, dhoore <dbl>,
#> # ghali <dbl>, quan_original <dbl>, quan_updated <dbl>
Here, charlson
(as supplied by the cc
argument) is a “classcodes” object containing a classification scheme. This is the specification of how to match ex_icd10$icd10
to each condition recognized by the Charlson comorbidity classification. It is based on regular expressions (see ?regex
).
There are 7 default “classcodes” objects in the package (classcodes
column below). Each of them might have several versions of regular expressions (column regex
) and weighted indices (column indices
):
all_classcodes()
#> # A tibble: 7 x 3
#> classcodes regex indices
#> <chr> <chr> <chr>
#> 1 charlson icd10, icd9cm_deyo, icd9cm_enhan… "charlson, deyo_ramano, dhoore…
#> 2 cps icd10 "only_ordinary"
#> 3 elixhauser icd10, icd10_short, icd9cm, icd9… "sum_all, sum_all_ahrq, walrav…
#> 4 hip_ae icd10, kva, icd10_fracture ""
#> 5 hip_ae_hail… icd10, kva ""
#> 6 knee_ae icd10, kva ""
#> 7 rxriskv atc_pratt, atc_caughey, atc_garl… "pratt, sum_all"
Each of those classcodes objects are documented (see for example ?charlson
). Those objects are basically tibbles (data frames) with some additional attributes:
charlson#>
#> Classcodes object
#>
#> Regular expressions:
#> icd10, icd9cm_deyo, icd9cm_enhanced, icd10_rcs, icd8_brusselaers, icd9_brusselaers
#> Indices:
#> charlson, deyo_ramano, dhoore, ghali, quan_original, quan_updated
#>
#> # A tibble: 17 x 14
#> group description icd10 icd9cm_deyo icd9cm_enhanced icd10_rcs
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 myoc… Acute myoc… I2([… 41[02] 41[02] "I2([1-3…
#> 2 cong… Heart fail… I(09… 428 39891|4(0(2([0… "I(1[13]…
#> 3 peri… Peripheral… I7([… 44(39|1)|7… 0930|4(373|4[0… "(I7([0-…
#> 4 cere… Cerebrovas… G4[5… 43[0-8] 36234|43[0-8] "G4[56]|…
#> 5 deme… Senile and… F0([… 290 29(0|41)|3312 "A810|F0…
#> 6 chro… Chronic ob… (I27… 490|50([0-… 4(16[89]|9)|50… "(I2[67]…
#> 7 rheu… Systemic l… M(0[… 7(1(0[014]… 4465|7(1(0[0-4… "M(0[569…
#> 8 pept… Gastric, d… K2[5… 53[1-4] 53[1-4] <NA>
#> 9 mild… Alcoholic … B18|… 571[24-6] 070([23]{2}|[4… <NA>
#> 10 diab… Diabetes w… E1[0… 250[0-37] 250[0-389] <NA>
#> 11 hemi… Paraplegia… G(04… 34(41|2) 3(341|4([23]|4… "G(114|8…
#> 12 rena… Chronic gl… I1(2… 58([2568]|… 40(3([019]1)|4… "I1[23]|…
#> 13 diab… Diabetes w… E1[0… 250[4-6] 250[4-7] "E1[0-4]"
#> 14 mali… Malignant … C([0… (1([4-68]|… 1([4-68]|7[0-2… "C([01]|…
#> 15 mode… Hepatic co… I(8(… 456[01]|57… 456[0-2]|572[2… "B18|I(8…
#> 16 meta… Secondary … C(7[… 19([6-8]|9… 19[6-9] "C(7[7-9…
#> 17 AIDS… HIV infect… B2[0… 04[2-4] 04[2-4] "B2[0-4]"
#> # … with 8 more variables: icd8_brusselaers <chr>, icd9_brusselaers <chr>,
#> # charlson <dbl>, deyo_ramano <dbl>, dhoore <dbl>, ghali <dbl>,
#> # quan_original <dbl>, quan_updated <dbl>
Columns have pre-specified names and/or content:
group
: short descriptive names of all groups to classify by (i.e. medical conditions/comorbidities in the Charlson case)description:
(optional) details describing each groupvignette("Interpret_regular_expressions")
for details and ?charlson
for concrete examples). Multiple versions might be used if combined with different code sets (i.e. ICD-9 versus ICD-10) or as suggested by different sources/authors. (Column names are arbitrary but identified by attr(., "regexprs")
and specified by argument regex
in as.classcodes()
).attr(., "indices")
and specified by argument indices
in as.classcodes()
.)condition
: (optional) conditional classification (not used with charlson
but see example below).In the example above, we did not specify which version of the regular expressions to use. We see from the printed output above (or by attr(charlson, "regexprs")
), that the first regular expression is “icd10.” This will be used by default. We have ICD-10 codes recorded in our code data set (ex_icd10$icd10
). We might therefore use either “icd10” or the alternative “icd10_rcs.” Other versions might be relevant if the medical data is coded by other codes (such as earlier versions of ICD). We will show below how to alter this setting in practice.
Some classcodes objects have an additional class attribute “hierarchy,” controlling hierarchical groups where only one of possibly several groups should be used in weighted index sums. The classcodes object for the Elixhauser comorbidity classification has this property:
print(elixhauser, n = 0) # preview 0 rows but present the attributes
#>
#> Classcodes object
#>
#> Regular expressions:
#> icd10, icd10_short, icd9cm, icd9cm_ahrqweb, icd9cm_enhanced
#> Indices:
#> sum_all, sum_all_ahrq, walraven, sid29, sid30, ahrq_mort, ahrq_readm
#> Hierarchy:
#> c("metastatic cancer", "solid tumor"),
#> c("diabetes uncomplicated", "diabetes complicated")
This means that patients who have both metastatic cancer and solid tumors should be recognized as such if classified. If such patient are assigned an aggregated index score, however, only the largest score is used (in this case for a metastatic cancer as superior to a solid tumor). The same is true for patients diagnosed with both uncomplicated and complicated diabetes.
Consider a patient Alice with some diagnoses:
<- tibble::tibble(id = "Alice")
pat <- c("C01", "C801", "E1010", "E1021")
diags ::decode(diags, decoder::icd10cm)
decoder#> [1] "Malignant neoplasm of base of tongue"
#> [2] "Malignant (primary) neoplasm, unspecified"
#> [3] "Type 1 diabetes mellitus with ketoacidosis without coma"
#> [4] "Type 1 diabetes mellitus with diabetic nephropathy"
According to Elixhauser, poor Alice has both a solid tumor and a metastatic cancer, as well as diabetes both with and without complications. The (unweighted) index “sum_all,” however will not equal 4 but 2, since metastatic cancer and diabetes with complications subsume solid tumors and diabetes without complications.
<- tibble::tibble(id = "Alice", icd10 = diags)
icd10 <- categorize(pat, codedata = icd10, cc = elixhauser,
x id = "id", code = "icd10", index = "sum_all", check.names = FALSE)
#> Classification based on: icd10
t(x)
#> [,1]
#> id "Alice"
#> congestive heart failure "FALSE"
#> cardiac arrhythmias "FALSE"
#> valvular disease "FALSE"
#> pulmonary circulation disorder "FALSE"
#> peripheral vascular disorder "FALSE"
#> hypertension uncomplicated "FALSE"
#> hypertension complicated "FALSE"
#> paralysis "FALSE"
#> other neurological disorders "FALSE"
#> chronic pulmonary disease "FALSE"
#> diabetes uncomplicated "TRUE"
#> diabetes complicated "TRUE"
#> hypothyroidism "FALSE"
#> renal failure "FALSE"
#> liver disease "FALSE"
#> peptic ulcer disease "FALSE"
#> AIDS/HIV "FALSE"
#> lymphoma "FALSE"
#> metastatic cancer "TRUE"
#> solid tumor "TRUE"
#> rheumatoid arthritis "FALSE"
#> coagulopathy "FALSE"
#> obesity "FALSE"
#> weight loss "FALSE"
#> fluid electrolyte disorders "FALSE"
#> blood loss anemia "FALSE"
#> deficiency anemia "FALSE"
#> alcohol abuse "FALSE"
#> drug abuse "FALSE"
#> psychoses "FALSE"
#> depression "FALSE"
#> sum_all "2"
Consider Alice once more. Suppose she got a THA and had some surgical procedure codes recorded at hospital visits either before, during or after her index surgery. Those codes are recorded by the Nomesco classification of surgical procedures (also known as KVA codes in Swedish). Here, “post_op” indicates whether the code was recorded after surgery or not. This information is not always accessible by pure date stamps (if so, the approach illustrated in vignette("coder")
could be used instead).
<-
nomesco ::tibble(
tibbleid = "Alice",
kva = c("AA01", "NFC01"),
post_op = c(TRUE, FALSE)
)
Thus, the “post_op” column is a Boolean/logical vector with a name recognized from the “condition” column in hip_ae
, a classcodes object used to identify adverse events after THA (the use of set_classcodes()
is further explained below and is used here since hip_ae
includes codes for both ICD and NOMESCO/KVA).
set_classcodes(hip_ae, regex = "kva")
#>
#> Classcodes object
#>
#> Regular expressions:
#> kva
#> Indices:
#>
#>
#> # A tibble: 1 x 3
#> group kva condition
#> <chr> <chr> <chr>
#> 1 KVA ^(NF([CF-HJ-MS-TW]|A(02|1[12]|2[0-2])|Q09|U[013489]9)|QD(A10|… post_op
A code from nomesco$kva
will only be recognized as an adverse events if 1) the code is matched by the relevant regular expression, and 2) the extra condition (from nomesco$post_op
) is TRUE.
We need to specify that codes are based on regular expressions matching NOMESCO codes. We do this by the regex
argument passed to set_classcodes()
by the cc_args
argument.
In the data set (nomesco
), “AA01” was recorded after surgery but does not indicate a potential adverse event. “NFC01” is a potential adverse event but was recorded already before surgery. Therefore, no adverse event will be recognized in this case.
categorize(pat, codedata = nomesco, cc = hip_ae, id = "id", code = "kva",
cc_args = list(regex = "kva"))
#> index calculated as number of relevant categories
#> # A tibble: 1 x 3
#> id KVA index
#> <chr> <lgl> <dbl>
#> 1 Alice FALSE 0
Most functions do not use the classcodes object themselves, but a modified version passed through set_classcodes()
. This function can be called directly but is more often invoked by arguments passed by the cc_args
argument used in other functions (as in the example above).
set_classcodes()
We might use set_classcodes()
to prepare a classification scheme according to the Charlson comorbidity index based on ICD-8 (Brusselaers and Lagergren 2017). Assume that such codes might be found in character strings with leading prefixes or in the middle of a more verbatim description. This is controlled by setting the argument start = FALSE
, meaning that the identified ICD-8 codes do not need to appear in the beginning of the character string. We might assume, however, that there is no more information after the code (as specified by stop = TRUE
). We can also use some more specific and unique group names as specified by tech_names
.
<-
charlson_icd8 set_classcodes(
"charlson",
regex = "icd8_brusselaers", # Version based on ICD-8
start = FALSE, # Codes do not have to occur in the beginning of a vector
stop = TRUE, # Code vector must end with the specified codes
tech_names = TRUE # Use long but unique and descriptive variable names
)
The resulting object has only one version of regular expressions (icd8_brusselaers
as specified). Each regular expression is suffixed with $
(due to stop = TRUE
). Group names might seem cumbersome but this will help to distinguish column names added by categorize()
if this function is run repeatedly with different classcodes (i.e. if we calculate both Charlson and Elixhauser indices for the same patients). The original charlson
object had 17 rows, but charlson_icd8
has only 13, since not all groups are used in this version.
charlson_icd8#>
#> Classcodes object
#>
#> Regular expressions:
#> icd8_brusselaers
#> Indices:
#> charlson, deyo_ramano, dhoore, ghali, quan_original, quan_updated
#>
#> # A tibble: 13 x 9
#> group description icd8_brusselaers charlson deyo_ramano dhoore ghali
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 char… Acute myoc… (41[0-2])$ 1 1 1 1
#> 2 char… Heart fail… (4270|428)$ 1 1 1 4
#> 3 char… Peripheral… (44[0-5])$ 1 1 1 2
#> 4 char… Cerebrovas… (43[0-8])$ 1 1 1 1
#> 5 char… Senile and… (290[01])$ 1 1 1 0
#> 6 char… Chronic ob… (49[0-3]|51[5-8… 1 1 1 0
#> 7 char… Systemic l… (7(1[0-2]|34))$ 1 1 1 0
#> 8 char… Paraplegia… (344)$ 2 1 1 0
#> 9 char… Chronic gl… (40[34]|58[0-3]… 2 1 1 3
#> 10 char… Diabetes w… (250)$ 2 1 1 0
#> 11 char… Malignant … (1([4-68][0-9]|… 2 1 1 0
#> 12 char… Hepatic co… (070|4560|51[1-… 3 1 1 0
#> 13 char… Secondary … (19[6-9])$ 6 1 1 0
#> # … with 2 more variables: quan_original <dbl>, quan_updated <dbl>
Note that all index columns remain in the tibble. It is thus possible to combine any categorization with any index, although some combinations might be preferred (such as regex_icd9cm_deyo
combined with index_deyo_ramano
).
We can now use charlson_icd8
for classification:
classify(410, charlson_icd8)
#> Classification based on: icd8_brusselaers
#>
#> The printed data is of class: classified, matrix.
#> It has 1 row(s).
#> It is here previewed as a tibble
#> Use `print(x, n = NULL)` to print as is (or use `n` to specify the number of rows to preview)!
#>
#> # A tibble: 1 x 13
#> charlson_icd8_b… charlson_icd8_b… charlson_icd8_b… charlson_icd8_b…
#> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE
#> # … with 9 more variables: charlson_icd8_brusselaers_dementia <lgl>,
#> # charlson_icd8_brusselaers_chronic_pulmonary_disease <lgl>,
#> # charlson_icd8_brusselaers_rheumatic_disease <lgl>,
#> # charlson_icd8_brusselaers_hemiplegia_or_paraplegia <lgl>,
#> # charlson_icd8_brusselaers_renal_disease <lgl>,
#> # charlson_icd8_brusselaers_diabetes_complication <lgl>,
#> # charlson_icd8_brusselaers_malignancy <lgl>,
#> # charlson_icd8_brusselaers_moderate_or_severe_liver_disease <lgl>,
#> # charlson_icd8_brusselaers_metastatic_solid_tumor <lgl>
The ICD-8 code 410
is recognized as (only) myocardial infarction.
set_classcodes()
Instead of pre-specifying the charlson_icd8
, a similar result is achieved by:
classify(
410,
"charlson",
cc_args = list(
regex = "icd8_brusselaers",
start = FALSE,
stop = TRUE,
tech_names = TRUE
)
)#>
#> The printed data is of class: classified, matrix.
#> It has 1 row(s).
#> It is here previewed as a tibble
#> Use `print(x, n = NULL)` to print as is (or use `n` to specify the number of rows to preview)!
#>
#> # A tibble: 1 x 13
#> charlson_icd8_b… charlson_icd8_b… charlson_icd8_b… charlson_icd8_b…
#> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE
#> # … with 9 more variables: charlson_icd8_brusselaers_dementia <lgl>,
#> # charlson_icd8_brusselaers_chronic_pulmonary_disease <lgl>,
#> # charlson_icd8_brusselaers_rheumatic_disease <lgl>,
#> # charlson_icd8_brusselaers_hemiplegia_or_paraplegia <lgl>,
#> # charlson_icd8_brusselaers_renal_disease <lgl>,
#> # charlson_icd8_brusselaers_diabetes_complication <lgl>,
#> # charlson_icd8_brusselaers_malignancy <lgl>,
#> # charlson_icd8_brusselaers_moderate_or_severe_liver_disease <lgl>,
#> # charlson_icd8_brusselaers_metastatic_solid_tumor <lgl>