Introduction to record linkage with diyar

Data required for epidemiological investigations are usually stored in different databases. Consolidating these datasets is often the first step for these investigations. This vignette provides a brief introduction to the basics of record linkage as implemented by diyar.

Let’s begin by reviewing missing_staff_id - a sample dataset containing incomplete staff information.

data(missing_staff_id)
missing_staff_id
#>   r_id staff_id age initials hair_colour     branch_office source_1 source_2
#> 1    1       NA  30     G.D.       Brown Republic of Ghana        A        3
#> 2    2       NA  30     B.G.        Teal            France        A        1
#> 3    3       NA  30     X.P.        <NA>              <NA>        A        1
#> 4    4       NA  30     X.P.       Green              <NA>        B        1
#> 5    5       NA  30     <NA>       Green            France        A        1
#> 6    6        2  30     G.D.  Dark brown             Ghana        A        1
#> 7    7        2  30     G.D.       Brown Republic of Ghana        B        2

A unique identifier that distinguishes one entity (staff) from another is often unavailable or incomplete as is the case with staff_id in this example. links() is used to create one. The identifier is created as an S4 class (pid) with useful information about each group in its slots.

The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.

In the example below, we use initials and hair_colour as distinguishing characteristics.

missing_staff_id$p1 <- links(criteria = missing_staff_id$initials)
missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour)
missing_staff_id[c("initials", "hair_colour", "p1", "p2")]
#>   initials hair_colour            p1            p2
#> 1     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001)
#> 2     B.G.        Teal P.2 (No hits) P.2 (No hits)
#> 3     X.P.        <NA> P.3 (CRI 001) P.3 (No hits)
#> 4     X.P.       Green P.3 (CRI 001) P.4 (CRI 001)
#> 5     <NA>       Green P.5 (No hits) P.4 (CRI 001)
#> 6     G.D.  Dark brown P.1 (CRI 001) P.6 (No hits)
#> 7     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001)

Unsurprisingly, the uniqueness of identifiers p1 and p2 correspond to the uniqueness of the initials and hair_colour respectively. In this case both strategies represent different outcomes. For example, p1 identifies records 3 and 4 as the same person, while p2 has it as records 4 and 5.

To maximise coverage, links() can implement an ordered multistage deterministic approach to record linkage. For example, we can say that records with matching initials should be linked to each other, then other records with a matching hair_colour should then be added to the group.

missing_staff_id$p3 <- links(criteria = as.list(missing_staff_id[c("initials", "hair_colour")]))
missing_staff_id[c("initials", "hair_colour", "p1", "p2", "p3")]
#>   initials hair_colour            p1            p2            p3
#> 1     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2     B.G.        Teal P.2 (No hits) P.2 (No hits) P.2 (No hits)
#> 3     X.P.        <NA> P.3 (CRI 001) P.3 (No hits) P.3 (CRI 001)
#> 4     X.P.       Green P.3 (CRI 001) P.4 (CRI 001) P.3 (CRI 001)
#> 5     <NA>       Green P.5 (No hits) P.4 (CRI 001) P.3 (CRI 002)
#> 6     G.D.  Dark brown P.1 (CRI 001) P.6 (No hits) P.1 (CRI 001)
#> 7     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)

We see that p3 now identifies records 3, 4 and 5 as the same person. The logic here is that since record 4 has the same initial as record 3 and also has the same hair_colour as record 5, all three are linked together as records of the same entity. Records 3 and 5 are only considered part of the same entity because of their share link with record 4. If record 4 is removed from this dataset, records 3 and 5 will considered part of separate entities. This process is referred to as record expansion.

At each stage, additional matching criteria can be specified. This is done through a sub_criteria object. This is an S3 class containing attributes to be compared and functions for the comparisons. A sub_criteria object is used for evaluated, fuzzy and/or nested matches.

For example, we can compare hair_colour and branch_office without any order (priority) to them. This is the equivalent of saying matching hair colour OR/AND branch office.

scri_1 <- sub_criteria(missing_staff_id$hair_colour, 
                       missing_staff_id$branch_office, 
                       operator = "or")
scri_2 <- sub_criteria(missing_staff_id$hair_colour, 
                       missing_staff_id$branch_office, 
                       operator = "and")
missing_staff_id$p4 <- links(criteria = "place_holder", 
                             sub_criteria = list(cr1 = scri_1), 
                             recursive = TRUE)
missing_staff_id$p5 <- links(criteria = "place_holder", 
                             sub_criteria = list(cr1 = scri_2), 
                             recursive = TRUE)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5")]
#>   hair_colour     branch_office            p4            p5
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.2 (CRI 001) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)

There is no limit to the number of sub_criteria that can be specified but each sub_criteria must be paired to a criteria. Any unpaired sub_criteria will be ignored.

As mentioned, a sub_criteria can be nested. For example, scri_3 below is the equivalent of saying (scri_1; matching hair colour OR branch office) AND (matching initials OR branch office).

scri_3 <- sub_criteria(scri_1, 
                       sub_criteria(missing_staff_id$initials, 
                                    missing_staff_id$branch_office,
                                    operator = "or"),
                       operator = "and")

missing_staff_id$p6 <- links(criteria = "place_holder", 
                             sub_criteria = list(cr1 = scri_3), 
                             recursive = TRUE)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6")]
#>   hair_colour     branch_office            p4            p5            p6
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)

Evaluated matches can be implemented with user-defined functions. The only requirement is that they:

For example, there are variations of the same hair_colour and branch_office values in missing_staff_id. A quick look and we see that using the last word of each value will improve the linkage result. We can pass a custom function to the sub_criteria object that will make this comparison. After doing this below (p7), we see that record 6 has now been linked with records 1 and 7, which was not the case earlier.

# A function to extract the last word in a string
last_word_wf <- function(x) tolower(gsub("^.* ", "", x))
# A logical test using `last_word_wf`.
last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y)

scri_4 <- sub_criteria(missing_staff_id$hair_colour, 
                       missing_staff_id$branch_office,
                       match_funcs = c(last_word_cmp, last_word_cmp),
                       operator = "or")

missing_staff_id$p7 <- links(criteria = "place_holder", 
                             sub_criteria = list(cr1 = scri_4), 
                             recursive = TRUE)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
#>   hair_colour     branch_office            p4            p5            p6
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#>              p7
#> 1 P.1 (CRI 001)
#> 2 P.2 (CRI 001)
#> 3 P.3 (No hits)
#> 4 P.2 (CRI 001)
#> 5 P.2 (CRI 001)
#> 6 P.1 (CRI 001)
#> 7 P.1 (CRI 001)

A sub_criteria provides a lot flexibly in terms of how attributes are compared however, it can cost a lot in processing time. This is because links() is an iterative function, comparing batches of record-pairs in iterations. This keeps memory usage down but at the cost of processing time. There are ways to mitigate this such as using the recursive and check_duplicate arguments.

In contrast to this, link_records() creates every possible record-pair of values for each attribute, which are then compared. This is often faster but costs more in memory usage. With a large enough dataset, we will eventually run out of memory, especially when there is no blocking_attribute. Below we see how link_records() is used for the same fuzzy matching.

p8 <- link_records(attribute = list(missing_staff_id$hair_colour, 
                                    missing_staff_id$branch_office), 
                   cmp_func = c(last_word_cmp, last_word_cmp), 
                   probabilistic = FALSE)
p8
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (CRI 001)" "P.3 (No hits)" "P.2 (CRI 001)"
#> [5] "P.2 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>    sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight record.match
#> 1     1    2         0         0          0        FALSE
#> 2     1    3         0         0          0        FALSE
#> 3     1    4         0         0          0        FALSE
#> 4     1    5         0         0          0        FALSE
#> 5     1    6         1         1          2         TRUE
#> 6     1    7         1         1          2         TRUE
#> 7     2    3         0         0          0        FALSE
#> 8     2    4         0         0          0        FALSE
#> 9     2    5         0         1          1         TRUE
#> 10    2    6         0         0          0        FALSE
#> 11    2    7         0         0          0        FALSE
#> 12    3    4         0         0          0        FALSE
#> 13    3    5         0         0          0        FALSE
#> 14    3    6         0         0          0        FALSE
#> 15    3    7         0         0          0        FALSE
#> 16    4    5         1         0          1         TRUE
#> 17    4    6         0         0          0        FALSE
#> 18    4    7         0         0          0        FALSE
#> 19    5    6         0         0          0        FALSE
#> 20    5    7         0         0          0        FALSE
#> 21    6    7         1         1          2         TRUE

It’s worth noting that link_records() does not have the step-wise or nested matching features of links() however, merge_ids() can be used to achieve the same result by consolidating the result of multiple instances of link_records() (pid objects) in the same way links() would.

The advantage of link_records() over links() is it’s ability to implement a probabilistic approach to record linkage. Its implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity. In summary, m_probabilities and u_probabilities, which are the probabilities of a true and false match respectively are used to calculate a final match score for each record-pair. Records below or above a certain score_threshold are considered matches or non-matches respectively. See help(link_records) for a more detailed explanation of the method. Below we see the same analysis as above but as a probabilistic record linkage.

p9 <- link_records(attribute = list(missing_staff_id$hair_colour, 
                                    missing_staff_id$branch_office), 
                   cmp_func = c(last_word_cmp, last_word_cmp), 
                   probabilistic = TRUE)
p9
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>    sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight prb.var_1 prb.var_2 prb.weight
#> 1     1    2         0         0          0 -3.298333 -3.235597  -6.533930
#> 2     1    3         0         0          0 -2.873027 -2.873027  -5.746054
#> 3     1    4         0         0          0 -3.235597 -2.873027  -6.108624
#> 4     1    5         0         0          0 -3.235597 -3.235597  -6.471194
#> 5     1    6         1         1          2  4.466709  4.466709   8.933417
#> 6     1    7         1         1          2  1.659354  1.659354   3.318708
#> 7     2    3         0         0          0 -3.136062 -2.873027  -6.009089
#> 8     2    4         0         0          0 -3.298333 -2.873027  -6.171360
#> 9     2    5         0         1          1 -3.298333  1.659354  -1.638979
#> 10    2    6         0         0          0 -3.328707 -3.298333  -6.627040
#> 11    2    7         0         0          0 -3.298333 -3.235597  -6.533930
#> 12    3    4         0         0          0 -3.358454 -3.358454  -6.716908
#> 13    3    5         0         0          0 -3.358454 -3.358454  -6.716908
#> 14    3    6         0         0          0 -3.358454 -3.358454  -6.716908
#> 15    3    7         0         0          0 -3.358454 -3.358454  -6.716908
#> 16    4    5         1         0          1  1.659354 -3.358454  -1.699100
#> 17    4    6         0         0          0 -3.298333 -3.358454  -6.656787
#> 18    4    7         0         0          0 -3.235597 -3.358454  -6.594051
#> 19    5    6         0         0          0 -3.298333 -3.298333  -6.596666
#> 20    5    7         0         0          0 -3.235597 -3.235597  -6.471194
#> 21    6    7         1         1          2  4.466709  4.466709   8.933417
#>    record.match
#> 1         FALSE
#> 2         FALSE
#> 3         FALSE
#> 4         FALSE
#> 5          TRUE
#> 6          TRUE
#> 7         FALSE
#> 8         FALSE
#> 9         FALSE
#> 10        FALSE
#> 11        FALSE
#> 12        FALSE
#> 13        FALSE
#> 14        FALSE
#> 15        FALSE
#> 16        FALSE
#> 17        FALSE
#> 18        FALSE
#> 19        FALSE
#> 20        FALSE
#> 21         TRUE

As mentioned, with a sufficiently large dataset, we can easily run out of memory. If memory usage is a concern, then use links_wf_probabilistic() instead. links_wf_probabilistic() is a wrapper function of links() with a specific sub_criteria to achieve probabilistic record linkage. It’s less memory intensive but is usually slower. The reasons for this has already been described above.

p10 <- links_wf_probabilistic(attribute = list(missing_staff_id$hair_colour,
                                               missing_staff_id$branch_office), 
                              cmp_func = c(last_word_cmp, last_word_cmp), 
                              probabilistic = TRUE,
                              recursive = TRUE)
p10
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>   sn_x sn_y cmp.var_1 cmp.var_2 cmp.weight prb.var_1 prb.var_2 prb.weight
#> 1    1    1         1         1          2  1.733354  1.733354   3.466709
#> 2    2    2        NA        NA         NA        NA        NA         NA
#> 3    3    3        NA        NA         NA        NA        NA         NA
#> 4    4    4        NA        NA         NA        NA        NA         NA
#> 5    5    5        NA        NA         NA        NA        NA         NA
#> 6    6    1         1         1          2  4.466709  4.466709   8.933417
#> 7    7    1         1         1          2  1.733354  1.733354   3.466709
#>   record.match
#> 1         TRUE
#> 2           NA
#> 3           NA
#> 4           NA
#> 5           NA
#> 6         TRUE
#> 7         TRUE

links() and link_records() can perform record linkage across multiple datasets in one instance. To achieve this, values from the different datasets should be passed to the functions as single atomic vectors. The corresponding identifier for each dataset should then be passed to the data_source argument. The strata argument is a related feature which provides a blocking attribute, limiting the linkage process to subsets of the dataset. See a demonstration of these below.

triplicate <- rbind(missing_staff_id[c(4:5, 7)],
                    missing_staff_id[c(4:5, 7)],
                    missing_staff_id[c(4:5, 7)])
triplicate$data_source <- c(rep("set_1", 7), rep("set_2", 7), rep("set_3", 7))

triplicate$p1 <- links(as.list(triplicate[1:2]), 
                       data_source = triplicate$data_source, 
                       strata = triplicate$source_1)
triplicate$p2 <- links(as.list(triplicate[1:2]), 
                       strata = triplicate$data_source, 
                       data_source = triplicate$source_1)

triplicate
#>    initials hair_colour source_1 data_source            p1             p2
#> 1      G.D.       Brown        A       set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 2      B.G.        Teal        A       set_1 P.2 (CRI 001) P.02 (No hits)
#> 3      X.P.        <NA>        A       set_1 P.3 (CRI 001) P.03 (CRI 001)
#> 4      X.P.       Green        B       set_1 P.4 (CRI 001) P.03 (CRI 001)
#> 5      <NA>       Green        A       set_1 P.5 (CRI 002) P.03 (CRI 002)
#> 6      G.D.  Dark brown        A       set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 7      G.D.       Brown        B       set_1 P.7 (CRI 001) P.01 (CRI 001)
#> 8      G.D.       Brown        A       set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 9      B.G.        Teal        A       set_2 P.2 (CRI 001) P.09 (No hits)
#> 10     X.P.        <NA>        A       set_2 P.3 (CRI 001) P.10 (CRI 001)
#> 11     X.P.       Green        B       set_2 P.4 (CRI 001) P.10 (CRI 001)
#> 12     <NA>       Green        A       set_2 P.5 (CRI 002) P.10 (CRI 002)
#> 13     G.D.  Dark brown        A       set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 14     G.D.       Brown        B       set_2 P.7 (CRI 001) P.08 (CRI 001)
#> 15     G.D.       Brown        A       set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 16     B.G.        Teal        A       set_3 P.2 (CRI 001) P.16 (No hits)
#> 17     X.P.        <NA>        A       set_3 P.3 (CRI 001) P.17 (CRI 001)
#> 18     X.P.       Green        B       set_3 P.4 (CRI 001) P.17 (CRI 001)
#> 19     <NA>       Green        A       set_3 P.5 (CRI 002) P.17 (CRI 002)
#> 20     G.D.  Dark brown        A       set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 21     G.D.       Brown        B       set_3 P.7 (CRI 001) P.15 (CRI 001)

The results are different as per the different linkage strategy. We can use as.data.frame or as.list to access the contents of each identifier to inspect the difference.

as.data.frame(triplicate$p1)
#>    pid sn pid_cri link_id pid_total iteration       pid_dataset
#> 1    1  1       1       1         6         1 set_1,set_2,set_3
#> 2    2  2       1       2         3         1 set_1,set_2,set_3
#> 3    3  3       1       3         3         1 set_1,set_2,set_3
#> 4    4  4       1       4         3         1 set_1,set_2,set_3
#> 5    5  5       2       5         3         1 set_1,set_2,set_3
#> 6    1  6       1       1         6         1 set_1,set_2,set_3
#> 7    7  7       1       7         3         1 set_1,set_2,set_3
#> 8    1  8       1       1         6         1 set_1,set_2,set_3
#> 9    2  9       1       2         3         1 set_1,set_2,set_3
#> 10   3 10       1       3         3         1 set_1,set_2,set_3
#> 11   4 11       1       4         3         1 set_1,set_2,set_3
#> 12   5 12       2       5         3         1 set_1,set_2,set_3
#> 13   1 13       1       1         6         1 set_1,set_2,set_3
#> 14   7 14       1       7         3         1 set_1,set_2,set_3
#> 15   1 15       1       1         6         2 set_1,set_2,set_3
#> 16   2 16       1       2         3         1 set_1,set_2,set_3
#> 17   3 17       1       3         3         1 set_1,set_2,set_3
#> 18   4 18       1       4         3         2 set_1,set_2,set_3
#> 19   5 19       2       5         3         1 set_1,set_2,set_3
#> 20   1 20       1       1         6         2 set_1,set_2,set_3
#> 21   7 21       1       7         3         1 set_1,set_2,set_3

as.data.frame(triplicate$p2)
#>    pid sn pid_cri link_id pid_total iteration pid_dataset
#> 1    1  1       1       1         3         1         A,B
#> 2    2  2       0       2         1         2           A
#> 3    3  3       1       3         3         1         A,B
#> 4    3  4       1       3         3         1         A,B
#> 5    3  5       2       4         3         1         A,B
#> 6    1  6       1       1         3         1         A,B
#> 7    1  7       1       1         3         2         A,B
#> 8    8  8       1       8         3         1         A,B
#> 9    9  9       0       9         1         2           A
#> 10  10 10       1      10         3         1         A,B
#> 11  10 11       1      10         3         1         A,B
#> 12  10 12       2      11         3         1         A,B
#> 13   8 13       1       8         3         1         A,B
#> 14   8 14       1       8         3         2         A,B
#> 15  15 15       1      15         3         1         A,B
#> 16  16 16       0      16         1         2           A
#> 17  17 17       1      17         3         1         A,B
#> 18  17 18       1      17         3         1         A,B
#> 19  17 19       2      18         3         1         A,B
#> 20  15 20       1      15         3         1         A,B
#> 21  15 21       1      15         3         2         A,B

Introduction to record linkage with diyar

04 December 2021