Deduplication using reclin

Jan van der Laan

2021-11-22

library(reclin)
library(dplyr)

Using reclin for deduplication will be demonstrated using an example

Towns with names containing ‘rdam’ or ‘rdm’ have been selected. This should contain most records concerning the two largest cities in The Netherlands: Amsterdam and Rotterdam.

data("town_names")
head(town_names)
#>                 name official_name
#> 1       alblasserdam  Alblasserdam
#> 2          amsterdam     Amsterdam
#> 3     amsterdam-z.o.     Amsterdam
#> 4 amsterdam-zuidoost     Amsterdam
#> 5      amsterdam z-o     Amsterdam
#> 6     amsterdam z.o.     Amsterdam

First, we do a little bit of cleanup of the names. We have a lot of names of the form ‘amsterdam z.o.’, ‘amsterdam zo’, etc. Removing non-alphanumeric characters will probably help. Also, some of the O’s are written as 0’s (zeros).

town_names$clean_name <- gsub("[^[:alnum:]]", "", town_names$name)
town_names$clean_name <- gsub("0", "o", town_names$clean_name)

We will now compare all records from the dataset to each other. First, we generate all possible pairs of records. However, as it is not necessary to compare the first record to the second and also the second to the first, we will only select pairs for which the second index is larger than the first. This is done by filter_pairs_for_deduplication. We compare the names using jaro_winkler and select records for which the Jaro-Winkler similarity is above 0.88. This value is determined by eye-balling the data. Usually values around 0.9 work.

p <- pair_blocking(town_names, town_names) %>% 
  filter_pairs_for_deduplication() %>%
  compare_pairs("clean_name", default_comparator = jaro_winkler()) %>% 
  score_simsum() %>% 
  select_threshold(0.88)
head(p)
#> ldat with 6 rows and 5 columns
#>   x y clean_name    simsum select
#> 1 1 2  0.6679894 0.6679894  FALSE
#> 2 1 3  0.6208514 0.6208514  FALSE
#> 3 2 3  0.9393939 0.9393939   TRUE
#> 4 1 4  0.5459851 0.5459851  FALSE
#> 5 2 4  0.8431373 0.8431373  FALSE
#> 6 3 4  0.8823529 0.8823529   TRUE

We have now selected some town names that we consider the same: records 2 and 3 (record 3 in output above) are the same, and records 3 and 4 (record 6). However, records 2 and 4 are not classified as belonging to the same record (record 5).

In our final step we want to assign each record in our original data set town_names into a number of groups, each group containing all records with the same town names. The function deduplicate_equivalance does that. It will use the ‘rules’ derived above: 2 and 3 belong to the same group, 3 and 4 belong to the same group, etc., to assign each record to a group. It will, therefore, also assign records 2 and 4 to the same group. For those familiar with graph theory: it derives all subgraphs and assigns all nodes in a subgraph the same identifier.

res <- deduplicate_equivalence(p)
head(res)
#>                 name official_name        clean_name duplicate_groups
#> 1       alblasserdam  Alblasserdam      alblasserdam              541
#> 2          amsterdam     Amsterdam         amsterdam              581
#> 3     amsterdam-z.o.     Amsterdam       amsterdamzo              581
#> 4 amsterdam-zuidoost     Amsterdam amsterdamzuidoost              581
#> 5      amsterdam z-o     Amsterdam       amsterdamzo              581
#> 6     amsterdam z.o.     Amsterdam       amsterdamzo              581

As we can see records 2 to 6 are assigned to the same group. We can calculate the number of groups and compare that to the original number of town names:

length(unique(res$duplicate_groups))
#> [1] 53
length(unique(res$duplicate_groups))/nrow(res)
#> [1] 0.09075342

We are only left with 53 town names; a reduction of approximately 90 percent. For this small number of remaining groups it is possible to manually derive the correct names, or, if that would be available, we could use the most frequent name in each group as the group name.

Lets assume that we are able to correctly determine the group names. This means that we assign the most frequent official name to each group:

res <- res %>% group_by(duplicate_groups, official_name) %>% mutate(n = n()) %>% 
  group_by(duplicate_groups) %>%
  mutate(group_name = first(official_name, order_by = desc(n)))

We can then calculate the confusion matrix and calculate the precision and recall:

precision <- res %>% group_by(group_name) %>% 
  summarise(precision = sum(group_name == official_name)/n())

precision_recall <- res %>% group_by(official_name) %>% 
  summarise(recall = sum(group_name == official_name)/n()) %>%
  left_join(precision, by = c("official_name" = "group_name")) %>% 
  mutate(precision = ifelse(is.na(precision), 0, precision))

precision_recall
#> # A tibble: 19 × 3
#>    official_name           recall precision
#>    <fct>                    <dbl>     <dbl>
#>  1 Alblasserdam             1         1    
#>  2 Amsterdam                0.997     0.964
#>  3 Amsterdam-Duivendrecht   1         1    
#>  4 Botlek Rotterdam         0.5       1    
#>  5 Diemen                   0.167     1    
#>  6 Europoort Rotterdam      1         1    
#>  7 Hoogvliet Rotterdam      0.725     1    
#>  8 Leerdam                  1         0.778
#>  9 Naarden                  1         1    
#> 10 Nieuw Amsterdam          0.75      1    
#> 11 Pernis Rotterdam         0.688     1    
#> 12 Rotterdam                0.993     0.876
#> 13 Rotterdam Albrandswaard  1         1    
#> 14 Rotterdam Poortugaal     1         1    
#> 15 Rozenburg                1         1    
#> 16 Schiedam                 0         0    
#> 17 Spaarndam                1         1    
#> 18 Veendam                  0         0    
#> 19 Zwammerdam               0.4       1

Overall precision and recall

summarise(precision_recall, mean(recall), mean(precision))
#> # A tibble: 1 × 2
#>   `mean(recall)` `mean(precision)`
#>            <dbl>             <dbl>
#> 1          0.748             0.875