Extending partition

Malcolm Barrett

2021-10-04

Customizing existing partitioners

Partitioners are functions that tell the partition algorithm 1) what to try to reduce 2) how to measure how much information is lost from the reduction and 3) how to reduce the data. We call this approach Direct-Measure-Reduce. In partition, functions that handle 1) are thus called directors, functions that handle 2) are called metrics, and functions that handle 3) are called reducers. partition has a number of pre-specified partitioners for agglomerative data reduction. See the vignette introducing partition to learn more about these existing partitioners.

partition is agnostic to the direct-measure-reduce functions; it only needs to know what to apply to the data and will handle the rest. It’s then easy to extend partition to handle other directors, metrics, and reducers. Let’s consider a simple example: part_icc(), the default partitioner in partition(). part_icc() produces scaled row means for reduced variables, but let’s say we want to use unscaled means, instead. replace_partitioner() takes a partitioner and lets you reassign any part of the direct-measure-reduce algorithm. as_director(), as_measure(), and as_reducer() are helper functions to do so. Here, we’ll use as_reducer() and rowMeans() to create a new reducer.

library(partition)

part_icc_rowmeans <- replace_partitioner(
  part_icc,
  reduce = as_reducer(rowMeans)
)

part_icc_rowmeans
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>

Note that it now uses a custom reducer. We can apply part_icc_rowmeans the same way as other partitioners.

set.seed(1234)

df <- simulate_block_data(
  block_sizes = rep(5, 3),
  lower_corr = .4,
  upper_corr = .6,
  n = 100
)

prt <- partition(df, .5, partitioner = part_icc_rowmeans)
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 4 reduced variables created from 11 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block1_x1, block1_x2, block1_x3, block1_x4}
#> reduced_var_2 = {block3_x1, block3_x3, block3_x5}
#> reduced_var_3 = {block2_x2, block2_x5}
#> reduced_var_4 = {block2_x1, block2_x4}
#> 
#> Minimum information:
#> 0.507
partition_scores(prt)
#> # A tibble: 100 × 8
#>    block1_x5 block2_x3 block3_x2 block3_x4 reduced_var_1 reduced_var_2
#>        <dbl>     <dbl>     <dbl>     <dbl>         <dbl>         <dbl>
#>  1   -0.200     0.665    -0.959     -0.444       -0.790         -0.685
#>  2   -0.976     0.468    -0.738      1.44        -1.06          -0.636
#>  3    1.42     -0.142     0.292      0.944        1.27           0.282
#>  4    0.305    -0.824     1.85       1.57         0.512          1.37 
#>  5    0.0165    0.230     1.14       0.943       -0.0377        -0.646
#>  6   -1.02     -1.24      0.872     -0.143       -0.435         -0.114
#>  7   -0.201    -1.27      2.41       1.84         0.662          2.31 
#>  8    0.668     0.0977   -0.0585     0.869       -0.974          0.387
#>  9   -1.19     -1.01      0.618      0.305       -0.0846        -0.884
#> 10    0.272     0.605     1.18      -0.213       -0.346          1.10 
#> # … with 90 more rows, and 2 more variables: reduced_var_3 <dbl>,
#> #   reduced_var_4 <dbl>

as_measure() works much the same way: it accepts a function that returns a single metric to check against the threshold. Let’s swap out ICC for inter-item reliability:

inter_item_reliability <- function(.data) {
   corr(.data) %>%
    colMeans(na.rm = TRUE) %>%
    mean()
}

measure_iir <- as_measure(inter_item_reliability)

prt <- partition(df, .5, partitioner = replace_partitioner(part_icc, measure = measure_iir))
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: <custom metric> 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 3 reduced variables created from 15 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block1_x1, block1_x2, block1_x3, block1_x4, block1_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3, block2_x4, block2_x5}
#> reduced_var_3 = {block3_x1, block3_x2, block3_x3, block3_x4, block3_x5}
#> 
#> Minimum information:
#> 0.53

This returns a different partition because each reduced variable must now have an inter-item reliability of .5 or greater.

as_director() supports directors in the style of direct_distance(). Instead of a single function, however, it takes two: .pairs, a way to create a matrix comparing each variable, and .target, a way to select two variables to possibly reduce. direct_distance() does this by fitting a correlation-based distance matrix and using the variables with the smallest distance between them. Let’s try an example with Euclidean distance, instead. We’ll create two functions: euc_dist(), which returns a pairwise matrix of Euclidean distance between variables, and min_dist(), which finds the smallest distance and returns the names of two variables.

euc_dist <- function(.data) as.matrix(dist(t(.data)))

# find the pair with the minimum distance
min_dist <- function(.x) {
  indices <- arrayInd(which.min(.x), dim(as.matrix(.x)))

  #  get variable names with minimum distance
  c(
    colnames(.x)[indices[1]],
    colnames(.x)[indices[2]]
  )
}

We’ll pass these functions to as_director() and, as above, apply it to part_icc().

direct_euc_dist <- as_director(euc_dist, min_dist)

prt <- partition(df, .5, partitioner = replace_partitioner(part_icc, direct = direct_euc_dist))
prt
#> Partitioner:
#>    Director: <custom director> 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 4 reduced variables created from 11 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x2, block2_x5}
#> reduced_var_2 = {block3_x1, block3_x3, block3_x5}
#> reduced_var_3 = {block2_x1, block2_x4}
#> reduced_var_4 = {block1_x1, block1_x2, block1_x3, block1_x4}
#> 
#> Minimum information:
#> 0.507

Creating new partitioners

The part_*() functions are actually wrappers for as_partitioner(). Each partitioner has a set of component functions that direct, measure, and reduce: direct_*(), measure_*(), and reduce_*(). ICC, for instance, is measured with measure_icc(). Passing direct-measure-reduce functions to as_partitioner() thus creates a partitioner. The source code for part_icc(), for instance, looks like this:

function(spearman = FALSE) {
  as_partitioner(
    direct = direct_dist(spearman = spearman),
    measure = measure_icc,
    reduce = reduce_scaled_mean
  )
}

We can use as_partitioner() with the built-in direct_*(), measure_*(), and reduce_*() functions or apply custom components, like we created above. It’s easy to create a totally new partitioner using the functions we wrote above:

custom_part <- as_partitioner(
  direct = as_director(euc_dist, min_dist),
  measure = as_measure(inter_item_reliability),
  reduce = as_reducer(rowMeans)
)

partition(df, .5, custom_part)
#> Partitioner:
#>    Director: <custom director> 
#>    Metric: <custom metric> 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 3 reduced variables created from 15 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x1, block2_x2, block2_x3, block2_x4, block2_x5}
#> reduced_var_2 = {block1_x1, block1_x2, block1_x3, block1_x4, block1_x5}
#> reduced_var_3 = {block3_x1, block3_x2, block3_x3, block3_x4, block3_x5}
#> 
#> Minimum information:
#> 0.53

partition_step(), map_cluster(), and reduce_cluster()

as_director(), as_measure(), and as_reducer() help facilitate working with the the way partition() iterates; they essentially put the custom components in the right place, handle storage of objects, and pass the right results on for you. You can also work with the partition algorithm directly. Internally, partition() starts by creating a partition_step object. It’s this partition_step object that is passed on while the algorithm iterates; all directors, metrics, and reducers take a partition_step as the first argument and return a partition_step.

As a simple example, consider as_reducer(rowMeans). This returns a function that looks like this:

function(.partition_step) {
  reduce_cluster(.partition_step, rowMeans)
}

This takes a partition_step object, accesses the target (here, two variables to reduce) and metric, and if the metric is above the threshold, reduces them to a vector–a single reduced variable–using rowMeans(). The helper function reduce_cluster() applies the function to the partition_step the right way.

Some partitioners, like part_kmeans(), will assess many targets simultaneously, for instance by assigning all original variables to a cluster at a given level of k. as_reducer(rowMeans, returns_vector = FALSE) handles functions that should return a data frame instead. What it’s doing is using map_cluster() instead of reduce_cluster():

function(.partition_step) {
  map_cluster(.partition_step, rowMeans)
}

Knowing this can be useful because we can make changes to how the function works. Let’s say we want to add the ability to use the na.rm argument in rowMeans():

function(.partition_step, na.rm = FALSE) {
  partialized_rowMeans <- purrr::partial(rowMeans, na.rm = na.rm)
  map_cluster(.partition_step, partialized_rowMeans)
}

But we can also write these components from scratch. To do so, we need to work with partition_step directly. Let’s make a new director using hierarchical clustering with the hclust() function; it will work like part_kmeans() in that it will assign variables to a cluster for a given level of k, check if the information loss for each cluster, and reduce if no variables are below the threshold. We’ll create a director called direct_hcluster() to replace the director in part_kmeans(), but we’ll use the same metric (ICC) and reducer (scaled means).

A partition_step is just a list object, so you can use it store anything you need as the partition algorithm iterates. For instance, we’ll want to save k, the number of clusters we’re checking. Since hclust() only needs to be fit once (cuttree() does the assigning), we’ll also save that so we don’t have to keep fitting it. Each iteration will then have access to these objects. The target of the director should be saved to .partition_step$target, although you could handle it however you like if you make completely custom partitioners. Note that we can also use .partition_step$all_done <- TRUE as to tell the partition to end early. See ?as_partition_step for some of the common objects saved to partition_step.

For direct_hcluster, then, we’ll 1) create an initial k to check, 2) make sure we haven’t iterated through all levels of k without finding a set that have all reduced variables with their ICC above the threshold, 3) fit the hclust() function and save it for future iterations, and 4) generate cluster assignments using cuttree() and k. We now have a working director: it takes a partition_step, assigns a target, and returns a partition_step.

direct_hcluster <- function(.partition_step) {
  #  set initial k to 1 - number of cols in data
  if (is.null(.partition_step$k)) {
    .partition_step$k <- ncol(.partition_step$reduced_data) - 1
  }
  
  #  stop partition if all k checked
  if (.partition_step$k == 0) {
    #  tell the partition algorithm to stop
    .partition_step$all_done <- TRUE
    return(.partition_step)
  }
  
  if (is.null(.partition_step$hc)) {
    #  save hclust object for future use
    .partition_step$hc <- hclust(dist(t(.partition_step$reduced_data)))
  }
  
  .partition_step$target <- cutree(.partition_step$hc, k = .partition_step$k)
  
  .partition_step
}

As before, we can use as_partitioner() and apply it to our data using partition(). For this partitioner, part_hcluster, we’ll use our custom director, as well as measure_min_icc() and reduce_kmeans() from part_kmeans().

part_hcluster <- as_partitioner(
  direct = direct_hcluster,
  #  use same functions as part_kmeans() but search k linearly 
  measure = purrr::partial(measure_min_icc, search_method = "linear"),
  reduce =   purrr::partial(reduce_kmeans, search = "linear")
)

partition(df, .5, part_hcluster)
#> Partitioner:
#>    Director: <custom director> 
#>    Metric: Minimum Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 3 reduced variables created from 6 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x2, block2_x5}
#> reduced_var_2 = {block1_x1, block1_x4}
#> reduced_var_3 = {block3_x3, block3_x5}
#> 
#> Minimum information:
#> 0.534

partition is thus fully extensible. While we include many pre-specified partitioners, directors, metrics, and reducers, the partition algorithm is agnostic to those components; they can be fully specified by the user, making partition a very powerful framework for data reduction.