Example: Multi-Ethnicity Categorization

Vinh Nguyen

2022-06-07

Background on Multi-Ethnicity Categorization

When it comes to disaggregating student outcomes by ethnicity, colleges typically rely on an ethnicity categorization that assigns each student to a single race/ethnicity category. However, it is not uncommon for students to identify as a member of more than one race/ethnicity group. Categorization of ethnicity is typically simplified by assigning these multi-ethnicity students to a “2 or more Races” group or assigning students to a single race/ethnicity category using a predefined rule set. For example, in the case of IPEDS race and ethnicity reporting, which mirrors the US Census, students are asked on their college application (CCC Apply for California Community Colleges) if they are “Hispanic / Latino” (yes or no) in one question. Then in a subsequent question, students are asked to check all races that they identify with (e.g., White, Black, Asian, etc.). If a student identifies as Hispanic or Latino, then that student would be grouped into the “Hispanic / Latino” ethnicity group in IPEDS reporting regardless of however many additional race/ethnicity boxes are checked.

Conducting a disproportionate impact (DI) analysis using a single ethnicity categorization has the potential to skew results when some students are left out (the impact can be large depending on the institution and/or the size of these groups), and could also mask some student groups that appear hidden under a single categorization formula. A more inclusive approach to DI analysis would be to include students in all ethnicity groups that they identify with. For example, if a student identifies as Hispanic and White, then they should be included in both the Hispanic group and the White group. Similarly, if a student identifies as Black and Asian, then they should be included in both the Black group and the Asian group.

Carrying out the previous analysis is certainly feasible, but suffers from practical implementation for at least two reasons:

  1. The distribution of the ethnicity group components do not add up to 100%, leading to a potentially distorted view and confusion among the audience.
  2. Working with more granular ethnicity data from multiple variables increases the complexity of the analysis considerably, reducing the likelihood that a multi-ethnicity view is adopted in most DI analyses.

In this vignette, we illustrate how the DisImpact package could be adapted to carry out a multi-ethnicity analysis using the di_iterate function as the workhorse, and manipulating the returned summary data set.

Multi-Ethnicity Data Format

In the case of single ethnicity categorization, ethnicity is usually stored in a single variable or column that lists the ethnicity group for each student (row). In the case of multi-ethnicity data, when a student could correspond to multiple groups, there are multiple ways to describe such information. Here, we describe three common approaches:

  1. A single variable/column that lists all student groups that a student (row) corresponds to in a single data cell, usually delimited by a comma or some other delimeter. For example, the value in the data cell could have the value Asian, Black, Hispanic if the students fall into these three groups.
  2. A wide format consisting of the same number of flags as there are groups, where each flag is binary (1/0), indicating group membership in each column. For example, if there are 9 groups, then there would be 9 additional variables/columns with names such as Flag_Group_1, …, Flag_Group_9, where each variable will take on a value of 1 or 0, with 1 indicating group membership, and 0 indicating non-membership.
  3. A long format that has each student corresponding to the same number of rows as the number of groups they are members of, with an ethnicity column similar to that of a single ethnicity categorization. For example, if a student self-identifies as Asian, Black, and Hispanic, then the student would have three rows in the data set, with an ethnicity column having the following three values in the three corresponding rows: Asian, Black, Hispanic.

The second, wide format is preferred when it comes to conducting a DI analysis using the DisImpact package. The included student_equity data set consists of ethnicity flags, and these variables will be used in a multi-ethnicity DI analysis.

Multi-Ethnicity Example Data Set

As seen in the Scaling DI vignette, one could repeat DI calculations over various success variables, group (disaggregation) variables, and cohort variables using the di_iterate function. The original intent of the di_iterate function is to take in a student-level data set and output a data set with summary results of dissagregation that could be referenced in a dashboard tool like Tableau or PowerBI. A pre-calculated data set (the output of di_iterate) makes it relatively easy to visualize disaggregation, equity gaps, and disproportionate impact across many outcome variables, cohort variables, disaggregation variables, and scenarios (subset) by filtering on the appropriate rows (summarized results). The following snippet illustrates this capability using default options:

# Load some necessary packages
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(forcats)
library(DisImpact)

# Load student equity data set
data(student_equity)

# Caclulate DI over several scenarios
df_di_summary <- di_iterate(data=student_equity
                          , success_vars=c('Math', 'English', 'Transfer')
                          , group_vars=c('Ethnicity', 'Gender')
                          , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                          , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                            )

In addition to the Ethnicity variable, the student_equity data set also contains ethnicity flags that are more granular, based on what students report. A student could be assigned to more than one category. For example, a student could fall into the Asian and South East Asian categories. Similarly, a student could fall into both White and South West Asian / North African (SWANA) categories .

head(student_equity)
##         Ethnicity Gender Cohort Transfer Cohort_Math Math Cohort_English
## 1 Native American Female   2017        0        2017    1           2017
## 2 Native American Female   2017        0        2018    1             NA
## 3 Native American Female   2017        0        2018    1           2017
## 4 Native American   Male   2017        1        2017    1           2018
## 5 Native American   Male   2017        0        2017    1           2019
## 6 Native American   Male   2017        1        2019    1           2018
##   English      Ed_Goal     College_Status Student_ID EthnicityFlag_Asian
## 1       0 Deg/Transfer First-time College     100001                   0
## 2      NA Deg/Transfer First-time College     100002                   0
## 3       0 Deg/Transfer First-time College     100003                   0
## 4       1        Other First-time College     100004                   0
## 5       0 Deg/Transfer              Other     100005                   0
## 6       1        Other First-time College     100006                   0
##   EthnicityFlag_Black EthnicityFlag_Hispanic EthnicityFlag_NativeAmerican
## 1                   0                      0                            1
## 2                   0                      0                            1
## 3                   0                      0                            1
## 4                   0                      0                            1
## 5                   0                      0                            1
## 6                   0                      0                            1
##   EthnicityFlag_PacificIslander EthnicityFlag_White EthnicityFlag_Carribean
## 1                             0                   0                       0
## 2                             0                   0                       0
## 3                             0                   0                       0
## 4                             0                   0                       0
## 5                             0                   0                       0
## 6                             0                   0                       0
##   EthnicityFlag_EastAsian EthnicityFlag_SouthEastAsian
## 1                       0                            0
## 2                       0                            0
## 3                       0                            0
## 4                       0                            0
## 5                       0                            0
## 6                       0                            0
##   EthnicityFlag_SouthWestAsianNorthAfrican EthnicityFlag_AANAPI
## 1                                        0                    1
## 2                                        0                    1
## 3                                        0                    1
## 4                                        0                    1
## 5                                        0                    1
## 6                                        0                    1
##   EthnicityFlag_Unknown EthnicityFlag_TwoorMoreRaces
## 1                     0                            0
## 2                     0                            0
## 3                     0                            0
## 4                     0                            0
## 5                     0                            0
## 6                     0                            0
## # Correlation to show overlap
## cor(student_equity[, str_detect(names(student_equity), 'EthnicityFlag')])

Analysis of Multi-Ethnicity Data Using DisImpact

For a multi-ethnicity analysis, one could pass a list of ethnicity flags to the group parameter of di_iterate, similar to how Gender and Ethnicity were passed in the previous example to create df_di_summary. However, since the flags are binary (1’s and 0’s), and the ethnicity group names are in the variable names themselves (eg, EthnicityFlag_Asian), the user needs to filter on the appropriate rows corresponding to the groups of interest (1 value in the flags), extract the group names, and store the group names in the group column of the returned summary data set. The following code illustrates this with the student_equity data set.

# Identify the ethnicity flag variables
want_vars <- names(student_equity)[str_detect(names(student_equity), '^EthnicityFlag')]
want_vars <- want_vars[!str_detect(want_vars, 'Unknown')] # Remove Unknown
want_vars <- want_vars[!str_detect(want_vars, 'Two')] # Remove Two or More Races
want_vars # Ethnicity Flags of interest
##  [1] "EthnicityFlag_Asian"                     
##  [2] "EthnicityFlag_Black"                     
##  [3] "EthnicityFlag_Hispanic"                  
##  [4] "EthnicityFlag_NativeAmerican"            
##  [5] "EthnicityFlag_PacificIslander"           
##  [6] "EthnicityFlag_White"                     
##  [7] "EthnicityFlag_Carribean"                 
##  [8] "EthnicityFlag_EastAsian"                 
##  [9] "EthnicityFlag_SouthEastAsian"            
## [10] "EthnicityFlag_SouthWestAsianNorthAfrican"
## [11] "EthnicityFlag_AANAPI"
# Number of students
## Total
student_equity %>%
  group_by(Cohort) %>%
  tally
## # A tibble: 2 x 2
##   Cohort     n
##    <int> <int>
## 1   2017 10000
## 2   2018 10000
## Each group
student_equity %>%
  select(Cohort, one_of(want_vars)) %>% 
  group_by(Cohort) %>%
  summarize_all(.funs=sum) %>%
  as.data.frame
##   Cohort EthnicityFlag_Asian EthnicityFlag_Black EthnicityFlag_Hispanic
## 1   2017                3414                1135                   2018
## 2   2018                3444                1129                   2029
##   EthnicityFlag_NativeAmerican EthnicityFlag_PacificIslander
## 1                          244                            60
## 2                          243                            65
##   EthnicityFlag_White EthnicityFlag_Carribean EthnicityFlag_EastAsian
## 1                4332                      68                    2043
## 2                4359                      75                    2114
##   EthnicityFlag_SouthEastAsian EthnicityFlag_SouthWestAsianNorthAfrican
## 1                         1439                                     1164
## 2                         1443                                     1130
##   EthnicityFlag_AANAPI
## 1                 3714
## 2                 3742
## Observation: students can be in more than 1 group

# Convert the ethnicity flags to character as required by di_iterate
for (varname in want_vars) {
  student_equity[[varname]] <- as.character(student_equity[[varname]])
}

# DI analysis
df_di_summary_mult_eth <- di_iterate(data=student_equity
                          , success_vars=c('Math', 'English', 'Transfer')
                          , group_vars=want_vars # specify the list of ethnicity flag variables
                          , cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
                          , scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
                          , di_80_index_reference_groups='all but current'
                            ) %>%
  filter(group=='1') %>% # Ethnicity flags have 1's and 0's; filter on just the 1 group as that is of interest
  # filter((group=='1') | (disaggregation=='- None' & group=='- All')) %>% 
  mutate(group=str_replace(disaggregation, 'EthnicityFlag_', '') %>% gsub(pattern='([A-Z])', replacement=' \\1', x=.) %>% str_replace('^ ', '') %>% str_replace('A A N A P I', 'AANAPI')# Rather than show '1', identify the ethnicity group names and assign them to group
       , disaggregation='Multi-Ethnicity' # Originally is a list of variable names corresponding to the various ethnicity flags; call this disaggregation 'Multi-Ethnicity'
         )

# Check if re-assignments are correct
table(df_di_summary_mult_eth$disaggregation, useNA='ifany')
## 
## Multi-Ethnicity 
##             981
table(df_di_summary_mult_eth$group, useNA='ifany')
## 
##                         AANAPI                          Asian 
##                             90                             90 
##                          Black                      Carribean 
##                             90                             87 
##                     East Asian                       Hispanic 
##                             90                             90 
##                Native American               Pacific Islander 
##                             90                             84 
##               South East Asian South West Asian North African 
##                             90                             90 
##                          White 
##                             90
# Illustration: the group proportions add up to more than 100% since a student could be counted in more than 1 group
df_di_summary_mult_eth %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Transfer', cohort=='2018') %>%
  select(group, n) %>%
  mutate(Proportion=n / sum(student_equity$Cohort=='2018')) %>%
  mutate(Sum_Proportion=sum(Proportion))
## # A tibble: 11 x 4
##    group                              n Proportion Sum_Proportion
##    <chr>                          <dbl>      <dbl>          <dbl>
##  1 Asian                           3444     0.344            1.98
##  2 Black                           1129     0.113            1.98
##  3 Hispanic                        2029     0.203            1.98
##  4 Native American                  243     0.0243           1.98
##  5 Pacific Islander                  65     0.0065           1.98
##  6 White                           4359     0.436            1.98
##  7 Carribean                         75     0.0075           1.98
##  8 East Asian                      2114     0.211            1.98
##  9 South East Asian                1443     0.144            1.98
## 10 South West Asian North African  1130     0.113            1.98
## 11 AANAPI                          3742     0.374            1.98

Visualizing in Dashboard Platform

Once a DI summary data set for multi-ethnicity is available, it could be combined with other summary data sets to be used in dashboard development as described in the Scaling DI vignette.

# Combine
df_di_summary_combined <- bind_rows(
  df_di_summary
  , df_di_summary_mult_eth # Could first filter on rows of interest (eg, just the categorizations of interest to the institution)
)

# Disaggregation: Ethnicity
df_di_summary_combined %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  as.data.frame
##    cohort           group    n       pct di_indicator_ppg
## 1    2017           Asian 1406 0.8968706                0
## 2    2017           Black  421 0.7862233                1
## 3    2017        Hispanic  815 0.7325153                1
## 4    2017 Multi-Ethnicity  211 0.8293839                0
## 5    2017 Native American   45 0.9333333                0
## 6    2017           White 1500 0.8773333                0
## 7    2018           Asian 2212 0.9235986                0
## 8    2018           Black  684 0.7441520                1
## 9    2018        Hispanic 1386 0.7366522                1
## 10   2018 Multi-Ethnicity  369 0.7940379                1
## 11   2018 Native American   68 0.8088235                0
## 12   2018           White 2576 0.8819876                0
## 13   2019           Asian 1429 0.9083275                0
## 14   2019           Black  411 0.7834550                1
## 15   2019        Hispanic  786 0.7404580                1
## 16   2019 Multi-Ethnicity  225 0.8000000                0
## 17   2019 Native American   47 0.8297872                0
## 18   2019           White 1558 0.8896021                0
## 19   2020           Asian  573 0.9301920                0
## 20   2020           Black  180 0.7333333                1
## 21   2020        Hispanic  304 0.7171053                1
## 22   2020 Multi-Ethnicity   99 0.7575758                0
## 23   2020 Native American   14 0.6428571                0
## 24   2020           White  610 0.8819672                0
##    di_indicator_prop_index di_indicator_80_index
## 1                        0                     0
## 2                        0                     0
## 3                        0                     1
## 4                        0                     0
## 5                        0                     0
## 6                        0                     0
## 7                        0                     0
## 8                        0                     0
## 9                        0                     1
## 10                       0                     0
## 11                       0                     0
## 12                       0                     0
## 13                       0                     0
## 14                       0                     0
## 15                       0                     0
## 16                       0                     0
## 17                       0                     0
## 18                       0                     0
## 19                       0                     0
## 20                       0                     1
## 21                       0                     1
## 22                       0                     0
## 23                       1                     1
## 24                       0                     0
# Disaggregation: Multi-Ethnicity
df_di_summary_combined %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  as.data.frame
##    cohort                          group    n       pct di_indicator_ppg
## 1    2017                          Asian 1571 0.8873329                0
## 2    2018                          Asian 2541 0.9086974                0
## 3    2019                          Asian 1622 0.8896424                0
## 4    2020                          Asian  655 0.9160305                0
## 5    2017                          Black  485 0.7752577                1
## 6    2018                          Black  767 0.7470665                1
## 7    2019                          Black  464 0.7887931                1
## 8    2020                          Black  204 0.7205882                1
## 9    2017                       Hispanic  822 0.7347932                1
## 10   2018                       Hispanic 1407 0.7391613                1
## 11   2019                       Hispanic  794 0.7430730                1
## 12   2020                       Hispanic  309 0.7184466                1
## 13   2017                Native American  103 0.8834951                0
## 14   2018                Native American  162 0.7901235                0
## 15   2019                Native American  110 0.8090909                0
## 16   2020                Native American   44 0.7727273                0
## 17   2017               Pacific Islander   25 0.3600000                1
## 18   2018               Pacific Islander   41 0.4390244                1
## 19   2019               Pacific Islander   30 0.3000000                1
## 20   2020               Pacific Islander   16 0.1875000                1
## 21   2017                          White 1880 0.8622340                0
## 22   2018                          White 3271 0.8550902                0
## 23   2019                          White 1960 0.8668367                0
## 24   2020                          White  767 0.8578879                0
## 25   2017                      Carribean   29 0.8620690                0
## 26   2018                      Carribean   47 0.7446809                0
## 27   2019                      Carribean   32 0.8437500                0
## 28   2020                      Carribean   16 0.7500000                0
## 29   2017                     East Asian  904 0.8816372                0
## 30   2018                     East Asian 1555 0.9163987                0
## 31   2019                     East Asian 1008 0.8898810                0
## 32   2020                     East Asian  388 0.9201031                0
## 33   2017               South East Asian  698 0.8982808                0
## 34   2018               South East Asian 1047 0.9006686                0
## 35   2019               South East Asian  665 0.8962406                0
## 36   2020               South East Asian  284 0.9014085                0
## 37   2017 South West Asian North African  507 0.8737673                0
## 38   2018 South West Asian North African  836 0.8720096                0
## 39   2019 South West Asian North African  532 0.8872180                0
## 40   2020 South West Asian North African  213 0.8591549                0
## 41   2017                         AANAPI 1699 0.8864038                0
## 42   2018                         AANAPI 2740 0.8952555                0
## 43   2019                         AANAPI 1758 0.8794084                0
## 44   2020                         AANAPI  702 0.8974359                0
##    di_indicator_prop_index di_indicator_80_index
## 1                        0                     0
## 2                        0                     0
## 3                        0                     0
## 4                        0                     0
## 5                        0                     0
## 6                        0                     0
## 7                        0                     0
## 8                        0                     0
## 9                        0                     0
## 10                       0                     0
## 11                       0                     0
## 12                       0                     0
## 13                       0                     0
## 14                       0                     0
## 15                       0                     0
## 16                       0                     0
## 17                       1                     1
## 18                       1                     1
## 19                       1                     1
## 20                       1                     1
## 21                       0                     0
## 22                       0                     0
## 23                       0                     0
## 24                       0                     0
## 25                       0                     0
## 26                       0                     0
## 27                       0                     0
## 28                       0                     0
## 29                       0                     0
## 30                       0                     0
## 31                       0                     0
## 32                       0                     0
## 33                       0                     0
## 34                       0                     0
## 35                       0                     0
## 36                       0                     0
## 37                       0                     0
## 38                       0                     0
## 39                       0                     0
## 40                       0                     0
## 41                       0                     0
## 42                       0                     0
## 43                       0                     0
## 44                       0                     0
# Disaggregation: Ethnicity
df_di_summary_combined %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  ## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Ethnicity') +
  labs(size='Disproportionate Impact') +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.

# Disaggregation: Multi-Ethnicity
df_di_summary_combined %>%
  filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>%
  select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
  mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>% 
  ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
  geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  ## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
  geom_line() +
  xlab('Cohort') +
  ylab('Rate') +
  theme_bw() +
  scale_color_manual(values=c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99'), name='Multi-Ethnicity') +
  labs(size='Disproportionate Impact') +
  scale_y_continuous(labels = percent, limits=c(0, 1)) +
  ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Multi-Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.0    scales_1.1.1     ggplot2_3.3.2    stringr_1.4.0   
## [5] knitr_1.39       dplyr_1.0.8      DisImpact_0.0.18
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3     pillar_1.7.0     bslib_0.3.1      compiler_4.0.2  
##  [5] jquerylib_0.1.4  highr_0.9        prettydoc_0.4.1  tools_4.0.2     
##  [9] digest_0.6.25    gtable_0.3.0     jsonlite_1.5     evaluate_0.15   
## [13] lifecycle_1.0.1  tibble_3.1.6     fstcore_0.9.12   pkgconfig_2.0.3 
## [17] rlang_1.0.1      cli_3.2.0        yaml_2.3.5       parallel_4.0.2  
## [21] xfun_0.30        fastmap_1.1.0    withr_2.5.0      generics_0.1.2  
## [25] vctrs_0.3.8      sass_0.4.1       grid_4.0.2       tidyselect_1.1.2
## [29] glue_1.6.1       R6_2.3.0         fansi_1.0.2      rmarkdown_2.14  
## [33] farver_2.0.3     purrr_0.3.4      tidyr_1.2.0      magrittr_2.0.2  
## [37] ellipsis_0.3.2   htmltools_0.5.2  fst_0.9.8        colorspace_1.4-1
## [41] labeling_0.3     utf8_1.2.2       stringi_1.4.6    munsell_0.5.0   
## [45] crayon_1.5.0