Background on Multi-Ethnicity Categorization
When it comes to disaggregating student outcomes by ethnicity, colleges typically rely on an ethnicity categorization that assigns each student to a single race/ethnicity category. However, it is not uncommon for students to identify as a member of more than one race/ethnicity group. Categorization of ethnicity is typically simplified by assigning these multi-ethnicity students to a “2 or more Races” group or assigning students to a single race/ethnicity category using a predefined rule set. For example, in the case of IPEDS race and ethnicity reporting, which mirrors the US Census, students are asked on their college application (CCC Apply for California Community Colleges) if they are “Hispanic / Latino” (yes or no) in one question. Then in a subsequent question, students are asked to check all races that they identify with (e.g., White, Black, Asian, etc.). If a student identifies as Hispanic or Latino, then that student would be grouped into the “Hispanic / Latino” ethnicity group in IPEDS reporting regardless of however many additional race/ethnicity boxes are checked.
Conducting a disproportionate impact (DI) analysis using a single ethnicity categorization has the potential to skew results when some students are left out (the impact can be large depending on the institution and/or the size of these groups), and could also mask some student groups that appear hidden under a single categorization formula. A more inclusive approach to DI analysis would be to include students in all ethnicity groups that they identify with. For example, if a student identifies as Hispanic and White, then they should be included in both the Hispanic group and the White group. Similarly, if a student identifies as Black and Asian, then they should be included in both the Black group and the Asian group.
Carrying out the previous analysis is certainly feasible, but suffers from practical implementation for at least two reasons:
- The distribution of the ethnicity group components do not add up to 100%, leading to a potentially distorted view and confusion among the audience.
- Working with more granular ethnicity data from multiple variables increases the complexity of the analysis considerably, reducing the likelihood that a multi-ethnicity view is adopted in most DI analyses.
In this vignette, we illustrate how the DisImpact
package could be adapted to carry out a multi-ethnicity analysis using
the di_iterate
function as the workhorse, and manipulating
the returned summary data set.
Multi-Ethnicity Data Format
In the case of single ethnicity categorization, ethnicity is usually stored in a single variable or column that lists the ethnicity group for each student (row). In the case of multi-ethnicity data, when a student could correspond to multiple groups, there are multiple ways to describe such information. Here, we describe three common approaches:
- A single variable/column that lists all student groups that
a student (row) corresponds to in a single data cell, usually delimited
by a comma or some other delimeter. For example, the value in the data
cell could have the value
Asian, Black, Hispanic
if the students fall into these three groups. - A wide format consisting of the same number of flags as
there are groups, where each flag is binary (1/0), indicating group
membership in each column. For example, if there are 9 groups, then
there would be 9 additional variables/columns with names such as
Flag_Group_1
, …,Flag_Group_9
, where each variable will take on a value of 1 or 0, with 1 indicating group membership, and 0 indicating non-membership. - A long format that has each student corresponding to the
same number of rows as the number of groups they are members of, with an
ethnicity column similar to that of a single ethnicity categorization.
For example, if a student self-identifies as Asian, Black, and Hispanic,
then the student would have three rows in the data set, with an
ethnicity column having the following three values in the three
corresponding rows:
Asian
,Black
,Hispanic
.
The second, wide format is preferred when it comes to conducting a DI
analysis using the DisImpact
package. The included
student_equity
data set consists of ethnicity flags, and
these variables will be used in a multi-ethnicity DI analysis.
Multi-Ethnicity Example Data Set
As seen in the Scaling DI vignette, one could repeat DI
calculations over various success variables, group (disaggregation)
variables, and cohort variables using the di_iterate
function. The original intent of the di_iterate
function is
to take in a student-level data set and output a data set with summary
results of dissagregation that could be referenced in a dashboard tool
like Tableau or PowerBI. A pre-calculated data set (the output of
di_iterate
) makes it relatively easy to visualize
disaggregation, equity gaps, and disproportionate impact across many
outcome variables, cohort variables, disaggregation variables, and
scenarios (subset) by filtering on the appropriate rows (summarized
results). The following snippet illustrates this capability using
default options:
# Load some necessary packages
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(forcats)
library(DisImpact)
# Load student equity data set
data(student_equity)
# Caclulate DI over several scenarios
<- di_iterate(data=student_equity
df_di_summary success_vars=c('Math', 'English', 'Transfer')
, group_vars=c('Ethnicity', 'Gender')
, cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
, scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
, )
In addition to the Ethnicity
variable, the
student_equity
data set also contains ethnicity flags that
are more granular, based on what students report. A student could be
assigned to more than one category. For example, a student could fall
into the Asian and South East Asian categories. Similarly, a student
could fall into both White and South West Asian / North African (SWANA)
categories .
head(student_equity)
## Ethnicity Gender Cohort Transfer Cohort_Math Math Cohort_English
## 1 Native American Female 2017 0 2017 1 2017
## 2 Native American Female 2017 0 2018 1 NA
## 3 Native American Female 2017 0 2018 1 2017
## 4 Native American Male 2017 1 2017 1 2018
## 5 Native American Male 2017 0 2017 1 2019
## 6 Native American Male 2017 1 2019 1 2018
## English Ed_Goal College_Status Student_ID EthnicityFlag_Asian
## 1 0 Deg/Transfer First-time College 100001 0
## 2 NA Deg/Transfer First-time College 100002 0
## 3 0 Deg/Transfer First-time College 100003 0
## 4 1 Other First-time College 100004 0
## 5 0 Deg/Transfer Other 100005 0
## 6 1 Other First-time College 100006 0
## EthnicityFlag_Black EthnicityFlag_Hispanic EthnicityFlag_NativeAmerican
## 1 0 0 1
## 2 0 0 1
## 3 0 0 1
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## EthnicityFlag_PacificIslander EthnicityFlag_White EthnicityFlag_Carribean
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## EthnicityFlag_EastAsian EthnicityFlag_SouthEastAsian
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## EthnicityFlag_SouthWestAsianNorthAfrican EthnicityFlag_AANAPI
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## EthnicityFlag_Unknown EthnicityFlag_TwoorMoreRaces
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## # Correlation to show overlap
## cor(student_equity[, str_detect(names(student_equity), 'EthnicityFlag')])
Analysis of Multi-Ethnicity Data Using DisImpact
For a multi-ethnicity analysis, one could pass a list of ethnicity
flags to the group
parameter of di_iterate
,
similar to how Gender
and Ethnicity
were
passed in the previous example to create df_di_summary
.
However, since the flags are binary (1’s and 0’s), and the ethnicity
group names are in the variable names themselves (eg,
EthnicityFlag_Asian
), the user needs to filter on the
appropriate rows corresponding to the groups of interest (1 value in the
flags), extract the group names, and store the group names in the
group
column of the returned summary data set. The
following code illustrates this with the student_equity
data set.
# Identify the ethnicity flag variables
<- names(student_equity)[str_detect(names(student_equity), '^EthnicityFlag')]
want_vars <- want_vars[!str_detect(want_vars, 'Unknown')] # Remove Unknown
want_vars <- want_vars[!str_detect(want_vars, 'Two')] # Remove Two or More Races
want_vars # Ethnicity Flags of interest want_vars
## [1] "EthnicityFlag_Asian"
## [2] "EthnicityFlag_Black"
## [3] "EthnicityFlag_Hispanic"
## [4] "EthnicityFlag_NativeAmerican"
## [5] "EthnicityFlag_PacificIslander"
## [6] "EthnicityFlag_White"
## [7] "EthnicityFlag_Carribean"
## [8] "EthnicityFlag_EastAsian"
## [9] "EthnicityFlag_SouthEastAsian"
## [10] "EthnicityFlag_SouthWestAsianNorthAfrican"
## [11] "EthnicityFlag_AANAPI"
# Number of students
## Total
%>%
student_equity group_by(Cohort) %>%
tally
## # A tibble: 2 x 2
## Cohort n
## <int> <int>
## 1 2017 10000
## 2 2018 10000
## Each group
%>%
student_equity select(Cohort, one_of(want_vars)) %>%
group_by(Cohort) %>%
summarize_all(.funs=sum) %>%
as.data.frame
## Cohort EthnicityFlag_Asian EthnicityFlag_Black EthnicityFlag_Hispanic
## 1 2017 3414 1135 2018
## 2 2018 3444 1129 2029
## EthnicityFlag_NativeAmerican EthnicityFlag_PacificIslander
## 1 244 60
## 2 243 65
## EthnicityFlag_White EthnicityFlag_Carribean EthnicityFlag_EastAsian
## 1 4332 68 2043
## 2 4359 75 2114
## EthnicityFlag_SouthEastAsian EthnicityFlag_SouthWestAsianNorthAfrican
## 1 1439 1164
## 2 1443 1130
## EthnicityFlag_AANAPI
## 1 3714
## 2 3742
## Observation: students can be in more than 1 group
# Convert the ethnicity flags to character as required by di_iterate
for (varname in want_vars) {
<- as.character(student_equity[[varname]])
student_equity[[varname]]
}
# DI analysis
<- di_iterate(data=student_equity
df_di_summary_mult_eth success_vars=c('Math', 'English', 'Transfer')
, group_vars=want_vars # specify the list of ethnicity flag variables
, cohort_vars=c('Cohort_Math', 'Cohort_English', 'Cohort')
, scenario_repeat_by_vars=c('Ed_Goal', 'College_Status')
, di_80_index_reference_groups='all but current'
, %>%
) filter(group=='1') %>% # Ethnicity flags have 1's and 0's; filter on just the 1 group as that is of interest
# filter((group=='1') | (disaggregation=='- None' & group=='- All')) %>%
mutate(group=str_replace(disaggregation, 'EthnicityFlag_', '') %>% gsub(pattern='([A-Z])', replacement=' \\1', x=.) %>% str_replace('^ ', '') %>% str_replace('A A N A P I', 'AANAPI')# Rather than show '1', identify the ethnicity group names and assign them to group
disaggregation='Multi-Ethnicity' # Originally is a list of variable names corresponding to the various ethnicity flags; call this disaggregation 'Multi-Ethnicity'
,
)
# Check if re-assignments are correct
table(df_di_summary_mult_eth$disaggregation, useNA='ifany')
##
## Multi-Ethnicity
## 981
table(df_di_summary_mult_eth$group, useNA='ifany')
##
## AANAPI Asian
## 90 90
## Black Carribean
## 90 87
## East Asian Hispanic
## 90 90
## Native American Pacific Islander
## 90 84
## South East Asian South West Asian North African
## 90 90
## White
## 90
# Illustration: the group proportions add up to more than 100% since a student could be counted in more than 1 group
%>%
df_di_summary_mult_eth filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Transfer', cohort=='2018') %>%
select(group, n) %>%
mutate(Proportion=n / sum(student_equity$Cohort=='2018')) %>%
mutate(Sum_Proportion=sum(Proportion))
## # A tibble: 11 x 4
## group n Proportion Sum_Proportion
## <chr> <dbl> <dbl> <dbl>
## 1 Asian 3444 0.344 1.98
## 2 Black 1129 0.113 1.98
## 3 Hispanic 2029 0.203 1.98
## 4 Native American 243 0.0243 1.98
## 5 Pacific Islander 65 0.0065 1.98
## 6 White 4359 0.436 1.98
## 7 Carribean 75 0.0075 1.98
## 8 East Asian 2114 0.211 1.98
## 9 South East Asian 1443 0.144 1.98
## 10 South West Asian North African 1130 0.113 1.98
## 11 AANAPI 3742 0.374 1.98
Visualizing in Dashboard Platform
Once a DI summary data set for multi-ethnicity is available, it could be combined with other summary data sets to be used in dashboard development as described in the Scaling DI vignette.
# Combine
<- bind_rows(
df_di_summary_combined
df_di_summary# Could first filter on rows of interest (eg, just the categorizations of interest to the institution)
, df_di_summary_mult_eth
)
# Disaggregation: Ethnicity
%>%
df_di_summary_combined filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
as.data.frame
## cohort group n pct di_indicator_ppg
## 1 2017 Asian 1406 0.8968706 0
## 2 2017 Black 421 0.7862233 1
## 3 2017 Hispanic 815 0.7325153 1
## 4 2017 Multi-Ethnicity 211 0.8293839 0
## 5 2017 Native American 45 0.9333333 0
## 6 2017 White 1500 0.8773333 0
## 7 2018 Asian 2212 0.9235986 0
## 8 2018 Black 684 0.7441520 1
## 9 2018 Hispanic 1386 0.7366522 1
## 10 2018 Multi-Ethnicity 369 0.7940379 1
## 11 2018 Native American 68 0.8088235 0
## 12 2018 White 2576 0.8819876 0
## 13 2019 Asian 1429 0.9083275 0
## 14 2019 Black 411 0.7834550 1
## 15 2019 Hispanic 786 0.7404580 1
## 16 2019 Multi-Ethnicity 225 0.8000000 0
## 17 2019 Native American 47 0.8297872 0
## 18 2019 White 1558 0.8896021 0
## 19 2020 Asian 573 0.9301920 0
## 20 2020 Black 180 0.7333333 1
## 21 2020 Hispanic 304 0.7171053 1
## 22 2020 Multi-Ethnicity 99 0.7575758 0
## 23 2020 Native American 14 0.6428571 0
## 24 2020 White 610 0.8819672 0
## di_indicator_prop_index di_indicator_80_index
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 1
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 1
## 21 0 1
## 22 0 0
## 23 1 1
## 24 0 0
# Disaggregation: Multi-Ethnicity
%>%
df_di_summary_combined filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>%
select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
as.data.frame
## cohort group n pct di_indicator_ppg
## 1 2017 Asian 1571 0.8873329 0
## 2 2018 Asian 2541 0.9086974 0
## 3 2019 Asian 1622 0.8896424 0
## 4 2020 Asian 655 0.9160305 0
## 5 2017 Black 485 0.7752577 1
## 6 2018 Black 767 0.7470665 1
## 7 2019 Black 464 0.7887931 1
## 8 2020 Black 204 0.7205882 1
## 9 2017 Hispanic 822 0.7347932 1
## 10 2018 Hispanic 1407 0.7391613 1
## 11 2019 Hispanic 794 0.7430730 1
## 12 2020 Hispanic 309 0.7184466 1
## 13 2017 Native American 103 0.8834951 0
## 14 2018 Native American 162 0.7901235 0
## 15 2019 Native American 110 0.8090909 0
## 16 2020 Native American 44 0.7727273 0
## 17 2017 Pacific Islander 25 0.3600000 1
## 18 2018 Pacific Islander 41 0.4390244 1
## 19 2019 Pacific Islander 30 0.3000000 1
## 20 2020 Pacific Islander 16 0.1875000 1
## 21 2017 White 1880 0.8622340 0
## 22 2018 White 3271 0.8550902 0
## 23 2019 White 1960 0.8668367 0
## 24 2020 White 767 0.8578879 0
## 25 2017 Carribean 29 0.8620690 0
## 26 2018 Carribean 47 0.7446809 0
## 27 2019 Carribean 32 0.8437500 0
## 28 2020 Carribean 16 0.7500000 0
## 29 2017 East Asian 904 0.8816372 0
## 30 2018 East Asian 1555 0.9163987 0
## 31 2019 East Asian 1008 0.8898810 0
## 32 2020 East Asian 388 0.9201031 0
## 33 2017 South East Asian 698 0.8982808 0
## 34 2018 South East Asian 1047 0.9006686 0
## 35 2019 South East Asian 665 0.8962406 0
## 36 2020 South East Asian 284 0.9014085 0
## 37 2017 South West Asian North African 507 0.8737673 0
## 38 2018 South West Asian North African 836 0.8720096 0
## 39 2019 South West Asian North African 532 0.8872180 0
## 40 2020 South West Asian North African 213 0.8591549 0
## 41 2017 AANAPI 1699 0.8864038 0
## 42 2018 AANAPI 2740 0.8952555 0
## 43 2019 AANAPI 1758 0.8794084 0
## 44 2020 AANAPI 702 0.8974359 0
## di_indicator_prop_index di_indicator_80_index
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## 21 0 0
## 22 0 0
## 23 0 0
## 24 0 0
## 25 0 0
## 26 0 0
## 27 0 0
## 28 0 0
## 29 0 0
## 30 0 0
## 31 0 0
## 32 0 0
## 33 0 0
## 34 0 0
## 35 0 0
## 36 0 0
## 37 0 0
## 38 0 0
## 39 0 0
## 40 0 0
## 41 0 0
## 42 0 0
## 43 0 0
## 44 0 0
# Disaggregation: Ethnicity
%>%
df_di_summary_combined filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Ethnicity') %>%
select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>%
ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
geom_line() +
xlab('Cohort') +
ylab('Rate') +
theme_bw() +
scale_color_manual(values=c('#1b9e77', '#d95f02', '#7570b3', '#e7298a', '#66a61e', '#e6ab02'), name='Ethnicity') +
labs(size='Disproportionate Impact') +
scale_y_continuous(labels = percent, limits=c(0, 1)) +
ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.
# Disaggregation: Multi-Ethnicity
%>%
df_di_summary_combined filter(Ed_Goal=='- All', College_Status=='- All', success_variable=='Math', disaggregation=='Multi-Ethnicity') %>%
select(cohort, group, n, pct, di_indicator_ppg, di_indicator_prop_index, di_indicator_80_index) %>%
mutate(group=factor(group) %>% fct_reorder(desc(pct))) %>%
ggplot(data=., mapping=aes(x=factor(cohort), y=pct, group=group, color=group)) +
geom_point(aes(size=factor(di_indicator_ppg, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
## geom_point(aes(size=factor(di_indicator_80_index, levels=c(0, 1), labels=c('Not DI', 'DI')))) +
geom_line() +
xlab('Cohort') +
ylab('Rate') +
theme_bw() +
scale_color_manual(values=c('#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99', '#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a', '#ffff99'), name='Multi-Ethnicity') +
labs(size='Disproportionate Impact') +
scale_y_continuous(labels = percent, limits=c(0, 1)) +
ggtitle('Dashboard drop-down selections:', subtitle=paste0("Ed Goal = '- All' | College Status = '- All' | Outcome = 'Math' | Disaggregation = 'Multi-Ethnicity'"))
## Warning: Using size for a discrete variable is not advised.
Appendix: R and R Package Versions
This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 scales_1.1.1 ggplot2_3.3.2 stringr_1.4.0
## [5] knitr_1.39 dplyr_1.0.8 DisImpact_0.0.18
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 pillar_1.7.0 bslib_0.3.1 compiler_4.0.2
## [5] jquerylib_0.1.4 highr_0.9 prettydoc_0.4.1 tools_4.0.2
## [9] digest_0.6.25 gtable_0.3.0 jsonlite_1.5 evaluate_0.15
## [13] lifecycle_1.0.1 tibble_3.1.6 fstcore_0.9.12 pkgconfig_2.0.3
## [17] rlang_1.0.1 cli_3.2.0 yaml_2.3.5 parallel_4.0.2
## [21] xfun_0.30 fastmap_1.1.0 withr_2.5.0 generics_0.1.2
## [25] vctrs_0.3.8 sass_0.4.1 grid_4.0.2 tidyselect_1.1.2
## [29] glue_1.6.1 R6_2.3.0 fansi_1.0.2 rmarkdown_2.14
## [33] farver_2.0.3 purrr_0.3.4 tidyr_1.2.0 magrittr_2.0.2
## [37] ellipsis_0.3.2 htmltools_0.5.2 fst_0.9.8 colorspace_1.4-1
## [41] labeling_0.3 utf8_1.2.2 stringi_1.4.6 munsell_0.5.0
## [45] crayon_1.5.0