Estimating SLDAX Models

Kenneth Tyler Wilcox

20 October 2021

Setup Packages and Load Data

library(dplyr)   # For easier data manipulation with pipes `%>%`
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lda)     # Needed if using prep_docs() function
library(psychtm)

data(teacher_rate)  # Synthetic student ratings of instructors

Data Preparation

In this data set, the documents we want to model are stored as character vectors in a column of the data frame called doc. We can get a preview of the data structure to verify this. The documents are synthetic for this example but are generated to have similar word distributions to a real-world data set. Obviously, the synthetic documents are not as easily readable.

glimpse(teacher_rate)
#> Rows: 3,733
#> Columns: 4
#> $ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
#> $ rating <dbl> 5, 5, 5, 5, 2, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 4, …
#> $ grade  <dbl> 3, 2, 2, 1, 3, 5, 4, 1, 5, 4, 5, 3, 1, 2, 5, 2, 5, 3, 4, 2, 2, …
#> $ doc    <chr> "he is an alright there to understand care to is make project w…

To use the topic modeling functions in psychtm, we need to prepare the text data to create a vector containing the vocabulary of all unique terms used in the documents of interest as well as a matrix of word indices that represent which term in the vocabulary was used in each document (a row) at each word position (a column). This matrix has D rows (assuming we have D documents in our data) and \(\max \{N_d}\) columns where \(N_d\) is the number of words in document \(d, d = 1, 2, \ldots, D\) — that is, the number of columns in this matrix is equal to the number of words in the longest document in our data. The psychtm package provides a function prep_docs() to convert a column of documents represented as character vectors in a data frame into a documents matrix and a vocab vector which will be needed to use this package’s topic modeling functionality.

docs_vocab <- prep_docs(teacher_rate, "doc")

We can see that we have 6264 unique terms in the vocabulary. We will save this for use later when modeling.

str(docs_vocab$vocab)
#>  chr [1:6264] "he" "is" "an" "alright" "there" "to" "understand" "care" ...
vocab_len <- length(docs_vocab$vocab)

We can also inspect the structure of the documents matrix created by prep_docs(). Notice that we have 3733 rows — the same as the number of rows in our original data frame teacher_rate — and 73 columns — corresponding to the number of terms or words in the longest document in our data.

str(docs_vocab$documents)
#>  int [1:3733, 1:73] 1 18 42 64 107 18 117 24 1 117 ...

Shorter documents represent unused word positions with a 0. For example, the first document only contains 19 words, so the remaining elements in the documents matrix are all set to 0. The first 25 are shown below.

print(docs_vocab$documents[1, 1:25])
#>  [1]  1  2  3  4  5  6  7  8  6  2  9 10 11 12 13 14 15 16 17  0  0  0  0  0  0

Nonzero elements are indices for the terms stored in the vocab vector. To demonstrate, we can recover the original document as follows word by word. In practice, one may want to further preprocess or clean the documents before modeling (e.g., remove stop-words or perform stemming). We do not demonstrate that here for simplicity, but see, for example, the ebook Welcome to Text Mining with R for some examples using the tidytext R package (Silge & Robinson, 2016).

docs_vocab$vocab[docs_vocab$documents[1, 1:17]]
#>  [1] "he"          "is"          "an"          "alright"     "there"      
#>  [6] "to"          "understand"  "care"        "to"          "is"         
#> [11] "make"        "project"     "was"         "between"     "it"         
#> [16] "necessarily" "point"

Model Fitting

The Supervised Latent Dirichlet Allocation with Covariates (SLDAX) model or its special cases Supervised Latent Dirichlet Allocation (SLDA) or Latent Dirichlet Allocation (LDA) can be fit using the gibbs_sldax() function. The function uses Gibbs sampling, a Markov Chain Monte Carlo algorithm, so the number of iterations to run the sampling algorithm m needs to be specified. It is usually a good idea to specify a “burn-in” period of iterations burn to discard while the algorithm iterates toward a converged solution so that pre-converged values are not treated as draws from the posterior distribution we want to sample from. Finally, a thinning period thin can be specified so that only draws separated by the thinning period are kept, resulting in lower autocorrelation among the final posterior samples.

m can be calculated as the desired number of posterior draws T (e.g., 150 for speed in this tutorial; this is generally too low in practice) multiplied by thin plus burn: m = T x thin + burn. Below, we have m = 150 x 1 + 300 = 450. In practice, a longer burn-in period, total number of iterations, and a larger thinning period may be advisable. For any of SLDAX, SLDAX, and LDA, the documents matrix prepared above is supplied to the docs argument and the size of the vocabulary calculated earlier is supplied to the V argument. Finally, we specify that we are fitting an LDA model by supplying model = "lda". Be patient as the algorithm may take a few minutes to complete. Progress can be displayed (optional) using the display_progress argument. Other options for gibbs_sldax such as prior specifications can be found in the documentation by running ?gibbs_sldax.

Estimating a Latent Dirichlet Allocation Model

Here, we fit an LDA model with three topics K, so no covariate or outcome variables need to be provided. We set a seed for reproducibility.

set.seed(92850827)
fit_lda <- gibbs_sldax(m = 450, burn = 300, thin = 1,
                       docs = docs_vocab$documents,
                       V = vocab_len,
                       K = 3, model = "lda", display_progress = TRUE)

The estimated (posterior mean or median) topic proportions for each document \(\theta_d\) can be obtained using est_theta(). Here I show the estimated topic proportions for Topic 1, 2, and 3 for the first six documents. Note that each row sums to 1 across topics for each document.

theta_hat <- est_theta(fit_lda)
head(theta_hat)
#>           [,1]       [,2]       [,3]
#> [1,] 0.7988248 0.11116770 0.09000745
#> [2,] 0.9080048 0.04194245 0.05005275
#> [3,] 0.8932106 0.06379452 0.04299490
#> [4,] 0.8992168 0.05135308 0.04943008
#> [5,] 0.9279868 0.03686728 0.03514594
#> [6,] 0.9024830 0.05714753 0.04036944

Similarly, we can obtain the estimated (posterior mean or median) topic–word probabilities \(\beta_k\) for each topic using est_beta(). Here I show the estimated topic–word probabilities for Topic 1, 2, and 3 for the first ten terms in the vocabulary. Note that each row sums to 1 across terms for each topic.

beta_hat <- est_beta(fit_lda)
colnames(beta_hat) <- docs_vocab$vocab
beta_hat[, 1:10]
#>                he           is           an      alright        there
#> [1,] 0.0201289387 0.0307835332 0.0057009246 1.975882e-05 0.0026515015
#> [2,] 0.0007937285 0.0010155003 0.0006101854 8.618505e-05 0.0007116854
#> [3,] 0.0004666478 0.0004244107 0.0005179860 1.194118e-04 0.0007991100
#>                to   understand         care         make      project
#> [1,] 0.0296174541 0.0022887459 0.0016311004 0.0055352938 0.0006269008
#> [2,] 0.0004114374 0.0002372083 0.0012341072 0.0003099339 0.0002623891
#> [3,] 0.0004771996 0.0002817537 0.0006374755 0.0003966857 0.0002771233

To help interpret the topics, two quantities can be useful and are provided by the get_topwords() function. First, the most probable terms associated with each topic directly estimated in \(\beta_k\) for each topic can be evaluated. Because common words such as stop words (e.g., “the”, “and”, “to”) were not removed ahead of analysis in this demo, the topics can be difficult to interpret using the original probabilities.

get_topwords(beta_ = beta_hat, nwords = 10, docs_vocab$vocab, method = "prob") %>% 
  print(n = 30)
#> # A tibble: 30 × 3
#>    topic word         prob
#>    <chr> <chr>       <dbl>
#>  1 1     and       0.0404 
#>  2 1     the       0.0378 
#>  3 1     is        0.0308 
#>  4 1     to        0.0296 
#>  5 1     you       0.0277 
#>  6 1     a         0.0258 
#>  7 1     he        0.0201 
#>  8 1     i         0.0176 
#>  9 1     class     0.0156 
#> 10 1     of        0.0154 
#> 11 2     she       0.0198 
#> 12 2     professor 0.00599
#> 13 2     her       0.00475
#> 14 2     i         0.00236
#> 15 2     final     0.00227
#> 16 2     keep      0.00220
#> 17 2     offer     0.00197
#> 18 2     highly    0.00189
#> 19 2     science   0.00185
#> 20 2     now       0.00170
#> 21 3     took      0.00341
#> 22 3     her       0.00212
#> 23 3     over      0.00196
#> 24 3     look      0.00191
#> 25 3     this      0.00185
#> 26 3     she       0.00175
#> 27 3     but       0.00159
#> 28 3     dr.       0.00159
#> 29 3     can       0.00151
#> 30 3     our       0.00131

However, another metric which down-weights words that are highly probable in many topics (e.g., stopwords like “and” or “to”) is the term score. That is, term scores can be interpreted as a measure of uniquely representative a term is for a topic where larger term scores denote terms that have a high probability for a topic and lower probabilities for other topics (i.e., more “unique” to a given topic) and smaller term scores denote terms that are probable in multiple topics (i.e., more “common” to all topics). This is the default metric computed by get_topwords(). Inspecting the top 10 terms for each topic below using term scores, it is now clearer that Topic 1 corresponds with descriptions of course professors, whereas Topic 2 corresponds to course assessment methods such as readings and exams.

get_topwords(beta_hat, 15, docs_vocab$vocab, method = "termscore") %>%
  print(n = 30)
#> # A tibble: 45 × 3
#>    topic word       termscore
#>    <chr> <chr>          <dbl>
#>  1 1     and          0.117  
#>  2 1     the          0.114  
#>  3 1     to           0.0830 
#>  4 1     you          0.0803 
#>  5 1     is           0.0790 
#>  6 1     a            0.0673 
#>  7 1     he           0.0470 
#>  8 1     of           0.0362 
#>  9 1     class        0.0343 
#> 10 1     i            0.0271 
#> 11 1     are          0.0217 
#> 12 1     for          0.0195 
#> 13 1     his          0.0195 
#> 14 1     not          0.0189 
#> 15 1     if           0.0183 
#> 16 2     she          0.0182 
#> 17 2     professor    0.00348
#> 18 2     now          0.00272
#> 19 2     keep         0.00227
#> 20 2     science      0.00221
#> 21 2     offer        0.00203
#> 22 2     instructor   0.00201
#> 23 2     final        0.00165
#> 24 2     found        0.00161
#> 25 2     right        0.00154
#> 26 2     day.         0.00142
#> 27 2     print        0.00120
#> 28 2     mrs.         0.00117
#> 29 2     myself       0.00113
#> 30 2     able         0.00112
#> # … with 15 more rows

For models with a larger number of topics than illustrated here, it can be useful to extract the most probable subset of topics for each document using the get_toptopics() function. For example, the most probable two topics for each document can be retrieved. Results for the first three documents are shown.

head(get_toptopics(theta = theta_hat, ntopics = 2))
#> # A tibble: 6 × 3
#>     doc topic   prob
#>   <dbl> <dbl>  <dbl>
#> 1     1     1 0.799 
#> 2     1     2 0.111 
#> 3     2     1 0.908 
#> 4     2     3 0.0501
#> 5     3     1 0.893 
#> 6     3     2 0.0638

Model Goodness of Fit Metrics

Model fit metrics of topic coherence and topic exclusivity can be computed using get_coherence() and get_exclusivity(). This can be useful, for example, when using cross-validation to determine the optimal number of topics. By default, coherence and exclusivity are computed for each topic, but a global measure can be defined, for example, by averaging over topics if desired (not shown).

get_coherence(beta_ = beta_hat, docs = docs_vocab$documents)
#> [1]  -23.06239 -136.66708  -99.65906
get_exclusivity(beta_ = beta_hat)
#> [1] 9.989794 7.714247 4.146112

Estimating a SLDAX Model

To fit an SLDAX model where the latent topics along with covariates are used to model an outcome, we need to further specify the regression model for the covariates of interest (the topics are automatically entered into the model additively). We also need to specify a data frame data containing the covariates and outcome of interest. Be careful to ensure that this is the same set of observations for which documents were provided previously. Missing data imputation methods for missing documents, covariates, or outcomes are currently not implemented. Only complete data can be analyzed. We also change the model to be estimated to model = "sldax".

set.seed(44680835)
fit_sldax <- gibbs_sldax(rating ~ I(grade - 1), data = teacher_rate,
                         m = 450, burn = 300, thin = 1,
                         docs = docs_vocab$documents,
                         V = vocab_len,
                         K = 3, model = "sldax", display_progress = TRUE)

Topic proportion and topic–word probabilities can be summarized with the same functions as demonstrated above in the section on LDA. As we saw above for an LDA model, we can interpret the three topics from the SLDAX model using term scores.

get_topwords(est_beta(fit_sldax), 15, docs_vocab$vocab, method = "termscore") %>%
  print(n = 30)
#> # A tibble: 45 × 3
#>    topic word       termscore
#>    <chr> <chr>          <dbl>
#>  1 1     and         0.114   
#>  2 1     the         0.107   
#>  3 1     to          0.0877  
#>  4 1     is          0.0744  
#>  5 1     you         0.0743  
#>  6 1     a           0.0681  
#>  7 1     he          0.0487  
#>  8 1     class       0.0365  
#>  9 1     of          0.0350  
#> 10 1     i           0.0262  
#> 11 1     are         0.0204  
#> 12 1     not         0.0201  
#> 13 1     his         0.0189  
#> 14 1     if          0.0186  
#> 15 1     very        0.0181  
#> 16 2     less        0.00190 
#> 17 2     ton         0.00161 
#> 18 2     success.    0.00125 
#> 19 2     guy         0.00118 
#> 20 2     able        0.00110 
#> 21 2     print       0.00105 
#> 22 2     enjoyable   0.00105 
#> 23 2     funny,      0.000992
#> 24 2     pop         0.000983
#> 25 2     instructor  0.000959
#> 26 2     right       0.000927
#> 27 2     caring,     0.000905
#> 28 2     slow        0.000898
#> 29 2     can't       0.000883
#> 30 2     spanish     0.000875
#> # … with 15 more rows

Here we illustrate summarization of the regression relationships in the SLDAX model. The post_regression() function constructs a coda::mcmc object that can be further manipulated by the methods in the coda R package (Plummer et al., 2006) such as summary.mcmc() as shown below. The burn-in length and thinning period are automatically reflected in these posterior summaries.

The results of summary() provide the posterior mean estimates, corresponding posterior standard deviations, Bayesian credible intervals, and the standard error and autocorrelation-adjusted standard error of the posterior mean estimates. See ?coda:::summary.mcmc for further information .

Here, we see that a 1-unit improvement in a student’s grade is associated with a roughly -0.3 decrease in their rating of the instructor while holding the topical content of their comments constant. The posterior mean estimates of the regression coefficients for the topics (e.g., topic1) correspond to conditional mean ratings when a student received the lowest possible grade and only that topic was present in their written comment. To obtain meaningful topic “effects” associated with changing the prevalence of a given topic while holding all others constant, we need to estimate the posterior distribution of a contrast of the topic regression coefficients. The post_regression() function calculates these contrasts as \(c_k = \eta_k^{(t)} - K^{-1} \sum_{j \ne k} \eta_j^{(t)}\) where \(\eta_k\) and \(\eta_j\) are regression coefficients for the topics (not for any of the covariates). These topic effects are labeled as effect_t1 and so on. Here, we see using the posterior means and credible intervals that Topic 1 is positively associated with instructor ratings while holding a student’s grade constant, Topic 2 is negatively associated with instructor ratings while holding a student’s grade constant, and Topic 3 does not appear to be associated with instructor ratings while holding student’s grade constant. In practice, the topic interpretations should be inspected and longer chains should be used along with convergence checks.

eta_post <- post_regression(fit_sldax)
summary(eta_post)
#> 
#> Iterations = 301:450
#> Thinning interval = 1 
#> Number of chains = 1 
#> Sample size per chain = 150 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>                 Mean       SD  Naive SE Time-series SE
#> I(grade - 1) -0.2624 0.007971 0.0006508      0.0006508
#> topic1        4.8660 0.032105 0.0026213      0.0030664
#> topic2        3.8448 0.237901 0.0194245      0.0337216
#> topic3        4.2241 0.251126 0.0205044      0.0280480
#> effect_t1     0.8315 0.151761 0.0123912      0.0241455
#> effect_t2    -0.7003 0.313717 0.0256149      0.0455078
#> effect_t3    -0.1312 0.320977 0.0262077      0.0449124
#> sigma2        1.1340 0.028963 0.0023649      0.0023649
#> 
#> 2. Quantiles for each variable:
#> 
#>                 2.5%     25%      50%      75%   97.5%
#> I(grade - 1) -0.2765 -0.2680 -0.26234 -0.25762 -0.2474
#> topic1        4.8042  4.8451  4.86125  4.89006  4.9256
#> topic2        3.4282  3.6604  3.82417  4.00860  4.2798
#> topic3        3.7542  4.0594  4.20768  4.41378  4.6294
#> effect_t1     0.5422  0.7124  0.83249  0.93923  1.1372
#> effect_t2    -1.2408 -0.9318 -0.69565 -0.50744 -0.1145
#> effect_t3    -0.7336 -0.3298 -0.09543  0.07387  0.4314
#> sigma2        1.0802  1.1143  1.13087  1.15058  1.1965