Stoltz, Dustin S., and Marshall A. Taylor. 2019. “Concept Mover’s Distance.” Journal of Computational Social Science 2(2):293-313.
@article{stoltz2019concept,
title={Concept Mover's Distance},
author={Stoltz, Dustin S and Taylor, Marshall A},
journal={Journal of Computational Social Science},
volume={2},
number={2},
pages={293--313},
year={2019},
publisher={Springer}
}
text2map
Install and load the text2map
package:
library(text2map)
To use the Concept Mover’s Distance function (CMDist
) ,
you will need to transform your corpus into a document-term matrix
(DTM). The preferred DTM is a sparse matrix in Matrix
format (class “dgCMatrix”) as output by text2vec
,
Quanteda
, udpipe
, and tidytext
’s
cast_sparse
function, but we have tried to make the package
accommodate DTMs made with the tm
package (technically a
“simple_triplet_matrix”) or just a regular old base R dense matrix (see
this for a comparison of DTM creation methods). You can also use
text2map
’s super fast unigram DTM builder
dtm_builder()
.
You will also need a matrix of word embedding vectors (with the
“words” as rownames), and ultimately, CMDist
is only as
good as the word embeddings used. Word embedding vectors can be from a
pre-trained source, for example:
It will take a little data wrangling to get these loaded as a matrix
in R
with rownames as the words (feel free to contact us if
you run into any issues loading embeddings into R
), or you
can just download these R
-ready fastText English Word
Vectors trained on the Common Crawl (crawl-300d-2M.vec) hosted on Google
Drive:
You can also download the file directly into your R session using the following:
library(googledrive) # (see https://googledrive.tidyverse.org/)
<- tempfile()
temp drive_download(as_id("17H4GOGedeGo0urQdDC-4e5qWQMeWLpGG"), path = temp, overwrite = TRUE)
<- readRDS(temp)
my.wv # save them to your project file so you don't have to re-download
saveRDS(my.wv, "data/ft.cc.en.300D.2M.Rds")
Instead of pretrained embeddings, you can create your own embeddings
trained locally on the corpus on which you are using
CMDist
. For example, the text2vec
R package
provides a nice workflow
for using the GloVe method to train embeddings. As we discuss in our
original paper, the decision to use pre-trained vs local corpus-trained
is understudied as applied in the social-scientific context. Pretrained
embeddings are a simple way to get an analysis started, but researchers
should always be critical about the corpus used to train these
models.
One important caveat: the terms used to denote a concept or build a semantic direction need not be in the corpus, but it must be in the word embeddings matrix. If it is not, the function will stop and let you know. This means, obviously, that corpus-trained embeddings cannot be used with words not in the corpus (pre-trained must be used). As the fastText method can be trained on “subword” character strings, it is possible to average the character strings that make up an out-of-vocabulary term.
The most difficult and important part of using Concept Mover’s
Distance is selecting anchor terms. This should be driven by (a) theory,
(b) prior literature, (c) domain knowledge, and (d) the word embedding
space. One way of double-checking that selected terms are approriate is
to look at the term’s nearest neighbors. Here we use the
sim2
function from text2vec
to get the cosine
distance between “thinking” and its top 10 nearest neighbors.
library(text2vec)
<- sim2(x = my.wv,
cos_sim y = my.wv["thinking", , drop = FALSE],
method = "cosine")
head(sort(cos_sim[,1], decreasing = TRUE), 10)
Once you have a DTM and word embedding matrix, the simplest use of
CMDist
involves finding the closeness to a focal concept
denoted by a single word. Here, we use the word “thinking” as
our concept word (“cw”):
<- CMDist(dtm = my.dtm, cw = "thinking", wv = my.wv) doc_closeness
However, we may be interested in specifying the concept with additional words: for example, we might want to capture “critical thinking.” To handle this, it is as simple as including both words separated by a space in the quotes. This creates a pseudo-document that contains only “critical” and “thinking.”
<- CMDist(dtm = my.dtm, cw = "critical thinking", wv = my.wv) doc_closeness
What if instead of a compound concept we are interested in a common
concept represented with two words (i.e., a bigram)? First, just like
with any other word, the ngram must be in the embedding matrix (the
fastText pre-trained embeddings have a lot of n≥1 grams). As long as the
ngram is in the embeddings, the only difference for CMDist
is that an underscore rather than a space needs to be placed between
words:
<- CMDist(dtm = my.dtm, cw = "critical_thinking", wv = my.wv) doc_closeness
In the original JCSS paper, we discussed the “binary concept problem,” where documents that are close to a concept with a binary opposite will likely also be close to both opposing poles. For example if a document is close to “love” it will also be close to “hate.” But, very often an analyst will want to know whether a document is close to one pole or the other of this binary concept. To deal with this we incorporate insights from Kozlowski et al’s (2019) paper “Geometry of Culture” to define “semantic directions” within word embeddings – i.e. pointing toward one pole and away from the other. We outline this procedure in more detail in a subsequent JCSS paper “Integrating Semantic Directions with Concept Mover’s Distance to Measure Binary Concept Engagement”.
The procedure involves generating a list of antonym pairs for a given
binary concept – that is, terms which may occur in similar contexts but
are typically juxtaposed. Then we get the differences between these
antonyms’ respective vectors, and averaging the result (with
get_direction()
). The resulting vector will be the location
of one “pole” of this binary concept, and CMDist
calculates
the distance each document is from this concept vector (“cv”). As this
creates a one-dimensional continuum, documents that are far from this
“pole” will be close to the opposing “pole.”
# first build the semantic direction:
<- c("death", "casualty", "demise", "dying", "fatality")
additions <- c("life", "survivor", "birth", "living", "endure")
substracts
<- cbind(additions, substracts)
pairs <- get_direction(pairs, my.wv)
death.sd
# input it into the function just like a concept word:
<- CMDist(dtm = my.dtm, cv = death.sd, wv = my.wv) doc_closeness
Instead of building a pseudo-document with several terms, as in the
compound concept section above, one could also average several terms
together to arrive at a “centroid” in the word embedding space – this
can be understood as referring to the center of a “semantic region.”
Again, CMDist
will calculate the distance each document is
from this concept vector (“cv”).
# first build the semantic centroid:
<- c("death", "casualty", "demise", "dying", "fatality")
terms
<- get_centroid(terms, my.wv)
death.sc
# input it into the function just like a concept word:
<- CMDist(dtm = my.dtm, cv = death.sc, wv = my.wv) doc_closeness
An analysis might suggest multiple concepts are of interest, and so
it is useful to get multiple distances at the same time. This, in
effect, is adding more rows to our pseudo-document-term matrix. For
example, in our JCSS paper, we compare how Shakespeare’s plays
engage with “death” against 200 other concepts. CMDist
also
incorporate both concept words and vectors into the same run.
# example 1
<- c("thinking", "critical", "thought")
concept.words < CMDist(dtm = my.dtm, cw = concept.words, wv = my.wv)
doc_closeness
# example 2
<- c("critical thought", "critical_thinking")
concept.words < CMDist(dtm = my.dtm, cw = concept.words, wv = my.wv)
doc_closeness
# example 3
<- c("critical thought", "critical_thinking")
concept.words < CMDist(dtm = my.dtm, cw = concept.words, cv = concept.vectors, wv = my.wv) doc_closeness
Calculating CMD
(as of version 0.5.0) relies on Linear Complexity RWMD or
LC-RWMD
. The performance gain over the previous
RWMD
implementation is considerable. Nevertheless, it may
be desirable to boost performance further on especially large
projects.
One way to reduce complexity and thus time (without a noticeable drop
in accuracy) is by removing very sparse terms in your DTM. Parallelizing
is another option, so we built it in. To use parallel calculations just
set parallel = TRUE
. The default number of threads is 2,
but you can set it as high as you have threads/cores.
<- CMDist(dtm = my.dtm,
doc_closeness cw = "critical_thinking",
wv = my.wv,
parallel = TRUE,
threads = 2)
By default, the outputs are normalized using the scale()
function in R. If this is not desired, set
scale = FALSE
.
<- CMDist(dtm = my.dtm, cw = "thinking", wv = my.wv, scale = FALSE)
doc_closeness
CMDist
relies text2vec
which currently uses
cosine as the distance metric to solve the underlying
transportation problem. Kusner et al. (2015) and Atasu et al. (2017)
papers used Euclidean distance to compare word
embeddings vectors. Cosine and Euclidean are mathematically similar (and
in many cases identical). In all of our comparisons of Concept Mover’s
Distance using Euclidean and cosine the results were nearly
identical.
Rather than the Relaxed Word Mover’s Distance (RWMD) as discussed in
Kusner et al’s (2015) “From Word Embeddings To Document Distances”,
text2vec
uses the Linear-Complexity Relaxed Word Mover’s
Distance (LC-RWMD) as described by Atasu et al. (2017). LC-RWMD not only
reduces the computational demands considerably, if we take the maximum
of each document pair across the upper and lower triangles (i.e., the
maximum of each [i,j]-[j,i] pair), the resulting
symmetric distance matrix is precisely the same as the previous
algorithm using Kusner et al’s approach. As far as our tests have shown,
the triangle corresponding to treating the pseudo-document as a “query”
and the documents as a “collection” (see the text2vec
documentation, p. 28) will always return this set of maximums, and
therefore there is no reason for the additional comparison step. If you
have reason to believe this is not the case, please contact us and we
will share the code necessary to reproduce results with the original
Kusner et al. approach to check.
For more discussion of the math behind Concept Mover’s Distance see Stoltz and Taylor (2019) “Concept Mover’s Distance” in the Journal of Computational Social Science.