Word Embeddings: Defaults and Specifications

A word embedding comprises values that represent the latent meaning of a word. The numbers may be seen as coordinates in a space that comprises several hundred dimensions. The more similar two words’ embeddings are, the closer positioned they are in this embedding space, and thus, the more similar the words are in meaning. Hence, embeddings reflect the relationships among words, where proximity in the embedding space represents similarity in latent meaning. Text uses already existing language models to map text data to high quality word embeddings.

To represent several words, sentences and paragraphs, word embeddings of single words may be combined or aggregated into one word embedding. This can be achieved by taking the mean, minimum or maximum value of each dimension of the embeddings.

This tutorial focuses on how to retrieve layers and how to aggregate them to receive word embeddings in text. The focus will be on the actual functions.

For more detailed information about word embeddings and the language models in regard to text please see text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning; and for more comprehensive information about the inner workings of the language models, for example see Illustrated BERT or the references given in Table 1.

Table 1 show some of the more common language models; for more detailed information see HuggingFace

Models References Layers Dimensions Language
‘bert-base-uncased’ Devlin et al. 2019 12 768 English
‘roberta-base’ Liu et al. 2019 12 768 English
‘distilbert-base-cased’ Sahn et al., 2019 6? 768? English
‘bert-base-multilingual-cased’ Devlin et al.2019 12 768 104 top languages at Wikipedia
‘xlm-roberta-large’ Liu et al 24 1024 100 language

textEmbed: Reflecting standards and state-of-the-arts

The main function to transform text to word embeddings is textEmbed(). First, provide a tibble containing the text-variable(s) that you want to transform (note that it is OK to submit other variables too; the function will only grab the character variables). Second, set the language model; using a setting among the options for model ensures that you use a model that have been tested with text.

Setting the advanced options pretrained_weights (e.g., to pretrained_weights = 'bert-base-uncased'), tokenizer_class (e.g., to tokenizer_class = BertTokenizer) and model_class (e.g., to model_class = BertModel; and model = NULL); allows you to set a model directly with the HuggingFace interface. Make sure that the pretrained_weights, tokenizer_class, and model_class fit together (otherwise you will get an error).

Third, decide whether you want contextualized and/or decontextualized word embeddings; by setting the contexts ans deconext parameters to TRUE/FALSE. Contextualized word embeddings are standard and return word embeddings that have taken into account the context in which the word was used; the decontextualized word embeddings do not take into account the context of how the word was used (and are used in the plot functions).

Last, select the number of layers you want to use and the way you want to aggregate them.

library(text)

# Transform the text data to BERT word embeddings
wordembeddings <- textEmbed(x = Language_based_assessment_data_8,
                            model = 'bert-base-uncased',
                            contexts = TRUE,
                            layers = 11:12,
                            context_aggregation = "mean",
                            decontexts = TRUE,
                            decontext_layers = 11:12,
                            decontext_aggregation = "mean")

# Save the word embeddings to avoid having to import the text every time
# saveRDS(wordembeddings, "_YOURPATH_/wordembeddings.rds")
# Get the word embeddings again
# wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds")

# See how word embeddings are structured
wordembeddings

The textEmbed() function is suitable when you are just interested in getting good word embeddings to test some research hypothesis with. That is, the defaults are based on general experience of what works. Under the hood textEmbed uses one function for retrieving the layers (textEmbedLayersOutput) and another function for aggregating them (textEmbedLayerAggreation). So, if you are interested in examining different layers and different aggregation methods it is better to split up the work flow so that you first retrieve all layers (which takes most time) and then test different aggregation methods.

textEmbedLayersOutput: Get tokens and all the layers

The textEmbedLayersOutput function is used to retrieve the layers of hidden states.

library(text)

#Transform the text data to BERT word embeddings

x <- Language_based_assessment_data_8[1:2, 1:2]
 
wordembeddings_tokens_layers <- textEmbedLayersOutput(x,
                                                contexts = TRUE,
                                                decontexts = FALSE,
                                                model = 'bert-base-uncased',
                                                layers = 'all',
                                                return_tokens = TRUE)
wordembeddings_tokens_layers

textEmbedLayerAggreation: Testing different layers

The output from the textEmbedLayerAggreation() function is the same as that of textEmbed(); but, now you have the possibility to test different ways to aggregate the layers without having to retrieve them from the language model. In textEmbedLayerAggreation(), you can select any combination of the layers that you want to aggregate; and then you can select to aggregate them using the mean of the dimensions, the minimum or maximum value.

library(text)

# Aggregating layer 11 and 12 by taking the mean of each dimension. 
we_11_12_mean <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
                                      layers = 11:12,
                                      aggregation = "mean")

# Aggregating layer 11 and 12 by taking the minimum of each dimension accross the two layers.
we_11_12_min <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
                                     layers = 11:12,
                                     aggregation = "min")

# Aggregating layer 1 to 12 by taking the max value of each dimension accross the 12 layers.
we_1_12_min <- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
                                    layers = 1:12,
                                    aggregation = "max")
we_1_12_min

Now the word embeddings are ready to be used in down stream tasks such as predicting numeric variables or be plotted according to different dimensions.