README

The goal of wordpiece.data is to provide stable, versioned data for use in the {wordpiece} tokenizer package.

Installation

install.packages("wordpiece.data")

# install.packages("remotes")
remotes::install_github("macmillancontentscience/wordpiece.data")

Dataset Creation

The datasets included in this package were retrieved from huggingface (specifically, cased and uncased). They were then processed using the {wordpiece} package. This is a bit circular, because this package is a dependency for the wordpiece package.

vocab_txt <- tempfile(fileext = ".txt")
download.file(
  url = "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt", 
  destfile = vocab_txt
)
parsed_vocab <- wordpiece::load_vocab(vocab_txt)
rds_filename <- paste0(
  paste(
    "wordpiece",
    "cased",
    length(parsed_vocab),
    sep = "_"
  ),
  ".rds"
)
saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)

vocab_txt <- tempfile(fileext = ".txt")
download.file(
  url = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt", 
  destfile = vocab_txt
)
parsed_vocab <- wordpiece::load_vocab(vocab_txt)
rds_filename <- paste0(
  paste(
    "wordpiece",
    "uncased",
    length(parsed_vocab),
    sep = "_"
  ),
  ".rds"
)
saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)

Example

You likely won’t ever need to use this package directly. It contains a function to load data used by {wordpiece}.

library(wordpiece.data)

head(wordpiece_vocab())
#> [1] "[PAD]"     "[unused0]" "[unused1]" "[unused2]" "[unused3]" "[unused4]"

Code of Conduct

Please note that the wordpiece.data project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

wordpiece.data

Installation

Dataset Creation

Example

Code of Conduct

Disclaimer

Contact information