The wordbankr
package allows you to access data in the Wordbank database from R
. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.
There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquisition and word mappings across languages.
The get_administration_data()
function gives by-administration information, either for a specific language and/or form or for all instruments.
## # A tibble: 5,520 x 15
## data_id age comprehension production language form birth_order ethnicity
## <dbl> <int> <int> <int> <chr> <chr> <fct> <fct>
## 1 129242 27 497 497 English… WS Fourth Hispanic
## 2 129243 21 369 369 English… WS Second White
## 3 129244 26 190 190 English… WS Fourth White
## 4 129245 27 264 264 English… WS Second White
## 5 129246 19 159 159 English… WS Second Other
## 6 129247 30 513 513 English… WS Second Other
## 7 129248 25 444 444 English… WS Second Other
## 8 129249 24 582 582 English… WS Second White
## 9 129250 28 558 558 English… WS Second Black
## 10 129251 18 7 7 English… WS Fourth Other
## # … with 5,510 more rows, and 7 more variables: sex <fct>, zygosity <chr>,
## # norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## # license <chr>
## # A tibble: 82,055 x 15
## data_id age comprehension production language form birth_order ethnicity
## <dbl> <int> <int> <int> <chr> <chr> <fct> <fct>
## 1 29821 13 293 88 Croatian WG <NA> <NA>
## 2 29822 16 122 12 Croatian WG <NA> <NA>
## 3 29823 9 3 0 Croatian WG <NA> <NA>
## 4 29824 12 0 0 Croatian WG <NA> <NA>
## 5 29825 12 44 0 Croatian WG <NA> <NA>
## 6 29826 8 14 5 Croatian WG <NA> <NA>
## 7 29827 9 2 1 Croatian WG <NA> <NA>
## 8 29828 10 44 1 Croatian WG <NA> <NA>
## 9 29829 13 172 51 Croatian WG <NA> <NA>
## 10 29830 16 241 68 Croatian WG <NA> <NA>
## # … with 82,045 more rows, and 7 more variables: sex <fct>, zygosity <chr>,
## # norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## # license <chr>
The get_item_data()
function gives by-item information, either for a specific language and/or form or for all instruments.
## # A tibble: 505 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_1 Risponde … Italian WG firs… <NA> <NA>
## 2 item_2 Risponde … Italian WG firs… <NA> <NA>
## 3 item_3 Reagisce … Italian WG firs… <NA> <NA>
## 4 item_4 Vuoi la p… Italian WG phra… <NA> <NA>
## 5 item_5 Hai sonno… Italian WG phra… <NA> <NA>
## 6 item_6 Vuoi bere? Italian WG phra… <NA> <NA>
## 7 item_7 Stai atte… Italian WG phra… <NA> <NA>
## 8 item_8 Stai buono Italian WG phra… <NA> <NA>
## 9 item_9 Batti le … Italian WG phra… <NA> <NA>
## 10 item_10 Cambiamo … Italian WG phra… <NA> <NA>
## # … with 495 more rows, and 4 more variables: lexical_class <chr>,
## # uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
## # A tibble: 31,811 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_81 gristi Croatian WG word action_… predicates
## 2 item_2… puhati Croatian WG word action_… predicates
## 3 item_2… razbiti Croatian WG word action_… predicates
## 4 item_64 donijeti Croatian WG word action_… predicates
## 5 item_1… kupiti Croatian WG word action_… predicates
## 6 item_36 čistiti Croatian WG word action_… predicates
## 7 item_3… zatvoriti Croatian WG word action_… predicates
## 8 item_2… plakati Croatian WG word action_… predicates
## 9 item_2… plesati Croatian WG word action_… predicates
## 10 item_42 crtati Croatian WG word action_… predicates
## # … with 31,801 more rows, and 4 more variables: lexical_class <chr>,
## # uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
If you are only looking at total vocabulary size, admins
is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data()
function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id
).
get_instrument_data(
language = "English (American)",
form = "WS",
items = c("item_26", "item_46")
)
## # A tibble: 11,692 x 3
## data_id value num_item_id
## <dbl> <chr> <dbl>
## 1 129242 "produces" 26
## 2 129243 "produces" 26
## 3 129244 "produces" 26
## 4 129245 "produces" 26
## 5 129246 "" 26
## 6 129247 "produces" 26
## 7 129248 "produces" 26
## 8 129249 "produces" 26
## 9 129250 "produces" 26
## 10 129251 "" 26
## # … with 11,682 more rows
By default get_instrument_table()
returns a data frame with columns of the administration’s data_id
, the item’s num_item_id
(numerical item_id
), and the corresponding value. To include administration information, you can set the administrations
argument to TRUE
, or pass the result of get_administration_data()
as administrations
(that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo
argument to TRUE
, or pass it result of get_item_data()
.
Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data()
.
As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:
animals <- get_item_data(language = "English (American)", form = "WS") %>%
filter(category == "animals")
Then we get the instrument data for those items:
animal_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = animals$item_id,
administrations = TRUE)
Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:
animal_summary <- animal_data %>%
mutate(produces = value == "produces") %>%
group_by(age, data_id) %>%
summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
group_by(age) %>%
summarise(median_num_animals = median(num_animals, na.rm = TRUE))
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
geom_point() +
labs(x = "Age (months)", y = "Median animal words producing")
The get_instruments()
function gives information on all the CDI instruments in Wordbank.
## # A tibble: 56 x 7
## instrument_id language form age_min age_max has_grammar unilemma_covera…
## <int> <chr> <chr> <int> <int> <int> <dbl>
## 1 1 British Sig… WG 8 36 0 0.76
## 2 2 Cantonese WS 16 30 0 0.95
## 3 3 Croatian WG 8 16 0 1
## 4 4 Croatian WS 16 30 0 0.52
## 5 5 Danish WS 16 36 1 0.580
## 6 6 English (Am… WG 8 18 0 1
## 7 7 English (Am… WS 16 30 1 1
## 8 8 German WS 18 30 0 0.77
## 9 9 Hebrew WG 11 25 0 1
## 10 10 Hebrew WS 25 36 1 0.86
## # … with 46 more rows
The get_sources()
function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data
argument is set to TRUE
, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.
## # A tibble: 29 x 9
## source_id name dataset instrument_lang… instrument_form contributor citation
## <int> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 9 Marc… "Normi… English (Americ… Words & Gestur… Larry Fens… "Fenson…
## 2 10 Byers "" English (Americ… Words & Gestur… Krista Bye… ""
## 3 11 Thal "13" English (Americ… Words & Gestur… Donna Thal… "Thal, …
## 4 12 Thal "16" English (Americ… Words & Gestur… Donna Thal… "Thal, …
## 5 14 Marc… "Normi… Spanish (Mexica… Words & Gestur… Donna Jack… "Jackso…
## 6 18 Kris… "" Norwegian Words & Gestur… Hanne Simo… "Simons…
## 7 19 Kris… "longi… Norwegian Words & Gestur… Hanne Simo… "Simons…
## 8 20 CLEX "" Croatian Words & Gestur… Melita Kov… "Kovace…
## 9 24 CLEX "" Russian Words & Gestur… Stella Cey… "Е.А.Ве…
## 10 26 CLEX "" Swedish Words & Gestur… Mårten Eri… "Erikss…
## # … with 19 more rows, and 2 more variables: longitudinal <lgl>, license <fct>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)
## # A tibble: 4 x 7
## source_id name dataset instrument_form n_admins age_min age_max
## <int> <chr> <chr> <fct> <int> <int> <int>
## 1 13 Marchman Norming Words & Sentences 1094 15 30
## 2 14 Marchman Norming Words & Gestures 778 8 19
## 3 65 Fernald Outreach Words & Gestures 55 16 22
## 4 66 Fernald Outreach Words & Sentences 80 18 38
The fit_aoa()
function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data()
– one row per administration x item combination, and minimally the columns age
and num_item_id
. It returns a data frame with one row per item and an aoa
column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure
) each word, smoothing the proportion using method
, and taking the age at which the smoothed value is greater than proportion
.
eng_ws_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = c("item_1", "item_42"),
administrations = TRUE,
iteminfo = TRUE)
fit_aoa(eng_ws_data)
## # A tibble: 2 x 10
## # Groups: num_item_id [2]
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 NA item_1 baa baa word sounds other
## 2 42 24 item_42 owl word animals nouns
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
## # A tibble: 2 x 10
## # Groups: num_item_id [2]
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 21 item_1 baa baa word sounds other
## 2 42 27 item_42 owl word animals nouns
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
One of the item-level fields is uni_lemma
(“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items()
simply gives all the available uni_lemma
values.
## # A tibble: 1,380 x 1
## uni_lemma
## <chr>
## 1 a
## 2 a little
## 3 a lot
## 4 able
## 5 about
## 6 above
## 7 after
## 8 afternoon
## 9 again
## 10 air conditioner
## # … with 1,370 more rows
The function get_crossling_data()
takes a vector of uni_lemmas
and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG
forms). Each row is combination of item and age, and the columns indicate the number of children (n_children
), means (comprehension
, production
), standard deviations (comprehension_sd
, production_sd
), and item-level fields.
get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
ungroup() %>%
select(language, uni_lemma, definition, age, n_children, comprehension,
production, comprehension_sd, production_sd) %>%
arrange(uni_lemma)
## # A tibble: 381 x 9
## language uni_lemma definition age n_children comprehension production
## <chr> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 British… hat hat 8 4 0 0
## 2 British… hat hat 9 4 0 0
## 3 British… hat hat 10 4 0 0
## 4 British… hat hat 11 6 0.167 0
## 5 British… hat hat 12 6 0 0
## 6 British… hat hat 13 6 0 0
## 7 British… hat hat 14 7 0.143 0
## 8 British… hat hat 15 6 0 0
## 9 British… hat hat 16 7 0.143 0.143
## 10 British… hat hat 17 7 0.286 0.143
## # … with 371 more rows, and 2 more variables: comprehension_sd <dbl>,
## # production_sd <dbl>
The function fit_vocab_quantiles()
uses quantile regression to fit a set of vocabulary size quantiles to a dataset. It takes a data frame return by get_administration_data()
, and additional arguments specifying which measure column to fit on (measure
: “production” or “comprehension”), an optional demographic column to group by (group
), and which type of quantiles to fit (quantiles
: “standard”, “deciles”, “quintiles”, “quartiles”, “median”, or a numeric vector of quantile values). Defaults to “standard”, which is 0.10, 0.25, 0.50, 0.75, 0.90.
eng_ws <- get_administration_data("English (American)", "WS")
fit_vocab_quantiles(eng_ws, production)
## # A tibble: 75 x 5
## # Groups: language, form [1]
## language form age quantile production
## <chr> <chr> <int> <fct> <dbl>
## 1 English (American) WS 16 0.1 8.84
## 2 English (American) WS 17 0.1 10.4
## 3 English (American) WS 18 0.1 14.0
## 4 English (American) WS 19 0.1 19.5
## 5 English (American) WS 20 0.1 27.0
## 6 English (American) WS 21 0.1 36.5
## 7 English (American) WS 22 0.1 49.9
## 8 English (American) WS 23 0.1 67.7
## 9 English (American) WS 24 0.1 90.0
## 10 English (American) WS 25 0.1 117.
## # … with 65 more rows
## # A tibble: 150 x 6
## # Groups: language, form, sex [2]
## language form sex age quantile production
## <chr> <chr> <fct> <int> <fct> <dbl>
## 1 English (American) WS Female 16 0.1 8.06
## 2 English (American) WS Female 17 0.1 10.6
## 3 English (American) WS Female 18 0.1 16.2
## 4 English (American) WS Female 19 0.1 25.
## 5 English (American) WS Female 20 0.1 36.9
## 6 English (American) WS Female 21 0.1 51.9
## 7 English (American) WS Female 22 0.1 70.8
## 8 English (American) WS Female 23 0.1 93.8
## 9 English (American) WS Female 24 0.1 121.
## 10 English (American) WS Female 25 0.1 152.
## # … with 140 more rows
## # A tibble: 45 x 5
## # Groups: language, form [1]
## language form age quantile production
## <chr> <chr> <int> <fct> <dbl>
## 1 English (American) WS 16 0.25 18.0
## 2 English (American) WS 17 0.25 21.8
## 3 English (American) WS 18 0.25 30.1
## 4 English (American) WS 19 0.25 43.0
## 5 English (American) WS 20 0.25 60.6
## 6 English (American) WS 21 0.25 82.7
## 7 English (American) WS 22 0.25 109.
## 8 English (American) WS 23 0.25 141.
## 9 English (American) WS 24 0.25 177.
## 10 English (American) WS 25 0.25 217.
## # … with 35 more rows