Interacting with the IPUMS microdata extract API

Warning: New features currently in development

The IPUMS microdata extract API is still in beta testing, and the interface for interacting with the API using ipumsr could still change in response to tester feedback. If you are interested in becoming a beta tester, email .

To install the latest version of ipumsr from CRAN, use:

install.packages("ipumsr")

Since we are still actively developing the functions for interacting with the API, there may be changes in our GitHub repo that are not yet on CRAN. To install the development version of ipumsr from GitHub, use:

if (!require(remotes)) install.packages("remotes")
remotes::install_github("ipums/ipumsr/ipumsexamples")
remotes::install_github(
  "ipums/ipumsr", 
  build_vignettes = TRUE, 
  dependencies = TRUE
)

Overview

The IPUMS microdata extract API allows registered IPUMS USA and CPS users to define extracts, submit extract requests, and download extract files without visiting the IPUMS website. ipumsr includes functions that help R users interact with the extract API from their R session.

library(ipumsr)
library(dplyr) # not necessary to use API functions, but used in some examples
library(purrr) # not necessary to use API functions, but used in some examples

Setting up your API key

If you don’t have an IPUMS USA or IPUMS CPS account, you’ll need to register for access. Here’s where you can register for IPUMS USA, or register for IPUMS CPS. You’ll also need to request API beta access by emailing , as mentioned above.

Once you’re registered, you’ll need to create an API key.

Once you’ve created an API key, you can choose to supply it as a function argument whenever interacting with the API, or you can set the value of the IPUMS_API_KEY environment variable to your key. The example code in this vignette assumes you have assigned your key to this environment variable.

To set the value of the IPUMS_API_KEY environment variable for your current session, you can use:

set_ipums_api_key("paste-your-key-here")

To set your API key and save it for use in future sessions, use the same function, but with save set to TRUE:

set_ipums_api_key("paste-your-key-here", save = TRUE)

This will add your API key to a file named “.Renviron” in your user home directory, so that the value of the IPUMS_API_KEY environment variable is set when R starts up.

Defining your extract

Each IPUMS data collection with API support has its own function for defining an extract. These functions have names of the form define_extract_<collection>(). Thus, to define an IPUMS USA extract, you use define_extract_usa(), and to define an IPUMS CPS extract, you use define_extract_cps().

All define_extract_() functions return an ipums_extract object which can then be submitted using the submit_extract() function.

usa_extract_definition <- define_extract_usa(
  description = "USA extract for API vignette",
  samples = c("us2018a","us2019a"),
  variables = c("AGE","SEX","RACE","STATEFIP")
)

cps_extract_definition <- define_extract_cps(
  description = "CPS extract for API vignette",
  samples = c("cps1976_01s", "cps1976_02b"),
  variables = c("YEAR", "MISH", "CPSIDP", "AGE", "SEX", "RACE", "UH_SEX_B1")
)

For more details on the ipums_extract class, view the documentation page with ?ipums_extract-class.

Note that samples are specified using special sample ID codes, which can be browsed here on the IPUMS USA website, or here for IPUMS CPS.

Submitting your extract

To submit your extract, use:

submit_extract(usa_extract_definition)

However, like the define_extract_() functions, the submit_extract() function returns an ipums_extract object, and the returned object has been updated to include the extract number, so it can be useful to save that return object by assigning a name to it, like this:

submitted_usa_extract <- submit_extract(usa_extract_definition)

That way, you can use the submitted_usa_extract object as input to check the extract’s status, as shown in the next section, or to reference the extract number:

submitted_usa_extract$number

Checking the status of your extract

To retrieve the latest status of an extract, you can use the get_extract_info() function. get_extract_info() returns an ipums_extract object with the “status” element updated to reflect the latest status of the extract, and the “download_links” element updated to include links to any extract files that are available for download.

The “status” of a submitted extract is one of “queued”, “started”, “produced”, “canceled”, “failed”, or “completed”. Only “completed” extracts can be downloaded, but “completed” extracts older than 72 hours may not be available for download, since extract files are removed after that time (see discussion of the is_extract_ready() function below).

If you assigned a name to the return value of submit_extract(), as shown above, you could get updated information on the extract, returned as an ipums_extract object, with:

submitted_usa_extract <- get_extract_info(submitted_usa_extract)

To print the latest status, you can use:

submitted_extract$status

If you forget to capture the return value of submit_extract(), you can pull down an ipums_extract object containing all the information on your most recent extract for a given data collection with:

submitted_usa_extract <- get_last_extract_info("usa")

get_last_extract_info() is just a convenience wrapper around get_recent_extracts_info_list(), described below.

If you don’t have an ipums_extract object in your environment that describes the extract you’re interested in, and you don’t want the most recent extract, you can also query the latest status of an extract by supplying the name of the IPUMS data collection and extract number of the extract, in one of two formats. Here’s how you’d get the latest information on IPUMS CPS extract number 33:

cps_extract_33 <- get_extract_info("cps:33")

or

cps_extract_33 <- get_extract_info(c("cps", "33"))

Note that in the first format, there are no spaces before or after the colon, and that in both formats, there is no need to zero-pad the extract number – in other words, use “33”, not “00033”.

If you want R to periodically check the status of your extract, and only return an updated ipums_extract object once the extract is ready to download, you can use wait_for_extract(), as shown below:

downloadable_cps_extract <- wait_for_extract(cps_extract_33)

wait_for_extract() also accepts the same "collection:number" and c("collection", "number") specifications shown above:

downloadable_cps_extract <- wait_for_extract("cps:33")

or

downloadable_cps_extract <- wait_for_extract(c("cps", "33"))

For large extracts that take a long time to produce, or when the IPUMS servers are busy, you may not want to use wait_for_extract(), as it will tie up your R session until the extract is ready to download.

wait_for_extract() will tie up your R session until the extract is ready to download, so it might not be the best option for large extracts that take a long time to produce. However, wait_for_extract() does offer a timeout_seconds argument to set the maximum number of seconds you want the function to wait. By default, that argument is set to 10,800 seconds (3 hours).

An alternative way to check whether your extract is ready to download is using the is_extract_ready() function. This function accepts either an ipums_extract object or a "collection:number" or c("collection", "number") specification, and returns a single TRUE or FALSE value indicating whether the extract is ready to be downloaded.

is_extract_ready(cps_extract_33)
is_extract_ready("cps:33")
is_extract_ready(c("cps", "33"))

As noted above, only extracts with status “completed” can be ready to download, but not all “completed” extracts are ready to download, because extract files are removed from IPUMS servers after 72 hours. The is_extract_ready() function checks whether an extract can currently be downloaded by looking at the “download_links” element of the extract object returned by the API.

Note that the API has a limit of 60 requests with the same API key per minute, so you wouldn’t want to write a loop that repeatedly uses is_extract_ready() to check your extract status.

Downloading your extract

Once your extract is ready to download, use the download_extract() function to download the data and DDI codebook files to your computer. The download_extract() function returns the path to the DDI codebook file, which can be used to read in the downloaded data with ipumsr functions. By default, the function will download files into your current working directory, but alternative locations can be specified with the download_dir argument.

ddi_path <- download_extract(submitted_usa_extract)

ddi <- read_ipums_ddi(ddi_path)
data <- read_ipums_micro(ddi)

Or, using a "collection:number" or c("collection", "number") specification:

ddi_path <- download_extract("cps:33")
ddi_path <- download_extract(c("cps", "33"))

Sharing an extract definition

One exciting feature enabled by the IPUMS microdata extract API is the ability to share a standardized extract definition with other IPUMS users so that they can create an identical extract for themselves. ipumsr facilitates this by offering the functions save_extract_as_json() and define_extract_from_json() to write extract definitions to and read extract definitions from a standardized JSON-formatted file.

To write the definition of your CPS extract number 33 to a JSON-formatted file that can be shared with other users, you could use:

cps_extract_33 <- get_extract_info("cps:33")
save_extract_as_json(cps_extract_33, file = "cps_extract_33.json")

Then, you or another user could use that JSON file to create a duplicate ipums_extract object with the same definition, and submit it, using:

clone_of_cps_extract_33 <- define_extract_from_json("cps_extract_33.json")
submitted_cps_extract <- submit_extract(clone_of_cps_extract_33)

Note that the code in the previous chunk assumes that the file is saved in the current working directory. If it’s saved somewhere else, replace "cps_extract_33.json" with the full path to the file.

Revising a previous extract

ipumsr also includes convenience functions for revising a previous extract definition, facilitating a “revise and resubmit” workflow. Here’s how you would pull down the definition of USA extract number 33 and add a sample and a variable to it:

old_extract <- get_extract_info("usa:33")
new_extract <- add_to_extract(old_extract, samples = "us2020a", vars = "RELATE")

The add_to_extract() function returns an ipums_extract object that has been modified as requested and has been reset to an unsubmitted state, by stripping the extract number, status, and download links from the original extract. The revised extract can then be submitted with:

newly_submitted_extract <- submit_extract(new_extract)

To remove values from an extract, use remove_from_extract():

newer_extract <- remove_from_extract(new_extract, samples = "us2020a")

Getting info on multiple recent extracts

You can query the API for the details and status of recent extracts (the ten most recent, by default) using the functions get_recent_extracts_info_list() and get_recent_extracts_info_tbl(). The _list version of the function returns a list of ipums_extract objects, whereas the _tbl version returns a tibble (enhanced “data.frame”) in which each row contains information on one extract.

The list representation is useful if you want to be able to operate on elements as ipums_extract objects. For instance, to retrieve your most second-most-recent extract and revise it for resubmission, you could use:

second_most_recent_extract <- get_recent_extracts_info_list("usa")[[2]]
revised_extract <- revise_extract_micro(
  second_most_recent_extract, 
  samples_to_add = "us2010a"
)

Or to download all recent extracts that are ready to download, using purrr::keep() and purrr::map_chr():

ddi_paths <- get_recent_extracts_info_list("usa") %>% 
  keep(is_extract_ready) %>% 
  map_chr(download_extract)

The tibble representation is useful if you want to use functions for manipulating data.frames to find recent extracts matching particular criteria.

recent_usa_extracts_tbl <- get_recent_extracts_info_tbl("usa")

For example, to find extracts with descriptions including the word “occupation”, you could use:

recent_usa_extracts %>%  
  filter(grepl("occupation", description))

Filtering on properties such as “samples” or “variables” is a little more complex, because these are stored in list columns, but it is possible. For example, to find extracts including the variable “AGE”, you could use purrr::map_lgl() like this:

recent_usa_extracts %>% 
  filter(map_lgl(variables, ~"AGE" %in% .x))

To convert between these two representations, ipumsr provides the functions extract_list_to_tbl() and extract_tbl_to_list(), such that the following is TRUE:

identical(
  extract_list_to_tbl(get_recent_extracts_info_list("usa")),
  get_recent_extracts_info_tbl("usa")
)

Putting it all together, with pipes

The return values of the functions to interact with the API are configured in such a way that you can define, submit, wait for, download, and read in your extract all in one piped expression:

data <- 
  define_extract_usa(
    "USA extract for API vignette",
    c("us2018a","us2019a"),
    c("AGE","SEX","RACE","STATEFIP")
  ) %>% 
    submit_extract() %>% 
    wait_for_extract() %>% 
    download_extract() %>% 
    read_ipums_micro()