ipumsr
- IPUMS
Data in RThe ipumsr
package allows you to read data from your
IPUMS extract into R along with the associated metadata like variable
labels, value labels and more. IPUMS is a great source of international
census and survey data.
IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context. Data and services are available free of charge.
Learn more here: https://www.ipums.org/mission-purpose
This vignette gives the basic outline of the ipumsr
package. Additional vignettes provide guidance on working with IPUMS
value labels, IPUMS geographic data, and data from three specific IPUMS
projects – CPS, NHGIS, and Terra.
To view any of these additional vignettes, run one the following
commands after installing the ipumsr
package:
vignette("value-labels", package = "ipumsr")
vignette("ipums-geography", package = "ipumsr")
vignette("ipums-cps", package = "ipumsr")
vignette("ipums-nhgis", package = "ipumsr")
vignette("ipums-terra", package = "ipumsr")
Registered users can download IPUMS data from our website at https://www.ipums.org. The website provides an interactive extract system that allows you to select only the samples and variables that are relevant to your research question.
For microdata projects (all supported projects except NHGIS and IPUMS Terra), once you have created your extract, you should choose to download your data in either a fixed-width text (.dat file extension) or comma delimited (.csv file extension) format.
Once your extract is complete, download the data file and the DDI. Downloading the DDI is a little bit different depending on your browser. On most browsers you should right-click the file and select “Save As…”. If this saves a file with a .xml file extension, then you should be ready. However, Safari users must select “Download Linked File” instead of “Download Linked File As”. On Safari, selecting the wrong version of these two will download a file with a .html file extension instead of a .xml extension.
For NHGIS, download the table data in a comma delimited format, and, if you want the associated mapping data, the GIS data. NHGIS provides the option to download comma-delimited files with an extra header row; it does not matter which option you select.
For IPUMS Terra, download the .zip extract “bundles”. If you want the associated mapping data, select the “include boundary files” option. You do not need to unzip the data.
Once your extract is downloaded, the ipumsr
package
functions read_*()
help you load the data into R.
read_ipums_micro()
/
read_ipums_micro_list()
: Reads data from microdata projects
(USA, CPS, International, DHS, Time Use, Health Surveys and Higher
Ed)read_nhgis()
/ read_nhgis_sf()
/
read_nhgis_sp()
: Reads data from the NHGIS project.
read_nhgis
loads only tabular data, whereas
read_nhgis_sf()
and read_nhgis_sp()
load
tabular data and shapefiles.read_terra_micro()
, read_terra_micro_sf()
,
read_terra_micro_sp()
, read_terra_area()
and
read_terra_raster()
load data form the IPUMS Terra
project.read_ipums_sf()
and read_ipums_sp()
load
boundary files.read_ipums_ddi()
: Reads DDI files with metadata that
are included with some extracts (mainly microdata)read_ipums_codebook()
: Reads the text codebook included
with some extracts (mainly NHGIS and some TerraPop extracts)Once the data is in R, you can view information about the extract using the metadata functions.
ipums_view()
: Makes a webpage that displays in the
RStudio Viewer which provides information about the extract as a whole
(like your extract notes or the citation information) and the specific
variables included (like the variable label, description and value
labels).ipums_file_info()
: Returns the file-level metadata
contained in ipums_view()
as an R data structure.ipums_var_info()
: Returns the variable-level metadata
contained in ipums_view()
as an R data structure.The data from most IPUMS projects contain some form of weighting variable that should be used to calculate estimates that are representative of the whole population. Many projects also provide specifications to help estimate variance given the complex design of the survey, such as replicate weights or design variables like STRATUM and PSU. The survey package provides functions that allow you to estimate variance taking this into account, and the srvyr package implements dplyr-like syntax for survey analysis, using the survey package’s functions.
For more information about what these variables mean and how to use them, see the website for the project you are interested in.
Some projects have data that is not contained within the extract
system, such that no DDI is provided for this data. In this case, either
use the comma delimited file (.csv file extension) if available, or use
the haven
package to read one of the files intended for
another statistical software (like Stata, SAS or SPSS).
The way that IPUMS treats value labels does not align with factors
(the main way that R is able to store values associated with labels).
R’s factor
variables can only store values as an integer
sequence (1, 2, 3, …), but IPUMS conventions are to store missing and
not-in-universe codes as large numbers, to distinguish them from the
normal values.
Therefore, the ipumsr
package uses the
labelled
class from the haven
package to store
labelled values. See the “value-labels” vignette for more information
(vignette('value-labels')
).
If you want to use IPUMS value labels attached to a variable, it is
generally best to convert from the labelled
class to
factor
early on in your data analysis workflow. This is
because many data manipulation functions will drop the labels stored in
a variable with class labelled
. The function
as_factor()
is the main function to create factors from
labels, but often you will need to do more manipulation before that.
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)
# Note that you can pass in the loaded DDI into the `read_ipums_micro()`
<- read_ipums_ddi(ipums_example("cps_00006.xml"))
cps_ddi <- read_ipums_micro(cps_ddi, verbose = FALSE)
cps_data
# Show which variables have labels
%>%
cps_data select_if(is.labelled)
#> # A tibble: 7,668 × 3
#> STATEFIP MONTH INCTOT
#> <int+lbl> <int+lbl> <dbl+lbl>
#> 1 55 [Wisconsin] 3 [March] 4883
#> 2 55 [Wisconsin] 3 [March] 5800
#> 3 55 [Wisconsin] 3 [March] 99999998 [Missing.]
#> 4 27 [Minnesota] 3 [March] 14015
#> 5 27 [Minnesota] 3 [March] 16552
#> 6 27 [Minnesota] 3 [March] 6375
#> 7 19 [Iowa] 3 [March] 99999999 [N.I.U. (Not in Universe).]
#> 8 19 [Iowa] 3 [March] 0
#> 9 19 [Iowa] 3 [March] 600
#> 10 19 [Iowa] 3 [March] 99999999 [N.I.U. (Not in Universe).]
#> # … with 7,658 more rows
# Notice how the tibble print function shows the dbl+lbl class on top
# Investigate labels
ipums_val_labels(cps_data$STATEFIP)
#> # A tibble: 75 × 2
#> val lbl
#> <int> <chr>
#> 1 1 Alabama
#> 2 2 Alaska
#> 3 4 Arizona
#> 4 5 Arkansas
#> 5 6 California
#> 6 8 Colorado
#> 7 9 Connecticut
#> 8 10 Delaware
#> 9 11 District of Columbia
#> 10 12 Florida
#> # … with 65 more rows
# Convert the labels to factors (and drop the unused levels)
<- cps_data %>%
cps_data mutate(STATE_factor = as_factor(lbl_clean(STATEFIP)))
table(cps_data$STATE_factor, useNA = "always")
#>
#> Iowa Minnesota North Dakota South Dakota Wisconsin <NA>
#> 1892 2362 188 227 2999 0
# Manipulating the labelled value before as_factor
# often leads to losing the information...
# Say we want to set Iowa (STATEFIP == 19) to missing
<- cps_data %>%
cps_data mutate(STATE_factor2 = as_factor(ifelse(STATEFIP == 19, NA, STATEFIP)))
# ipumsr provides helpers for these kinds of tasks, like lbl_na_if().
# See the value-labels vignette for more information
<- cps_data %>%
cps_data mutate(STATE_factor3 = as_factor(lbl_na_if(STATEFIP, ~.val == 19)))
# The as_factor function also has a "levels" argument that can
# put both the labels and values into the factor
<- cps_data %>%
cps_data mutate(STATE_factor4 = droplevels(as_factor(STATEFIP, levels = "both")))
table(cps_data$STATE_factor4, useNA = "always")
#>
#> [19] Iowa [27] Minnesota [38] North Dakota [46] South Dakota
#> 1892 2362 188 227
#> [55] Wisconsin <NA>
#> 2999 0
As with value labels, the other attributes that ipumsr
stores about the data are often lost during an analysis. One way to deal
with this is to load the DDI or codebook in addition to the actual data
using the functions read_ipums_ddi()
and
read_ipums_codebook()
. This way, when you wish to refer to
variable labels or other metadata, you can use the DDI object, which
does not get modified during your analysis.
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)
# Note that you can pass in the loaded DDI into the `read_ipums_micro()`
<- read_ipums_ddi(ipums_example("cps_00006.xml"))
cps_ddi <- read_ipums_micro(cps_ddi, verbose = FALSE)
cps_data
# Currently variable description is available for year
ipums_var_desc(cps_data$YEAR)
#> [1] "YEAR reports the year in which the survey was conducted. YEARP is repeated on person records."
# But after using ifelse it is gone
<- cps_data %>%
cps_data mutate(YEAR = ifelse(YEAR == 1962, 62, NA))
ipums_var_desc(cps_data$YEAR)
#> [1] NA
# So you can use the DDI
ipums_var_desc(cps_ddi, "YEAR")
#> [1] "YEAR reports the year in which the survey was conducted. YEARP is repeated on person records."
# The DDI also has file level information that is not available from just
# the data.
ipums_file_info(cps_ddi, "extract_notes") %>% cat()
#> User-provided description: Minimal test extract
#> Samples: 1962, 1963
#> Variables: STATEFIP, INCTOT (automatically Year, SERIAL, HWTSUPP, MONTH, WTSUPP)
#> Select Cases: State - Minnesota, Iowa, Wisconsin, South Dakota, North Dakota
Several functions within the ipumsr
package allow for
“dplyr select-style” syntax. This means that they accept either a
character vector of values (e.g. c("YEAR", "AGE")
), bare
vectors of values (e.g. c(YEAR, AGE)
) and the helper
functions allowed in dplyr::select()
(e.g. one_of(c("YEAR", "AGE"))
).
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)
# The vars argument for `read_ipums_micro` uses this syntax
# So these are all equivalent
<- ipums_example("cps_00006.xml")
cf read_ipums_micro(cf, vars = c("YEAR", "INCTOT"), verbose = FALSE) %>%
names()
#> [1] "YEAR" "INCTOT"
read_ipums_micro(cf, vars = c(YEAR, INCTOT), verbose = FALSE) %>%
names()
#> [1] "YEAR" "INCTOT"
read_ipums_micro(cf, vars = c(one_of("YEAR"), starts_with("INC")), verbose = FALSE) %>%
names()
#> [1] "YEAR" "INCTOT"
# `data_layer` and `shape_layer` arguments to `read_nhgis()` and terra functions
# also use it.
# (Sometimes extracts have multiple files, though all examples only have one)
<- ipums_example("nhgis0008_csv.zip")
nf ipums_list_files(nf)
#> # A tibble: 1 × 2
#> type file
#> <chr> <chr>
#> 1 data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv
ipums_list_files(nf, data_layer = "nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv")
#> # A tibble: 1 × 2
#> type file
#> <chr> <chr>
#> 1 data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv
ipums_list_files(nf, data_layer = contains("ds135"))
#> # A tibble: 1 × 2
#> type file
#> <chr> <chr>
#> 1 data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv
For certain IPUMS projects, the data is hierarchical, multiple people
are included in a single household, or multiple activities are performed
by a single person. The ipumsr
package provides two data
structures for storing such data (for users who did not select the
“rectangularize” option on the website). The data can be loaded as a
"list"
or "long"
.
List data loads each record type into a separate
data.frame. The names of the recordtype data.frames are the value of the
RECTYPE variable (e.g. “H” and “P”). Use the function
read_ipums_micro_list()
to load the data this way.
Long data has one row per unit, regardless of what
type of record the unit is. Therefore, datasets loaded this way often
contain variables with a large number of missings, for the variables
that only apply to certain record types. Use the function
read_ipums_micro()
to load the data this way.
library(ipumsr)
library(dplyr, warn.conflicts = FALSE)
# List data
<- read_ipums_micro_list(
cps ipums_example("cps_00010.xml"),
verbose = FALSE
)
$PERSON
cps#> # A tibble: 7,668 × 6
#> RECTYPE YEAR SERIAL PERNUM WTSUPP INCTOT
#> <chr+lbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>
#> 1 P [Person Record] 1962 80 1 1476. 4883
#> 2 P [Person Record] 1962 80 2 1471. 5800
#> 3 P [Person Record] 1962 80 3 1579. 99999998 [Missing.]
#> 4 P [Person Record] 1962 82 1 1598. 14015
#> 5 P [Person Record] 1962 83 1 1707. 16552
#> 6 P [Person Record] 1962 84 1 1790. 6375
#> 7 P [Person Record] 1962 107 1 4355. 99999999 [N.I.U. (Not in Univer…
#> 8 P [Person Record] 1962 107 2 1386. 0
#> 9 P [Person Record] 1962 107 3 1629. 600
#> 10 P [Person Record] 1962 107 4 1432. 99999999 [N.I.U. (Not in Univer…
#> # … with 7,658 more rows
$HOUSEHOLD
cps#> # A tibble: 3,385 × 6
#> RECTYPE YEAR SERIAL HWTSUPP STATEFIP MONTH
#> <chr+lbl> <dbl> <dbl> <dbl> <int+lbl> <int+lbl>
#> 1 H [Household Record] 1962 80 1476. 55 [Wisconsin] 3 [March]
#> 2 H [Household Record] 1962 82 1598. 27 [Minnesota] 3 [March]
#> 3 H [Household Record] 1962 83 1707. 27 [Minnesota] 3 [March]
#> 4 H [Household Record] 1962 84 1790. 27 [Minnesota] 3 [March]
#> 5 H [Household Record] 1962 107 4355. 19 [Iowa] 3 [March]
#> 6 H [Household Record] 1962 108 1479. 19 [Iowa] 3 [March]
#> 7 H [Household Record] 1962 122 3603. 27 [Minnesota] 3 [March]
#> 8 H [Household Record] 1962 124 4104. 55 [Wisconsin] 3 [March]
#> 9 H [Household Record] 1962 125 2182. 55 [Wisconsin] 3 [March]
#> 10 H [Household Record] 1962 126 1826. 55 [Wisconsin] 3 [March]
#> # … with 3,375 more rows
# Long data
<- read_ipums_micro(
cps ipums_example("cps_00010.xml"),
verbose = FALSE
)
cps#> # A tibble: 11,053 × 9
#> RECTYPE YEAR SERIAL HWTSUPP STATEFIP MONTH PERNUM WTSUPP INCTOT
#> <chr+lbl> <dbl> <dbl> <dbl> <int+lb> <int+lb> <dbl> <dbl> <dbl+lbl>
#> 1 H [Househ… 1962 80 1476. 55 [Wis… 3 [Mar… NA NA NA
#> 2 P [Person… 1962 80 NA NA NA 1 1476. 4.88e3
#> 3 P [Person… 1962 80 NA NA NA 2 1471. 5.8 e3
#> 4 P [Person… 1962 80 NA NA NA 3 1579. 1.00e8 [Mis…
#> 5 H [Househ… 1962 82 1598. 27 [Min… 3 [Mar… NA NA NA
#> 6 P [Person… 1962 82 NA NA NA 1 1598. 1.40e4
#> 7 H [Househ… 1962 83 1707. 27 [Min… 3 [Mar… NA NA NA
#> 8 P [Person… 1962 83 NA NA NA 1 1707. 1.66e4
#> 9 H [Househ… 1962 84 1790. 27 [Min… 3 [Mar… NA NA NA
#> 10 P [Person… 1962 84 NA NA NA 1 1790. 6.38e3
#> # … with 11,043 more rows
sf
vs sp
The ipumsr
package allows for loading geospatial data in
two formats (sf for Simple Features and sp for Spatial). The
sf
package is relatively new, and so does not have as
widespread support as the sp
package. However, (in my
opinion) it does allow for easier analysis, and so may be a better place
to start if you have not used GIS data in R before.
For more details about how to load geographic data using ipumsr, see
the vignette “ipums-geography”
(vignette("ipums-geography", package = "ipumsr")
)