Cancensus and CensusMapper

The cancensus package was developed to provide users with a way to access Canadian Census in a programmatic way following good tidy data practices. While the structure and data in cancensus is unique to Canadian Census data, this package is inspired in part by tidycensus, a package to interface with the US Census Bureau data APIs.

As Statistics Canada does not provide direct API access to Census data, cancensus retrieves Census data indirectly through the CensusMapper API. CensusMapper is a project by Jens von Bergmann, one of the authors of cancensus, to provide interactive geographic visualizations of Canadian Census data. CensusMapper databases store all publicly available data from Statistics Canada for the 2006, 2011, and 2016 Censuses. Censusmapper data can be accessed via an API and cancensus is built to interface directly with it.

API Key

cancensus requires a valid CensusMapper API key to use. You can obtain a free API key by signing up for a CensusMapper account. CensusMapper API keys are free and public API quotas are generous; however, due to incremental costs of serving large quantities of data, there limits to API usage in place. For most use cases, these API limits should not be an issue. Production uses with large extracts of fine grained geographies may run into API quota limits. For larger quotas, please get in touch with Jens directly.

To check your API key, just go to “Edit Profile” (in the top-right of the CensusMapper menu bar). Once you have your key, you can store it in your system environment so it is automatically used in API calls. To do so just enter set_api_key(<your_api_key>, install = TRUE)

Installing cancensus

The stable version of cancensus can be easily installed from CRAN.

install.packages("cancensus")

library(cancensus)

options(cancensus.api_key = "your_api_key")
options(cancensus.cache_path = "custom cache path")

Alternatively, the latest development version can be installed from Github using devtools.

# install.packages("devtools")
devtools::install_github("mountainmath/cancensus")

library(cancensus)

options(cancensus.api_key = "your_api_key")
options(cancensus.cache_path = "custom cache path")

For performance reasons, and to avoid unnecessarily drawing down API quotas, cancensus caches data queries under the hood. By default, cancensus caches in R’s temporary directory, but this cache is not persistent across sessions. In order to speed up performance, reduce quota usage, and reduce the need for unnecessary network calls, we recommend assigning a persistent local cache using set_cache_path(<local cache path>, install = TRUE), this enables more efficient loading and reuse of downloaded data.. Users will be prompted with a suggestion to change their default cache location when making API calls if one has not been set yet.

Accessing Census Data

cancensus provides three different functions for retrieving Census data: * get_census to retrieve Census data and geography as a spatial dataset * get_census_data to retrieve Census data only as a flat data frame * get_census_geometry to retrieve Census geography only as a collection of spatial polygons.

get_census takes as inputs a dataset parameter, a list of specified regions, a vector of Census variables, and a Census geography level. You can specify one of three options for spatial formats: NA to return data only, sf to return an sf-class data frame, or sp to return a SpatialPolygonsDataFrame object.

# Returns a data frame with data only
census_data <- get_census(dataset='CA16', regions=list(CMA="59933"),
                          vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                          level='CSD', use_cache = FALSE, geo_format = NA, quiet = TRUE)

# Returns data and geography as an sf-class data frame
census_data <- get_census(dataset='CA16', regions=list(CMA="59933"),
                          vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                          level='CSD', use_cache = FALSE, geo_format = 'sf', quiet = TRUE)

# Returns a SpatialPolygonsDataFrame object with data and geography
census_data <- get_census(dataset='CA16', regions=list(CMA="59933"),
                          vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                          level='CSD', use_cache = FALSE, geo_format = 'sp', quiet = TRUE)

cancensus utilizes caching to increase speed, minimize API token usage, and to make data available offline. Downloaded data is hashed and stored locally so if a call is made to access the same data, cancensus will read the local version instead. To force cancensus to refresh the data, specify use_cache = FALSE as a parameter for get_census.

Additional parameters for advanced options can be viewed by running ?get_census.

Census Datasets

cancensus can access Statistics Canada Census data for Census years 1996, 2001, 2006, 2011, and 2016 . You can run list_census_datasets to check what datasets are currently available for access through the CensusMapper API. Additional data for the 2016 Census will be included in Censusmapper within a day or two after public release by Statistics Canada. Statistics Canada maintains a release schedule for the Census 2016 Program which can be viewed on their website.

Thanks to contributions by the Canada Mortgage and Housing Corporation (CMHC), cancensus now includes additional Census-linked datasets as open-data releases. These include annual taxfiler data at the census tract level for tax years 2000 through 2017, which includes data on incomes and demographics, as well as specialized crosstabs for Structural type of dwelling by Document type, which details occupancy status for residences. These crosstabs are available for the 2001, 2006, 2011, and 2016 Census years at all levels starting with census tract.

The function list_census_datasets() will show all available datasets alongside their metadata.

list_census_datasets()
#> # A tibble: 28 × 6
#>    dataset description     geo_dataset attribution   reference reference_url    
#>    <chr>   <chr>           <chr>       <chr>         <chr>     <chr>            
#>  1 CA1996  1996 Canada Ce… CA1996      StatCan 1996… 92-351-U  https://www150.s…
#>  2 CA01    2001 Canada Ce… CA01        StatCan 2001… 92-378-X  https://www150.s…
#>  3 CA06    2006 Canada Ce… CA06        StatCan 2006… 92-566-X  https://www150.s…
#>  4 CA11    2011 Canada Ce… CA11        StatCan 2011… 98-301-X… https://www12.st…
#>  5 CA16    2016 Canada Ce… CA16        StatCan 2016… 98-301-X  https://www150.s…
#>  6 CA01xSD 2001 Canada Ce… CA01        StatCan 2001… 92-378-X  https://www150.s…
#>  7 CA06xSD 2006 Canada Ce… CA06        StatCan 2006… 92-566-X  https://www150.s…
#>  8 CA11xSD 2011 Canada Ce… CA11        StatCan 2011… 98-301-X  https://www150.s…
#>  9 CA16xSD 2016 Canada Ce… CA16        StatCan 2016… 98-301-X  https://www150.s…
#> 10 TX2000  2000 T1FF taxf… CA1996      StatCan 2000… 72-212-X  https://www150.s…
#> # … with 18 more rows

As other Census datasets become available via the CensusMapper API, they will be listed as output when calling list_census_datasets().

Census Regions

Census data is aggregated at multiple geographic levels. Census geographies at the national (C), provincial (PR), census metropolitan area (CMA), census agglomeration (CA), census division (CD), and census subdivision (CSD) are defined as named census regions.

Canadian Census geography can change in between Census periods. cancensus provides a function, list_census_regions(dataset), to display all named census regions and their corresponding id for a given census dataset.

list_census_regions("CA16")
#> # A tibble: 5,518 × 8
#>    region name              level     pop municipal_status CMA_UID CD_UID PR_UID
#>    <chr>  <chr>             <chr>   <int> <chr>            <chr>   <chr>  <chr> 
#>  1 01     Canada            C      3.52e7 <NA>             <NA>    <NA>   <NA>  
#>  2 35     Ontario           PR     1.34e7 <NA>             <NA>    <NA>   <NA>  
#>  3 24     Quebec            PR     8.16e6 <NA>             <NA>    <NA>   <NA>  
#>  4 59     British Columbia  PR     4.65e6 <NA>             <NA>    <NA>   <NA>  
#>  5 48     Alberta           PR     4.07e6 <NA>             <NA>    <NA>   <NA>  
#>  6 46     Manitoba          PR     1.28e6 <NA>             <NA>    <NA>   <NA>  
#>  7 47     Saskatchewan      PR     1.10e6 <NA>             <NA>    <NA>   <NA>  
#>  8 12     Nova Scotia       PR     9.24e5 <NA>             <NA>    <NA>   <NA>  
#>  9 13     New Brunswick     PR     7.47e5 <NA>             <NA>    <NA>   <NA>  
#> 10 10     Newfoundland and… PR     5.20e5 <NA>             <NA>    <NA>   <NA>  
#> # … with 5,508 more rows

The regions parameter in get_census requires as input a list of region id strings that correspond to that regions geoid. You can combine different regions together into region lists.

# Retrieves Vancouver and Toronto
list_census_regions('CA16') %>% 
  filter(level == "CMA", name %in% c("Vancouver","Toronto"))
#> # A tibble: 2 × 8
#>   region name      level     pop municipal_status CMA_UID CD_UID PR_UID
#>   <chr>  <chr>     <chr>   <int> <chr>            <chr>   <chr>  <chr> 
#> 1 35535  Toronto   CMA   5928040 B                <NA>    <NA>   35    
#> 2 59933  Vancouver CMA   2463431 B                <NA>    <NA>   59

census_data <- get_census(dataset='CA16', regions=list(CMA=c("59933","35535")),
                          vectors=c("v_CA16_408","v_CA16_409","v_CA16_410"),
                          level='CSD', use_cache = FALSE, quiet = TRUE)

Census Geographic Levels

Census data accessible through cancensus comes is available in a number of different aggregation levels including:

Code	Description	Count in Census 2016
C	Canada (total)	1
PR	Provinces/Territories	13
CMA	Census Metropolitan Area	35
CA	Census Agglomeration	14
CD	Census Division	287
CSD	Census Subdivision	713
CT	Census Tracts	5621
DA	Dissemination Area	56589
EA	Enumeration Area (1996 only)	-
DB	Dissemination Block (2001-2016)	489676
Regions	Named Census Region

Selecting regions = "59933" and level = "CT" will return data for all 478 census tracts in the Vancouver Census Metropolitan Area. Selecting level = "DA" will return data for all 3450 dissemination areas and selecting level = "DB" will retrieve data for 15,197 dissemination block. Working with CT, DA, EA, and especially DB level data significantly increases the size of data downloads and API usage. To help minimize additional overhead, cancensus supports local data caching to reduce usage and load times.

Setting level = "Regions" will produce data strictly for the selected region without any tiling of data at lower census aggregation levels levels.

Working with Census Variables

Census data contains thousands of different geographic regions as well as thousands of unique variables. In addition to enabling programmatic and reproducible access to Census data, cancensus has a number of tools to help users find the data they are looking for.

Displaying available Census variables

Run list_census_vectors(dataset) to view all available Census variables for a given dataset.

list_census_vectors("CA16")
#> # A tibble: 6,623 × 7
#>    vector     type   label      units parent_vector aggregation  details        
#>    <chr>      <fct>  <chr>      <fct> <chr>         <chr>        <chr>          
#>  1 v_CA16_401 Total  Populatio… Numb… <NA>          Additive     CA 2016 Census…
#>  2 v_CA16_402 Total  Populatio… Numb… <NA>          Additive     CA 2016 Census…
#>  3 v_CA16_403 Total  Populatio… Numb… <NA>          Average of … CA 2016 Census…
#>  4 v_CA16_404 Total  Total pri… Numb… <NA>          Additive     CA 2016 Census…
#>  5 v_CA16_405 Total  Private d… Numb… v_CA16_404    Additive     CA 2016 Census…
#>  6 v_CA16_406 Total  Populatio… Ratio <NA>          Average of … CA 2016 Census…
#>  7 v_CA16_407 Total  Land area… Numb… <NA>          Additive     CA 2016 Census…
#>  8 v_CA16_1   Total  Total - A… Numb… <NA>          Additive     CA 2016 Census…
#>  9 v_CA16_2   Male   Total - A… Numb… <NA>          Additive     CA 2016 Census…
#> 10 v_CA16_3   Female Total - A… Numb… <NA>          Additive     CA 2016 Census…
#> # … with 6,613 more rows

Variable characteristics

For each variable (vector) in that Census dataset, this shows:

Vector: short variable code
Type: variables are provided as aggregates of female responses, male responses, or total (male+female) responses
Label: detailed variable name
Units: provides information about whether the variable represents a count integer, a ratio, a percentage, or a currency figure
Parent_vector: shows the immediate hierarchical parent category for that variable, where appropriate
Aggregation: indicates how the variable should be aggregated with others, whether it is additive or if it is an average of another variable
Description: a rough description of a variable based on its hierarchical structure. This is constructed by cancensus by recursively traversing the labels for every variable’s hierarchy, and facilitates searching for specific variables using key terms.

Variable search

Each Census dataset features numerous variables making it a bit of a challenge to find the exact variable you are looking for. There is a function, find_census_vectors(), for searching through Census variable metadata in a few different ways. There are three types of searches possible using this function: exact search, which simply looks for exact string matches for a given query against the vector dataset; keyword search, which breaks vector metadata into unigram tokens and then tries to find the vectors with the greatest number of unique matches; and, semantic search which works better with search phrases and has tolerance for inexact searches. Switching between search modes is done using the query_type argument when calling find_census_variables() function.

# Find the variable indicating the number of people of Austrian ethnic origin
find_census_vectors("Australia", dataset = "CA16", type = "total", query_type = "exact")
#> # A tibble: 2 × 4
#>   vector      type  label      details                                          
#>   <chr>       <fct> <chr>      <chr>                                            
#> 1 v_CA16_3813 Total Australia  25% Data; Citizenship and Immigration; Total - S…
#> 2 v_CA16_4809 Total Australian 25% Data; Minority / Origin; Total - Ethnic orig…

find_census_vectors("Australia origin", dataset = "CA16", type = "total", query_type = "semantic")
#> # A tibble: 1 × 4
#>   vector      type  label      details                                          
#>   <chr>       <fct> <chr>      <chr>                                            
#> 1 v_CA16_4809 Total Australian 25% Data; Minority / Origin; Total - Ethnic orig…

find_census_vectors("Australian ethnic", dataset = "CA16", type = "total", query_type = "keyword", interactive = FALSE)
#> # A tibble: 1 × 4
#>   vector      type  label      details                                          
#>   <chr>       <fct> <chr>      <chr>                                            
#> 1 v_CA16_4809 Total Australian 25% Data; Minority / Origin; Total - Ethnic orig…

Managing variable hierarchy

Census variables are frequently hierarchical. As an example, consider the variable for the number of people of Austrian ethnic background. We can select that vector and quickly look up its entire hierarchy using parent_census_vectors on a vector list.

list_census_vectors("CA16") %>% 
  filter(vector == "v_CA16_4092") %>% 
  parent_census_vectors()
#> # A tibble: 3 × 7
#>   vector      type  label         units parent_vector aggregation details       
#>   <chr>       <fct> <chr>         <fct> <chr>         <chr>       <chr>         
#> 1 v_CA16_4089 Total Western Euro… Numb… v_CA16_4044   Additive    CA 2016 Censu…
#> 2 v_CA16_4044 Total European ori… Numb… v_CA16_3999   Additive    CA 2016 Censu…
#> 3 v_CA16_3999 Total Total - Ethn… Numb… <NA>          Additive    CA 2016 Censu…

Sometimes we want to traverse the hierarchy in the opposite direction. This is frequently required when looking to compare different variable stems that share the same aggregate variable. As an example, if we want to look the total count of Northern European ethnic origin respondents disaggregated by individual countries, it is pretty easy to do so.

# Find the variable indicating the Northern European aggregate
find_census_vectors("Northern European", dataset = "CA16", type = "Total")
#> # A tibble: 7 × 4
#>   vector      type  label                     details                           
#>   <chr>       <fct> <chr>                     <chr>                             
#> 1 v_CA16_4122 Total Northern European origin… 25% Data; Minority / Origin; Tota…
#> 2 v_CA16_4125 Total Danish                    25% Data; Minority / Origin; Tota…
#> 3 v_CA16_4128 Total Finnish                   25% Data; Minority / Origin; Tota…
#> 4 v_CA16_4131 Total Icelandic                 25% Data; Minority / Origin; Tota…
#> 5 v_CA16_4134 Total Norwegian                 25% Data; Minority / Origin; Tota…
#> 6 v_CA16_4137 Total Swedish                   25% Data; Minority / Origin; Tota…
#> 7 v_CA16_4140 Total Northern European origin… 25% Data; Minority / Origin; Tota…

The search result shows that the vector v_CA16_4092 represents the aggregate for all Northern European origins. The child_census_vectors function can return a list of its constituent underlying variables.

# Show all child variable leaves
list_census_vectors("CA16") %>% 
  filter(vector == "v_CA16_4122") %>% child_census_vectors(leaves = TRUE)
#> # A tibble: 6 × 7
#>   vector      type  label      units parent_vector aggregation details          
#>   <chr>       <fct> <chr>      <fct> <chr>         <chr>       <chr>            
#> 1 v_CA16_4125 Total Danish     Numb… v_CA16_4122   Additive    CA 2016 Census; …
#> 2 v_CA16_4128 Total Finnish    Numb… v_CA16_4122   Additive    CA 2016 Census; …
#> 3 v_CA16_4131 Total Icelandic  Numb… v_CA16_4122   Additive    CA 2016 Census; …
#> 4 v_CA16_4134 Total Norwegian  Numb… v_CA16_4122   Additive    CA 2016 Census; …
#> 5 v_CA16_4137 Total Swedish    Numb… v_CA16_4122   Additive    CA 2016 Census; …
#> 6 v_CA16_4140 Total Northern … Numb… v_CA16_4122   Additive    CA 2016 Census; …

The leaves = TRUE parameter specifies whether intermediate aggregates are included or not. If TRUE then only the lowest level variables are returns - the “leaves” of the hierarchical tree.