Background

This package is intended for use in data management activities associated with fixed locations in space. The motivating fields include air and water quality monitoring where fixed sensors report at regular time intervals.

When working with environmental monitoring time series, one of the first things you have to do is create unique identifiers for each individual time series. In an ideal world, each environmental time series would have both a locationID and a deviceID that uniquely identify the specific instrument making measurements and the physical location where measurements are made. A unique timeseriesID could be produced as locationID_deviceID. Metadata associated with each timeseriesID would contain basic information needed for downstream analysis including at least:

timeseriesID, locationID, deviceID, longitude, latitude, ...

An extended time series for a mobile sensor would group by deviceID.
Multiple sensors placed at a single location could be be grouped by locationID.
Maps would be created using longitude, latitude.
Time series measurements would be accessed from a secondary data table with timeseriesID column names.

Unfortunately, we are rarely supplied with a truly unique and truly spatial locationID. Instead we often use deviceID or an associated non-spatial identifier as a stand-in for locationID.

Complications we have seen include:

GPS-reported longitude and latitude can have jitter in the fourth or fifth decimal place making it challenging to use them to create a unique locationID.
Sensors are sometimes repositioned in what the scientist considers the “same location”.
Data from a single sensor goes through different processing pipelines using different identifiers and is later brought together as two separate timeseries.
The spatial scale of what constitutes a “single location” depends on the instrumentation and scientific question being asked.
Deriving location-based metadata from spatial datasets is computationally intensive unless saved and identified with a unique locationID.
Automated searches for spatial metadata occasionally produce incorrect results because of the non-infinite resolution of spatial datasets and must be corrected by hand.

Functionality

A solution to all these problems is possible if we store spatial metadata in simple tables in a standard directory. These tables will be referred to as collections. Location lookups can be performed with geodesic distance calculations where a longitude-latitude pair is assigned to a pre-existing known location if it is within distanceThreshold meters of that location. These lookups will be extremely fast.

If no previously known location is found, the relatively slow (seconds) creation of a new known location metadata record can be performed and then added to the growing collection.

For collections of stationary environmental monitors that only number in the thousands, this entire collection can be stored as either a .rda or .csv file and will be under a megabyte in size making it fast to load. This small size also makes it possible to store multiple known locations files, each created with different locations and different distance thresholds to address the needs of different scientific studies.

Example Usage

The package comes with some example known locations tables.

Lets take some metadata we have for air quality monitors in Washington state and create a known locations table for them.

wa <- get(data("wa_airfire_meta", package = "MazamaLocationUtils"))
names(wa)

##  [1] "monitorID"             "longitude"             "latitude"             
##  [4] "elevation"             "timezone"              "countryCode"          
##  [7] "stateCode"             "siteName"              "agencyName"           
## [10] "countyName"            "msaName"               "monitorType"          
## [13] "siteID"                "instrumentID"          "aqsID"                
## [16] "pwfslID"               "pwfslDataIngestSource" "telemetryAggregator"  
## [19] "telemetryUnitID"

Creating a Known Locations table

We can create a known locations table for them with a minimum 500 meter separation between distinct locations:

library(MazamaLocationUtils)

# Initialize with standard directories
mazama_initialize()
setLocationDataDir("./data")

wa_monitors_500 <-
  table_initialize() %>%
  table_addLocation(wa$longitude, wa$latitude, distanceThreshold = 500)

Right now, our known locations table contains only automatically generated spatial metadata:

dplyr::glimpse(wa_monitors_500)

## Rows: 68
## Columns: 13
## $ locationID   <chr> "ddbb565d51fe74ba", "f4ac27b3de8b9c19", "efce6225e8b2b1b8…
## $ locationName <chr> "us.wa_ddbb56", "us.wa_f4ac27", "us.wa_efce62", "us.wa_3b…
## $ longitude    <dbl> -122.3383, -120.6647, -120.0231, -120.1051, -117.5890, -1…
## $ latitude     <dbl> 47.55998, 47.59880, 47.83861, 48.35412, 47.64535, 47.8852…
## $ elevation    <dbl> 3.19, 361.69, 338.56, 473.18, 729.72, 730.84, 13.03, 175.…
## $ countryCode  <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US…
## $ stateCode    <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA…
## $ countyName   <chr> "King", "Chelan", "Chelan", "Okanogan", "Spokane", "Steve…
## $ timezone     <chr> "America/Los_Angeles", "America/Los_Angeles", "America/Lo…
## $ houseNumber  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ street       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ city         <chr> "Seattle", "Leavenworth", "Chelan", "Okanogan County", "A…
## $ zip          <chr> "98106", "98826", "98816", "98856", "99001", "99040", "98…

Merging external metadata

Perhaps we would like to import some of the original metadata into our new table. This is a very common use case where non-spatial metadata like site name or agency responsible for a monitor can be added.

Just to make it interesting, let’s assume that our known locations table is already large and we are only providing additional metadata for a subset of the records.

# Use a subset of the wa metadata
wa_indices <- seq(5,65,5)
wa_sub <- wa[wa_indices,]

# Use a generic name for the location table
locationTbl <- wa_monitors_500

# Find the location IDs associated with our subset
locationID <- table_getLocationID(
  locationTbl, 
  longitude = wa_sub$longitude, 
  latitude = wa_sub$latitude, 
  distanceThreshold = 500
)

# Now add the "siteName" column for our subset of locations
locationData <- wa_sub$siteName
locationTbl <- table_updateColumn(
  locationTbl, 
  columnName = "siteName", 
  locationID = locationID, 
  locationData = locationData
)

# Lets see how we did
locationTbl_indices <- table_getRecordIndex(locationTbl, locationID)
locationTbl[locationTbl_indices, c("city", "siteName")]

## # A tibble: 13 × 2
##    city            siteName                  
##    <chr>           <chr>                     
##  1 Okanogan County Twisp-Ewell St            
##  2 <NA>            <NA>                      
##  3 Tacoma          Tacoma-Alexander Ave      
##  4 Auburn          Auburn 29th St            
##  5 Anacortes       Anacortes-202 Ave (SO-AQS)
##  6 Quincy          Quincy 3rd                
##  7 Tulalip Bay     Tulalip-Totem Beach Rd    
##  8 Shelton         Shelton-W Franklin        
##  9 Ellensburg      Ellensburg-Ruby St        
## 10 Wenatchee       Wenatchee-Fifth St        
## 11 Walla Walla     Walla Walla-12th St       
## 12 Chehalis        Chehalis-Market Blvd      
## 13 Tacoma          Tacoma-S 36th St

Very nice. We have added siteName to our known locations table for a more detailed description of each monitors’ location.

Finding known locations

The whole point of a known locations table is to speed up access to spatial and other metadata. Here’s how we can use it with a set of longitudes and latitudes that are not currently in our table.

# Create new locations near our known locations
lons <- jitter(wa_sub$longitude) 
lats <- jitter(wa_sub$latitude)

# Any known locations within 50 meters?
table_getNearestLocation(
  wa_monitors_500,
  longitude = lons,
  latitude = lats,
  distanceThreshold = 50
) %>% dplyr::pull(city)

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA

# Any known locations within 500 meters
table_getNearestLocation(
  wa_monitors_500,
  longitude = lons,
  latitude = lats,
  distanceThreshold = 500
) %>% dplyr::pull(city)

##  [1] NA           NA           NA           "Auburn"     "Anacortes" 
##  [6] "Quincy"     NA           "Shelton"    "Ellensburg" "Wenatchee" 
## [11] NA           NA           NA

# How about 5000 meters?
table_getNearestLocation(
  wa_monitors_500,
  longitude = lons,
  latitude = lats,
  distanceThreshold = 5000
) %>% dplyr::pull(city)

##  [1] "Okanogan County" NA                "Tacoma"          "Auburn"         
##  [5] "Anacortes"       "Quincy"          "Tulalip Bay"     "Shelton"        
##  [9] "Ellensburg"      "Wenatchee"       "Walla Walla"     "Chehalis"       
## [13] "Tacoma"

Standard Setup

Before using MazamaLocationUtils you must first install MazamaSpatialUtils and then install core spatial data with:

  library(MazamaSpatialUtils)
  setSpatialDataDir("~/Data/Spatial")
  
  installSpatialData("EEZCountries")
  installSpatialData("OSMTimezones")
  installSpatialData("NaturalEarthAdm1")
  installSpatialData("USCensusCounties")

Once the required datasets have been installed, the easiest way to set things 
up each session is with:

  library(MazamaLocationUtils)
  mazama_initialize()
  setLocationDataDir("~/Data/KnownLocations")

mazama_initialize() assumes spatial data are installed in the standard location and is just a wrapper for:

  MazamaSpatialUtils::setSpatialDataDir("~/Data/Spatial")
  
  MazamaSpatialUtils::loadSpatialData("EEZCountries.rda")
  MazamaSpatialUtils::loadSpatialData("OSMTimezones.rda")
  MazamaSpatialUtils::loadSpatialData("NaturalEarthAdm1.rda")
  MazamaSpatialUtils::loadSpatialData("USCensusCounties.rda")

Every time you table_save() your location table, a backup will be created so you can experiment without losing your work. File sizes are pretty tiny so you don’t have to worry about filling up your disk.

Best wishes for well organized spatial metadata!

Introduction to MazamaLocationUtils

Mazama Science

2022-01-16