geodimension: Definition of Geographic Dimensions

Jose Samos (jsamos@ugr.es)

2020-11-26

Introduction

The multidimensional data model was defined with the aim of supporting data analysis. In multidimensional systems, data is structured in facts and dimensions1. The geographic dimension plays a fundamental role in multidimensional systems. Apart from the analysis possibilities it offers, like any other dimension, it is very interesting to have the possibility of representing the reports obtained from multidimensional systems, using their geographic dimensions, on a map, or performing spatial analysis on them. This functionality is supported by package geomultistar.

To define a geographic dimension in a star schema, we need a table with attributes corresponding to the levels of the dimension. Additionally, we will also need one or more geographic layers to represent the data using this dimension. We can obtain this data from available vector layers of geographic information. In simple cases, one layer is enough; but we often need several layers related to each other. The relationships can be defined by common attribute values or can be inferred from the respective geographic information.

The main objective of this package is to support the definition of geographic dimensions from layers of geographic information that can be used in multidimensional systems. In particular, through package geomultistar they can be used directly.

The rest of this document is structured as follows: First, an illustrative example of how the package works is developed. Then, the document ends with conclusions.

An illustrative example

Suppose we have a multidimensional design on US data and the geographic dimension is defined at the city level. For each city we have its name and the code of the state in which it is located. It would be interesting to have other levels of detail in this dimension to be able to perform roll-up operations.

With this objective, we look for layers of geographic information. In United States Census Bureau we find layers at various levels of detail, including place, county, state, region, division and nation. We have obtained them (for place we have selected only the cities) and we have stored2 them in layer_us_city, layer_us_county, layer_us_state, layer_us_region, layer_us_division and layer_us_nation, respectively.

The layers are related to each other. In some cases they have attributes in common, in others, although there is a relationship, it is not explicitly defined. We can use geodimension to support the definition of these relationships. Once defined, it will also offer us support to exploit them and obtain information from them.

Thus, three phases can be distinguished:

Definition of levels

Each of the layers that we have obtained represents a conceptual level of the geographical dimension, in the package it is called geolevel. To define a geolevel, we need a layer and the set of attributes that make up the layer’s key (which uniquely identify each of its instances).

We can previously check if a set of attributes form a key of the layer using the check_key function.

library(tidyr)
library(sf)
library(geodimension)

names(layer_us_city)
#>  [1] "statefp"      "placefp"      "placens"      "geoid"        "name"        
#>  [6] "namelsad"     "lsad"         "classfp"      "pcicbsa"      "pcinecta"    
#> [11] "mtfcc"        "funcstat"     "aland"        "awater"       "shape_length"
#> [16] "shape_area"   "geoid_data"   "geometry"

check_key(layer_us_city, key = c("name", "statefp"))
#> [1] FALSE

check_key(layer_us_city, key = c("geoid"))
#> [1] TRUE

We might expect the city name (name) and state code (statefp) to be sufficient to identify a city, however, they are not3. We check that the geoid field is a valid key, therefore, it will be the one we use to define the geolevel.

We can check the geometry that is considered for the definition of the level by means of the get_geometry function (it simplifies the types into three: point, line, and polygon). In addition, we give each level a name to be able to refer to it, as shown below.

get_geometry(layer_us_city)
#> [1] "point"

city <-
  geolevel(name = "city",
           layer = layer_us_city,
           key = c("geoid"))

For county it is the same as for city, the name and the code of the state do not compose a valid key. In this case, the geometry is polygon. Additionally, a layer with another geometry can be associated using the add_geometry function or, in the case of point-type geometry, it can be generated from the polygon layer using the complete_point_geometry function, as is done below.

get_geometry(layer_us_county)
#> [1] "polygon"

county <-
  geolevel(name = "county",
           layer = layer_us_county,
           key = c("geoid")) %>%
  complete_point_geometry()

In the same way, the checks are made and the rest of the levels are defined. Although all the layers are of the polygon type, we generate the point layers only for those that we think may be needed.

In some cases, latitude and longitude are stored as fields in the attribute table, from which we can generate a layer of points using the coordinates_to_geometry function. We can later associate that layer to the corresponding level using the add_geometry function mentioned above.

Below is only the definition of the levels.

us_state_point <-
  coordinates_to_geometry(layer_us_state,
                          lon_lat = c("intptlon", "intptlat"))

state <-
  geolevel(name = "state",
           layer = layer_us_state,
           key = c("statefp")) %>%
  add_geometry(layer = us_state_point)

region <-
  geolevel(name = "region",
           layer = layer_us_region,
           key = c("regionce"))

division <-
  geolevel(name = "division",
           layer = layer_us_division,
           key = c("divisionce"))

nation <-
  geolevel(name = "nation",
           layer = layer_us_nation,
           key = c("name"))

Once the levels are defined, then we will define the dimension and the relationships between the levels.

Definition of relationships

To define a geodimension, we give it a name and start from any geolevel. Next, we add the rest of the geolevels in any order.

gd <-
  geodimension(name = "gd_us",
               level = region) %>%
  add_level(division) %>%
  add_level(state) %>%
  add_level(nation) %>%
  add_level(city) %>%
  add_level(county)

Next, we can define the relationships that we want to consider between the levels. In a relationship there are two parts, the lower level and the upper level. To define the relationships, the following points must be taken into account:

The relationships between state, region, division and nation are defined below.

gd <- gd %>%
  relate_levels(lower_level_name = "state",
                lower_level_attributes = c("division"),
                upper_level_name = "division") %>%
  relate_levels(lower_level_name = "division",
                upper_level_name = "region",
                by_geography = TRUE) %>%
  relate_levels(lower_level_name = "region",
                upper_level_name = "nation")

The relationship between state and division is defined by a state attribute that matches the division key.

The relationship between division and region is defined by geographic relationship because they do not have any attributes in common.

Finally, as nation has only one instance, it relates directly to all instances of the indicated level (in this case, region), no additional relationship is required.

In addition to these relationships there is a clear relationship between city and state and also between county and state. In both cases it can be defined by attributes.

gd <- gd %>%
  relate_levels(lower_level_name = "city",
                lower_level_attributes = c("statefp"),
                upper_level_name = "state") %>%
  relate_levels(lower_level_name = "county",
                lower_level_attributes = c("statefp"),
                upper_level_name = "state")

The relationship between city and county is a bit fuzzy (at least for me), it depends on the state, the city and also the county. In any case, if we are interested in establishing a relationship, we could define it between city and state using the geographical properties of the levels, since county includes polygon-type geometry.

gd <- gd %>%
  relate_levels(lower_level_name = "city",
                upper_level_name = "county",
                by_geography = TRUE)

We can check if all the instances have been related using the following function:

nrow(
  gd %>% get_unrelated_instances(lower_level_name = "city",
                                 upper_level_name = "county")
)
#> [1] 0

Since there are no unrelated instances, each city has been linked to the county whose boundaries contain it.

If we did a similar check for each defined relation, we would observe that all the instances of the lower levels have been related to those of the higher levels, except for the following:

nrow(
  gd %>% get_unrelated_instances(lower_level_name = "state",
                                 upper_level_name = "division")
)
#> [1] 1

If we look at the content:

state_key statefp region division statens geoid stusps name lsad mtfcc funcstat aland awater intptlat intptlon shape_length shape_area geoid_data
52 72 9 0 01779808 72 PR Puerto Rico 00 G4000 A 8.869e+09 4.922e+09 +18.2176480 -066.4107992 7.428 1.178 04000US72

The region or division code does not correspond to any of the existing ones. If we want to relate all the instances of state to the nation level, we must define the relationship explicitly (there are other solutions but we will not consider them here, in any case they can be generated from this one).

gd <- gd %>%
  relate_levels(lower_level_name = "state",
                upper_level_name = "nation")

With these operations we have defined a geodimension.

Obtaining information

From a geodimension we can obtain information in table or layer format, to define a geographic dimension in a star schema. We can also define new versions of the dimension.

We can consult the levels of the geodimension using the following function:

gd %>%
  get_level_names()
#> [1] "city"     "county"   "division" "nation"   "region"   "state"

A new geodimension is defined by selecting a subset of levels, which we want to take into account when obtaining information, or to define new dimensions.

gds <- gd %>%
  select_levels(level_names = c("state", "division", "region", "nation"))

gds %>%
  get_level_names()
#> [1] "division" "nation"   "region"   "state"

From any level of the geodimension, a data table can be obtained that includes only the data of the level or all the data inherited from higher levels. For each level you can indicate whether or not a prefix is added to identify the origin of the fields. By default it is added, as you can see below.

ld <- gd %>%
  get_level_data(level_name = "division")
names(ld)
#> [1] "division_key" "divisionce"   "affgeoid"     "geoid"        "name"        
#> [6] "lsad"         "aland"        "awater"

ld <- gd %>%
  get_level_data(level_name = "division",
                 inherited = TRUE)
names(ld)
#>  [1] "division_key"      "divisionce"        "affgeoid"         
#>  [4] "geoid"             "name"              "lsad"             
#>  [7] "aland"             "awater"            "NATION_nation_key"
#> [10] "NATION_name"       "NATION_affgeoid"   "NATION_geoid"     
#> [13] "REGION_region_key" "REGION_regionce"   "REGION_affgeoid"  
#> [16] "REGION_geoid"      "REGION_name"       "REGION_lsad"      
#> [19] "REGION_aland"      "REGION_awater"

Previously, if we need it, we can obtain the name of the levels from which a level will inherit attributes.

gd %>%
  get_higher_level_names(level_name = "state",
                         indirect_levels = TRUE)
#> [1] "division" "nation"   "region"

ld <- gd %>%
  get_level_data(level_name = "state",
                 inherited = TRUE)
names(ld)
#>  [1] "state_key"             "statefp"               "region"               
#>  [4] "division"              "statens"               "geoid"                
#>  [7] "stusps"                "name"                  "lsad"                 
#> [10] "mtfcc"                 "funcstat"              "aland"                
#> [13] "awater"                "intptlat"              "intptlon"             
#> [16] "shape_length"          "shape_area"            "geoid_data"           
#> [19] "DIVISION_division_key" "DIVISION_divisionce"   "DIVISION_affgeoid"    
#> [22] "DIVISION_geoid"        "DIVISION_name"         "DIVISION_lsad"        
#> [25] "DIVISION_aland"        "DIVISION_awater"       "NATION_nation_key"    
#> [28] "NATION_name"           "NATION_affgeoid"       "NATION_geoid"         
#> [31] "REGION_region_key"     "REGION_regionce"       "REGION_affgeoid"      
#> [34] "REGION_geoid"          "REGION_name"           "REGION_lsad"          
#> [37] "REGION_aland"          "REGION_awater"

For a level we can obtain the available geometries and a layer with the attribute configuration we want and the selected geometry.

gd %>%
  get_level_geometries(level_name = "state")
#> [1] "point"   "polygon"

ll <- gd %>%
  get_level_layer(level_name = "state",
                  geometry = "polygon",
                  only_key = TRUE)

plot(ll)

In addition to these functions, the package offers the possibility to change the CRS of all the layers of a geodimension using the transform_crs function.

Conclusions

The geographic dimension is very relevant for multidimensional systems. We can enrich a basic geographic dimension through information available in vector layers, generally, we will need several layers.

Relationships between layers can be established through attributes or through the geographic relationships between their instances. The definition of these relationships can be systematized and is in part what is intended in this package.

Additionally, once a geodimension has been defined, with the support of this package, we can easily obtain the attribute table with all the attributes of the levels that we want (if we need it), and also the layers with associated geographic information.


  1. Basic concepts of dimensional modelling and star schemas are presented in starschemar vignettes.↩︎

  2. To take up less space in the package, its geometry has also been simplified.↩︎

  3. For our example, this is not a problem, just a warning that we must pay attention when we select the cities, to determine if any are repeated.↩︎