Introduction to spnaf

The spnaf package is developed for calculating spatial network autocorrelation for flow data. Functions in the package are designed specifically to evaluate how networks are spatially clustered, in the form of \(G_{ij}\) statistic which is presented in the paper written by Berglund and Karlström.

Data: CA

The package has a dataset called CA which stands for California, US. This dataset contains migration amounts among CA counties in 2019. The data consists of origins and destinations of each residential flow.

dim(CA)
#> [1] 2580   12
head(CA)
#>   State.Code.of.Geography.A FIPS.County.Code.of.Geography.A
#> 1                       006                             001
#> 2                       006                             001
#> 3                       006                             001
#> 4                       006                             001
#> 5                       006                             001
#> 6                       006                             001
#>   State.U.S..Island.Area.Foreign.Region.Code.of.Geography.B
#> 1                                                       006
#> 2                                                       006
#> 3                                                       006
#> 4                                                       006
#> 5                                                       006
#> 6                                                       006
#>   FIPS.County.Code.of.Geography.B State.Name.of.Geography.A
#> 1                             005                California
#> 2                             007                California
#> 3                             009                California
#> 4                             011                California
#> 5                             013                California
#> 6                             015                California
#>   County.Name.of.Geography.A
#> 1             Alameda County
#> 2             Alameda County
#> 3             Alameda County
#> 4             Alameda County
#> 5             Alameda County
#> 6             Alameda County
#>   State.U.S..Island.Area.Foreign.Region.of.Geography.B
#> 1                                           California
#> 2                                           California
#> 3                                           California
#> 4                                           California
#> 5                                           California
#> 6                                           California
#>   County.Name.of.Geography.B Flow.from.Geography.B.to.Geography.A
#> 1              Amador County                                    0
#> 2               Butte County                                  304
#> 3           Calaveras County                                   49
#> 4              Colusa County                                    0
#> 5        Contra Costa County                                 8761
#> 6           Del Norte County                                    1
#>   Counterflow.from.Geography.A.to.Geography.B
#> 1                                          97
#> 2                                        1259
#> 3                                         349
#> 4                                          29
#> 5                                       18007
#> 6                                          95
#>   Net.Migration.from.Geography.B.to.Geography.A
#> 1                                           -97
#> 2                                          -955
#> 3                                          -300
#> 4                                           -29
#> 5                                         -9246
#> 6                                           -94
#>   Gross.Migration.between.Geography.A.and.Geography.B
#> 1                                                  97
#> 2                                                1563
#> 3                                                 398
#> 4                                                  29
#> 5                                               26768
#> 6                                                  96

Data: CA_polygon

The package also has a sf object called CA_polygon which is a sf class object that represents boundaries of CA counties. It has id column and geometry column and can be plotted by attaching the sf package. The polygon can be joined with CA since it has id column that matches County code of CA. You can learn more about how to deal with spatial objects at https://r-spatial.github.io/sf/.

library(sf)
plot(CA_polygon, col = 'white', main = 'CA polygon')

Function: Gij.polygon

spnaf package aims to measure spatial density of networks, which have origins (starting point) and destinations (ending point). Main function of spnaf is called Gij.polygon and the first main input of the function is df which is OD data in a data.frame form that must contain “oid”, “did”, and “n” (please refer to the help document) like CA above. The second important input is shape which is corresponding polygon object in sf class. The function also inherited two parameters from spdep such as queen, snap. The parameter method is one of c(“t”, “o”, “d”) which stand for total, origins only, and destinations only respectively (Please check this paper to get more information about the method). The last parameter n is used for bootstrapping permutation of resampling the individual statistic n times to generate a non-parametric distribution, since there would be a violation of the assumption of normality when one tries to calculate a spatial statistic with polygons(see how authors told about it in this paper). The process should be done to ensure a statistical significance of the statistic.

args(Gij.polygon)
#> function (df, shape, queen = TRUE, snap = 1, method = "t", R = 1000) 
#> NULL

How to execute

# Data manipulation
CA <- spnaf::CA
OD <- cbind(CA$FIPS.County.Code.of.Geography.B, CA$FIPS.County.Code.of.Geography.A)
OD <- cbind(OD, CA$Flow.from.Geography.B.to.Geography.A)
OD <- data.frame(OD)
names(OD) <- c("oid", "did", "n")
OD$n <- as.numeric(OD$n)
OD <- OD[order(OD[,1], OD[,2]),]
head(OD) # check the input df's format
#>     oid did     n
#> 63  001 005    97
#> 108 001 007  1259
#> 160 001 009   349
#> 201 001 011    29
#> 229 001 013 18007
#> 280 001 015    95

# Load sf polygon
CA_polygon <- spnaf::CA_polygon
head(CA_polygon) # it has geometry column
#> Simple feature collection with 6 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.4096 ymin: 35.90691 xmax: -118.3606 ymax: 41.46584
#> Geodetic CRS:  NAD83
#>    id                       geometry
#> 1 001 MULTIPOLYGON (((-122.3423 3...
#> 2 003 MULTIPOLYGON (((-120.0724 3...
#> 3 005 MULTIPOLYGON (((-121.0274 3...
#> 4 013 MULTIPOLYGON (((-122.4298 3...
#> 5 019 MULTIPOLYGON (((-120.9094 3...
#> 6 023 MULTIPOLYGON (((-124.4086 4...

# Execution of Gij.polygon with data above and given parameters
result <- Gij.polygon(df = OD, shape = CA_polygon, queen = TRUE, snap = 1,
method = 't', R = 1000)
#> (1) Creating Spatial Weights... Done !
#> note: Total 3306 network combinations are ready to be analyzed
#> note: which is calculated by the input polygon data
#> (2) Calculating Gij ... Done! 
#> (3) Do bootstrapping ... Done ! 
#> (4) Generating result in lines ... Done !

Interpretation of the result

The metric, an extended statistic of Getis and Ord (1992), \(G_{i}^{*}\), has similar intuition of hotspot analysis with static data: a high and significant value in a flow indicates spatial clustering of flows with high values. it can be interpreted as Z-value suitable for conducting statistical tests as the metric inherited the characteristics of \(G_{i}^{*}\). If one conducted bootstrapping for 1,000 times like above, those with a value greater than the 50th largest value of the distribution (i.e., at the significance level of 0.05) can be defined as positive clusters.

# positive clusters at the significance level of 0.05
head(result[[1]][result[[1]]$pval < 0.05,])
#>    oid did     n      Gij  pval
#> 6  001 013 18007 7.456884 0.008
#> 18 001 037  3957 2.754069 0.031
#> 20 001 041   959 2.281708 0.030
#> 23 001 047   586 2.548161 0.034
#> 29 001 059   953 1.960098 0.036
#> 33 001 067  3616 6.859656 0.012
# positive clusters at the significance level of 0.05 in lines class
head(result[[2]][result[[2]]$pval < 0.05,])
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -122.7238 ymin: 33.7029 xmax: -117.7608 ymax: 38.44905
#> Geodetic CRS:  NAD83
#>    oid did     n      Gij  pval                            sfc
#> 6  001 013 18007 7.456884 0.008 LINESTRING (-121.8887 37.64...
#> 18 001 037  3957 2.754069 0.031 LINESTRING (-121.8887 37.64...
#> 20 001 041   959 2.281708 0.030 LINESTRING (-121.8887 37.64...
#> 23 001 047   586 2.548161 0.034 LINESTRING (-121.8887 37.64...
#> 29 001 059   953 1.960098 0.036 LINESTRING (-121.8887 37.64...
#> 33 001 067  3616 6.859656 0.012 LINESTRING (-121.8887 37.64...

Visualization of all flows and Significant Flows(<0.05) only

library(tmap)
# plot all flows with the polygon (left)
tm_shape(CA_polygon) +
  tm_polygons()+
  tm_shape(result[[2]]) +
  tm_lines()
# plot significant flows only with the polygon (right)
tm_shape(CA_polygon) +
  tm_polygons()+
  tm_shape(result[[2]][result[[2]]$pval < 0.05,]) +
  tm_lines(col='pval')

Reference