Introduction

This document is an example to the tools that compx provides to study spatial segregation using network analysis, especially for spatial community detection to identify clusters of spatially-contiguous, demographically-homogeneous communities.

Preparation

In the previous vignette on the computing with the metric tensor, we obtained compx and prepared our data. If the following code looks unfamiliar, please take a moment to read that vignette.

Libraries

library(compx)        # main analysis functions
library(tidyverse)    # for data manipulation and viz
library(igraph)       # for network analysis
library(RColorBrewer) # for plots
library(ggthemes)     # for theme_map
library(units)        # needed for sf
library(sf)           # needed for io and visualization in this vignette
library(scales)       # for ggplot2 viz
library(RSpectra)     # fast eigenvalues, also used within compx

Data Input

The network-related functions of compx have the same requirements for data input as the metric tensor computatino functions. As a reminder, the functions provided by package compx assume that your data is expressed in two components. The first is a SpatialPolygonsDataFrame (spdf) containing geographic polygons (“tracts”). For illustrative purposes, compx comes bundled with an spdf with the Census tracts for Wayne County, Michigan, which includes the urban core of Detroit, an oft-analyzed city in quantitative segregation studies. The tracts were originally accessed via the R package tigris.

The second data input must be a demographic table with class in c('tbl_df', 'tbl', 'data.frame'), which I will refer colloquially to as a “data frame.” compx includes an example in detroit_race, giving racial demographic counts for each tract for decennial Censuses 1970, 1980, 1990, 2000, 2010. Due to changing questions and collection methods, only 1990 data and later is comparable with 2010. The demographic data was assembled and normalized by the Project on Diversity and Disparities at Brown University.

Three columns are required:

tract, the key relating the data the corresponding spdf. compx assumes that tract matches the GEOID column of spdf@data.
group, the demographic category (such as racial group, in this case).
n, the count of members of group in each tract.

Additionally, you may include an optional column for time t. compx functions will automatically use this column of detected; if you don’t want to do temporal analysis, you should delete the \(t\) column if you have one. Any additional columns are ignored.

We now prepare the data we’ll use:

tracts_to_use <- detroit_race %>% 
    group_by(tract, t) %>%
    summarise(n = sum(n)) %>%
    filter(n > 100) 

tracts <- detroit_tracts %>% 
    filter(GEOID %in% tracts_to_use$tract)

race_2010 <- detroit_race %>%
    filter(t == 2010) %>%
    select(-t)

Network Analysis

Computing the Graph

Because network analysis is fundamental to the idea of compx, we provide a simple function to compute networks (as igraph objects) from input data.

Similarly to the hessian argument of compute_metric, here you need to specify a divergence or comparison function between distributions. It should operate on vectors of nonnegative real numbers n and m (note: do NOT assume that these are normalized), and return a real number. It should also be symmetric. An example is the Jensen-Shannon metric, defined below in terms of the KL divergence. The Jensen-Shannon distance is a logical choice for categorical variables, like race. For ordinal variables, a different divergence should be substituted for DKL below. Alternative weightings of the two terms are also possible.

js_dist <- function(n,m){
    
    N <- sum(n)
    M <- sum(m)
    
    p <- n/N
    q <- m/M
    
    sqrt((N/(M+N))*DKL(p, p+q) + (M/(M+N))*DKL(q, p+q))
}

Having defined an appropriate divergence, we can now construct the graph:

g <- construct_information_graph(tracts = tracts, 
                                 data = race_2010, 
                                 divergence = js_dist)
g

## IGRAPH UN-- 606 1516 -- 
## + attr: name (v/c), n (v/x), x (v/n), y (v/n), dist (e/n)
## + edges (vertex names):
##  [1] 26163526500--26163526400 26163526500--26163526200
##  [3] 26163526500--26163526300 26163526500--26163534600
##  [5] 26163526500--26163985000 26163526500--26163573600
##  [7] 26163574202--26163576000 26163574202--26163574300
##  [9] 26163574202--26163574100 26163574202--26163575400
## [11] 26163546600--26163546500 26163546600--26163546100
## [13] 26163546600--26163546700 26163546600--26163546300
## [15] 26163541500--26163540900 26163541500--26163541700
## + ... omitted several edges

g is an igraph object with edge and node attributes. The dist edge attribute reflects how information-geometrically dissimilar are the connected nodes, where this dissimilarity is just the information distance between the nodes, calculated using the metric tensor. It’s useful to visualize this network. In the plot below, dissimilar tracts have thin edges between them, while very similar ones are joined by thick edges.

edges <- g %>% as_long_data_frame() %>% tbl_df()
nodes <- data_frame(x = V(g)$x, y = V(g)$y)

ggplot() + 
    geom_sf(aes(), fill = 'firebrick', data = tracts, alpha = .6, size = 0) + 
    geom_segment(aes(x = from_x, 
                     y = from_y, 
                     xend = to_x, 
                     yend = to_y, 
                     size = exp(-7 * dist^2)), color = 'black', data = edges) +
    ggthemes::theme_map() +
    scale_size_continuous(range = c(0,1)) + 
    guides(alpha = 'none', color = 'none', size = 'none')

Comparing to the figures above, you might notice that areas with thin or invisible edges correspond to the same “border” areas that we saw highlighted by the spatial trace above. This is by design.

The Scale of Segregation

We’re interested in determining the “scale” of segregation – the characteristic spatial resolution or resolutions at which racial differences are most pronounced. There are lots of good ways to try to get at this quantitatively, but one way is inspired by a contemporary machine-learning algorithm called spectral clustering, which you can read more about here. The idea is to compute the Laplacian matrix corresponding to the graph g, and then compute its spectrum of eigenvalues. “Gaps” in the eigenvalue spectrum will correspond to strong “signals” about the spatial scale of separation. Intiutively, it “makes sense” to cut the structure into that many pieces.

A   <- affinity_matrix(g, sigma = 100)        # square exponential affinity with specified sigma
L   <- normalized_laplacian(A)      # L_{rw} in the tutorial cited above
evL <- eigs(L, 50, which = 'LM', sigma = 1e-10)             # compute the spectrum

data.frame(n = 1:30, ev = 1 - rev(Re(evL$values))[1:30]) %>%
    ggplot() +
    aes(x = n, y = ev) +
    geom_line() +
    geom_point() +
    # scale_y_continuous(trans = 'log10') + 
    geom_vline(xintercept  = 7.5, linetype = 'dashed') + 
    geom_vline(xintercept  = 9.5, linetype = 'dashed') + 
    geom_vline(xintercept = 10.5, linetype = 'dashed') + 
    theme_bw()

For example, we see that in Detroit with under the Jensen-Shannon Distance, there are candidate gaps around \(n = 7\), \(n = 9\), and \(n = 10\). Of course, it’s important to realize that there is some judgment required in identifying these hard cutoffs – it’s not a single cutoff but the full spectrum of that most fully describes the community structure.

Spatial Community Detection

Spectral Clustering

The nice thing about taking inspiration from spectral clustering is that this method is also a method to actually find clusters. The idea is to do k-means clustering in the eigenspace of the Laplacian matrix. Since k-means is sensitive to its initial conditions, we’ll do it 1000 times and choose the best result.

set.seed(1234) # for reproducibility

k <- 7         # number of clusters, determined from previous graph
nreps <- 1000  # number of repetitions

# Extract the eigenspace of L corresponding to the first k components
Z <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)] 

# nreps runs of kmeans and extract the performance as a column
models <- data_frame(n = 1:nreps) %>% 
    dplyr::mutate(model = map(n, ~ kmeans(Z, centers = k))) %>% 
    mutate(perf = map_dbl(model, ~.$tot.withinss))

# get the best model
model <- models %>% 
    filter(perf == min(perf))
km <- model$model[[1]]
  
# assign the best model's labels as vertex attributes to g  
g        <- g %>% set_vertex_attr(name = 'cluster', 
                                  index = colnames(A), 
                                  value = km$cluster)

We’ve also provided a convenience function that will do the same thing; it’s just a wrapper around the code above.

g <- g %>% spectral_cluster(sigma = 50, k = k, nreps = nreps)

Either way, now we can plot the result:

# extract as data frame. We already extracted the edges df above. 
nodes    <- data_frame(geoid = V(g)$name, cluster = V(g)$cluster)

pal <- scales::brewer_pal(palette = 'Set3')
pal <- colorRampPalette(pal(k))
pal <- pal(k)

tracts %>% 
    left_join(nodes, by = c('GEOID' = 'geoid')) %>% 
    ggplot() + 
    geom_sf(aes(fill = as.character(cluster)), size = .1) +
    ggthemes::theme_map() + 
    scale_fill_manual(values = pal) + 
    theme(legend.position = c(.8, .1)) + 
    guides(fill = guide_legend(title = 'Cluster'))

This is a reasonably strong result; the clustering has detected the predominating division between the white suburbs to the west (Cluster 7) and the predominantly black urban core (Cluster 3), as well as distinct areas like “Mexicantown” (Cluster 2), Grosse Point (Cluster 1), and Hamtramck (Cluster 6). It’s worth comparing this clustering to a demographic map of Detroit.

Agglomerative Hierarchical Clustering

Another approach is to use hierarchical clustering on the network. From a theoretical perspective, this approach has both virtues and drawbacks. Its virtues are that it can be viewed as a form of direct information-maximization; its main drawback is that it’s greedy and therefore has the potential to give very poor solutions. Empirically, however, it seems to do ok on the data sets we work with. Agglomerative clustering may be performed on either the tracts and demographic data directly, or on the information graph. We’ll show it the first way for the static case, and then use it on the information graph for the temporal case:

a <- h_clust(tracts, race_2010)

Now let’s take a look:

pal <- scales::brewer_pal(palette = 'Set3')
pal <- colorRampPalette(pal(k))
pal <- pal(k)

tracts %>% 
    mutate(cluster = a %>% cutree(k)) %>% # add cluster labels
    ggplot() + 
    geom_sf(aes(fill = as.character(cluster)), size = .1) +
    ggthemes::theme_map() + 
    scale_fill_manual(values = pal) + 
    theme(legend.position = c(.8, .1)) + 
    guides(fill = guide_legend(title = 'Cluster'))

We can see that we’ve extracted a similar set of clusters, but there are some differences; primarily, cluster 3 is more expansive than the corresponding cluster under spectral clustering. Whether this is desired behavior depends on the context. The difference may be ascribed to either the greedy nature of hierarchical clustering or the fact that unlike spectral methods it weights population densities; separating out these considerations in practice may be difficult.

Spatiotemporal Networks

It’s also possible to use compx to construct and analyze networks with multiple time-slices. Just like before, it’s necessary to use data that has a t column with appropriate values.

race_temporal <- detroit_race %>%
    filter(t %in% c(1990, 2000, 2010))          

g <- construct_information_graph(tracts, race_temporal, divergence = js_dist)

When t column is provided, the names of g are concatenations of the GEOID of the tract and the corresponding value of \(t\). There’s an edge between each tract and its corresponding step forward or backward in time. ## Spectral Clustering

We can use our same spectral clustering and agglomerative clustering techniques here as well. Let’s do spectral first. Since the network we are clustering is more complex, we’ll allow some more clusters. Note that this

k     <- 7
sigma <- 100
g     <- g %>% spectral_cluster(sigma = sigma, k = k, nreps = 1000)

It’s more complex to visualize these full networks in a workable way, but not hard to just inspect the clusters over time.

nodes    <- data_frame(geoid = V(g)$geoid, t = V(g)$t, cluster = V(g)$cluster) 
    
pal <- scales::brewer_pal(palette = 'Set2')
pal <- colorRampPalette(pal(k))
pal <- pal(k)

tracts %>% 
    left_join(nodes, by = c('GEOID' = 'geoid')) %>% 
    ggplot() + 
    geom_sf(aes(fill = as.character(cluster)), size = .1) +
    ggthemes::theme_map() + 
    scale_fill_manual(values = pal) + 
    theme(legend.position = 'bottom',
          legend.justification = c(.5, .0),
          plot.margin=unit(c(0,0,0,0),"mm")) + 
    guides(fill = guide_legend(title = 'Cluster', nrow = 1)) + 
    facet_wrap(~t)

Spectral clustering is structurally sensitive to very large differences between tracts. Since tracts don’t tend to change dramatically in 10-year periods, the spectral solutions are relatively persistent over time. There is an exception: the historically white-dominated area centered at Grosse Point (Cluster 1) has receded from 1990 to 2010, with its western-most areas being “absorbed” into a predominantly black cluster.

We can easily visualize changing demographics within each of the clusters:

race_temporal %>% 
    left_join(nodes, by = c('t' = 't', 'tract' = 'geoid')) %>% 
    group_by(t, cluster, group) %>% 
    summarise(n = sum(n)) %>% 
    mutate(percentage = n / sum(n)) %>% 
    filter(!is.na(percentage)) %>% 
    ggplot() + 
    aes(x = t, y = percentage, color = group, group = group) + 
    geom_line() + 
    facet_wrap(~cluster) + 
    theme_bw()

Not only has the white-dominated area centered at Grosse Pointe become geographically smaller. Even the area that is still majority white has become more diverse from 2000 to 2010.

Agglomerative Clustering

Agglomerative hierarchical clustering tends to be more willing to see individual tracts change clusters over time. This time we’ll use the h_clust function on the information graph:

b <- h_clust(g)

data_frame(height = b$height, 
           n = length(b$height):1) %>% 
    mutate(value = lead(height) - height) %>% 
    filter(value > 0, !is.na(value)) %>% 
    ggplot() + 
    aes(x = n, y = value) + 
    geom_point() + 
    scale_x_continuous(limit = c(NA, 20))

    # scale_y_continuous(trans = 'log', limit = c(1e-6,NA))

There’s a fair bit of subjectivity in interpreting plots like this. We see that there’s a huge information gain associated with using two clusters, which rapidly falls off. Let’s use \(k = 7\):

k <- 7
g <- g %>% set_vertex_attr('cluster', value = b %>% cutree(k))

# get the edges, only including ones that are "within" a time slice. 
nodes    <- data_frame(geoid = V(g)$geoid, t = V(g)$t, cluster = V(g)$cluster) 
    
pal <- scales::brewer_pal(palette = 'Set3')
pal <- colorRampPalette(pal(k))
pal <- pal(k)

tracts %>% 
    left_join(nodes, by = c('GEOID' = 'geoid')) %>% 
    ggplot() + 
    geom_sf(aes(fill = as.character(cluster)), size = .1) +
    ggthemes::theme_map() + 
    scale_fill_manual(values = pal) + 
    theme(legend.position = 'bottom',
          legend.justification = c(.5, .0)) + 
    guides(fill = guide_legend(title = 'Cluster', nrow = 1)) + 
    facet_wrap(~t)

Unlike spectral clustering, the hierarchical method tends to generate clusters that vary substantially over time. However, these clusters don’t track quite as strongly visual boundaries in the data, suggesting that pure hierarchical methods may not be desirable here. It’s possible to use spectral methods as preprocessing to generate a form of regularization for subsequent hierarchical organization, but we won’t pursue that further here.

This version of hierarchical clustering is greedily information-maximizing, so it’s useful to ask how much information is captured by the clustering. This particular clustering uses 7 clusters to capture 0.2529845 nats of information, out of 0.258152 total. So, despite the fact that our data set has 1833 tracts in it, we can “tell most of the story” with just 7 super-tracts.

Future Work

Priorities for future work on compx include refinement of clustering methods, performance enhancements, and improvements TBD based on suggestions from those in the planning and sociological communities. ZA particular future improvement would be the addition of a factor that increased the importance of temporal differences, allowing spectral methods to find more dynamic structure.

Session Information

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RSpectra_0.12-0    scales_0.4.1.9002  sf_0.5-3          
##  [4] units_0.4-5        ggthemes_3.2.0     RColorBrewer_1.1-2
##  [7] igraph_1.0.1       dplyr_0.5.0        purrr_0.2.2       
## [10] readr_1.0.0        tidyr_0.6.1        tibble_1.3.3      
## [13] ggplot2_2.2.1.9000 tidyverse_1.1.1    compx_0.0.0.9000  
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2   haven_1.0.0      lattice_0.20-34  colorspace_1.2-6
##  [5] htmltools_0.3.5  yaml_2.1.13      XML_3.98-1.4     rlang_0.1.1     
##  [9] foreign_0.8-67   DBI_0.7          sp_1.2-4         readxl_0.1.1    
## [13] modelr_0.1.0     plyr_1.8.4       stringr_1.1.0    rgeos_0.3-22    
## [17] munsell_0.4.3    gtable_0.2.0     rvest_0.3.2      psych_1.6.6     
## [21] evaluate_0.9     labeling_0.3     knitr_1.14       forcats_0.2.0   
## [25] maptools_0.8-39  acs_2.0          parallel_3.3.2   broom_0.4.1     
## [29] Rcpp_0.12.12     udunits2_0.13    formatR_1.4      jsonlite_1.2    
## [33] mnormt_1.5-4     hms_0.3          digest_0.6.12    stringi_1.1.2   
## [37] grid_3.3.2       tools_3.3.2      bitops_1.0-6     magrittr_1.5    
## [41] RCurl_1.95-4.8   lazyeval_0.2.0   Matrix_1.2-7.1   xml2_1.1.1      
## [45] lubridate_1.6.0  assertthat_0.1   rmarkdown_1.1    httr_1.2.1      
## [49] R6_2.2.0         nlme_3.1-128

Analyzing Spatial Segregation with Information Geometry

Spatial Network Analysis

Phil Chodrow

August 23rd, 2017