3 Methodology
This page provides an outline of the methodology behind indicator modelling / calculation. For more details, please refer to the notebooks (provided in the Appendices of this book).
3.1 Indicator selection
We resolved to use four indicators of interest, which were shortlisted after an initial review of the available data and software, as well as the interests of the stakeholders. These are:
- Air quality
- House prices
- Job accessibility
- Greenspace accessibility
For each of these indicators, we have detailed below how they are generated. This step represents the basic scenario.
We have two different types of indicators: while air quality and house prices are derived using models which are fit to actual data, accessibilities are directly calculated using geospatial data, so are not directly modelled/predicted using explanatory variables.
3.1.1 Accessibility calculation
For accessibility computations, we calculate the cumulative potential accessibility to green spaces and jobs (collectively ‘opportunities’). We consider a set of origins and opportunity destination points, and the time it takes to move between them on the road network for different transport modes.
In order to calculate the time it takes from each origin to reach each destination, we need to have a Time Travel Matrix (TTM) between origins and destinations for different means of transport. To build this, we can generate a graph/network starting from two data sources: one for the road network, and one for public transportation schedules. Once a TTM is built between a set of origins and a set of destinations for a given mode of transport, we can run accessibility analyses on it.
Origins: For this project, we take the coordinates of the Population Weighted Centroids (PWC) of Output Areas (OA) as the origins.
Destinations: For each destination, we need to provide the coordinates as well as a datum for the opportunity (or ‘supply’, in our case this is the land use). For example, this could be the number of jobs available. (Other opportunities, though not considered here, could be the number of schools, health centres, or any other countable feature.)
Jobs: The destinations used for job accessibility are the PWCs of the working population for each Workplace Zone (WPZ) in the 2011 Census (see original data and definition).
To obtain an opportunity measure (job counts), we make the approximation that the number of workers is equal to the number of available jobs. This is in order to preserve the spatial location of the jobs, which is needed in the calculation of the time travel matrix. In fact, the number of jobs in the Census is given at the OA level, but the PWC for this aggregated level can be off-centered in relation to where people work.
Finally, we define job accessibility as the number of jobs which are accessible by public transport and walking within 15 minutes.
Greenspace: Here, we use the open data set available from Ordnance Survey (OS). In particular, as a first approximation we consider the layer “Access points” from this datum. This gives the coordinates of the access points to green spaces in all of Great Britain, which we use as destinations.
As an opportunity measure, we count the total areas of the greenspace (layer “Sites”) to which these points give access to. The greenspace accessibility is then defined as the sum of the areas of greenspace sites reachable within 15 minutes.
A step-by-step description of the accessibility calculations is as follows:
Indicator definition
- Obtain greenspace sites from OS
- Filter out irrelevant categories (allotments, golf courses, bowling greens)
- Retain entrances on the edge of the sites
- Associate
- Get the area of each site
- Filter entrances within time threshold (15 min) of each origin (OA)
- Consider unique values for parks (can be reached in )
- Generate metric for each OA as the sum of reachable site size
Build time travel matrix (TTM)
We can build a TTM from two data sources: the road network, and a timetable for public transport. We can obtain the first by donwloading OpenStreetMap (OSM) data for the area of interest. Timetables for England public transport in GTFS format are available from UK2GTFS. We use GTFS data because it is more compatible with the ttm
calculation package.
Get time table for public transport > GTFS data
Get roads network > OSM data
Generate network graph >
r5
engine (Conveyal) >r5py
packageGenerate TTM for 4 different modes > ‘transit’, ‘bike’, ‘car’, ‘walking’
(An idea for future improvement would be to adjust the time threshold depending on the time of day / day of the week, as these would influence e.g. traffic conditions.)
Run accessibility analysis
- Get land use data (opportunities detailed above in Destinations)
- Jobs > run
tracc
package on the TTM for each transport mode:- Compute impedance function based on a 15 minute cost (cumulative)
- Setting up the accessibility object, i.e. joining the destination data to the travel time data
- Measuring potential accessibility to jobs as the cumulative sum of opportunities at destination per origin
- Greenspace > convert the
tracc
functions to work only on reachable sites’ area, on the TTM per transport mode:- Compute impedance function based on a 15 minute cost (cumulative)
- Join the destination data to the travel time data
- Per OA (origin), select all entrances which can be reached within the threshold time, and deduplicate entrances which correspond to the same park
- Calculate the metric as the sum of parks’ areas
3.1.2 Air quality index
We develop an air quality index as a composite of PM2.5, PM10, NO2 and SO2 values, derived from the UK AIR project by DEFRA. Data are available as a 1 km grid. The composite index follows the methodology of the European Environmental Agency (EEA), reflecting the relative health risk associated with the exposure to particle intensities. Quoting from the EEA:
The bands are based on the relative risks associated with short-term exposure to PM2.5, O3 and NO2, as defined by the World Health Organization in their Health Risks of Air Pollution in Europe (HRAPIE) project report.
The relative risk of exposure to PM2.5 is taken as the basis for driving the index, specifically, the increase in the risk of mortality per 10 µg/m3 increase in the daily mean concentration of PM2.5.
Assuming linearity across the relative risk functions for O3 and NO2, we calculate the concentrations of these pollutants that pose an equivalent relative risk to a 10 µg/m3 increase in the daily mean of PM2.5.
For PM10 concentrations, a constant ratio between PM10 and PM2.5 of 1:2 is assumed, in line with the World Health Organization´s air quality guidelines for Europe.
For SO2, the bands reflect the limit values set under the EU Air Quality Directive.
The relationship between PM2.5 : PM10 : NO2 : O3: SO2 is then equal to 1 : 2 : 4 : 5 : 10. The combined index can then be computed as \[Q_{\text{air}} = \frac{\text{PM2.5}}{1} + \frac{\text{PM10}}{2} + \frac{\mathrm{NO_2}}{4} + \frac{\mathrm{O_3}}{5} + \frac{\mathrm{SO_2}}{10}\]
Except for the O3, UK AIR reports all of these values as concentrations in µg/m3. O3 is reported in terms of the number of days above a threshold of 120 µg/m3, and thus cannot be used in this formula, but even EEA omits data when unavailable, so we can create the index based on the four other measurements.
It should also be borne in mind that UK AIR data are not direct measurements, but rather a model.
The data from the 1 km grid are spatially interpolated to Output Area geometries. The index itself is computed on the grid.
3.1.3 House price
An optimal way of working with house prices in the modelling exercise like ours is to use price per square metre (sqm). However, these data are not generally available at the Output Area or similar level. Luckily, we can retrieve such data from the paper “A new attribute-linked residential property price dataset for England and Wales 2011-2019”, by Chi et al.. This dataset contains individual house prices, total floor area, and a resulting price per sqm obtained from a combination of Land Registry Price Paid Data and Domestic Energy Performance Certificates.
We use the data from the years 2018–2019, and compute the mean price per sqm for each output area. Given the data are not fully up-to-date, the results of our modelling are presented as a percent increase or decrease compared to the baseline value rather than in absolute terms.
3.2 Explanatory variables
We curated a set of explanatory variables, which describe the urban environment and can be changed in order to model the different scenarios. These variables are chosen to have predictive power for the four indicators, and can be categorised into several groups according to their source.
3.2.1 Group 1: ONS population estimates
The first group consists of only one variable, namely, population estimates. We use the population estimates by the Office for National Statistics (ONS) at the OA level for mid-2020. The other data used in the project generally reflect the period between 2018 and 2020, so this was chosen in order to describe the compatible point in time. The dataset (titled ‘SAPE23DT10d’) is retrieved from the ONS website. Since it is already reported at the Output Area level, it does not need to be processed further.
3.2.2 Group 2: Workplace population by industry
The workplace population is obtained from Census 2011 data, and is reported on Workplace Zone geometries. We use preprocessed data from the Urban Grammar project which aggregated industries into the following groups:
A, B, D, E. Agriculture, energy and water
C. Manufacturing
F. Construction
G, I. Distribution, hotels and restaurants
H, J. Transport and communication
K, L, M, N. Financial, real estate, professional and administrative activities
O, P, Q. Public administration, education and health
R, S, T, U. Other
The data are then interpolated from Workplace Zones to Output Areas, giving us another 8 variables.
3.2.3 Group 3: CORINE Land Cover classification
We use CORINE Land Cover classification data for 2018, which are provided as polygons of contiguous areas belonging to the same class. We use the data extracted for Great Britain within the Urban Grammar project, and interpolate the data to Output Areas, which gives us the proportion of each OA covered by each class. We further filter out fully or nearly invariant classes and use only:
Discontinuous urban fabric
Continuous urban fabric
Non-irrigated arable land
Industrial or commercial units
Green urban areas
Pastures
Sport and leisure facilities
leading to 7 more variables.
3.2.4 Group 4: Urban morphometrics
Urban morphometrics offers a way of describing the physical built environment (buildings, streets) in a set of quantitative measurements which capture different aspects of morphological elements. We directly use the set measured within the Urban Grammar project, which are presented as individual characters (prior contextualisation), and interpolate their values from the original geometry (enclosed tessellation cells) to Output Areas. This way, we obtain a contextual version using the project-specific aggregation to OA. This gives us 59 morphometric variables.
3.2.5 Variable pruning and final variable set
In total, from these four groups, we have 75 explanatory variables. Given that some of these are collected for different purposes than modelling, and with different geographical extents in mind, it may happen that some of the variables are correlated within our limited area of interest (Tyne and Wear). This may negatively affect the performance of the predictive models, and it is better to minimise the number of such pairs. We, therefore, measure both the Pearson and Spearman correlation indices between all pairs of explanatory variables. For each pair of variables where both correlation indices have an absolute value larger than 0.8, we retain only one, ideally the one that is more interpretable.
This procedure resulted in 16 of the morphometric variables being removed, leaving us with 59 explanatory variables in total:
Group | # | Variable name |
---|---|---|
Population | 1 | population estimate |
Workplace | 2 | A, B, D, E. Agriculture, energy and water |
3 | C. Manufacturing | |
4 | F. Construction | |
5 | G, I. Distribution, hotels and restaurants | |
6 | H, J. Transport and communication | |
7 | K, L, M, N. Financial, real estate, professional and administrative activities | |
8 | O, P, Q. Public administration, education and health | |
9 | R, S, T, U. Other | |
Corine | 10 | Land cover [Discontinuous urban fabric] |
11 | Land cover [Continuous urban fabric] | |
12 | Land cover [Non-irrigated arable land] | |
13 | Land cover [Industrial or commercial units] | |
14 | Land cover [Green urban areas] | |
15 | Land cover [Pastures] | |
16 | Land cover [Sport and leisure facilities] | |
Morphometric | 17 | area of building |
18 | courtyard area of building | |
19 | circular compactness of building | |
20 | corners of building | |
21 | squareness of building | |
22 | equivalent rectangular index of building | |
23 | centroid - corner mean distance of building | |
24 | centroid - corner distance deviation of building | |
25 | orientation of building | |
26 | area of ETC | |
27 | circular compactness of ETC | |
28 | equivalent rectangular index of ETC | |
29 | covered area ratio of ETC | |
30 | cell alignment of building | |
31 | alignment of neighbouring buildings | |
32 | mean distance between neighbouring buildings | |
33 | perimeter-weighted neighbours of ETC | |
34 | mean inter-building distance | |
35 | width of street profile | |
36 | width deviation of street profile | |
37 | openness of street profile | |
38 | length of street segment | |
39 | linearity of street segment | |
40 | mean segment length within 3 steps | |
41 | node degree of junction | |
42 | local proportion of 3-way intersections of street network | |
43 | local proportion of 4-way intersections of street network | |
44 | local proportion of cul-de-sacs of street network | |
45 | local closeness of street network | |
46 | local cul-de-sac length of street network | |
47 | square clustering of street network | |
48 | local degree weighted node density of street network | |
49 | street alignment of building | |
50 | area covered by edge-attached ETCs | |
51 | buildings per meter of street segment | |
52 | reached ETCs by neighbouring segments | |
53 | reached ETCs by tessellation contiguity | |
54 | area of enclosure | |
55 | circular compactness of enclosure | |
56 | equivalent rectangular index of enclosure | |
57 | orientation of enclosure | |
58 | perimeter-weighted neighbours of enclosure | |
59 | area-weighted ETCs of enclosure |
3.3 Modelling / regression analysis
The accessibility indicators result from a deterministic model based on a fixed travel-time matrix. Hence, any new scenario being modelled can re-use it by changing only the raw values at the destinations. On the other hand, we need to develop predictive models to be able to see the changes in air quality and house prices.
Both of these models use the same classifier, a histogram-based gradient boosting regression tree as implemented in the scikit-learn
Python package, and the set of 59 explanatory variables listed above. To ensure we are able to properly capture the spatial nature of the indicators, we add another 59 features which are a spatial lag of the original set. The spatial lag is measured as a mean value of the variable within a set neighbourhood, the definition of which is empirically determined. We test the neighbourhoods based on contiguity (from order of contiguity 1 to 5 inclusive of lower-order neighbors), Euclidean distance (500 m, 1000 m, 2000 m), and their unions. Each option is then included as part of the grid search aimed at the selection of the best model parameters and the best definition of the spatial lag in relation to the model performance metrics (the coefficient of determination R2, mean squared error, and mean absolute error). Other model parameters which we also assess are the learning rate, the maximum number of iterations, and the maximum number of bins. For more details, see the implementation in the Air Quality notebook and the House Price notebook.
Furthermore, we have experimented with the geographical extent of the training data. While the original models were based only on the data from the study area (Tyne and Wear), we have tested models trained on other geographical subsets:
- the whole of England
- the whole of England, but excluding Greater London, which is known to be an outlier that may not help in predictive quality within Tyne and Wear
- Tyne and Wear plus all OAs belonging to “Urbanity” classes from the rest of England
The aim of this is to add additional robustness to the model by training on a wider set of data.
3.3.1 Resulting model specifications
The resulting models, selected based on their performance, are based on the following specifications:
Air Quality
- Spatial extent: Tyne and Wear plus all OAs belonging to “Urbanity” classes from the rest of England
- Spatial lag: A union of 5 orders of Queen contiguity and 2000 m distance band
- Model parameters:
learning_rate=0.2
,max_bins=64
,max_iter=1000
The model performance (R2) is ~0.8.
House Price
- Spatial extent: England excluding Greater London
- Spatial lag: 5 orders of Queen contiguity
- Model parameters:
learning_rate=0.1
,max_bins=128
,max_iter=1000
The model performance (R2) is ~0.5.
3.4 Building scenarios
Scenarios are constructed by changing four macro variables, which together drive a sampling mechanism which derives the 59 variables used in the modelling. The four variables are:
- Level of urbanity, which is a proxy for a signature type of spatial signatures as defined in the Urban Grammar project;
- Use of buildings, referring to the ratio of residential and commercial or industrial use;
- Job types, which determines whether jobs in each Output Area are more white-collar or blue-collar; and
- Greenspace, reflecting the amount of formal greenspace (i.e. parks).
Scenarios are modelled by specifying the macro variables on Output Areas where the change takes place. Any of the four variables can be specified in combination with any other. The algorithm then samples the data from either the baseline (capturing the existing state) or from the known distribution of values per signature type based on the country-wide data.
The first step in the sampling procedure is the selection of a signature type. If it is not changed, subsequent steps modify the baseline. Otherwise, we sample the values for a set signature type reflecting its common characteristics as observed across Great Britain. For example, if we specify the ‘local urbanity’ signature type for an Output Area which previously had ‘dense urban neighbourhoods’ assigned, we look at how ‘local urbanity’ usually looks like and sample all the required values from a narrow normal distribution centred on the median of the nationwide distribution. Further macro variables are adjusting these values. The use macro variable will change the variables of population and workplace population, job types will change the distribution of jobs between the job type categories, and greenspace will allocate new parks onto an Output Area (adjusting other values such as population accordingly).
Development of each scenario is then comprised of a few simple steps:
- Select Output Areas that are supposed to change
- Assessing the target level of urbanity for each Output Area
- Assessing other macro variables for each Output Area
- From these, sample the values to be used as input for the model
- Run the models to assess the effect of tested changes