3 Methodology

This page provides an outline of the methodology behind indicator modelling / calculation. For more details, please refer to the notebooks (provided in the Appendices of this book).

3.1 Indicator selection

We resolved to use four indicators of interest, which were shortlisted after an initial review of the available data and software, as well as the interests of the stakeholders. These are:

Air quality
House prices
Job accessibility
Greenspace accessibility

For each of these indicators, we have detailed below how they are generated. This step represents the basic scenario.

We have two different types of indicators: while air quality and house prices are derived using models which are fit to actual data, accessibilities are directly calculated using geospatial data, so are not directly modelled/predicted using explanatory variables.

3.1.1 Accessibility calculation

For accessibility computations, we calculate the cumulative potential accessibility to green spaces and jobs (collectively ‘opportunities’). We consider a set of origins and opportunity destination points, and the time it takes to move between them on the road network for different transport modes.

In order to calculate the time it takes from each origin to reach each destination, we need to have a Time Travel Matrix (TTM) between origins and destinations for different means of transport. To build this, we can generate a graph/network starting from two data sources: one for the road network, and one for public transportation schedules. Once a TTM is built between a set of origins and a set of destinations for a given mode of transport, we can run accessibility analyses on it.

Origins: For this project, we take the coordinates of the Population Weighted Centroids (PWC) of Output Areas (OA) as the origins.
Destinations: For each destination, we need to provide the coordinates as well as a datum for the opportunity (or ‘supply’, in our case this is the land use). For example, this could be the number of jobs available. (Other opportunities, though not considered here, could be the number of schools, health centres, or any other countable feature.)
- Jobs: The destinations used for job accessibility are the PWCs of the working population for each Workplace Zone (WPZ) in the 2011 Census (see original data and definition).
  
  To obtain an opportunity measure (job counts), we make the approximation that the number of workers is equal to the number of available jobs. This is in order to preserve the spatial location of the jobs, which is needed in the calculation of the time travel matrix. In fact, the number of jobs in the Census is given at the OA level, but the PWC for this aggregated level can be off-centered in relation to where people work.
  
  Finally, we define job accessibility as the number of jobs which are accessible by public transport and walking within 15 minutes.
- Greenspace: Here, we use the open data set available from Ordnance Survey (OS). In particular, as a first approximation we consider the layer “Access points” from this datum. This gives the coordinates of the access points to green spaces in all of Great Britain, which we use as destinations.
  
  As an opportunity measure, we count the total areas of the greenspace (layer “Sites”) to which these points give access to. The greenspace accessibility is then defined as the sum of the areas of greenspace sites reachable within 15 minutes.

A step-by-step description of the accessibility calculations is as follows:

Indicator definition

Obtain greenspace sites from OS
- Filter out irrelevant categories (allotments, golf courses, bowling greens)
- Retain entrances on the edge of the sites
- Associate
(An idea for future improvement would be to obtain “missing” areas from Corine or OSM data.)
Get the area of each site
Filter entrances within time threshold (15 min) of each origin (OA)
Consider unique values for parks (can be reached in )
Generate metric for each OA as the sum of reachable site size

Build time travel matrix (TTM)

We can build a TTM from two data sources: the road network, and a timetable for public transport. We can obtain the first by donwloading OpenStreetMap (OSM) data for the area of interest. Timetables for England public transport in GTFS format are available from UK2GTFS. We use GTFS data because it is more compatible with the ttm calculation package.

Get time table for public transport > GTFS data
Get roads network > OSM data
Generate network graph > r5 engine (Conveyal) > r5py package
Generate TTM for 4 different modes > ‘transit’, ‘bike’, ‘car’, ‘walking’

(An idea for future improvement would be to adjust the time threshold depending on the time of day / day of the week, as these would influence e.g. traffic conditions.)

Run accessibility analysis

Get land use data (opportunities detailed above in Destinations)
Jobs > run tracc package on the TTM for each transport mode:
- Compute impedance function based on a 15 minute cost (cumulative)
- Setting up the accessibility object, i.e. joining the destination data to the travel time data
- Measuring potential accessibility to jobs as the cumulative sum of opportunities at destination per origin
Greenspace > convert the tracc functions to work only on reachable sites’ area, on the TTM per transport mode:
- Compute impedance function based on a 15 minute cost (cumulative)
- Join the destination data to the travel time data
- Per OA (origin), select all entrances which can be reached within the threshold time, and deduplicate entrances which correspond to the same park
- Calculate the metric as the sum of parks’ areas

3.1.2 Air quality index

We develop an air quality index as a composite of PM2.5, PM10, NO₂ and SO₂ values, derived from the UK AIR project by DEFRA. Data are available as a 1 km grid. The composite index follows the methodology of the European Environmental Agency (EEA), reflecting the relative health risk associated with the exposure to particle intensities. Quoting from the EEA:

The bands are based on the relative risks associated with short-term exposure to PM2.5, O₃ and NO₂, as defined by the World Health Organization in their Health Risks of Air Pollution in Europe (HRAPIE) project report.

The relative risk of exposure to PM2.5 is taken as the basis for driving the index, specifically, the increase in the risk of mortality per 10 µg/m³ increase in the daily mean concentration of PM2.5.

Assuming linearity across the relative risk functions for O₃ and NO₂, we calculate the concentrations of these pollutants that pose an equivalent relative risk to a 10 µg/m³ increase in the daily mean of PM2.5.

For PM10 concentrations, a constant ratio between PM10 and PM2.5 of 1:2 is assumed, in line with the World Health Organization´s air quality guidelines for Europe.

For SO₂, the bands reflect the limit values set under the EU Air Quality Directive.

The relationship between PM2.5 : PM10 : NO₂ : O₃: SO₂ is then equal to 1 : 2 : 4 : 5 : 10. The combined index can then be computed as \[Q_{\text{air}} = \frac{\text{PM2.5}}{1} + \frac{\text{PM10}}{2} + \frac{\mathrm{NO_2}}{4} + \frac{\mathrm{O_3}}{5} + \frac{\mathrm{SO_2}}{10}\]

Except for the O₃, UK AIR reports all of these values as concentrations in µg/m³. O₃ is reported in terms of the number of days above a threshold of 120 µg/m³, and thus cannot be used in this formula, but even EEA omits data when unavailable, so we can create the index based on the four other measurements.

It should also be borne in mind that UK AIR data are not direct measurements, but rather a model.

The data from the 1 km grid are spatially interpolated to Output Area geometries. The index itself is computed on the grid.

3.1.3 House price

An optimal way of working with house prices in the modelling exercise like ours is to use price per square metre (sqm). However, these data are not generally available at the Output Area or similar level. Luckily, we can retrieve such data from the paper “A new attribute-linked residential property price dataset for England and Wales 2011-2019”, by Chi et al.. This dataset contains individual house prices, total floor area, and a resulting price per sqm obtained from a combination of Land Registry Price Paid Data and Domestic Energy Performance Certificates.

We use the data from the years 2018–2019, and compute the mean price per sqm for each output area. Given the data are not fully up-to-date, the results of our modelling are presented as a percent increase or decrease compared to the baseline value rather than in absolute terms.

3.2 Explanatory variables

We curated a set of explanatory variables, which describe the urban environment and can be changed in order to model the different scenarios. These variables are chosen to have predictive power for the four indicators, and can be categorised into several groups according to their source.

3.2.1 Group 1: ONS population estimates

The first group consists of only one variable, namely, population estimates. We use the population estimates by the Office for National Statistics (ONS) at the OA level for mid-2020. The other data used in the project generally reflect the period between 2018 and 2020, so this was chosen in order to describe the compatible point in time. The dataset (titled ‘SAPE23DT10d’) is retrieved from the ONS website. Since it is already reported at the Output Area level, it does not need to be processed further.

3.2.2 Group 2: Workplace population by industry

The workplace population is obtained from Census 2011 data, and is reported on Workplace Zone geometries. We use preprocessed data from the Urban Grammar project which aggregated industries into the following groups:

A, B, D, E. Agriculture, energy and water
C. Manufacturing
F. Construction
G, I. Distribution, hotels and restaurants
H, J. Transport and communication
K, L, M, N. Financial, real estate, professional and administrative activities
O, P, Q. Public administration, education and health
R, S, T, U. Other

The data are then interpolated from Workplace Zones to Output Areas, giving us another 8 variables.

3.2.3 Group 3: CORINE Land Cover classification

We use CORINE Land Cover classification data for 2018, which are provided as polygons of contiguous areas belonging to the same class. We use the data extracted for Great Britain within the Urban Grammar project, and interpolate the data to Output Areas, which gives us the proportion of each OA covered by each class. We further filter out fully or nearly invariant classes and use only:

Discontinuous urban fabric
Continuous urban fabric
Non-irrigated arable land
Industrial or commercial units
Green urban areas
Pastures
Sport and leisure facilities

leading to 7 more variables.

3.2.4 Group 4: Urban morphometrics

Urban morphometrics offers a way of describing the physical built environment (buildings, streets) in a set of quantitative measurements which capture different aspects of morphological elements. We directly use the set measured within the Urban Grammar project, which are presented as individual characters (prior contextualisation), and interpolate their values from the original geometry (enclosed tessellation cells) to Output Areas. This way, we obtain a contextual version using the project-specific aggregation to OA. This gives us 59 morphometric variables.

3.2.5 Variable pruning and final variable set

In total, from these four groups, we have 75 explanatory variables. Given that some of these are collected for different purposes than modelling, and with different geographical extents in mind, it may happen that some of the variables are correlated within our limited area of interest (Tyne and Wear). This may negatively affect the performance of the predictive models, and it is better to minimise the number of such pairs. We, therefore, measure both the Pearson and Spearman correlation indices between all pairs of explanatory variables. For each pair of variables where both correlation indices have an absolute value larger than 0.8, we retain only one, ideally the one that is more interpretable.

This procedure resulted in 16 of the morphometric variables being removed, leaving us with 59 explanatory variables in total:

Group	#	Variable name
Population	1	population estimate
Workplace	2	A, B, D, E. Agriculture, energy and water
	3	C. Manufacturing
	4	F. Construction
	5	G, I. Distribution, hotels and restaurants
	6	H, J. Transport and communication
	7	K, L, M, N. Financial, real estate, professional and administrative activities
	8	O, P, Q. Public administration, education and health
	9	R, S, T, U. Other
Corine	10	Land cover [Discontinuous urban fabric]
	11	Land cover [Continuous urban fabric]
	12	Land cover [Non-irrigated arable land]
	13	Land cover [Industrial or commercial units]
	14	Land cover [Green urban areas]
	15	Land cover [Pastures]
	16	Land cover [Sport and leisure facilities]
Morphometric	17	area of building
	18	courtyard area of building
	19	circular compactness of building
	20	corners of building
	21	squareness of building
	22	equivalent rectangular index of building
	23	centroid - corner mean distance of building
	24	centroid - corner distance deviation of building
	25	orientation of building
	26	area of ETC
	27	circular compactness of ETC
	28	equivalent rectangular index of ETC
	29	covered area ratio of ETC
	30	cell alignment of building
	31	alignment of neighbouring buildings
	32	mean distance between neighbouring buildings
	33	perimeter-weighted neighbours of ETC
	34	mean inter-building distance
	35	width of street profile
	36	width deviation of street profile
	37	openness of street profile
	38	length of street segment
	39	linearity of street segment
	40	mean segment length within 3 steps
	41	node degree of junction
	42	local proportion of 3-way intersections of street network
	43	local proportion of 4-way intersections of street network
	44	local proportion of cul-de-sacs of street network
	45	local closeness of street network
	46	local cul-de-sac length of street network
	47	square clustering of street network
	48	local degree weighted node density of street network
	49	street alignment of building
	50	area covered by edge-attached ETCs
	51	buildings per meter of street segment
	52	reached ETCs by neighbouring segments
	53	reached ETCs by tessellation contiguity
	54	area of enclosure
	55	circular compactness of enclosure
	56	equivalent rectangular index of enclosure
	57	orientation of enclosure
	58	perimeter-weighted neighbours of enclosure
	59	area-weighted ETCs of enclosure

3.3 Modelling / regression analysis

The accessibility indicators result from a deterministic model based on a fixed travel-time matrix. Hence, any new scenario being modelled can re-use it by changing only the raw values at the destinations. On the other hand, we need to develop predictive models to be able to see the changes in air quality and house prices.

Both of these models use the same classifier, a histogram-based gradient boosting regression tree as implemented in the scikit-learn Python package, and the set of 59 explanatory variables listed above. To ensure we are able to properly capture the spatial nature of the indicators, we add another 59 features which are a spatial lag of the original set. The spatial lag is measured as a mean value of the variable within a set neighbourhood, the definition of which is empirically determined. We test the neighbourhoods based on contiguity (from order of contiguity 1 to 5 inclusive of lower-order neighbors), Euclidean distance (500 m, 1000 m, 2000 m), and their unions. Each option is then included as part of the grid search aimed at the selection of the best model parameters and the best definition of the spatial lag in relation to the model performance metrics (the coefficient of determination R², mean squared error, and mean absolute error). Other model parameters which we also assess are the learning rate, the maximum number of iterations, and the maximum number of bins. For more details, see the implementation in the Air Quality notebook and the House Price notebook.

Furthermore, we have experimented with the geographical extent of the training data. While the original models were based only on the data from the study area (Tyne and Wear), we have tested models trained on other geographical subsets:

the whole of England
the whole of England, but excluding Greater London, which is known to be an outlier that may not help in predictive quality within Tyne and Wear
Tyne and Wear plus all OAs belonging to “Urbanity” classes from the rest of England

The aim of this is to add additional robustness to the model by training on a wider set of data.

3.3.1 Resulting model specifications

The resulting models, selected based on their performance, are based on the following specifications:

Air Quality

Spatial extent: Tyne and Wear plus all OAs belonging to “Urbanity” classes from the rest of England
Spatial lag: A union of 5 orders of Queen contiguity and 2000 m distance band
Model parameters: learning_rate=0.2, max_bins=64, max_iter=1000

The model performance (R²) is ~0.8.

House Price

Spatial extent: England excluding Greater London
Spatial lag: 5 orders of Queen contiguity
Model parameters: learning_rate=0.1, max_bins=128, max_iter=1000

The model performance (R²) is ~0.5.

3.4 Building scenarios

Scenarios are constructed by changing four macro variables, which together drive a sampling mechanism which derives the 59 variables used in the modelling. The four variables are:

Level of urbanity, which is a proxy for a signature type of spatial signatures as defined in the Urban Grammar project;
Use of buildings, referring to the ratio of residential and commercial or industrial use;
Job types, which determines whether jobs in each Output Area are more white-collar or blue-collar; and
Greenspace, reflecting the amount of formal greenspace (i.e. parks).

Scenarios are modelled by specifying the macro variables on Output Areas where the change takes place. Any of the four variables can be specified in combination with any other. The algorithm then samples the data from either the baseline (capturing the existing state) or from the known distribution of values per signature type based on the country-wide data.

The first step in the sampling procedure is the selection of a signature type. If it is not changed, subsequent steps modify the baseline. Otherwise, we sample the values for a set signature type reflecting its common characteristics as observed across Great Britain. For example, if we specify the ‘local urbanity’ signature type for an Output Area which previously had ‘dense urban neighbourhoods’ assigned, we look at how ‘local urbanity’ usually looks like and sample all the required values from a narrow normal distribution centred on the median of the nationwide distribution. Further macro variables are adjusting these values. The use macro variable will change the variables of population and workplace population, job types will change the distribution of jobs between the job type categories, and greenspace will allocate new parks onto an Output Area (adjusting other values such as population accordingly).

Development of each scenario is then comprised of a few simple steps:

Select Output Areas that are supposed to change
Assessing the target level of urbanity for each Output Area
Assessing other macro variables for each Output Area
From these, sample the values to be used as input for the model
Run the models to assess the effect of tested changes