Appendix C — Air quality

This notebook contains the code used to derive the air quality/pollution index.

C.1 Exploration of air quality data from Newcastle urban observatory

Downloaded from https://archive.dev.urbanobservatory.ac.uk/agg/900/ aggregated for 15 minutes blocks. We can also get coarser aggregation (hour, 6 hours, day) or even raw minute data.

import pandas as pd
import geopandas as gpd
import xarray as xr
import contextily as ctx

Specify paths to existing files, some of which need to be downloaded manually (see Notes).

data_folder = "/Users/martin/Library/CloudStorage/OneDrive-SharedLibraries-TheAlanTuringInstitute/Daniel Arribas-Bel - demoland_data"

lsoa_list_file = (
    f"{data_folder}/processed/tynewear_lsoas_list.csv"  # list of LSOAs within AoI
)
aoi_file = f"{data_folder}/processed/authorities/"

# national level files
LSOA_boundaries_file = f"{data_folder}/raw/LSOA_(Dec_2011)_Boundaries_Super_Generalised_Clipped_(BSC)_EW_V3.geojson"  # administrative boundaries from gov.uk OS

Load the data for CO for December to see the structure

co_dec22 = pd.read_csv(f"{data_folder}/raw/air_quality/2022-12-CO-900.0.csv.zip")

Generate actual geometry

co_dec22 = gpd.GeoDataFrame(
    co_dec22, geometry=gpd.GeoSeries.from_wkt(co_dec22["Location (WKT)"], crs=4326)
).to_crs(27700)

co_dec22.head()

	Sensor Name	Variable	Units	Timestamp	Mean Value	Min Value	Max Value	Median Value	Standard Dev	Count Records	Location (WKT)	Ground Height Above Sea Level	Sensor Height Above Ground	Broker Name	Third Party	Sensor Centroid Longitude	Sensor Centroid Latitude	Raw ID	geometry
0	PER_AIRMON_MESH1972150	CO	ugm -3	2022-12-01 00:00:00	417.865689	406.879185	491.198130	410.974850	20.274449	15.0	POINT (-1.617121 54.974193)	47.290001	2.0	aq_mesh_api	False	-1.617121	54.974193	79208	POINT (424607.499 564463.931)
1	PER_AIRMON_MESH1907150	CO	ugm -3	2022-12-01 00:00:00	630.707678	614.071515	678.104495	627.741670	17.546116	15.0	POINT (-1.603448 54.965679)	31.170000	2.0	aq_mesh_api	False	-1.603448	54.965679	79150	POINT (425488.062 563521.339)
2	PER_AIRMON_MESH1968150	CO	ugm -3	2022-12-01 00:00:00	1.257286	1.196525	1.766735	1.217135	0.138612	15.0	POINT (-1.614848 54.97149)	42.040001	2.0	aq_mesh_api	False	-1.614848	54.971490	79204	POINT (424754.639 564163.933)
3	PER_AIRMON_MESH1969150	CO	ugm -3	2022-12-01 00:00:00	343.280313	337.130365	364.159235	342.825595	6.512988	15.0	POINT (-1.593537 55.039309)	64.440002	2.0	aq_mesh_api	False	-1.593537	55.039309	79205	POINT (426075.104 571718.592)
4	PER_AIRMON_MONITOR1157100	CO	ugm -3	2022-12-01 00:00:00	0.179765	0.179765	0.179765	0.179765	0.000000	1.0	POINT (-1.623043 54.970093)	45.270000	2.0	aq_mesh_api	False	-1.623043	54.970093	79565	POINT (424230.858 564005.621)

We have classic summary stats for the CO values, each per 15 minute slot. We can get those for every month back to roughly 2016.

The same can be done for the other metrics:

CO (this one)
NO
NO2
NOx
O3
PM 4
PM 1
PM 10
PM 2.5
Particle count

Given we are dealing with rather static scenarios, the target indicator in the modelling process needs to be an aggregation of some sort per a longer period of time (a year?).

Further, we may not want to model each metric individually, so there may be a scope for a creation of an index of air quality combining all these into a single metric that will be then predicted. Or we can predict them all…

aoi = gpd.read_file(aoi_file)

m = co_dec22.loc[~co_dec22["Location (WKT)"].duplicated()].explore(marker_type="marker")
aoi.boundary.explore(m=m, color="black")
m

Make this Notebook Trusted to load map: File -> Trust Notebook

The main issue with the dataset is its coverage. As shown above, it is very limited and even if we wanted to interpolate the data we wouldn’t be able to do so south of the city. UO has other sensors there but not those for air quality.

C.2 EEA-based index

EEA specifies the relationship between concentration of different pollutants to get an index of hazard for human health (here). We can use their multipliers to create a single index based on the UK AIR data. From EEA:

The bands are based on the relative risks associated to short-term exposure to PM2.5, O3 and NO2, as defined by the World Health Organization in its report on the Health Risks of Air Pollution in Europe project (HRAPIE project report).

The relative risk of exposure to PM2.5 is taken as basis for driving the index, specifically the increase in the risk of mortality per 10 µg/m3 increase in the daily mean concentration of PM2.5.

Assuming linearity across the relative risks functions for O3 and NO2, we calculate the concentrations of these pollutants that pose an equivalent relative risk to a 10 µg/m3 increase in the daily mean of PM2.5.

For PM10 concentrations, a constant ratio between PM10 and PM2.5 of 1:2 is assumed, in line with the World Health Organization´s air quality guidelines for Europe.

For SO2, the bands reflect the limit values set under the EU Air Quality Directive.

The relationship between \(PM_{2.5}\) : \(PM_{10}\) : \(NO_{2}\) : \(O_{3}\) : \(SO_{2}\) is then equal to 1 : 2 : 4 : 5 : 10. The combined index can then be computed as \[Q_{air} = \frac{PM_{2.5}}{1} + \frac{PM_{10}}{2} + \frac{NO_{2}}{4} + \frac{O_{3}}{5} + \frac{SO_{2}}{10}\]

Except for the \(O_{3}\), UK AIR reports all as a concentration in \(\mu g \cdot m^{-3}\). \(O_{3}\) is reported as a number of days above a threshold of 120 \(\mu g \cdot m^{-3}\) and cannot be used in this formula but even EEA omits data when unavailable so we shall be able to create the index based on 4 remaning measurements.

It shall also be condsidered that UK AIR is not representing a direct measurements but a model.

C.2.1 Data

Data is available from DEFRA (https://uk-air.defra.gov.uk/data/pcm-data) as CSVs in a UK grid.

The data begins on row 6 of each file and contains 4 columns: - a unique code (ukgridcode) for each 1x1km cell in the map - the x coordinates for the centroid of each grid cell - the y coordinates for the centroid of each grid cell - the values for the metric itself.

The coordinate system is OSGB and the coordinates represent the centre of each 1x1km cell.

There are some grid squares in each data set that do not have values associated with them and are labelled as ‘MISSING’

We can use pandas to read grids to Xarray for efficient array computation of the index.

pm10_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mappm102021g.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
pm25_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
no2_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mapno22021.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
so2_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mapso22021.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
pollutants_2021 = xr.merge([pm10_21, pm25_21, no2_21, so2_21])
pollutants_2021

We can compute the index based on the formula above.

aqi = (
    pollutants_2021.pm252021g
    + pollutants_2021.pm102021g / 2
    + pollutants_2021.no22021 / 4
    + pollutants_2021.so22021 / 10
)
pollutants_2021 = pollutants_2021.assign(aqi=aqi)

We can filter only the area of interest before converting to a GeoDataFrame.

bds = aoi.total_bounds
pollutants_2021_tyne = pollutants_2021.sel(
    x=slice(bds[0], bds[2]), y=slice(bds[1], bds[3])
)

We convert the array to a GeoDataFrame with polygons representing grid area. This will be needed for areal interpolation to LSOA/MSOA.

pollutants_2021_tyne_df = pollutants_2021_tyne.to_dataframe().reset_index()
pollutants_2021_tyne_df = gpd.GeoDataFrame(
    pollutants_2021_tyne_df,
    geometry=gpd.points_from_xy(
        pollutants_2021_tyne_df.x, pollutants_2021_tyne_df.y, crs=27700
    ).buffer(500, cap_style=3),
)

We can check how the derived index looks.

pollutants_2021_tyne_df.explore("aqi")

Make this Notebook Trusted to load map: File -> Trust Notebook

pollutants_2021_tyne_df.to_file(
    f"{data_folder}/processed/air_quality/air_quality_grid_2021.gpkg"
)