Appendix C — Air quality

This notebook contains the code used to derive the air quality/pollution index.

C.1 Exploration of air quality data from Newcastle urban observatory

Downloaded from https://archive.dev.urbanobservatory.ac.uk/agg/900/ aggregated for 15 minutes blocks. We can also get coarser aggregation (hour, 6 hours, day) or even raw minute data.

import pandas as pd
import geopandas as gpd
import xarray as xr
import contextily as ctx

Specify paths to existing files, some of which need to be downloaded manually (see Notes).

data_folder = "/Users/martin/Library/CloudStorage/OneDrive-SharedLibraries-TheAlanTuringInstitute/Daniel Arribas-Bel - demoland_data"
lsoa_list_file = (
    f"{data_folder}/processed/tynewear_lsoas_list.csv"  # list of LSOAs within AoI
)
aoi_file = f"{data_folder}/processed/authorities/"

# national level files
LSOA_boundaries_file = f"{data_folder}/raw/LSOA_(Dec_2011)_Boundaries_Super_Generalised_Clipped_(BSC)_EW_V3.geojson"  # administrative boundaries from gov.uk OS

Load the data for CO for December to see the structure

co_dec22 = pd.read_csv(f"{data_folder}/raw/air_quality/2022-12-CO-900.0.csv.zip")

Generate actual geometry

co_dec22 = gpd.GeoDataFrame(
    co_dec22, geometry=gpd.GeoSeries.from_wkt(co_dec22["Location (WKT)"], crs=4326)
).to_crs(27700)
co_dec22.head()
Sensor Name Variable Units Timestamp Mean Value Min Value Max Value Median Value Standard Dev Count Records Location (WKT) Ground Height Above Sea Level Sensor Height Above Ground Broker Name Third Party Sensor Centroid Longitude Sensor Centroid Latitude Raw ID geometry
0 PER_AIRMON_MESH1972150 CO ugm -3 2022-12-01 00:00:00 417.865689 406.879185 491.198130 410.974850 20.274449 15.0 POINT (-1.617121 54.974193) 47.290001 2.0 aq_mesh_api False -1.617121 54.974193 79208 POINT (424607.499 564463.931)
1 PER_AIRMON_MESH1907150 CO ugm -3 2022-12-01 00:00:00 630.707678 614.071515 678.104495 627.741670 17.546116 15.0 POINT (-1.603448 54.965679) 31.170000 2.0 aq_mesh_api False -1.603448 54.965679 79150 POINT (425488.062 563521.339)
2 PER_AIRMON_MESH1968150 CO ugm -3 2022-12-01 00:00:00 1.257286 1.196525 1.766735 1.217135 0.138612 15.0 POINT (-1.614848 54.97149) 42.040001 2.0 aq_mesh_api False -1.614848 54.971490 79204 POINT (424754.639 564163.933)
3 PER_AIRMON_MESH1969150 CO ugm -3 2022-12-01 00:00:00 343.280313 337.130365 364.159235 342.825595 6.512988 15.0 POINT (-1.593537 55.039309) 64.440002 2.0 aq_mesh_api False -1.593537 55.039309 79205 POINT (426075.104 571718.592)
4 PER_AIRMON_MONITOR1157100 CO ugm -3 2022-12-01 00:00:00 0.179765 0.179765 0.179765 0.179765 0.000000 1.0 POINT (-1.623043 54.970093) 45.270000 2.0 aq_mesh_api False -1.623043 54.970093 79565 POINT (424230.858 564005.621)

We have classic summary stats for the CO values, each per 15 minute slot. We can get those for every month back to roughly 2016.

The same can be done for the other metrics:

  • CO (this one)
  • NO
  • NO2
  • NOx
  • O3
  • PM 4
  • PM 1
  • PM 10
  • PM 2.5
  • Particle count

Given we are dealing with rather static scenarios, the target indicator in the modelling process needs to be an aggregation of some sort per a longer period of time (a year?).

Further, we may not want to model each metric individually, so there may be a scope for a creation of an index of air quality combining all these into a single metric that will be then predicted. Or we can predict them all…

aoi = gpd.read_file(aoi_file)
m = co_dec22.loc[~co_dec22["Location (WKT)"].duplicated()].explore(marker_type="marker")
aoi.boundary.explore(m=m, color="black")
m
Make this Notebook Trusted to load map: File -> Trust Notebook

The main issue with the dataset is its coverage. As shown above, it is very limited and even if we wanted to interpolate the data we wouldn’t be able to do so south of the city. UO has other sensors there but not those for air quality.

C.2 EEA-based index

EEA specifies the relationship between concentration of different pollutants to get an index of hazard for human health (here). We can use their multipliers to create a single index based on the UK AIR data. From EEA:

The bands are based on the relative risks associated to short-term exposure to PM2.5, O3 and NO2, as defined by the World Health Organization in its report on the Health Risks of Air Pollution in Europe project (HRAPIE project report).

The relative risk of exposure to PM2.5 is taken as basis for driving the index, specifically the increase in the risk of mortality per 10 µg/m3 increase in the daily mean concentration of PM2.5.

Assuming linearity across the relative risks functions for O3 and NO2, we calculate the concentrations of these pollutants that pose an equivalent relative risk to a 10 µg/m3 increase in the daily mean of PM2.5.

For PM10 concentrations, a constant ratio between PM10 and PM2.5 of 1:2 is assumed, in line with the World Health Organization´s air quality guidelines for Europe.

For SO2, the bands reflect the limit values set under the EU Air Quality Directive.

The relationship between \(PM_{2.5}\) : \(PM_{10}\) : \(NO_{2}\) : \(O_{3}\) : \(SO_{2}\) is then equal to 1 : 2 : 4 : 5 : 10. The combined index can then be computed as \[Q_{air} = \frac{PM_{2.5}}{1} + \frac{PM_{10}}{2} + \frac{NO_{2}}{4} + \frac{O_{3}}{5} + \frac{SO_{2}}{10}\]

Except for the \(O_{3}\), UK AIR reports all as a concentration in \(\mu g \cdot m^{-3}\). \(O_{3}\) is reported as a number of days above a threshold of 120 \(\mu g \cdot m^{-3}\) and cannot be used in this formula but even EEA omits data when unavailable so we shall be able to create the index based on 4 remaning measurements.

It shall also be condsidered that UK AIR is not representing a direct measurements but a model.

C.2.1 Data

Data is available from DEFRA (https://uk-air.defra.gov.uk/data/pcm-data) as CSVs in a UK grid.

The data begins on row 6 of each file and contains 4 columns: - a unique code (ukgridcode) for each 1x1km cell in the map - the x coordinates for the centroid of each grid cell - the y coordinates for the centroid of each grid cell - the values for the metric itself.

The coordinate system is OSGB and the coordinates represent the centre of each 1x1km cell.

There are some grid squares in each data set that do not have values associated with them and are labelled as ‘MISSING’

We can use pandas to read grids to Xarray for efficient array computation of the index.

pm10_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mappm102021g.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
pm25_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
no2_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mapno22021.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
so2_21 = (
    pd.read_csv(
        "https://uk-air.defra.gov.uk/datastore/pcm/mapso22021.csv",
        header=5,
        na_values=["MISSING"],
    )
    .set_index(["x", "y"])
    .drop(columns="gridcode")
    .to_xarray()
)
pollutants_2021 = xr.merge([pm10_21, pm25_21, no2_21, so2_21])
pollutants_2021

We can compute the index based on the formula above.

aqi = (
    pollutants_2021.pm252021g
    + pollutants_2021.pm102021g / 2
    + pollutants_2021.no22021 / 4
    + pollutants_2021.so22021 / 10
)
pollutants_2021 = pollutants_2021.assign(aqi=aqi)

We can filter only the area of interest before converting to a GeoDataFrame.

bds = aoi.total_bounds
pollutants_2021_tyne = pollutants_2021.sel(
    x=slice(bds[0], bds[2]), y=slice(bds[1], bds[3])
)

We convert the array to a GeoDataFrame with polygons representing grid area. This will be needed for areal interpolation to LSOA/MSOA.

pollutants_2021_tyne_df = pollutants_2021_tyne.to_dataframe().reset_index()
pollutants_2021_tyne_df = gpd.GeoDataFrame(
    pollutants_2021_tyne_df,
    geometry=gpd.points_from_xy(
        pollutants_2021_tyne_df.x, pollutants_2021_tyne_df.y, crs=27700
    ).buffer(500, cap_style=3),
)

We can check how the derived index looks.

pollutants_2021_tyne_df.explore("aqi")
Make this Notebook Trusted to load map: File -> Trust Notebook
pollutants_2021_tyne_df.to_file(
    f"{data_folder}/processed/air_quality/air_quality_grid_2021.gpkg"
)