Making UK weather and air quality data available to a diverse research community
Authors: Ann Gledson, Douglas Lowe (Research IT), David Topping (Department of Earth and Environmental Sciences - FSE) and Caroline Jay (Department of Computer Science - FSE).
The aim of the Alan Turing Institute funded ‘Understanding the relationship between human health and the environment’ project was to study how self-reported hay fever symptom data from the Britain Breathing project was affected by air quality.
The environment data had to be extracted from three data-sets: air quality measurements from the Automatic Urban and Rural Network (AURN); weather and pollen measurements from the Medical and Environmental Data Mash-up Infrastructure (MEDMI); and modelled forecast data generated using the European Monitoring and Evaluation Programme (EMEP) model.
Although these datasets are all theoretically ‘open’, downloading, cleaning and pre-processing them was extremely time-consuming, ultimately requiring months of work. Once obtained, much further work was required to ensure their validity, including detecting and removing duplicates and unrealistic measurements, imputing missing values and augmenting data for those regions with sparse coverage.
We knew that air quality and weather data are of great value to many research communities, and that across the country researchers were constantly ‘reinventing the wheel’ – going through the same lengthy process of finding and cleaning the data, replicating (and wasting) many months of work.
Keen to stop this cycle, we switched our attention to publishing these datasets along with the data extraction/cleaning tools used to create them, and a visualisation tool. We focussed on making them as accessible and adaptable as possible, allowing researchers to get on with the job of evaluating the impact of environmental factors, rather than becoming bogged down in pre-processing.
Applying open research practices
Our datasets are available on Zenodo and include daily air quality (NO2, NOX, SO2, O3, PM10, PM2.5), pollen (e.g. ambrosia, urtica) and weather (temperature, pressure, and relative humidity) readings from AURN and MEDMI in the years 2016 to 2020 inclusive, for the United Kingdom.
We publish a cleaned dataset, with duplicate and spurious values removed and also a further version with missing values imputed. We also publish a dataset of UK regions with estimations of environmental measurements for regions where sensors are lacking. We published a data descriptor paper in Nature Scientific Data.
All cleaning, imputation and regional estimation methodologies are available as command-line tools or python libraries, with full documentation included. Each tool allows parameter adjustment, so that adaptations can be made depending on requirements and is extensible, allowing addition of further methods.
A geo-spatial visualisation tool is provided, allowing users to visualise our data or their own environmental datasets, alongside the accuracy of regional estimations made using our regional estimation python library.
All our code is available on GitHub with full documentation. The code is designed to be easily extensible, so users can add their own algorithms. The mine-the-gaps code is containerised using Docker, allowing easy installation.
Data pre-processing is not a single, one-size-fits-all operation, and different users of our data are likely to have different expectations of the methodologies used. Clear explanations of those choices have been provided, so users can easily decide if the cleaned datasets fit their needs. Because the methods we have used to create the datasets are available as a tool-set, users who have different pre-processing requirements can modify these rather than starting from scratch, and still obtain the data considerably faster than before.
Benefits of using these open research practices
When you share your data and methodologies, it not only helps to advance the research domain by preventing repeated work, but it also opens a dialogue between us and the end-users. We have given invited talks about our approach at the University of Cambridge, University of Warwick, and the Alan Turing Institute, and are in the process of creating a ‘primer’ showcasing best practice. We have been able to get feedback from users and hope to soon see how people have used the data and tools.
Don’t aim for a perfect product to suit everyone or it will never be finished. Decide on an approach, provide clear explanations and allow users to adapt it if required.
You can visit the Research IT website for more information about the services they provide to researchers at the University.
- Cleaned and imputed: https://doi.org/10.5281/zenodo.5118563
- Regional estimates: https://doi.org/10.5281/zenodo.5119234
- 2020 datasets (imputation not possible due to differences in air quality due to lock-downs):
- Cleaned with mean and max: https://doi.org/10.5281/zenodo.4740965
- Regional estimates: https://doi.org/10.5281/zenodo.5457270
- Data cleaning and imputation: https://doi.org/10.5281/zenodo.5129076
- Regional estimates: https://doi.org/10.5281/zenodo.5119778
- Visualisation tool: https://github.com/UoMResearchIT/mine-the-gaps
- Nature Data journal: describing the above datasets, methodologies and code
- EGU General Assembly conference: describing the Mine-the-gaps visualisation tool
- European Geosciences Union (EGU) General Assembly, May 2022, in print.
Read full journal article
UK daily meteorology, air quality, and pollen measurements for 2016–2019, with estimates for missing data.