A citizen centred urban network for weather and air quality in Australian schools

High-quality, standardized urban canopy layer observations are a worldwide necessity for urban climate and air quality research and monitoring. The Schools Weather and Air Quality (SWAQ) network was developed and distributed across the Greater Sydney region with a view to establish a citizen-centred network for investigation of the intra-urban heterogeneity and inter-parameter dependency of all major urban climate and air quality metrics. The network comprises a matrix of eleven automatic weather stations, nested with a web of six automatic air quality stations, stretched across 2779 km2, with average spacing of 10.2 km. Six meteorological parameters and six air pollutants are recorded. The network has a focus on Sydney’s western suburbs of rapid urbanization, but also extends to many eastern coastal sites where there are gaps in existing regulatory networks. Observations and metadata are available from September 2019 and undergo routine quality control, quality assurance and publication. Metadata, original datasets and quality-controlled datasets are open-source and available for extended academic and non-academic use.


Methods
Monitoring equipment. Air temperature, relative humidity, barometric pressure, wind speed and direction as well as rainfall are measured at each location using Vaisala WXT536 multi-parameter weather sensors 20 . Wind is measured by a Vaisala WINDCAP ® ultrasonic sensor that uses an array of three equally spaced transducers to determine horizontal wind speed and direction. Individual rain drops are detected by a Vaisala RAINCAP ® piezoelectric sensor, while all other signals are recorded using capacitive sensors. The WXT536 is protected against flooding, clogging, wetting, evaporation losses and is provided with the Vaisala Bird Spike Kit to reduce the interference caused by birds on wind and rain measurements. Vaisala weather sensors are also deployed in some of the aforementioned UMNs (see "Background & Summary"), namely in the OKCNET 7 (Vaisala WXT510 and Vaisala WINDCAP ® ), in the Helsinki Testbed 8 (Vaisala WXT510), and in the BULC 11 (Vaisala WXT520). SWAQ is based on successor models.
Six air pollutants (sulfur dioxide, nitrogen dioxide, carbon monoxide, ozone, PM 10 and PM 2.5 ) are measured at 6 locations using medium-cost, small-weight and compact-size Vaisala AQT420 air quality sensors 21 . Proprietary intelligent algorithms are incorporated to compensate for the impact of ambient conditions and aging, allowing the use of affordable electrochemical sensors in lieu of costly gas sampling and conditioning equipment for large-scale deployment. Particulate matter is measured by a laser particle counter (LPC) that quantifies the angular light scattering engendered by particles passing through the detection area, whose diameter falls in the range 0.6 to 10 μm. Particle size and concentration is estimated via digital signal processing (DSP) and is based on the spherical equivalent diameter. Range, accuracy and resolution for each variable are detailed in Table 1, along with overall encumbrance, weight and power requirements.
The 5 weather stations (Vaisala WXT536 sensors only, hereinafter met stations) are powered by QMP201C 12 W solar panel units, mated with 12 V lead acid or nickel-cadmium batteries 22 . QMP201C are equipped with two boxes, one for the mains power supply (Vaisala Mains Power Supply Unit BWT15SXZ) and battery regulator (Vaisala Battery Regulator QBR101) and the other for a 7 Ah backup battery. The mains power supply operates on universal AC inputs and frequencies (85 to 264 VAC and 47 to 440 Hz). The output voltage (15 VDC) is used to power the sensors as well as to charge the QBR101. The solar panel is provided with an angle-adjusting hand screw to set the site-optimized tilt precisely. Similarly, the 6 weather and air quality stations (Vaisala WXT536 and AQT420 sensors combined, hereinafter met + aqt stations) are powered by Ningbo Qixin Solar Electrical Appliance Co. SL30TU-18M (30 W peak power). The panels are connected to a 12 V lead acid or nickel-cadmium battery. All electronic ancillary components (e.g. LEDs) are regulated to maximize the autonomy time and absorb little current ( < 0.2 mA overall). One met + aqt station (STAT code "OEHS") is powered by the mains power only as it was installed at a regulatory site where direct access to the grid was available. When neither solar nor mains power are available the battery working autonomy is approximately 4 days. Battery charging time depends strongly on solar radiation, however in good conditions it takes about 4 days to charge the battery while also powering the system. Data transmission is performed via Multi-Observation Gateway MOG100 devices for all sensors, with unique ID and Application Programming Interface (API) key per site 23 . The MOG100 has dedicated connectors for the sensors and the solar panel, and operates as both a gateway and a logger device for Vaisala WXT530 and AQT400 Series. It comprises a GSM module for wireless communication, an additional battery regulator and input to the solar panel and a memory for data logging and local buffering. Data is stored for approximately two days, with oldest data being replaced first. The MOG100 operates at a 8-30 VDC voltage and requires an average power consumption of 80 mW. As it is enclosed in an IP66-rated weatherproof aluminum casing, it can be installed directly outdoors. This is the case for only met stations that do not require any extra battery to ensure continuing operation. For met + aqt stations, the MOG100 and additional solar power components (Vaisala Battery Regulator QBR101C and extra 12 V lead acid battery) are safely stored in a lockable IP66 weatherproof box.
Sensors and gateways are installed following calibration and testing, performed directly by Vaisala in controlled conditions and included as independent test reports. The data transfer stability (especially regarding solar energy availability), and the data quality were verified during an initial trial period that started in summer 2018. Validation was performed against the closest government station, as described in Di Virgilio et al. 15 . Routine maintenance visits are performed as required or otherwise at least annually. Metadata are updated accordingly. Maintenance typically includes cleaning of the radiation shields and the solar panels, battery checks and visual inspection of cable integrity, mechanical stability, and site clearness. Additional maintenance is performed on a 12-36 month interval, as detailed in Table 2, with a recalibration every two years.
Data is recorded and transmitted at 20-minute intervals by the MOG100 to the Vaisala cloud service, Beacon View 24 . The communication takes place via a 3.5 G (4-band GSM) cellular modem with integrated SIM card and ready-to-use cellular data plan through a secure HTTP protocol (HTTPS). The Beacon Cloud is a user-friendly, preconfigured, low-maintenance and scalable platform that i) ensures data integrity through embedded security features, ii) integrates and visualizes network-level data in near real-time, and iii) produces technical diagnostics www.nature.com/scientificdata www.nature.com/scientificdata/ on status and performance. Beacon's open API for third-party integrations was used to establish a live link with the Climate Change Research Centre (CCRC, UNSW, Sydney) central Beacon cloud server, the CCRC high performance computing (HPC) server "Storm", and the SWAQ website. More information is provided in the "Data Records" section.
Siting and metadata. Optimum site allocation was determined by undertaking a multi-criterion weighted overlay analysis to explore variables that may influence data representativeness, for example, distance from major roads, and variables that may influence the need for monitoring, such as presence of vulnerable population groups, and gaps in the current regulatory monitoring networks. The Australian Bureau of Meteorology (BoM) synoptic weather station network and the New South Wales state Department of Planning, Industry and Environment (DPIE) air quality regulatory network were both assessed first to determine locations where there were currently no observation sites. Six non-sampled regions across the Sydney metropolitan area were identified. Each region was then analysed based on the following variables of interest: current and projected population density and proportion of vulnerable groups, socio-economic status including level of education and household income, density of major roads, industrial areas, and high traffic areas, areas slated for urban growth, the mode of travel to work and number of cars per household, and local climate zones (LCZ). The layers were reclassified into a common evaluation scale from 1 to 10 of suitability or environmental risk, with 10 being the most ideal location for placing a sensor. Schools in each region were then assigned a weighting between 1 and 10 and those scoring high were prioritised for the network. The risk of low outdoor environmental quality was higher in areas i) more densely inhabited, ii) largely industrial, and iii) close to sections of high traffic. Combining appropriate siting and homogeneous spatial density required careful balancing of competing requirements 25,26 . Beyond general considerations (e.g. vandalism, cost, site approvals), further challenges emerged as optimal siting is typically variable-specific 27 . Each SWAQ station measures 6 meteorological parameters through a single-body sensor and 6 of them detect 6 different air pollutants, again aggregated in a compact device, including both primary pollutants (that tend to be more localized to the emission sources) and secondary pollutants (which may accumulate further downwind). All related constraints resulted in a set of siting rules aimed at harmonizing the need for standardization and that for practical feasibility. Accordingly, all SWAQ sensors were installed: • in homogenous urban regions, without i) sections of anomalous variation in the regional urban makeup and surrounding aspect ratio, ii) local and mesoscale climate alterations (e.g. wind tunnels or sheltered areas, cold air drainage, fog regions, transition zones or other topographically-generated climate patterns), iii) unusually wet patches in an otherwise dry area, iv) individual buildings significantly different to the average, and v) large, concentrated heat/pollution sources or sinks or local spots of altered thermo-photochemistry 14,28,29 ; • in areas falling into the WMO Class 4 27 largely unshaded for sun elevations > 20 °C and with artificial heat sources and surfaces (e.g. buildings, asphaltic car parks, concrete walls) covering < 50% and < 30% of the surface within a circular areas of 10 and 3 m around the sensors' screens, respectively. The selected areas were clear of high-power radio transmitters, antennas, power lines and generators that could have distorted the transmission; • at a constant height of 2-3.5 m above ground level, on account of the dominant LCZs and thus mean Urban Canopy Layer height (z H ). 2 m is the maximum height suggested by WMO 27 , however, adjustments of maximum + 1.5 m were adopted due to security measures and mounting requirements. This is in line with the BULC UMN in the UK 11 where the height was fixed at 3 m.
The location of the stations is displayed in Fig. 2 with blue and black markers. Geographic and LCZ details are provided in Table 3 and, and land-use and land cover in Table 4. The minimum, average and maximum spacing are 3.7, 10.2 and 17.5 km, respectively, from −33.5995° to −34.0424° latitude and from 150.6918° to 151.2706° longitude. The SWAQ UMN complements the network of DPIE automated air quality and meteorology stations (met + aqt stations) and BoM automated weather stations (met stations) by design. These stations are aimed at evaluating synoptic-scale conditions and are thereby sited to minimize the influence of urbanization. Fig. 2 clearly shows how SWAQ UMN covers underrepresented areas by providing below-canopy observations. New sensors were installed by DPIE upon completion of our siting optimization analysis, which confirms its usefulness and representativeness in better informing the Australian health protection system.

Data records
The Beacon API was used with the "Storm" server at the University of New South Wales (UNSW) to download SWAQ raw data for analysis and archiving by running a scheduled Python script. The script converts the downloaded raw data (in XML format) as structured JavaScript Object Notation (JSON) files for permanent storage in the UNSW Data Archive. All stations' outputs are stored as key-value pairs under the date and time stamp for  www.nature.com/scientificdata www.nature.com/scientificdata/ each recording interval. A second script is then used to convert the json files as Comma Separated Value (CSV) files for later processing, with all stations' outputs concatenated horizontally. The headers take the general form of "STAT_Variable", where "STAT" is the four-character station code (see Table 3) and "Variable" indicates the measured parameter (see "Symbol" in Table 1) or the Timestamp. Data points that fail one or more quality tests (see "Technical Validation" section) are flagged. The flags are horizontally concatenated to the raw output, with a dedicated column for each station and measurement, under the heading "STAT_Variable_Flags". All flags associated with the same data point are displayed as a semicolon-separated list.
This raw dataset, inclusive of all stations, all parameters, and corresponding flags, is stored with the identifier "YYYY-MM-DD_Raw". Raw data is stored alongside a second csv file called "YYYY-MM-DD_Cleaned". This is a ready-to-use dataset, quality controlled as recommended by SWAQ's technicians. The cleaning procedure is described in the following section. Both datafiles are available from the Australian Terrestrial Ecosystem Research Network (TERN) data portal 30 . The associated Zenodo record contains the metadata files.
Date and time in both the Raw and Cleaned data files are ISO-8601-compliant.   www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
Data quality in wireless networks like SWAQ depends on each element along the line that connects the sensed environment to the final user (e.g. power line, detectors, loggers, transmitters) and eventually determines the level of user acceptance and reliance 31 .
Quality assurance and control (QA/QC) involves different methods performed not just to ensure the quality of data, but also to preserve and prolong the service life of the equipment. QA includes periodic maintenance of stations and field sensor checks as detailed in the metadata files, whereas QC includes tests routinely performed on the data output to identify defective functioning and incorrect readings. However, some of nature's most intriguing and life-threatening phenomena produce data that fail most automated QC tests 32 . In view of the increasing escalation of extreme weather and pollution events worldwide and especially in urbanscapes, QC procedures are designed to ensure observations of extreme episodes are not excluded.
QC on SWAQ data is performed monthly through an automated script in Python 3.9.2. In line with the Oklahoma Mesonet 33 , the Birmingham Urban Climate Laboratory network 11 , as well as the World Meteorological Organization 34 , quality control flags are used to mark erroneous and suspicious data points according to a defined set of filters. The flags supplement but do not alter the original data 35 . This entrusts the ultimate decision on deleting/preserving flagged recordings to the end user. Fig. 3 schematizes the filtering and flagging systems. In line with the 6 Ws of the SWAQ sensor network (Fig. 1), both systems are conceived to maximize data preservation and allow observation of a substantiated narrative on climatological and air quality extremes.
Filters include continuity tests, fixed range tests (on both physical and instrumental limits), dynamic range and step tests (both performed on a monthly basis), internal consistency tests (on known atmospheric relations) and persistence tests. The continuity test is used to verify that the record structure is correct, complete and without any gaps in time. The fixed range tests look for non-physical or out-of-range data. Instrumental limits were derived from equipment specification sheets, except for PM 10 . Manual inspection of PM 10 data revealed a saturation at 3276.2 µg/m 3 , which was thus set as upper bound in fixed range tests. Dynamic range and step tests examine the relative magnitude of a given data point with respect to the statistical distribution of the same variable across the dataset. The former looks at absolute values, while the latter evaluates the rate of change of consecutive values. Lower and upper outlier thresholds for dynamic range tests and step tests are calculated monthly, rather than on annual or seasonal basis, to implicitly account for seasonal cycles and to guarantee greater comparability over times of extreme episodes, such as heat waves, droughts, thunderstorms, cold spells and bushfires. The outlier definition is stricter for step tests as compared to the standard definition applied for dynamic range tests (refer to Fig. 3), on account of Sydney's extraordinary meteorological dynamicity, extensively reported in literature and confirmed by routine statistical analysis [36][37][38] . Site-specific limit bounds defined from prior experience are customary across UMNs 11,33,39 . The dynamic range test is applied to all variables but rain, RH and wd, whereas the step test is applied to all variables, but rain and wind direction. No internal consistency test is in place for rain, as the criterion entails extensive cloud cover on top of high humidity levels which would exclude most of the short-lived events that typify the region 35,40 . A 3-hour persistence criterion is applied as described in Meek and Hatfield 41 to all variables, except rain.
The flagging system embraces a two-fold dimension, individual and combinatorial. A Single Test Flag (STF) is first applied, following the sequence in Fig. 3. The coding takes the general form of STFx.y where the first digit (x) denotes increasing severity and decreasing confidence level from good to suspicious, erroneous, and missing, whereas the second digit (y) discriminates across different filters. Months having more than 10% missing or erroneous data are issued a warning flag (STF4.2) to inform on the lack of a proper statistical sample to perform dynamic range and step tests. Removal of all STF-flagged data points does not conserve extreme events, as most localized phenomena tend to be erroneously flagged when such algorithms are taken individually 42 . The Combinatorial Flag (CF) system attempts to mitigate the risk by using Boolean operators to combine  Table 4. Land use and land cover attributes at each SWAQ site. Data extracted from Geoscape surface cover and buildings datasets 44 . *Average in a 500 m radius. **Average in a 500 m radius, followed by average in a 50 m radius in brackets.
www.nature.com/scientificdata www.nature.com/scientificdata/ STFs. The coding takes the general form of CFx. In the CF system, only data points simultaneously failing the dynamic range test and the step test are eventually CF-flagged as suspect, since they mark sensor spikes or isolated jumps. The CF system captures the magnitude and duration of extreme events with little distortion even when all flagged recordings are removed.
The percentage of good (STF0, CF0) data is close to 90% on average, slightly lower in summer, which suggests adequate solar powering. Pollutants (especially PM 2.5 ) are much more frequently flagged, given the difficulty of discerning real spikes due to local emissions or advection from erroneous measurements. However, utilizing the CF system over the STF system helps to restore episodes of consistently poor air quality. The lowest percentages are typically associated with prolonged persistence test rejection, missing values and fixed range test failure.
The original data, as stored in the "YYYY-MM-DD_Raw" datafile requires critical usage (refer to the "Usage notes" section). Conversely, the ready-to-use "YYYY-MM-DD_Cleaned" dataset is filtered in such a way to ensure both the maximum reasonable standard of accuracy and the minimum data deletion, for optimum use of the data across different urban disciplines. It involves the following sequential steps: i) replacing all negative pollutant values with zero, ii) replacing RH and wd values slightly crossing the physical boundaries with the boundaries themselves, iii) removing all data points failing the instrumental fixed range test, and iv) removing all data flagged as CFx, with x > 1.

Usage notes
SWAQ data are cleaned according to robust QA/QC procedures and presented in a user-friendly fashion. The "YYYY-MM-DD_Raw" datafile is meant for data analysts, scientists and expert users as it maintains the raw information intact, while flagging each test failed. The "YYYY-MM-DD_Cleaned" datafile is meant for the broader public as data are already filtered based on extensive in-house expertise in urban climatology and phenomenology.
Considering all the constraints in pursuing optimal site allocation, it is highly recommended to consult metadata prior to data use. Further, it is suggested to run a final manual check aimed at identifying and removing likely unreliable data not picked up by the automatic tests, such as isolated (single site) measurements twice the average maximum across all other locations or disturbances during QA operations and recorded in the metadata or temporary sensor failures (e.g. LEPP_PM10 from 2019-10-01T00:00:00 to 2019-10-03T02:00:00).
The data and metadata files include an additional met + aqt station placed in the University of New South Wales campus (STAT code = UNSW). UNSW is part of the SWAQ network, but its siting and metadata have unique features that require special attention before use. Indeed, the station is located in a car park, under www.nature.com/scientificdata www.nature.com/scientificdata/ scattered trees (due to setting constraints within the University campus). UNSW data should be used and interpreted on account of local emissions of heat and pollutants, as well as potential power insufficiencies.
In addition to collecting data for urban climate and air quality research, the SWAQ network is first and foremost a citizen-centred network. The project promotes STEM in schools, by providing them with access to scientific instruments and contact with research scientists within the local context that is relevant to their community. Students learn valuable STEM skills through directly being involved in the observation and analysis of the meteorological and air quality data. School teachers and students are able to monitor conditions at their school in real time and relate how changes in local pollution concentrations are driven by variation in local meteorological conditions, or how the onset of events such as bushfires, heatwaves, or thunderstorms can affect air quality. The project has produced curriculum-aligned lesson plans that use the SWAQ data.
These lesson plans are freely available on the SWAQ website (https://www.swaq.org.au/education) and are regularly presented at science teacher's conferences.
The data portal and visualisation of data at www.swaq.org.au/explore were developed in consultation with school students via concept testing workshops and provide timely weather and air quality data which can be freely accessed by anyone. Further, the website visualisations provide data found to be most useful and relevant to school students and members of the general public alike, with guidance on how to read the graphs and easily understandable descriptions of each of the variables presented.

code availability
The code used for technical validations is publicly available in the SWAQ repository on Github: https://github. com/giuliaulpiani/SWAQ.