Background & Summary

Indoor air pollution has become a major global public health threat that requires increasing joint efforts from policymaking, academic research, industrial innovations, etc. In most cases, indoor air quality is more polluted than the atmospheric environment1. For instance, the concentrations of particulate matter (PM), formaldehyde (HCHO), and volatile organic compounds (VOCs) in indoor air can occasionally be several times higher than those found in outdoor air, imposing substantial health threats to indoor occupants2,3. Formaldehyde and PM are categorized as carcinogenic substances and could cause severe diseases such as lung cancer, given long-term exposures4. Existing research indicated that such exposure to indoor air pollution could contribute to excess deaths in developing countries, accounting for 4% of the global disease burden5. Other studies also validated the adverse effects of VOCs and particle exposure on public health, such as chronic body damages6,7,8, cancer9,10,11, heart diseases12,13,14, lung diseases15,16, and skin inflammations17.

Household air purifiers can improve indoor air quality by reducing concentrations of harmful pollutants in indoor air, such as fine particles, harmful gases and bacteria18. The application of air purifiers is therefore becoming increasingly popular in residential buildings. For instance, in China, a study revealed that the use of air purifiers could significantly reduce indoor PM2.5 concentrations and positively affect population health19. Zhang et al.20 reviewed relevant research on applying air filtration technologies to improve indoor air quality, which used mechanical filters to efficiently remove larger particles, and adopted electrostatic force-enhanced filtration to achieve large dust loading capacity and ultralow pressure drops21. Tian et al.22 provided a comprehensive summary of the working principles and performance of novel electrostatic force-enhanced filtration technologies. In their analysis, Tian et al. emphasized the importance of considering both the initial and long-term performance of electrostatic force-enhanced filtration technologies. They found that charging PM or filters could significantly improve filtration efficiency and overall performance. Furthermore, the performance of these technologies was found to vary when exposed to non-oily particles, oily particles, and bioaerosols. As alternatives, adsorption technologies have also gained increasing popularity in air purifiers due to their efficient and affordable natures23,24, although the pollutant removal capability may vary given various sorbents. Chen et al.25 thoroughly reviewed the methods for enhancing mass transfer in gas phase adsorption catalysis using the Direct Ink Writing (DIW) technique. The review encompassed various aspects such as raw materials, preparation processes, auxiliary optimization methods, and practical applications. In addition, the review also discussed the potential prospects and challenges of employing DIW methods to achieve efficient mass transfer kinetics. Air purifier efficiency has been greatly improved along with these filtration and adsorption technology improvements. As a result, air purifiers have been widely used to reduce indoor pollutant concentrations, improve indoor air quality and reduce the risk of associated diseases.

Air purifiers are typically equipped with various sensors for real-time measurements of formaldehyde, carbon dioxide, VOCs, etc. Currently, most studies focus on evaluating the environmental changes before and after using air purifiers while using external sensors rather than built-in sensors in air purifiers. Huang et al.26 used indoor PM concentration changes to develop predictive models for automated and real-time controls over air purifiers. Cooper et al.27 utilized the relationship between air purifier operations and indoor PM concentrations to make inferences on indoor occupant behavior and guide the use of air purifiers. Using indoor environmental parameters collected, Pei et al.28 analyzed the operational behavior of portable air purifiers in Chinese residences and evaluated their overall performance in improving indoor air quality. Malayeri et al.29 utilized neural networks and genetic algorithms to design photocatalytic reactors for air purifiers. Previous studies have shown that more detailed data on air purifiers are helpful for developing improved predictive models, facilitating product design, and guiding human behavior inferences.

Utilizing real-time data collected by built-in sensors in air purifiers, prediction models on indoor pollutant concentrations can be readily established to facilitate optimal designs and operation of air purifiers. Moreover, leveraging experiences learned in diverse environments, artificial intelligence models can be developed for the easy and quick adaptation of air purifiers in new settings. At present, there is a scarcity of datasets to provide comprehensive coverages and descriptions of indoor environmental parameters and the corresponding air purifier operating conditions. This paper presents a dataset related to air purifier operations as a solution. The uniqueness of the dataset lies in the collection of one-year hourly data from 100 air purifiers in residential buildings dispersed across 18 provinces in China. The dataset includes the air purifier’s working status and airflow rate together with 5 monitoring variables, i.e., formaldehyde concentration, PM2.5 concentration, total volatile organic compound (TVOC) concentration, temperature, and relative humidity recorded. Such dataset can be used for the following applications:

  • To derive optimal design guidance for new air purifiers.

  • To guide personnel use by indoor human behavior predictions.

  • To develop and validate prediction models on indoor pollutant emissions.

  • To perform area-level data analysis using statistical or machine learning algorithms.

  • To develop cost-effective control methods for air purifier operations.

  • To detect and diagnose faults in air purifier operations.

Methods

Data acquisition and anonymization

The original dataset was collected from July 1, 2021, to July 1, 2022, and was sourced from hourly data collected by 100 air purifiers situated across 18 Chinese provinces (Fig. 1). Detailed information on the distribution of air purifiers can be found in Table 1. Beijing 352 Environmental Protection Technology Co. LTD provided the raw information to the authors via collaboration. The specified air purifier model is the X86 Smart Air Purifier. Interested researchers can access the raw data and possessed dataset for free by visiting the website in the “Data Records” section. The raw data were collected by various built-in sensors installed inside the air purifier. Before using the purifier, residents were asked if they agreed on the data collection, sharing, and publishing of the measurements collected in their homes. After obtaining the user’s consent, the anonymized air purifier will record the current indoor air parameters every hour during operations and upload them to the cloud for data storage via Wi-Fi. The collected data have been anonymized to avoid possible violations of privacy.

Fig. 1
figure 1

Air purifiers in 18 provinces across 4 different climate zones in China. Each translucent red circle represents an air purifier; the darker the circle colour, the more intensively the area is used.

Table 1 Air purifier distribution details.

As summarized in Table 2, the dataset consists of nine variables in total. The first three variables represent the data collection timestamps, machine ID and on/off status. The remaining six variables represent measurements of airflow rate and indoor air quality. The original 5582 datasets provided by the company were used to screen 100 air purifiers, with a maximum missing data ratio of 7%. The data were recorded in 100 CSV files, each containing around 8760 rows. The project team has contacted potential data providers with the following requirements to ensure validity in the preliminary data collection phase:

  • The data must be completely derived from the designated site to reflect actual occupant behaviors in real buildings.

  • The dataset should contain all five air quality parameters (formaldehyde concentration, PM2.5 concentration, TVOC concentration, temperature, and relative humidity), and the data should be traceable to its corresponding machine.

  • The total length of the hourly data should be no less than one year.

Table 2 Columns of the raw and processed data, including their reporting resolution.

Data selection and processing

The processing of the original data is shown in Fig. 2. Generally, a missing rate below 10% is deemed as low. In order to ensure improved data stability through subsequent data interpolation, a threshold of 7% is selected for the initial data screening. To ascertain the missing rate for each parameter in each air purifier, we evaluate the absence of each distinct parameter and identify the highest missing rate as the predominant absence rate for that specific air purifier. Subsequently, any missing rates exceeding 7% are excluded from the collective data pool and retained as part of the initial dataset. After the initial data is established, the missing values are filled in. Afterward, data cleaning methods were implemented to remove and impute data outliers. This iterative process is repeated until the dataset contains no missing values.

Fig. 2
figure 2

Data processing flowchart.

Data Records

The dataset is available on figshare30 [https://doi.org/10.6084/m9.figshare.24278101]. The raw data is organized in a zipped file named “anonymized_raw_data”. The file size is 272MB (3.90GB before compression). It contains 5582 csv files with anonymized raw data, the primary data obtained from Beijing 352 Environmental Protection Technology Co. LTD. The zipped file named “processed_final_data” contains 100 csv files, each including the data using the processing method mentioned above. The above two zip files are packaged in a file called “One-year dataset of hourly air quality parameters from 100 air purifiers used in China residential buildings”. All the files in both folders cover the data collection period from July 1, 2021, to July 1, 2022. The only difference among the machines lies in their usage locations, with the model number and duration of use remaining consistent across all of them. The csv file named “Machine parameters (with ID and Cities)” contains the iot_id and its corresponding city information about each air purifier. The data are also available at github [https://github.com/weijiaze/Scientific-data].

Technical Validation

Data pre-processing

Data selection

As shown in Fig. 3, most of the raw datasets exhibit substantial missing values. To ensure the reliability of data imputation, this study specifically chooses raw datasets with less than 7% missing values for analysis, as mentioned in Fig. 2.

Fig. 3
figure 3

Distribution of missing ratio for raw data, The x-axis indicates the missing ratio missing values, the interval is 7, the y-axis indicates the number of air purifiers contained in the interval, and the final selection is one hundred air purifiers in 0–7% as the initial data.

Data imputation

In this study, missing value imputation methods have been applied to two groups of data separately, i.e., one for on/off status, airflow rate and the other for air quality parameters.

a. imputation of on/off status and airflow rate

Based on data exploration, it is found that the on/off status and airflow rate will become missing at the same time due to random discontinuous data loss. Therefore, the following method has been utilized to induce missing data. An example will illustrate the imputation of missing data in the on/off state as follows, and the airflow rate will be imputed using the same approach. (1) If there is no change in the on/off status (i.e., denoted as 1 or 0) before and after the missing data, the missing data is considered to be consistent with adjacent measurements. In this case, the Next Observation Carried Backward (NOCB) method is used for data imputation. It uses the closest data backward to fill in the missing data. It involves filling in the current missing value using one of its preceding data points. (2) If the on/off status before and after the missing data varies, the on/off status of the missing data will be inferred based on indoor air quality measurements. As an example, assuming the on/off status of the previous non-missing data sample is off (i.e., denoted as 0) and the measurements on indoor air quality (e.g., formaldehyde concentration) are reducing, the on-off status will be imputed as on (i.e., denoted as 1). The same method can be applied to the versa situations.

b. imputation of air quality parameters

Multivariate imputation techniques can offer significantly improved capabilities in filling substantial gaps within the dataset by transforming time series data into matrices, with each row representing a weekly cycle. Although MICE necessitates substantially greater computational resources, it demonstrates exceptional robustness. Irrespective of sensor categories, missing rates, or gap sizes, MICE consistently produces imputations with an average NRMSE of 1 standard deviation for each sensor31. In terms of concentrations of formaldehyde, PM2.5, TVOC, temperature, and relative humidity, the Multiple Imputation by Chained Equations (MICE)32,33 has been used for data imputation. Prediction models have been developed to estimate the missing data using other column data near the missing data as features. More specifically, linear regression is employed for predicting continuous missing values, while logistic regression is utilized for classifying categorical missing values.

Data cleaning

The PauTa criterion34, a statistical method to identify outliers beyond triple standard deviations, is used for outlier detection in this study. The PauTa criterion was dynamically adjusted based on a sliding window of 7 days. Any outlier identified using this criterion will be replaced using the above data imputation methods.

When the data originate from a single non-randomized case study, measurement accuracy is critical because no alternate use cases may be utilized to identify outliers and filler approaches. The validity of the missing data filler is more demanding in the data used because the missing data are more scattered. This research utilizes the MICE interpolation method to accurately estimate missing data points by considering the distribution of the original dataset. This method generates a series of complete datasets, typically ranging from 3 to 10, by filling in the missing values through interpolation techniques based on the values present in the original dataset. Subsequently, standard statistical methods are applied to each complete dataset, and the outcomes of these separate analyses are combined to form a comprehensive set of results. Finally, the interpolation results are compared to the original distribution curve, and the interpolation result that best fits the curve is selected as the final data point. This study generates 10 interpolation results and selects the one that best fits the original distribution curve according to the fit of the interpolation results as the final data. After the comparison of imputation results and raw data in distribution, it seems that the data distribution after interpolation is the same as before, which greatly improves the reliability of the data set. Overall, the MICE interpolation method provides a robust approach for handling missing data by generating multiple complete datasets and utilizing statistical techniques to integrate the results. It ensures a more accurate and comprehensive estimation of missing values, enhancing the validity and reliability of subsequent analyses.

To accurately present processed data, it is essential to eliminate outliers and missing values. The resulting graph will display gradual and consistent curves without sudden or steep fluctuations by removing these outliers and missing values. This step is crucial in accurately representing the processed data and clearly depicting the trends. Figure 4 shows the variations of all processed parameters of a specific air purifier installed in Shaanxi Province in one day. It indicates that the temperature and humidity fluctuate regularly with time, while the peak of air pollutants usually appears at around 9:00 am and 9:00 pm, and the air purifier gears are also adjusted accordingly to changes in pollutant concentrations. Around noon, the pollutant concentration typically decreases to a lower value, leading to the automatic shutdown of the purifier. That is why the peak pollutant concentration at night is higher than that during the day.

Fig. 4
figure 4

Change in all monitored parameters of the air purifier in one day (November 25, 2021, 7:00 am to November 26, 2021, 6:00 am). The green shadow means the air purifier is on during the period, and the orange shadow means the air purifier is off.

Figure 5 shows another example of parameter variations with the on/off status over a 2-day period from the same device installed in Shaanxi Province. The sampling time was set to one hour, and the air purifier was set to auto-regulation mode, recording the average concentration of pollutants within each hour and the corresponding status of the air purifier. It can be seen that the air purifier will automatically turn on when the pollutant concentration rises above the threshold until it reaches the normal concentration, which presents rather good control ability on indoor pollutants.

Fig. 5
figure 5

Indoor air quality parameters (PM2.5, HCHO, TVOC) coupled with the on/off status in two days (November 25, 2021, 6:00 am to November 27, 2021, 5:00 am), When on/off status = 1, the air purifier is on; when on/off status = 0, the air purifier is off.

Usage Notes

Based on the currently available dataset, which can serve as a foundation for future investigations, we can utilize variations in indoor pollutant concentrations to discern specific occupant behaviors within the enclosed space and ascertain the occupant’s daily schedule35. For instance, if there is a sudden surge in pollutant concentration at 12:00 noon or 6:00 pm, coinciding with the activation of an air purifier, we can infer that this spike is likely attributed to cooking activities by the user. Conversely, if the air purifier remains inactive while pollutant levels decrease concurrently with shifts in temperature and humidity, it could be hypothesized that the occupant has opened doors or windows. Furthermore, we can retrieve real-time local data on temperature, humidity, and pollutant concentration and compare it against the data from the air purifier itself. This comparative analysis allows us to validate the air purifier’s efficacy in improving indoor air quality over time.