Room-level occupant counts and environmental quality from heterogeneous sensing modalities in a smart building

The research areas of occupant sensing and occupant behavior modeling are lacking comprehensive public datasets for providing baseline results and fostering data-driven approaches. This data descriptor covers a dataset collected via sensors on room-level occupant counts together with related data on indoor environmental quality. The dataset comprises 44 full days, collated in the period March 2018 to April 2019, and was collected in a public building in Northern Europe. Sensor readings cover three rooms, including one lecture room and two study zones. The data release contains two versions of the dataset, one which has the raw readings and one which has been upsampled to a one-minute resolution. The dataset can be used for developing and evaluating data-driven applications, occupant sensing, and building analytics. This dataset can be an impetus for the researchers and designers to conduct experiments and pilot studies, hence used for benchmarking.


Background & Summary
Accurate estimates of occupant counts in a building can be used in various applications areas, including smart spaces, safety and evacuation, facility management, and building operations 1,2 . In the building operation area, occupant counts can enable applications, like, adaptive ventilation in rooms, occupant-based energy benchmarking, and model-predictive control of room setpoints. In these applications, the more accurately the numbers of occupants can be sensed, the more energy-efficient a building can be operated 3 . For studies of adaptive ventilation and model-predictive control, it is also necessary that occupant presence can be linked to indoor environmental quality and ventilation rates.
Datasets are needed for such research which captures the conditions in non-laboratory settings. This is because the developed systems and algorithms have to handle omissions and faults in the sensor data that they process to be applicable beyond the lab. Therefore, the released dataset is collected in a living lab building, which is a standard building in normal use but where additional effort has been made to enable data collection. However, as standard components are used uncertainties around systems and their calibration is higher than in a laboratory setting.
The building considered for this data release is a teaching and office building, located at the University of Southern Denmark, Odense campus. Denmark has a temperate climate representative for the northern part of Europe. The building has been designed to serve as a living lab for data-driven research on building operation and optimization 4 . The role as a living lab has been communicated at the university, and a screen at one of the entrances show examples of data collected. However, as the data collection is mostly invisible to occupants, the collection does not impact their behavior in any noticeable manner. The building is approximately 8500 m 2 split across three floors and a basement. It has 1000 occupants on a typical weekday and is used for both student activities and staff. Data is collected at the spatial granularity of rooms. The data have been collected in three rooms, two of them study zones, and one is a lecture room. The rooms were selected to cover different usage patterns and differences in sunlight by facing either south-east or north-west. The study zones have a mixed-use for student activities, such as project work and solving exercises. The teaching room is mostly used for scheduled activities, typically spanning between two to four hours.  Installed on the ceiling. In the hallway 0.5 m from door 1.
Installed on the ceiling. In the hallway 0.5 m from door 1.
Exit door 1 Installed on the ceiling. In the hallway 0.5 m from door 1.
Installed on the ceiling. In the hallway 0.5 m from door 1.
Installed on the ceiling. In the hallway 0.5 m from door 1.
Enter door 2 Installed on the ceiling. In the hallway 0.5 m from door 2.
Installed on the ceiling. In the hallway 0.5 m from door 2.
Installed on the ceiling. In the hallway 0.5 m from door 2.
Exit door 2 Installed on the ceiling. In the hallway 0.5 m from door 2.
Installed on the ceiling. In the hallway 0.5 m from door 2.
Installed on the ceiling. In the hallway 0.5 m from door 2.      www.nature.com/scientificdata www.nature.com/scientificdata/ The rooms contain a unique collection of sensor modalities covering both occupant presence and indoor environmental quality factors, including CO 2 concentration level, relative humidity, illuminance, occupant counts, occupant counts entering and leaving the rooms, temperature, and the in-room airflow, estimated by the damper position, which is correlated to the airflow, the air is outdoor air heated using heat recovery. The placement of the sensors follows the standard practice of the building industry in Denmark.
Compared to existing sensor-based datasets for buildings, most of them consider residential homes (e.g., Barker et al. 5 ). We, however, With this dataset, consider commercial buildings. Previous datasets for commercial buildings include fewer sensor modalities and has a lower temporal resolution than the presented dataset.   www.nature.com/scientificdata www.nature.com/scientificdata/ Previous datasets include a dataset with only three modalities and a small temporal range by the University of Southern Denmark, described in 6 , a dataset by Lawrence Berkeley National Lab with lower granularity and fewer modalities but with more background variables 7 and a dataset with only one sensor modality by University of Texas, San Antonio 8 . Thereby, the released dataset is unique due to the number of sensor modalities available.
The sensor modalities in the dataset enable researchers to both study new technical solutions (e.g., CO 2 -based occupant estimation algorithms 9 , adaptive ventilation, or model-predictive control) and establish knowledge on occupants and indoor environmental quality (e.g., quantify the correlation between occupants and air quality). The dataset can also be used to learn modeling parameters for occupants to more accurately parameterize building performance simulations 3 .
The occupant counts entering and leaving the rooms have been collected using six state-of-the-art PC2 3D stereo vision cameras produced by the company Xovis, which have been mounted over the entrances to the rooms. To estimate the number of occupants in the rooms, we have used the PLCount algorithm 10 on the raw readings from the cameras. The sensor and method have been evaluated in the building with a manually obtained ground truth based on video recordings. The study documented in 10 showed an accuracy of 0.075 Root Mean Square Error (RMSE). The other building data is collected via standard-grade sensors connected to a building management system (BMS), which for the particular building is a Schneider Electric BMS. The data for the release is collected through application programming interfaces of the BMS. The CO 2 sensor data have not been cleaned.  www.nature.com/scientificdata www.nature.com/scientificdata/ Therefore users of the data should address known issues with this stream, including offsets and drifts 9 . An overview of the different sensor streams can be found in Table 1, including units and uncertainties. See Table 2 for details of the sampling strategies of the various sensors and room. The physical placement of the sensors can be found in Table 3.

Methods
Selection methodology. The published dataset is collected in the period of March 1st, 2018, to April 30th, 2019. We would like only to publish periods of continuous readings, but since the source is a real building and BMS APIs were used for collecting the data, we have had to adjust expectations slightly. We only considered full days of data. Days where the CO 2 sampled stream had more than three missing readings in a row were not considered, hence allows gaps of 15 minutes. Threshold-based sensors which only collect a sample when there is a change larger than the threshold has not been conceded for eliminating days since it is impossible to evaluate how many readings such streams should contain. Additionally, we chose not to consider two consecutive days as this would make the released data susceptible to privacy attacks. This decision is based on the results of a study 11 , which showed that using CO 2 streams, the data could be deanonymized. This attack could be used to identify the weekday. Which could be used to reveal the identity of the room, by doing a data linkage attack using the teaching rooms scheduled activates and the released streams, as showcased in 11 . To eliminated days in a sequence, we have selected the following procedure: For sequences with an even amount of days, the days were randomly removed to comply with the rule. Uneven amount of days in the sequence was removed by maximizing the number of days in the output. This left 44 full days of data covering the three test rooms and all the sensor modalities. These make up the days of the released dataset.    www.nature.com/scientificdata www.nature.com/scientificdata/ Data processing. In the released dataset, we have provided two forms of the data, the original raw form and one which has been pre-processed to allow easier use of the data by having a stable sample rate. The pre-processing applies forward fill and then backward fill on the original streams. Forward fill, fills missing samples for the desired sampling frequency by filling the gaps in the stream with the last reported value in the stream, until a new reported value is reacted in the stream, backward fills do the same but back in time. The sampling rate for the fills has been set to minute-wise sampling for all of the streams. Other sampling rates can be computed using the original dataset. The sample rate has been selected to accommodate the identified use cases, e.g., occupancy and model-predictive control 9 . The original dataset has only been changed for the event and threshold-based sensors, by adding a reading at 00:00:00, which had the value of the last reading of the previous day.
Data suppression. The most sensitive part of the data release is the identity of the rooms. Since combined with the occupant counts and knowledge of room activities, one can calculate a teacher performance index, as demonstrated in 11 . Thus we have anonymized room identity and replaced the dates by a DayId, which is a random number assigned in a non-chronological order. To limit the effect on the usefulness of the dataset, we have introduced a year, month, and a workday indicator. The time of day is untouched.   www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The released dataset, hosted on figshare 12 , contains the mentioned sensor modalities, and the amount of readings of each of the rooms can be seen in Table 4. The upsampled dataset 67680 readings per stream. The data coverage for the sensors using the sampling strategies of dynamic sampling rate (DSR) and static sampling rate (SSR), can be found in Table 5. The illuminance data stream have relatively low data coverage, we have added it since it still captures the tendency for the light level in the rooms through the days, although the coverage can affect the usefulness of the stream. Summary statistics for the upsampled dataset for rooms 1, 2 and, 3 can be found in Tables 6-8. The data for each combination of sensor modality and room can be found in separate comma-separated value (CSV) files. In addition to the sensor values, these files have columns for the metadata defined in the data suppression section, namely: Timestamps, year, month and workday indicators, and dayId, which also can be found in the readme file. The two versions of the data, original and upsampled, can be found in the folder's original and filleddata, respectively. The metadata also contains room type, size, seating capacity, and volume, which can be found in Table 9, and in the roominfo file. Furthermore, we have provided a Brick representation of the sensor instrumentation 13 , found in the brick_graph file generated using the brick_generator script. The Brick model consists of the physical relations between the sensors streams, the building, and rooms. It is modeled using Resource Description Framework (RDF) triples between the elements in the model. Each of the rooms models are the same, an example of them can be found in Fig. 1. Finally, have we included a categorization of the sensors following the Mahdavi and Taheri 14 ontology, which can be found in the occupant-behavior-ontology file or in Tables 10, 11, and 12 for the indoor conditions, inhabitants, and control systems, respectively.

technical Reliability
To evaluate the technical reliability of the dataset, we have provided plots, showing the daily profiles of each sensor modality. Furthermore, we provide additional evidence for each of the streams based on statistical analysis.
In 15 , the authors showcase the relation between the VAV damper position, CO 2 , and the number of occupants, which can be seen in Fig. 2. The Pearson product-moment correlation coefficients, which measure the linear association between two variables, between the CO 2 and damper position in the dataset is 0.87700, 0.89287, and 0.80668 for Room 1, 2 and 3, respectively. The correlation between CO 2 and the number of occupants is 0.65863, 0.83680, and 0.81663 for the rooms. Finally, the correlation coefficient between damper position and the number of occupants is 0.70158, 0.77950, and 0.69057 in the dataset, using the data form the day with the DayId of 9. These numbers highlight the expected relationships between these modalities 16 . As expected, there exists a slight positive correlation among these data streams. This is because the operations of the damper position are regulated to maintain CO 2 concentration below a particular threshold. Likewise, CO 2 concentration is mostly influenced by the number of occupants in a particular space. However, we do not expect correlation coefficients for a perfect relationship. We have compared the total amount of people entering and exiting the three rooms. The results show that, according to the sensors, 0.17% more enters room 1 then leaves the room. In room 2, 1.12% more leave the room than enters the room. In room 3, 0.29% more leave the room than enters. For all rooms, we have used the total amount of people entering and exiting in the monitored period for the comparison. These numbers indicate that the observed sensing error is very low.
In 17 there have been performed reliability tests for the CO 2 and temperature streams, where it was found that the CO 2 sensors were not calibrated and therefore was the sensors replaced and calibrated. In Figs. 3-5 the profiles for the CO 2 concentrations, the damper position, and the occupancy estimation for the rooms can be seen. Highlighting expected patterns during daytime versus night. The lowest CO 2 readings for the three rooms are 406.72, 405.76, and 408.0 ppm for Room 1, 2, and 3, respectively. This is close to the ambient concentrations for Denmark, which is around 400 ppm.
The daily illuminance level for the three rooms is shown in Fig. 6. Rooms 1 and 2 are both located on the west side of the building, which can be inferred by observing the daily profiles shown in the figure, where the illuminance is peaking later during the day when there is direct sunlight on the windows. The same can be observed for Room 3, which has eastern exposure and therefore peaks in the morning. Furthermore, does all three rooms have the lowest reading of 0 lux, which is the expected lowest value since the sensor can not detect values below 10 lux, as specified in the technical product sheet. The humidity and temperature daily profiles can be seen in Figs. 7 and 8. Here the impact of the sun in the afternoon for Room 1 and Room 2 can be observed, which is not present for Room 3.