A Global Building Occupant Behavior Database

This paper introduces a database of 34 field-measured building occupant behavior datasets collected from 15 countries and 39 institutions across 10 climatic zones covering various building types in both commercial and residential sectors. This is a comprehensive global database about building occupant behavior. The database includes occupancy patterns (i.e., presence and people count) and occupant behaviors (i.e., interactions with devices, equipment, and technical systems in buildings). Brick schema models were developed to represent sensor and room metadata information. The database is publicly available, and a website was created for the public to access, query, and download specific datasets or the whole database interactively. The database can help to advance the knowledge and understanding of realistic occupancy patterns and human-building interactions with building systems (e.g., light switching, set-point changes on thermostats, fans on/off, etc.) and envelopes (e.g., window opening/closing). With these more realistic inputs of occupants’ schedules and their interactions with buildings and systems, building designers, energy modelers, and consultants can improve the accuracy of building energy simulation and building load forecasting.

last decade [9][10][11][12] . Examples of these studies include: (1) Occupant presence; (2) Occupant number; (3) Opening/ closing windows; (4) Opening/closing window blinds; (5) Turning on/off lights; (6) Adjusting thermostats; (7) Turning on/off air-conditioners; (8) HVAC sizing and thermal comfort; (9) Crowd control and security; and (10) Circulation design 9 . Models from these studies were built to describe occupant behavior in buildings in order to evaluate the performance of building design and operation 9 . Each research study has its own dataset and represents an individual case, making cumulative learning a key challenge although the studies were carried out around the world. With such a large body of data to work on, occupant behavior researchers will not be able to dive deeper to compare occupant behaviors across various building types and nations, or derive valuable information for energy-efficient building design and operations based on limited field-measured data.
In this paper, we develop and present the worldwide ASHRAE occupant behavior database (https://ashraeobdatabase.com) with data contributions from researchers across the globe, as part of the IEA EBC Annex 79 project 10 . A prior effort, under the IEA EBC Annex 66 project 11 , has published five occupant behavior datasets, which were also included in this database 13 . The database consists of 34 datasets from 39 institutions located in 15 countries and 10 climate zones. This is a comprehensive global occupant behavior database. The Brick schema 12 was adopted to develop sensor and room metadata models. As shown in Figure 1, the database includes 11 different types of occupant behavior measurements collected from 3 different types of building spaces. The database can support various use cases of occupant behavior research, including: • Understand occupant behaviors in real buildings, • Compare and understand the diversity and dynamics of occupant behaviors, • Develop mathematical models of occupant behaviors at various spatial and temporal resolutions by building types, • Benchmark various occupant behavior modeling approaches, • Generate typical occupant schedules and behavior models for use in building performance simulations, as well as building energy codes and standards.

Methods
This section introduces the data collection, pre-processing and modeling approaches implemented in this study. Figure 2 shows the detailed technical approach we have followed to develop the occupant behavior database.
Data collection. Before collecting the data, a worldwide survey was developed and administered among researchers willing to contribute their datasets. As Table 1 shows, the survey contains some basic questions about the metadata, type of occupant behavior, data collection method, period and frequency, geographic location and building type, heating/cooling strategy (interaction between occupant and thermostat), climate zone. With the information collected from the worldwide survey, the project team reached out to potential contributors with detailed requirements. Below is a list of preliminary data requirements: • Data should come from field experiments, and represent "real" occupant behavior in real buildings; • The time span of the dataset should be at least one month to represent weekly and monthly behavior patterns, or represent any behavior changes within buildings; • Data on adaptive behavior (e.g., opening or closing windows to maintain thermal comfort) should come with indoor and outdoor environmental parameters (e.g., ambient and indoor air temperature); • The dataset should contain metadata information, a dictionary of data headings, experimental setup details, and data collection methods.
Data pre-processing. After receiving raw datasets from contributors, each dataset was inspected based on the above requirements. The contributors were responsible for addressing privacy that relevant for occupant data 14 and further anonymization was added as part of pre-processing. All datasets were then separated into survey-based, in-situ-based, and mixed-type of data. The in-situ-based data contains dynamic information and measurements in the building with constant sampling intervals, such as door and window status (OPEN/ CLOSED), indoor equipment status (ON/OFF), indoor environment information (temperature, humidity, CO 2 concentration, illumination, etc.). Survey-based data contains information from the specific study, including occupant questionnaires, static information about the building envelope, floor plan and sampled measurement. Datasets without a continuous and fixed sampling time interval were also classified as survey-based data. One dataset was categorized as mixed type data since it has both survey-based and in-situ-based data. Table 2 provides a review of all the datasets, including the country of origin, collection method, and measurement categories. There are in total 24 in-situ-based datasets, one mixed-type of dataset, and nine survey-based datasets. Table 2 listed the types of occupant behavior data that were included in the database. Each type of measurement has a CSV template file associated with it. Based on the templates, all the raw data were pre-processed to be consistent in standard naming, data types, and formats. The data types follow the entities and tags defined in the Brick schema, which is covered in the following section.
The detailed data pre-processing procedure includes the following: • Empty columns from the raw datasets were removed; • All the missing values in raw datasets were replaced with -999; • As a process of anonymization, each building and room were assigned with a unique ID number; • The headings of common data columns in the raw data were standardized following pre-defined dataset templates. Survey data with unique questions or measurement naming remained the same as the original naming schema. For the survey-based data, a dictionary of headings was created for each dataset; • The format of timestamp in the raw data was revised to follow the format "yyyy-mm-dd hh:mm:ss", the time zone remained as local time zone to reflect the daily behavior patterns of the occupant; The time granularity of some datasets is at the minute level. In that case, the value of seconds in the timestamp was kept as zero. • The decimal point of the raw data was adjusted accordingly, such as status (binary), occupant number data (integers), data of indoor and outdoor conditions (one decimal precision digit); • Raw occupancy data that only contains enter/leave events were aggregated to get a total number of occupants in the space.  www.nature.com/scientificdata www.nature.com/scientificdata/ Sensor and room metadata modeling. Contextual information provided by subsystem vendors as "metadata", that is, the data about data, those subsystems include HVAC systems, lighting, security, and sensing/ monitoring systems 15 . In this database, each dataset contains one or more buildings with various types of sensors installed to measure occupant behavior patterns, indoor and outdoor conditions as shown in Table 2. The Brick schema was adopted to develop sensor and room metadata models, to better present the information of different   www.nature.com/scientificdata www.nature.com/scientificdata/ types of measurements and relationships between subsystems with buildings. Brick is an open-source unified metadata schema for buildings, which standardizes semantic descriptions of the assets and their relationships in buildings. The assets include physical, logical, and virtual assets. The core concepts of Brick are Tag, Class, Relationship, and Graph. A Tag represents an atomic fact or attribute of an entity. A Class is a category with a definition used to represent a group of entities in the building. The Relationship defines the nature of the link between two related entities. A graph is a summarized figure indicating the data structure of a set of entities and their relationships. Brick defines a detailed ontology (https://brickschema.org/ontology) to support and expand these core concepts. The database users can easily extract sensor and room metadata information (e.g., number of rooms in the building, number and types of sensors that were deployed in the space) without querying the database. aSHrae global occupant behavior database. A website (https://ashraeobdatabase.com) was created as a data warehouse for public access. Query builder tools were developed based on different behavior types, cities and countries, building types, study ID, and publication list. Users can select and download data from the www.nature.com/scientificdata www.nature.com/scientificdata/ database interactively with the query tools. Data analytic functions were developed to provide an interactive overview of the database and assist users to select the dataset. Figure 3 shows the geographical and institutional details of the global contributions to this occupant behavior database. Köppen-Geiger climate classification has been widely used in the smart building area by researchers around the world [16][17][18] . Since the datasets 19 were contributed by researchers around the globe, Köppen-Geiger climate classification was adopted to represent the different climate zones in the datasets. The database covers 10 different climate zones globally according to the Köppen-Geiger climate classification (https://en.climate-data.org), which covers Tropical rainforest, Tropical savanna wet, Hot deserts, Humid subtropical, Temperate oceanic, Hot-summer Mediterranean, Cool-summer Mediterranean, Hot-summer humid continental, Warm-summer humid continental, Monsoon-influenced hot-summer humid continental climates. All datasets were compressed into a zip file named "ASHRAE Global Occupant Behavior Database", with a total size of 548 MB. The final datasets have been uploaded to the figshare website for public use 19 . A website (https://ashraeobdatabase.com) was created to query and download the desired data from the database based on different selection criteria.

Data records
As Figure 4 indicates, in the root folder, a folder named "Dataset_Templates" contains all the templates (.csv files) that have been used to process raw datasets. Those templates can be used as references for future data contributors. The "in-situ" folder contains 22 datasets representing the different dynamic measurement data (.csv files) with constant sampling intervals. The folder also includes the brick.pdf file which is a PDF view of the Brick model, and the brick.ttl (Turtle) file, which is the Brick model that can be viewed interactively through the Brick server (https://viewer.brickschema.org/). In the PDF file, users can get a glimpse of sensor measurement types and relationships with the building. Through the Turtle file, users can extract the complete sensor and room metadata information of the dataset without opening those datasets. The "survey" folder contains questionnaires as well as dynamic or static measurement data without a constant sample interval time. Survey-based data varies greatly, as different research projects focus on various measurements and questions. Each dataset has a dictionary of headings to assist users with understanding the data. In total, 12 survey-based datasets were collected in this database. Sensor and room metadata. As discussed above, 22 Brick models have been developed for the in-situ datasets. www.nature.com/scientificdata www.nature.com/scientificdata/ and their relationships in this dataset. The relationship Room "isLocationOf " Desk indicates that desk-level measurements exist in this dataset. In this model, there can be multiple "Building", "Room", and "Desk". The number of those entities can be extracted from the Turtle file. The relationships include "isLocationOf ", "hasPoint", and "Regulates". Those relationships could be reversed as "hasLocation", "isPointOf ", and "isRegulatedBy". The points represent different sensors, for instance, wind direction sensor, wind speed sensor, air temperature sensor, humidity sensor, etc. Detailed information such as numbers or names of the entities and points can be found in the Turtle file.

technical Validation
In this section, we explore the datasets and perform an initial analysis of different occupant behavior data. The analysis focuses on occupant number, door opening, occupant presence, window opening, and outdoor measurements. occupant number historical data. Figure 6 shows the historical data of occupant number in a commercial office building from Dataset 32. The occupancy of two office rooms was measured from May 22, 2018 to July 11, 2018. Camera-based sensors were deployed to collect occupant counts in this study. To valid the dataset, researchers added an automatic daily calibration for the measurement, which includes two functions: first, set the occupant count to zero if it is less than zero; second, set the occupant count to zero at 3 AM each day. From the figure, we can observe the weekday and weekend trends of occupant number in both rooms. Holiday effects can also be captured, such as Memorial Day (Monday -May 28, 2018), and Independence Day (Wednesday -July 4, 2018). The occupant number dropped to relatively lower values during weekends and holidays. Figure 7 provides a detailed view of the historical occupant number in one week, a common workday schedule was observed from both rooms.
Door status. Figure 8 shows a data distribution of door status (Open/Closed) in educational offices from Dataset 5. The building in this dataset has multiple rooms, with each room having only one door. Cable-connected magnetic sensors were used to measure the opening/closing of doors in this study. The data was collected from October 27, 2016, to October 31, 2017, with 5-minute granularity. The figure indicates that Room 1 and 2 have a similar trend of the door opening activities during the working hours (8 AM-5 PM). However, Room 3 showed different trends where door opening behaviors spread across the 24 hours of a day. Overall, the door opening probability was low, which indicated that the doors remained closed most of the time. www.nature.com/scientificdata www.nature.com/scientificdata/ occupant presence. Figure 9 shows a cross-comparison of the first arrival and last departure times among three different datasets (Dataset 9, 10 and 30) from three different countries. The data was collected from educational offices in various months from 2016 to 2018. Passive infrared sensors were used in Dataset 9 to collect the event-based occupancy data, the minimum occupied time was set as 15 minutes. Dataset 10 monitored space occupancy manually by a person in the office. And Dataset 30 utilized Bluetooth device pairing technology to sample occupancy data every one minute. In Figure 9, the first arrival captures the time when the space is first occupied, while the last departure captures the time when the space is last occupied during the day. Even though the distribution varies because of different lengths of data collection periods, it is clear that first arrival times are centered around 10 AM during the day. And the last departure times during the day are centered between 6 PM and 7 PM.
Window operations. Figure 10 shows window status data with indoor and outdoor temperature measurements from Dataset 5. The data was collected from educational offices. Researchers deployed cable-connected magnetic sensors to measure the opening/closing of windows in this study, the sampling time was five minutes. The indoor temperature was measured by the temperature probe (PT1000 class A Cable) with an accuracy of 0.15 °C or less at 0 °C. Outdoor temperature was collected every 10 minutes from an over roof weather station that was installed about 10 meters above the ground. From the changes in indoor and outdoor temperature over two days, it can be observed that the HVAC system helped to maintain an indoor temperature close to 24 °C during the day. In Rooms 3 and 4, window opening activities were captured during the afternoons on both days when the indoor and outdoor temperature both were relatively high. outdoor measurements. Field measurements of outdoor parameters were investigated based on the available datasets. Those outdoor measurements cover five different climate zones, such as Tropical savanna wet climate (Aw), Humid subtropical climate (Cfa), Temperate oceanic climate (Cfb), Monsoon-influenced hot-summer humid continental climate (Dwa), and Warm-summer humid continental climate (Dfb). The time granularity varies among those datasets. Dataset 5 collected outdoor measurements from an over roof weather station with a 10-minute sampling time. Dataset 7 has a 5-minute sampling interval and data was collected from an onsite outdoor weather station. Researchers of Dataset 14 installed over roof portable outdoor weather stations and sampled outdoor measurements every 10 minutes. However, Dataset 16 collected hourly outdoor measurements from local weather station which is www.nature.com/scientificdata www.nature.com/scientificdata/ approximately 4 kilometers away from the most experimental buildings. Then, the data was resampled into every 10 minutes using linear interpolation. Since datasets were collected from different months in various years, in order to compare outdoor measurements in the same time span, data from November of four datasets (5, 7, 14, and 16) were identified and analyzed. Hourly data from different days were analyzed and plotted using boxplot. Figure 11 shows hourly outdoor temperature distributions of four different datasets and climate zones. Figure 12 shows hourly outdoor relative humidity distributions of those datasets. Figure 13 shows the hourly outdoor solar radiation distributions in datasets 5, 7, and 14 since dataset 16 doesn't measure outdoor solar radiation. The results captured different trends of temperature, relative humidity, and solar radiation within the four different climate zones by the time of the day.

Usage Notes
The datasets 19 have been uploaded to a public domain of the figshare website, users can download data through this link (https://doi.org/10.6084/m9.figshare.16920118.v6). A website (https://ashraeobdatabase.com) was created to query and download the desired data from the database based on different selection criteria. These criteria include types of measurement data, countries and cities, type of building, study ID, and publication. The website also provides an overall analysis of all the datasets, a list of available publications from those studies, etc.

code availability
All the codes used to clean the raw datasets have been uploaded to GitHub for public use (https://github.com/ yapanliu/ashrae-ob-database). The raw datasets are also open to the public on request.