Background & Summary

Commercial building, heating, ventilation, and air-conditioning (HVAC) systems account for nearly 10% of the global electric energy1. To enhance energy efficiency in the building sector, organizations such as the United States Green Building Council and the American Society of Heating, Refrigerating and Air Conditioning Engineers (ASHRAE), Federation of European Heating, Ventilation and Air Conditioning Associations (REHVA) or Chartered Institution of Building Services Engineers (CIBSE) have developed various design standards and guidelines (e.g. ASHRAE Standard 90.1 and 189.1, REHVA Guidebook 282, 293 and 314, CIBSE Guideline F5 and L6) to support professionals in upgrading building design and construction practices. Furthermore, the bloom of building environmental assessment schemes and certifications – such as Building Research Establishment’s Environmental Assessment Method, Leadership in Energy and Environmental Design (LEED), the French Haute Qualité Environnementale, or the German Deutsche Gesellschaft für Nachhaltiges Bauen are promoting the construction and renovation of green and sustainable buildings, mostly in the developed countries. However, according to a recent study, LEED-certified buildings consume merely 10% less site energy than similar buildings that are not LEED-certified7. One of the major reasons for the underperformance of LEED-certified buildings is that operation of the building technical systems did not operate as intended, originating what is acknowledged as a performance gap motivating the need to explore the problem in more depth, using tools such as performance monitoring and benchmarking, information visualization, better controls, fault detection and diagnostics. But the limitation of these technologies in the building industry leads to a bottleneck in which no more energy savings can be achieved beyond those technologies. Estimating energy consumption in buildings to provide services is comprehensively influenced by architectural design, engineering technologies and cultural background, operational practices, occupant behavior, social equity, etc. Among them, occupant behavior has been suggested to be one of the most important factors influencing that explain the performance gap in buildings8.

However, the challenges in studying occupant behavior in buildings are many including: (a) Occupant behavior is dynamic and complex in nature; (b) Privacy issues make data collection difficult; and (c) Monitoring occupant behavior relies on various types of sensors with relatively high costs. To better understand the occupant behavior in buildings, more than 500 papers have been published on the topic of occupant behavior over the last decade9,10,11,12. Examples of these studies include: (1) Occupant presence; (2) Occupant number; (3) Opening/closing windows; (4) Opening/closing window blinds; (5) Turning on/off lights; (6) Adjusting thermostats; (7) Turning on/off air-conditioners; (8) HVAC sizing and thermal comfort; (9) Crowd control and security; and (10) Circulation design9. Models from these studies were built to describe occupant behavior in buildings in order to evaluate the performance of building design and operation9. Each research study has its own dataset and represents an individual case, making cumulative learning a key challenge although the studies were carried out around the world. With such a large body of data to work on, occupant behavior researchers will not be able to dive deeper to compare occupant behaviors across various building types and nations, or derive valuable information for energy-efficient building design and operations based on limited field-measured data.

In this paper, we develop and present the worldwide ASHRAE occupant behavior database ( with data contributions from researchers across the globe, as part of the IEA EBC Annex 79 project10. A prior effort, under the IEA EBC Annex 66 project11, has published five occupant behavior datasets, which were also included in this database13. The database consists of 34 datasets from 39 institutions located in 15 countries and 10 climate zones. This is a comprehensive global occupant behavior database. The Brick schema12 was adopted to develop sensor and room metadata models. As shown in Figure 1, the database includes 11 different types of occupant behavior measurements collected from 3 different types of building spaces. The database can support various use cases of occupant behavior research, including:

  • Understand occupant behaviors in real buildings,

  • Compare and understand the diversity and dynamics of occupant behaviors,

  • Develop mathematical models of occupant behaviors at various spatial and temporal resolutions by building types,

  • Benchmark various occupant behavior modeling approaches,

  • Generate typical occupant schedules and behavior models for use in building performance simulations, as well as building energy codes and standards.

Fig. 1
figure 1

Summary of the ASHRAE Global Occupant Behavior Database by building types.


This section introduces the data collection, pre-processing and modeling approaches implemented in this study. Figure 2 shows the detailed technical approach we have followed to develop the occupant behavior database.

Fig. 2
figure 2

Overview of the technical approach.

Data collection

Before collecting the data, a worldwide survey was developed and administered among researchers willing to contribute their datasets. As Table 1 shows, the survey contains some basic questions about the metadata, type of occupant behavior, data collection method, period and frequency, geographic location and building type, heating/cooling strategy (interaction between occupant and thermostat), climate zone. With the information collected from the worldwide survey, the project team reached out to potential contributors with detailed requirements. Below is a list of preliminary data requirements:

  • Data should come from field experiments, and represent “real” occupant behavior in real buildings;

  • The time span of the dataset should be at least one month to represent weekly and monthly behavior patterns, or represent any behavior changes within buildings;

  • Data on adaptive behavior (e.g., opening or closing windows to maintain thermal comfort) should come with indoor and outdoor environmental parameters (e.g., ambient and indoor air temperature);

  • The dataset should contain metadata information, a dictionary of data headings, experimental setup details, and data collection methods.

Table 1 Dataset description in the survey (*Required field).

Data Pre-processing

After receiving raw datasets from contributors, each dataset was inspected based on the above requirements. The contributors were responsible for addressing privacy that relevant for occupant data14 and further anonymization was added as part of pre-processing. All datasets were then separated into survey-based, in-situ-based, and mixed-type of data. The in-situ-based data contains dynamic information and measurements in the building with constant sampling intervals, such as door and window status (OPEN/CLOSED), indoor equipment status (ON/OFF), indoor environment information (temperature, humidity, CO2 concentration, illumination, etc.). Survey-based data contains information from the specific study, including occupant questionnaires, static information about the building envelope, floor plan and sampled measurement. Datasets without a continuous and fixed sampling time interval were also classified as survey-based data. One dataset was categorized as mixed type data since it has both survey-based and in-situ-based data. Table 2 provides a review of all the datasets, including the country of origin, collection method, and measurement categories. There are in total 24 in-situ-based datasets, one mixed-type of dataset, and nine survey-based datasets.

Table 2 Summary of 34 datasets.

Table 2 listed the types of occupant behavior data that were included in the database. Each type of measurement has a CSV template file associated with it. Based on the templates, all the raw data were pre-processed to be consistent in standard naming, data types, and formats. The data types follow the entities and tags defined in the Brick schema, which is covered in the following section.

The detailed data pre-processing procedure includes the following:

  • Empty columns from the raw datasets were removed;

  • All the missing values in raw datasets were replaced with - 999;

  • As a process of anonymization, each building and room were assigned with a unique ID number;

  • The headings of common data columns in the raw data were standardized following pre-defined dataset templates. Survey data with unique questions or measurement naming remained the same as the original naming schema. For the survey-based data, a dictionary of headings was created for each dataset;

  • The format of timestamp in the raw data was revised to follow the format “yyyy-mm-dd hh:mm:ss”, the time zone remained as local time zone to reflect the daily behavior patterns of the occupant; The time granularity of some datasets is at the minute level. In that case, the value of seconds in the timestamp was kept as zero.

  • The decimal point of the raw data was adjusted accordingly, such as status (binary), occupant number data (integers), data of indoor and outdoor conditions (one decimal precision digit);

  • Raw occupancy data that only contains enter/leave events were aggregated to get a total number of occupants in the space.

Sensor and room metadata modeling

Contextual information provided by subsystem vendors as “metadata”, that is, the data about data, those subsystems include HVAC systems, lighting, security, and sensing/monitoring systems15. In this database, each dataset contains one or more buildings with various types of sensors installed to measure occupant behavior patterns, indoor and outdoor conditions as shown in Table 2. The Brick schema was adopted to develop sensor and room metadata models, to better present the information of different types of measurements and relationships between subsystems with buildings. Brick is an open-source unified metadata schema for buildings, which standardizes semantic descriptions of the assets and their relationships in buildings. The assets include physical, logical, and virtual assets. The core concepts of Brick are Tag, Class, Relationship, and Graph. A Tag represents an atomic fact or attribute of an entity. A Class is a category with a definition used to represent a group of entities in the building. The Relationship defines the nature of the link between two related entities. A graph is a summarized figure indicating the data structure of a set of entities and their relationships. Brick defines a detailed ontology ( to support and expand these core concepts. The database users can easily extract sensor and room metadata information (e.g., number of rooms in the building, number and types of sensors that were deployed in the space) without querying the database.

ASHRAE global occupant behavior database

A website ( was created as a data warehouse for public access. Query builder tools were developed based on different behavior types, cities and countries, building types, study ID, and publication list. Users can select and download data from the database interactively with the query tools. Data analytic functions were developed to provide an interactive overview of the database and assist users to select the dataset.

Data Records

Figure 3 shows the geographical and institutional details of the global contributions to this occupant behavior database. Köppen-Geiger climate classification has been widely used in the smart building area by researchers around the world16,17,18. Since the datasets19 were contributed by researchers around the globe, Köppen-Geiger climate classification was adopted to represent the different climate zones in the datasets. The database covers 10 different climate zones globally according to the Köppen-Geiger climate classification (, which covers Tropical rainforest, Tropical savanna wet, Hot deserts, Humid subtropical, Temperate oceanic, Hot-summer Mediterranean, Cool-summer Mediterranean, Hot-summer humid continental, Warm-summer humid continental, Monsoon-influenced hot-summer humid continental climates. All datasets were compressed into a zip file named “ASHRAE Global Occupant Behavior Database”, with a total size of 548 MB. The final datasets have been uploaded to the figshare website for public use19. A website ( was created to query and download the desired data from the database based on different selection criteria.

Fig. 3
figure 3

Global Contributions to the ASHRAE Occupant Behavior Database.

As Figure 4 indicates, in the root folder, a folder named “Dataset_Templates” contains all the templates (.csv files) that have been used to process raw datasets. Those templates can be used as references for future data contributors. The “in-situ” folder contains 22 datasets representing the different dynamic measurement data (.csv files) with constant sampling intervals. The folder also includes the brick.pdf file which is a PDF view of the Brick model, and the brick.ttl (Turtle) file, which is the Brick model that can be viewed interactively through the Brick server ( In the PDF file, users can get a glimpse of sensor measurement types and relationships with the building. Through the Turtle file, users can extract the complete sensor and room metadata information of the dataset without opening those datasets. The “survey” folder contains questionnaires as well as dynamic or static measurement data without a constant sample interval time. Survey-based data varies greatly, as different research projects focus on various measurements and questions. Each dataset has a dictionary of headings to assist users with understanding the data. In total, 12 survey-based datasets were collected in this database.

Fig. 4
figure 4

Folder view of the database.

Sensor and room metadata

As discussed above, 22 Brick models have been developed for the in-situ datasets. Figure 5 shows an example of the Brick model for Dataset 20. The data contributor collected plug load, indoor and outdoor measurements from an educational office building in Vienna, Austria. Data collection started on January 1, 2013 and ended on December 31, 2013. The building has six different rooms, plug load data was collected from four out of six rooms, and indoor measurements were collected from all the rooms. This Brick model covers all the entities and their relationships in this dataset. The relationship Room “isLocationOf” Desk indicates that desk-level measurements exist in this dataset. In this model, there can be multiple “Building”, “Room”, and “Desk”. The number of those entities can be extracted from the Turtle file. The relationships include “isLocationOf”, “hasPoint”, and “Regulates”. Those relationships could be reversed as “hasLocation”, “isPointOf”, and “isRegulatedBy”. The points represent different sensors, for instance, wind direction sensor, wind speed sensor, air temperature sensor, humidity sensor, etc. Detailed information such as numbers or names of the entities and points can be found in the Turtle file.

Fig. 5
figure 5

View of the Brick model of Dataset 20.

Technical Validation

In this section, we explore the datasets and perform an initial analysis of different occupant behavior data. The analysis focuses on occupant number, door opening, occupant presence, window opening, and outdoor measurements.

Occupant number historical data

Figure 6 shows the historical data of occupant number in a commercial office building from Dataset 32. The occupancy of two office rooms was measured from May 22, 2018 to July 11, 2018. Camera-based sensors were deployed to collect occupant counts in this study. To valid the dataset, researchers added an automatic daily calibration for the measurement, which includes two functions: first, set the occupant count to zero if it is less than zero; second, set the occupant count to zero at 3 AM each day. From the figure, we can observe the weekday and weekend trends of occupant number in both rooms. Holiday effects can also be captured, such as Memorial Day (Monday - May 28, 2018), and Independence Day (Wednesday - July 4, 2018). The occupant number dropped to relatively lower values during weekends and holidays. Figure 7 provides a detailed view of the historical occupant number in one week, a common workday schedule was observed from both rooms.

Fig. 6
figure 6

Historical occupant number data from Dataset 32.

Fig. 7
figure 7

Historical occupant number data from Dataset 32 in one week.

Door status

Figure 8 shows a data distribution of door status (Open/Closed) in educational offices from Dataset 5. The building in this dataset has multiple rooms, with each room having only one door. Cable-connected magnetic sensors were used to measure the opening/closing of doors in this study. The data was collected from October 27, 2016, to October 31, 2017, with 5-minute granularity. The figure indicates that Room 1 and 2 have a similar trend of the door opening activities during the working hours (8 AM–5 PM). However, Room 3 showed different trends where door opening behaviors spread across the 24 hours of a day. Overall, the door opening probability was low, which indicated that the doors remained closed most of the time.

Fig. 8
figure 8

Data distribution of door openings from Dataset 5.

Occupant presence

Figure 9 shows a cross-comparison of the first arrival and last departure times among three different datasets (Dataset 9, 10 and 30) from three different countries. The data was collected from educational offices in various months from 2016 to 2018. Passive infrared sensors were used in Dataset 9 to collect the event-based occupancy data, the minimum occupied time was set as 15 minutes. Dataset 10 monitored space occupancy manually by a person in the office. And Dataset 30 utilized Bluetooth device pairing technology to sample occupancy data every one minute. In Figure 9, the first arrival captures the time when the space is first occupied, while the last departure captures the time when the space is last occupied during the day. Even though the distribution varies because of different lengths of data collection periods, it is clear that first arrival times are centered around 10 AM during the day. And the last departure times during the day are centered between 6 PM and 7 PM.

Fig. 9
figure 9

Cross-comparison of first arrival and last departure between occupant presence datasets from different countries.

Window operations

Figure 10 shows window status data with indoor and outdoor temperature measurements from Dataset 5. The data was collected from educational offices. Researchers deployed cable-connected magnetic sensors to measure the opening/closing of windows in this study, the sampling time was five minutes. The indoor temperature was measured by the temperature probe (PT1000 class A Cable) with an accuracy of 0.15 °C or less at 0 °C. Outdoor temperature was collected every 10 minutes from an over roof weather station that was installed about 10 meters above the ground. From the changes in indoor and outdoor temperature over two days, it can be observed that the HVAC system helped to maintain an indoor temperature close to 24 °C during the day. In Rooms 3 and 4, window opening activities were captured during the afternoons on both days when the indoor and outdoor temperature both were relatively high.

Fig. 10
figure 10

Window operation coupled with indoor and outdoor temperature in Dataset 5.

Outdoor measurements

Field measurements of outdoor parameters were investigated based on the available datasets. Those outdoor measurements cover five different climate zones, such as Tropical savanna wet climate (Aw), Humid subtropical climate (Cfa), Temperate oceanic climate (Cfb), Monsoon-influenced hot-summer humid continental climate (Dwa), and Warm-summer humid continental climate (Dfb). The time granularity varies among those datasets. Dataset 5 collected outdoor measurements from an over roof weather station with a 10-minute sampling time. Dataset 7 has a 5-minute sampling interval and data was collected from an onsite outdoor weather station. Researchers of Dataset 14 installed over roof portable outdoor weather stations and sampled outdoor measurements every 10 minutes. However, Dataset 16 collected hourly outdoor measurements from local weather station which is approximately 4 kilometers away from the most experimental buildings. Then, the data was resampled into every 10 minutes using linear interpolation. Since datasets were collected from different months in various years, in order to compare outdoor measurements in the same time span, data from November of four datasets (5, 7, 14, and 16) were identified and analyzed. Hourly data from different days were analyzed and plotted using boxplot. Figure 11 shows hourly outdoor temperature distributions of four different datasets and climate zones. Figure 12 shows hourly outdoor relative humidity distributions of those datasets. Figure 13 shows the hourly outdoor solar radiation distributions in datasets 5, 7, and 14 since dataset 16 doesn’t measure outdoor solar radiation. The results captured different trends of temperature, relative humidity, and solar radiation within the four different climate zones by the time of the day.

Fig. 11
figure 11

Outdoor temperature distributions by hour in different datasets and climate zones. Dataset 16 – Aw; Dataset 5 – Cfa; Dataset 7- Cfb; Dataset 14 – Dwa.

Fig. 12
figure 12

Outdoor relative humidity distributions by hour in different datasets and climate zones. Dataset 16 – Aw; Dataset 5 – Cfa; Dataset 7- Cfb; Dataset 14 – Dwa.

Fig. 13
figure 13

Outdoor solar radiation distributions by hour in different datasets and climate zones. Dataset 5 – Cfa; Dataset 7- Cfb; Dataset 14 – Dwa.

Usage Notes

The datasets19 have been uploaded to a public domain of the figshare website, users can download data through this link ( A website ( was created to query and download the desired data from the database based on different selection criteria. These criteria include types of measurement data, countries and cities, type of building, study ID, and publication. The website also provides an overall analysis of all the datasets, a list of available publications from those studies, etc.