High resolution synthetic residential energy use profiles for the United States

Efficient energy consumption is crucial for achieving sustainable energy goals in the era of climate change and grid modernization. Thus, it is vital to understand how energy is consumed at finer resolutions such as household in order to plan demand-response events or analyze impacts of weather, electricity prices, electric vehicles, solar, and occupancy schedules on energy consumption. However, availability and access to detailed energy-use data, which would enable detailed studies, has been rare. In this paper, we release a unique, large-scale, digital-twin of residential energy-use dataset for the residential sector across the contiguous United States covering millions of households. The data comprise of hourly energy use profiles for synthetic households, disaggregated into Thermostatically Controlled Loads (TCL) and appliance use. The underlying framework is constructed using a bottom-up approach. Diverse open-source surveys and first principles models are used for end-use modeling. Extensive validation of the synthetic dataset has been conducted through comparisons with reported energy-use data. We present a detailed, open, high resolution, residential energy-use dataset for the United States.


Background & Summary
Modernization of the U.S. electric grid is occurring at a noteworthy rate due to the installation of new technologies within the grid such as smart meters.They enable two-way communication between the customer and utilities, providing information and granular control of power usage for individual households 1,2 .The grid is also witnessing rapid transformations due to increasing penetration of electric vehicles (EV) and distributed energy resources (DER) such as rooftop photovoltaics (PV), community solar, and wind energy.While this wave of modernization is beneficial, the electric grid is simultaneously facing a sharp increase in crisis situations as a result of climate change phenomena 3,4 such as extreme weather events and global warming.One example of extreme weather is the February 2021 North American cold wave that caused a tremendous strain on the power grid especially in Texas where millions lost power for days 5 .Another example is where global warming impacts household HVAC energy use.Although the rise of 1 • to 2 • C in winter temperatures is expected to decrease heating requirements, a similar rise in summer temperatures is expected to increase cooling needs significantly 6 .
In the face of these challenges, achieving sustainable energy goals has become paramount for maintaining a healthy grid.To this end, the research community is faced with important questions regarding reduction of carbon footprints [7][8][9][10][11] , incentivizing DER adoption 12 , studying benefits of building energy retrofit 9,13,14 , integration of electric vehicles 15 and consumer behavior 16 in the grid, and mechanisms for designing electricity pricing 17,18 to create efficient residential consumption patterns.Answering many of these questions requires comprehensive knowledge of energy-use patterns, building stock, the structure of distribution networks, consumer behaviors, and so on.However, such exhaustive datasets are rarely freely available (or available at all) for research use, making it hard for the research community to pursue these endeavours 19 .Reasons for unavailability of such data range from privacy concerns to the lack of a system for making data available to researchers.
Most of the published energy use data are metered data, a result of longitudinal studies conducted by researchers (Table 1) with relatively small samples of households that may not be representative of the wider geographical region and demographics.Some of these studies monitor households over a longer period of time (e.g. two years), however, the downside of such experiments is that it takes a considerable amount of time (e.g.participant consent, equipment setup, monitoring) and manual effort (e.g., data cleaning, imputing missing values) before such data is usable.Although these studies release energy data for free use, many of them limit publishing participant details (e.g.building characteristics and location, household level demographics).Participant details are usually withheld due to privacy reasons/participant consent, lack of information, or unavailability of these attributes in the free version of the data.Literature has attempted to address some of these issues by creating appropriate data structures for releasing appliance metadata information for households alongwith their energy use data 35,51 .However, we observe that many of the issues still persist in the U.S. context.One such example is the Pecan Street Dataport 28 .Pecan Street Inc. 27 is the largest publisher of energy-use data in the U.S. through their portal -Dataport.They collect energy-use data in California (CA), Texas (TX), New York (NY), and Colorado (CO).This is a potentially very useful data set.However, only a small sample (∼25 households in CA and TX) of energy-use data is freely available for public use and do not contain sufficient (or any) demographic or building information.
A dataset synthesized over a larger spatial scope offers the opportunity to study regional and temporal differences in energy use while a smaller region dataset offers studying energy use patterns that may be particular to the region.Irrespective of spatial scope, small sample size makes it difficult to get a good representation of the population variation in the region (e.g.explaining/exploiting role of household demographics, behavior, and building characteristics in energy use).In addition to the spatial scope and number of samples, many of the datasets do not release sufficient (or any) participant details.Such limited data restricts the usage of these energy-use data for detailed practical analyses or studying scenario interventions and equity questions in the grid (e.g., which type of demographic and building stock is best suited for EV adoption, or how much carbon footprint can be reduced by retrofitting buildings).Thus, we observe that there is a general sparsity of large scale high resolution energy use datasets along with detailed metadata information at household level such as appliance ownership, building data, important demographic features.
We summarize key drawbacks of energy datasets for the U.S. as follows -limited spatial scope, small sample size, lack of sufficient household, appliance, & building metadata.Given these wide array of problems with the state-of-art energy-use data availability, we introduce synthetic energy use datasets that are able to address many of these issues.Synthetic data is defined as data generated by models that provide accurate statistical representations of the real world.Examples of such data for the smart grid are synthetic power distribution networks 52 , energy consumption profiles for offices and commercial buildings 53 and for residential buildings 20,[54][55][56] .Our work specifically addresses the data scarcity gap in energy use research for the U.S. residential sector.We propose a synthetic framework for modeling large-scale high resolution energy use data by integrating diverse datasets and end-use models for bottom-up dis-aggregate energy modeling.This results in a novel synthetic energy use dataset (i.e., a digital twin of household level energy demand) comprising hourly electrical energy demand profiles for Table 1.Energy-use datasets published in the residential sector.

Authors/Dataset Description
Klemanjak et al. 20,21 A synthetic energy demand dataset was released for 21 appliances in Austria in 2020.Data collected from two households was used to train models and then appropriate noise was added for appliance start times and durations to mimic variations in actual consumption patterns.
Kolter et al. 22,23 The Reference Energy Disaggregation Data Set (REDD) is published by MIT.The dataset contains high-frequency current/voltage waveform data of the power mains in households along with labeled circuits in the house.
Makonin et al. 24 The Rainforest Automation Energy (RAE) dataset was published by Harvard in 2017.The dataset contains 1Hz data (mains and sub-meters) from two residential houses.
Murray et al. 25,26 Load measurements from 20 households of UK from a two year longitudinal study.
Pecan Street 27,28 Labeled circuit data for households across major cities in the U.S.This is said to be the most comprehensive dis-aggregate energy data available for the U.S.
Rashid et al. 29,30 The I-blend dataset has recorded minute-level consumption of all the buildings at an academic institute in India over a period of 52 months Paige et al. 31,32 The flEECe dataset provides energy data at a 1Hz sampling rate for four circuits for six net-zero energy senior housing units in Virginia, USA for nine months Shin et al. 33,34 The first Korean dataset measuring appliance-level energy data was released in 2019 for 22 houses in Korea.
Kelly et al. 35,36 Power demand is recorded from five houses UK houses at two levels -whole house and individual appliances.This dataset is referred to as the UK-Dale dataset.Two versions of this dataset have been released.
Anderson et al. 37,38 Building-Level fUlly-labeled dataset for Electricity Disaggregation (BLUED) for one household in Pittsburg U.S. for one week.State transition of appliances are labeled and time-stamped, providing the necessary ground truth for the evaluation of NILM algorithms.
Barker et al. 39,40 Electricity usage data is monitored every minute from nearly every plug load from 400 anonymous homes.
Beckel et al. 41 Electricity consumption is monitored via smart plugs for six households in Switzerland over a period of 8 months.
Pereira et al. [42][43][44] Power usage for 44 apartments and 6 homes in Portugal is collected for 264 days at 30 minute intervals.The advanced version of this dataset 'SustDataED2' dataset contains 96 days of aggregated and individual appliance consumption from one household in Portugal.
Monacchi et al. 45,46 Common household devices are monitored for power consumption in Austria and Italy (GREEND dataset).
Pullinger et al. household-level electricity consumption behaviors dataset for the U.S. Our synthetic energy-use infrastructure is well-suited to solve the newer smart grid problems mentioned earlier.We publish the dis-aggregated energy use timeseries for all the synthetic households.The published data is representative of the U.S. households, provide household level metadata, and are a good representation of the real world energy use.
Table 2. List of primary datasets used for constructing the residential demand models.

Dataset Description
American Time Use Survey (ATUS 2015) ATUS provides nationally representative estimates of how, where, and with whom people in the U.S. spend their time, and is the only federal survey providing data on the full range of activities, from childcare to volunteering.This survey provides demographic information as well as information on energy-related activities 58 .24-hour data is recorded for 5115 participants.

Synthetic Populations and Ecosystems of the World (SPEW)
SPEW 57,59 is a framework that produces synthetic populations for various countries.We used the open-sourced version of the synthetic population available for the U.S. constructed for the year 2013.PUMS is a 5% representative sample for a larger region than block group referred to as a Public Use Microdata Area (PUMA) 62 .PUMAs are described by the Census as "a collection of counties or tracts within counties with more than 100,000 people".These statistical areas are defined for the circulation of PUMS data.PUMS contains individual records of the characteristics for a 5% sample of people and their households.One PUMS record is a complete Census record.
North American Land Data Assimilation System (NLDAS) Hourly temperature data for North America.Data resolution is at 1/8th-degree grid over North America 63 .

Residential Energy Consumption Survey (RECS 2015)
U.S. Energy Information Administration (EIA) Residential Energy Consumption Survey (RECS) 64 data is a national sample survey that collects energy-related data for housing units.For 2015, data was collected from 5,686 households to represent 118.2 million U.S. households.We use this dataset to obtain housing unit-specific information such as floor area, main heating fuel, fuel equipment, indoor temperature setting, presence of air conditioner, dishwasher, washer, dryer, refrigerator, water heater fuel, water heater size, water heater age, number of lighting units, etc,.

National Solar Radiation Database (NSRDB)
NREL provides solar radiation data for the U.S. We use hourly data that comes from the physics-based approach called the Physical Solar Model (PSM).Data is available for the U.S. for 1998-2014 65 .The GHI variable is used as an indicator of irradiance level in the lighting model.GHI is modeled solar radiation on a horizontal surface received from the sky.This is measured in watt meter 2 .

Miscellaneous
Appliance power and efficiencies, gallons of hot water required for activities, and any other input data required for models is drawn from surveys and data collected from ground and/or testing [66][67][68][69] .

Methods
This section describes the datasets and models employed to generate synthetic energy use time series at the household level, see Table 2.All notations used in the paper are described in Table 3.
Daily amount of water consumed (in gallons) by a household H i in a day by an event v.
The presented framework is composed of a synthetic representation of the U.S. population, regression models for surveys, and bottom-up energy use models.A synthetic population is composed of households and people in households.The synthetic households are generated using census surveys and statistical methods such that the synthetic population is statistically similar to the original population.An open-source version of the U.S. synthetic population -Synthetic Populations and Ecosystems of the World (SPEW) 57,59 is used in our framework.The SPEW synthetic population is comprised of demographic characteristics of synthetic households and synthetic individuals.The synthetic population is created using U.S. census data such as PUMS (Table 2) and statistical methods such as sampling and the Iterative Proportional Fitting (IPF) method 70 .
The SPEW households are made of basic demographic (e.g., income, age) and locality information.Although the SPEW population is representative of the U.S. population on a finer spatial resolution, it is not equipped with energy and activity related information (e.g., building characteristics, time spent at home, number of cooking activities) necessary for estimating energy use at household level or person level.Building stock, energy and activity related information is collected by national surveys in the U.S. -Residential Energy Consumption Survey RECS 64 and American Time Use Survey ATUS 58 respectively.The basic synthetic population is augmented with energy and activity related attributes by building machine learning models.This augmentation is called as the enrichment step.The enriched synthetic population along with other freely available data sources can be used together as inputs to the energy use modeling framework.The energy use modeling framework has six models for representing nine energy uses -HVAC, lighting, domestic hot-water, refrigerator, dishwasher, cooking, clothes washer, clothes dryer, and miscellaneous plug load such as TV, computer use, cleaning activities (e.g., vacuuming).The first subsection describes the modeling details of the enrichment step and the following subsection describes energy demand models.2. These datasets are input to different modeling components of the framework.Some datasets support augmentation of the synthetic population while others are input to the energy-use models.All the models are described in the Methodology section.The bottom rectangle describes the recorded data/smart meter data from different climate zones of the U.S.These datasets are used for validation of the synthetic energy-use timeseries.The validation block (yellow backdrop) describes three components of V&V -regional, magnitude, and structural/shape comparisons.This line of validation covers (a) different temporal aspects (hourly and daily), (b) spatial aspects in terms of regions and seasons, (c) diversity aspect of the large-scale synthetic data.The blue text refers to the V's of big data.Each colored block possesses the given V characteristic.

Enrichment models
The enrichment models support creating comprehensive synthetic structures for calculating residential energy usage.This step is called as the enrichment step.Refer to Figure 2 for a pictorial representation of the overview of the framework.Datasets used in this workflow are described in Table 2. Since the demographic features available in the synthetic population are not sufficient for computing energy usage, it is made richer by adding layers of information related to building stock and energy consumption from the RECS survey such as building characteristics, appliance ownership, and thermostat set-point behaviors.This mapping of features is made by building inference tree models.Activity schedules for a normative day of an ATUS survey respondent are attached to synthetic individual by building a multivariate random forest regression model.These models are described below.
The ATUS model.The ATUS data provides nationally representative surveys of people's activities in different location types such as childcare in or outside the house, time spent at work, laundry time at home, waiting times in hospital, and so on, see Table 2 for a description.The time-use diaries of the survey individuals can be attached to synthetic individuals by matching an appropriate survey individual to a synthetic individual.In our work, we consider appropriate matching based on amount of time a person spends in different location types such as home, work, school, shopping, and other miscellaneous locations.This seems a reasonable approach because we are interested in learning how an individual spends 24 hours of the day by categorizing the amount of time spent at important location types -for e.g., the time spent in different location types for a person works full-time is quite different than a house bound senior citizen or a college student.This rationale of assigning survey respondents to synthetic individuals is also presented in prior work by Lum et al 71 .
Random forest regression method is used to build a model that predicts the amount of time a person spends in locations types such as home, work, shopping, other, school, and trip counts during the day.Thus, six dependent variables are modeledtrip count during the day and time spent at each location type -home, work, shopping, other, school.Independent variables used to build the model are as follows -number of members in the household (hsize), number of children (nchild), age (age), working hours (wrkhrs), gender (gender), income modeled as a categorical variable (hinc2, hinc3), and binary variables such as an American citizen or not (nativity), worker or not (worker), owns home or not (ownhome), has a phone or not (tel), and race related variables such as if person is white, Hispanic, black, or Asian (white, hispanic, black, asian).Figure 3 shows example of feature importance for two dependent variables.The selection of the parameters for 'ntree' (number of decision trees) and 'node size' (minimum size of terminal nodes).Eight conditions are tested for the combination of the two parameters: ntree=500, 1000, 1500, and 2000; node size=5, and 10.The plots show robust results across the different conditions.According to the plots, the following five independent variableswrkhrs, worker, age, hinc3, hsize mostly affect all the dependent variables.The right-hand y-axis shows the absolute Pearson Correlation Coefficient.The positive and negative coefficients are distinguished by blue dots and squares, respectively.Except wrkhrs, worker, all other independent variables weakly correlated with the dependent variables.
Once the model is trained on ATUS respondents, a synthetic person P i, j is randomly assigned a survey individual from the leaf nodes in the trained ensemble model.Thus, the result gives every synthetic individual a time-use diary.The energy-use models will extract home activities from a time-diary and also build a household-level occupancy schedule over the 24-hour duration, denoted as O i,0 , O i,1 , . . ., O i,23 .These are used as an input to the energy use models.Synthetic household member 7/29 activity scheduling conflicts are handled in the activity model.
The RECS mapping model.The baseline synthetic population does not have any building structural characteristics and appliance ownership information.These salient features are important for modeling different categories of energy use and are available in the RECS survey.We overlay RECS household attributes onto a synthetic household by building multivariate conditional inference trees 72,73 .Conditional inference tree is a non-parametric class of regression trees that uses recursive partitioning of dependent variables based on the value of correlations.Four dependent variables are modeled -square footage of the dwelling, presence of laundry appliances, presence of air conditioner, presence of dishwasher.The independent variables are year in which the house was built, occupancy time of the current tenants, own or rent the residence, total number of rooms, income, number of refrigerators, number of members in the household, dwelling type, dwelling is located in urban or rural area, primary heating fuel type.The independent variables are common attributes between RECS survey records and synthetic household records.Conditional inference trees are trained on different census regions in the U.S. to tease out regional differences.A RECS household S l is randomly selected from the appropriate leaf nodes of the conditional inference tree and assigned to the synthetic household H i every time a new simulation is run.This dynamic assignment introduces stochasticity when the simulation is executed for same and/or different days.

Energy use modeling
The enriched synthetic population (i.e., the output of the enrichment step) enables encoding of behaviors (time spent in different energy related activities at home), normative attributes (e.g., square footage, age, income, gender), declarative attributes (e.g., individual activities as a sequence) and procedural attributes (e.g., behaviors capturing dependencies, interactions, frequency of performing activities) into the knowledge required for building energy use profiles 74 .The synthetic infrastructure is leveraged to build six energy use models (Figure 2).Nine end-uses are synthesized for each household.These end-uses are divided into two parts -Thermostatically Controlled Loads (TCL) and appliance use.For a household i, nine end-uses published in the data are -1.HVAC (E hvac ).This category includes heating and cooling electric load from central air conditioning during hot days and electric furnace/heater used during cold days.This is a TCL load.

Domestic hot water use (E h2o
).Energy consumed for heating water that is needed for personal grooming activities such as shower/bath, laundry activities such as using clothes washer, and dishwasher.This is a TCL load.

Clothes Washer (E cwasher
).Energy used by electric clothes washers.

6.
Cooking (E cook ).Energy consumed by electric cooking range, oven, and other kitchen appliances such as coffee maker, microwave, toaster, etc.

Miscellaneous plug load (E misc
).This type of energy indicates plug load attributed to cleaning activities and electronic devices such as TV, computers, other smaller electronic gadgets.
Table 3 describe the notations used in the methodology sections.The total energy summed over 24 hours (E total i ) of a household i is given by the equations below - HVAC model E hvac According to the U.S. Energy Information Administration (EIA) 75 , HVAC is responsible for the highest proportion of energy consumption in households.The HVAC model calculates how much energy is required to maintain ambient/comfort temperature indoors.This is dependent on factors ranging from the area of the house, outdoor temperature, efficiency of HVAC equipment, and so on.Occupant behaviour of thermostat settings in different seasons and household occupancy during the day play an important role in understanding thermal comfort levels and how its effect on electricity consumption.Engineering and statistical approaches 76 are presented in the literature to simulate energy consumption of heaters/furnace and air conditioners [77][78][79][80] .We adopt the engineering based approach from Subbiah et al. 80 where the function of heating/cooling a household H i at hourly intervals is defined as: Here is the energy consumed by household H i at the end of hour t in kWh by heating/cooling equipment to maintain thermal comfort.FloorArea i is the floor area and WallArea i is the wall area (extrapolated from floor area 80 ) of H i .The quantities R roof and R wall are R-values (insulation level) for households in different climate zones, while η is defined in Table 3. Next, ∆T is the absolute difference between T in t and T out t , and T in t is indoor thermostat temperature at hour t.The hourly outside temperature (T out t ) is obtained from NOAA NLDAS data mentioned in Table 2. Efficiency and insulation data is obtained from guidelines published by EIA.All other household attributes are obtained from the enriched synthetic population.Depending upon occupancy patterns throughout the day, changes in thermostat behaviors are assigned to each household.Heating and cooling threshold temperatures for appliance on/off times are taken from the thermostat study published by NREL in 2017 81 .

Domestic Water Heating Model E h2o
The EIA 75 EIA shows that 17%-32% of the household energy use is attributed to domestic hot water use (DHW).Literature shows models used for estimating hot water demand at multiple temporal resolutions -annual, daily, hourly, and minute intervals.One of the initial models for estimating load profiles of hot water demand was developed in 2001 by Jordan et al. 82 for a period of one year for temporal resolutions of 1 min, 6 min, and 1 hour.However, this work does not consider historical nor factual flow rates to determine how much hot water (gallons/day) is used by a household.A follow-up paper was developed for synthesizing water demand profiles for Switzerland 83 by calibrating this model using field data.A model to simulate yearly DHW event schedule for a single-family household was developed by Hendron et al. 84 from the National Renewable Energy Laboratory (NREL) in 2010.The simulator used two surveys that collected information about water demand in U.S. households for five categories: sink, bath, shower, clothes washer, and dishwasher.This model has been widely accepted in the literature.One recent example of the adaptation of Hendron's model is for simulating hot water demand in Canadian households 85 .The model is calibrated for survey data collected for Canada and appropriate adjustments are made with respect to Canadian lifestyles.
For our model, we use the distributions of duration and flow rates of activities involving hot water usage such as bath/shower, clothes washer, and dishwasher from Hendron et al.Note that duration and flow rates can take negative values (Table 4).The flow rate is capped to 0.05gpm and the duration is capped to 1 minute for any negative value 84 .Table 4 characterizes the average count of daily events, duration, and flow rates.The values of hot water temperature for different uses and the cold water inlet temperature are obtained from studies conducted by NREL in different regions of U.S. 68,69,86 An engineering based approach is used to estimate hot water usage 68,80 in household i for event v at time t The gallons of hot water G hot v,i,t consumed by event v is computed as a product of flow_rate (gpm) and duration (minutes).Both these characteristics are drawn from distributions in Table 4. E hot v is the energy consumed by the event v to heat G hot v gallons of water.Last four entries in the Table 3 shows summation of multiple events occurring across the time horizon.Here η is the efficiency of the electric water heaters.Surveys conducted by NREL have shown that η is a complex function of storage capacity of water heater, type of water heater, age of water heater.No distributions are available for η in the current studies.Field data collected from NREL surveys 68,69,86 show that the efficiency varies anywhere between 80%-99%.Here 0.00189 ( kWh gal • F ) is a conversion constant obtained from Subbiah et al. 80 , and ∆T is the temperature difference ( • F) between mains (inlet) water temperature T cold m,z for a given month m in a climate zone z and the water temperature required for a particular end-point.The values for T cold m,z and T hot v are obtained from NREL surveys 68,69 .Whenever the activity model detects the presence of an event v, we calculate the energy used by hot-water for the event using Equation 3. Note that we compute hot water energy usage only for synthetic households having electric water heaters.Lighting accounts for 5-10% 75 of the consumption with lighting usage in residential setting mainly characterized by outdoor lighting conditions and occupancy schedules in households 87 .A Markov-chain approach is adopted by Widen et al. 88 for modeling lighting demand in Swedish households using time use data in Sweden.A stochastic model is developed for residential lighting estimation for the city of Cordova in Spain by Palacios-Garcia 89 based on a model developed by Stokes et al. 90 using measured lighting data for 100 UK homes.Another stochastic model is developed by Richardson et al. 91 for UK households using time-use data and lighting data from the Energy Information Administration(EIA).

9/29
We build a stochastic model for lighting demand in U.S. dwellings by building on design concepts from work done by Richardson et al. 91 , Stokes et al 90 , and Paatero & Lund et al. 92 .Richardson's model is particularly interesting since it supports important characteristics of light usage such as 'co-use' and 'relative weights'.The model uses the concept of 'co-use' of lighting, i.e., lighting in a dwelling is often shared by household members in the same space of the dwelling at the same time.The model also considers that all lighting units are not used at the same frequency (e.g.frequently occupied rooms such as kitchen space and living area will use more lighting than other rooms) and employs a weighting scheme to indicate relative usage.
Outdoor lighting conditions are modeled using irradiance time series.It is obtained from NSRDB described in Table 2. Hourly irradiance data is collected using the NSRDB API for the 365 days of the year 2014 at census tract resolution for the U.S. Thus, all synthetic households in a census tract use the same irradiance time series for a given day.The household level hourly occupancy profile O i,0 , O i,1 , . . ., O i,23 is developed by examining activities of awake synthetic household members of H i at home.Presence of awake occupants in the dwelling support the decision making of light switch-on event.The distribution of lighting units in households are derived from the RECS survey.In general, distribution of lighting units of a H i is taken from the matching S l .Three types of lighting units are considered: incandescent, CFL, and LED.Power ratings of lighting unit categories are taken from a study conducted by the Bonneville Power Administration (U.S.) where lighting fixtures were analyzed for a sample of 161 Northwest residences 93 .For a given simulation day, we define an irradiance threshold (Irr i ) for a household H i .It indicates that occupants may consider switching on lights when outdoor lighting is less than Irr i .Irr i is sampled from a normal distribution 91 Normal(60, 10).All notations used in the model are described in Table 3. Annual lighting data for the U.S. is summarized for different household sizes from the RECS survey.
Literature shows that lighting usage increases by number of occupants in the household, however, the lighting usage does not double for every occupant added in the house.In order to simulate shared lighting usage, the concept of effective occupancy 91 of a household Ôi,0 , Ôi,t , . . ., Ôi,23 is introduced.Effective occupancy ( Ôi,t ) is defined as a function of active occupancy (O i,t ).The values for effective occupancy are derived by scaling the annual lighting demand by household size such that the effective occupancy of a dwelling with one active occupant is one.The next step is to obtain the details of lighting units in a household.The proportion of lighting unit types are obtained from a RECS household S l that matches H i (RECS Model).Power ratings are attached to each lighting unit.In general, not all lighting units are used at the same frequency.This is observed in literature surveys such as DECADE report 94 .The frequency of usage of lighting units in households can be roughly modeled as a natural log curve 91 , however, no formal methods have been presented in the literature due to lack of quantitative data.We use the natural log curve presented in Richardson et al. 91 to model the relative usage of a lighting unit.Once weights are assigned to lighting units, the probability of a switch-on event for every lighting unit is calculated at a regular time interval (in our case 1 hour).The probability of a switch-on event P on b of lighting unit b at hour t is calculated as where Here b weight is sampled from a natural logarithmic curve, γ is a calibration constant used to achieve the appropriate annual lighting consumption for the U.S., and Ôi,t is the effective occupancy of H i at time t.If a switch-on event occurs, then energy 10/29 consumption is calculated for the respective lighting unit b.The lighting duration is picked randomly from the distribution described in Stokes et al. 90 .

Refrigeration E refr
The energy consumed by a refrigerator depends upon its size, age, ambient temperature, and several other factors as described in literature.They consume 3%-5% of the total residential energy usage.Shimoda et al. 78 show that the daily refrigerator consumption is affected by outside temperature, while Tsuji et al 79 show a linear relationship between outside temperature and annual refrigerator demand.Both these work are done in context of refrigerators in Japan.The Lawrence Berkeley National Laboratory in California uses field metered energy use data from ∼1500 refrigerators and freezers to develop a model that predicts annual usage of different freezer and refrigerator categories 95 .All of the above models collected relevant data from the field or utilized detailed surveys on refrigeration.
Our approach is to develop a regression model for predicting daily refrigerator usage (kWh/day) of a household (E refr i ) as a function of outside environment temperature.The model is trained with the metered refrigerator usage data from Pecan Street Inc, where 30% of the total metered data is used for training and testing the model.The 30% data is obtained by conducting stratified sampling based on climate zones and daily average temperature bins.The dependent variable is the daily refrigerator usage E refr i in kWh/day for H i .The independent variables are daily average temperature T out ( • F) and categorical attributes indicating three major climate zones.The 24 hour load profile of a refrigerator E refr i,0 , E refr i,1 , . . ., E refr i,23 is constructed from the daily usage, and the variation in the hourly usage of the refrigerator is modeled using a Guassian distribution.The refrigerator operates in an automated/standby mode, that is, occupant presence does not influence the energy consumption of this activity 79,80 .Thus, computing the 24 hour profile of the refrigerator by adding a small Gaussian noise to the hourly load can be considered acceptable.The validation section shows that addition of this noise creates good match to real data.

Activity model E appliances
The energy consumption in a households that is attributed to appliance usage and plug load is 20%-26%.This energy is a result of the occupants' desires to perform activities such as taking baths, making hot meals, using the dishwasher, doing laundry, charging electronics such as TVs and computers, or using any other appliances that consume electricity.Equations 1b and 1c are used in this model.Based on the aforementioned end-uses, appliance usage behavior is characterized by 79 through operational mode of appliances, duration of operation, power consumption, limit on daily event occurrence, and saturation rate.Operational mode of appliances describes the functioning appliances and related behavior that can be categorized into three types: automatic (appliance use is independent of person), semi-automatic (appliance turned on by household member but turned off automatically), and manual (appliance turned off and on manually).The saturation rate can be used to determine the presence and/or penetration of certain appliances in households.Generally, the operational mode of appliances and saturation rate are deterministic in nature.However, parameters such as probability of activity occurrence, start time, duration, power consumption, and maximum occurrences vary from household to household and day to day.In general, some appliance usages can overlap and/or occur in parallel.These details are handled in this model.
Table 6 outlines all the modeled activities and related appliances, their modes of operation, maximum allowed daily occurrences, activity duration, and power consumption.The distributions marked with an asterisk (*) denote that they are modeled by engineering judgement and/or other sources such as Energy Calculator (energyusecalculator.com).Power rating distributions for dishwashers are obtained from a survey conducted by NIST 67,96 .Power ratings and duration distributions for laundry appliances are derived from literature 54,80 and surveys 96 ; power ratings for appliances in cook activity include electric ovens, microwaves, and electric cooktops (small-and large burners.)Power rating distributions for these appliances are derived from the NIST efficiency study 66 , and durations of appliance usage are obtained from ATUS data, where the maximum limit for cooking activities is capped to three.Sample power ratings for TVs are observed from EnergyStar reports 98 and modeled using a normal distribution.The tv activity duration is modeled as a log-normal distribution after examining the ATUS survey data.Power ratings for computer use activity are derived from a small study conducted by EnergyStar 97 .Standard values for charging duration are used from reputed laptop manufacturers.Vacuum related data are obtained from EnergyStar vacuum report and a survey conducted by Electrolux covering 28,000 consumers from 23 countries including U.S. 99,100 .We assume that all households have vacuum cleaners.The usage frequency of vacuuming is 1-5 times per week 100 and the maximum number of daily occurrences is 1.Assuming Normal distribution for power ratings and duration of appliance usage is reasonable after examining rudimentary results from surveys/reports.The results of the hot water usage study conducted by NREL 84,86 as summarized in Table 4 show that most of the processes can be modeled as a Normal distribution.
The activity model simulates appliance usage based on activity indicators provided by ATUS when the occupant is present in the house.Considering the presence of appliance in each household (from matching RECS household) The time use diaries of adults in the synthetic population and frequency of occurrence of appliance usage such as dishwasher and laundry, and activities such as cooking are taken from RECS household.The activity model focuses on activities performed by an individual when at home.Similar to lighting, activities such as cooking, vacuuming, and leisure activities such as watching TV are shared Table 5. Summary of referenced end-use modeling methods, including how these models are extended in this paper.

End-use Relevant models
Our approach

HVAC
Muratori et.al 77 , Subbiah et.al 80 , Thorve et.al 54 , Tsuji et.al 79 Our model is based on the approach adopted in Subbiah et.al 80 and Thorve et.al 54 .These models were specific to Virginia state.The method employed in these works as well as ours is a physics model.This model is also documented in NREL Technical Reports.Additional details about thermostat settings, building characteristics such as insulation are obtained from RECS survey, EIA website, and NREL Technical Reports.

DHW
Maguire et.al 68 , Hendron et.al 84 , Thorve et.al 54 Hendron et.al 84 and Maguire et.al 68 present a general stochastic method to reproduce sample hot water draws based on two water usage surveys conducted in the U.S. The analyses concludes by reporting distributions related to hot water usage events such as showering, using dishwasher, and using clothes washer.Some of these results are summarized in Table 4 and used in our model.Hot and cold water temperatures for specific end-uses are obtained from NREL surveys.The above model does not consider the setting of specific household schedules.This context of household occupancy and occurrence of events is added to an existing model in literature presented in Thorve et.al 54 92 We mainly improve upon the stochastic lighting model developed for U.K. household by Richardson et.al by adding context of U.S. households such as household size, household occupancy, annual lighting consumption in the U.S. for different household sizes, calibration of γ for U.S. households, and proportion of light bulbs in the U.S. households and their power ratings.The probability of switch-on event is modeled from Paatero & Lund et.al 92 and Richardson et.al 91 .Duration of switch-on event is taken from Stokes et.al 90 .Power ratings for different categories of lighting units in U.S. is obtained from a study conducted by Bonneville Power Administration 93 .Proportion of lighting units in U.S. households and annual lighting consumption by household size is derived from RECS survey.Irradiance data for the U.S. is obtained from NREL.

refr
-A linear regression model is developed to predict daily refrigerator usage for a household based on outside temperature and climate zones.

misc, act
Subbiah et.al 80 , Thorve et.al 54 , Tsuji et.al 79 All the three referenced models have inspired the design of activity models involving use of appliances.The actual activity occurrence is obtained from the individual/household occupancy schedule.Duration and power usage distributions of appliances is modeled from NIST datasets 66,67,96 and other datasets [97][98][99][100] .The start time is chosen randomly within the duration reported by ATUS individuals and the power ratings and duration of the activity/appliance is selected from the above mentioned distributions.by household members.A procedure is outlined below for generating household level activity sequence ActSeq i .Let M be the number of adult members in the synthetic household.Then each household member P i, j has an activity sequence ActSeq i, j .The goal is to find one household level activity sequence ActSeq i composed of n activities (individual + shared appliance usage related activities) such that the sequence satisfies following constraints: 1.Each activity is performed when at least one occupant is home.
2. The limit on repeated usage is respected for each activity type.
3. Presence of appliance is considered for activities such as dishwasher, and laundry appliances.
12/29 Once the above constraints are satisfied, a start time is randomly selected for each activity from the activity duration reported by ATUS.The actual duration and power ratings for appliances used in different activities is chosen from Table 6.

Data Records
The dataset for the entire year of 2014 for U.S. households is publicly available for download from the net.science repository through University of Virginia Dataverse 101 .The dataset is available in the form of csv files.It is organized in folders according to date and state.Figure 4 shows the hierarchy of data organization and file name templates.Each file corresponds to a U.S. county identifier and date.A county identifier is a FIPS code.FIPS codes are numbers which uniquely identify geographic areas by the U.S. census.A record in the file corresponds to a synthetic household.The record includes synthetic household metadata and energy data for that particular date.All energy related data is in kWh.All the energy data is timestamped by local timezones in the country.A data header codebook is also included in the downloads.Note that, this work was reviewed by the University of Virginia's Institutional Review Board (IRB) and was determined to be exempt from board IRB approval, as this research project did not involve human subject research.

Technical Validation
Three studies are presented for validating the synthetic energy profiles.The first study quantifies the similarity between the real and synthetic energy use probability distributions using Jensen-Shannon and Hellinger distance.Comparisons are performed by end-use for real and synthetic data in all representative locations of the U.S. Strong similarities are observed for appliance use distributions between real and synthetic data as well as across spatial locations.TCL loads show differences in distributions across locations.The second study examines variations in the 24-hour energy use timeseries in real and synthetic data in all representative locations in the U.S. We uncover unique energy use patterns in the real and synthetic datasets and study similarities in patterns using unsupervised learning.We introduce two metrics in the process -coverage and closeness.The synthetic data has patterns similar to that of real data.The last study is focused on observing trends in the synthetic energy use in different representative locations in the U.S. We notice that the synthetic data is able to incorporate the effects of mixture of variables such as weather, irradiance, building attributes and demographic characteristics on household level energy usage.The study is a quick demonstration of energy use variability at multiple spatio-temporal levels in different end-uses.
The remaining V&V section is outlined as follows.First, we describe challenges in validating a large synthetic dataset for energy use.Then, we highlight the temporal and spatial resolutions of the data that are considered in the validation experiments.Next, ground truth datasets (real/recorded/actual data) used for evaluation are briefly described.This is followed by description of the experimental setup and results.
Validating the quality of the large-scale synthetic timeseries data for a sizeable region such as the U.S. is challenging, owing to the vast extent, diversity, and contrasting climates in the country.One of the challenges of validating an energy consumption timeseries at household level is the large variety and variability of the load patterns within and between households.In addition to external elements such as weather and building characteristics, consumer lifestyles and affordances play a vital role in shaping the demand such as a curve with morning peak, or a curve with a small afternoon peak and sharp evening peak.This leads to a big spectrum of variations and patterns in energy use.Thus, in-depth comparative analyses of synthetic data to actual data is required.However, it is conditioned on the availability of a reasonable amount of representative real data.Here, we employ real/recorded data such as load research data, end-use metering data, and smart meter data from ten locations in the country that are representative of the U.S. climate zones (Table 7).The availability of public smart meter data in the U.S. is limited, which may cause a potential skew towards the selected sample of households and may not be spatially representative.Thus, framing our understanding of validation in this context is important.
We address the quality of the synthetic energy consumption data on two intrinsic qualities of energy use data : magnitude (usage over 24 hours) and load shape (pattern of consumption).Magnitude and load shape can be examined across the temporal  102 in the contiguous United States are as follows: (i) marine, (ii) hot-dry/mixed-dry, (iii) hot-humid, (iv) mixed-humid, and (v) cold/very-cold.Comparisons are then performed at household and city/county resolutions.
• Temporal representativeness and resolutions.Temporal representativeness is studied by observing similarities between real and synthetic hourly demand profiles.Furthermore, daily and seasonal energy usage is studied for different locations.
• Dis-aggregate energy use.Note that we publish dis-aggregated energy use data at household level.Thus, a finer level of evaluation such as an energy use sub-type (e.g.HVAC, cooking, etc,.) is possible at various temporal and spatial levels.
All the real datasets used in the V&V process are listed in Table 7. Recorded datasets are obtained from Pecan Street Dataport 27 , Northwest Energy Efficiency Alliance (NEEA) 103 , National Rural Electric Cooperative Association (NRECA).The Los Alamos dataset is obtained from a public data sharing repository Dryad 104 .Unfortunately, we do not have any metadata about households (e.g.household size, dwelling type, etc) in these datasets.The datasets only have energy use timeseries.Three studies are presented to cover temporal, spatial, and dis-aggregate nature of the synthetic time-series: I. Comparing real and synthetic end-use energy usage (magnitude) II.Comparing real and synthetic energy use patterns (shape/structure) III.Observing differences and similarities in synthetic energy use data in spatially representative locations

I. Comparing real and synthetic end-use energy usage (magnitude)
In this experiment, distributions of synthetic and real daily end-use data are compared using statistical metrics.One way of comparing these distributions is by measuring distance between the real and synthetic end-use distributions.Many metrics can be used to perform this task (e.g., Kullback-Leibler divergence (KL), the Hellinger distance, total variation distance (TVD), the Wasserstein metric, the Jensen-Shannon divergence (JS), and the Kolmogorov-Smirnov statistic (KS)).Klemenjak et al. 20 use JS distance and Hellinger distance as examples to compare distributions of appliance energy use between different datasets.A similar method is implemented in this section using the JS distance and the Hellinger distance metric.In our case, computing the distances between daily end use distributions allows us to perform regional comparisons as well as comparisons between real and synthetic datasets.The Jensen-Shannon distance is the square root of the Jensen-Shannon divergence 105 .The range of this metric ranges between [0, 1] where 0 implies the distributions are similar.We prefer JS divergence over KL divergence since it is a symmetric measure.If P and Q are two probability vectors, then the JS distance JS(P, Q) is given by where M is the pointwise mean of P and Q and KL is the Kullback-Leibler divergence.To supplement our study, we use Hellinger distance as a second metric to quantify the similarity between two probability distributions.Hellinger distance is also a symmetric measure.Its range of values is [0, 1] with 0 encoding that the distributions are similar.The Hellinger distance of two probability vectors P and Q is denoted by H(P, Q) and defined as where k is the length of the vectors, and p i , q i are the i th elements of the vectors P and Q, respectively.Daily end-use energy usage (e.g. ) at household level are compared in the real and synthetic data for every location specified in Table 7. Vectors P and Q denote values in a single end-use for two datasets.Tables 6(a)(b)(c) list JS distances and Tables 6(d)(e)(f) list Hellinger distances for selected end-uses (HVAC, refrigerator, cooking appliances).Each matrix represents distances between two energy usage distributions for an end-use.The row and column headers represent different data-sources and different regions and each cell represents the probability distribution similarity/distance value in the form of heatmap where the bar shows the range of the values on a continuous scale.
The JS and Hellinger distance tables for end-uses show strong similarities (the distance is close to zero).Furthermore, within each matrix three types of comparisons are performed.We compute similarity between end-use distributions for different regions within synthetic data, different regions within real data, and different regions in different data sources (namely real and synthetic data).For appliance usage (e.g.cooking), the distributions are quite similar across regions and data-sources.This supports findings from Figure 11 that there exists significant similarities between different regions for synthetic daily energy consumption of different appliances.For HVAC end-use, it is observed that the distributions grow apart between regions for both -synthetic and real data sources.This is particularly true due to the strong association of HVAC with outdoor/environment temperature conditions and the time span for which these temperature conditions prevail (e.g., warmer temperatures are observed for a longer time in Texas (TX))

II. Comparing energy use patterns (load shape/structural similarity)
In this section, the synthetic energy use timeseries are evaluated using the concepts of diversity, coverage, and closeness.The diversity in energy use patterns is captured by segmenting the normalized timeseries e 0 , . . ., e 23 using unsupervised learning techniques such as clustering.This is followed by studying coverage in terms of what percentage of synthetic timeseries population is represented in the real timeseries population and vice versa.Thus, coverage is used to measure diversity.However, learning only coverage is not sufficient.It is necessary to measure the accuracy of the matches found.Hence, we introduce the closeness metric.It studies how close (e.g.dist(i, j)) are the synthetic and real data points.
Let R and S be the set of load shapes of real and synthetic energy use timeseries.Let K R be the number of unique load shapes (segments/patterns/clusters) found in set R. Then, we define the coverage(S) as a ratio coverage(S) = Number of unique shapes in R that contain atleast one data point from S Thus, coverage(S) reflects the degree to which samples from set S cover the patterns in set R. Similarly, if K S is the number of unique segments in set S, then, coverage(R) reflects the the percentage of unique patterns in set S covered by data points in set R. Coverage is bounded between 0 and 1. Figure 13b shows coverage(S) and coverage(R) as K varies.

Number of unique shapes in
To measure closeness we calculate distance of individual timeseries to it's respective cluster center/representative.If K R is the number of clusters in set R, then, the closeness(S,R) of set S to set R is measured by comparing the distributions of distances of individual timeseries i ∈ R and j ∈ S in each cluster c ∈ K R to the respective center/representative timeseries of the cluster.Figure 13b illustrates the schematic of building the distance distributions.Let P R and P S denote the probability vectors of distances of sets R and S respectively.To measure the degree of closeness, we compare the two probability distributions using Hellinger distance H(P R , P S ) (Equation 6).If distributions P R and P R are similar, then we say that set S is close to set R.
Closeness is bounded between 0 and 1. 0 implies that the two sets are close.Note that closeness is not a symmetric metric i.e. closeness(S, R) = closeness(R, S). Figure 13b describes the variation in similarity score of the probability with different number of segments K. Now, we briefly describe the experimental setup.Two cases are considered to examine coverage, closeness and robustness of cluster groupings (k).For each case the energy use timeseries is normalized resulting in a load shape e 0 , . . ., e 23 .We choose normalization by total consumption (Equation 9) in order to consider pronounced effects of peak-load in the profile.Household preferences or lifestyles can be typically captured by one or more load shapes 106 , hence we choose this representation for 18/29 Orange color is denoted for findings of case 1 where we cluster real data set R and assign a cluster label to synthetic data set S. Blue color is denoted for findings of case 1 where we cluster synthetic data set S and assign a cluster label to real data set R. (a) illustrates 100% coverage in both cases even as k varies.This means that, in each case at least one data point belongs to every cluster for a given k.(b) shows the closeness between the two distance vectors : distance of real data points in a cluster to its respective centroid and distance of synthetic data points in a cluster to its respective centroid.Closeness is given by the Hellinger distance which suggests that a value of 0 signifies that the two distributions are similar.The value of distances is close to 0 for all values of k in both the cases.However, an upward trend is observed as k increases.Overall we see the robustness of results w.r.t.k.
uncovering patterns in the data.Thus, every i ∈ R and j ∈ S are normalized energy use vectors of length 24.
In the first case (Case 1), we generate K R patterns from set R by clustering the real normalized energy use vectors using k-means clustering algorithm with Euclidean distance.This is followed by assigning a cluster label k ∈ K R to each synthetic energy use timeseries j ∈ S. Let c k be the center/representation vector of group k.Then, j ∈ S is assigned to the cluster whose cluster center distance is minimum from j and is given by min(dist( j, c 0 ), . . ., dist( j, c K R )).Then, we calculate the coverage of synthetic data coverage(S) and closeness of synthetic data to real data among all clusters as closeness(S, R).In Case 2, we generate K S clusters from set S (synthetic data) by segmenting the normalized energy use vectors using k-means clustering algorithm with Euclidean distance.This is followed by assigning a cluster label k ∈ K S to each real energy use timeseries i ∈ R. i is assigned to the cluster whose cluster center distance is minimum from i and is given by min ∀k∈K S dist(i, c k ).Then, we calculate the coverage of real data in synthetic groups coverage(R) and closeness of real data and synthetic data among all synthetic clusters as closeness(R, S).
Results of both the cases are summarized in Figures 8.A 100% coverage is observed in both the cases for different values of k.Observations for closeness metric are interesting.The Hellinger distance is close to zero in all the scenarios, however there is a slight uptake in the value as k increases.We inspect this further in Figure 7. Figure 7 shows histograms of distances of real data points and synthetic data points from their assigned cluster center.In case 1, the distribution of distances of synthetic data points is slightly broader than the distribution of distances of real data points for all k.Thus, we see a distance for closeness(R, S) in Figure 8(c).As k increases it is observed that some individual clusters have a broad and/or bimodal distance distribution indicating that there are data points that are very close to the cluster center while a few are far away.This difference is apparent as the number of clusters increases.
The goal of this V&V exercise was to verify if the diversity and trends of the real energy use profiles are replicated in the synthetic energy use profiles.Due to a biased and skewed sample of the real energy use data, it is challenging to perform validation of synthetic data.Some of the characteristics of the real datasets that hinder the implementation of using existing evaluation metrics as is are mentioned below.No supporting information of the real households is available (e.g.household size, dwelling type, square footage, indoor thermostat setting).We have shown that all of these factors are extremely important in the generation of household demand at a given time.Some of households in the real data may also be participants in demand-response programs resulting in unique load shapes due to shifting demand/reducing peak demand that may not be found in households not participating in DR programs (e.g.synthetic data).The real datasets are collected for different years for each region.The data are incomplete for some regions (e.g.San Diego samples do not have lighting data).The sample size (number of households) is highly skewed.It varies from 9 households in Montana to 56000 households for Horry,SC.Thus, it is important to note that |R| << |S| (e.g. the number of households simulated in our framework for Washington state is far greater than that of 78 households in real data for Washington state.)All of these observations are summarized in Table 7.This empirical study uses only the synthetic data to conduct a comparative regional analyses to examine similarities and dissimilarities between energy use for different end-uses.We observe the spatio-temporal patterns and variations in different end-uses with respect to environmental elements such as irradiance and temperature as well as demographic and structural characteristics of the households.The selected target locations are spatially representative of different climate zones of the U.S.: Arlington, VA; Cook County, IL; Houston County, TX; Maricopa County, AZ; King County, WA The composition of electric consumption by end-uses is shown in the form of pie diagrams in Figure 9. EIA reports the shares of the major end-uses as follows: DHW 17-32%, lighting 5-10%, refrigerator 3-5%, activities/appliances 20-26%, space heating 25-47%, and air conditioning 5-10%.In general, the percentages of major end-use categories lie in the ranges similar to those reported by EIA.HVAC has a dominant share in the energy consumption in households as compared to usage of appliances and/or other activities.Electricity usage for heating water is the lowest during summer months for all locations (Figure 10c).In particular, regions from hot-humid and hot-dry climate zones consume the least amount of energy.This observation stems from the relation between E h2o,v and T cold m,z described in Equation 3. The water inlet temperature ( T cold m,z ) differs across temporal as well as spatial scale and is dependent on outside environment temperatures 68 (Details in Appendix).Figure 13 shows plots describing relation between household size and the number of gallons of hot water consumed and energy required to heat water.Note that, we consider only electric water heaters in this work.
Figure 10(a) shows that the HVAC consumption varies significantly throughout the year.HVAC use is higher in hot-dry areas in summer as compared to other regions possibly due to higher temperatures.Structural characteristics such as dwelling size (square footage), insulation quality, age and efficiency of HVAC equipment also affect household HVAC consumption.Another important variable that drives HVAC consumption is indoor thermostat behavior which is related to household occupants' behavior/actions.In this work, indoor thermostat temperatures are set constant throughout the day.Insulation quality is not monitored in households (due to lack of data).We assume that the dwelling is well-insulated and the insulation values are implemented according to the DOE standards for the respective climate zones.In Figure 12a we show effect of square footage (conditioned space) of a dwelling on hvac energy use.In general, we observe that as the conditioned space in the dwelling increases, the HVAC consumption increases.
Lighting energy-use varies by seasons in all regions as irradiance levels change with weather events and seasons.Figure 14b shows average irradiance time series for the target locations.The corresponding lighting usage is shown in Figure 14a.As an example, we look at monthly irrandiance profiles across 24 hours in Virginia for the year 2014 (Figure 14d).The corresponding monthly lighting energy use time series is shown in Figure 14c.Example of lighting consumption w.r.t.household size is explored in Figure 12b.
Figure 11 shows the breakdown of appliance usage for different appliances and electronic devices.Both figures show a line chart indicating average daily consumption for the month.The scatter plot in the background describes average daily consumption for an end-use for sampled days color coded by location, where the size of the markers denotes the standard deviation of the end-use consumption.It is observed that appliance usage in activities such as cooking, dishwashing, performing laundry, watching TV, using computer, and cleaning are fairly similar in different regions.The above comment is intuitively true since appliance use duration and their ratings may not vary across regions.However, the occurrence timing throughout the day may vary from house to house depending upon occupant schedules irrespective of which geographic regions they belong to.

Usage Notes
In order to analyze the dataset, researchers can use any programming languages such as Python, Java, Matlab, or R. As described in the 'Data Records' section, the files are stored in csv format, so most of the file reading functions in the above languages can support reading/accessing the dataset.Next, we discuss the potential applications of the released synthetic data.We also highlight important challenges and limitations of this work.

Applicability and benefits of the dataset.
We are releasing a comprehensive household level dataset for energy use.In addition to the household level disaggregated energy use data, household composition is also included from census data.This work was reviewed by the University of Virginia's Institutional Review Board (IRB) and was determined to be exempt from board IRB approval, as this research project did not involve human subject research.The dataset can be effectively employed in various applications such as NILM (non-intrusive load monitoring), load profile analyses for observing similarities/differences between end use consumption of different regions and seasons, evaluating effects of retrofits in buildings, studying effects of temperature rise in different regions, and so on.In addition, this data can also be used for energy model calibration, occupant behavior evaluation, implementing demand response strategies and policy interventions.The dataset can be especially leveraged in training deep learning models where massive amount data is appreciated.Such models can be used for real-time residential demand forecasting.The dataset released are essentially time-series along with categorical and numerical attributes.Thus, any statistical tool or programming  The lighting energy use is inversely proportional to the irradiance.The energy usage is higher in evening and night hours when the occupant is active in the dwelling.The average lighting and irradiance profiles show regional differences in irradiance availability and subsequent lighting energy usage.The VA profiles show that the day light is available for longer durations leading to lower lighting energy consumption as compared to winter.
language can be used to analyze them.Study III in the 'Technical Validation' study illustrates examples of the possible uses of the dataset.

Challenges and limitations.
The use of synthetic residential energy demand data has its pros and cons.National scale hourly synthetic data can be used to carry out national and even potentially international policy analysis.The spatio-temporal variability allows one to access important emerging questions related to energy equity, fairness and accessibility at a fine scale.A systems level approach can be taken to vexing questions outlined in the 2030 Intergovernmental Panel on Climate Change (IPCC) goals.On the other hand, synthetic data sets have their limitations as well.For instance, the fine-scale variability (minutes level as well as weekly variation) of usage amongst households cannot be captured easily in such synthetic data sets.Additionally, the behavior exhibited by any single synthetic family might be biased by the data used for synthesis.Thus, any insight generated from high resolution analyses should be considered carefully.An important challenge in developing the realistic synthetic residential load profiles at a national scale and at a high spatio-temporal resolution is to find appropriate datasets for representing different types of climates, demographics, appliances, and activity patterns.Accessibility and availability of all the above information from legitimate sources is crucial to maintain trustworthiness in the resulting models.A robust and extensible infrastructure is developed to synthesize diverse data sources into detailed information structure at various spatial resolutions (e.g.combining household level data with climate zone related data such as insulation values).The infrastructure consists of methods to compose multiple models and data sets.The overall time to generate the synthetic data was reduced by using high performance computing capabilities.Some of the limitations of our work are discussed.The current synthetic data does not include power consumption by electric vehicles and energy generation via renewable generation (e.g.solar panel, wind).The ATUS data is available for a normative day for individuals.Thus, activity and appliance related demands are generated for a normative day with minor variations coming from the activity model.Hence, our synthetic data might not be able to capture daily activity variation appropriately (e.g. as observed in real-time smart metering).This can be challenging to work with especially when studying demand response scenarios.The building envelop considered for a synthetic household is simplified due to lack of information needed to represent a large population group, thus limiting our ability to employ state-of-the-art and sophisticated building modeling techniques.(e.g.we use a simple HVAC physics based model to generate heating and cooling related energy demand).

Concluding remarks.
The paper describes a bottom up approach to generate large-scale digital twin data of dis-aggregated residential energy use hourly timeseries for the residential sector at household resolution across the contiguous United States for millions of households.The approach integrates diverse open-source surveys and datasets, where the end-use models are developed by either extending well-established methods or by building new models.Extensive validation of the synthetic datasets is conducted using real/recorded energy-use data across spatial and temporal resolutions.

Figure 1 .
Figure 1.Data overview.This figure shows examples of the spatio-temporal resolutions of multiple facets of the dis-aggregated synthetic energy demand data.The figure shows sample data at state, county, and household level at different temporal granularities.The data is generated for all households in the U.S.

Figure 2 .
Figure 2. Overview of the energy modeling infrastructure.Many different types of input data are used in the proposed modeling framework.These are shown at the top.For complete description of input datasets refer to Table2.These datasets are input to different modeling components of the framework.Some datasets support augmentation of the synthetic population while others are input to the energy-use models.All the models are described in the Methodology section.The bottom rectangle describes the recorded data/smart meter data from different climate zones of the U.S.These datasets are used for validation of the synthetic energy-use timeseries.The validation block (yellow backdrop) describes three components of V&V -regional, magnitude, and structural/shape comparisons.This line of validation covers (a) different temporal aspects (hourly and daily), (b) spatial aspects in terms of regions and seasons, (c) diversity aspect of the large-scale synthetic data.The blue text refers to the V's of big data.Each colored block possesses the given V characteristic.

Figure 3 .
Figure 3. Impurity-based feature importance and correlation.Each plot shows Gini importance of features for two dependent variables -home and work.The x-axis shows independent variables in order of importance based on IncNodePurity.The selection of the parameters for 'ntree' (number of decision trees) and 'node size' (minimum size of terminal nodes).Eight conditions are tested for the combination of the two parameters: ntree=500, 1000, 1500, and 2000; node size=5, and 10.The plots show robust results across the different conditions.According to the plots, the following five independent variableswrkhrs, worker, age, hinc3, hsize mostly affect all the dependent variables.The right-hand y-axis shows the absolute Pearson Correlation Coefficient.The positive and negative coefficients are distinguished by blue dots and squares, respectively.Except wrkhrs, worker, all other independent variables weakly correlated with the dependent variables.

Figure 4 .
Figure 4. Data organization.Dataset is available in the form of csv files.The files are organized by dates (temporal) and states (spatial).The blue text indicates the type (e.g.folder, file, record).The text within angular brackets denotes nomenclature templates of folders and files.A record csv file contains energy use data and metadata for a synthetic household in the SPEW population.There will be one file per county and date.One day generates several GBs of data.

Figure 6 .
Figure 6.Left column: Jensen-Shannon distance matrices, Right column: Hellinger distance matrices.Each of the column shows Jensen-Shannon distance and Hellinger distance matrices between end-use probability distributions.Each matrix represents distances between two energy usage distributions for a particular enduse (e.g.HVAC, refrigerator, cooking).The row and column headers of the matrix represent different data-sources and different regions and each cell represents the probability distribution similarity/distance value in the form of heatmap, where the bar shows the range of the values on a continuous scale.

Figure 7 .
Figure 7. Example of closeness in different cases with varying k.Figures show the distances of data points from sets R and S to their respective cluster center.(a) demonstrates histograms of distances for different k.The plot on left is for real data points and on right is for synthetic data points.Then, we calculate closeness(R, S) using Hellinger distance (corresponds blue line in Figure 8(c)).For k = 5 a bimodal pattern is observed in distances for synthetic data points which tends to diminish as the number of clusters k increases.Figure (b) shows histograms of distances for different k for case 2. The plot on left is for synthetic data points and on right is for real data points.closeness(S, R) is calculated using Hellinger distance (corresponds to orange line in Figure 8(c)).

Figure 8 .
Figure8.Summary of the two case scenarios.Orange color is denoted for findings of case 1 where we cluster real data set R and assign a cluster label to synthetic data set S. Blue color is denoted for findings of case 1 where we cluster synthetic data set S and assign a cluster label to real data set R. (a) illustrates 100% coverage in both cases even as k varies.This means that, in each case at least one data point belongs to every cluster for a given k.(b) shows the closeness between the two distance vectors : distance of real data points in a cluster to its respective centroid and distance of synthetic data points in a cluster to its respective centroid.Closeness is given by the Hellinger distance which suggests that a value of 0 signifies that the two distributions are similar.The value of distances is close to 0 for all values of k in both the cases.However, an upward trend is observed as k increases.Overall we see the robustness of results w.r.t.k.

III.
Observing differences and similarities in synthetic energy use data in spatially representative locations (a) Arlington, VA (b) Cook, IL (c) Houston, TX (d) Maricopa, AZ (e) King, WA (f) Legend

Figure 9 .
Figure 9. Composition of synthetic electric consumption in the representative target locations.Heating and cooling constitute the majority part of the residential electric consumption.Refrigerators consume slightly higher energy in hotter regions such as Maricopa and Houston.Activities such as dishwashing, laundry, and cooking represents between 8-17% for different regions.Lighting and water heating have a consistent proportion of consumption across all locations.The proportions bear similarities with data published by EIA.

( a )Figure 10 .Figure 11 .
Figure 10.Monthly synthetic energy use changes in end-uses such as HVAC, refrigerator, domestic hot water w.r.t.temperature.The above line charts monthly energy use changes in end-uses such as HVAC, refrigerator, domestic hot water w.r.t.outside temperature.The line chart shows average daily consumption over all households in the target regions.The scatter plot in the background describes average daily consumption for an end-use for sampled days color coded by location.The size of the markers denotes the standard deviation of the end-use consumption.Legend: Arlington, VA (green); Cook County, IL (blue); Houston County, TX (yellow); Maricopa County, AZ (brown); King County, WA (cyan)

Figure 12 .
Figure 12.(a) Synthetic HVAC use and house area (i.e.floor area).Boxplot comparing daily HVAC consumption in a winter day for the selected target locations by house area (i.e.floor area).The x-axis groups floor area of houses in five bins denoted in two units sq.ft (ft 2 ) and sq m (m 2 ).The bins are as follows : ≤ 1000 ft 2 , 1000 -1500 ft 2 , 1500 -2000 ft 2 , 2000 -3000 ft 2 , ≥ 3000 ft 2 .It is observed that as floor area of the house increases HVAC consumption increases in all regions.Winter temperatures are relatively moderate in AZ and TX, thus, the HVAC consumption is less as compared to other regions.(b) Synthetic lighting use and household size.Lighting consumption increases as household size increases.Household size indicates number of members in a household.

( a )Figure 13 .
Figure 13.Synthetic hot water usage and energy vs. synthetic household size.Household size indicates number of household members.The clustered bar charts show the amount of hot water consumed (in gallons in (a)) and corresponding energy usage in (b) according to household size in a winter day.The vertical black line on each bar shows the variation.Water usage and its variation increases with household size.The amount of energy for hot water end-use increases with household size and differs by region.
Irradiance profile (c) VA lighting profile (d) VA irradiance profile

Figure 14 .
Figure 14.Heatmap depicting relation between hourly synthetic lighting usage and hourly irradiance.(a) shows average annual 24-hour lighting profiles of representative target locations.(b) shows average annual 24-hour irradiance profile of representative target locations.(c) and (d) present the variation in lighting usage and corresponding irradiance profiles at monthly level for Arlington, VA.(c) presents lighting consumption variation throughout the day in different months across the year.(d) shows variation in monthly irradiance profile.The units of measurements for energy usage is kWh and irradiance is Watts/m 2 .The lighting energy use is inversely proportional to the irradiance.The energy usage is higher in evening and night hours when the occupant is active in the dwelling.The average lighting and irradiance profiles show regional differences in irradiance availability and subsequent lighting energy usage.The VA profiles show that the day light is available for longer durations leading to lower lighting energy consumption as compared to winter.

Table 4 .
Hot water model characteristics in order to schedule these hot water usage events.

Table 6 .
Modeled activity and appliance usage behaviors

Table 7 .
Datasets used for validation • Spatial representativeness and resolutions.Due to limited availability of real data, we define spatial representativeness by choosing atleast one location in each climate zone in the U.S. to carry out validation experiments.The major climate zones