Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# DEDDIAG, a domestic electricity demand dataset of individual appliances in Germany

## Abstract

Real-world domestic electricity demand datasets are the key enabler for developing and evaluating machine learning algorithms that facilitate the analysis of demand attribution and usage behavior. Breaking down the electricity demand of domestic households is seen as the key technology for intelligent smart-grid management systems that seek an equilibrium of electricity supply and demand. For the purpose of comparable research, we publish DEDDIAG, a domestic electricity demand dataset of individual appliances in Germany. The dataset contains recordings of 15 homes over a period of up to 3.5 years, wherein total 50 appliances have been recorded at a frequency of 1 Hz. Recorded appliances are of significance for load-shifting purposes such as dishwashers, washing machines and refrigerators. One home also includes three-phase mains readings that can be used for disaggregation tasks. Additionally, DEDDIAG contains manual ground truth event annotations for 14 appliances, that provide precise start and stop timestamps. Such annotations have not been published for any long-term electricity dataset we are aware of.

 Measurement(s) domestic appliance electricity usage • whole-house mains reading • ground truth event annotations Technology Type(s) energy measurement system • Annotation Sample Characteristic - Environment house Sample Characteristic - Location Germany

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14753556

## Background & Summary

For many years electricity consumption has only been monitored for billing purposes, thus only requiring metering devices that provide readings for each billing period. In recent years smart-meters have been installed to a greater extent to provide a basis for more complex pricing models as well as a more in-depth consumption analysis. The European Union has started this transformation as part of the Third Energy Package in the directive 2009/72/EC11 in order to promote energy efficiency by “developing innovative pricing formulas, or introducing intelligent metering systems or smart grid”. The target was to have smart-meters installed in 80% of households by 2020 with the goal to improve customer awareness regarding electricity efficiency and increase stability and reliability of the grid2. While higher resolution information is the basis for more detailed usage analysis, the raw electricity consumption data will not provide any meaningful benefit as it lags a required abstraction as to what causes the electricity consumption, and, even more important, what could be done to alter the consumption. Enabling meaningful retrospective insights as well as prospective suggestions on consumer electricity consumption, mechanisms for information retrieval, behavior analysis, and forecasting have to be developed. The biggest enabler for research are datasets, and even more significant are publicly available datasets, since only then different research teams can evaluate publications and further develop established techniques. This is especially critical as the gathering of high resolution electricity consumption data is accompanied by many ethical questions, especially if not implemented as opt-in3.

Kolter and Johnson4 published the first dataset for electricity disaggregation. At the time they argued, that “although there are vast amounts of data relevant to energy domains the majority of this data is unavailable to researchers”. Furthermore, the authors argue that many other domains have greatly benefited from public benchmark data sets such as MNIST5 for handwritten digit recognition or PASCAL6 for visual object category recognition and detection.

Since then, more datasets have been released with different target electricity applications in mind4,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21. A review on electricity datasets22 highlights the diversity of the datasets as they differ in location, duration, frequency, file format and many more aspects. The authors demand two primary objectives for the collection and provisioning of datasets, for (1) interoperability and (2) comparability. This would help to decrease heterogeneity, thus increase comparability of publications. An aspect highlighted by the authors is missing meta data describing the circumstances of the data recording. We found this to be especially relevant in order to be able to find explanations for behavior and behavior changes.

The DEDDIAG (domestic electricity demand dataset of individual appliances in Germany) dataset was mainly recorded in order to provide the basis for automated domestic load-shift applications where, e. g., a Real-Time-Price is used as an incentive. Hence, the dataset contains appliances that have the potential for automated load-shifting such as fridges, freezers, dishwashers, dryers and washing machines. According to the German Federal Environment Agency, other applications, outside of space heating and cooling, such as washing machines, dryers, stoves, refrigerators and similar, account for 52.6% of a household’s electricity demand in 201823. It is uncommon to use electricity based space heating, and therefore space heating accounts only for 5.8%; space cooling systems are in general uncommon and only account for 1.0%24. In some cases additional appliances that may not contribute to load-shifting were recorded as they give insight into habits. We find it especially important to provide annotations for appliance usage in the form of Start-End annotations that attribute a certain time-span to a label. For instance, on a washing machine this is the cycle start and end combined with the program used; for a fridge this highlights the compressor cycles as well as the light that indicates an open fridge door. The latter is especially interesting as it gives a clear indication of an occupant being present.

We therefore not only publish the raw data, but also manual annotations, demographic information and model names for some appliances to provide a basis for research of device identification, behavior analysis and load-shift recommendation systems.

## Methods

In the following, an electricity data collection system for single appliances as well as whole-house level is described. The general requirements were derived from a 2 year load-shift monitoring project, where homes were monitored. The goal of the project was to record real-world behavior to understand load-shift potential, and at a later stage provide the homes with RTP (Real-Time-Pricing) incentives for load-shifting. As part of the project, a statistic based appliance usage prediction algorithm was developed to reduce the complexity of RTP behavior recommendations systems25. Further, an event segmentation algorithm for appliances identification in raw electricity measurements was developed26. The complete data collection software is available for download (see Sect. Code Availability for details).

### Requirements

The collection system was built with a focus on recording data for a user behavior change scenario such as Real-Time-Pricing (RTP). Therefore, the focus was on recording appliances with the following properties:

• easily shiftable load,

• significant electricity consumption,

• standard power plug.

Appliances falling under these terms are washing machines, dishwashers, fridges and freezers, thus the datasets mostly contain these appliances.

The general requirements for the project were:

• collect appliances and, optionally, whole-house electricity usage,

• store data locally,

• upload data frequently to central server,

• collect over a long time period (>2 years) at ≈1 Hz,

• no technical knowledge required to install the system,

• keep costs low.

#### Hardware requirements

The hardware requirements identified for the project were: The complexity of the hardware and software installation must be reduced so that a person without any special technical knowledge is able to install the system in their home. Therefore, the hardware for appliance-level measurements must be an end-user product that simply needs to be plugged in between wall-plug and appliance. The measuring plugs should support wireless communication to lower the complexity of the setup as the user otherwise would have to install cables in each room. For safety reasons, the installation of a house-level smart-meter must be performed by an electrician. Therefore, the technical knowledge required can be higher, but should also be kept as low as possible. The sample frequency for both appliance-level and home-level meter must be at least 1 Hz. This rate is used by a large number of algorithms already, following the suggestion of Klemenjak, C. et al.22 for macroscopic datasets. Higher sample rates increase the data handling complexity. It needs to be kept in mind that doubling the sample rate will also double the data volume that needs to be handled. Collected data must be stored on a small on-premises computer. The overall cost should be kept as low as possible.

#### Software development requirements

With a limited overall time for building and using the monitoring system, the goal for the development was to measure as early in the project as possible. This meant that system had to be installed in the homes as soon as possible. Therefore, it was developed in a MVP (Minimum Viable Product) fashion. The homes were distributed throughout the south of Germany and the project team did not have direct physical access to the homes.

The software requirements identified for the project were: The software system must be able to read the measurements from appliance- and home-meter and store it on an on-premises computer. The users should be provided with a web interface where they can monitor real-time meter values. All local data (client) must be uploaded hourly to a centralized server over the internet. The uploads must be self healing, meaning the upload system needs to be robust against a broken internet link, aborted/partial uploads and server downtime. The software must be able to be updated automatically in order to ship bug fixes and new features. For security reasons direct network access to the households network must be avoided. Thus the system must be updated autonomously without any interaction simply by providing a new software version on a server.

### System architecture

Based on the requirements listed above, we derived the following system architecture (cf. Figure 1a). All components on server and client are designed as micro services and run as containers using Docker (https://www.docker.com). This enables a guaranteed software state on each client as well as a simple update mechanism.

#### Server-side

The server side provides three main functionalities: Data storage including secure upload, visualization, provide software updates, and monitoring. For data storage a PostgreSQL (https://www.postgresql.org) database is used, an open source relational database that provides a rich set of query functionality. The measurements of all homes and appliances are stored in a single table. In the following, all measurement sources, i. e. appliance or smart-meter phase, are called item. Each item is identified by a unique ID that is assigned on first upload of measurements to the server. All measurements are stored into a database scheme as shown in Fig. 1b, where all measurements are stored in a single table with a foreign-key to an items table. The items table contains the assigned unique ID as well as meta data such as item name and which household it is installed in.

As server-side hardware we used an Intel Xeon Gold 6140 CPU @ 2.30 GHz with 256 GB RAM. Such powerful hardware is in general not required, but increases SQL query speed when large number of measurements are recorded. It must be noted that in order to utilize the hardware power, the PostgreSQL instance needs to be configured to allocate more memory per query. The server requires an internet link capable of handling the uploaded data of about 140 KB per appliance per hour. This is insignificant compared to the required download capabilities. Daily backups and large queries require gigabytes to be transferred as fast as possible, and we therefore made use of a 1-Gbit internet link.

#### Home-/client-side

The hardware components used for each home were:

• computer: Raspberry Pi 3 Model B with 32 GB storage,

• WiFi Access-Point: TP-Link TL-WR802N,

• individual appliance meter: TP-Link SmartPlug HS110 (see Fig. 2c),

• whole-house meter: ABB B23 112-100 + MAX485 module (see Fig. 2d),

• 5 V DC power supply.

For the on-premises computer a Raspberry Pi 3 Model B with 32 GB storage with a standard casing (see Fig. 2b) was used. It offers WiFi, Ethernet and GPIOs for Modbus communication with a whole-house meter at very low cost. In cases where a whole-house meter was installed, a DIN rail casing as shown in Fig. 2a was used to mount the Pi next to the meter as the read out using Modbus requires a cable connection.

The computer runs Hypriot OS (https://blog.hypriot.com), a slim operating system with the main focus on Docker container support. The system is based on Eclipse Smarthome (https://github.com/eclipse-archived/smarthome), which is used for collecting and persisting the data locally into a PostgreSQL database. The appliance meters are read over the wireless network every second, where the timestamp is added at receiving time of the value from the meter, thus introducing a reading latency equivalent to response time of device plus network latency. The system time is set using the NTP protocol. In order to keep the 1 Hz rate, the next value reading request takes this latency into account. Readings are persisted only on value change to minimize storage space. Additionally, one value is always stored at the full hour in order to detect connectivity problems.

Each hour the new data is exported into a CSV file and uploaded to the central server using SSH with key authentication. In order to identify incomplete uploads on server side, the SHA checksum is used as a filename and checked before import on the server side.

The full system can be pre-configured in a lab. This reduces the steps required for installation by the residents in each home to: Plug in individual appliance meters and Raspberry Pi, and connect the access point to a local router. This also allows to ship the system as a package without requiring an electrician when only appliance data is collected (i. e., no whole-house meter installed).

### Event annotations

The collection of raw data is an important step of all electricity machine learning efforts as it can act as ground truth. For some applications such as appliance usage analysis, the raw electricity measurements are only one part of the required ground truth, as electricity consumption does not necessarily correlate with usage. While the BLUED9 dataset does provide ground truth event annotation, the dataset cannot be used for tasks such as user behavior analysis as the recorded period is about 12 hours.

Pereira27 takes the annotations from BLUED, adds manual annotations to the UK-DALE dataset and combines the results into a new event annotation dataset called NILMPEds. In this non-intrusive load monitoring context (NILM) events usually are defined as switch-on or off using a single timestamp. These two events have a logical relationship as a switch-on/start must be followed by a switch-off/stop. We therefore publish manual annotations for some of our data where an event is defined by two timestamps: $${e}_{0}=({t}_{0},\;{t}_{1})$$ where $${t}_{0} < {t}_{1}$$. Manual annotations were created using expert knowledge in a tool that allows multiple people to annotate data at the same time. Figure 3 shows manual annotations of compressor and light for a refrigerator in house 5. The measurements in the annotation tool have been rounded to full seconds using the provided SQL function round_timestamp() and the annotations are therefore created to full seconds.

## Data Records

We provide raw 1 Hz power readings of load-shifting relevant appliances over periods of 21 to 1351 days. For one home (house 8) we also provide whole-house mains readings. In total, the dataset contains 15 homes with a total of 50 appliances. A detailed listing of all homes and appliances, duration and missing data can be found in Table 1.

Figures 47 show stacked daily average consumption for all houses. The aligned charts show the periods the data was recorded and can be used for visual examination of missing data or analyzing seasonal changes wherein, for example, refrigerators require more electricity during summer periods.

For 14 cycle based appliances we provide manual annotations. These annotations provide start and stop timestamps of a cycle; there are multiple labels per appliance where possible in order to annotate different modes. For example, for the refrigerator with item ID 10 in house 0, we provide separate annotations for light and compressors. Thus, the annotations can be used as ground-truth for appliance usage predictions which, to our best knowledge, does not exist for any long-term electricity dataset available. This will allow training and evaluation of classifiers that can detect “usage” of the refrigerator based on the light being switched on when the door is opened. Additionally, we provide demographic data for each household’s residents such as age, absence duration and regularity of absence.

The DEDDIAG dataset is open-access and is hosted on Figshare28.

### Structure

All data are provided as plain text, tab-separated value files (TSV). There is one directory per house (house_00/, house_01/,) containing a house.tsv file with the house description. The description provides demographic data such as number of residents, their age, regularity of absence and normal absence duration. Appliances meta data is provided in items.tsv, containing a unique ID, name, device category and house ID. Appliance categories provide a grouping label. Available categories are: Refrigerator, Freezer, Washing Machine, Dryer, Dish Washer, Coffee Machine, TV, Office Desk, Smart Meter Phase, Smart Meter Total and Other.

The measurements, annotation and annotation labels of each appliance are provided in separate files:

• item_XXXX_data.tsv.gz,

• item_XXXX_annotations.tsv,

• item_XXXX_annotation_labels.tsv.

The measurement files are each compressed using gzip to reduce size. The data is split per house and appliance in order to be able to work on a single home or appliance. While the tsv files can easily be used directly, an import bash script called import.sh is provided to automatically import the data into a docker-based PostgreSQL database. Detailed instructions are provided as part of the dataset archive in README.md. The compressed dataset has a size of 14 GB, the uncompressed, imported data requires about 140 GB. All measurements are stored in a sparse manner where only value changes are recorded. In order to be able to detect connectivity problems of the metering device, there is at least one reading per hour. Therefore, if the time span between two measurements is greater than 1 h 5 sec, it should be assumed that there are missing values. A measurement record is stored as <item_id> <time> <value> where <item_id> corresponds to an ID in items.tsv. <time> is a UTC datetime string using the YYYY-MM-DD HH:MI:SS.US format, measurements are stored as float values.

Annotations are provided as <id> <item_id> <label_id> <start_date> <stop_date> where start, stop, are UTC datetime strings using the YYYY-MM-DD HH:MI:SS format. Annotations are only provided to full seconds, thus the corresponding measurement can be found using the provided custom SQL function round_timestamp().

## Technical Validation

All parts of the data collection system were tested in a lab environment. The collected measurements were captured as provided by the metering device. The mains meter was certified to be MID (Measuring Instruments Directive) calibrated, the appliance-level meters were not. A lab test on the wireless appliance meter devices was performed to assure constant readings over different metering devices. Another lab test was performed to compare readings of the appliance and mains meters, showing that the appliance meters systematically provided 2% to 4% lower readings compared to the mains meters.

### Sample rate precision

Although there is one reading per second, the exact rate varies due to the technical limitations of the metering device. For single appliance measurements the rate will usually fluctuate between 0.9 Hz and 1.1 Hz. The timestamp precision is sub-second, rounding to full seconds may result in having two measurements for the same second. The recommended handling for such cases is to only keep the later value to avoid changing time-order of the two values, since this may introduce a significant error considering only value changes are recorded.

### Missing values

There are missing values within the dataset. In order to be able to detect failures, the system recorded at least one value per hour per appliance and therefore time gaps that are greater than 1 h 5 sec must be treated as missing values. The additional 5 seconds are added as a fair dealing gap because small deviations from the 1 h are of no significance as these gaps only represent periods where the electricity demand has not changed, which in reality only applies to 0 Watt cases. An overview of monitoring gaps is given in Table 1 as sum of missing periods relative to the total duration. There are two columns, one for a gap size of more than 1 h 5 sec, and another for a gap size of more than one day, where the latter must always be smaller. The >1 h 5 sec missing data ranges from 0.28% on item 38 to 59.40% on item 4, and the >1 day missing data ranges from 0.00% to 49.26% for the same appliances.

As an example Table 2 provides a more detailed insight into monitoring gaps of item 4 and item 69 that have an overall >1 h 5 sec missing data of 59.40% and 35.06%. The items were chosen as an example since they have very different missing data patterns. The table shows how much of the missing data falls within a certain gap length, clearly indicating that item 69 suffers from long term interruptions where 89% of the missing data are from gaps >5 days where there is no gap of that length for item 4. Item 69 has a single monitoring gap of about 33 days, accounting for 34% of its missing data, while the largest gap for item 4 is 4 days.

As the analysis as well as impacts of the missing data depend on the application of the dataset, detailed inspections of missing data must be performed when using the dataset. The DEDDIAG -loader software library described in section Code Availability provides a method to find missing periods to assist analysis. Missing data can also be seen visually in the daily average power demand, Figs. 47.

## Usage Notes

The dataset release is accompanied by a full usage description. The dataset has not been preprocessed and is provided as recorded. It is advised to read the README.md for detailed usage instructions. All data is provided as plain-text TSV files. Although we recommend and provide information to import all data into a PostgreSQL database, the plain-text data can be parsed directly with standard tooling.

## Baseline Results

In addition to the dataset we provide baseline results for appliance category identification and automatic event annotations.

### Appliance category identification

The dataset is used to train and evaluate an appliance category classifier. Such a classifier can be used to identify single appliance recordings of unknown appliances and therefore act as the first classifier for further tasks such as event annotation or usage prediction. The task is performed on all available appliances using the unaligned recordings, meaning that the classifier has to be able to perform on a randomly chosen segment. The sparse recorded data is taken as is, which prevents trying to classify long switched off states where the power demand is 0 Watt.

As a baseline we train a k-nearest neighbors classifier for all available categories: Coffee Machine, Dish Washer, Dryer, Freezer, Office Desk, Other, Refrigerator, TV, Washing Machine.

The classifier is trained using a sliding window where the window size will also define the recording length required to identify an appliance. With a sample rate of 1 Hz, the window size represents the number of seconds the appliance needs to run in order to identify it. In order to reduce the feature space dimensionality, we extract window-based features using the window’s mean as well as coefficients of a discrete wavelet transform. The mean value provides a good indicator as appliances are composed of very distinct power demand groups. The wavelet transform is very common in time series classification and provides a low dimensional representation of the signal. It is performed as a multi-level decomposition using the Daubechies kernel db1. The decomposition level is set such that we get a total of four components, two detail and two approximation coefficients, making the feature space independent of the window size and therefore removing the problem of exploding dimensionality for large window sizes. Combined with the window’s mean, this results in a five-dimensional feature vector, which is normalized and standardized independently based on the training data.

The classification is performed using the k-nearest neighbor algorithm, a very robust and commonly used classifier with the capability of separating non-linear class boundaries based on a distance metric. We separate based on 5 neighbors using the Minkowski distance. 5 million values per category are used in order to reduce the total data size while still having a significant number of samples, taking equal numbers from each appliance in case multiple appliances are present for a category. The evaluation is performed using a 5-fold cross-validation for window sizes 2n with $$n\in [2,11]$$ and step size 3.

Table 3 shows the F1-score per class and the weighted average over all classes for tested window sizes. The F1-score is calculated as:

$${F}_{1}=2\cdot \frac{{\rm{precision}}\cdot {\rm{recall}}}{{\rm{precision}}+{\rm{recall}}}=\frac{{\rm{TP}}}{{\rm{TP}}+0.5({\rm{FP}}-{\rm{FN}})},\quad {\rm{precision}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FP}}},\quad {\rm{recall}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}},$$
(1)

where TP, FP and FN are the number of true positives, false positives, and false negatives.

The results indicate that the F1-score increases with increasing window size, which means that although the number of features presented to the k-nearest neighbor algorithm are independent from the window size, the extracted window features contain more distinct class boundaries when using a larger window size. For very small window sizes, the Coffee Machine category already performs very well with a score >0.91 even for window size 4. The best performing category at large window sizes is Freezer, with a maximum score of 0.9767. While at window size 2048 nearly all categories perform well with scores >0.90, the Washing Machine performs considerably worse with a score of 0.87. Washing Machines and Dish Washers are the most difficult appliances to distinguish, as they both present the highest miss-classification class for each other.

### Event annotation

Event annotation, also known as event detection, is the task of finding events in electricity data. There have been numerous publications on event detection with a NILM background. These event detection tasks are usually performed using thresholding methods29. Since there are no public datasets that provide ground truth, the evaluation of most publications in the essence do not predict events, but rather the presence of electricity consumption. DEDDIAG provides manual event annotations that can act as ground truth to train and evaluate event annotation algorithms. Appliance usage prediction and user behavior algorithms, such as25,30, will benefit from more advanced event annotations that are evaluated against manual annotated ground truth, as theses algorithms require event annotations as their ground truth.

We denote event annotation as a segmentation task where an event segment $$e=[{t}_{0},{t}_{1}]$$ is defined by the interval between two timestamps $${t}_{0}$$, $${t}_{1}$$, $${t}_{0} < {t}_{1}$$ during which an appliance is running, e. g. a washing machine being started at $${t}_{0}$$, finishing a full washing cycle at $${t}_{1}$$ (note that the appliance might consume only a very small amount of power at certain times while running, which may make it very hard to distinguish this case from the actual end of the cycle). For most appliances, such as washing machine and dishwasher, there are no overlapping segments. This is also the case if there are different labels defined for one appliance, e. g., PreWash and Normal. For refrigerators and freezers, overlap between the light and compressor labels are possible, but overlap within one label cannot occur.

Since in a NILM context events are defined by a single timestamp, the evaluation of such events is usually done based on correctness of the label for each time-step using standard scores such as true positive rate, false positive rate, accuracy or F1. For usage analysis or event attribution these scores are problematic because all such events are extremely rare, especially when only annotating the start or end. To put it in perspective: an appliance that only is switched on once per day will, in a dataset recorded with 1 Hz, result in having to find one value within 86400 data points. Evaluating an appliance based on ON/OFF status for each available time-step will for long-running appliances result in a much more balanced task. While for electricity attribution tasks such as disaggregation this evaluation can be seen as fair, for appliance usage analysis it is still not strict enough. A per timestamp evaluation will not provide strong enough punishment for many unwanted annotations such as splitting or combining of event segments as shown in Fig. 8.

For usage analysis a split or combined event segment will result on wrong overall event counts, wrong usage lengths and thus a false analysis of usage behavior. We therefore introduce a new score called Jaccard-Time-Span-Event-Score (JTES) that is based on the Intersection-over-Union (IoU) metric, also known as Jaccard index. The Jaccard index computed on a single segment is:

$${\rm{iou}}({e}_{{\rm{T}}},{e}_{{\rm{P}}})=\frac{Intersection}{Union}=\frac{\min ({t}_{{\rm{T1}}},{t}_{{\rm{P1}}})-\max ({t}_{{\rm{T0}}},{t}_{{\rm{P0}}})}{\max ({t}_{{\rm{T0}}},{t}_{{\rm{P0}}},{t}_{{\rm{T1}}},{t}_{{\rm{P1}}})-\min ({t}_{{\rm{T0}}},{t}_{{\rm{P0}}},{t}_{{\rm{T1}}},{t}_{{\rm{P1}}})},$$
(2)

where $${e}_{{\rm{T}}}$$ is the true event segment and $${e}_{{\rm{P}}}$$ the predicted one; this basically assumes that a box of height one is used for comparing regions. The Jaccard index is widely used to evaluate image segmentation tasks such as object detection, where commonly true positives are defined as having an overlap of more than 50%6, which are then used to calculate an overall accuracy score. The JTES score avoids setting an arbitrary threshold and therefore eliminates the drawbacks of using pure IoU. In a first step we compute an average IoU for each true event segment, only taking into account predictions that actually have at least a partial overlap with the true event segment. Let $${e}_{{\rm{T}}i},i=0,\ldots ,{N}_{{\rm{T}}}-1$$ be a true event segment and $${e}_{{\rm{P}}j}$$, $$j=0,\ldots ,{N}_{{\rm{P}}}-1$$ a predicted segment, where $${N}_{{\rm{T}}},{N}_{{\rm{P}}}$$ is the number of true and predicted segments, respectively. The average IoU $${A}_{i}$$ for true event i is then given by

$${A}_{i}=\frac{1}{{N}_{{\rm{NZ}}i}}\mathop{\sum }\limits_{j=0}^{{N}_{{\rm{P}}}-1}{\rm{iou}}({e}_{{\rm{T}}i},{e}_{{\rm{P}}j}),$$
(3)

where $${\rm{iou}}({e}_{{\rm{T}}i},{e}_{{\rm{P}}j})$$ computes the IoU for the segments $${e}_{{\rm{T}}i}$$ and $${e}_{{\rm{P}}j}$$ as defined in (2), and $${N}_{{\rm{NZ}}i}$$ is the number of non-zero IoU-values in the sum (i. e. the predictions with an overlap). The final score is then calculated by summing all Ai and normalization to the number of true event segments NT corrected by the number of false positive predictions NFP:

$${\rm{JTES}}({{\boldsymbol{e}}}_{{\rm{T}}},{{\boldsymbol{e}}}_{{\rm{P}}})=\frac{\mathop{\sum }\limits_{i=0}^{{N}_{{\rm{T}}}}{A}_{i}}{{N}_{{\rm{T}}}+{N}_{{\rm{FP}}}},\hspace{7.5pt}{{\boldsymbol{e}}}_{{\rm{T}}}=\left({e}_{{\rm{T}}0},{e}_{{\rm{T}}1},\ldots ,{e}_{{\rm{T}}{N}_{{\rm{T}}}-1}\right),\hspace{7.5pt}{{\boldsymbol{e}}}_{{\rm{P}}}=\left({e}_{{\rm{P}}0},{e}_{{\rm{P}}1},\ldots ,{e}_{{\rm{P}}{N}_{{\rm{P}}}-1}\right).$$

The JTES score requires non-overlapping true event segments, i. e. $${e}_{{\rm{T}}i}\cap {e}_{{\rm{T}}j}={\rm{0}}\,/\;\forall i,j,i\ne j$$. It is designed to reflect real-world expectations on event annotation algorithms, where an event that is split in half will at best only be evaluated as half correct, the same applies to scenarios where two true events have been merged into one event as shown in Fig. 8. Splitting a real event in two successive events will result in a JTES score of at most 0.5, while common scores such as accuracy or F1 will evaluate each time step independently and still result in perfect scores.

The principles of JTES are:

• Score is in range [0, 1], where 0 is lowest and 1 best,

• false positives and false negatives are equally bad,

• if a true event spans over multiple predicted events, the result is averaged,

• if the predicted event spans over multiple true events, the result is accounted for accordingly,

• if $${N}_{{\rm{T}}}=0$$ and $${N}_{{\rm{P}}} > 0$$, the score is 0,

• if $${N}_{{\rm{T}}} > 0$$ and $${N}_{{\rm{P}}}=0$$, the score is 0,

• if $${{\boldsymbol{e}}}_{{\rm{T}}}={{\boldsymbol{e}}}_{{\rm{P}}}$$ (assuming the segments are ordered by start time), the score is 1.

As a baseline algorithm we present results for a simple lower bound thresholding algorithm. The appliance is seen as switched-on when the average electricity demand E in a sliding window w reaches a threshold t:

$$E(w)=\left\{\begin{array}{cc}0 & {\rm{avg}}(w) < t\\ 1 & {\rm{avg}}(w)\ge t\end{array}\right..$$
(4)

The threshold is defined as the non-zero minimum average window calculated over each event in the train split.

We use a 5-fold cross-validation and test window sizes in range 1–24. Table 4 lists results for 15 segmentation tasks, where the best window was chosen based on the highest JTES score. The results clearly show that the fridge compressor cycles can reliably be annotated having a JTES of 0.8817, 0.8855, and 0.9348. The fridge light however, having very low JTES scores of 0.1589, 0.2713, and 0.0403, cannot be annotated correctly, which can be explained by the lack of an upper bound that would allow distinguishing compressor and light power. It also shows the problem when evaluating rare events using an F1-score, as while the JTES on the light of item 10 is only 0.0403, the F1-score is at 0.95, which would indicate that the algorithm performs well. The light and compressor can be present at the same time, which would require a second lower bound threshold, if even possible using a simple threshold approach. Washing machines show very different JTES scores, ranging from 0.0385 to 0.6055, where especially item 6 with the lowest JTES score still has a F1-score of 0.9196. The algorithm had to choose a very low threshold as there are many periods where the washing machine uses very little electricity which then splits the event into multiple parts, which is heavily punished by the JTES, but not by the F1-score. Dish washers show similar results, having JTES scores ranging from 0.0817 to 0.7167 while having F1-scores ranging from 0.8722 to 0.9959. The very bad JTES score on the washing machine with item 6 and dish washer with item 26 can be explained by a very high number of false positives of short length, that are very mildly punished by the F1-score. This in particular demonstrates the weaknesses of evaluating based on single timestamps instead of a time span and highlights the importance of the novel JTES score.

Overall, using a simple thresholding algorithm will not provide a good and reliable event detection algorithm and the simple approach would benefit from merging small successive predictions as well as filtering based on event length.

## Code availability

The full data collection system is published under MIT license and is available under https://DEDDIAG.github.io. The dataset itself is published as tab-separated text files together with code to import all data into a PostgreSQL instance. An SQL function called get_measurements() is provided to get seconds-based measurements, where readings are converted from value-changes to seconds-based readings using interpolation; timestamps are rounded to nearest seconds. There is also a python package available https://github.com/DEDDIAG/DEDDIAG-loader.git that assists in retrieving data into a pandas-DataFrame/numpy-array.

## References

1. 1.

EU Directive 2009/72/EG. https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32009L0072from=EN (2019).

2. 2.

NATIONAL ENERGY EFFICIENCY ACTION PLAN 2020. https://ec.europa.eu/energy/sites/ener/files/si_neeap_2017_en.pdf (2017).

3. 3.

Le Ray, G. & Pinson, P. The ethical smart grid: Enabling a fruitful and long-lasting relationship between utilities and customers. Energy Policy 140, 111258, https://doi.org/10.1016/j.enpol.2020.111258 (2020).

4. 4.

Kolter, Z. & Johnson, M. J. REDD: A public data set for energy disaggregation research. In Proc. of the KDD Workshop on Data Mining Applications in Sustainability (SustKDD) (2011).

5. 5.

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324, https://doi.org/10.1109/5.726791 (1998).

6. 6.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, 303–338, https://doi.org/10.1007/s11263-009-0275-4 (2010).

7. 7.

Makonin, S., Ellert, B., Bajic, I. V. & Popowich, F. Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014. Scientific Data 3, 1–12, https://doi.org/10.1038/sdata.2016.37 (2016).

8. 8.

Kriechbaumer, T. & Jacobsen, H.-A. Blond: Building-level office environment dataset, https://doi.org/10.14459/2017mp1375836 (2017).

9. 9.

Anderson, K. et al. BLUED: a fully labeled public dataset for Event-Based Non-Intrusive load monitoring research. In Proc. of the 2nd KDD Workshop on Data Mining Applications in Sustainability (SustKDD), 1–5 (2012).

10. 10.

Batra, N., Parson, O., Berges, M., Singh, A. & Rogers, A. A comparison of non-intrusive load monitoring methods for commercial and residential buildings. Preprint at https://arxiv.org/abs/1408.6595 (2014).

11. 11.

Picon, T. et al. COOLL: Controlled on/off loads library, a public dataset of high-sampled electrical signals for appliance identification. Preprint at https://arxiv.org/abs/1611.05803 (2016).

12. 12.

Pecan Street Inc. Dataport. https://dataport.pecanstreet.org/ (2014).

13. 13.

Uttama Nambi, A. S., Reyes Lua, A. & Prasad, V. R. LocED: Location-aware Energy Disaggregation Framework. In Proc. of the 2nd ACM Int. Conf. on Embedded Systems for Energy-Efficient Built Environments, BuildSys, 45–54, https://doi.org/10.1145/2821650.2821659 (New York, USA, 2015).

14. 14.

Beckel, C., Kleiminger, W. & Cicchetti, R. The eco data set and the performance of non-intrusive load monitoring algorithms. In Proceedings of the 1st ACM Intern. Conf. BuildSys 2014, BuildSy’s 14, 80–89, https://doi.org/10.1145/2674061.2674064 (Association for Computing Machinery, 2014).

15. 15.

Monacchi, A., Egarter, D., Elmenreich, W., D’Alessandro, S. & Tonello, A. M. GREEND: An energy consumption dataset of households in Italy and Austria. In IEEE Int. Conf. SmartGridComm, 511–516, https://doi.org/10.1109/SmartGridComm.2014.7007698 (2014).

16. 16.

Batra, N., Gulati, M., Singh, A. & Srivastava, M. B. It’s different: Insights into home energy consumption in india. In Proceedings of the 5th ACM Workshop on Embedded Systems For https://doi.org/10.1145/2528282.2528293 (ACM, New York, NY, USA, 2013).

17. 17.

Makonin, S., Wang, Z. J. & Tumpach, C. RAE: The Rainforest Automation Energy Dataset for Smart Grid Meter Data Analysis. In MDPI Data Vol. 3, https://doi.org/10.3390/data3010008 (2018).

18. 18.

Murray, D. et al. A data management platform for personalised real-time energy feedback. In Proc. of the 8th Int. Conf. on Energy Efficiency in Domestic Appliances and Lighting, 1293–1307 (2015).

19. 19.

Barker, S. et al. Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. In Proc. of the 2nd KDD Workshop on Data Mining Applications in Sustainability (SustKDD) (2012).

20. 20.

Reinhardt, A. et al. On the accuracy of appliance identification based on distributed load metering data. In 2012 Sustainable Internet and ICT for Sustainability (SustainIT), 1–9 (2012).

21. 21.

Kelly, J. & Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Scientific Data 2, https://doi.org/10.1038/sdata.2015.7 (2015).

22. 22.

Klemenjak, C. et al. Electricity consumption data sets: Pitfalls and opportunities. In Proc. of the 6th ACM Int. Conf. on Systems for Energy-Efficient Buildings, Cities, and Transportation, BuildSys 19, 159–162, https://doi.org/10.1145/3360322.3360867 (Association for Computing Machinery, New York, NY, USA, 2019).

23. 23.

German Federal Environment Agency. Anteile der Anwendungsbereiche am Netto-Stromverbrauch der privaten Haushalte 2008 und 2018. https://www.umweltbundesamt.de/sites/default/files/medien/384/bilder/dateien/5_abb_anteile-anwendungsbereiche-am-netto-sv-ph_2020-07-01.xlsx (2020).

24. 24.

BDEW Bundesverband der Energie-und Wasserwirtschaft e.V. Wie heizt Deutschland? - Studie zum Heizungsmarkt. https://www.bdew.de/media/documents/BDEW_Heizungsmarkt_final_30.09.2019_3ihF1yL.pdf (2019).

25. 25.

Wenninger., M., Schmidt., J. & Goeller., T. Appliance usage prediction for the smart home with an application to energy demand side management - and why accuracy is not a good performance metric for this problem. In Proc. of the 6th Int. Conf. on Smart Cities and Green ICT Systems - Volume 1: SMARTGREENS, 143–150, https://doi.org/10.5220/0006264401430150. INSTICC (SciTePress, 2017).

26. 26.

Wenninger, M., Stecher, D. & Schmidt, J. SVM-based segmentation of home appliance energy measurements. In 18th IEEE Int. Conf. on Machine Learning and Applications – ICMLA 2019, 1666–1670, https://doi.org/10.1109/ICMLA.2019.00272 (2019).

27. 27.

Pereira, L. NILMPEds: A performance evaluation dataset for event detection algorithms in non-intrusive load monitoring. Data 4, 127, https://doi.org/10.3390/data4030127 (2019).

28. 28.

Wenninger, M., Schmidt, J. & Maier, A. Deddiag, a domestic electricity demand dataset of individual appliances in germany. figshare https://doi.org/10.6084/m9.figshare.13615073 (2021).

29. 29.

Girmay, A. A. & Camarda, C. Simple event detection and disaggregation approach for residential energy estimation. In 3rd Int. Workshop on Non-Intrusive Load Monitoring (2016).

30. 30.

Basu, K., Hawarah, L., Arghira, N., Joumaa, H. & Ploix, S. A prediction system for home appliance usage. Energy and Buildings 67, 668–679, https://doi.org/10.1016/j.enbuild.2013.02.008 (2013).

## Acknowledgements

Major parts of this work were funded by the German Federal Ministry of Education and Research (BMBF), grant 01LY1506, and supported by the Bayerische Wissenschaftsforum (BayWISS). Furthermore, we thank all participating households for their data donation and all students involved in the system development.

## Funding

Open Access funding enabled and organized by Projekt DEAL.

## Author information

Authors

### Contributions

M.W. designed and implemented the data collection system, prepared the public release of the dataset. A.M. and J.S. provided conceptional guidance, helped finding participating households. J.S. furthermore managed the project that funded the data collection. All authors reviewed the manuscript.

### Corresponding author

Correspondence to Marc Wenninger.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and Permissions

Wenninger, M., Maier, A. & Schmidt, J. DEDDIAG, a domestic electricity demand dataset of individual appliances in Germany. Sci Data 8, 176 (2021). https://doi.org/10.1038/s41597-021-00963-2