Background & Summary

Buildings are responsible for 40% of the European Union’s energy consumption and 36% of its greenhouse gas emissions1. In the residential sector, space heating, cooling, and hot water supply constitute up to 80% of citizens’ energy consumption. At the same time, more than 75% of EU buildings are inefficient2, and the share of the building stock that undergoes major renovation is low, ranging from less than 0.4% to 1.2% across EU member states1. Therefore, renovating buildings is a key initiative of the European Green Deal3. To this end, the European Commission published the “A Renovation Wave for Europe” strategy that aims to tackle inefficient buildings and to double annual energy renovation rates4.

In practice, Energy Performance Certificates (EPC) are one of the central instruments to contribute to these goals by providing transparency of the existing building stock’s energy efficiency. It is mandatory to issue an EPC for buildings up for sale or rent and to display them in advertisements5. In Europe, the United Kingdom has the largest number of registered EPCs – more than 20 Million5 – with around 60% of coverage in England and Wales in 20226. EPCs are created for existing buildings by accredited energy assessors during on-site visits7, which makes the process slow, costly, and local. To achieve the aggressive renovation targets, methods are needed that generate relevant insights to buildings based on widely available data sources.

To this end, there is increasing research interest, in particular in the field of urban building energy modeling (UBEM)8,9,10. UBEM refers to the bottom-up building energy modeling and analysis on a city-scale. One aim of UBEM is to identify inefficient buildings automatically on large scale with continuous coverage. However, Ali et al. point out that necessary input data is often unavailable for an entire city or a district9. Moreover, the prevalent physics-based or engineering approach in UBEM is currently inadequate for large-scale building-level analysis because it requires highly detailed input data9.

As an alternative to physics-based modeling, some researchers propose data-driven approaches which aim at determining energy characteristics with fewer input data11,12,13,14,15,16. Publications mention different compositions of input features such as occupancy, year of construction, insulation of building envelope, or surface-to-volume ratio. However, these features are usually only available for a specific region or a subset of buildings in an area of interest.

Consequently, researchers turn to extracting relevant features, or proxies that indirectly indicate features, from remotely sensed data, which is available on a larger scale16,17,18,19,20,21. For example, building footprints16 and roof shapes18 can be gathered from aerial images. Street view images include information about the window-to-wall ratio or the number of floors19,21. Airborne LiDAR contains information about a building’s height15 and can be further processed into Digital Surface Models (DSM)19,20 or 3D models22 to reflect a building’s envelope and volume. Implicitly, LiDAR also contains information that is relevant for building energy modeling. For example, Tooke et al. predicted building age based on LiDAR17, and Castagno and Atkins improved roof shape classification with LiDAR18.

In summary, combining widely available remotely sensed data sources with data-driven algorithms to estimate energy efficiency can provide building insights fast and at scale21. LiDAR is a promising data source as it includes features linked to building height, envelope geometry, roof superstructures, and surface-to-volume ratio. In addition, age, building type, or architectural style can be inferred from a building’s point cloud representation.

To support and accelerate research in this field, this paper provides a large-scale dataset of building point clouds coupled with energy characteristics. The major contributions are two-fold:

  1. 1)

    The Points for Energy Renovation (PointER) dataset23 comprises more than 1 million building point clouds and covers 16 diverse authority districts in England. When available, building point clouds are linked to data from UK’s energy performance database, leading to more than half a million complete sets of point clouds with energy characteristics. The dataset can be downloaded from https://doi.org/10.14459/2023mp1713501

  2. 2)

    Open source code with a detailed documentation to replicate the building point cloud generation process enables follow-up studies. The process is region-agnostic and can be applied anywhere where LiDAR data and building footprints are available. The code is available at https://github.com/kdmayer/PointER

Methods

Figure 1 gives a high-level overview of the conducted steps to create the PointER dataset23.

Fig. 1
figure 1

Data pipeline overview: generating a building point cloud dataset in four steps.

Obtaining base data

Multiple open datasets in the UK enabled the creation of the PointER building point cloud dataset, including LiDAR data and Europe’s largest collection of EPCs. In this study, we focused on England, as not all of the required data was available for the entire UK. To produce the dataset, we obtained seven base datasets which are summarized in Table 1. All datasets were provided under the Open Government License (OGL) except for the building footprints dataset.

Table 1 Overview of base data sources.

The first and major component was UK’s National LiDAR Programme Point Cloud data, collected between 2017 and 2021. The LiDAR data covers all of England with a resolution of 1 m and is available online24. The point data is categorized into the ASPRS Standard LiDAR Point Classes25, such as building, ground, low vegetation, water, etc. Hence, building point clouds could simply be extracted from the LiDAR data based on their classification. However, to match energy characteristics on the property level, we required a distinction between neighboring properties. In particular, the boundaries of properties could not be extracted from LiDAR points alone in the common case of townhouses. Therefore, our alternative approach was to crop points using building footprint data.

To this end, we required accurate building footprint data. We chose the UKBuildings edition 13 online version dataset by Verisk26 because it had the most accurate data based on our visual analysis of aerial images. Figure 2 displays UKBuildings’ polygons as well as building outlines from OpenStreetMaps (OSM)27 and Ordnance Survey’s OS OpenMap - Local28. As can be seen from the examples, a common downside of OSM was the missing footprint information. Likewise, the OS dataset was not suitable because it did not contain exact property lines between the buildings. While Verisk’s footprint data has a higher quality, the drawback is that we could not publish the footprint data due to the private license.

Fig. 2
figure 2

Building outline polygons of three available datasets visualized on aerial images. UKBuildings26 (left), OSM27 (center), and OS OpenMap28 (right) on Google Maps aerial image.

Our goal was to link the point clouds to energy performance certificate data29. The certificates are available for a subset of properties in England and Wales. There are currently around 25.60 million domestic EPCs and 1.11 non-domestic EPCs in the database6. The main label is an energy rating between “A” for most efficient and “G” for least efficient. However, the EPC database includes more detailed information, such as roof insulation efficiency or window efficiency, which were also added as energy attributes. The data is available under the OGL, except for the Ordnance Survey address data entries, which were therefore excluded from our dataset.

To link the EPC with the building point clouds, we used the Unique Property Reference Numbers (UPRN), which exist for most of the UK’s buildings30. EPC data is structured by local authority districts (LAD). Accordingly, we also downloaded LADs’ boundary shape files provided by the Office for National Statistics31 to limit the point cloud generation to geographic Areas of Interest (AOI).

Lastly, we aimed at generating a dataset of buildings that possess a similar composition as the entire English building stock. We used the 2011 Rural Urban Classification32. The classification distinguishes ten rural urban classes ranging from “rural hamlets and isolated dwellings in a sparse setting” to “urban major conurbation”. We assigned one of these classes to each of the footprints. This allowed us to select AOIs so that our dataset possessed the same rural urban distribution as entire England. The smallest available geographic entities are the Output Areas (OA). OAs entail an average resident population of approximately 300 people33. Therefore, we obtained the OA boundary shape files34 as the last data source.

Storing data in a database

A key element of the dataset generation process was a postgres35 database to store the obtained data. We used the postGIS36 extension for geo-data and the pgpointcloud37 extension for the point cloud data, as well as gdal38 and pdal39 to insert the data into the database, respectively. The program itself was written in Python3. The program manipulates data and interacts with the postgres database. The code and utilized Python libraries can be found at https://github.com/kdmayer/PointER. The framework was set up in a Singularity40 container. Most of the data was inserted into the database before running the point cloud generation script. However, point cloud and EPC data were imported periodically during the runtime for one AOI only. This reduced the amount of data in the database and speeded up the cropping process.

Selecting areas of interest

To generate point clouds for a subset of England’s building stock, we chose a geographically diverse set of LADs. We first selected Coventry, Westminster, Oxford, and Peterborough, similarly to Mayer et al.21. Furthermore, we used the RUC for small geographies to calculate England’s RUC distribution and all LAD’s distribution. We chose LADs with the goal to reach a RUC distribution similar to England in our dataset. To account for the geographic diversity of buildings in England, we selected LADs from across the country including coastal and interior regions. Figure 3 visualizes the geographic distribution of selected LADs. Furthermore, columns one and two of Table 2 give and overview of selected LADs, sorted by district code.

Fig. 3
figure 3

Map of UK’s Local Authority Districts (LADs) and the 16 LADs selected for our dataset highlighted in green.

Table 2 Overview of number of footprints with point cloud, UPRN and EPC data in the dataset.

Our data includes other properties besides rural urban classification, such as building age, height, size and energy related features. In our dataset, these properties’ distribution should ideally be similar to the distribution of the entire English building stock. While we did not take this data into account for the selection of regions to reduce complexity, we analyzed these properties in retrospect and included the result in the chapter Technical Validation.

Generating building point clouds

Building point clouds were generated for one AOI at a time. Figure 4 depicts the eight steps of the program. We used python to process the steps and to execute SQL queries in the postgres database, whereas steps one to five used postgres and steps five to seven ran in python itself.

Fig. 4
figure 4

Overview and visualization of the program’s steps for generating building point clouds and their energy characteristics for one AOI.

Step 1: The program starts by inserting the respective point cloud and EPC data into the database. This reduces the amount of data simultaneously in the database and speeds up GIS-functions such as the intersection function.

Step 2: The program runs batches of footprints iteratively to avoid memory issues. In our experience, a batch size of 500 works well, leading to 120 iterations for an AOI with 60.000 footprints. The program selects all footprints inside the AOI and saves them as a materialized view in the second step. This reduces the computationally expensive intersection of footprints in the AOI at each of the thousands of iterations.

Step 3: To account for small deviations in the spatial correlation between the footprint and the point cloud data, a buffer of 0.5 m is added to each footprint. In Fig. 4, this buffer is indicated by the footprint’s black outlines.

Step 4: Next, all LiDAR points within footprints are selected and assigned to the respective footprint. In Fig. 4 points intersecting with a footprint are plotted in black. Points outside footprints are plotted in orange. In step 4, we also applied a minimum threshold of 100 points to the footprints and filtered out buildings with less points than this threshold. This is indicated by red polygons in Fig. 4. We used this pre-selection to avoid point clouds that did not provide a useful geometric representation of a building and would reduce the quality of the dataset.

Step 5: In step 5, EPC data are matched to the buildings by UPRN. Currently, only a subset of buildings holds EPCs which leads to a significant number of houses without energy attributes (red polygon in Fig. 4). Furthermore, it is possible that one footprint is allocated with multiple EPC if a building contains multiple dwellings. In addition, some footprints do not have a linked UPRN. This fact is discussed in more detail in section Technical Validation.

Step 6: Airborne LiDAR data only covers the buildings’ roofs and fragments of the walls. Therefore, floor points are added to the point cloud in a raster of 0.5 m, because we expected that this improves the representation of the building. The step is conducted in python using the shapely library.

Step 7 and 8: In the end, the point clouds are saved in “.npy” format and the footprint polygons and EPC data are saved in a “.geojson“ file.

Data Records

The Points for Energy Renovation (PointER) dataset23 can be downloaded from https://doi.org/10.14459/2023mp1713501.

Dataset size and content

The PointER dataset23 comprises more than one million point clouds. The selected regions include almost 1.4 million buildings, however our point cloud threshold of 100 points per building point cloud eliminated of around 25% of buildings. As Table 2 shows, this effect was strongest in Plymouth, Coventry, Erewash, and Sutton.

Furthermore, only around 45% of footprints could be linked with EPC data by UPRN. We observed a great variance with 56% in Peterborough and 30% in Sutton. As a result, our final dataset contains around half a million point clouds with energy feature data.

Dataset structure

Table 3 gives a schematic overview of the dataset’s folder structure. The dataset contains one result folder for each of the 16 selected AOIs, named according to the AOI’s LAD code, e.g. E06000014 for York. Each result folder contains one sub-folder with all building point clouds in python numpy “.npy” format. Building point cloud files are named according to their footprint’s centroid’s coordinate in the spatial reference system EPSG 27700, i.e. “XCOORDINATE_YCOORDINATE.npy”. Point coordinates of a point cloud are in EPSG 27700. In addition, there is a “final_result_AOI_CODE.json” file that maps the building point cloud file to the EPC data. Finally, a summary of the number of footprints with point cloud and EPC information of the AOI is stored in the “production_metrics_AOI_CODE.json” file.

Table 3 Folder structure of published dataset.

Table 4 visualizes the structure of the “final_result_.json” files. The first four columns originate from the point cloud generation process. They connect the footprint with the resulting point cloud file. The information about number of points refers to the building point cloud before adding floor points. This can be used to filter out footprints with too little or too many points depending on the use case. Furthermore, the UPRN column enables linking data to the point cloud.

Table 4 Structure of feature table provided in final_result.json.

In the provided dataset, we combined point clouds with EPC data. The EPC features consist of a unique identifier for each EPC entry, current and potential energy rating, current and potential energy efficiency as well as 80 more features with detailed building and energy information. Point cloud data and EPC were linked through the Verisk footprint UPRN and the EPC UPRN.

Technical Validation

Quality of input data

The National LiDAR Programme covers the entire area of England. We downloaded the LiDAR point cloud data with resolution of 1 m which translates to an average point cloud density of 1 point per square meter. The National LiDAR Programme offers a metadata dashboard with details on each survey’s mission dates and quality metrics41. The average ground truth error across 1215 surveys is 3.66 cm. Furthermore, 97.1% of LiDAR points from overlapping flightlines are less than 15 cm different in elevation41.

Verisk’s UKBuildings edition 13 online version dataset includes 28,733,631 buildings from the whole of Great Britain as well as the Belfast urban area. Verisk conducts an internal data quality analysis on accuracy and completeness, but only plans to publish the results in future releases. Therefore, we conducted a qualitative assessment by visualizing the footprint polygons and LiDAR data on Google aerial as depicted in Fig. 2.

Information about the quality of EPC data can be found in the publication’s technical notes42. The EPC dataset contains around 60% of the housing stock in England and just less than 60% in Wales42. The proportion is similar across all regions in England42. The energy assessment of individual buildings is conducted by energy assessors, who are responsible for the robustness of the data in relation to individual buildings. In addition, there are validation checks as the data is uploaded on the registery42. With regard to subsequent potential applications, it is important to note that users need to carefully interpret the EPC data. The creation process in the UK has several shortcomings leading to an over-prediction of energy use especially for lower efficiency classes, as described by Few et al.43.

Finally, Local Authority District boundaries, Output Area boundaries and Rural Urban Classification affect the selection of buildings for point cloud generation, but not the building point clouds itself.

Quality assessment of resulting building point clouds

Our approach uses building outlines to crop the point clouds. Hence, errors can arise from spatial or temporal mismatch between point cloud data and building footprints. Therefore, we conducted a manual inspection of the resulting building point clouds. To this end, we randomly selected a subset of 5000 point clouds, assuming that the quality of the subset can be extrapolated to the entire dataset. First, we classified the point clouds into “suitable”, “inspection required” and “unsuitable” based on a 2D visualization. In a second round, we inspected the 3D representation as well as an aerial image of the “inspection required” buildings and classified them into “suitable after inspection” and “unsuitable”. Figure 5 gives examples for building point clouds of different quality.

Fig. 5
figure 5

Visualization of building point cloud examples of three assessment categories. Around 85% of building point clouds are “suitable”, 10% are “suitable after inspection” and 5% are “unsuitable”.

Around 5% of the point clouds were of low-quality, meaning, that the building was not recognizable. There were a number of reasons for this. Some buildings missed points on the roof, usually caused by surrounding structures. Other buildings were covered by vegetation which results in a chaotic point structure. Most of these buildings were small buildings such as garages or auxiliary buildings. Furthermore, a poor spatial alignment between LiDAR data and building outline could led to missing roof structures. The different temporal representations in the two datasets led to some point clouds that appeared to be vegetation or a construction site instead of a building. Finally, for some of the geographic areas, LiDAR points of two surveys overlapped, which could result in distorted building point clouds in a few cases.

After reassessing “inspection required” point clouds in detail, 478 out of 650 buildings were classified as “suitable after inspection”. The vast number of these point clouds were divided into two cases. First, there were buildings that contained a small number of vegetation points covering the roof, but the overall building was clearly recognizable to the human eye. Second, the building point clouds appeared slightly odd at first, but the buildings were part of narrow townhouse-complexes. Therefore, these 478 point clouds were described as edge cases. They made up around 10% of buildings. The remaining 85% of buildings were classified as suitable.

Effect of point threshold

Part of our approach is filtering out point clouds that contain less points than a defined threshold of 100 points. This is because buildings with fewer than 100 points are found to lack a rich representation. Figure 6 provides an impression of the number of LiDAR points per footprint for Westminster and Coventry. In Westminster (left) almost all buildings have more points than the threshold, except for a few garage or auxiliary garden buildings, which are visualized in red. On the other hand, the selected district in Coventry includes many buildings with less than 100 points. These are small townhouse buildings that simply have a small footprint area and consequently do not contain enough points. Some of the footprints (yellow) comprise just enough points to be above the threshold. Furthermore, garages and auxiliary buildings, too, are mostly below 100 points. While the point cloud threshold leads to a higher number of well-represented buildings, it also leads to a bias by removing small buildings. This can also be observed in the representativity analysis illustrated in Fig. 8.

Fig. 6
figure 6

Building footprints of Westminster (left) and Coventry (right) with number of points within their boundary. Westminster has only few buildings with less than 100 points (visualized in red), whereas Coventry displays larger numbers of those footprints in some areas of the city.

Linkage of EPC features to point clouds through UPRN

To join the point cloud with EPC data we used the footprint dataset’s UPRN feature. As an alternative, UPRN can be allocated to footprints through spatial intersection. However, the spatial intersection approach led to a lower number of buildings with UPRN. Therefore, we decided to use the UPRN feature provided by Verisk. Figure 7 shows the number of UPRN and EPC entries per footprint for buildings in our dataset.

Fig. 7
figure 7

Number of UPRN and EPC data points per footprint for our dataset.

In our PointER dataset23, 1,040,425 or 84.54% of the footprints have at least one UPRN. Furthermore, some footprints are linked to more than one UPRN. As the UPRN refers to a dwelling not to a building, this can be the case for multi-dwelling units. Furthermore, multiple UPRN per footprint could also arise due to erroneous UPRN allocation in some cases. Overall, we linked 27.14% of footprints with exactly one EPC entry and 56.91% of footprints remained without EPC data. Moreover, 15.95% of footprints had multiple EPC values. In our dataset, in the final result data frame, this is reflected by multiple rows that have identical point cloud filenames, but different EPC data. Multiple EPC links require an EPC selection approach. For example, the selection of the average or lowest EPC rating linked to a footprint could be used.

Comparing england’s building feature distribution with our dataset’s distribution

This section evaluates our subset’s representativity in terms of RUC, area, height, age class and EPC rating. The distribution in our dataset in comparison to England is visualized in Fig. 8.

Fig. 8
figure 8

Distribution of Rural Urban Classification, building area, height, age class and EPC rating of our dataset in comparison to all buildings in England included in UK Buildings dataset.

The graph at the top shows the asymmetry between the 10 rural-urban classes, which led to the challenge of achieving exact alignment of the RUC distribution. Nevertheless, the overall representation is similar, especially for the largest classes “Urban city and town”, “Urban major conurbation”, “Rural town and fringe”, “Rural village” and “Rural hamlets and isolated dwellings”. The building characteristics area and height are shown in the two center graphs and display high agreement, except for small buildings with areas under 50 m² and a height of 2–4 m. This can be attributed to our approach of filtering out buildings that contain too few points. Furthermore, the figure depicts a difference in the age class distribution, where our dataset comprises less unclassified buildings than the England dataset. Instead, our dataset contains slightly more modern, interwar, and historic buildings. Finally, a similar distribution of EPC ratings is key for the application in building energy modeling. The bar graph on the bottom right in Fig. 8 indicates a similar distribution of EPC ratings of our dataset in comparison to all of England’s footprints. Therefore, we conclude, that with our approach of selecting building footprints based on RUC, we derive a subset of buildings that possess a similar feature composition as the entire English building stock.

Usage Notes

As mentioned in the paragraph on UPRN and EPC per footprint, some footprints are linked to multiple energy characteristics, because a building can contain multiple dwellings. When users require linking point clouds to unique energy features, they first need to apply a selection process. As there are multiple approaches with advantages and disadvantages depending on the use case, we leave this step to future users.

Depending on the application, the dataset might need to be normalized (e.g. in a point cloud deep learning pipeline). Currently, coordinates are in the metric coordinate reference system EPSG 27700, the Ordnance Survey National Grid reference system. Buildings can be normalized in relation to the largest building if scale preservation is required. However, most point clouds will consequently only occupy a fraction of the normalized space, because large buildings are less common, as visible in the height and area distribution displayed in Fig. 8.

All data is licensed under the Open Government License, except for the building footprints. Hence, users can exploit the dataset both commercially and non-commercially, but have to acknowledge the sources of the data. The UKBuildings footprint data is provided by Verisk under a private license for research purposes. Therefore, we can use building footprints to generate point clouds, but we cannot include them in this dataset for download. To extend the dataset to other regions, we recommend contacting Verisk for their UKBuildings dataset. Alternatively, other footprint data, such as OSM can be used. When using OSM, regions with high data quality and coverage should be selected.

This dataset can be used in a range of applications. For example, explorative studies can investigate the explanatory power of LiDAR data for building energy characteristics, e.g. through the application of deep learning methods. Recent advances in deep learning methods for point clouds and their application are promising44,45,46,47 and we expect building point clouds to include significant information. Hence, studies can build point cloud classification models and predict EPC labels for all buildings in England and the UK.

Using our open source code, building point clouds can be generated for any location with available building footprint and LiDAR data. This way, future studies can be conducted across multiple countries. In addition, our dataset could be coupled with additional data sources such as aerial images, street view images, historic energy usage, or socio-demographic data.

Although our dataset is motivated by the challenge of modeling building energy efficiency, it is also relevant for applications outside of this field. Building point clouds can be used to evaluate architectural features or to support urban planning activities. Through the standardized Unique Property Reference Number (UPRN)30 other datasets can easily be linked to the building point clouds.