Vectorized rooftop area data for 90 cities in China

Reliable information on building rooftops is crucial for utilizing limited urban space effectively. In recent decades, the demand for accurate and up-to-date data on the areas of rooftops on a large-scale is increasing. However, obtaining these data is challenging due to the limited capability of conventional computer vision methods and the high cost of 3D modeling involving aerial photogrammetry. In this study, a geospatial artificial intelligence framework is presented to obtain data for rooftops using high-resolution open-access remote sensing imagery. This framework is used to generate vectorized data for rooftops in 90 cities in China. The data was validated on test samples of 180 km2 across different regions with spatial resolution, overall accuracy, and F1 score of 1 m, 97.95%, and 83.11%, respectively. In addition, the generated rooftop area conforms to the urban morphological characteristics and reflects urbanization level. These results demonstrate that the generated dataset can be used for data support and decision-making that can facilitate sustainable urban development effectively.


Background & Summary
Rooftops of buildings have been intensively studied in fields such as sustainable urban development, building energy modeling, and urban planning and design in recent decades [1][2][3] . Owing to urbanization associated with the digital age, reliable information on rooftops is in increasing demand [4][5][6] . The rapid access to accurate rooftop information is important for the evaluation of urban and rural development trends. These trends are useful for formulating development strategies and protecting urban and rural ecosystems [7][8][9] . However, data on rooftop areas are unavailable in many developing countries because of resource constraints. Therefore, methods suitable for generating reliable data on rooftop areas of buildings at low cost are urgently needed [10][11][12] .
The automatic extraction of rooftop area data is gaining popularity in diverse fields, and studies involving varied data sources exist 13 . Three-dimensional (3D) spatial data, such as the Digital Surface Model (DSM) and Light Detection and Ranging (LiDAR), are exploited for reconstructing buildings, which includes the rooftop area representation and geometric modeling [13][14][15][16] . However, the costs of acquiring 3D spatial data and of constructing the associated 3D models are costly, especially at the city scale. Due to the development of image processing algorithms, such as the edge detection and image segmentation, rooftops data can be extracted from high-resolution remote sensing imagery 17,18 . Conventional image process techniques, however, involve complex empirical rules and threshold settings, and thus, exhibit limitations when applied to high-resolution remote sensing imagery in large-scale 14 .
Open-access data from public service providers, such as Google Earth, Baidu Map, and OpenStreetMap, provide opportunities for the acquisition of urban information associated with broad coverage, fast updating speeds, and low cost 19-21 . However, although open-access data, for example, from a Google Earth Satellite (GES) image are valuable for obtaining information on cities, conventional processing methods hardly discover in-depth semantic information and lack flexibility when examining large data involving complex features 22 .
In recent years, deep learning methods have been employed for efficient feature learning and urban information acquisition 23,24 . In fact, deep learning-based image semantic segmentation methods have been applied for the extraction of rooftops data 25,26 . Nevertheless, regarding the optimization of rooftop data extraction applications, the data acquisition process requires the incorporation of geographic information 27,28 . In addition, public rooftop area datasets that are suitable for use as training dataset in machine learning are scant 29 . Further, naive deep learning models based on unbalanced and insufficient training samples exhibit unsatisfactory performances 30 . Therefore, a robust high-performance rooftop extraction model remains elusive in China.
In the present study, the main objective is to extract accurate rooftop areas in China using high-resolution open-access remote sensing imagery based on a geospatial artificial intelligence (GeoAI) framework. The principal components of this framework are illustrated in Fig. 1. The following steps were employed for generating the rooftop area dataset: (1) data preparation through spatial stratified sampling involving geospatial prior knowledge and data processing pipeline to augment the representativeness and number of samples; (2) creation of a deep learning segmentation model, which is based on an ensemble learning strategy and an improved prediction method to improve the rooftop extraction performance.
Based on the framework, we developed a national-scale vectorized rooftop area dataset involving 90 cities in China. The data was validated on test samples of 180 km 2 across different regions with spatial resolution, overall accuracy, and F1 score of 1 m, 97.95%, and 83.11%, respectively.

Methods
Data collection. GES imagery. In the present study, open-access GES imagery served as the primary data source (Table 1). GES imagery is advantageous because of its high resolution and wide coverage. These images were downloaded in October 2020 using Python scripts in the open map service application program interface (Google Earth API) provided by Google. The spatial resolution of GES imagery varies according to its image level. The spatial resolution of the 18-level GES imagery was approximately 0.6 m/pixel in most developed countries, and this adequately displayed the geometry and structure of different rooftops. However, because the GES images originated from multiple sensors, differences in their performance induced variations in the resolution from region to region. For example, in China, images of major cities are usually obtained from the WorldView, and QuickBird satellites, and these involve an original resolution within 1 m. In contrast, images of remote areas are acquired from the SPOT series satellites, and the original resolution of which is within 5 m.
FROM-GLC30. The spatial stratified sampling standard was based on a priori knowledge of the urban land cover from the global 30 m resolution land cover data (FROM-GLC30) created in 2017 (Table 1). These data include the following types: cultivated land, woodland, grassland, shrubland, wetland, waterbody, tundra, artificial surface, bare land, glacier, and permafrost. The overall accuracy of the FROM-GLC30 data is 72.43% 31 , which is based on a global all-season validation sample set from more than 36,000 locations. www.nature.com/scientificdata www.nature.com/scientificdata/ Site selection. In the present study, we selected 90 cities in China (Table 2), and these were partitioned into four tiers based on the city administration hierarchy established by the government in China. Tier 1 involved municipalities with a central administration and regions linked to a special administration in China. Tier 2 comprised mainly sub-provincial cities, while Tier 3 involved provincial capitals and major prefecture-level cities. Tier 4 contained ordinary prefecture-level cities, and the locations of these 90 cities are displayed in Fig. 2. These cities involve all provincial capitals and major administrative levels in China, and their distribution covers different climate regions, and thus, highlight the economics, political, and geographic adequacy. In terms of population, the 90 selected cities cover about 40% of the entire China.
Information about the characteristics of the 90 cities in each tier is presented in Table 3 (collated from national statistics 32,33 , no data available for Hong Kong and Macau). The administrative center bias in resource allocation makes cities with higher administrative rank more advantageous in terms of expansion. Therefore, the average size of population and built district shows an increasing trend from Tier 4 to Tier 1. Cities in Tier 1 and Tier 2 are mostly the economic centers of provinces, with developed economies and high urbanization rates, and usually show a multi-core urban morphology. Cities in Tier 3 and Tier 4, on the other hand, usually have a single-core urban morphology. In addition, the shape of cities in each tier generally varies depending on the topography. Cities in the plains tend to have a clumped shape, while cities along rivers and valleys tend to have a striped shape. Data preparation. GES imagery preprocessing. The quality of a GES image varies based on the imaging sensor, imaging time, and environmental factors (e.g., atmospheric condition and climate), all of which affect the model training and generalization. Therefore, standardization procedure for GES imagery is necessary, and in the present study, the Gamma Correction algorithm 34 and Contrast Limited Adaptive Histogram Equalization algorithm 35 were used to resolve brightness and sharpness issues.
Spatial stratified sampling strategy. The study area involves several land cover types, and the proportion of unbuilt areas including water, cultivated land, and forest, exceeds that of built-up areas. Therefore, regular random sampling will create the imbalanced category problem, in which the proportion of negative samples (non-rooftop samples) significantly surpasses that of positive samples (rooftop samples). Therefore, the priori knowledge of the urban land cover was utilized to partition the study area into built-up and unbuilt areas based on the FROM-GLC30 data. The built-up area contains mainly artificial surfaces, which easily yield positive samples. In contrast, the unbuilt area comprises water bodies, wetlands, grasslands, bare lands, cultivated lands, shrublands, and forests, which commonly provide negative samples.  www.nature.com/scientificdata www.nature.com/scientificdata/ However, based on empirical evidence from previous studies in unbuilt areas, we find farmlands, bare lands, and intersections of different land cover types can be misclassified by the rooftop extraction model. Therefore, in the present study, the unbuilt areas were manually filtered to determine confusing areas, and these areas supplied confusing negative samples.
Therefore, to obtain representative and balanced positive and negative samples, spatial stratified sampling was employed in the sample acquisition. The results of stratified sampling in the built-up and confusing areas are displayed in Fig. 3, and this approach was used to produce patches from the GES images.
Sample processing and division. In the present study, sample images were correspondingly labeled positive and negative manually. Patches of the image samples and the corresponding masks were then divided into training samples (served as input for the rooftop extraction model training) and test samples (served as ground truth for the validation of rooftop area dataset).
During the rooftop extraction model training task, the built-up area covered by the image samples and their corresponding masks was 200 km 2 , while the confusing area was 160 km 2 , and thus, the total training samples covering 360 km 2 . According to previous deep learning studies, higher training data volume produce more robust models. However, in practice, available data are usually limited. To resolve this limitation, data augmentation without changing the labeled categories was conducted, thereby enhancing the generalization potential of the model. Data augmentation operations used in the present study included the following: random cropping, image rotation, image flipping, image blurring, and noise addition.
Modelling. Naive rooftop extraction model. Rooftop areas of cities were extracted using the DeepLabV3+ model and GES images. The DeepLabV3+ is an open-source image semantic segmentation model that was launched by the Google R&D team 36 . In GES images, the rooftop areas exhibit varied sizes and shapes because of differences in architectural styles and dimensions 37 . Rooftop edges are sometimes difficult to accurately identify because the GES image quality is affected by weather conditions 38 . DeepLabV3 + can perceive   www.nature.com/scientificdata www.nature.com/scientificdata/ features of different scales, thereby improving the recognition accuracy for multifarious rooftops. In addition, the DeepLabV3+ enables transformation of the feature map into a constant resolution map based on the encoderdecoder structure, and this resolves the blurred edges of the rooftop area extraction issue 39 .
In the present study, the cross-entropy 40 and dice loss functions 41 were integrated to generate a composite loss function that can simultaneously handle the imbalance categories of samples problem and accelerate the convergence of training. These functions are expressed as follows: where p i denotes the predicted value of the i th sample, p i * represents the ground truth value of the ith sample, L dice (·) is the dice loss function, L bce (·) stands for the cross-entropy loss function, and α and β are weight coefficients of the loss function, with corresponding values of 0.2 and 0.8 in the present study.
Expansion prediction. To extract rooftop areas, standardized GES images served as input for the ensemble model. Considering that original GES images were significantly larger than the required dimension for model input, cropping into smaller patches was necessary for the prediction. However, this creates an uneven transition or stitched problem at the splicing gap of the prediction result of cropped smaller images 42 . The expansion prediction techniques are suitable for eliminating this uneven transition at the splicing gap.
Steps implemented in the expansion prediction (Fig. 4) include the following: (1) An n × n sliding window characterized by a step size of n was created. During movement, this window expanded to m × m, and the original remote sensing image was then cropped into small patches. (2) The cropped images were concatenated into a tensor. (3) The ensemble rooftop extraction model was then used to predict the tensor obtained in step 2. (4) The central portion of each n x n tensor element was extracted and split into patches. (5) Patches obtained in step 4 were stitched to produce a large predicted image, which was then cropped based on the original GES image.
Model training. Overall network parameters were fine-tuned using the MS COCO dataset 43 and pre-trained Xception backbone network 44 , while the AdamW optimizer 45 and Cosine Annealing Warm Restarts algorithm 46 were employed for rapid convergence of the network. Four Tesla V100 GPUs were used to train the model using the mixed accuracy method, and the development was based on the Pytorch framework 47 . The hyperparameter settings are presented in Table 4.
After the model reaches convergence, rooftop areas were predicted using the snapshot 48 based on union ensemble rule and expansion prediction techniques, the union ensemble rule is defined in Eq. 2. Considering the trade-off between the quality and speed of prediction, two local optimal models (in this work, checkpoints of the 29 th and 55 th epochs were chosen) were integrated into the ensemble rooftop extraction model. In addition, the TensorRT framework was used to accelerate the model calculation based on the FP16 accuracy. Accordingly, the prediction speed was increased approximately 5-fold without a significant loss in the accuracy, thereby reducing the prediction time from 100 to 20 days.
where P i denotes the matrix of the prediction category, P union ensemble − represents the result of the union integration, and N is the number of sub models.      www.nature.com/scientificdata www.nature.com/scientificdata/ File structure. The rooftop area data of each city contains three fields (area, X and Y), as presented in Table 5.
All area fields are double float format, in square meters, calculated in CGCS 2000 Albers geographic coordinates. All X and Y fields are double float format, in decimal degrees, calculated in WGS 1984 Web Mercator Auxiliary Sphere geographic coordinates.

Technical Validation
Sampling design. Based on the spatial stratified sampling method proposed in this study, the test samples used for the validation of the rooftop area dataset are obtained from GES images and manually labeled with the ground truth through visual interpretation. To better reflect the quality of the rooftop extraction results in each city tier, four tiers of 45 km 2 were created, and these produced a test dataset covering 180 km 2 , in which the built-up and the confusing areas covered by the image samples and their corresponding masks were 100 and 80 km 2 , respectively. analysis design. Qualitative and quantitative evaluation criteria were utilized to validate the rooftop area dataset generated. Regarding the qualitative evaluation, morphological and topological characteristics of ground truth and extracted rooftop area data for various city tiers were compared. Conversely, for the quantitative www.nature.com/scientificdata www.nature.com/scientificdata/ evaluation, a testing dataset covering 180 km 2 comprising four city tiers of 45 km 2 was employed. In the present study, indicators calculated based on the confusion matrix 52 include accuracy, precision, recall, and F1 score were used. The closer the values of these indicators are to 100%, the higher the quality of the rooftop extraction. It should be mentioned that precision is equivalent to user accuracy, which is a measure of exactness, and recall is equivalent to producer accuracy, which is a measure of completeness. Precision and recall can also be communicated in terms of error, as either commission error (1-precision) or omission error (1-recall). In particular, the F1 score, which is a weighted average of the recall and precision, is an important indicator for comprehensive evaluation of rooftop extraction results. These data and indicators were then used to evaluate the relationship between the rooftop extraction results and the ground truth 53 .
Validation results. The validation results for various city tiers are presented in Table 6. The overall accuracy for all city tiers is 97.95%, while the F1 score is 83.11%. However, the overall quality of the rooftop area dataset for high-tier cities is better than that of the lower-tier cities. Based on empirical exploration, we realized that the GES image quality for various cities varied according to the imaging sensor, imaging time, and environmental factors such as the atmospheric condition and climate. These image quality differences affect the model training and generalization. The quality of GES images for high-tier cities is better, so the results of the model extraction are generally better, which explains the quality difference between the rooftop area dataset for different city tiers. Therefore, in using this dataset for other applications, an evaluation of the impact of regional differences in quality on specific applications is necessary. www.nature.com/scientificdata www.nature.com/scientificdata/ In addition, we validated the quality of the rooftop area dataset for different city tiers using two types of sampling areas, as shown in Fig. 5. Each submap corresponds to a ground extent of 1 km 2 and the indicators for evaluating the rooftop extraction results are given below. In addition, we visualized the elements TN (True Negative), TP (True Positive), FN (False Negative) and FP (False Positive) in the confusion matrix. In this way the commission and omission errors in the extraction results can be clearly indicated by FP (in blue) and FN (in orange), respectively. It can be seen that the rooftop extraction in the built-up area is better than that in the confusing area, which explains why we use a stratified sampling strategy when collecting samples. In general, the extraction results adequately delineate characteristics of the rooftops, and these are consistent with the GES images. Our dataset also comprises small, sparsely distributed, and irregularly structured rooftops in confusing area, which highlights that the rooftop area dataset generated by the proposed framework has good performance in finding details and effectively avoids the interference of complex background information.
Examples of rooftop area dataset for different city tiers are shown in Fig. 6. The extracted rooftop areas exhibit clustering, and this is consistent with the spatial morphology and city boundaries. Therefore, the spatial distribution of rooftop areas is supportive for understanding the urbanization level and urban planning needs of cities. For example, Shanghai is a high-density mega-city in China, and its frank and vast plains and numerous waterways and ports provide advantages for the establishment of its polycentric urban spatial system. The concept of intensive development has also resulted in a more compact urban space in Shanghai. However, the eastern and northern parts of Harbin are mountainous and hilly, and the expansion of the city is limited by natural conditions, so the urban space is loose and the urban area is mainly distributed in the central and western plains.
Specific details on the rooftop area dataset for different city tiers are displayed in Fig. 7 using Beijing, Nanjing, Jilin, and Hengshui as examples. Compared with the GES images, the rooftop extraction results display significant details for various cities, with rooftops and complex backgrounds effectively distinguished. Moreover, dense and sparse spatial distribution areas are accurately extracted. In fact, the extraction results exhibit no sign of the stitched problem because of the application of the expansion prediction.
Meanwhile, Fig. 7 reveals that the GES image used in the present study was not adequately orthorectified, and this partially accounts for the offset in positions between the extracted rooftop area and the ground truth. This offset in positions is significant (approximately 20 m based on empirical survey) for a few high-rise buildings, but it is negligible for low-and medium-rise buildings. Therefore, if the rooftop area data are used for a major city-or country-scale estimation, such position offsets will minimally impact the results. Nevertheless, the rooftop extraction results obtained in the present study provide high-quality details suitable for supporting many architecture-oriented applications.
The empirical evidence of our study in 90 Chinese cities shows that the proposed method can be quickly generalized at a large scale and shows strong robustness in regions with different characteristics. When it is to be extended to regions outside of China, we suggest fine-tuning the existing model by adding new feature samples, to enhance its applicability in the target region.

Code availability
The procedure of spatial sampling is executed in the ArcGIS Pro platform. The code of the deep learning model is available at https://github.com/ChanceQZ/RoofTopSegmatation. The program is described by Python3, packages of which are Pytroch, Numpy, and OpenCV mainly.

acknowledgements
We appreciate the detailed suggestions and comments from the anonymous reviewers. We express heartfelt thanks to the other members of the Smart City Sensing and Simulation lab, who undertook data collection and annotation work. This work was supported by the National Natural Science Foundation (Grant 41930648), Joint Fund Project of National Natural Science Foundation of China (Grant U1811464), and National Research Foundation Singapore. The data of this work is licensed and hosted by National Tibetan Plateau Data Center.

author contributions
Z.X.Z. investigated study area and related work, designed the geospatial stratified sampling algorithm, collected and processed GES image and wrote manuscript. Z.Q. designed, trained and applied rooftop extraction model, developed parallel computing programs, simplified rooftop area dataset, and wrote manuscript. T.Z. assisted with validation and quality control and reviewed manuscript. M.C. designed and condensed experimental ideas, guided experiments work, reviewed manuscript, and provided project funding support. K.Z. collected and