Background & Summary

Land use, land cover, and settlement databases are typically remote sensing derived or combined products that have made significant contributions to the scientific study of environmental and human systems, but they are limited in their temporal coverage and may suffer from low classification accuracy and limited thematic depth1,2,3,4. Furthermore, lack of processing infrastructure has created significant obstacles towards advancing our understanding of historical settlement development5,6,7. With increasing data availability and technological advances, large-scale historical-spatial data infrastructures become increasingly feasible and popular in the social and natural sciences8,9. As such, data products like the National Land Cover Dataset (NLCD)1,10,11 or the Global Human Settlement Layer (GHSL)12 typically characterize physical properties of surfaces measured through remotely sensed signals over time but cannot depict thematic details of settlements (e.g., land use classes). No such data exists prior to the 1970s when remote sensing-based earth observation became operational at a global scale. Consequently, researchers are able to evaluate and quantify changes in developed land, the intensity of development, or the proportion of built-up land over a few decades but have a limited understanding of the semantic and functional components of building- and property-related land use and its changes. Furthermore, existing datasets extending farther back in time are typically model-based, of unknown accuracy and of low spatial detail13,14.

Significant advancements in our understanding of rural-urban development can only be made if we are able to capture the underlying spatio-temporal processes that contribute to land change at fine scale. However, to date, significant obstacles in alternative data availability and the computational costs of extracting relevant information15,16,17, the low spatial detail contained in historical records18 and limited geographic coverage19,20,21 have hindered our ability to produce fine resolution data layers that depict different aspects of land development in urban and rural settings over longer time periods and over large spatial extents.

The multi-temporal land use layers described in this article fundamentally differ from previous instalments of land cover/land use data for its attribute richness, temporal extent and fine temporal and spatial resolution. We detail the creation and properties of a novel gridded data product featuring built up land use progression in the United States from 1940 to 2015. This product was created using the Zillow Transaction and Assessment Dataset (ZTRAX), which is a collection of more than 200 million geocoded housing and property-level records, collected from existing cadastral data sources22. These records were rasterized to generate two primary datasets covering the period 1940–2015, a land use majority layer with an annual temporal resolution and class-specific property count layers with a semi-decadal temporal resolution, at a spatial resolution of 250 m for most of the contiguous United States (CONUS), encompassing six different land use classes.

This unique dataset has the potential to transform our understanding of how the compositions of communities and urban centres in the U.S. have developed over 75 years. These data products will be highly useful to researchers in the social and natural sciences, and applicable to studies related to urban development, vulnerability, and natural hazards. While ZTRAX is a proprietary data source, the data derivatives described herein are disseminated as public data to the research community. The historical land use data products are published as part of the Historical Settlement Data Compilation for the U.S. (HISDAC-US)23,24 and will be accessible through Harvard Dataverse (https://dataverse.harvard.edu/dataverse/hisdacus). HISDAC-US has been used in several recent studies on urban development and change25,26,27,28,29,30, landscape change analysis and modelling31, transportation infrastructure analysis32, population modelling33, as well as natural hazard risk assessment34,35.

Input Data and Methods

Semantic aggregation of ZTRAX land use types

The ZTRAX dataset contains information on more than 200 million parcels using over 400 million public records22. Third party providers and internal initiatives were used to collect data from assessor information and publicly available documentation. The attribute richness of this dataset offers unique opportunities to explore land use progression and the built environment through novel and compelling perspectives. Recently, ZTRAX has gained increased popularity in the natural and social sciences36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54.

The presented land use data product contains six thematic classes of the built environment, which represent land use types of built structures. The six thematic classes described in the data presented herein include: agriculture, commercial, industrial, residential-owned, residential-income, and recreational facilities. The six classes used herein represent a subset of the rich land use classification used in ZTRAX (300 + land use classes total) and were chosen for their importance in studying urban dynamics and development55,56,57,58,59. There are 12 general thematic classes contained in ZTRAX; agriculture, commercial, exempt, government, historical, industrial, miscellaneous, private, residential, recreational, transportation, and vacant. Due to the low overall representation and incompleteness of several classes and the importance of the 6 contained in the described data product, 7 of the classes (exempt, government, historical, miscellaneous, private, transportation, and vacant) were omitted from the data and the residential class was subdivided into residential-owned and residential-income. These omissions are reflected in the uncertainty shapefiles and gridded layers we have provided and characterized by the county cumulative sum attribute and grid cell counts. Moreover, we report the subclasses of the included and excluded land use categories and their frequencies60.

The agricultural thematic class in ZTRAX contains 23 subclasses that define agricultural land parcels in greater detail. For our purpose, all non-structural (i.e., not built up) agricultural subclasses were removed from the data prior to processing, keeping all structures such as farms, ranches, miscellaneous structures, and non-residential rural structure improvements. This step ensures that all classes are defined based on built-up structures and not the general use of the land. Examples of excluded agricultural land uses are grazing land, crop land, and other uses that do not describe a physical structure. The reader is directed to the circular histogram60 for a complete breakdown of all 300 + land use types and the frequencies in which they appear in the ZTRAX database.

The commercial sector contains 65 subclasses that range from office and medical buildings to dry cleaners, casinos, and gas stations - no data were removed from this class. For the industrial theme, 44 subclasses are included in the ZTRAX data that differentiate between heavy industrial buildings such as labour camps, quarries, and slaughterhouses as well as lighter industrial facilities such as assembly plants, recycling centres, and loft buildings. The residential (or housing) sector is broken down into two primary categories 1) residential-owned (RO) or residential structures that are owned by a residential account holder who owns the property at the service address of record (https://www.lawinsider.com/dictionary/residential-owner), and 2) residential-income (RI) or residential structures that have been zoned as rented or leased dwellings (i.e., not occupied by the owner)60. The housing sector contains 36 subclasses that describe residential housing, all of which were included in the final gridded product.

Finally, the recreational land use class includes recreational facilities that contain 32 subclasses including bowling alleys, playgrounds, zoos, and dance halls. The land use attribute can have three levels of granularity. At the finest granularity, attributes differentiate between aspects of subclasses such as the quality of duplex housing. For the presented data products, we included attributes limited to the primary thematic classes. Table 1 shows the progression of records for these land use classes since 1940 and the circular histogram60 shows the sub-classes included in each thematic class represented in the data.

Table 1 The cumulative number of ZTRAX property records per land use theme and year.

Gridded surface creation

The ZTRAX database is based on cadastral parcel and tax records obtained from state and county records and contains more than 400 million total records22, of which approximately 200 million have spatial information. This spatial information typically consists of address points or approximate parcel centroid locations. Due to differing reporting practices from county to county there are swaths of the country that are poorly represented, particularly in the Midwest and Louisiana. According to Zillow’s documentation legal transactions of a house are processed by the county recorder’s office, and it is somewhat common for county recorders to not record the address or assessor parcel number (APN) on the legal records. In such cases it is not possible to systematically map these records to the specific parcels involved (https://www.zillow.com/research/ztrax/ztrax-faqs/). The lack of APN or address manifests as areas of no data (e.g Wisconsin & Louisiana) in the dataset presented in herein.

We converted the ZTRAX data (available as CSV data) into a set of relational databases for efficient querying. We extracted property locations, land use and built year attributes and assigned a bi-dimensional spatial index (i.e., a grid cell identifier) to each record, referenced to a 250 m × 250 m grid in Albers Equal Area Conic projection (SR-ORG:7480) (https://spatialreference.org/ref/sr-org/7480/). This grid is consistent with the spatial grid used in other data products of the HISDAC-US. Then, we separated the data into complete records, and records with missing built year or land use attributes (Section 2.3). We rasterized the complete data records to generate grid cell level land use statistics in annual and semi-decadal cross-sections as defined by a built year attribute of each record23,24. Specifically, these records were grouped into spatio-temporal bins (as defined by the grid cell identifier and the built year attribute) and processed to determine the most frequently occurring land use type per grid cell annually from 1940–2015. These summary statistics (i.e., most frequent land use type per grid cell id and year) were then used as input for the rasterization process. We used the Numpy61 and Rasterio62 Python packages to generate gridded surfaces in GeoTiff format. We also used this process to calculate the counts of each primary land use class per grid cell for each semi-decade starting in 1940. The data with missing spatial references and missing attribute values (i.e., built year, land use type) were then used to calculate various uncertainty statistics using the pandas63 and geopandas64, and base Python packages65. Thus, this data product consists of three items: (a) annual predominant (majority class) land use layers, (b) semi-decadal layers measuring the number of properties of each land use class per grid cell, and (c) accompanying uncertainty surfaces.

Creation of uncertainty layers

Uncertainty in the created data consists of several components: (a) ZTRAX data incompleteness due to attribute missingness or missing geolocation (i.e., non-georeferenced records), (b) survivorship bias due to historical land use changes not captured by ZTRAX, and (c) thematic uncertainty in the land use attribute. While the latter two types of uncertainty are attempted to be quantified by means of ancillary data (see Section 3), the ZTRAX attribute incompleteness can be quantified directly and is reported in accompanying datasets as additional county-level and grid-cell level summary statistics. Moreover, ZTRAX suffers from positional inaccuracies due to the approximation of areal parcel units by discrete point locations and related issues50. In earlier work, we quantified and reported positional, thematic and temporal uncertainties in existing settlement layers and provided uncertainty layers hosted in the HISDAC-US repository23,24 (https://dataverse.harvard.edu/dataverse/hisdacus). These uncertainty layers are highly recommended for users interested in applying the historical settlement data products. The uncertainty layers described here focus on data missingness in creating time sequences of gridded land use layers at the county-level and at the grid cell-level.

We calculated the proportion of records with missing land use attributes and/or missing geolocation to quantify the county-level uncertainty. We determined the total count of records in each county and calculated the proportion of records with and without a georeference. Moreover, we cross-tabulated attribute missingness for the built year (“by”) and land use (“lu”) attributes:

  1. 1.

    proportion of records that contained both a land use and year-built value (“by-lu”)

  2. 2.

    proportion of records without both land use and year-built values (“nby-nlu”)

  3. 3.

    proportion of records with land use and no built year (“nby-lu”)

  4. 4.

    proportion of records with a built year and no land use (“by-nlu”)

In order to further characterize the county-level attribute missingness over time, we generated decadal, county-level shapefiles containing the proportion of records with valid year-built attribute, but missing land use attribute per county. This resulted in seven county boundary files, one for each decade within the temporal coverage of our data. For each decade we then calculated proportions of georeferenced records, proportions of records that had both the built year and land use attribute, as well as the proportions of records with missing land use attribute for that decade. Additionally, we created grid cell-level uncertainty layers. To gain a fine-resolution understanding of uncertainty, we generated gridded uncertainty layers using the georeferenced records in the ZTRAX data. Gridded time sequences (decadal) were created using all the georeferenced records that had a value for the built year but no land use attribute to quantify the proportion of missing land use type entries at the grid cell level. These uncertainty layers are recommended for data users to integrate in their analysis to be able to account for varying data quality, both regionally and over time.

Data Records

Historical gridded land use layers

The datasets described in the following sections have been published in the Harvard Dataverse HISDAC-US repository at the following URL https://dataverse.harvard.edu/dataverse/hisdacus66,67,68. The multi-temporal land use surfaces are organized as sequences of georeferenced gridded layers (file names include the year e.g., LU_ThemeMaj_1985) covering most of the built-up areas in CONUS (excluding Hawaii, Alaska, and non-covered counties) with a spatial resolution of 250 m and a temporal resolution of one year for the majority class data product, and 5 years for the class-specific layers. In the main data product, each grid cell value represents the most frequently occurring land use class among all ZTRAX records located within that grid cell, for a given year (for all georeferenced records with both a built year and a land use designation). Additionally, for each individual land use class, we created a time sequence of gridded count layers representing the number of records of that land use class (e.g., industrial) located within a grid cell for each semi- decade starting in 1940. These layers have the land use class and the year included in their file names (e.g., LU_ThemeCount_RO_1975, for residential-owned structures in 1975). These data products cover the time period 1940–2015. We have provided the raster layers in GeoTIFF format with a spatial resolution of 250 m. We aligned these layers to the existing layers in the HISDAC-US to ensure consistency across settlement data products housed in that data compilation. We have published all data in the HISDAC-US repository using the Albers Equal Area Conic projection for the contiguous US (USGS version, SR-ORG:7480).

Figure 1 illustrates various aspects of the land use data package that offer novel perspectives on urban development. The dataset allows the user to understand urban growth in terms of land use change, not only through thematic majority but count surfaces that characterize growth of land use classes over time. The top 3 rows in Fig. 1 show the cumulative counts for commercial, residential-income, and residential-owned land use classes, at three points in time in Houston, Texas. The bottom row in Fig. 1. displays the cumulative counts for all other land uses classes, Agriculture, Industrial, and Recreational. Additionally, we have generated contemporary (i.e., 2016) count surfaces for the thematic classes and accompanying uncertainty surfaces. These 2016 layers also contain those records that lack a built year record and thus represent a more complete picture of more recent land use patterns.

Fig. 1
figure 1

Land use-specific property counts in 1945, 1985, and 2015, for Houston, Texas. The top 3 rows display theme specific counts. The bottom row displays the aggregated counts of the agricultural, industrial, and recreational land use classes.

Uncertainty surfaces

As described above, we have created several uncertainty surfaces in order to provide information on basic data quality aspects. Data completeness and multi-variable processing quickly creates a complex picture of uncertainty. There are two categories of uncertainty layers that we have provided: vector files with multi-temporal data in the attribute table aggregated to the county level (2010-boundaries) (https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html) and multi-temporal gridded uncertainty layers consistent with the main data products.

Seven county-boundary vector layers have been provided, one for each decade. Each layer provides the proportion of records that are georeferenced and the proportion of records that are not georeferenced but located in that county based on the county identifier, for each decade. Additionally, we have provided the proportion of records that have a built year and the proportion of records that have a land use attribute for that county and year for both the georeferenced and non- georeferenced data. Due to the exclusion of poorly represented land use types (i.e. exempt, government, historical buildings) there are counties in the uncertainty surfaces that have more cumulative structures listed in the year-built attribute than listed in the land use attribute column. The instances in which there are more structures for a given year than counted in the land use attribute represent structures with land use types that were omitted from the land use data. We provided an additional decadal gridded uncertainty layer to address the structures that were excluded from the dataset. For each decade we calculated the cumulative number of excluded structures per grid cell. There are 6 thematic classes excluded from the main data product and 12 agricultural sub-classes that were excluded as they did not characterize built up structures. The 6 non-agricultural thematic classes represented by this gridded uncertainty layer are: (1) Exempt, (2) Historical, (3) Miscellaneous, (4) Privately Owned, (5) Transportation, and (6) Vacant. The data user is urged to use those layers, and the detailed land use disaggregation60 to inform their analysis using baseline data qualities and completeness information.

The multi-temporal gridded uncertainty layers for all georeferenced data quantify missingness in land use type entries at the same resolution as the main data product. Each surface represents only those structures that were explicitly geocoded in ZTRAX. There are five attributes that characterize uncertainty in the county level shapefiles: (1) the cumulative sum of all structures contained in ZTRAX for each county and decade, (2) the cumulative sum of all structures containing a land use attribute per county and decade, (3) the cumulative sum of all structures with a built year attribute per county and decade, (4) the proportion of structures containing the land use attribute relative to all structures in the county per decade, and finally (5) the proportion of structures containing the year built attribute relative to all structures in the county per decade. The shapefile variables containing proportions represent the completeness of either the land use or built year attributes. Data users are encouraged to use the uncertainty surfaces provided with the data presented herein and the positional uncertainty layers published in HISDAC-US23,24 to assess data suitability for a given location and to account for inherent positional uncertainty. Below we provide a table (Table 2) describing the files contained in the land use data sets.

Table 2 Technical specifications and access information for the created historical land use datasets.

Technical Validation

ZTRAX is subjected to quality issues that include spatial, temporal, and thematic uncertainties that propagate into the gridded surfaces contained in the HISDAC-US. In part, these uncertainties have been quantified in previous work23,24. For a thorough positional accuracy assessment of the gridded surfaces in HISDAC-US, over time and across the rural-urban continuum, we direct the reader to Uhl et al. (2021a), who report regionally varying levels of positional agreement. Uhl et al. (2021a) provide important insights into the quantity agreement of the ZTRAX-derived grid cell aggregates of built-up records and locations, as compared to building footprint data, census population and housing unit counts. Such reported disagreements also propagate into the land use data layers described herein, and the user of any data products from the HISDAC-US is urged to refer to these validation results to reflect the accuracy of the data appropriately. Based on this validation, it is known that while the completeness in HISDAC-US is acceptable for data layers after 1900, all products derived from ZTRAX will be subject to underestimation due to the difficulties of obtaining structural records from counties that have differing reporting policies, attribute incompleteness and inconsistency, and the dynamic nature of development. The level of underestimation for residential records was assessed in Uhl et al. (2021a) who reported varying levels of incompleteness of records along the rural-urban continuum in comparison to Census housing unit counts.

Herein, we assessed the completeness of land use and year-built attributes in ZTRAX (Section 4.1) and employed three ancillary datasets to quantitatively and qualitatively address uncertainties specific to the land use product. Specifically, we used land use data from volunteered geographic information (i.e., OpenStreetMap, OSM) (https://planet.openstreetmap.org) to assess the agreement with the created (contemporary) land use layers (Section 4.2), and compared our land use layers to remote-sensing-derived land cover/land use (LULC) data from the National Land Cover Database 200169 and 201670, as well as to urban land use classes from the Local Climate Zones (LCZ)71 dataset available for the CONUS (Section 4.3). In addition to that, we used data on building demolitions to quantify effects of survivorship bias, as building replacements or teardowns are not recorded in ZTRAX (Section 4.4). Finally, we used overhead imagery and a visual-analytical approach to assess the visual consistency of buildings at ZTRAX locations for different land use categories (Section 4.5).

Attribute incompleteness

Grid cells with small structural counts and low attribute completeness should be carefully considered as cell value assignment in the main data product was based on the most frequently occurring land use class within the grid cell extent. In such cases, the user is advised to use the land use type count layers in conjunction with uncertainty layers to better understand the underlying reliability of the data. Table 3 summarizes attribute missingness statistics for the land use data product, indicating that over 98% of the ZTRAX records have valid land use information (lower levels are found e.g. in Maine or Iowa, see Fig. 2a), and over 75% of ZTRAX records have valid land use and year built information. The majority of ZTRAX records have valid location information (Fig. 2b). Around 25% of the data are lacking land use and year-built information, and these are located in approximately 400 counties (see Fig. 2c), that can be also identified by the county-level completeness layers (Section 3.2).

Table 3 Cross-tabulation of land use (lu) and year built (by) completeness; “n” indicates missingness, e.g., nby = “no built year”.
Fig. 2
figure 2

Attribute completeness in ZTRAX: Percentage of records per county with a valid (a) land use attribute, (b) location attribute (i.e., latitude and longitude), and (c) year built attribute.

Comparison to OpenStreetMap land use data

While detailed and reliable land use data is sparse, OpenStreetMap (OSM) offers user-generated land use and functional information at the building level. While OSM is not expected to have high completeness in terms of the land use attribute, we assume the reported land use information to be accurate. We generated gridded surfaces, aligned with the HISDAC-US land use data grids, containing the number of buildings of a given land use type in OSM per grid cell, and conducted a cell-level agreement assessment. We mapped the relevant OSM land use types to the land use classification scheme of the presented HISDAC-US land use data. Moreover, due to the sparsity of some land use classes in both datasets, and the potentially large bias introduced by this, we only evaluated the three most frequent land use classes: residential, commercial, and industrial.

Preliminary tests have shown that a considerable amount of building footprints in OSM are lacking the land use attribute and thus, its completeness in OSM appears to be low in certain regions of the CONUS, while the correctness of those attributes that exist is expected to be high. Thus, only Type II errors (i.e., comission errors) in the ZTRAX-derived land use data can be quantified by comparing against the OSM data. For the evaluation of commission error, please refer to Section 4.3 (comparison to remotely-sensed LULC data) and Section 4.4 (Survivorship bias). Note that this assessment was done for the most recent point in time of the presented land use layer series (i.e., 2016) and for contemporary OSM data downloaded in 2021, to keep the temporal gap to a minimum. We carried out these assessments for individual counties as well as across the rural-urban continuum using the Rural-Urban Continuum Codes (RUCCs) created by the U.S. Department of Agriculture (USDA)72,73. RUCCs define nine rural-urban classes, including three metro and six non-metropolitan county designations using criteria of population size, the degree of urbanization and adjacency to a metro area (Table 4).

Table 4 OSM-based agreement assessment using correlations and recall measures across the rural-urban continuum (RUC).

Given these constraints in the OSM reference data, we first extracted all grid cells containing at least one OSM and ZTRAX derived record of the same land use class and assessed the correlations of the grid cell counts, as a measure of quantity agreement. Due to the ZTRAX data structure and spatial generalization effects, these distributions can contain outliers, resulting from large numbers of records in individual grid cells24, and thus, we used Spearman’s rank correlation coefficient for this assessment. Moreover, we calculated the recall (i.e., producer’s accuracy, or sensitivity) of the ZTRAX-derived land use counts with respect to the OSM in order to quantify the omission errors associated with the ZTRAX-derived data. The latter was done based on binarized absence-presence gridded surfaces, using a threshold of at least one record per grid cell, and thus allowing for measuring the Type II error component of the positional agreement between the ZTRAX and OSM derived surfaces. We quantified both, correlations of grid-cell level counts and the spatial agreement (i.e., recall) for each of the three land use classes under test for the whole CONUS, and across the rural-urban continuum, by conducting stratified assessments for grid cells located in counties of each RUCC (Table 4), as well as for each individual county (Fig. 3).

Fig. 3
figure 3

Comparison to building-level land use classes from OpenStreetMap. (a) Distribution of Spearman’s correlation coefficient based on 250 × 250 m grid cell counts of residential, commercial, and industrial records, and (b) distribution of county-level recall values; Panels (c) and (d) show the distributions of county-level correlation and recall, disaggregated for each rural-urban continuum code.

First, we observe positive correlations between building counts of ZTRAX and OSM land use classes across CONUS (>0.33 across all RUCCs for any class), and these correlations are highest for the residential class in highly urban environments (i.e., RUCC 1, c = 0.66). Correlations generally decrease towards more rural settings, where both ZTRAX and OSM completeness can be low. The completeness of ZTRAX land use records appears to decrease from the residential to the commercial and industrial classes, yielding recall values over all RUCCs of up to 0.77, 0.61 and 0.34, respectively. Recall values across the RUC follow similar patterns as the correlations, exhibiting highest values for the residential class in urban settings (RUCC 1, recall = 0.88) and lowest values in rural settings for the industrial class.

While these general patterns illustrate the broad-scale agreement between ZTRAX and OSM based land use data, we observe strong local variations of uncertainty at the county level, as the distributions of Spearman’s rank correlation coefficients and recall measures calculated at the county level suggest (Fig. 3a,b). As the upper tails of these distributions indicate, there is a considerable number of counties that exhibit very high quantity and positional Type II agreement between ZTRAX and OSM. Decomposing the distributions of county-level agreement metrics across the rural-urban continuum, we observe that while the overall agreement metrics in Table 4 decrease from urban towards rural regions, this trend is less visible in the county-level metrics (Fig. 3c,d). For example, a considerable number of rural counties (RUCC 6–9) exhibit high recall values for residential and commercial land use classes. We would like to emphasize that different factors such as spatial, temporal, and semantic inconsistencies between ZTRAX and OSM data, as well as the user-generated nature of the OSM database and associated uncertainty issues affect the presented agreement assessment results, underlining the difficulty in conducting land use data validation in general. However, these results suggest that the surfaces representing contemporary land use are largely coherent to the independently collected and compiled OSM data and thus, represent a reliable and plausible proxy for land use distributions in most regions of the CONUS.

Comparison to remote-sensing-based LULC datasets

We used gridded land cover data from the NLCD in 200169 and 201670, as well as gridded LCZ urban land use (temporally referenced approximately in 2016–2018) available for the CONUS71. We implemented two approaches for this comparison: First, we implemented a record-based approach: We drew a stratified random sample of ZTRAX records, retrieved the land use/climate zone from the underlying NLCD and LCZ grids at the location of each record, and cross-tabulated the land use class of each ZTRAX record and the respective NLCD and LCZ class labels found at the respective location. Specifically, we randomly selected one county for each of the nine RUCCs, within each of the nine U.S. census divisions72,73. We then retrieved the ZTRAX property records within these counties and drew a sample of n = 1,000 records (with replacement) from each of our six land use classes (cf. Table 1) per county. This way, we obtained a sample of N = 486,000 ZTRAX records, located within 81 U.S. counties uniformly distributed across the CONUS, and across the rural-urban continuum, and equally proportioned across the land use classes used herein.

Second, we implemented a raster-based approach: We down-sampled the NLCD gridded surfaces from their native resolution of 30 m and the LCZ data from 100 m into the HISDAC 250 m grid using two resampling techniques: 1) a majority-area rule, and 2) using 1-hot encoding, i.e., creating a binary 250 m gridded surface for each NLCD and LCZ class, encoding the presence of each class with 1, and the absence with 0. This way, we were able to evaluate the correspondence of our land use classes also to underrepresented classes in NLCD and LCZ, which are likely to disappear when using majority-area resampling. Similarly, we created a binary surface in our 250 m grid indicating the presence (1) or absence (0) of records of any of our six land use classes, based on the land-use-specific property count surfaces (cf. Fig. 2). We compared these data layers by cross-tabulating our land use based binary surface with the binary surfaces of each of the LULC classes from NLCD and LCZ, respectively.

We compared NLCD 2016 and LCZ to our 2016 layers, and to minimize the effects of temporal inconsistencies, we compared the NLCD 2001 to our layers referenced in the year 2000. These different strategies (record-based and raster-based) allowed for gaining a relatively unbiased picture of the correspondence between our land use classes and remotely sensed LULC types. The record-based approach evaluates the correspondence between ZTRAX and the LULC datasets without being affected by additional uncertainty induced by the resampling. However, it only evaluates thematic agreement where ZTRAX records are available, disregarding omission errors. The raster-based approach may suffer from additional positional uncertainty due to resampling from 30 m (NLCD) and 100 m (LCZ) resolutions to the target resolution of 250 m but enables the quantification of class-specific omission errors in regions where no ZTRAX records are available.

The record-based comparison (Fig. 4, top part) revealed very similar patterns for NLCD 2001 and NLCD 2016: Highest proportions of income and owned residential ZTRAX records are located in the NLCD classes “Developed, low intensity” and “Developed, medium intensity”, whereas industrial and commercial land uses have highest proportions in “Developed, high intensity”, in particular in urban counties. Agriculturally used properties have highest proportions within “Pasture/Hay” and “Cultivated crops”. Comparing to the local climate zones (Fig. 4, bottom part) shows that the highest proportions of ZTRAX records for most land use classes are located within the “Open low-rise” class, except for the agricultural land use class, which peaks in the “Low plants” and “Dense trees” LCZ classes.

Fig. 4
figure 4

Record-level comparison of ZTRAX land use classes and LULC land use categories, carried out for the full sample, for rural counties (RUCC 6–9) and urban counties (RUCC 1–5). Values are shown in % of the sample of N = 486,000 ZTRAX records used.

Some of these cross-tabulations seem implausible, such as ZTRAX records located in wetlands or open water. It is likely that these are artefacts due to the resampling, and the spatial resolution of the HISDAC-US land use data layers. In the least optimistic scenario, we can consider these mismatches to be commission errors (i.e., ZTRAX reports built-up structures that do not exist). In that case, these commission errors quantifiable by the conducted cross-comparison would sum up to only 4–5% of all ZTRAX records. Here, it is worth noting that commission errors in ZTRAX may also occur due to demolished buildings that have not been deleted or updated (i.e., set to “vacant” land use) in ZTRAX. However, these cases are likely not to exceed 1–2% of all ZTRAX records (see Section 4.4).

The raster-based approach reveals a complementary picture. As shown in Table 5, for most non-settlement and vegetation-dominated NLCD classes, only small area proportions are covered by grid cells containing one or more ZTRAX records. This trend inverts for the settlement-related land cover classes: For example, over 82% of “Developed, low intensity” land cover in 2016 geographically coincides with the land use data described herein. This agreement is lower for the NLCD 2001 data, as grid cells without temporal information are counted as “% not covered by HISDAC”. This trend persists across the two different data resampling techniques. Larger differences in these proportions between majority-based resampling and 1-hot encoding indicate that the land cover classes (e.g., “Developed, low intensity”) are underrepresented and/or spatially scattered and thus, disappear when using majority-based resampling. Moreover, when distinguishing these cross-tabulations in proportions of the HISDAC-covered and not covered area (Table 5, bottom part), we observed that the highest proportion of not covered area is shrubland/scrub (i.e., 23%), and highest proportions of the HISDAC-covered areas are located in “Deciduous Forest” and “Pasture/Hay” (agricultural class, cf. Figure 4), followed by the developed classes.

Table 5 Grid-cell-level comparison of ZTRAX land use classes and NLCD 2001 and 2016 land cover classes.

The raster-based cross-tabulations with LCZ classes show a similar pattern: The majority of the settlement-related and built-up classes (e.g., compact and open high-rise, etc.) are covered by HISDAC-US (Table 6, left part), whereas most vegetation-dominated LCZ classes are not covered. However, we observed some exceptions deviating from this trend: For example, the “Open midrise” class is mostly not covered in HISDAC-US. A reason could be public buildings that are omitted in HISDAC-US24 and are not considered in the land use data presented herein. Moreover, only 10–12% of the grid cells labelled as “Heavy industry” are covered in HISDAC-US. This may be caused by spatial offsets, as industrially used parcels may be very large, but also indicates a relatively poor coverage of industrial land use, which is also in line with observations made when comparing our data to OSM (Section 4.2). Conversely, the highest proportions of HISDAC-US-covered area in LCZ is classified as “Open low-rise”, “Dense trees” or “Low plants”. This is plausible as dense, urban settlements only represent a small portion of the US built environment, and peri-urban and rural settlements as well as agriculturally used structures are typically spatially scattered and thus, as a result of the resampling process, “occupy” a larger proportion of grid cells than built-up properties in dense, urban settings.

Table 6 Grid-cell-level comparison of ZTRAX land use classes and LCZ 2016–2018 urban land use categories.

It is worth noting that due to the different properties of the data compared herein (i.e., discrete locations vs. categorical and density information contained in gridded surfaces), positional uncertainty in both the LULC data (e.g., induced by the registration accuracy of the underlying remote sensing data) and in the ZTRAX data (e.g., using parcel centroids or address points instead of the locations of actual built-up structures) may introduce additional uncertainty in these cross-comparisons. However, the aggregation of the NLCD and LCZ datasets from fine resolutions to the target resolution of 250 m is assumed to mitigate such bias partially.

Assessing survivorship bias in the historical data

Survivorship bias presents a problem that appears in several disciplines and is of particular importance to most types of settlement, land use, or building stock data74,75,76,77,78. This type of bias appears when units, such as a built structure, are removed from the population but are not accounted for in the data. For example, a structure built in 1930 may get remodelled or may get demolished over time79. ZTRAX does not directly account for demolished structures and therefore does not continue to represent structures that no longer exist. The described land use data product suffers from the same limitation in that only surviving buildings are considered without accounting for possible structural losses. To demonstrate and measure the effects of this survivorship bias for Colorado, we used address-level demolition data over 10 years (2008–2017) obtained from the Colorado State archives (https://spl.cde.state.co.us/artemis/heserials/he171017internet/).

We stratified the counties in Colorado by their RUCC, and found that demolitions took place in urban counties at more than 10 times the rate of demolitions in rural counties; out of 28,403 possible demolitions, 26,011 occurred in rather urban counties (i.e., RUCC designations 1–5). We grouped RUCC 1–5 as urban counties and RUCC 6–9 as rural for all analysis using RUC codes. Comparing the total amount of demolitions occurring between 2008 and 2017 to the total number of built structures in 2015, we estimate that approximately 1.1% of Colorado’s building stock was demolished during this time period (thus an average annual rate of 0.11%). At the county scale we found, for both rural and urban counties, that the maximum percentage of demolished building stock did not exceed 2.5% during the 10-year period. As mentioned before, this observation also provides an estimated upper bound of potential commission error in ZTRAX and the derived land use datasets: There may be cases where demolished buildings are not reconstructed, and the demolition is not updated in ZTRAX, leading to a false positive (i.e., commission error). Furthermore, we refer to Uhl et al. (2021) where commission errors of ZTRAX-based settlement layers were quantified, and high levels of precision were observed in contemporary, urban settings, dropping to around 0.7 in rural settings and early time periods.

Moreover, we matched the demolition records to the ZTRAX records based on the address information given in both datasets and assessed the relationships between the demolition year and the year built on record in ZTRAX, separately for urban and rural counties (Fig. 5). The scraped demolition data contained a total of 33,645 addresses of which we were able to match 28,403 records to the ZTRAX data, leaving 5,242 records unmatched. These unmatched records may represent structures that have disappeared completely and are not contained in the ZTRAX data, or they are an artefact of our matching process, which was address-based and thus may be prone to misspelling errors in the addresses. We noted multiple instances in the scraped demolition data that were incorrectly spelled or had inconsistent formatting. From the analysis of this data we observed the following: (1) Most buildings that were demolished do not have a valid year-built attribute, and this proportion is higher in rural than in urban counties. This indicates that missing year-built attributes may be a result of recent teardowns, and possibly reconstruction, and a reporting latency that appears to be higher in rural counties. (2) Only a small percentage of demolished buildings (around 10% in 2008) have a year built > = year demolished. These records represent rebuilding activity that likely caused a replacement of the prior year built and represent the survivorship bias in ZTRAX- derived building age information. (3) Around 20% in 2008, and 50% in 2016 of demolitions did not cause an update of the year built on record in ZTRAX (year built <year demolished). This could imply several things: (a) The buildings were demolished and not replaced (i.e., they “disappeared”), but data records were not updated. This would illustrate an important limitation of our data, i.e., the shrinkage of human settlements cannot be measured. (b) The buildings were demolished and replaced, but the year built was not updated. This scenario would reduce the survivorship bias with respect to building age (i.e., the “original” year built persists); and (c) The buildings were demolished and replaced, and in addition to that, the building function changed: This would be an example of historical land use change not captured in our data. Furthermore, only a very small portion of demolished buildings is labelled as “vacant” in ZTRAX, indicating that most demolitions are followed by immediate reconstruction, or these cases are underreported in ZTRAX, which again would be an example of the inability to capture the shrinkage of built-up land in ZTRAX.

Fig. 5
figure 5

Cross-comparison of ZTRAX records and building demolition records in Colorado. The bar charts show the proportions of demolished buildings in different categories established by comparing demolition year and the year built on record in ZTRAX, separately for ZTRAX records reported as vacant and non-vacant. Urban counties have RUCC 1–5, rural counties have RUCC 6–9.

In conclusion, while we acknowledge the uncertainty due to survivorship bias contained in our data and generated by our modelling approach, it is clear that even in the most conservative scenarios that use verifiable data, survivorship bias would have minimal impact on analytical outcomes. As described above, we assume that land use for a structure was designated at the time the structure was built as this would have been the time that construction records/permits were submitted to the county assessor. Thus, some uncertainty remains unaddressed if buildings were built in parcels that have been re-zoned, and thus their land use designation may have changed, at some point in time. However, different kinds of land use changes over time have different transition likelihoods. We illustrate this in Table 7 to provide a basis for identifying land use classes that may be prone to this type of thematic uncertainty if past land use changes have not been recorded in the database.

Table 7 Estimated likelihoods of land use transitions over time.

Qualitative comparison to overhead imagery

Lastly, we used Bing aerial imagery (https://www.arcgis.com/home/item.html?id=8651e4d585654f6b955564efe44d04e5) to qualitatively assess the relationship between broad-scale patterns of land surface in remotely sensed earth observation data and the different land use classes recorded in ZTRAX. To do so, we randomly selected one location per land use class, for each of the 3,019 covered counties, resulting in a total of 18,114 sample locations. We then obtained the RGB Bing imagery within a bounding box of 100 × 100 meters around each ZTRAX location, and generated a mosaic of these individual images, per land use class. These mosaics are based on a method proposed and employed in Uhl et al.80 which involves the calculation of color moments81 for each image, resulting in a 12-dimensional low-level descriptor summarizing the color content of each image. We then use t-distributed stochastic neighbour embedding (T-SNE)82 to map these 12-dimensional descriptors into a bi-dimensional space. T-SNE arranges data points in the bi-dimensional target space in a way that similar data points are located near each other. We then rectified the resulting 2-d point cloud and visualized each image at its corresponding location in t-SNE space. This method groups similar images together and allows for an integrated visual assessment of large amounts of images (Fig. 6). These visualizations characterize the broad-scale patterns of geographic contexts encountered at the ZTRAX locations per land use class. For example, they illustrate the quantity of vegetation-dominated settings, which are most frequent in the agricultural land use class. Small buildings are commonly found in the agricultural and residential land use classes. Large bright objects represent the (typically flat) roofs of large industrially, commercially, or recreationally used structures and seem to occur commonly at locations of these three land use classes. Note that images containing vegetation only are likely seen due to positional offsets of ZTRAX point locations from the actual building locations within rural (often larger) cadastral parcels50. However, we can assume that these offsets have a minor effect on the data accuracy due to the chosen spatial resolution of 250 × 250 m, as recent multi- scale accuracy assessments have suggested24. Thus, the visual inspection of the t-SNE plots in Fig. 6 reveals plausible matches for most sample locations. This technique could be used to systematically refine larger-scale samples for building level verification to conduct quantitative accuracy assessments, as far as building function can be inferred from overhead imagery.

Fig. 6
figure 6

Visual assessment of Bing overhead imagery collected at the locations of a stratified random sample of ZTRAX records for the six land use classes used herein. (a) residential (owned), (b) residential (income), (c) commercial, (d) industrial, (e) agricultural, and (f) recreational. The images collected at each location per land use class are arranged based on their color similarity, using color moments and t-distributed stochastic neighbour transform (t-SNE). The small patches to the right of each mosaic show exemplary enlargements, providing further detail on the building characteristics at each ZTRAX location. The yellow rectangles show the locations of the enlargements (the upper enlargement corresponds to the upper of the two rectangles per land use class.

In this analytical effort, we used OpenStreetMap, demolition data records, remote-sensing derived land cover classifications and overhead imagery as comparative data sources, being aware that none of these external data sources represent optimal ground truth to evaluate the quality of the created land use layers. While these comparisons do not quantify the uncertainty in historical land use data, they highlight important data quality aspects and properties that help to better understand the completeness and inherent bias in the data product.

Usage Notes

In previous sections we have described our efforts to quantify certain biases that are present in the land use data. However, there are several other limitations that the user should consider when employing the gridded land use datasets. First, ZTRAX relies heavily on county records to populate the land use attributes, and county reporting practices differ from place to place, which may not account for all buildings that exist. Similarly, the implemented land use classification procedure may differ from county to county, which introduces some uncertainty related to the building type. We attempted to mitigate this uncertainty by grouping the 300 + land use types into broad thematic classes e.g., commercial or residential. A significant limitation of this dataset comes from the collection methods used to build the ZTRAX database; public buildings such as universities and low-income housing are generally not represented in the data presented herein. The gridded land use data will thus typically characterize privately owned structures. We recommend users integrate open-source data to capture the presence of public buildings within a given area to attenuate the error introduced by the exclusion of these buildings. Finally, we emphasize that this data was restricted to the land use of physical structures and the thematic classes that have been identified in the literature as significant to urban development. Thus, the data does not account for land uses that are not associated with a structure e.g., cropland or grazing land, and the data excludes other potentially important land use classifications such as tax exempt or governmental structures. As there has historically been a dearth of data that can directly describe structural land use in developed areas, the design of this data product intentionally gives preference to the land use classes identified as drivers of urban development at the expense of other potentially important land use types. Users should be aware of this inherent bias, and we encourage them to utilize the uncertainty layers to estimate the number of excluded structures within the considered study area.