Descriptor : High-resolution reconstruction of the United States human population distribution , 1790 to 2010

Where do people live, and how has this changed over timescales of centuries? High-resolution spatial information on historical human population distribution is of great significance to understand humanenvironment interactions and their temporal dynamics. However, the complex relationship between population distribution and various influencing factors coupled with limited data availability make it a challenge to reconstruct human population distribution over timescales of centuries. This study generated 1-km decadal population maps for the conterminous US from 1790 to 2010 using parsimonious models based on natural suitability, socioeconomic desirability, and inhabitability. Five models of increasing complexity were evaluated. The models were validated with census tract and county subdivision population data in 2000 and were applied to generate five sets of 22 historical population maps from 1790–2010. Separating urban and rural areas and excluding non-inhabitable areas were the most important factors for improving the overall accuracy. The generated gridded population datasets and the production and validation methods are described here.

National Hydrography Dataset (http://nhd.usgs.gov/). Protected areas data (version 1.3), designated to preserve biological diversity and other natural, recreation and cultural uses, were released in 2012 and obtained from National Gap Analysis Program (http://gapanalysis.usgs.gov/padus/). Digital elevation model data (Fig. 2) were released in 2013 and derived from NASA Shuttle Radar Topography Mission Version 3.0 (http://www2.jpl.nasa.gov/srtm/), with a resolution of 90 m. Historical data on urban spatial extents for San Francisco/Sacramento and Baltimore/Washington DC region 36,37 were from the USGS Land Cover Institute (LCI).

Level of spatial units
The above data were used to compute population for each inhabited pixel (k) in each decade (t) from 1790 to 2010 (excluding 1960). The data were variously applied over the following spatial units: urban area (φ), census tract (c 1 ), county subdivision (c 2 ), county (i), census division (δ), and Region (μ). There were 9 census divisions (δ: D1 -D9) and 7 Regions (μ: R1 -R7) with constant boundaries over time, while the number of other spatial units varied with the expansion of human settlements and the modification of administrative boundaries. For example, the number of counties increased from 292 in 1790 to 3109 in 2010. Also, in 2010, there were 7,754 146 pixels (1 km 2 ), 72,271 census tracts, 35,532 county subdivisions, and 3,535 urban areas in the conterminous US.

Population count determination
Decennial census data on total population (P T ) and aggregated urban population (P U ) for each urban area (φ) determined the remaining rural population (P R ) by difference, for each county (i) from 1790 to 2010 (excluding 1960, for which digital urban population data are missing): The number of counties containing urban population increased from 19 in 1790 to 2424 in 2010. For counties with more than one urban area, we obtained the population for individual urban areas from the US Census Bureau. Note a population threshold of 2500 was used for identifying urban areas (https:// www.census.gov/geo/reference/ua/urban-rural-2010.html).

Areal extent delineation
The 2000 census includes full coverage of areal extent (A U ) for all of the 3610 urban areas. However, areal extents are incomplete for 1990 and largely unknown in prior decades. The 1990 NHGIS data cover only 396 large urban areas with population larger than 50,000. Thus, the 2000 data were used as a baseline for Figure 1. The major spatial input data for the population downscaling models, and associated administrative unit boundaries (census divisions D1 to D9, state, county, and urban areas). Un-inhabitable areas include designated protected areas, waterbodies, and highlands with eleviation, z, larger than 3500 m. historical backward projections of urban areal extents. The areal extent of each urban area (φ) in a census division (δ) prior to 2000 was estimated on the basis of a power law scaling relationship with its population P U,φ 35 : where α δ and β δ are the proportionality coefficient and scaling factor for census division δ estimated by fitting log(A U ) and log(P U ) in 2000 according to equation (2). Strong linear relationships between log(A U ) and log(P U ) have been identified based on modern satellite imagery and census data [40][41][42][43] , and also observed in ancient settlements, for example, the Pre-Hispanic settlements in the Basin of Mexico 44 . Our hypothesis was that this scaling was stable and could be used to reconstruct historical settlement patterns. Empirical studies 45 have revealed spatial variation of β values: 0.375 for England and Wales, 0.914 for Japan, 1.38 for China, and larger than 2/3 for US urbanized areas (population larger than 50,000). Spatial variation of β was incorporated here by developing such relationships for each census division, δ. However, the lack of corresponding A U and P U data for urban areas before 2000 precluded further  evaluation of the temporal stationarity of β. Thus, the scaling relationships were considered stable for simplicity, with constant parameter values over time. The extent of rural areas (A R ) within each county (i) was determined as the difference between the county total area (A T ) and the aggregated extents of all urban areas in the county: Two simplifying assumptions were further made to delineate the extent of each urban area over time: 1) urbanization was monotonic, with historically established urban areas still extant in 2000; 2) urban areas developed outward from their center, so regions farthest from the center would urbanize last. The urban extents for each urban area in each decade were delineated by concentrically shrinking inward towards their center moving backward with time from 2000, with the remainder considered as rural areas.

Influence coefficient calculation
Inhabitability, topographic suitability, and socio-economic desirability were the three major factors considered to influence human population distribution. The calculation method for the influence coefficient (w) of each factor for each pixel (k) is explained below. Note the influence coefficients for inhabitability (w 0k ) and topographic suitability (w 1k ) were steady over time, while the influence coefficient for socio-economic desirability (w 2k ) changed with the urbanization process.
Inhabitability. Inhabitable zones were defined here to exclude protected areas, waterbodies larger than 1 km 2 , and areas with elevation higher than 3500 m (Fig. 1). Protected areas were regarded as noninhabitable if their status was defined as "Designated", indicating legal or administrative decree, and if their public access was classified as "Restricted" or "Closed". Note that protected areas are not constant during the time period of the study, and some populations were displaced in the creation of protected areas in the US. But the total number of such displaced persons is relatively low such that we expect the impact on our final results to be minor. Low-population census tracts were also treated as noninhabitable areas. Preliminary analyses revealed that areal weighting resulted in overestimates for low population tracts. This simplification was found to reduce the overall model errors (see Table 1 for the specific cutoff tract population for each division). Inhabitable areas were assumed to be steady over time.
The influence coefficient of inhabitability, w 0k , for pixel k was set as zero for non-inhabitable areas and one for inhabitable areas.
Natural suitability. Elevation has a relatively larger variation and better resolution within counties compared to other natural factors, including temperature and precipitation, and plays an important role in influencing population distribution 46 . We aggregated this detailed elevation data to county level instead of census tract and county subdivision because the latter data are only recently available and are thus used here only for validation purposes. Although such aggregation masks variation within counties, especially for the western mountainous regions, we found that a valid relationship developed from a perspective of regionalization could generally reflect the topographic influence. We tested the relationship between county mean elevation and population density using linear, log, and logistic functions using the R statistical software. We chose elevation over slope to represent topographic suitability because we found that the former had a better linear relationship (larger R 2 ) with population density where i and μ denote county and Region illustrated in Fig. 2, PD i and z i are population density and mean elevation for county i, and b μ and m μ are fitting parameters for Region μ. We calculated the influence coefficient of elevation, w 1k , for pixel k from the following relation where z k is elevation for pixel k. Only inhabitable areas were considered when calculating population density. To minimize the number of model parameters, geographically adjacent census divisions and states with similar m μ values from equation (4) were combined into seven regions (μ: R1 -R7, shown in Fig. 2), that upon integration still retained significant relationships between z i and In (PD i ). Two regions correspond to census divisions (R1 and R2), one includes only one state (R5, Florida), two comprise multiple contiguous states (R3 and R4), and two are composed of one census division plus multiple neighboring states (R6 and R7). No significant relationship was found for Colorado (R0, in Fig. 2), and thus population was not weighted by topographic suitability there. Significant negative linear relationships between z i and In (PD i ) were found for all regions except 100 mozo 300 m in R4 (Fig. 2). The presence of the city of Atlanta at approximately 300 m disrupted the general pattern that was found in the rest of the hot southeastern coastal plain. The slope, m, varied among regions, but all relationships were significant with p-valueo0.001.
Socioeconomic desirability. We considered socioeconomic desirability for urban (U) and rural (R) areas separately. Urban population density decreases with increasing distance from the urban center. This trend has been described simply with an exponential decay model 47 , but with growing attention on fractal cities, an inverse power function has been used more recently [48][49][50] . Here, we used the inverse power function to describe how the influence coefficient of socioeconomic desirability w 2k for urban pixels (k ∈ U) changes across space.
where r kφ is radial distance from the center of urban area φ to the pixel k, and λ δ is density gradient for division δ. We applied the following relation suggested by recent studies 51,52 to link parameter λ δ from socioeconomic desirability with the urban area-population scaling factor β δ from equation (2) Like β δ , λ δ values were calculated for divisions and regarded as constant over time.
For rural areas, proximity to an urban center is advantageous for economic development. The influence coefficient of socioeconomic desirability w 2k for rural pixels (k ∈ R) was determined using a gravity model of market potential 53 where N φ is the number of urban areas that are within the maximum urban influence distance in decade t, D t (r kj ≤ Dt ), corresponding to daily per capita travel range which exponentially increased due to transport technology evolution from 30 m in 1790 to 100 km in 2000 (ref. 54). The gravity model was adopted here, based on its wide application in reflecting the accessibility of urban markets [55][56][57] .

Population mapping models
Five population distribution models of increasing complexity were developed, using the influence coefficients described above (normalized to range between 0 and 1). M1 simply allocated census county population homogeneously within counties. M2 separated urban and rural areas, and then homogeneously allocated urban and rural population within urban and rural areas, respectively. M3 excluded non-inhabitable areas, including waterbodies, protected areas, highlands, and low population census tracts. M4 extended Model 3 with the addition of topographic suitability. M5 added socioeconomic desirability. From M3 to M5, the population distribution maps were obtained by multiplying the population raster by the normalized weighting grid, with each subsequent model contributing an additional influence coefficient to the previous model. The most complete model (M5) included all coefficients as shown in the following equation where Z indicates urban (U) or rural (R) pixel, and the exponents s and d weight the relative importance of topographic suitability and socioeconomic desirability on population mapping. The latter two parameters were calibrated to obtain the highest mapping accuracy with values evaluated between 0.2 and 3 for each division (Tables 2 and 3). Equation 9 was implemented through Model Builder in ArcGIS to produce our historical population datasets. The models we used are publicly available and can be freely downloaded from figshare (Historical Population Models, Data Citation 1).

Accuracy assessment
All five models, M1 -M5, were applied to map historical population from 1790 to 2010 (excluding 1960) (Data Citation 1). The best way to assess the accuracy of population distribution maps is to compare them with census data at a finer level than was used for the model input. While the county level input data were already the finest resolution that was available consistently throughout the temporal range of interest, higher resolution census tract data with full coverage of the conterminous US were available from 1990 to 2010, and these were used as a primary reference here. Note that the spatial resolution of our population products precluded validation for census tracts with area less than 1 km 2 , which represented 12.6% of the tracts and 9.7% of the total population for the conterminous US in 2010. We supplemented the tract analysis with reference data at a coarser level, county subdivision, from 1980 to 2010, to achieve a more comprehensive validation and also to assess the accuracy based on reference data at different geographic scales. We again used only county subdivisions with areas larger than 1 km 2 (98.8% of the county subdivisions, and greater than 99% of the conterminous US population). We generated gridded population data based on county level data and then aggregated population grids by census tracts and county subdivisions to compare them to census data at the corresponding level. We applied mean absolute relative error (MARE) to assess the overall performance of the 2000 population products for the conterminous US and each census division where subscript c is reference data geographic unit (census tract or county subdivision), N c is the number of evaluated units in the conterminous US or census division, M c is modeled population aggregated for unit c, and O c is observed population from census data. The five models were assessed in terms of both accuracy and effectiveness. We used MARE to compare model accuracy, with lower MARE values indicating better model performance. Model effectiveness was determined from the importance of the added influence factors based on the magnitude of MARE reduction compared to the previous model. We made full use of the available fine-resolution census population data and assessed the accuracy of the model-generated historical population maps for the conterminous US based on census tract data from 1990 to 2010 and county subdivision data from 1980 to 2010. Furthermore, we assessed the accuracy of our generated historical urban extents for selected regions. There is currently no historical (urban) land use database for the USA from 1790 to present. However, the USGS LCI 36,37 developed historical urban extents for two fast-growing regions: San Francisco/Sacramento and Baltimore/Washington DC. Those investigators mapped urban land use change and produced maps from 1900 to 1996 for San Francisco/Sacramento and from 1792 to 1992 for Baltimore/Washington DC. Therefore, the historical urban extents for these regions reconstructed in this study for the closest corresponding years were compared to the USGS LCI results to test the validity of the scaling relationships and the urban area delineation method. We used the relative overlap of urban pixels between our results and those from USGS LCI as an accuracy indicator. Note that the minimum population threshold used to define urban areas in the USGS LCI studies was 500 people, but we constrained our comparison to include only the areas with more than 2500 people, to be consistent with our threshold for urban areas.
Finally, we note that accuracy assessments for population products are commonly conducted using partial datasets due to the availability of higher resolution reference data. For example, Gaughan et al. 58 used only four urban areas to validate population products for the entire mainland of China, and Sorichetta et al. 18 used six countries for accuracy assessment of population products for 28 countries in Latin America and the Caribbean.

Technical Validation Population mapping illustration
The spatial details generated by M1 to M5 are illustrated in Fig. 3, with a focus on the northern part of the South Atlantic division (D5), which exhibits significant topographical variation from the coastal plain to the Appalachian Plateau, and includes many cities of different sizes. The census tract (Fig. 3b) and county subdivision data (Fig. 3c) are compared to all five model outputs (Fig. 3d). M1 and M2 produced homogeneous populations within counties, and urban and rural areas, respectively. M3 also produced homogeneous populations within urban and rural areas, but had more low population pixels due to the exclusion of non-inhabitable areas. M4 and M5 generated more spatial details within urban and rural areas for each county. With input from both topographic suitability and socio-economic desirability, M5 produced more detailed information consistent with the census tract population distribution, not only for the less densely populated zone in the northeast but also the high population regions in the southwest.

Population mapping model accuracy
Census tract population data was considered as the primary reference for validation due to its higher resolution, with mean area and population of 120 km 2 and 4310 persons in 2000. Based on our validation using census tract population for year 2000 (Fig. 4a) for the conterminous US, accuracy increased with model complexity from M1 to M5, with MARE decreasing from 6.96 (M1) to 1.54 (M5), indicating M5 as the most accurate model. By census division (δ: D1 -D9), model accuracy showed similar improving trends from M1 to M5, except M2 for D4 and D9. The model performed worst for D2, however, this division also showed improved performance with increased model complexity. Validation using county subdivision population data (mean area and population of 221 km 2 and 7932 persons) resulted in similar overall results, but with higher accuracy, indicated by smaller MARE values for Models 1 to 5 (Fig. 4b). The MARE decreased from M1 (3.11) to M4 (1.22), then increased marginally (1.23) for M5, showing M4 as the most accurate model. Similar trends were found for most divisions, except D2 and D5, with M3 as the most accurate model, and also D3, where model accuracy decreased from M2 (1.14) to M5 (1.22). Note that small or no differences in MARE were found between the most accurate models and M5 for these divisions. D1 and D8 presented worse model performance than the other divisions.

Population mapping model effectiveness
The improved accuracy found above was achieved by the increased complexity from M1 to M5. Models 1 to 5 increased in complexity by adding additional input data. M1 required only one data set (county total population, P T ), M2 required the addition of urban population (P U ) and extent of urban areas (A U ) for a total of 3 variables, M3 added inhabitable area for 4 variables, M4 also included topography and its relative importance (s) for 6 variables, and M5 added socioeconomic desirability and its relative importance (d), culminating in 8 variables.
Census-tract-based validation for the conterminous US (Table 4) showed that M3 and 2 resulted in the largest reduction in MARE compared to their next lower-complexity models: 3.58 and 1.53, respectively, revealing the importance of separating urban and rural areas as well as excluding non-inhabitable areas.

Urban extent dynamics
The historical urban areas were reconstructed based on the oldest available urban area boundaries (2000) using the scaling relationship between urban area and population. We illustrated the urban extents for Baltimore/Washington DC and San Francisco/Sacramento regions from our reconstructed results for the  5). Our simplified reconstruction tended to overestimate urban extents in both regions, and the lower population thresholds used to define urban areas for the USGS LCI studies resulted in more scattered small urban areas than in our results. The assessment of the historical urban areas in two fast-growing regions showed that our reconstructed urban extents reflect the general dynamic pattern of urban areas compared to the USGS LCI reference data.

Usage Notes
The dataset generated here provided ready-to-use historical population maps in the conterminous US from 1790 to 2010 at the resolution of 1 km. Our study showed the validity of applying scaling relationships between urban area and urban population to reconstruct historical urban areas, and the effectiveness of modeling spatio-temporal population distributions using existing data. Our backward projection was not just based on current population information but historical population data obtained from NHGIS, which provided accurate information about human settlement and population density at county level and served as a foundation to further disaggregation of population within administrative units. The high resolution of the input census population data, the separation of urban and rural extents, and use of inhabitability, elevation, and socioeconomic desirability influencing factors all contributed to the good accuracy of the final products.
Our final output included five sets of historical population products from 1790 to 2010 reconstructed based on models M1 to M5. Generally, model accuracy improved with increasing complexity. According to the census-tract-based validation, we proposed M5 as the most accurate and M3 as the most efficient, while the county-subdivision-based validation suggested M4 as the most accurate and M2 as the most efficient, largely because the larger size of county subdivisions masked the effect of additional factors included in more complex models. Therefore, we suggest that users consider their study units when selecting the model products. Also, we suggest applying the efficient models when transferring our approach to other regions, and adopting the most accurate models when directly applying our modelgenerated population products. Additionally, considering the varied model performance among the divisions, we caution users to choose the most appropriate products based on their specific study regions.
When compared to the large task of modeling human population at a continental scale over a time period of hundreds of years, our models include only a small number of parameters, and we thus consider all five models M1 to M5 as parsimonious. Our models used no more than three weighting coefficients, partly determined by the availability of historical geospatial data, while, for comparison, 10 weighting coefficients were used in LandScan 16 . Similar to our method, Landscan 16 and population modeling for Asia 19 and Africa 20,21 also applied multiple ancillary variables and allocated population using associated weights. However, land cover data, which were not available over the long time scales of interest in this study, were the major input for their modeling. Landscan did not provide details on their modeling methods 19 . The population mapping influence coefficients previously developed [19][20][21] were based on fixed values of population density for specific land covers, while those we developed as continuous functions of elevation, urban proximity, and market potential (equations (5), (6) and (8)) supported more variation. Further, our reconstruction was based on the same model structure with consideration of the dynamic trends of the model input parameters over time, including the number of urban areas and urban population, increasing urban influence distance over time, and changes of density gradients within urban areas.
Natural factors other than altitude, including temperature and precipitation, were excluded because of their relatively small variation within most counties. However, altitude was found to be an important factor to account for intra-county natural suitability variability. We further emphasize that our validated population products used the same method for the entire conterminous US, with no significant accuracy differences in the East and West. Another important factor that could be considered more explicitly in extensions to this work is transport networks, given their significant role in influencing human population distribution. Transport networks have evolved over time, shifting from rivers and canals, to roads and railroads, and then to air transport 54 , but there are no currently available comprehensive historical databases of transportation networks (rail lines, roads), precluding their direct inclusion in our models. However, the effect of the transport technology evolution was indirectly reflected in our models through the growth of the daily travel range in the gravity model of market potential. The availability of temporally dynamic data limited the selection of influencing factors. We aimed to apply consistent techniques over time to make the final products appropriate for dynamic analysis.
Our historical population products were generated and validated based on census population data. This data-based approach was therefore subject to the limitation of the currently available census data. First, we did not consider American Indians in our population distribution reconstruction since census  data did not include American Indians prior to 1900. It is still a challenge to collect and confirm population numbers for American Indians due to the sparse and inconsistent record. Second, urban areas are defined based on a population threshold of 2500, however, areas with less than 2500 people in the past could have held a role similar to modern urban areas. Our reconstruction thus might underestimate the attractiveness of smaller urban areas. Third, lack of high resolution historical data constrained a comprehensive validation dating back to 1790. Our historical population products prior to 1980 can be validated when more detailed census data become available. Despite these limitations, the population data from US Census Bureau are the best currently available source for data-driven analyses. We applied the following major assumptions in this study: Monotonic urbanization and homogeneously outward expansion, constant β, and steady inhabitable areas. The first two assumptions were used to reconstruct historical urban areas and the third was applied to determine the influence coefficient of inhabitability. While individual urban development patterns may often deviate from the first assumption (such as linear expansions along rivers or roads), simplifications were necessary to develop efficient models that could be feasibly applied to the large spatial and temporal scale of our study. This assumption could lead to varied impacts on model accuracy for specific cities with different development patterns. However, we found that the dynamic urban extents we generated reflect the general pattern of urban areas based on comparison with measured data available for two fast-growing regions, Baltimore-Washington DC dating back to 1792 and San Francisco/Sacramento dating to 1900 (Fig. 5). For the second assumption of constant β, currently there is no consistent opinion on how β may change over time. Theoretical considerations from a geometric perspective suggested β = 2/3, with dimensions of 3 for population and 2 for area 35 , although there is some controversy on the dimension of population. Temporal analyses based on a model of settlement structure and social networks 44 provided support for the dynamic nature of β: between 2/3 for unstructured settlements and 5/6 for larger and denser settlements with infrastructure networks. However, empirical analysis discovered little variation in exponents for Taiwan at different development phases 41 . Here, we found β = 0.95 for all divisions except Pacific (D9) with β = 0.86 (Fig. 6) based on current urban data, however, a lack of historical urban areas data could not support a dynamic analysis on this exponent and we assumed it to be constant. A smaller β for unstructured settlements 44 would suggest smaller sizes of urban areas and thus a higher population density in earlier times, which might also explain our overestimation of historical urban extents. Regarding our third assumption, the total inhabitable area may have expanded with increasing human settlement pressure and advancing technological development. For example, widespread drainage converted wetlands to croplands and settlements 60 . Thus, our model may overestimate historical inhabitable areas and thus underestimate population density in previous decades. However, data availability limitations necessitated simplifying assumptions, which result in inevitable modelling limitations. Future work could improve these models as more data become available.
The spatially explicit historical population data generated here could facilitate advancing our understanding of coupled human and natural systems. With the arrival of the Anthropocene, the scope, intensity, and rate of changes in human-nature interactions have increased dramatically 3 . It is thus becoming increasingly important to understand the dynamics of human-nature interactions at time scales of decades to centuries 8 . For example, from the perspective of water resources, several studies have quantified the geographic relationship between human settlements and rivers 61,62 , however, it remains elusive how such relationships have changed over time. Our population products could help evaluate dynamic human-nature relationships.