Historical dataset of administrative units with social-economic attributes for Austrian Silesia 1837–1910

Scientists from many disciplines need historical administrative boundaries in order to analyse socio-economic data in space and time. In this paper, we present a set of historical data consisting of administrative unit boundaries and exemplary socio-economic attributes for Austrian Silesia, an historical region located in modern Czechia and Poland. The dataset covers nearly 700 administrative unit boundaries on the level of cadastral or political communes and their subparts and was acquired through manual vectorisation of historical maps (1:28,800) from the period 1837–1841. The local-level units can be easily joined into higher-level divisions such as court or political districts for the period 1837–1910. The data can then be combined with statistical data collected approximately every 10 years for a similar period. Within the quality assessment, the relations between cartographic and census data and their credibility are analysed. The present dataset provides many possibilities for joining a wide range of historical statistical data to better understand various demographic and economic processes based on advanced analyses, e.g., by using GIS. Measurement(s) administrative region • Socioeconomic Indicator Technology Type(s) digital curation Factor Type(s) time period Sample Characteristic - Environment anthropogenic environment Sample Characteristic - Location Poland • Czech Republic Measurement(s) administrative region • Socioeconomic Indicator Technology Type(s) digital curation Factor Type(s) time period Sample Characteristic - Environment anthropogenic environment Sample Characteristic - Location Poland • Czech Republic Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12464801


Background & Summary
With progress in the digitalisation of historical documents in libraries and archives e.g., Österreichischen Nationalbibliothek (www.onb.ac.at), Zemský archiv v Opavě (http://www.archives.cz), the availability of such documents to internet users increases. Although large collections of historical map scans and census and industrial data records, especially from the 19 th and beginning of the 20 th century, have been made available, the main barrier for their use is still the time-and resource-consuming process of data pre-processing (e.g., georeferencing, vectorisation, optical character recognition) 1 . Another major barrier in the inclusion of historical data in current analyses is administrative boundary changes in the regions about which data were collected. To make past data comparable with current data, detailed historical administrative boundary reconstruction is needed, preferably across different time periods 2 .
Incorporating the former extent of historical administrative units is critical for organising statistical data both spatially and temporarily in order to analyse and visualise it. There is a high demand for such data because it allows the extension of time series to the past and thus improves the analytical possibilities for understanding the consequences of historical, demographic and economic processes 3,4 .
Recently, several historical datasets at various spatial scales, ranging from individual units up to whole countries or regions, have been made available [5][6][7] . Sometimes, the data are also available in the form of geoportals with different levels of interactivity such as A vision of Britain trough time (www.visionofbritain.org.uk), Social Explorer (www.socialexplorer.com), HGIS Germany (www.hgis-germany.de), Database of the Ukrainian residents born between 1650 and 1920 (www.pra.in.ua), but most often, the acquisition of this data is limited. The scope of spatial and historical studies is still insufficient, especially for Central and Eastern Europe, where interest in the digital humanities is dynamically increasing 8,9 and new projects appear e.g. HistoGIS (https://histogis.acdh-dev.oeaw.ac.at) or Der Franziszeische Kataster (www.franziszeischerkataster.at). In this region, due to

Methods
Thanks to the development of cartography and census data collection methods in the 19 th century, a large number of high-quality data were collected for Austrian Silesia. Currently, it is possible to join that data to contemporary datasets to extend the temporal series and to conduct long-term analyses by using new tools and formats, including GIS (Fig. 2).
Administrative boundary reconstruction was based on manual vectorisation, as due to the diverse quality of the original maps, automatic image processing was very difficult. Although manual vectorisation is time-consuming, it may give better results with historical materials 24 . Usually, manual vectorisation serves as the reference for automatic procedures 25 . To obtain text data, optical character recognition (OCR) and manual methods were used. The advantage of the OCR is fast processing, but the disadvantage is the necessity of verification and corrections, especially with small and blurred font shapes or handwriting (Fig. 3). cartographic data sources. As a basic cartographic material, 42 map sheets of the second military survey were used (1837-1841; 1:28,800). The maps are a generalisation of cadastral mapping (1:2,880) from the 1830s, updated with terrain relief information. The Austrian cadastre was founded by Emperor Franz in 1817. It served as a basis of stable spatio-temporal units (ger. Stabiler Kataster) for land taxation 26 . The maps were obtained from the War Archive in Vienna in the form of 300 dpi TIFF scans. As auxiliary data, cadastral maps (1:2,880) and indicative sketches (1:1,440) from the 1830s, which are available online in a digital form Český úřad zeměměřický a katastrální (ČÚZK) (https://archivnimapy.cuzk.cz/uazk/pohledy/archiv.html) and from Szukaj w Archiwach (www.szukajwarchiwach.gov.pl), were used. It was especially useful in situations where the quality of the topographic maps was very low and hard to interpret.
Statistical data sources. In the 19 th and the beginning of the 20 th century, the administrative structure of Austrian Silesia changed several times. An important trigger of the reorganisation of the administrative system was the Spring of the Nations in 1848, with many socio-economic processes following 16 .  www.nature.com/scientificdata www.nature.com/scientificdata/ The basic administrative units for which the statistical data were collected were villages, towns and, in some cases, their parts (ger. Orts, Ortschaft), e.g., colonies or suburbs. The above-mentioned units, alone or in aggregation with other similar entities, created cadastral communes (ger. Katastralgemeinden) and political communes (ger. Ortsgemeinden). The cadastral commune was a unit created for tax collection purposes, while political communes were designed as the lowest level administrative and self-government units. In the case of a few towns, such as Troppau, Teschen, Freistadt, only part of the political commune belonged to the cadastral commune. In the Austrian Silesia cadastral commune covered only one political commune, usually without its parts. The boundaries of both were exactly the same. In 1900 in Austrian Silesia there were 484 cases. 100 cases of cadastral commune covered one political commune with parts thereof. Only in one case cadastral commune covered two political communes. The names of cadastral and political communes generally coincided, though with a few exceptions (e.g., Gurschdorf, Tierlitzko). Political communes, as autonomous units, obtained legal status after the www.nature.com/scientificdata www.nature.com/scientificdata/ abolition of serfdom in 1848. Earlier, rural political communes were subject to the rule of masters in dominions (lat. dominium). Over a dozen cadastral communes formed a tax district (ger. Steuer Bezirke), while political communes formed the court districts (ger. Gerichts Bezirke). One or more court districts formed political districts (ger. Politische Bezirke) with their own authorities (ger. Bezirkhauptman). Apart from the above-mentioned structure, three towns in the region had their own statutes (ger. Städte mit eigenem Statut): Troppau, Bielitz, and Friedek. Political and court districts, as administrative units, were created in the place of former districts (ger. Kreis) that were dissolved in 1850.
The selection of the time series available in the paper was dependent on the availability of the historical data. In 1857, the first census under the Austrian Monarchy was conducted, not only for military purposes 27 . The next editions were carried out approximately every 10 years until 1910. The full results of the census were published both on the political and court district level [28][29][30] , with some data categories also available for separate towns, villages or parts thereof [31][32][33][34][35][36] . Statistical publications for towns, villages and their parts differed with the number of separate spatial units and the thematic coverage in different periods. For instance, for the town of Wisla, data from 1850, 1869, 1880, 1900, and 1910 were available for two units, while for 1890, the data were available for 33 units within the town. Statistical sources do not have a uniform thematic scope. In most cases, the number of data collected increased with every next census, e.g., the number of cattle-related categories increased from 13 in 1857 to 40 in 1910. In some censuses, the definitions of categories were changed. However, in most cases, the sub-categories can be easily joined into more general categories, e.g., the numbers of inhabited and uninhabited buildings into the number of buildings or the number of Roman, Greek and Armenian Catholics into the number of Catholics. The published number of sub-categories and categories is larger for districts than for communes or their parts. For instance, the number of houses for communes in 1880 is summarised, while for the districts, it is presented separately for inhabited, uninhabited and parts of homes.
Georeferencing. All of the second military map sheets were georeferenced by first-order polynomial transformation in ArcMap by using contemporary World Imagery satellite images (https://www.arcgis.com/home/ item.html?id=10df2279f9684e4a9f6a7f08febac2a9) as a source of control points. Additionally, the official administrative unit's data from the Czech Geoportal (ČÚZK) (https://geoportal.cuzk.cz) and National Register of Boundaries for Poland (www.gugik.gov.pl/pzgik) were used to collect the proper points. In most cases, the locations of characteristic road crossings, buildings or administrative boundary crossings were used as the control points. Altogether, 1083 points were used, which resulted in RMS values between 12 m and 21 m for 75% of the map sheets. The minimal RMS value was 6.6 m, and the maximal was 27.4 m. These results are comparable to other georeference works done for the second military survey in other parts of the Habsburg Empire 37 .
Data acquisition. The manual vectorisation was conducted with a scale of 1:2000 -1:5000 (Fig. 3a). During the acquisition, the tools of the ArcMap 10.7 Editor were used. The generalisation of the administrative boundaries on the second military survey versus cadastral mapping was limited. However, distinguishing the boundaries from other line symbols, such as roads, relief or densely settled areas, is not always easy. For both territories of contemporary Czechia and Poland, the use of the current administrative borders from ČÚZK (https://geoportal. cuzk.cz) and the National Register of Boundaries for Poland (www.gugik.gov.pl/pzgik) was very useful. If the current and historical boundaries overlapped, then the manual vectorisation followed the current data (Fig. 3b). If there were some discrepancies between the historical and current boundaries, a detailed inspection of the situation influenced the final decision (e.g., if the difference depended on real change, RMS error, or map distortion). Numerous boundary changes occurred along the historical boundary between the Habsburg Empire and Prussia (Fig. 3c). Currently, it is the Polish-Czech national boundary, with small corrections made in the 1950s. During the reconstructions, historical changes made before the river regulations (Fig. 3d) and water reservoir creations and small, local changes were considered. For several cadastral communes, the boundaries were acquired directly from cadastral mapping (1:2,880) (Fig. 3e). Finally, for 1900, the statistics cover 585 cadastral and 495 political communes 35 .
Additionally, thanks to the indicational sketches and analysis of the origin of the owners of the plots, it was possible to separate parts of communes (e.g., colonies, settlements), which resulted in increasing the spatial resolution of the dataset (Fig. 3f). In this way, for instance, in the Freiwaldau District, the number of basic administrative units increased from 58 cadastral communes to 108 units for 1900. Finally, nearly 700 administrative unit boundaries were reconstructed. In each case, the area of the vectorised units was verified by the statistical data. If statistical publications did not contain socioeconomic data for some part of commune, even though we had their geometry, those parts of commune were combined into larger administrative units, i.e. political or cadastral www.nature.com/scientificdata www.nature.com/scientificdata/ commune in accordance with generalised data in the statistical publication. However, this was the case in less than 5% of the units for each of the time periods.
Statistical data used in the study were acquired manually or semi-automatically by OCR ABBYY Fine Reader 12 software. For the oldest dataset, only the font shapes and signs unique to Czech, Polish or German were problematic. In some sources, single records were difficult to interpret due to the quality of the materials (e.g., distinguishing between 1 and 4 in some cases).

Data records
The dataset 38 presented in the paper is an open, vector SHP file that is easy to use in most GIS software and easily convertible to other formats. The shared layers are available on the level of communes and their parts and for districts for seven time periods between the 1830s and 1910 (Supplementary File 1). Altogether, 21 layers are available ( Table 1). The names of the layers consist of two letters (SA, denoting Austrian Silesia), the years of the time period and spatial reference units: political communes (CmP), cadastral communes (CmC), court districts (DC) and political districts (DP). Communes can be easily combined into more complex units incl. older administrative units (e.g., dominions before 1848) or districts (Fig. 4), using simple dissolve-or selection-type functions. For instance, the territorial units codes for court district, political district and statutory town each consist of six www.nature.com/scientificdata www.nature.com/scientificdata/ numbers separated by slashes, e.g., 01/01/01, and for dominions, the maximum length of the code is 21 numbers separated by slashes ( Table 2).
The spatial data are available with exemplary attributes covering geographical names, the number of buildings, demographic data, land use and cattle statistics in order to show the possibilities for data integration. An important reason for choosing exemplary attribute data was the ability to verify their values based on independent data sources for communes or districts. The selected attributes presented here highlight the potential of the statistics collected by the Habsburg administration 27 .
The first part of the attribute name consists of five characters describing the category and sub-category. Four digits following the space sign '_' define the time period (YY) and the data source level (XX).
In the dataset presented in the paper, 4 name-related (Table 3), 4 house-related (Table 4), 42 demographic-related (Table 5), 9 farm animal-related and 5 land use-related attributes ( Table 6) were selected. Commune names are available for 5 time periods: the 1830s, based on a second military survey, 1850 31 , 1880 33 , 1900 35 and today, according to the official national geoportals (https://geoportal.cuzk.cz; www.geoportal.gov.pl). The names are presented in one, two or three languages (Czech, German, Polish), according to the source information availability. The census from 1880 is the only one that contains the local names in all three languages for almost all the communes covered. Some names of the same units may differ over time, which might be a result of phonetic records or problems with the transliteration of Slavic languages into German. The various versions of the names, however, are preserved here for unambiguous identification with other historical sources. Demographic data for almost all time periods include the population divided by gender.    www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Spatial data. The data acquired by manual vectorisation were inspected by topological correctness tools in ArcMap 10.7 by using the rules Must Not Overlap and Must have no Gaps, with a tolerance of 0.001 m. The area of cadastral communes can be verified with census data from 1900 35 . Other census data refer to the area of political communes or districts. For 91% of the cadastral communes, the differences between spatial and census data are between −1% and 1% and are between −3% and 3% (Fig. 5) for 97%. These results show the high quality of the source data and vectorisation procedure. Small discrepancies may be a result of the data processing in GIS (e.g., georeferencing) and/or slight changes in cadastral communes' boundaries in the period under study (e.g., in Mährisch Pilgersdorf in 1846 or Kopitau in 1868).
The aggregation of communes into political districts confirms small (up to ±0.3%) discrepancies with census data ( Table 7).
The boundaries vectorised manually were compared to the boundaries created by other authors. Bičik et al. 10,39 created so-called STUs (stable territorial units) and BTUs (basic territorial units) consisting of 1 to 12 cadastral communes for the whole territory of Czechia, including Austrian Silesia (Fig. 6a) Table 6. The farm animal-related and land use-related attributes covered by the dataset. www.nature.com/scientificdata www.nature.com/scientificdata/ for the period 1845-2010. According to this delineation, the number of cadastral communes in the Czech part of Austrian Silesia was generalised in nearly 40% of Austrian Silesia, from 473 cadastral communes in the 19 th century to 285 STUs. The average area of STUs for Austrian Silesia was 1590 ha, while for the cadastral commune, it was 874 ha. In some STUs, the Austrian Silesia communes were joined with those from Moravia or the Silesian part of Prussia. On the one hand, such aggregation allow the possibility of integrating data over large periods from various sources, but on the other hand, it lowers the spatial resolution of the data, which may negatively affect the ability to join such data with other historical or contemporary sources. For instance, the current availability of remote sensing data, which are independent from administrative units, makes it possible to compare land use directly to detailed historical boundaries.
The vectorised boundaries of cadastral communes aggregated to court districts were compared to the independent reconstruction prepared for the whole territory of the Habsburg Empire in 1910 during the MOSAIC Project (https://censusmosaic.demog.berkeley.edu) (Fig. 6b). This shapefile was created by Helmut Rumpler and Martin Seger 40 and was slightly modified for the Max Planck Institute for Demographic Research and Chair for Geodesy and Geoinformatics, University of Rostock Population History GIS Collection. A total of 1711 court districts (incl. 28 for Austrian Silesia) were created during the project. The direct difference in district area between the MOSAIC Project and the dataset is 21.2%. For statuary towns, this difference is up to 80.9%, while for the rest of the districts, it is 14.1% (min -0.5%, max -54.4%). The generalisation of the boundary lengths between the two projects is equal to 22 census data. The credibility of census data from Habsburg statistical institutions has been the subject of many studies confirming their high scientific value as a result of advanced data collection methods 27,41 . However, the quality of census publications is so far less recognised. In the dataset presented in this paper, the data available for communes were aggregated to the level of court districts for each of the time periods and compared to the sums available for the level of court districts 1) within the same source and 2) with the independent source (Fig. 7). The criterion of categories and sub-categories chosen was the availability of data for conducting a comparison on two administrative levels.
First, inconsistencies in the data may be verified by the partial, categorical and sub-categorical sums (e.g., the number of men and women should be summed to the category 'population together' , and sums for exemplary category for court district level should be equal to the summed level of superior political district). Second, it needs to be verified by analysing partial sums which publication the error appeared in. Sum accordance in both publications confirms the credibility of both. The above-mentioned methods enabled the removal of errors triggered by the conversions of raster to text during the data acquisition process. The accordance expressed by the median, non-outlier range and the value of extremes confirm the rare errors in the source materials, regardless of the time  www.nature.com/scientificdata www.nature.com/scientificdata/ period. Sums of exemplary categories and sub-categories aggregated to court districts are equal to the values published for court districts in almost 100% of the cases. For instance, for more than 680 political communes in 1869, there are only 3 errors in the records. On page 1, the data were exchanged for the communes of Alt Bielitz and Batzdorf, and on page 18, in the commune of Nider Hillersdorf, the total population exceeded the sum of men and women. For 1880 for all the categories published for the communes, no errors were detected when comparing the values to district level publications. For 1890, two errors were found: 7 for 1900 and 3 for 1910.

Usage Notes
We shared the data in open, widely used shapefile format. According to shapefile format specification, boundary geometries are available in a direct access, main file with shp extension. In the shx index file, each record contains the offset of the corresponding main file record from the beginning of the main file. The socio-economic data, as dBASE table with one record per feature, are available in dbf extension. The one-to-one relationship between Fig. 6 Comparison of the vectorised cadastral communes from the dataset with STUs (a) and court districts from the MOSAIC Project (b). (2020) 7:208 | https://doi.org/10.1038/s41597-020-0546-z www.nature.com/scientificdata www.nature.com/scientificdata/ geometry and attributes is based on a record number. Attribute records in the dBASE file are the same order as records in the main file.
Examples of mandatory files for shapefile format: Examples of other files for shapefile format: • SA_1837_1850_CmC.cpg -used to specify the code page (only for.dbf) for identifying the character encoding to be used • SA_1837_1850_CmC.prj -projection description with text representation of coordinate reference systems • SA_1837_1850_CmC.sbn and SA_1837_1850_CmC.sbx -a spatial index of the features • SA_1837_1850_CmC.shp.xml -geospatial metadata in XML format In the second version of the dataset each shapefile has its own metadata XML file, based on ISO 19139 Metadata Implementation Specification GML 3.2 standard, where the details such as sources of socio-economic data with pages or map sheets are clarified. www.nature.com/scientificdata www.nature.com/scientificdata/ The shapefiles are stored in 21 compressed ZIP archives. The ZIP archives can be opened directly using QGIS open software https://qgis.org or after unpacking using e.g. 7-Zip https://www.7-zip.org/ or WinZip https://www. winzip.com also load with the single command Add Data to ArcGIS Desktop or ArcGIS Pro https://www.arcgis. com.
We ask the reader to refer to Tables from 1 to 6 for detailed explanations of the different data layers and fields.