MatchingLand, geospatial data testbed for the assessment of matching methods

Xavier, Emerson M. A.; Ariza-López, Francisco J.; Ureña-Cámara, Manuel A.

doi:10.1038/sdata.2017.180

Download PDF

Data Descriptor
Open access
Published: 05 December 2017

MatchingLand, geospatial data testbed for the assessment of matching methods

Scientific Data volume 4, Article number: 170180 (2017) Cite this article

1720 Accesses
6 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This article presents datasets prepared with the aim of helping the evaluation of geospatial matching methods for vector data. These datasets were built up from mapping data produced by official Spanish mapping agencies. The testbed supplied encompasses the three geometry types: point, line and area. Initial datasets were submitted to geometric transformations in order to generate synthetic datasets. These transformations represent factors that might influence the performance of geospatial matching methods, like the morphology of linear or areal features, systematic transformations, and random disturbance over initial data. We call our 11 GiB benchmark data ‘MatchingLand’ and we hope it can be useful for the geographic information science research community.

Design Type(s)	data integration objective • modeling and simulation objective
Measurement Type(s)	geographic feature
Technology Type(s)	computational modeling technique
Factor Type(s)	DataTypes
Sample Characteristic(s)	Andalucia Autonomous Community • building • road

Machine-accessible metadata file describing the reported data (ISA-Tab format)

A Global Feature-Rich Network Dataset of Cities and Dashboard for Comprehensive Urban Analyses

Article Open access 30 September 2023

A rasterized building footprint dataset for the United States

Article Open access 29 June 2020

SWECO25: a cross-thematic raster database for ecological research in Switzerland

Article Open access 03 January 2024

Background & Summary

Nowadays geospatial data have become ubiquitous in many modern applications. There are a plethora of geospatial data sources generated by different producers that should be integrated in order to provide their full power. We are living a ‘data overload’ scenario, where there are many data providers, each one with its own representation of the geographic reality. Some authors are calling this current status ‘big geospatial data’^1–3. Another recent trend that can benefit from geospatial data integration is the smart city⁴, which involves the integration of many interdisciplinary fields using geographical knowledge⁵.

Data integration requires finding the correspondences between assessed geospatial datasets, a process that we call geospatial data matching⁶. Data matching is not a trivial task, and one that has demanded recent research from the Geospatial Information Science (GIScience) community^7–10.

In our previous study⁶ we indicated that the matching research community should create a benchmark dataset for testing new matching methods or measures for geospatial vector data within a homogeneous framework. This matching testbed would undoubtedly be a useful tool for comparing different measures and methods, because we identified that the results may change outside the initial test site. In this context, the main aim of this testbed is to provide a comparing framework for geospatial data matching approaches, which can be useful for the research community or GIS software developers. The testbed supplied is composed of four groups of datasets, of which the first was built from real geospatial data, and it serves as a basis for the other three groups of synthetic data. The testbed can be considered as the data set that allows materializing a design of experiments of methods and measures of matching.

This testbed is composed of four groups of datasets: (1) initial, (2) morphology modified, (3) systematic disturbance, and (4) random disturbance. Initial datasets are originated from authoritative mapping at scales 1:25,000 and 1:10,000 of Spanish agencies. The other three groups are derivative of the initial datasets at scale 1:25,000. Morphology modified datasets are composed of synthetic objects in some morphology class for linear and areal features. Systematic disturbance group is composed of datasets that were generated from affine transformations over initial data. The last group of datasets (random disturbance) is formed by data influenced by displacement vector fields applied over the initial datasets. Each group of datasets is compounded by the datasets for the three geometric primitives: point, line, and area, except for the morphology modified group which does not have point data. Figure 1 illustrates the dataset groups in the matching testbed.

**Figure 1: Dataset groups available in this testbed.**

The value of the testbed provided can be summarized in four items: (1) these datasets can be used as benchmark data for other studies investigating geospatial data matching at the feature level; (2) the development of new similarity measures can benefit from these datasets as comparing sets used to calculate the new ‘distances’ between objects; (3) data quality studies focused on positional quality or completeness can use the datasets in order to develop new quality evaluation procedures by adopting two corresponding datasets: one as the test data, and the other as the reference; and (4) there are disturbed data that may permit assessing the robustness of investigated matching techniques in the presence of controlled perturbations.

The remainder of this article is structured as follows: The next section presents the concepts used to produce the datasets. The following subsection explains each data record associated with the testbed supplied, as of the parameters used to generate synthetic data. The procedures that assure the reliability of datasets are discussed in the final section.

Methods

The geospatial data matching testbed is composed of four groups of datasets: (1) initial, (2) morphology modified, (3) systematic disturbance, and (4) random disturbance. Initial datasets are originated from mapping provided by official mapping agencies of Spain at scales 1:25,000 and 1:10,000. The other three groups are derivative of the initial datasets at scale 1:25,000. Morphology modified datasets are composed of synthetic objects in some morphology class for linear and areal features. Systematic disturbance group is composed of datasets that were generated from affine transformations over initial data. The last group of datasets (random disturbance) is formed by data influenced by displacement vector fields applied over the initial datasets. Each group of datasets is compounded by the datasets for the three geometric primitives: point, line, and area, except for the morphology modified group which does not have point data. Each test dataset (scale 1:25,000) can be divided into nine regions. Each region can be identified by the first number of the object identifier (OID), e.g., 1,023 is in the first region, and 9,128 is in the ninth region.

The following subsections detail the initial datasets and the methods used to prepare each group of datasets.

Initial datasets

The initial datasets in this testbed are formed by test datasets at scale 1:25,000 and reference datasets at scale 1:10,000. Test data is originated from the Base Topográfica Nacional 1:25,000 (BTN25) of national mapping provided by the Instituto Geográfico Nacional of Spain¹¹. Reference data is originated from the Base Cartográfica de Andalucía 1:10,000 (BCA10) of regional mapping provided by the Instituto de Estadística y Cartografia de Andalucía¹². Test datasets were divided into nine regions (S1-S9) with their corresponding regions in reference datasets (B1-B9).

For our testbed we selected mapping sheets with different landscapes: coast and mountain, rural and urban. Each region was originated from the following mapping sheets 1:25,000: (S1) 0896-3, (S2) 0896-4, (S3) 1003-4, (S4) 0999-1 east, (S5) 0999-2 west and 0999-4 east, (S6) 0999-1 west, (S7) 0999-3, (S8) 0999-4 west, and (S9) 0999-2 east.

The test data was selected from BTN25 as follows: Point data were created from the ‘Building’ class, so the areal features were converted to points using their centroid. First we selected buildings with an area less than 1,000 m² and shape index¹³ less than 1.2. Then for the regions S1-S6 we randomly selected more than 100 objects. For the regions S7-S9 we selected objects from some agglomerated areas in order to be close to an urban environment (more than 100 objects in each area). Area data were also obtained from the Building class, but excluding those objects selected as point data. After excluding point features, we randomly selected more than 100 objects for the regions S1-S6. For the regions S7-S9 we selected some near objects in order to represent an urban environment. In these cases, more than 250 objects were picked in each dataset. For the linear data the first step was to homogenize the road data for the initial datasets (BTN25 and BCA10) using the same topological rules. These procedures avoided ‘broken’ lines and ‘long’ lines. So line data were selected from BTN25 in this order: (1) motorways, (2) roads, (3) links, and (4) tracks. In order to reach at least 125 objects in each region, some tracks were selected randomly.

After these selection procedures over the test data we executed the manual matching between the selected objects from the BTN25 test data against the BCA10 reference data. We used regional orthoimagery to help us in this task in order to dismiss any doubts. At the end of this procedure we had 27 sets (nine regions by three geometries) and their correspondences represented in Table 1.

Table 1 Regions considered in the initial dataset group and their sizes.

Full size table

In Table 1, the column Geometry refers to the type of geometry for the dataset. The column Region refers to the name of each region. The column Size represents the number of objects in each region. The Matching pairs columns indicate the number of matching pairs when comparing each test region (S#[PLA]) with each reference region (B#[PLA]). Due to the presence of multiple corresponding case (1:n and m:n) the number of matching pairs differs from the size of the test dataset. For instance, the matching ‘100, 200:101, 102, 103’ represents six matching pairs: 100:101, 100:102, 100:103, 200:101, 200:102, and 200:103.

The last step was translating the areas to a generic place of the world, since they no longer represent any reality. So we also mirrored, rotated or translated the data in order to decharacterize the original site. After that we had all regions in a new compound dataset that we call ‘MatchingLand’ (see Fig. 2).

Morphology modified

The morphology of linear and areal objects is a factor that may affect the performance of geospatial matching procedures. In order to deal with this factor we adopted a roughness classification for lines and developed a complexity classification for areas. Based on these morphology classifications we developed a method for generating synthetic data from some source data for a specific morphology class.

The line roughness classification is based on the road-line classification developed by Ariza-López and García-Balboa¹⁴ where the authors used a back-propagation artificial neural network (BANN) over a moving window. Since we use road data in our experiment, this method seems to be adequate for our purposes. The BANN method defined five established roughness classes for road data: (1) very smooth, (2) smooth, (3) sinuous, with stable directionality, (4) sinuous, with variable directionality, and (5) very sinuous. Figure 3 presents examples of lines classified according to this method.

The area complexity classification developed in this study is drawn for building data at small scale. This method is based on two concepts: convexedness and Arkin’s turning function¹⁵. We propose a complexity classification for area building data defined in four classes: (1) very simple, (2) simple, (3) complex, and (4) very complex. Class 1 areas group the simpler objects that are the convex ones without holes. Class 2 areas are those convex ones with holes, and also those objects that are similar to some standard, like ‘L’ or ‘U’ objects. More complex objects are determined according to their number of ‘turns’, i.e., the number of times that the external ring changes its current turning (left or right). Class 3 are those objects with less than or equal to 10 turning changes, while class 4 (very complex) buildings are those that exceed this limit. Figure 4 shows some areas classified according to this method.

With the aim of incrementing the population of lines and areas in each morphology class we propose two methods for generating synthetic data, modifying original sources according to the desired morphology class.

The method for lines works as follows: For each line in the original data, we compare its morphology class with the desired morphology class. If the difference is greater than two, or the object already has the desired class, the object remains unaltered. Otherwise, a procedure for smoothing or roughing the line should be applied in order to reach the desired morphology class. The smoothing procedure is a combination of Douglas-Peucker simplification¹⁶ with Gaussian filtering^17,18 of sigma 4. The roughing procedure applies random displacements along internal curves of the line (clockwise or counter-clockwise). These procedures do not affect the first or last points of manipulated lines. As these procedures do not take into account the neighbouring objects, some lines required manual edition in order to maintain the topology. Figure 5a shows an example of how a line, originally classified in class 3, can be flattened to class 1 or can be roughened to class 5.

Similar to the line method, the area method also generated synthetic data. The procedure does not affect class 4 areas (very complex). For each area in the original data, we compare its morphology class with the desired morphology class. If there is no difference, the object remains the same. Otherwise, we apply a procedure that randomly disturbs or simplifies the object's geometry in order to achieve the desired morphology class. The disturb procedure raises the complexity of polygons by means of perforating one without holes or creating a 'corner' at a random vertex. The simplify procedure acts over non-convex polygons by removing the corners that least influence the area size. Figure 5b presents an example of an area that in the source data was classified as simple (class 2—'L' shape) that was simplified to class 1 and was disturbed to class 3.

Systematic disturbance

The presence of positional systematic disturbance is a factor that potentially affects the performance of geospatial data matching procedures. The aim of this group of datasets is to identify the influence of intentional systematic perturbations in position over matching procedures. Moreover, these datasets can also be valuable for data quality research studies that investigate positional quality. Our methodology is similar to the study of Mozas-Calvache and Ariza-López¹⁹, where the authors simulated several displacements over original data, such as translations, rotations, and scaling.

We propose generating synthetic data from the original data by applying a set of systematic disturbances represented by means of an affine transformation. This transformation is a composition of translations, rotations, and scaling²⁰. Hence our systematic disturbance method is designed to reflect these three kinds of transformations. The approach requires a set of standard displacements that define the entire process and it also requires a minimum bounding rectangle (MBR). For each displacement we generate a set of systematic disturbances, that are: (1) translations in eight directions, (2) counter-clockwise and clockwise rotations over three different pivots, and (3) two scaling factors (dilation and shear).

The translations are determined by the standard displacement applied in eight directions, beginning at 0 with increments of 45° (Fig. 6a). The rotation angle is calculated for each dataset, taking into account the standard displacement and the half of MBR's diagonal (Fig. 6b). Using this angle we have six possible rotations: two directions (counter-clockwise and clockwise) by three rotation pivots in relation to the MBR (lower-left, centre, and upper-right) (Fig. 6c). Finally, the scaling factors for dilation and shear are calculated using the relation between half of the MBR's diagonal and the standard displacement, as we can see in Fig. 6d.

**Figure 6: Systematic disturbance in function of a standard displacement.**

After determining the translations, rotations and scalings for each standard displacement, in each dataset (its own MBR), these perturbations must be combined in order to create a set of affine transformations that will be used to generate the synthetic perturbed data. The no-disturb configuration (no translation, no rotation, no scaling) is added prior to creating the affine transformations. Then all possible combinations among translations, rotations and scalings are generated.

For instance, if we choose only two distinct standard displacements, this approach is able to generate more than 1,000 different transformations. The number of combinations is calculated as follows: combinations=translations×rotations×scalings=(1+8×2)×(1+6×2)×(1+2×2)=17×13×5=1,105. For each affine transformation, a new synthetic dataset is created.

Random disturbance

This last group of datasets acts similarly to the systematic disturbance, but this approach uses random perturbations over original data in order to simulate this random behaviour. Other studies²⁰ adopted random errors in each vertex, including correlated displacement by lines. In this study we propose a new methodology to disturb geospatial data using vector fields created for a given standard displacement.

The key-concept of our methodology for random disturbance is the displacement vector field. This vector field works as a ‘force’ field that modifies the geospatial features by acting over their coordinates by means of random displacement.

There are three parameters in this approach: standard displacement, field resolution, and sigma. The standard displacement works as in the systematic disturb (previous subsection), i.e., it defines the amplitude of disturb. The vector field in this method is created according to a regular tesselation of the source data MBR, so we need a field resolution in order to define the cells. Finally, the sigma value represents the internal variation of the standard displacement. For instance, if we use a sigma of 10%, the random displacements will vary ±10% in amplitude in relation to the standard displacement.

The vector field disturb method works as follows: After defining the parameters, it creates a regular tesselation using the field resolution over the dataset's MBR. This tesselation has at least two additional columns and rows with the aim being to guarantee that the data border will fit inside the vector field (Fig. 7a illustrates an example). Then, using an unaligned systematic pattern²¹ it randomly creates a set of x values (one for each row) and a set of y values (one for each column). These values define the coordinates for each generator of our vector field, one per cell (Fig. 7b). The next step is to define the direction and amplitude for each field generator. The direction is randomly determined while the amplitude is calculated in function of the given standard displacement plus a random variation limited by the sigma (σ) parameter (Fig. 7c). In the end we have a vector field with a displacement vector for each cell in the tesselation.

The disturb vector field is composed of a set of displacement vectors which quantify the disturb to be applied in a dataset. We propose using this vector field as a geometric transformation over the original dataset. The influence of each vector in a coordinate of perturbed data should be determined in function of an interpolation function. In this approach we adopt the inverse distance weighting (IDW) interpolation with pow 2 and search radius of 2.5 times the resolution, as indicated by Gumiaux et al.²². The synthetic perturbed data is generated for each random vector field created from the given parameters (standard displacement, field resolution, and sigma).

Code availability

TerraLib is an open-source GIS library developed by the Brazilian National Institute for Space Research (INPE)²³. The AffineGT class in TerraLib is used to create the synthetic data in systematic disturb group. The TerraLib code is licensed under the GNU Lesser General Public License version 2.1 as published by the Free Software Foundation. We have used the TerraLib version 4.2.2 that can be found at the TerraLib repository²⁴.

One of the subprojects of TerraLib is TerraOGC—a framework for Web-GIS development that has been used in web services research^25,26. Inside TerraOGC there is a data quality processing module (DQEval) which contains most of the code related to this testbed. The GeometricTools class contains the methods associated to the morphology modified group (classifiers and transforms). The VectorFieldTransformation class uses the SpatialInterpolation class to generate the data for the random disturbance group. TerraOGC code is licensed under the GNU General Public License version 3. We have used the TerraOGC version 1.2.6 that can be found at its repository²⁷.

Data Records

This section describes the four groups of datasets generated using the methodology described in the previous section. All geoespatial data are supplied in the ESRI Shapefile format²⁸. The projection system is UTM zone 28 North with datum WGS-84 (EPSG:32628). The list of matching pairs, composed of object identifiers (OID), is in plain text.