Background & Summary

Location-Based Services (LBS) has become a multibillion-dollar industry that is expected to continue to steadily grow over the upcoming years1. Some of these services include location-based marketing2, authentication3, gaming4, and social networking5, among others. A key enabling technology at the heart of such services is positioning6. However, the de facto standard for positioning, the Global Navigation Satellite System (GNSS), has two major issues that limit the use of LBS. First, the availability and accuracy of GNSS are severely degraded in urban areas due to shadowing and multipath effects7. Second, GNSS chipsets are notorious for being power-hungry, which is problematic for power-constrained devices such as smartphones and smartwatches8. A more energy-efficient approach for positioning is achieved using cellular networks. Yet, the offered accuracy, which is in the order of tens9 to hundreds10 of meters, fails to satisfy the accuracy requirements imposed by many services and applications.

Recently, in an attempt to devise positioning solutions that can yield better performance, researchers have turned their attention to fingerprinting, a positioning technique that has achieved great success in the indoor positioning domain, a domain where GNSS signals are generally unavailable11. Fingerprinting is used to identify spatial locations based on location-dependent measurable features (location fingerprints). These fingerprints can be of different types such as WiFi fingerprints12, Bluetooth fingerprints13, cellular fingerprints14, and magnetic field fingerprints15. From an implementation perspective, the fingerprinting approach is a two-phase process that consists of an offline phase and an online phase. During the offline phase, site surveying is performed by sampling fingerprints of an area of interest at predefined reference points (RPs). Fingerprints are often sampled using a smartphone or a dedicated data acquisition platform. Fingerprints, along with the coordinates at which they were sampled, are stored in a database. The data is then used to train a machine learning algorithm to learn a function that best maps sampled fingerprints to their ground truth coordinates. Afterward, the learned function is utilized during the online phase to infer a user’s coordinates given the fingerprints measured at the user’s location. The process of fingerprinting is visually depicted in Fig. 1.

Fig. 1
figure 1

A graphical representation of the fingerprinting approach for positioning.

Despite its low complexity and ability to produce accurate location estimates, the main drawback of fingerprinting is the laborious and time-consuming site surveying task. This drawback has led many studies to resort to either simulated16 or crowdsourced data17, where the former never fully reflects the real world and the latter may suffer from integrity and consistency problems. The proposal of OutFin aims at addressing these drawbacks by making real-world measurements and reliable ground truth coordinates publicly available. Table 1 summarizes the main aspects of publicly available fingerprinting datasets published since 2014. Compared to these datasets, OutFin combines several features that place it in a unique position:

  • To the best of our knowledge, OutFin is the first multi-modal, outdoor fingerprints dataset to be publicly available.

  • The data was collected using two contemporary smartphones rather than outdated smartphones or custom-built platforms.

  • The data was collected at highly granular RPs with 61 to 183 centimeters (cm) spacing.

  • OutFin not only provides location fingerprints, but it also provides information about the devices that generated them (e.g., the service set identifier of an access point, the communication protocol of a Bluetooth device, and the number of neighboring cells of a serving cell).

  • OutFin is accompanied by an interactive map that provides various information about the collection environment, such as RP coordinates (both ground truth and Global Positioning System (GPS) estimates) and building ground elevations and heights.

Table 1 A comparison of the main aspects of publicly available fingerprinting datasets published since 2014.

In addition to facilitating the research and development of outdoor positioning solutions that are based on the fingerprinting approach, OutFin might spur innovation in other research realms, including but not limited to: machine learning18, Bayesian optimization19, simultaneous localization and mapping20, and map-matching21.

Methods

Data acquisition platform

OutFin was created using two smartphones for data acquisition: Samsung’s Galaxy S10+ (Phone 1) and Google’s Pixel 4 (Phone 2). The former was released in the U.S. market on March 8, 2019, while the latter was released on October 24, 2019. Both smartphones ran on Android 10, released on September 3, 2019. The motivation behind choosing Android-powered smartphones was twofold. First, Android provides application programming interfaces (APIs) that allow for acquiring raw data at the hardware level. Second, Android-powered smartphones account for over 74 of the market share worldwide22. The two smartphones were attached to a tripod head using a dual mount that horizontally separated them by 10 (see Fig. 2 (Site 1)). Both smartphones were in portrait mode. The tripod kept them at a fixed height of 132. The tripod head was adjusted to tilt the smartphones at a 40 degree (°) angle to the vertical plane. The same set of third-party apps used for data collection were installed on both smartphones. These apps, which can be downloaded from the Google Play Store, included: WiFi Analyzer Pro (App 1)23, Bluetooth Scanner Extreme Edition (App 2)24, NetMonitor Pro (App 3)25, and Physics Toolbox Sensor Suite Pro (App 4)26. The apps allowed for conveniently collecting and exporting WiFi, Bluetooth, cellular, and sensor data, respectively.

Fig. 2
figure 2

Pictures of the four sites where data was collected.

Data collection environment

Data collection was performed at the University of Denver’s campus where four separate sites were considered. The motivation behind collecting data at separate sites was to offer diversity. For instance, each site is different in terms of its reference points’ number, arrangement, and spacing. Also, due to different ground elevations and heights of surrounding buildings, each site has different visibility to the GNSS. This is reflected by GPS errors produced at a given site. The mean GPS error was 12.1 meters (m), 11.4 m, 4.3 m, and 12.7 m for the first, second, third, and fourth site, respectively. GPS estimates are provided in OutFin to help researches compare their system’s performance to that obtained by GPS. A description of the data collection sites is provided below:

Site 1:  Site 1 represents a portion of a covered sidewalk next to the east side of the 11.8 high Boettcher Auditorium (see Fig. 2). Site 1 contained 31 RPs arranged in three north-to-south lines (see Fig. 3). The spacing between RPs in each line was fixed at 152.5 and the distance between lines was fixed at 76.25.

Fig. 3
figure 3

An aerial map of the collection environment showing the four collection sites and the 122 RPs. RPs are color-coded according to the date of collection.

Site 2: Site 2 is 245 north of Site 1 and represents a portion of a covered sidewalk next to the north side of the 11.5 high Sie International Relations Complex (see Fig. 2). Site 2 contained 23 RPs arranged in a single east-to-west line (see Fig. 3). The spacing between RPs was fixed at 101.5.

Site 3: Site 3 is 40 south of Site 2 and represents a portion of an open terrace next to the south side of the Sie International Relations Complex (see Fig. 2). Site 3 contains 35 RPs arranged in a seven-column and five-row grid (see Fig. 3). The spacing between column RPs and row RPs were fixed at 61.

Site 4: Site 4 is 288 south of Site 3 and represents a portion of an open sidewalk by the south and west sides of the 13.4 high Seeley Mudd Science Building (see Fig. 2). Site 4 contains 33 RPs arranged in a three-column and eleven-row grid (see Fig. 3). The spacing between column RPs was fixed at 183, while the spacing between row RPs was fixed at 146.5.

Each RP is uniquely identified by an integer (an ID number) that symbolizes its order in the collection campaign. For example, data collection started with RP 1 on November 3, 2019, and ended with RP 122 on November 9, 2019. The ground truth locations of RPs belonging to a site are expressed with respect to a local frame of reference. Additionally, the easting and northing (X,Y) coordinates of all RPs were provided with respect to a global coordinate system (i.e., NAD83(2011)/Colorado Central). This was accomplished with help from the university’s Department of Geography & the Environment and by using a geographic information system software27.

Procedure

Data collection spanned six days (3–5/11/2019 and 7–9/11/2019) and involved four sites with a total of 122 RPs. Due to the fact that rain could severely affect wireless signal measurements, we did not collect any data on rainy days. The RPs surveyed each day are indicated in Fig. 3. The sequence of steps performed during a day of data collection are described below:

Step 1:  Before mounting the smartphones to the tripod, App 4 was launched to collect magnetic field measurements by rotating the smartphones around their X, Y, and Z axes multiple times (see Fig. 4). This process was performed for at least two minutes at a sampling rate of 1 Hertz (Hz). The resultant data was exported as a comma-separated values (CSV) file, named with the smartphone’s name and date (e.g., Phone1_051119.csv). Such data can be used to offset the hard-iron distortion caused by placing the smartphones close to each other. After this process, the smartphones were mounted to the tripod and placed at the RP where data was to be collected.

Fig. 4
figure 4

Illustration of the X, Y, and Z axes relative to a typical smartphone. Figure reproduced from44.

Step 2: App 1 was launched to collect WiFi data, ensuring that at least two WiFi scans were performed along the four cardinal directions by routing the tripod head counterclockwise, 90 at a time. A WiFi scan recorded the received signal strength (RSS) from all access points (APs) in range in addition to information about the APs themselves. Android only supports passive scanning, and the duration of a scan varies depending on the smartphone’s WiFi hardware and firmware. However, Google recently released a restriction that limits the frequency of scans that an app can perform to only four times in a 2-minute period28. This restriction applies to Android 9 and higher. The app reported scan results approximately every 30 seconds for Phone 1 and every 25 seconds for Phone 2. For Site 1 and 4’s RPs, data collection started facing south and ended facing west. For Site 2 and 3’s RPs, data collection started facing west and ended facing north. Collecting data along four directions mitigates the shadowing effect caused by the body of the data collector who is constantly facing the smartphone screens. Scan outcomes were exported as a CSV file, named with the smartphone’s model as a prefix and the RP’s ID as a suffix (e.g., Phone2_WiFi_73.csv).

Step 3:  App 2 was launched to collect Bluetooth data. Android allows active Bluetooth scanning; thus, scans can be triggered by a user-level app. A Bluetooth scan involves an inquiry scan of approximately 12 seconds, followed by a page scan for each discovered device to retrieve its information and the RSS29. The duration of a scan, for both smartphones, took anywhere between 15 and 30 seconds, primarily depending on the number of discoverable devices in the area. As in Step 2, the shadowing effect was accounted for by performing two scans along each cardinal direction. Scan results were exported as a CSV file with a naming convention like that described in Step 2 (e.g., Phone1_Bluetooth_29.csv).

Step 4:  App 3 was launched to collect cellular data. A smartphone’s cellular modem constantly scans the cellular network for cell selection/reselection and handover purposes. Android provides APIs to extract information associated with scans such as Reference Signal Received Power (RSRP) and cell identity information30. The sampling frequency can be set manually and was fixed to 1. As noted in Step 2, the shadowing effect was accounted for by collecting at least fifteen samples along each cardinal direction. Collected data was exported as a CSV file with a naming convention like that described previously (e.g., Phone2_Cellular_14.csv). Moreover, App 3 allowed for collecting GPS data as part of the data record. The GPS readings corresponding to RPs belonging to the same site were extracted and stored under a CSV file named with the site’s name as a prefix and the smartphone’s model and app name as a suffix (e.g., Site1_GPS_Phone1_App3.csv).

Step 5:  App 4 was launched to collect sensor data. A smartphone’s built-in sensors can be classified as either hardware-based, such as the magnetometer and gyroscope, or software-based, such as the gravity and linear acceleration sensors. Android provides APIs for accessing and acquiring raw sensor data at defined rates31. The sampling frequency was set to 1. Although sensor measurements are not subject to the shadowing effect, data was collected along the four cardinal directions to both conform with the survey pattern established above and diversify the dataset since magnetic field strength can vary greatly even within a small area (in the orders of a few centimeters or less)32. At least fifteen samples were collected along each direction, following the same directions described in Step 2. Sensor data was exported as a CSV file with a naming convention like that described previously (e.g., Phone1_Sensors_58.csv). App 4 also allowed for collecting GPS data as part of the data record. As in Step 4, the GPS readings corresponding to RPs belonging to the same site were extracted and stored under a CSV file with a naming convention like that described in Step 4 (e.g., Site3_GPS_Phone2_App4.csv).

Step 6:  The tripod was moved to the next RP and Steps 2–5 were repeated. This process continued until all RPs designated for a given day were surveyed.

Data Records

On April 2, 2020, the OutFin dataset was made publicly available on figshare33. Figure 5 shows the dataset’s file structure and presents an overview of all CSV file types, their field labels, and a data record example. A description of the CSV file types and their field labels is provided below:

  1. I.

    <phone>_WiFi_<RP>.csv contains WiFi data collected by a smartphone via App 1:

    1. SSID: The Service Set IDentifier (i.e., the AP’s network name).

    2. BSSID: The Basic Service Set IDentifier (i.e., the AP’s media access control address (MAC address)) encoded as an integer.

    3. Channel: The channel number that the AP uses for communication.

    4. Width: The bandwidth of the channel in megahertz (MHz); can be 20, 40, or 80 MHz.

    5. Center_Frequency_0: The center frequency of the primary channel in MHz.

    6. Center_Frequency_1: The center frequency of the 40 or 80 MHz-wide channel in MHz. If a 20-MHz channel is used, then Center_Frequency_1 Center_Frequency_0.

    7. Band: The AP’s frequency band in gigahertz (GHz); can be either 2.4 or 5 GHz.

    8. Capabilities: Describes the authentication, key management, and encryption schemes supported by the AP.

    9–17. RSS_0–RSS_8: The Received Signal Strengths in decibel-milliwatts (dBm), with respect to the back-to-back scans.

  2. II.

    <phone>_Bluetooth_<RP>.csv contains Bluetooth data collected by a smartphone via App 2:

    1. Date_Time: The date and time the scan was triggered as YYYY-MM-DD and hh:mm:ss. Denver, Colorado is in the Mountain Time Zone, which is seven hours behind Coordinated Universal Time (UTC-07:00).

    2. New_Device: A binary flag that is set to 1 if the remote Bluetooth device is discovered for the first time at the current RP.

    3. Date_Time_first_seen: The date and time the device was first discovered at the current RP. The date and time formats are as described above.

    4. MAC_address: The device’s MAC address encoded as an integer.

    5. Name: The device’s friendly name.

    6. Manufacturer: The device’s manufacturer name.

    7. Protocol: The Bluetooth protocol that the device uses for communication; can be CLASSIC (Basic Rate/Enhanced Data Rate (BR/EDR)), BLE (Bluetooth Low Energy), or DUAL (BR/EDR + BLE).

    8, 9. Minor_Device_Class, Major_Device_Class: Indicates the device’s minor and major classes, respectively, as specified by the Bluetooth Special Interest Group (SIG)34.

    10–17. Audio, Capturing, Networking, Object_Transfer, Positioning, Telephony, Rendering, Information: Binary flags that are set to 1 if the device is associated with any of the eight service classes specified by the Bluetooth SIG34.

    18. RSS: The Received Signal Strength in dBm.

  3. III.

    <phone>_Cellular_<RP>.csv contains cellular data collected by a smartphone via App 3. It should be noted that the entire collection environment was covered by Long-Term Evolution (LTE) cells. The Public Land Mobile Network (PLMN) identifier is 310410:

    1. Date_Time: The date and time the sample was captured. The date and time formats are as described above.

    2. UMTS_neighbors: The number of neighboring Universal Mobile Telecommunications Service (UMTS) cells.

    3. LTE_neighbors: The number of neighboring LTE cells.

    4. RSRP_strongest: The Reference Signal Received Power, in dBm, corresponding to the strongest neighboring cell, which employs the same technology as the serving cell.

    5. TAC: The Tracking Area Code, which uniquely defines a group of cells within a PLMN.

    6. eNB_ID: The E-UTRAN (Evolved-UMTS Terrestrial Radio Access Network) NodeB IDentifier that is used to uniquely identify an eNB (i.e., a base station in LTE) within a PLMN.

    7. Cell_ID: The Cell IDentifier, which is an internal descriptor for a cell. It can take any value between 0 and 255.

    8. PCI: The Physical Cell Identifier that is used to indicate the physical layer identity of a cell. It can take any value between 0 and 503.

    9. ECI: The E-UTRAN Cell Identifier that is used to uniquely identify a cell within a PLMN. ECI = 256 × eNB_ID + Cell_ID.

    10. Frequency: The downlink frequency band in MHz.

    11. EARFCN: The downlink E-UTRAN Absolute Radio Frequency Channel Number.

    12. TA: The Timing Advance value which ranges from 0 to 1282. A change of 1 in TA corresponds to a 156m round-trip distance35. For example, if TA = 7, then the eNB is located within a 546 radius from the smartphone.

    13. RSRP: The Reference Signal Received Power in dBm.

    14. RSRQ: The Reference Signal Received Quality in decibel (dB).

  4. IV.

    <phone>_Sensors_<RP>.csv contains sensor data collected by a smartphone via App 4:

    1. Time: The time the sample was captured. The time format is as described above.

    2–4. ax, ay, az: The linear acceleration, in meters per second squared (m/s^2), along the smartphone’s X, Y, and Z axes, respectively.

    5–7. wx, wy, wz: The angular velocity, in radian per second (rad/s), around the smartphone’s X, Y, and Z axes, respectively.

    8–10. Bx, By, Bz: The magnetic field strength, in microtesla (μT), along the smartphone’s X, Y, and Z axes, respectively.

    11–13. gFx, gFy, gFz: The g-force measured as the ratio of normal force to gravitational force (FN/Fg), along the smartphone’s X, Y, and Z axes, respectively.

    14–16. Yaw, Pitch, Roll: The angle of rotation, in degrees (°), around the smartphone’s X, Y, and Z axes, respectively.

    17. Pressure: The atmospheric pressure in hectopascal (hPa).

    18. Illuminance: The illuminance in lux (lx).

  5. V.

    <site>_Local.csv contains the local coordinates of RPs belonging to a site. Each site has its own frame of reference and the origins are at RPs 10, 122, 60, and 99 for Sites 1, 2, 3, and 4, respectively.

    1. RP_ID: The Reference Point IDentifier.

    2–4. X, Y, Z: The X, Y, and Z coordinates of the RP in centimeters (cm).

  6. VI.

    <site>_NAD83.csv contains the global coordinates of RPs belonging to a site with respect to the NAD83(2011)/Colorado Central coordinate system.

    1. RP_ID: The Reference Point IDentifier.

    2. X, Y: The X and Y coordinates of the RP in meters (m).

  7. VII.

    <site>_GPS_<phone>_App3.csv contains the GPS coordinates of RPs belonging to a site as computed by the smartphone’s GPS chipset and reported by App 3.

    1. RP_ID: The Reference Point IDentifier.

    2. Date_Time: The date and time the sample was captured. The date and time formats are as described above.

    3,4. Latitude, Longitude: The latitude and longitude coordinates of the RP.

  8. VIII.

    <site>_GPS_<phone>_App4.csv contains the GPS coordinates of RPs belonging to a site as computed by the smartphone’s GPS chipset and reported by App 4.

    1. RP_ID: The Reference Point IDentifier.

    2. Time: The time the sample was captured. The time format is as described above.

    3,4. Latitude, Longitude: The latitude and longitude coordinates of the RP.

  9. IX.

    <phone>_<date>.csv contains sensors data collected by a smartphone via App 3 before the smartphone is mounted to the tripod. Field labels are identical to that described in IV (<phone>_Sensors_<RP>.csv).

Fig. 5
figure 5

Directory tree of the OutFin dataset along with CSV file types and example data records. <phone> {Phone1,Phone2}, <RP> {1,2,...,122}, <site> {Site1,Site2,Site3,Site4}, and <date> {031119,041119,051119,071119,081119,091119}.

Technical Validation

The technical quality of the OutFin dataset was evaluated using experiments that consider two basic requirements that any high-quality dataset should satisfy, i.e., reliability and validity. Additionally, as a demonstration of the dataset’s potential for positioning applications, a number of practical usage examples are presented.

Measurement reliability

A data acquisition platform is said to be reliable if it provides consistent measurements at different points in time. To this end, before the collection campaign, WiFi, Bluetooth, cellular, and sensor data was captured over three different days at the same location. Spearman’s and Kendall’s correlation coefficients were then used to quantify the degree of consistency between temporal measurements for a given phone. Table 2 shows Spearman’s and Kendall’s correlation coefficients for the two smartphones for all possible pairs of days. Given that correlation results are high (i.e., close to the maximum value of 1.0), it can be concluded that the dataset possesses a high degree of reliability.

Table 2 Results of the correlation analysis between the measurements obtained on three different days for Phone 1 and Phone 2.

Measurement validity

A data acquisition platform is said to be valid if it accurately measures what it is intended to measure. In some cases, this requires the presence of theoretically-derived data to compare experimental data against. For example, WiFi RSS values can be computed using a path loss model. An input to the model is the distance between the transmitter and receiver. However, obtaining such inputs is not feasible since the exact location of all APs in the environment needs to be known. In the absence of theoretically-derived data, validity can be assessed by comparing data generated by different sources and checking for consistency. Accordingly, for a given day, Spearman’s and Kendall’s correlation coefficients were used to quantify the degree of consistency between the measurements obtained by the phones. The correlation results for the foregoing three days are shown in Table 3. These results demonstrate high levels of consistency, which attests to the validity of the dataset.

Table 3 Results of the correlation analysis between the measurements obtained from Phone 1 and Phone 2 for three different days.

As graphical evidence of measurement validity, Fig. 6 compares some of the data generated by the smartphones at randomly selected RPs side-by-side. Plots of the same data type exhibit the same profile despite corresponding to two different smartphones. Table 4 reports descriptive statistics of the data collected by each phone with respect to various variables. These statistics are compared against previously reported reference values, where applicable. The statistics displayed in Table 4 further support the validity of the dataset by ruling out the possibility that the dataset contains unrealistic, erratic, or random data.

Fig. 6
figure 6

Visualization of the data collected by Phone 1 and Phone 2 over randomly selected RPs. WiFi, Bluetooth, and cellular data are represented using parallel coordinate plots of the most important features, while sensor data are represented using time plots of magnetic field strength, angle of rotation, atmospheric pressure, and illuminance. All features are normalized between 0 and 1.

Table 4 Descriptive statistics of the OutFin dataset.

Usage Examples

This subsection provides a brief demonstration of some of the application domains that OutFin can be used for. These include fingerprint interpolation, feature extraction, performance evaluation, and signal denoising.

Fingerprint interpolation

Building a fingerprint map is usually required to provide positioning in a continuous fashion. The resolution of a map depends highly on the RP granularity (the higher the RP granularity, the better the map resolution). However, collecting fingerprints at highly granular RPs is time-consuming and labor intensive. Thus, interpolation methods are often employed to calculate the fingerprints between the locations of known fingerprints36. The choice of an interpolation technique is pivotal to the resulting map. For example, Fig. 7 compares the magnetic field maps created for Site 3 by two different interpolation techniques, namely linear and cubic interpolation. Clearly, the resulting maps are not identical, which suggests that a positioning algorithm would exhibit a difference in performance depending on the employed map.

Fig. 7
figure 7

Interpolated magnetic field magnitude of Site 3 using linear interpolation (left) and cubic interpolation (right). The maps were generated using calibrated magnetic field measurements from Phone 1 and Phone 2.

Feature extraction

A WiFi fingerprint has entries for all APs detected in an entire environment, but only a subset of these APs is observed at different locations. This is especially true for large-scale environments. For example, OutFin contains measurements from 1,379 unique APs; however, on average, only 10 of these APs are observed at any given RP. Consequently, feature extraction techniques are often utilized to reduce the dimensionality of the fingerprint space in order to achieve efficient and robust positioning37. Figure 8 compares two dimensionality reduction methods, i.e., the autoencoder and principal component analysis (PCA). The reconstruction cost obtained by the autoencoder is lower than that obtained by PCA. This suggests that the autoencoder is better at compressing the fingerprint space into a lower dimensional representation that comprises the informative content of the fingerprint space.

Fig. 8
figure 8

The 3D codes for 18 WiFi RSS measurements (9 measurements per phone) for 10 randomly selected RPs produced by the autoencoder (left) and PCA (right). MSE: mean squared error; PC: principal component; LV: latent variable.

Performance evaluation

When proposing a new positioning method, the performance of the proposed method is often evaluated against the performance of previously proposed methods. It is often the case that at the heart of many of the methods benchmarked against is a machine learning algorithm, such as k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Decision Tree, or Naive Bayes38. Therefore, with the purpose of comparing the performance of such algorithms, the positioning problem was casted as a classification task where each RP is treated as a unique class. Various performance metrics were considered, including classification metrics, positioning error, and computational complexity. For the sake of fair comparison, the parameters of each algorithm were fine-tuned using grid search and cross-validation. Evaluation results, shown in Table 5, are reported on the Bluetooth measurements collected from Site 4. The results demonstrate that different algorithms can be ranked differently depending on the chosen performance metric. For example, the best classification accuracy was achieved by RBF SVM, while the lowest mean positioning error was achieved by k-NN.

Table 5 Performance evaluation of commonly used algorithms for positioning with respect to various metrics.

Signal denoising

Signal loss can negatively impact the performance of a positioning system. Thus, denoising techniques are often integrated as a preprocessing step to enhance positioning39. As an example, a denoising autoencoder was utilized as a denoising agent where the feature vector of a cellular fingerprint is corrupted to emulate randomized loss of data. The degree of corruption is controlled by a predefined probability (ploss) where, for example, a ploss of 0.03 indicates a 3 chance of setting a feature to zero. Figure 9 demonstrates the differences in performance between using noisy cellular features and their denoised versions for positioning in Site 2. On average, the use of the denoising step resulted in a 1.43 improvement in accuracy and a 13.25 reduction in positioning error.

Fig. 9
figure 9

Noisy vs. denoised features for positioning. For a given ploss value, the results were generated using 3,111 cellular samples collected by both phones from Site 2. A k-NN algorithm is used for comparison where 60 of the samples were used for training and the remaining 40 for testing.