Home-to-school pedestrian mobility GPS data from a citizen science experiment in the Barcelona area

The analysis of pedestrian GPS datasets is fundamental to further advance on the study and the design of walkable cities. The highest resolution GPS data can characterize micro-mobility patterns and pedestrians’ micro-motives in relation to a small-scale urban context. Purposed-based recurrent mobility data inside people’s neighbourhoods is an important source in these sorts of studies. However, micro-mobility around people’s homes is generally unavailable, and if data exists, it is generally not shareable often due to privacy issues. Citizen science and its public involvement practices in scientific research are valid options to circumvent these challenges and provide meaningful datasets for walkable cities. The study presents GPS records from single-day home-to-school pedestrian mobility of 10 schools in the Barcelona Metropolitan area (Spain). The research provides pedestrian mobility from an age-homogeneous group of people. The study shares processed records with specific filtering, cleaning, and interpolation procedures that can facilitate and accelerate data usage. Citizen science practices during the whole research process are reported to offer a complete perspective of the data collected.


Background & Summary
Intertwined with sustainability, social justice, and climate change, mobility has moved to the middle of policy-makers' agendas in the last years [1].In this context, motorized traffic is considered to be one of the main problems our cities face today as it leads to a wide range of social and socio-environmental consequences.Cities are thus taking several actions to favour pedestrians' environments and promote sustainable mobility and liveability [2].These efforts also motivate a better understanding of urban mobility broadly and more specifically pedestrians' mobility because issues of walkability and pedestrian advocacy are at the heart of this theme [3].Much research has been done to understand multi-modal transportation in our cities, especially in the last ten years, due to the availability of digital traces [4].However, although digital traces work well to investigate large distance movements or origin-destination flows (typical public transportation or vehicles) [5], they are less precise in investigating the local micro-behaviours around people's homes or in pedestrian areas [6,7,8].There is an increasing need to get more precise data to better learn about the micromotives of what makes an area more suitable for walking, how different spaces are used to walk, or what activities are more impacted by walking accessibility.
In this sense, participatory research initiatives like citizen science [9] in which people share their own data, and their pedestrian behaviour is an unmatched source of understanding of human mobility patterns [10].Citizen science broadly refers to the active engagement of the general public in scientific research tasks [9].Among many other aspects related to responsible research practices or scientific literacy, citizen science can offer a powerful way to collect data that otherwise can be difficult or impossible to obtain [11].Citizens are indeed recognized for this key effort [12] while favouring new research avenues with open datasets in the context of Computational Social Science [13,14].However, new ethical challenges arise [15,16] and privacy considerations could be overlooked if no specific actions are undertaken.Mobility data must avoid participants identification by for instance inferring home address (see for instance Ref. [6,7,8]) and geomasking with for instance spatial k-anonimity technique is required [17,18].In citizen science projects, several aspects are also raised and some measures such as those taken CitSci.orgplatform could be required as well [19].
In 2012, some of the authors started running citizen science research initiatives to study pedestrian mobility in urban contexts under the name of Beepath experiments.The first initiative tracked participants via a mobile app while wandering through an open space in one of the largest parks in Barcelona (Spain) [20].Participants were attending to the local science festival, and several stands sparse within the park were attracting their attention.Data collected allowed us to provide a general modelling framework to analyse reactive and contextdependent factors in a peer-reviewed academic journal [20].Several activities were planned to increase public involvement in the research.They included different scientific/technology literacy activities to explain to participants how their data was collected and what was the general purpose of the research.When the participants finished their exploration, personalized reports were shared with each participant, and an aggregated analysis was publicly presented while preserving the participant's privacy.
The Beepath citizen science project developed a new experiment [21] under the form of an innovative STEAM activity in formal education (Concept and context in Figure 1), as many other citizen science initiatives with schools [22].The Beepath experiment we here report wanted to further explore walkability in the "City of 15-minutes" [23].The decision to work with schools was taken when some schools of the Barcelona Metropolitan area (not the ones participating in the Beepath experiment) were designing their own School Paths (Camins Escolars, in Catalan).These initiatives aimed to encourage children and young people to make their daily journey from home to school on foot or by bicycle, and, most importantly, without adult accompaniment.School Paths were considered a starting point for addressing challenges such as the achievement of a more sustainable mobility model and reflection on children's rights in the city.Schools interested to participate in the Beepath experiment thought that the citizen science research planned could contribute to the ongoing discussion on walkability with GPS data and scientific evidence.The schools broadly hypothesized that their students face several obstacles (e.g., absence of pedestrian crossings, short traffic lights on the green light duration, train railways dividing the neighborhood, or narrow side-walks).
Taking an active learning approach [24,25,26], the Beepath citizen science experiment collected the GPS data with mobile phones in a crowdsourced manner comparable to volunteer geographic information efforts [27].The Beepath citizen science experiment wanted to move beyond the crowd-sourced data collection paradigm and extend participation [27] in many other research phases with co-creation and co-design strategies [28,29].School participants contributed to the project in different stages (Figure 1, in yellow).As reported in the lower part of Figure 1 (Participatory research process), the co-design phase delivered a protocol and a way to proceed in all research phases.The students became testers of the technology developed and helped run the experiments distributed across ten secondary schools.The students also interpreted the data from their own school group based on their local and situated knowledge and produced data visualizations on their own.They finally presented the results in a public event in front of other school students and municipality representatives.A set of evidence-based recommendations were delivered and merged to reach a wider urban perspective.In Methods, we report these aspects that increase public participation in research and stress the citizen science component that has made it possible to obtain the mobility datasets.Along the Methods descriptions, we also include the actions being taken to preserve privacy of the school participants.Measures want to keep the location data anonymous and avoid participants to become identifiable as natural persons (avoiding the inference of their home addresses).That is also the reason why we only share processed anonymous GPS data and why we did not collect gender or any other socio-demographic trait.
To exemplify the results from the perspective of the school participants, we briefly report the recommendations delivered by three different schools to municipal representatives (see Ref. [30], in Catalan).These recommendations were grounded on the investigations done by the students with maps and statistical data features.The first example: The Ferran Tallada public school from Barcelona is in a hilly but still densely populated neighbourhood (El Carmel) and the school building is almost on top of one of the hills.The students emphasized the lack of traffic signals around the school to facilitate pedestrians' safe journeys.Pedestrians followed inefficient movements.The side-walks in this neighbourhood are narrow and the streets are winding with small visibility for car drivers.The students proposed expanding the width of side-walks in well identified locations and the change of traffic direction of particular streets.The second example: Verdaguer school students mostly belong La Ribera neighbourhood.The schools is inside one the largest public parks of Barcelona (Parc de la Ciutadella, 17.42 ha), on the edge of three neighbourhoods.The park has also three entrances and closes at night.Paradoxically, before 8am (when classes start), the only access point to the park (and to the school) was through the more distant door from La Ribera.The data showed that most of the students were making a considerable detour to access to the park as most of the students live in this area.The proposal in this case was simple: open earlier all the doors, specially the one closer to La Ribera.Today, all doors open before 7am.Finally, the third example: the Bellvitge Institute is located at the Bellvitge neighbourhood from L'Hospitalet de Llobregat (the second largest city of the Barcelona metropolitan area).The school is very close to the railways that divide into two parts the neighbourhood.The students do not have many options to cross the railways to reach the school and their trajectories were shown to be highly inefficient.The school jointly with a neighbourhood association sent a formal petition to increase the number of points to cross the railways.The students from this school also participated in an international Science and City workshop within the Barcelona Science and City Biennale (February 7, 2019) to share the experience.
As said, purposed-based pedestrian mobility from quite homogeneous age profile and at a micro-level is very difficult to be obtained.Thanks to citizen science practices, we here report and share unique mobility data from young students' home-to-school trajectories in the Barcelona metropolitan area.This recurrent path is relatively short in time.It shares commonalities with all participants as they have the same destination or origin and belong to a homogeneous age group.As described in the top right corner of Figure 1 (Human mobility data), GPS data was stored in a server.Datasets were then shared with each participant separately.Datasets were also further processed for scientific research in the way we here report to favour scientific reproducibility and data re-usability (Open datasets in Figure 1).In a broader academic scientific perspective, one might suggest a stochastic model which can reliably describe the movement of the participants [31].Data can deliver interpretations through appropriate mobility models and contribute to discussions about the most suitable models for pedestrian mobility [32,33,34,35].Apart from the velocity we show in Methods, it is possible to further characterize pedestrian mobility with other statistical metrics like reorientation angle or tortuosity [34].Finally, it can also

Experiment
Anonymous GPS data from participant's mobiles and stored in a server.

Cleaning
Aggregation of participants per school.Data cleaning and validation processes.

Interpretation
Discussion with data, plots and maps within each school.be of interest to correlate some of the statistical metrics with contextual information such as amount of green or width of the side-walks along similar lines to recent publications [6,7,8].The model parameters could be estimated from the empirical data and used to compare mobility in the schools' surroundings in terms of urban structure, more walkable routes (pedestrian streets and green areas) or climate conditions [36].One starting point for this analysis could be to study whether participants took the most optimal (shortest) route to school (or home) or instead opted for longer routes to choose safer or more walkable streets [37].

Methods
The Universitat de Barcelona Ethics Committee (IRB00003099) has approved this mobility experiment.All participants read and signed the informed consent and parental/legal guardian consent was also sought as a suitable procedure.No privacy issues have been observed to be in conflict with public release of the underlying processed data.
Table 1: List of schools jointly with its code-name (three letters long) and some contextual information.The school city and district are included in the third and fourth columns.Barcelona has up to 10 districts.Viladecans is a smaller city with around 67,000 inhabitants, and that is the reason why we do not report the district in this case.All cities belong to the Barcelona metropolitan area."Ins." is an abbreviation of "Institut" in Catalan (High School, in English).

Co-designing the experiment and training
An open call for participation to schools from the Barcelona metropolitan area was launched.The call resulted in the participation of 427 students (from 14 to 16 years old) and 31 teachers from 3rd and 4th grade of 10 secondary schools, most of them were public schools as reported in Table 1.Public school teachers received official recognition from the Consorci d'Educació de Barcelona, a recognition that counts for promotion to higher professional ranks.
To start the experiment preparation, two co-design sessions with the involved teachers were performed under the guidance and supervision of education experts (Eduscopi), citizen science researchers (OpenSystems, UB), and data scientists (Dribia).Sessions also introduced citizen science concepts and practices.There was also time devoted to providing basic skills for data visualization analysis for GPS records, and several tools to commit this task were offered.
The first session with teachers identified shared interests about mobility around schools (3 hours duration).Some logistic aspects related to the mobility experiment were also discussed.Each teacher then transferred to their own students the issues discussed.In-class activities were developed to further discuss, adjust, and validate the joint research to be performed identically by all schools.Some of the questions raised in class were: Which scientific questions could we formulate?What experiment would you like to perform?Initial answers by the students went in the direction about how easy or difficult was to arrive to school from home.Students started to talk about identified obstacles based on their personal perceptions.Through their teachers, the students also had the opportunity to express their own views on the participant's profile, the ideal number of participants, or the best time of the day to perform the experiment.Other topics addressed were related to key statistical measures that can characterize mobility and to the impact that results may have in a neighbourhood level and in a city level.The second session with teachers was performed to put together the perspectives and ideas shared in class (3 hours duration).It was then agreed that the mobility experiment was to be focused on the paths they followed to reach school in a morning or to leave school in an evening, with special attention to pedestrian mobility and walkability.
After the co-design sessions, teachers and students worked further but independently on the execution plan of the mobility study, including the logistics and the calendar for each of the activities planned.Each school had to finalize the design considering the particularities of its own context.They also anticipated in a more detailed manner the impact of the study on their own neighbourhood and how they could communicate the research results locally.Doubts and questions were resolved via constant email communication and during a visit to each class group (around 25 students each).The visit was made by one professional researcher from OpenSystems, UB.It allowed us to supervise and validate the approach taken by each school.During the visit, the professional researcher made a presentation on citizen science and pedestrian mobility to all students.The presentation also underlined key technological aspects related to GPS data acquisition and on the mobile app functionalities (see below and Figure 2).A final debate was organized in terms of data privacy and open science.The scientific protocol (see below) to guarantee scientific rigour and data quality during the experiment was also revised and discussed while preserving the privacy of school participants.

Data acquisition with the mobile phone app
Each school chose one school day over a two-week period (from November 5 to November 16, 2018).Weather conditions were favourable and allowed us the possibility to compare data among different schools.During the days and hours of the experiment, there were no exceptional climate events (e.g., rain or very low temperatures), and the weather was very similar.To preserve privacy of school participants, the team responsible of running the experiment (some of the authors) did not know exactly which students were collecting their own GPS records.All communication related to the experiment data acquisition was made through the teachers that acted as local coordinators in each school.All schools and school groups involved were large or very large to avoid personal identity inference with the school information.The school participants involved were from two courses (3rd or 4th grade of ESO, Educación Secundaria Obligatoria, which can be translated as Compulsory Secondary Education) but we did not know which one exactly.The schools had at least two classes per course and each class had 30-35 students.Also, all participants of one school had to perform the experiment the same day.We only allowed to collect data from a pre-established narrow time window of less than an hour.The exact time window was set with Table 2: Number of participants and GPS data records per each school.Columns two and three, respectively, report the number of participants and GPS records considering raw data (Original Data).This information is also reported after the filtering and cleaning process in the last three columns (Processed Data).The last column provides GPS interpolated data (GPS int.) which are GPS records reported every second.The code-name (first column) correspondence to each school is provided in Table 1.An aggregate with all participants and all GPS records is given in the last row.teachers.For instance, if participants decided to collect data when going to school, we only opened the server that morning and closed down the server few minutes after the beginning of classes.We activated data collection about 45 minutes before the beginning of classes.A total of 262 students finally participated in the experiment with their own mobile phones (see Table 2).Participation is lower than the number of students involved in the whole research activity (427 students).A relevant number of participating students did not have mobile phones, while others had an old mobile phone which was incompatible with the app version requirements.The schools thus divided research tasks among the students so that, for instance, those who did not have a mobile phone could spend more time analysing data or preparing specific graphics as part of their educational activities related to the project which were also relevant tasks in the joint citizen science research effort.

Scientific protocol for the experiment
Each participant used their own mobile phone following a common protocol.All students participating in the experiment by taking the home-to-school trajectory received the following instructions accompanied by some screenshots (see Figure 2):  4) Check if the app works properly before the experiment.First, check its functioning on your own.Then repeat the checking with your schoolmates the day that your teacher will announce.(5) In the meantime, report any problem you may have with the app to your teacher.
During the experiment.(1) Click on the Beepath app icon when you start the trajectory (see Figure 2b).After reading and accepting the informed consent, you click "Start the experiment".The GPS will then start to be recorded.Avoid starting the experiment in indoor spaces as GPS will lose precision.To preserve your privacy, do not start recording data in front of your home.Walk between 5-150 meters before clicking on the app icon to start the experiment.
(2) Write your nickname following the established protocol (see the section below and Figure 2b).( 3) On the next screen, you should be able to read the latitude in degrees, the longitude in degrees, and the GPS data precision in meters (see Figure 2c).You will also see the username you wrote in the previous screen.Leave the app on during the trip.The app collects the GPS data from your trip.You do not need to have the Beepath app visible on your screen; you can use other apps during your trip.
After the experiment.(1) Press "End Experiment" (see Figure 2d).( 2) The app will then direct you to a web page where you will have a data report: https://beepath.dribia.com:8080/stats/nickname.(3) To keep the link with you, save the web address in your notebook.
The web address with the personalized data report was unique to each participant.The server was kept active for several months active and allowed the participant to check their own trajectory via a synthesized and automatic data report.Participants could also download their own data.It could be stored in a csv file (with commas) on its own device.Each row included a timestamp (in YYYY:MM:DD HH:MM:SS format) and the GPS records with its latitude and longitude (in degrees).The report provided a map of the tracked GPS data and basic statistical features that included distance covered, time duration, and mean velocity.
Very similar instructions apply to participants performing school-to-home trajectories.The only differences lie in the fact that participants had to click on "End Experiment" between 5 to 150 meters before arriving home and to start the experiment inside the school facilities.In both cases and for privacy reasons, data from tests was not stored.

Nicknames and school code names
As mentioned in the protocol, each participant received a number assigned by each teacher.Nicknames were used to preserve the privacy and anonymity of the participants.They were asked to add a nickname that contained this number and additional key information: the mean of transport (bus, walking, car, scooter, bike, metro, train...), whether they made the trip alone or together with  1.

Supervision, monitoring and support
Before the experiments, data scientists from Dribia monitored and supervised the testing made by each school with some basic statistical analysis with aggregated data and by immediately erasing the data collected.There were some errors reported by the participants themselves but always throughout their teachers.They also implemented modifications in the app and in the data personalized report based on students' and teachers' feedback.During the experiment, data scientists from Dribia also monitored and supervised the progress of the data recording in real-time and did not collect any data out of the preestablished time window and which was not following the pre-established nick naming structure.A server was specifically prepared for the experiments, and the CPU capacity was increased during the weeks of the experiments.
After each experiment, making a first revision of all trajectories and some basic cleaning procedures, data scientists from Dribia aggregated the data per school.Every teacher thus received anonymous csv files with all trajectories from their school.The students took data and reflected on it.Maps and other visualizations were created in a standardized info-graphics format designed by Eduscopi.The template gave space to show plots, explain the results, and deliver recommendations related to pedestrian mobility in each neighborhood [30].All data manipulation was mediated by teachers and professional scientists did not have any direct contact with school participants in the data interpretation effort made by each school.

Actionable data to deliver policy recommendations
At the BarcelonActiva Auditorium, a 2-hours duration closing event was held with about 200 students and teachers in the audience.Student representatives for each school showed their own results and shared their own conclusions.A joint report collecting the info-graphics (pdf file in Catalan, shared jointly with the datasets) [30] and the set of policy recommendations were delivered to the Technical Director of Digital Innovation at the City Council of Barcelona and the institute of Barcelona digital city (i.lab).Some of the schools also used the analysis to go to public municipal or district authorities and ask for some very specific actions (generally at an urban micro-level).Some of the results are reported in the previous Background & Summary section.

Filtering, cleaning, and interpolating GPS data
As scientists, we are interested in purposed-based pedestrian mobility, but not all participants followed this type of mobility.Furthermore, in some cases, the data acquisition process appeared incomplete or partially failed for technical reasons.It was thus necessary to make a thorough filtering process.Also, for scientific purposes, it is also very much convenient to have GPS data with periodic timestamps via some interpolation.
Let us, however first define the following key variables that characterize the mobility of each participant.Distance between consecutive GPS records reads where we have an irregular duration of consecutive GPS timestamps ∆(t) and ⃗ r(t) is the GPS two-coordinate vector of a pedestrian position at time t (a given GPS time stamp).We thus define instantaneous velocity as and total distance covered by one participant as where we sum over all timestamps {t} from a trajectory except from the last one.Finally, the computing of the trajectory duration thus reads The filtering and cleaning process is divided into five successive steps.First, non-pedestrian users are filtered out by checking the means of transport they use.Second, the GPS paths are displayed on maps, and the invalid ones are removed (those that do not form a well-defined origin-destination trajectory).Third, the remaining trajectories are cleaned of noisy GPS locations at the beginning and/or end of the path due mostly to GPS connection problems (when they are still located at schools).This is done by visualizing the routes on maps along with the study of the distances (cf.Eq. ( 1)) and velocities (cf.Eq. ( 2)), which typically present outliers at these noisy locations.Finally, again to preserve participants privacy, we mask the location data (GPS records) by removing a random amount of records (the first 20-50 seconds of the trajectory) of the movement in home-to-school trajectories and (the last 20-50 seconds of the trajectory) of the movement in school-to-home trajectories.The spatial k-anonymity technique is used to measure the disclosure risk [17].This privacyby-design measure is taken as an addition to the protocol instruction described above to start the app recording about 5-150 meters after leaving (or before reaching) home.The fifth step consists in interpolating the GPS locations linearly to have all records uniformly separated by one second.
Initial data reported in Table 2 is reduced to 83 participants and 33, 222 GPS locations (41, 053 after interpolation).The details of the process are reported in the forthcoming sections.

Non-pedestrians removal
The nickname (as described in the previous sections) not only makes sure that data is anonymous, but also includes trip details such as the school code name and the type of transport being used.Since we are exclusively interested in purposed-based pedestrian mobility, we removed from the data set those participants who did not follow this type of mobility.For instance, the participant 2018-11-05_sgv_0601_tren used the train as a transport mode.Additional attention is required for those participants that did not indicate the means of transport.Through statistics, we can infer that some of these participants used another means of transport rather than walking.We thus remove abnormal averaged velocities.
We finally filter out a total of 105 non-pedestrian trajectories corresponding to a total number of 83, 009 GPS locations.

Invalid trajectories and non-origin-destination trajectories
We visually explore each trajectory record separately, displaying the GPS locations on maps.Several trajectory records do not follow a clear path, possibly due to problems with the app or with the GPS satellite connection.Invalid trajectories are also due to human error when recording the route (see Figure 3a).These records are removed jointly with those that neither start nor end in one of the schools (see Figure 3b).
This second step has removed a total amount of 7, 500 GPS locations, corresponding to 43 participants.

Outliers
When computing time increments ∆(t) between consecutive GPS timestamps, GPS consecutive distances d(t) (cf.Eq. ( 1)) and instantaneous velocities v(t) (cf.Eq. ( 2)), it is also possible to detect large outliers.These outliers are frequent in first GPS records as the GPS activation generally provides some initial noisy records.This is evident when displaying the trajectory of one participant on a map as shown in Figure 4a. Figure 4b thus shows its extremely large velocity values, exceeding even 90 m/s.Other outliers are also sometimes detected very close to the end of the trajectory due to the fact that the participant is  inside the school building, where GPS has much less precision or where the mobile is automatically connected to WiFi.Then GPS locations are accumulated in the same area for a while.
All these outliers are carefully analyzed and removed (see Figure 4c and Figure 4d as an example).Still, there are certain individual trajectories that contain too many outliers, thus making it impossible to create a meaningful origin-destination trajectory once the outliers are removed.In these cases, we discard the whole trajectory.
In total, 31 complete journeys and 34, 409 GPS outliers are being removed.Therefore, we ended up with 83 individual trajectories and a total amount of 36, 091 records.

Geomasking and spatial k-anonymity
Location data under the form of GPS data can potentially reveal personal identity (through home address, in our case).We have no mean to make sure that participants have followed the protocol instruction described above to start the app recording about 5-150 meters from their home (or to finalize the GPS recording 5-150 meters before reaching their home).In the remaining cleaned trajectories, we thus perform a further filtering to keep the anonymity of the participants.We have used the spatial k-anonymity technique [17,18] to mask the location data within a certain urban area.The technique strips away GPS records that could identify home address as shown in Fig. 5. K-anonymity is thus here characterized as the number of home units k within a given area.Table 3: Disclosure risk for each district of Barcelona or city.Columns respectively report the district of Barcelona or city where the school is located, the urban surface in km 2 units, the number of housing units, the density of housing in 1/km 2 , and the 1/k disclosure risk, which is calculated multiplying the density of housing units by the surface area of a circle of radius d = v • t, with v = 1.5 m/s and t = 50 s.The averaged disclosure risk is 2.72 × 10 −3 .L'Hospitalet and Viladecans are cities, not districts of Barcelona.

District or city
Urban surface (km 2 ) Housing units Housing units density (1/km Therefore, 1/k quantifies the disclosure risk, that is roughly the probability of revealing the correct location of the participant's home.Table 3 provides the urban surface and the number of housing units of a given district (or city) obtained from different reports accessible online [38,39,40].With this information, we can estimate the density of housing units as shown in the third column in Table 3.Then, last column approximately obtains the number k of homes within a given circle of radius d = v•t.To make it simple and as we only want a rough estimation, we can take constant velocity v, identical for all participants.Then, based on the disclosure risk 1/k willing to assume, we can finally decide the time period t in seconds we strip away from indidvidual GPS records at the beginning of the home-to-school trajectories (or at the end for school-to-home trajectories, but always when participants have non-zero velocity).If we approximately take v = 1.5 m/s for all participants (see Data Records Section to check that this is a reasonable choice) and use t = 50 s, we can obtain an average disclosure risk ⟨1/k⟩ = 2.72×10 −3 , which is a comparable order of magnitude taken by other publications using GPS data [6,7,8].Table 3 shows the details for each of the districts and cities.We have therefore chosen a random value of t between 20 seconds and 50 seconds for each participant (each trajectory).Table 4 reports the number of GPS records removed with the spatial k-anonymity and how the technique shortens the total duration of the trajectories.
In total, 2, 869 GPS records are being removed.Therefore, we ended up with 83 individual trajectories and 33, 222 GPS records data after the implementation of the k-anonimity technique.

Temporal gaps and linear interpolation
The mobile app is designed to collect data uniformly every second (∆(t) = 1 s).However, on some occasions, the trajectory data sets contain larger time Table 4: Number of GPS records and average trip duration for each school after spatial k-anonymity.Columns respectively report the district of Barcelona or city, the school code-name, the number of participants, the number of GPS records and the average trip duration in seconds, before and after applying the spatial k-anonymity.The number of GPS records after applying the spatial k-anonymity also includes the % reduction.L'Hospitalet del Llobregat and Viladecans are cities, not districts of Barcelona.intervals between consecutive GPS records.This can be attributed to errors in the GPS connection, having subsequent antenna signal jumps.They can also be attributed to mobile app pauses.17% of the data is separated by more than 1 second, but 99.8% of the GPS records are separated by ∆(t) ≤ 4s.Large temporal gaps are, therefore, statistically irrelevant.Even if temporal gaps are small, scientific analysis on pedestrian micromobility may require to have constant periodicity in consecutive GPS records.This can be the case in a study on instantaneous velocity.To have a record of every second, we perform a linear interpolation.The number of GPS locations is then increased from 33, 222 to 41, 053 records.This procedure does not reveal any remarkable effects on the dynamics as shown in Figure 6.Table 2 shows the number of participants and GPS locations for each school after processing the data and after the linear interpolation procedure.
Tables 5 and 6 include the main statistical indicators for each school: the distance travelled, the time spent, and the velocity (after processing the data but before linear interpolation).Table 5: Distance and duration of the trajectories for each school.Mean distance traveled (cf.Eq. ( 3)), the shortest and the largest trajectory, the mean amount of time spent (cf.Eq. ( 4)), the quickest and the fastest trajectory for each school and after filtering and processing the data (before linear interpolation).for more advanced calculations such as fitting models to the empirical data and matplotlib for plotting purposes.The Python code is built in different Jupyter notebook files which contain a detailed description of the study and the code documentation.

Data Records
The data repository [41] contains the collected and processed datasets, distributed in two different folders.
The processed data folder contains the 83 csv files reporting the participant trajectories after the filtering and cleaning process (i.e., removing nonpedestrian participants, removing invalid and non-origin-destination trajectories, cleaning outliers and removing the first or last 20-50 seconds of movement).Processed files are saved with the suffix processed (e.g.: 2018-11-05_sgv_0802 _peu_processed.csv).Each of the csv files has 10 columns, 7 of which correspond to the GPS raw data collected in the experiment by the Android and IOS platform server (see Table 7).The other three columns correspond to the time difference between GPS timestamps, the distance between consecutive GPS locations and the corresponding instantaneous velocity (see Table 8).We remark that the calculation of ∆(t), d(t) and v(t) is done in a time-advanced way (cf.Eqs. ( 1) and ( 2)): the time difference of the record at the location i is the time difference between the timestamps at locations i+1 and i (the same applies for the distance and velocity).For this reason, the last record (last row) does not have these three values reported.Table 9 shows an example of processed data table.
For convenience and data quality testing, we are interested in having the data processed with and without linear interpolation separately.The folder Table 6: Instantaneous velocity for each school.Mean, the smallest and largest value of the instantaneous velocity (cf.Eq. ( 2)) for each school and after filtering and processing the data (before linear interpolation).interpolated data contains the same csv processed files but with linear interpolation.The suffix interpolated is added to each filename instead of processed (e.g.: 2018-11-05_sgv_0802_peu_interpolated.csv).After completing the procedure of linear interpolation, all the records are equally spaced.They are periodic, every 1 second.Therefore, in the processed and interpolated csv files, the column ∆(t) is always 1.0, and the columns d (now distance covered in 1 second) and v (velocity) provides the same value.

Technical Validation
One could question whether linear interpolation can affect any potential pedestrian mobility analysis.We are in fact adding 7, 831 new GPS locations, which corresponds to an increase of 23.6% in the processed dataset.
Figure 6 compares qualitatively three statistical analyses that can characterize our pedestrian's mobility.First statistical analysis can be made with the probability density p V (v) of the instantaneous velocity (cf.Eq. ( 2)).Another probability density that can also be obtained, p U (u), can be expressed in terms where v m = E[v(t)] (for each trajectory).It is also possible to compute the mean squared displacement and the auto-correlation of the velocities (cf.Eq. ( 5)) where u m = E[u(t)].Both quantities are averaged over the trajectories.Interpolated and non-interpolated data can be thus compared in Figure 6.Probability densities match qualitatively well except for small velocities.This can be attributed to the interpolation as small time steps are especially relevant in this part of the distribution (see Figure 6a).When the logarithm of velocities is considered in Figure 6b, this effect is blurred out.When considering the MSD, the growth with time τ is shifted and the curves are almost parallel in a double logarithmic scale (see Figure 6c).Figure 6d is particularly sensitive to data interpolation as the interpolation enhances correlation as expected, particularly for short time distances.The computed observables, such as the distance traveled (cf.Eq. ( 3)) and the time spent (cf.Eq. ( 4)) remain statistically similar (see Table 10).Furthermore, the average effective speed v eff = D/T is also analyzed in Table 11.The mean instantaneous velocity seems to slightly decrease after the linear interpolation showing a shifting effect which was also observed in Figure 6c.The analysis confirms that linear interpolation does not broadly modify the statistical features here reported.Modifications are subtle but in any case limited and small.).From all participants: mean value (⟨. . .⟩), standard deviation, quantiles, and minimum and maximum value of the effective speed (defined as the total distance travelled over the time spent, D/T ) and the instantaneous velocity before and after linear interpolation.

Usage Notes
The collected data from the Beepath citizen science experiment represents a rich source of mobility information for the study of people's micro-behavior around neighborhoods.The data reports trajectories with a specific origin and destination (home-to-school, or the other way around) from an age-uniform group of participants.Clean pedestrian mobility data can be accessed through the two folders in the repository (processed data to processed and clean data and interpolated data adding a further step where the data is interpolated and all GPS records are uniformly spaced every 1 second).
The Python notebook called Data Processing.ipynb in the GitHub repository (https://github.com/ferranlarroyaub/Beepath-Schools.git)contains a description of the treatment and data clean-up discussed here, as well as the scripts to reproduce all the trajectory representation on maps (Figure 7).Each processed individual trajectory is saved in a new csv file also contained in the processed data folder and in the interpolated data folder after performing the linear interpolation.In addition, the Python Notebook also contains the code for the calculation of the time increment ∆(t) between consecutive timestamps, distances (d(t) and D, cf.Eq. ( 1) and Eq. ( 3)) and velocities (v(t), cf.Eq. ( 2)) between consecutive timestamps.These variables are added as new columns to the processed csv file of each participant.
The repository (https://github.com/ferranlarroyaub/Beepath-Schools.git) also contains different Jupyter notebooks with the necessary functions and scripts to study and characterize the participants' movement through statistical patterns.All these files contain a detailed description of the study and an explanation of the code.
In particular, the notebooks called Mean Squared Displacement.ipynb,Instantaneous velocity.ipynb,and Autocorrelation velocities.ipynbcontains the scripts to reproduce Figure 6, where the three main statistical features studied are compared with and without linear interpolation for technical validation purposes.These statistical patterns are the mean squared displacement, the probability density function of the instantaneous velocity, and the auto-correlation of the velocities.

Figure 1 :
Figure 1: Schematic description of the Beepath citizen science initiative.We have combined STEAM education with citizen science situated knowledge co-production to collect GPS data from purposed-based pedestrian mobility.The data files available are reported jointly with the participatory process that has made possible the whole research.The direct participation of students is highlighted in yellow.Grey color highlights the GPS data collected and open data sets are enhanced with a thicker frame.
Before the experiment.(1) Download the app.(2) If you have an iPhone, open the App Store and search for the Beepath app (see a screenshot in Figure 2a).If you have an Android, open the Google Play Store and search for the Beepath app.(3) Install the app.(

Figure 2 :
Figure 2: Mobile app screenshots for the several steps that the participant had to follow to complete the experiment.(a) It shows the "Beepath" place in the iPhone App Store.(b) It shows the starting screen where participants had to accept terms and conditions with informed consent.The participant also had to insert a nickname ("Usuari") above the horizontal line and finally click on "Començar Experiment" to start the experiment.(c) It shows real-time recording of the trajectory with some information related to GPS.It also gives the option to finish the experiment by clicking on the lowest part of the screen.(d) A summary of basic information about the whole trajectory is provided when the experiment is finished.(e) and (f) A couple of error messages are shown as examples.(e) It shows the case when the participant is still not able to access the data report because the server is still processing the data.(f) It shows the case when a participant is not allowed to start the experiment with a nickname that has already been used.All texts are in Catalan as the students that participated to the experiment were in Catalan schools.

Figure 3 :
Figure 3: Example of an invalid trajectory.(a) Map visualization of an invalid trajectory from a BEL participant, with only a few GPS locations scattered around the school (yellow star, school).(b) Map visualization of a trajectory with no-origin destination from home to school, from a participant of SGV.

Table 7 :
Columns of the raw csv files.Description of the 7 columns of the csv files from the original data, collected by the Android/IOS app servers.

Table 8 :
New columns of the filtered and processed csv files.Description of the 3 new columns.
Column Description ∆tTime difference between consecutive timestamps t, in seconds.dDistance at time t between consecutive GPS locations, in meters.v Instantaneous velocity at time t (distance over time-lapse), in meters/second.

Table 9 :
Example of a processed and cleaned csv file display.This is a ZAF pedestrian participant, with 220 records.

Table 10 :
Duration of the trajectories and distance travelled before and after linear interpolation procedure (cf.Eqs.(4) and (3)).From all participants: mean value (⟨. . .⟩), standard deviation, quantiles, and minimum and maximum value of the duration of the trajectory and the distance travelled before and after linear interpolation.