Activity of vehicles in the bus rapid transit system Metrobús in Mexico City

In this paper, we analyze a massive dataset with registers of the movement of vehicles in the bus rapid transit system Metrobús in Mexico City from February 2020 to April 2021. With these records and a division of the system into 214 geographical regions (segments), we characterize the vehicles’ activity through the statistical analysis of speeds in each zone. We use the Kullback–Leibler distance to compare the movement of vehicles in each segment and its evolution. The results for the dynamics in different zones are represented as a network where nodes define segments of the system Metrobús and edges describe similarity in the activity of vehicles. Community detection algorithms in this network allow the identification of patterns considering different levels of similarity in the distribution of speeds providing a framework for unsupervised classification of the movement of vehicles. The methods developed in this research are general and can be implemented to describe the activity of different transportation systems with detailed records of the movement of users or vehicles.

The study and understanding of human mobility in cities is an important and challenging problem since more than half of the world population lives in urban areas 1 . Nowadays human mobility can be explored in detail thanks to the digital traces people leave on mobile/digital platforms 2,3 . The identification of patterns in human mobility [4][5][6][7][8][9] is necessary in topics like urban planning, dealing with traffic congestion 10 , the influence of the spatial distribution of a city 1,[11][12][13][14] , the encounters or contacts that emerge 15,16 , among many others 1,2,17 . In these problems, the science of networks with well-established tools and methods to characterize and model complex systems [18][19][20] , provide a valuable framework to study transportation modes and their interactions [21][22][23] .
As one type of transit mode, bus rapid transit (BRT) systems have gained popularity worldwide for providing fast and easy access for citizens to fulfill their transportation needs 24 and have been adopted widely over the world [24][25][26][27] . The merit of the BRT system lies in its ability to provide a high-quality public transit service with limited infrastructure and at relatively low capital and operating cost 25 . The benefits of a typical BRT system consist of dedicated lanes and proper vehicles and stations; such a layout guarantees a significant advantage in terms of operability 26 . In addition, BRT systems stand to significantly decrease personal vehicle mode share 25 and might pull together connecting parts of the city in ways which other systems do not, especially at the level of service and spatial coherence 28 . In many BRT systems, vehicles have a preinstalled global positioning system (GPS) device which helps in collecting the travel time-related data, this information gives a global picture of the system in real time and can be used for improving the overall performance and schedule adherence of the vehicle. In recent works, the availability of trajectory data collected from operational vehicles in transportation systems has made possible the statistical analysis of travel time of vehicles in roadway segments [29][30][31] , the development of mathematical tools for the estimation of travel times and temporal changes in public transport 32,33 , the implementation of techniques to detect patterns in vehicle trajectories [34][35][36] and public traffic congestion estimation by using artificial neural networks [37][38][39][40][41] . However, approaches to systematically analyze information and identify activity patterns in BRT systems are limited; specifically, very little past research in BRT systems focused on the statistical analysis of the speed of vehicles in specific zones of the system.
In contrast, community detection in networks [42][43][44][45][46][47] has been proved as an important tool to detect patterns in different complex systems. For example, in the identification of correlations in financial markets 48,49 , the study of physiological networks 50 , the classification of patents based on their semantic content in technology 51 and pharmaceutics 52 , describing the network of stations in bike-sharing systems 53,54 or the geographical structure of the Twitter communication network at the global scale 55 , just to mention a few examples of the applicability of this method for pattern detection, unsupervised-classification, and data mining.

Results
Global characteristics of the system. The dataset analyzed in this research is part of an effort of Mexico City to have open databases for transportation systems, human mobility and other topics of interest 56 . In particular, for the BRT system Metrobús it is possible to have access to real-time registers with geographical coordinates (longitude, latitude) for positions and speeds of all the active vehicles in the system 57 . We analyze data for 383 days collected from February 2020 to April 2021 (see "Methods" section for a detailed description of the data). From these registers of activity we have the speed of each vehicle in m/s . In this way, we have access to a global picture of the system's activity at specific moments.
In Fig. 1 we illustrate a global analysis of the dataset. In Fig. 1a we show the positions and speeds of the vehicles in March 2nd, 2020 at 13:00 h, the system has 516 active vehicles at this moment. Each vehicle is depicted with a point and the color encodes its speed (see Supplemental Material with a video of the complete data collected for this day). This representation gives a general overview of the data, such as regions with the highest activity, and zones where the speed of the units is higher than the average. On the other hand, considering that each vehicle has a unique ID, we can count the total number of active vehicles in the collected data on a determined time scale. In Fig. 1b, we show the number of active units at the scale of months considering the 15 months covered in this research. In this analysis, we assume that a vehicle is active in the system if at least has one register in the respective month. The results show that the number of vehicles changes significantly, especially due to the modifications in the system introduced in response to the different stages of the COVID-19 pandemic in Mexico City. In this respect, the number of vehicles in February and March of 2020 represents the common operation pre-COVID-19 in Mexico. This number reduced, particularly, from April to August 2020. In the last 8 months of this study, the number of active vehicles increased from the low in May 2020 but not to the same levels observed in the first two months. To complement this part, in Fig. 1c, we analyze the information of the speeds of the vehicles in the whole system in each month. We explore the probability density ρ month (v) of nonnull speeds v for the records of each month in the dataset. In strong contrast with the results for the number of

Vehicle activity in segments.
In addition to the vehicles, the infrastructure of the system Metrobús includes 195 stations where users access this service and are distributed in 7 lines with 225 km exclusive roadways dedicated to buses. We use all the information available for the movement of vehicles to study the operation of the system in different zones of Mexico City. To this end, we divide the system into N = 214 segments defined by polygons that include stations and the lanes that connect them. A simple segment is described by an elongated rectangle defining a specific geographical region that includes the system's roads and stations at two of its ends. In this study, our partition of the systems considers 205 simple segments. In addition, 9 segments are general polygons, located in zones where different lines converge. In Fig. 2a, we present all the segments of the system. In this representation, polygons are sorted according to the geographical coordinates of their geometrical centers starting from the south-west and considering the latitudes (from south to north) as the first variable and the longitudes (from west to east) as the second variable. An index i = 1, 2, . . . , N codified in the color bar denotes the segment number; we maintain the same index in all the following analyses. Once we define the information of polygons representing the segments, we proceed to analyze the activity of vehicles in each one of them. Due to the high volume of data, we use the geographical coordinates of the vehicles to divide the complete dataset into registers associated with each segment. In this manner, we have all the speeds in each segment during the 383 days in our study. The statistical analysis of the speeds v considering the movement of vehicles (i.e. only registers with v > 0 m/s corresponding to 164 867 137 speed values, a 76.7% of the total database) are presented in Fig. 2b, results show the probability density ρ total (v) calculated with the relative frequencies of v in regular bin counts with �v = 0.25 m/s . In this representation, we maintain the same colors that codified the segments in Fig. 2a. We define a maximum speed v max = 20 m/s = 72 km/h , only a 0.0114% of the total database contains registers with v > v max . In addition, our analysis shows that in a high number of segments, most frequent values of speeds are observed in the interval 5 m/s ≤ v ≤ 10 m/s. Also, the values v can be divided into sets considering a defined time window; for example, registers in a particular hour, day, weekday, month, among others. Once we establish a particular partition of the speeds, we have probability densities that we can compare with the total density ρ total (v) in each segment. On a temporal scale of days, we obtain 383 densities ρ day (v) per segment with the information of the movement of vehicles in each day considered in our study. Then, using the Kullback-Leibler distance 58 we calculate the "distance" between each ρ day (v) and ρ total (v) (see the "Methods" section for a discussion about D KL ). In this way, we have 383 values of D KL in each segment comparing daily registers with the respective ρ total (v) in Fig. 2b. for these polygons reveal that in some days the distribution ρ day (v) differs of the respective ρ total (v) , this may be due to modifications in the routes of vehicles. In Fig. 3b we plot the map of the system Metrobús, the colors represent the value D KL / D KL of each segment. This map allows the identification of zones with regularity in the movement of vehicles (small values of ρ(D KLS ) ) and particular segments where average distances D KL > 0.2 show that the daily distributions differ with the total activity captured in ρ total (v) . In Fig. 3c we depict the statistical analysis of the values x = D KL /�D KL � in each segment, the respective probability densities ρ(x) are skew with a high fraction of distances below the average, (i.e. with x < 1).
In addition to the results for the D KL distances, we show in Fig. 3d,e the monthly distribution of speeds ρ month (v) for two particular segments. Our analysis is similar to the presented for the total system in Fig. 1c, but now for the vehicles in segment 10 with �D KL � = 0.296 and segment 212 with �D KL � = 0.050 (the numbers of these particular segments are included in the map in Fig. 3b). With different colors, we represent the 15 months considered in our study. In the case of the segment 10 with higher distance D KL , the values of ρ month (v) suffered deviations with respect with the total data ρ total (v) represented with a dashed line. On the other hand, in the segment 212 with D KL closer to zero, the ρ month (v) remain approximately the same as ρ total (v) for the total data in this segment.
The results in Fig. 3d,e are two particular examples to show how D KL allows identifying variations in the activity of vehicles in a particular segment, similar analyses can be implemented at different scales of time to characterize changes or the regularity in the vehicular activity in a particular region of the system. In Fig. 3, it can be seen that except for the 5 segments discussed above, the activity of vehicles (described by ρ total (v) in each segment) maintains some regularity. This result is important considering the variations in the number of active vehicles in Fig. 1b. A particular case is found in segment 10 that includes stations close to the National Autonomous University of Mexico. Since many of the activities in this university were developed virtually, the number of users in these stations was reduced significantly. As a consequence of the low demand, the movement of vehicles in this segment was restructured as we can see in Fig. 3d.
Network of similarity between segments. Now we compare all the speed probability densities associated with the activity of vehicles in each segment. We consider the complete probability density for all segments i = 1, 2, . . . , 214 , with the records 0 < v ≤ 20m/s in our database, denoted as ρ  Looking for a better understanding of the similarities in all the distributions presented in Fig. 2b, we use the values of the distances in Fig. 4 to define a similarity network. In this representation, each node is associated with a segment of the BRT system, the size of the network is N = 214 and an edge connecting two different nodes i and j is established if D KLS (i, j) ≤ H where H is a given threshold limit to decide if two segments have similar activity. The result is an undirected network with an adjacency matrix A with elements A ij = 1 if 0 < D KLS (i, j) ≤ H and A ij = 0 for D KLS (i, j) > H . By definition, the adjacency matrix considers the diagonal entries A ii = 0 to avoid loops or connections of a node to itself. In the general case, it is hard to have intuition about the values of H to define a similarity network and its choice depends on the particular structure of the dataset explored and the metric used for the distance or relation between two nodes. In this respect, it is convenient to perform the statistical analysis of all the entries D KLS (i, j) presented in Fig. 4. In this representation of ρ(D KLS ) , the area under the curve H 0 ρ(z)dz gives the fraction of edges included in the network with respect to a fully connected graph. The higher the threshold H, the more edges are included in the network. In particular, in our analysis for the BRT system Metrobús, we see that the interval 0 < H ≤ 1 could produce networks with useful information. For H ≫ 1 , a high fraction of the edges are included in the similarity network losing any particular structure at the level of groups of zones. For the readers interested in this part, we refer to the recent work of Rincón et al. 52 , where a network of patents is explored using similar methods with a metric implemented to compare keywords in texts.
In this way, for each value of H we have a network of segments and we can apply standard methods for its analysis. In Fig. 5 we centered our study on the different networks generated and in the largest connected component (LCC), defined by the largest set of connected nodes within the network. In Fig. 5a we present the number of nodes in the LCC as a function of the similarity threshold H in the interval 0.01 ≤ H ≤ 1 . In this representation of the results, we observe the effect of H defining different scales in the similarity between segments. For small values H < 0.02 the similarity network is formed by small size disconnected clusters showing that a reduced dv. Community structure and identification of patterns. In networks, the distribution of edges is not only globally, but also locally inhomogeneous, with high concentrations of edges within special groups of nodes and low connectivity between these groups. This feature in networks is called community structure. Communities (also clusters or modules), are sets of vertices that probably share common properties and/or play similar roles within the network 45 . Community detection endorses the identification of local connectivity patterns and guides the understanding of interactions in a complex structure. In this work, the communities represent groups of segments with similar activity of vehicles, which arose from considering all the information contained in the similarity network for different thresholds H, something not immediately visible if comparing the probability densities of speeds by pairs of segments. We apply modularity-based clustering algorithms 42,43 to analyze the community structure of the LCC of similarity networks generated with H = 0.05 and H = 0.357 and depicted in Fig. 5b,d, the results derived from the community structure are presented in Figs. 6, 7 and Table 1.
In Fig. 6 we explore the Metrobús system using the similarity network with threshold limit H = 0.357 . This network has two communities: C 1 with 108 nodes and C 2 formed by 106 nodes; the community structure is represented in Fig. 6a, other quantities that characterize the subnetworks defining each cluster (average degree k , global clustering C and average lengths of the shortest path connecting different nodes l ) are presented in Table 1(a). In Fig. 6b we show the communities in a map, in which segments in communities C 1 and C 2 are distributed in the whole system. We also present the total density of speeds ρ (i) total (v) for the segments i ∈ C 1 and i ∈ C 2 (see panels in Fig. 6c,d), dashed lines are generated with the registers of v in all the segments representing the total activity in each community. The speed distributions found in the communities C 1 and C 2 present particular features. One of the differences observed is that in C 1 the speeds v ≥ 10 m/s appear in a 30.46% of the non-null records; whereas, only a 9.34% in C 2 fulfill this condition. In this way, the classification of segments through community detection in similarity networks with H = 0.357 establishes a coarse-grained classification with two categories: C 1 for high-speed segments and C 2 for low-speed zones. The average speeds v in each community reported in Table 1(a) also confirm this characteristic observed in both categories, we also include the www.nature.com/scientificreports/ standard deviations σ v of the speed values. In the results in Fig. 6c,d, for ρ (i) total (v) (represented with thin lines for each segment i), some probability densities deviate from the result obtained with the records for the whole community, something that is more marked in the C 1 community. This is because having the threshold H = 0.357 , some differences are allowed in the similarity network.
In Fig. 7, we present the results for the analysis using H = 0.05 , a criterion that requires greater similarity to form a link in the network of segments. In this case, community detection algorithms applied to the LCC allow defining 6 communities C 1 , C 2 , . . . , C 6 that contain at least four segments as we illustrate in Fig. 7a. In the map in Fig. 7b, we see that these categories produce a more varied map, although as H is small, several segments (represented in white) cannot be grouped into a community, being outside the LCC (we also omitted a  For the non-null speeds in the segments in each community v denotes the average speed and σ v the standard deviation, we also report the fraction (as percentage) of speeds with v ≥ 10 m/s.

Community
Segments www.nature.com/scientificreports/ community defined by three segments in the LCC). In Fig. 7c we present the probability density for v with the records of non-null speeds in the segments of each community. In this case, the unsupervised classification of the segments produces more varied results that are reported in Table 1(b). For example, in the analysis of the proportion in which the velocities v ≥ 10 m/s appear, C 2 , C 6 define groups of segments with high speeds where more than 30% of the data meet v ≥ 10 m/s . Communities C 1 , C 4 have a fraction of around 14% for these speeds, and C 3 , C 5 define low-speed zones with less than 10% . This type of classification is also evident for the average velocities v in each community, being the highest v = 8.68 m/s in community C 6 and the lowest v = 4.73 m/s obtained for C 5 . The measures that describe the communities as networks also give us important information; for example, C 1 and C 2 are the subnetworks that have more nodes; however, considering the links, C 1 is much more connected, a fact that is evidenced in the highest average degree and clustering. Finally, the most valuable information is the probability density ρ(v) in Fig. 7c for the data in each community. The obtained distributions have particular characteristics that describe the vehicular movement in each group of segments. All this information and the map in Fig. 7b, help us to understand the global activity of the Metrobús system and its operation since the results obtained with the combination of methods implemented in this research allow us to detect emerging patterns when comparing the activity of the entire system.

Discussion
From the study of the data with the activity of vehicles in the BRT system Metrobús for 383 days and a partition of the regions where these vehicles move defining 214 segments, it is found that this system operates with relative regularity in each zone. In particular, the distributions of speeds in the entire system and in each of the segments are preserved, presenting small variations depending on the day, with some exceptions that also can be detected using the statistical methods implemented for this study. In this way, the speed distribution of each segment is a good reference for the specific behavior of the vehicles in each of the geographical zones defining the segments. The variations in each segment at different temporal scales can be effectively studied using Kullback-Leibler distances.
In addition, the analysis of the Kullback-Leibler distance between speed distributions of all pairs of segments allows the representation of the entire system as a network. In this structure, community detection algorithms allow identifying groups of segments with similar vehicular activity, the number of communities found varies according to parameter H required to define the similarity network. In a case with H = 0.357 two categories are established, one with segments in which speeds with v ≥ 10 m/s are more frequent and another in which these speeds appear less frequently. The analysis of a network with a greater similarity between segments with H = 0.05 gives a classification with more specific characteristics in the speed distributions, in this case the records with v ≥ 10 m/s appear in different proportions in each community. Our findings show that the Metrobús BRT system presents certain regularity in its operation, in the sense that the distribution of speeds of vehicles in 209 segments of the system suffered only small variations even with the reduction of active vehicles implemented  www.nature.com/scientificreports/ due to the COVID-19 pandemic in Mexico City. It seems plausible to associate this regularity with the exclusive lanes in the system and the rules that operators of the vehicles must follow. The statistical methods and the network science approach implemented in this research can be used for the multi-scale study of different transportation systems. In systems like taxis, buses, car-sharing services, in which large amounts of data are available with registers of the movement or quantities associated with vehicles or agents along with their geographic coordinates, this approach can lead to the unsupervised detection of regions with similar activity of vehicles. Other studies can incorporate the statistical analysis of different quantities of interest; for example, schedule adherence of the vehicles, carbon emissions, or user's accessibility to stations. A profound understanding of the vehicle activity and similarities detected in groups of segments can help researchers and transit specialists to draw up strategies tailored to improving operational aspects of the system.

Methods
Dataset description. With the implementation of location-enabled devices on public transportation, a large amount of bus trajectory data is being generated. Since April 2019 all the information of the movement of all active vehicles in the BRT system Metrobús is available under request to the public 57 . The information provided contains the timestamp, vehicles ID, and registers in real-time of the GPS coordinates (longitude, latitude) of each vehicle in the system, their speeds in meters per second (m/s) , and qualitative descriptions of the state of each vehicle or the levels of congestion for example: on time, stopped, among others. Each vehicle updates this information every 30 seconds. The Metrobús system operates from 4:30 to 00:00 h on weekdays (Monday to Friday) and starts operation at 5:00 h on weekends (Saturday, Sunday) and holidays, the description of stations and routes is available to the public in the webpage of Metrobús 59 . By using a code written in Python, we request the data automatically (waiting 30 seconds between requests), an initial treatment of the retrieved records is performed to save the data. We maintained the download of data from February 16th, 2020 to April 8th, 2021. In total, we have data for 383 days with 215 025 258 registers of position and speed of vehicles. On some particular days, the data was not available for download due to maintenance or problems connecting with the server.
Kullback-Leibler distance. The Kullback-Leibler distance is a standard method to calculate the difference between two probability distributions P(z) and Q(z) describing a stochastic variable z 58,60 . This tool is widely used for database comparison. For continuous distributions, this distance is given by 58 Here Q acts as a reference distribution. Also, it is important to emphasize that D KL (P||Q) is not a distance in the sense of a metric since the distance between P and Q is not necessarily the same as between Q and P. Also, from the definition in Eq. (4), it is clear that D KL (P||Q) > 0 and is null when P = Q.
Networks. Symmetric networks with N nodes are described by an adjacency N × N adjacency matrix A with entries 1 if two different nodes are connected and 0 otherwise. An important quantity in the study of networks is the degree of node i given by k i = N l=1 A il , that gives the number of connections to that node. In terms of this quantity we define the average degree as Another measure to characterize the topology of networks is the clustering coefficient 18 . This coefficient C i of the node i, quantifies the fraction of connected neighbors △ i of the node i with respect to the maximum number of these connections given by k i (k i − 1)/2 . In terms of the adjacency matrix we have for k i ≥ 2 otherwise C i = 0 . Here (A 3 ) ii = (AAA) ii = △ i /2 . The average clustering coefficient is given by From the information in the adjacency matrix, different algorithms allow the calculation of the shortest path connecting the nodes i and j, the length l ij with the number of edges in this shortest path is a measure of the distance between two nodes 18 . This information allows defining an average distance l given by In this way, for a particular connected undirected network with N nodes, we can calculate the adjacency matrix A and obtain the global quantities k , C and l that describe this structure.