Using road class as a replacement for predicted motorized traffic flow in spatial network models of cycling

Recent years have seen renewed policy interest in urban cycling due to the negative impacts of motorized traffic, obesity and emissions. Simulating bicycle mode share and flows can help decide where to build new infrastructure for maximum impact, though modelling budgets are limited. The four step model used for vehicles is not typically used for this task as, aside from the expense of use, it is designed around too-large zone sizes and a simplified network. Alternative approaches are based on aggregate statistics or spatial network analysis, the latter being necessary to create a model sufficiently sensitive to infrastructure location, although still requiring considerable modelling effort due to the need to simulate motor vehicle flows in order to account for the effect of motorized traffic in disincentivising cycling. The model presented uses an existing spatial network analysis methodology on an unsimplified network, but simplifies the analysis by substituting explicit prediction of motorized traffic flow with an alternative based on road classification. The method offers a large reduction in modelling effort, but nonetheless gives model correlation with actual cycling flows (R2 = 0.85) broadly comparable to a previous model with motorized traffic fully simulated (R2 = 0.78).


Results
Our best model, model 3, achieves cross-validated R 2 with measured cycle flows of 0.854 and mean GEH of 1.92 (see Section 4.3 for the definition of GEH). It also achieves a cross-validated fit of R 2 = 0.45 against census output area-level mode share data. Model 3, therefore, offers an improvement on the performance of ref. 18 which achieved R 2 = 0.78 in the prediction of measured flows, and equals that study in the prediction of mode share.
A comparison of work required for the different modelling processes is given in Table 1. Note that as Cardiff is a coastal city, this may underestimate the efforts of regional models in inland cities from which hinterland extends in all directions. The modelling areas, for example, differ only by a factor of 7 in this study; and the number of network links differ by a factor of 3 as it is the less dense areas which have been excluded from the simpler model (this may not be the case in other applications e.g. modelling the centre of a large city).
Modelling effort is also contingent on the accuracy of spatial models required in each case. At the time of the study, the OpenStreetMap data often contained topology errors where links would touch or intersect at places other than endpoints, and misclassifications of one-way links. For the spatial network model of motorized flow, it was essential to manually check one-way links, as errors in their encoding could result in e.g. all motor traffic being assigned to one side of a dual carriageway only, causing the empty side to appear attractive for cycling when this is not reflected in real-world conditions. Assignment of road classes, by contrast, was mostly automatic, requiring manual intervention in only 2 cases. Topology errors in both models were fixed automatically by planarization and automatic splitting of lines at intersections. The exceptions are bridges and tunnels ('brunels') which were removed from the data before automatic splitting, but required manual checks at key locations to ensure correct recombination afterwards. This was needed for a larger number of cases in the motorized flow model.
The remainder of this section discusses models 1 and 2, used as stepping stones to achieve the better model 3, and a test of the effectiveness of road class as a predictor of motorized flow.
Model 1 is the initial attempt to use road class to predict cycling, and used for calibration purposes only, achieving R 2 = 0.505 in univariate fit against actual cyclist flow data, an improvement on the simulated motor flow based model of ref. 17 which achieved R 2 = 0.49. Figure 1 uses a scatter plot to show the differences in prediction between model 1 and ref. 17 . Some modelled cyclist flow has been displaced from road classes 5 to 4, reflecting model 1's disincentivization of travelling on higher road classes, regardless of actual motorized flow. Contrary to this, other cyclist flows appear to be displaced from class 1 to 2. This is likely because replacing the predicted motorized flow of the class with its median value reduces the deterrent effect of both predicted and actual motorized flow outliers in class 2 (visible in Fig. 5). Such outliers manifest in popular parlance as 'rat runs': local and tertiary roads which are more popular for motorized traffic than their categorization would suggest. Unfortunately traffic count data is not available to verify this hypothesis, however, the fact that we have achieved an increase in model performance despite ignoring potentially increased actual traffic flow on 'rat runs' suggests a number of possibilities. Firstly it is possible that the effect is insubstantial compared to improvements in motorized flow predictions elsewhere. Secondly, it is possible that in the case of the current study area, cyclists tend to use such routes in spite of their motorized flow, perhaps because dedicated cycle lanes exist, or because the motorized flow is naturally of low speed, or managed by speed limits and traffic calming measures. Finally, it is possible that such routes entail poor cycling conditions, but no better alternatives exist. Determination of which of these is the case is beyond the scope of the current study. Figure 2 explores the difference between models in greater detail, by examining how changes in the prediction of motorized traffic affect changes of predictions in cycle traffic. Zone B contains the 'rat runs' discussed above: class 1 and 2 roads which, when we replace predicted motorized flow with road class information, are effectively subject to a substantial reduction in modelled motor traffic, yet exhibit little to no change in predicted cyclist flow. Not only the 'rat runs' , but in fact, the majority of links show only a weak correspondence between the reduction of simulated motor traffic and increase of simulated cyclist flow. This is illustrated by the trend line marked C, with the exceptions being shown in the zones marked A. The reason for this seeming lack of sensitivity to predicted motorized flow is that the choice set of sensible routes through a network is naturally limited to a small number for any given trip; thus, there is scope for considerable change in the modelled cost of the alternative routes, before the cyclist's modelled choice of route changes at all. For the modeller, this is convenient, as the lack of sensitivity (within a reasonable range) of route choice to actual motorized flow helps with our aim of discarding it from the model in favour of road class information.
Model 2 optimizes the fit against measured cycle flows by manual modification of distance multipliers to correct systematic over/under-prediction of measured flows in each road class (see Section 4.4), improving the univariate fit slightly to 0.514. Table 2 shows distance multipliers for models 1 and 2; in particular, an improved fit was achieved by increasing the distance penalty for higher road classes, in particular for class 6, non-residential dual carriageways. Model 3 (discussed at the start of this section) applies these distance multipliers in a multivariate model to achieve optimal performance with weighting λ (explained in section 4.3) equal to 0.5.
Lastly, we examine the question of whether road class works for cyclist predictions by virtue of proxying actual motorized flow, by comparing spatial network 17 and road class models for prediction of motorized traffic in Table 3. For the points where vehicle counts were conducted, the road class itself outperforms the simplified spatial network analysis used in that paper as a predictor of actual motorized flow, even taking into account the increased number of parameters (e.g. the sample mean for each road class being used as a parameter in a "model" where all roads are assigned predicted motorized flow based solely on their class). Thus we must consider in discussion the extent to which road class data may simply be a proxy for actual motorized flow.

Discussion
This paper has attempted to improve the transferability of spatial network analysis based cycling transport models by eliminating dependence on a detailed motor vehicle model. We have shown that replacing detailed motorized traffic flow simulation with road class information provides broadly comparable performance -in fact slightly improving on existing literature in the current case. At first glance this is surprising as we have discarded substantial information, however, several factors serve by way of explanation. Firstly, at the points for which we have motor vehicle information, the defined road class system outperforms the simplified road traffic model used in previous methods as a predictor of motorized traffic flow. Secondly, it is likely that cyclists' perceptions of difficulty are influenced by aspects of road class beyond actual motorized flow; for example, road class proxies speed information i.e. lower road classes will carry slower moving traffic which is potentially of lesser danger to the cyclist and thus preferred, actual motorized flow notwithstanding. Although we cannot fully disentangle the www.nature.com/scientificreports www.nature.com/scientificreports/ influence of motorized flow versus road class in this study, the fact that model 1 (based directly on predictions of mean motorized flow for each road class) is slightly outperformed by model 2 (based on further calibration of road classes, in particular increasing the deterrence of all higher road classes except residential dual carriageways) suggests that both factors make a contribution. Thirdly, we note that the realistic option set for route choice between any two points is normally limited, therefore quite wide variance between different models in deterrence caused by motor vehicle traffic for the same link will often lead to the same ultimate choice of route for the cyclist, provided the modelled deterrence of each link is within sensible limits. (This should not be confused with the importance of simulating a variety of aversions to motor traffic among cyclists, as shown to be beneficial both by the current paper and ref. 18 ).
The performance gain shown here, although gratifying, is of an order of magnitude which could easily be outweighed by variance in results between different data sets covering different urban areas, when the model is applied elsewhere. A limitation of the study is its restriction to a single city-scale model, rather than a study of multiple regions. We therefore see our key contribution, not as an increase in modelling accuracy, but a decrease in modelling complexity through ditching the requirement for an explicit vehicle model. In the current case, the reduction in modelling effort is substantial; theoretically, the reduction could be very high indeed, e.g. if modelling a small area within a large and dense urban metropolis. This contributes to cycle infrastructure planning by making it easier to apply the spatial network model in new locations.
Should the reason for the success of road class in cycle models be due in large part to its proxying of actual motorized flow, a further limitation materializes, namely that the model should be used with extreme caution when predicting the effect of road reclassification. In these cases, verification that post-intervention road classes will continue to approximately reflect actual motorized flow is essential. However, this is likely an unusual modelling scenario (except in the case of reclassifying to prohibit motorized traffic, in which case zero motorized flow can be assumed and this limitation does not apply). The primary envisaged use of the model is in predicting cyclist flows and mode choice, possibly in the presence of new cycling links and motorized traffic prohibitions, based on an assumption that existing motorized flows remain approximately the same except in locations where prohibitions are introduced.  www.nature.com/scientificreports www.nature.com/scientificreports/ In reapplication of either model to new areas, recalibration of factors (road traffic deterrence or road class deterrence) against actual cyclist flow and/or area mode share is strongly recommended. This is especially the case in international use: although similar systems of road classification are widespread globally, there are substantial differences in local context. These include, for example, (1) the difference between European-style compact cities versus American-style car-oriented cities with large suburbs; (2) the difference between planned grids of regular blocks versus organically grown spatial layouts; (3) cultural differences in how cycling is perceived as a mode of transport, awareness and willingness of drivers to afford road space to cyclists. While there is reason to believe that road class remains a useful predictor of cyclist behaviour in these contexts, it is also possible that the distance multipliers applicable in different countries will differ substantially. The road class model will require verification and possibly adaptation to ensure that the classes used make sense locally: suitability of any road class system will ultimately remain unknown until a model is attempted, but local knowledge on cyclist behaviour will likely be a good predictor of the suitability of the model. Although ref. 17 's model based on motorized flow offers in principle a universal standard for international comparison, the cultural differences noted above still mean that the same level of flow can have different effects on behaviour depending on local context, so neither model can be used without appropriate consideration.
Optionally, motorized traffic data can be used as a starting point for road class deterrence factors as in the current study, but in the presence of cyclist data, this may not be necessary (the same can be said for calibration of the more complex motorized spatial network model for which we propose replacement).
The future likely holds numerous potential improvements for models of cycling flow, from better calibration techniques to inclusion of additional factors such as the "safety in numbers" phenomenon 22 , and combination of socio-economic with spatial network models 23 in particular to reflect well-known class and gender imbalances in cycling 13 .

Methodology
Study area. Cardiff, Wales is selected as the study area for this paper. Cardiff 's existing traffic-free cycle network is quite fragmented with only the Taff Trail, a flagship cycle route which connects north and south, acting as a backbone. According to the 2011 Census of England and Wales 24 , 3.6% of working residents cycle to work in Cardiff, which is leading in Wales and higher than the average of England and Wales. Yet, there is a huge gap between Cardiff and the 10 UK cities exhibiting the highest levels of cycling to work. Cardiff Cycling Strategy 2016-2026 25 observes that 52% of car trips in Cardiff are under 5 km and 28% of residents do not cycle now but aspire to in future, revealing large potential for increasing the cycling level. However, annual capital expenditure on cycling infrastructure by Cardiff Council and external funding combined is only £4 per resident, a low investment compared to internationally renowned cycling cities Amsterdam and Copenhagen which invest around £18 per resident. A larger investment in expanding the cycle network is expected to assist in realizing this potential.
Data. This paper is based on a spatial network provided by OpenStreetMap (OSM), a public and crowd sourced mapping system 26 . In terms of cycle network coverage, continuity, attributes and recency, ref. 27 found OSM to be a better mapping system than Ordnance Survey (OS). Slope data for the spatial network is taken from Ordnance Survey Terrain 50; this misses small scale changes in height such as those encountered on bridges/underpasses, however, captures most terrain effects and has the advantage of being free to use under and OpenData license.
To calibrate the models, two sources of actual cycle flow data were used. The Department for Transport estimate, by combination of manual and automatic survey and interpolation 28 , the annual average daily traffic  www.nature.com/scientificreports www.nature.com/scientificreports/ (AADT) of both motor vehicles and pedal cycles at 107 on-road locations in Cardiff. This is supplemented by cycle flow data from 14 traffic-free locations collected by electronic sensors belonging to Cardiff Council. As both sources used different methodologies to collect cycle flow data, they are not directly comparable, in particular due to the Department for Transport not taking localized weather conditions into account when surveying cycling behaviour. However, both sources are important to the calibration process and thus must be combined. We follow ref. 17 in using a dummy variable to account for data source in the final predicted flow model.
The motor vehicle flow predictions in Cardiff are obtained from the motor vehicle flow sub-model in ref. 17 , which has a good correlation (R 2 = 0.84) with measured vehicle flows.
Mode share data is taken from a total of 1077 census Output Areas (Office for National Statistics, 2011).
Network analysis. This paper applies the publicly available Spatial Design Network Analysis + (sDNA+) toolkit in ArcGIS 29 . To calibrate the effect of road class in our models 1 and 2, we make use of the simpler models presented in ref. 17 , and to obtain our final results we add in model 3 the extensions of multiple trip purpose, distance decay, heterogeneous cyclist ability and agglomeration detailed in ref. 18 . The remainder of this section summarizes the models in these two papers.
Both of these models make use of spatial network betweenness 30 for predicting flows. Intuitively this can be conceived as simulating the shortest trips from everywhere to everywhere, subject to a definition of distance which reflects cyclist preferences, and a maximum distance for the trip. Although apparently indiscriminate in handling of origins and destinations, the correlation of network density with jobs and homes 31 has the effect that denser areas are modelled as generating more trips. The betweenness approach thus has a history of providing a reasonable fit to vehicle 32,33 and pedestrian 34 data. The formula used for betweenness is   N, and R(y,rmin,rmax,d radius ) is the subset of the network closer to link y than a threshold radius rmax but further from y than rmin, according to the distance metric d radius . The OD(y,z,x,d) function defined in Eq. (2) describes the proportion of link x that falls on the shortest path from the middle of link y to the middle of link z, with partial contributions for links which form the endpoints of the shortest path 18 . This is equivalent to the original definition of betweenness 30 under the assumption that shortest paths are unique, and subject to adaptation for spatial network representation in which, under dual representation 35 ,    www.nature.com/scientificreports www.nature.com/scientificreports/ decay; in contrast to ref. 17 these distances are interpreted as adjusted for slope and motorized traffic because we use cyclist distance (Section 4.4 Eq. 9) for d radius as well as d rauting . The multiple trip/cyclist combinations can also be interpreted as a simulation of non-interacting agents. In modelling terms, this means that multiple betweenness values are computed for each link, based on different values of d rauting , d radius , rmin, rmax and W(z), where The sDNA + software automatically sets rmin and rmax given the desired distance bands above. Traffic aversion and hence d rauting and d radius are modified by changing the value of parameter t in Eq. (9). A betweenness value for each distance band is computed for each possible combination of t = {0.4,0.6,0.8} with W(z) representing  Table 6. Derivation of road classes used in the study from tags in OpenStreetMap. where the βs are regression coefficients, and source is a dummy variable set to 0 if the actual flow was recorded by the Department for Transport and 1 if recorded by Cardiff Council. Cross-validated ridge regression is used to handle inherent collinearity and prevent overfit 36,37 ; models can thus be compared using a cross-validated coefficient of determination (R 2 ). The Box-Cox transform is inappropriate in a multiple regression context and is therefore replaced with a weighting scheme = λ RW y y y ( ) / (5) Where RW(y) is the regression weight for a data point with dependent variable value y, and λ is a calibration parameter (similar to that in the Box Cox transform, and unrelated to the regularization parameter λ in ridge regression) such that regressing with λ = 1 minimizes absolute errors while λ = 0 minimized relative errors. The actual value of λ is chosen so as to minimize the GEH (Geoffrey E. Havers) error statistic popular in transport planning 38 , which captures a mixture of absolute and relative error in residuals: where the βs are regression coefficients. As mode share data is only available on a zonal basis, the reach variables are averaged over all links within each zone to provide the independent variables for regression.
Definition of distance. The cycling models of betweenness and network density are both based on a cycling distance metric which accounts for the effect of slope, levels of motorized traffic and straightness on the distance perceived by the cyclist. Ref. 17 begins with the findings of ref. 11  and AADT is the predicted annual average daily flow of motorized vehicles on the link. The cycling distance is measured as a round trip and it is assumed that a cyclist adopts the same route for both outward and return  7  90  54  9352  9024  11556  9798   6  521  98  7403  4819  10377  8698   5  144  15  6358  3016  8958  4385   4  948  83  3414  2257  3762  2253   3  443  48  2296  1273  1856  1108   2  2585  280  918  368  792  267   1  18102  1208  70  15  75  13   0  526  78  0  0  0  0   Table 7. Summary of distribution of simulated Annual Average Daily Traffic (AADT in vehicles/hour) across different road classes.
Road categorisation. The practice of road classification is pervasive in modern transport planning, and hence ubiquitous in higher income, as well as widespread in middle-income countries worldwide. The UK Department for Transport defines five types of road which are broadly comparable to those used in other countries: motorways, A roads, B roads, classified unnumbered and unclassified 39 . We reviewed these categories within the study area to determine whether we believed them to capture sufficient details of the urban environment for our purpose of replacing predicted traffic flow in the models of 17,18 . Of particular concern was that A roads in the UK can be both major and minor arterials, and separately, be built with either single or dual carriageway design. Furthermore, the cycling characteristics of dual carriageway A roads differ substantially depending on whether or not they are fronted by residential properties. Figure 3 shows an example, contrasting a residential dual carriageway bordered by pedestrian sidewalks and joined by private driveways, speed limit 40mph, with a non-residential dual carriageway which is functionally similar to a motorway with a variety of speed limits up to 70mph. To capture these differences to the cycling environment, we define three road classes extracted from A roads: residential single carriageway, residential dual carriageway and non-residential dual carriageway. The remainder of the Department for Transport's classes were considered adequate for our purpose. Defined road classes with general definitions/functions/features and the associated conversion to UK standard are set out in Table 3. Comparison of models for predicting motorized vehicle (not cyclist) flow. We exclude traffic free paths and include motorways to give a total of n=107 data points for this test only. To match methodology of ref. 17 , counts and predictions are Box Cox transformed prior to predicting R 2 and Akaike Information Criterion (AIC), but GEH is computed on raw traffic counts. See section 4.3 for definition of GEH. Table 4. Having defined these road classes it is also necessary to define the mapping through which they are extracted from OSM, based on OSM's defined highway types. Table 5 shows possible values for the 'highway' tag in OpenStreetMap. For instance, trunk refers to a dual carriageway A Road usually; primary refers to a single carriageway A Road; secondary refers to a B Road; and tertiary refers to a classified unnumbered road. In scenarios where a link is actually a single carriageway but classified as trunk or a dual carriageway, another attribute 'oneway' is used to assure single and dual carriageways are correctly differentiated. For lower level road types, information from OSM tends to be detailed and needs to be consolidated to match with the defined road classes or to be excluded when it is not relevant to cyclists. For instance, living_street and residential are both classified as local roads while bridleway and track can be excluded as they do not appear within Cardiff city limits. Table 6 shows the derivation of our road classes from OSM data and Fig. 4 the resulting road categorisation in Cardiff.
We use the vehicle sub-model of ref. 17 to estimate AADT on each link. As with previous literature 33,40 this is based on angular betweenness i.e. the definition of distance is cumulative angular change, thus preferring routes with the least change of direction whether at junctions or on links. Such routes usually have priority and thus to some extent proxy shortest travel time. A range of trip distances range from 10 to 30 km are tested, picking the best fit to actual motorized flow for use in predicting AADT. Table 7 and Fig. 5 show the distribution of simulated AADT across road classes. Noting (i) the presence of AADT outliers within each road class, and (ii) that cyclists