## Introduction

Recent years have seen renewed policy interest in urban cycling due to increasing realisation of the negative impacts of motor traffic, obesity and emissions1. Some cities which are well known for their cycling infrastructures, such as Amsterdam and Copenhagen have been leading the world in terms of cycling level with 40% of trips completed by cycling2. Meanwhile, others such as London, New York City and Paris are investing in infrastructure or adopting pro-cycling policies3,4. However, with limited resources, it is crucial to assure the money is well spent. Thus, a common question to be asked when urban planners are attempting to build a bicycle-friendly environment is: where to implement cycling infrastructure for maximum effect? The economic argument is often the most persuasive to policymakers, and is underpinned by the switch of transport mode from motor vehicle to bicycle: fit people save health services money. Simulation of cyclist mode share is thus of great importance.

Aggregate statistical approaches based on spatial factors and demographics have been successful at predicting overall levels of cycling5,6,7,8,9. Another possibility is to model potential rather than predictions, where potential is defined as current travel demand over distances short enough to be cycled, whether or not such demand is currently fulfilled by cycling10. These models are valuable for identifying potential at coarse spatial level but once that has been established, a different model is needed to predict the effect of spatially detailed infrastructure changes. Any such model will necessarily need to determine whether a proposed infrastructure change actually lies on a route that, post-change, will actually be used, hence models must incorporate cyclist route choice11,12,13,14.

Motorized transport is typically simulated by the four-step model15: trip generation, trip distribution, mode choice and route choice. Ref. 16 outlines reasons why this approach has not simply been extended to active travel modelling. Most crucial from a cycling perspective is that practical deployments of the four-step model are typically (i) geared towards use on a simplified road network, and (ii) use a zonal approach when predicting trips (i.e. from residential zones to business zones). The simplified network arises because accurate vehicle modelling requires iterative assignment to determine the equilibrium state in presence of congestion, as well as junction timing models, both of which complicate analysis, so it is beneficial to simplify road networks by removing minor streets which play little role in actual motorized flow patterns. The zonal approach arises because demographic data is usually only available at zonal level. In modelling cycling, however, the zonal approach misses detailed consideration of trips that fall within a single zone, along with minor roads which may be preferred by cyclists. A further limitation of the four step model is exclusion of long terms effects of changing accessibility on land use: such feedbacks are of importance to active travel models, e.g. in residential location self-selection16. Finally, the budget for modelling cycling is typically much smaller than that available for motorized traffic models.

To address these issues, ref. 17 simplified the route choice model of ref. 11 and combined it with spatial network analysis to model cyclist flows, risk and mode share. This model made the simplifying assumption that cyclists travel from everywhere to everywhere subject to a maximum trip distance. Later work18 managed to discard these assumptions, in their place incorporating agglomeration effects, multiple trip purposes, heterogeneous preferences of different classes of cyclist, and the deterring effects of traffic and slope on mode share, to obtain a cross-validated fit with coefficient of determination R2 = 0.78 between modelled and measured cyclist flows. In the latter model, both mode and route choice are based on “cyclist-adjusted distance” i.e. distance with penalties applied for slope, turns, and level of predicted motorized traffic flow on each individual link within the network. Similar models of the pedestrian mode have also been produced19.

## Results

Our best model, model 3, achieves cross-validated R2 with measured cycle flows of 0.854 and mean GEH of 1.92 (see Section 4.3 for the definition of GEH). It also achieves a cross-validated fit of R2 = 0.45 against census output area-level mode share data. Model 3, therefore, offers an improvement on the performance of ref. 18 which achieved R2 = 0.78 in the prediction of measured flows, and equals that study in the prediction of mode share.

A comparison of work required for the different modelling processes is given in Table 1. Note that as Cardiff is a coastal city, this may underestimate the efforts of regional models in inland cities from which hinterland extends in all directions. The modelling areas, for example, differ only by a factor of 7 in this study; and the number of network links differ by a factor of 3 as it is the less dense areas which have been excluded from the simpler model (this may not be the case in other applications e.g. modelling the centre of a large city).

Modelling effort is also contingent on the accuracy of spatial models required in each case. At the time of the study, the OpenStreetMap data often contained topology errors where links would touch or intersect at places other than endpoints, and misclassifications of one-way links. For the spatial network model of motorized flow, it was essential to manually check one-way links, as errors in their encoding could result in e.g. all motor traffic being assigned to one side of a dual carriageway only, causing the empty side to appear attractive for cycling when this is not reflected in real-world conditions. Assignment of road classes, by contrast, was mostly automatic, requiring manual intervention in only 2 cases. Topology errors in both models were fixed automatically by planarization and automatic splitting of lines at intersections. The exceptions are bridges and tunnels (‘brunels’) which were removed from the data before automatic splitting, but required manual checks at key locations to ensure correct recombination afterwards. This was needed for a larger number of cases in the motorized flow model.

The remainder of this section discusses models 1 and 2, used as stepping stones to achieve the better model 3, and a test of the effectiveness of road class as a predictor of motorized flow.

Model 1 is the initial attempt to use road class to predict cycling, and used for calibration purposes only, achieving R2 = 0.505 in univariate fit against actual cyclist flow data, an improvement on the simulated motor flow based model of ref. 17 which achieved R2 = 0.49. Figure 1 uses a scatter plot to show the differences in prediction between model 1 and ref. 17. Some modelled cyclist flow has been displaced from road classes 5 to 4, reflecting model 1’s disincentivization of travelling on higher road classes, regardless of actual motorized flow. Contrary to this, other cyclist flows appear to be displaced from class 1 to 2. This is likely because replacing the predicted motorized flow of the class with its median value reduces the deterrent effect of both predicted and actual motorized flow outliers in class 2 (visible in Fig. 5). Such outliers manifest in popular parlance as ‘rat runs’: local and tertiary roads which are more popular for motorized traffic than their categorization would suggest. Unfortunately traffic count data is not available to verify this hypothesis, however, the fact that we have achieved an increase in model performance despite ignoring potentially increased actual traffic flow on ‘rat runs’ suggests a number of possibilities. Firstly it is possible that the effect is insubstantial compared to improvements in motorized flow predictions elsewhere. Secondly, it is possible that in the case of the current study area, cyclists tend to use such routes in spite of their motorized flow, perhaps because dedicated cycle lanes exist, or because the motorized flow is naturally of low speed, or managed by speed limits and traffic calming measures. Finally, it is possible that such routes entail poor cycling conditions, but no better alternatives exist. Determination of which of these is the case is beyond the scope of the current study. Figure 2 explores the difference between models in greater detail, by examining how changes in the prediction of motorized traffic affect changes of predictions in cycle traffic. Zone B contains the ‘rat runs’ discussed above: class 1 and 2 roads which, when we replace predicted motorized flow with road class information, are effectively subject to a substantial reduction in modelled motor traffic, yet exhibit little to no change in predicted cyclist flow. Not only the ‘rat runs’, but in fact, the majority of links show only a weak correspondence between the reduction of simulated motor traffic and increase of simulated cyclist flow. This is illustrated by the trend line marked C, with the exceptions being shown in the zones marked A. The reason for this seeming lack of sensitivity to predicted motorized flow is that the choice set of sensible routes through a network is naturally limited to a small number for any given trip; thus, there is scope for considerable change in the modelled cost of the alternative routes, before the cyclist’s modelled choice of route changes at all. For the modeller, this is convenient, as the lack of sensitivity (within a reasonable range) of route choice to actual motorized flow helps with our aim of discarding it from the model in favour of road class information.

Model 2 optimizes the fit against measured cycle flows by manual modification of distance multipliers to correct systematic over/under-prediction of measured flows in each road class (see Section 4.4), improving the univariate fit slightly to 0.514. Table 2 shows distance multipliers for models 1 and 2; in particular, an improved fit was achieved by increasing the distance penalty for higher road classes, in particular for class 6, non-residential dual carriageways. Model 3 (discussed at the start of this section) applies these distance multipliers in a multivariate model to achieve optimal performance with weighting λ (explained in section 4.3) equal to 0.5.

Lastly, we examine the question of whether road class works for cyclist predictions by virtue of proxying actual motorized flow, by comparing spatial network17 and road class models for prediction of motorized traffic in Table 3. For the points where vehicle counts were conducted, the road class itself outperforms the simplified spatial network analysis used in that paper as a predictor of actual motorized flow, even taking into account the increased number of parameters (e.g. the sample mean for each road class being used as a parameter in a “model” where all roads are assigned predicted motorized flow based solely on their class). Thus we must consider in discussion the extent to which road class data may simply be a proxy for actual motorized flow.

## Discussion

The performance gain shown here, although gratifying, is of an order of magnitude which could easily be outweighed by variance in results between different data sets covering different urban areas, when the model is applied elsewhere. A limitation of the study is its restriction to a single city-scale model, rather than a study of multiple regions. We therefore see our key contribution, not as an increase in modelling accuracy, but a decrease in modelling complexity through ditching the requirement for an explicit vehicle model. In the current case, the reduction in modelling effort is substantial; theoretically, the reduction could be very high indeed, e.g. if modelling a small area within a large and dense urban metropolis. This contributes to cycle infrastructure planning by making it easier to apply the spatial network model in new locations.

Should the reason for the success of road class in cycle models be due in large part to its proxying of actual motorized flow, a further limitation materializes, namely that the model should be used with extreme caution when predicting the effect of road reclassification. In these cases, verification that post-intervention road classes will continue to approximately reflect actual motorized flow is essential. However, this is likely an unusual modelling scenario (except in the case of reclassifying to prohibit motorized traffic, in which case zero motorized flow can be assumed and this limitation does not apply). The primary envisaged use of the model is in predicting cyclist flows and mode choice, possibly in the presence of new cycling links and motorized traffic prohibitions, based on an assumption that existing motorized flows remain approximately the same except in locations where prohibitions are introduced.

In reapplication of either model to new areas, recalibration of factors (road traffic deterrence or road class deterrence) against actual cyclist flow and/or area mode share is strongly recommended. This is especially the case in international use: although similar systems of road classification are widespread globally, there are substantial differences in local context. These include, for example, (1) the difference between European-style compact cities versus American-style car-oriented cities with large suburbs; (2) the difference between planned grids of regular blocks versus organically grown spatial layouts; (3) cultural differences in how cycling is perceived as a mode of transport, awareness and willingness of drivers to afford road space to cyclists. While there is reason to believe that road class remains a useful predictor of cyclist behaviour in these contexts, it is also possible that the distance multipliers applicable in different countries will differ substantially. The road class model will require verification and possibly adaptation to ensure that the classes used make sense locally: suitability of any road class system will ultimately remain unknown until a model is attempted, but local knowledge on cyclist behaviour will likely be a good predictor of the suitability of the model. Although ref. 17’s model based on motorized flow offers in principle a universal standard for international comparison, the cultural differences noted above still mean that the same level of flow can have different effects on behaviour depending on local context, so neither model can be used without appropriate consideration.

Optionally, motorized traffic data can be used as a starting point for road class deterrence factors as in the current study, but in the presence of cyclist data, this may not be necessary (the same can be said for calibration of the more complex motorized spatial network model for which we propose replacement).

The future likely holds numerous potential improvements for models of cycling flow, from better calibration techniques to inclusion of additional factors such as the “safety in numbers” phenomenon22, and combination of socio-economic with spatial network models23 in particular to reflect well-known class and gender imbalances in cycling13.

## Methodology

### Study area

Cardiff, Wales is selected as the study area for this paper. Cardiff’s existing traffic-free cycle network is quite fragmented with only the Taff Trail, a flagship cycle route which connects north and south, acting as a backbone. According to the 2011 Census of England and Wales24, 3.6% of working residents cycle to work in Cardiff, which is leading in Wales and higher than the average of England and Wales. Yet, there is a huge gap between Cardiff and the 10 UK cities exhibiting the highest levels of cycling to work. Cardiff Cycling Strategy 2016-202625 observes that 52% of car trips in Cardiff are under 5 km and 28% of residents do not cycle now but aspire to in future, revealing large potential for increasing the cycling level. However, annual capital expenditure on cycling infrastructure by Cardiff Council and external funding combined is only £4 per resident, a low investment compared to internationally renowned cycling cities Amsterdam and Copenhagen which invest around £18 per resident. A larger investment in expanding the cycle network is expected to assist in realizing this potential.

### Data

This paper is based on a spatial network provided by OpenStreetMap (OSM), a public and crowd sourced mapping system26. In terms of cycle network coverage, continuity, attributes and recency, ref. 27 found OSM to be a better mapping system than Ordnance Survey (OS). Slope data for the spatial network is taken from Ordnance Survey Terrain 50; this misses small scale changes in height such as those encountered on bridges/underpasses, however, captures most terrain effects and has the advantage of being free to use under and OpenData license.

To calibrate the models, two sources of actual cycle flow data were used. The Department for Transport estimate, by combination of manual and automatic survey and interpolation28, the annual average daily traffic (AADT) of both motor vehicles and pedal cycles at 107 on-road locations in Cardiff. This is supplemented by cycle flow data from 14 traffic-free locations collected by electronic sensors belonging to Cardiff Council. As both sources used different methodologies to collect cycle flow data, they are not directly comparable, in particular due to the Department for Transport not taking localized weather conditions into account when surveying cycling behaviour. However, both sources are important to the calibration process and thus must be combined. We follow ref. 17 in using a dummy variable to account for data source in the final predicted flow model.

The motor vehicle flow predictions in Cardiff are obtained from the motor vehicle flow sub-model in ref. 17, which has a good correlation (R2 = 0.84) with measured vehicle flows.

Mode share data is taken from a total of 1077 census Output Areas (Office for National Statistics, 2011).

### Network analysis

This paper applies the publicly available Spatial Design Network Analysis + (sDNA+) toolkit in ArcGIS29. To calibrate the effect of road class in our models 1 and 2, we make use of the simpler models presented in ref. 17, and to obtain our final results we add in model 3 the extensions of multiple trip purpose, distance decay, heterogeneous cyclist ability and agglomeration detailed in ref. 18. The remainder of this section summarizes the models in these two papers.

Both of these models make use of spatial network betweenness30 for predicting flows. Intuitively this can be conceived as simulating the shortest trips from everywhere to everywhere, subject to a definition of distance which reflects cyclist preferences, and a maximum distance for the trip. Although apparently indiscriminate in handling of origins and destinations, the correlation of network density with jobs and homes31 has the effect that denser areas are modelled as generating more trips. The betweenness approach thus has a history of providing a reasonable fit to vehicle32,33 and pedestrian34 data. The formula used for betweenness is

$$Betweenness(x,rmin,rmax,{d}_{routing},{d}_{radius})=\sum _{y\in N}\,\sum _{z\in R(y,rmin,rmax,{d}_{radius})}OD(y,z,x,{d}_{routing})W(z)$$
(1)
$$OD(y,z,x,{d}_{routing})=\{\begin{array}{ll}1 & if\,x\,is\,on\,the\,shortest\,path\,from\,y\,to\,z\,as\,defined\,by\,metric\,{d}_{routing}\\ 1/2 & if\,x=y\ne z\,or\,x=z\ne y\\ 1/3 & if\,x=y=z\\ 0 & otherwise\end{array}$$
(2)

Reference 17 and our models 1 and 2 use network-Euclidean distance for dradius, set rmin = 0, rmax = 3 km, W(z)=1 and for drauting use the definition of cyclist distance outlined in Section 4.4, Eq. 9 below (a Euclidean network distance adjusted for slope or motorized traffic). Variables are normalized using a Box-Cox transform prior to regression.

Reference 18 and our model 3 augment the “everywhere to everywhere” assumption with a variety of different trip purposes: trips to each network link, extra trips to each link within the city centre (as defined by a threshold of urban density – this can also be interpreted as incorporating agglomeration effects), trips to recreational cycling facilities. Each of these is duplicated for cyclist classes of varying confidence i.e. varying aversion to motor traffic, and disaggregated within various distance bands (3, 5, 8, 11, 15 and 20 km round trips) to account for distance decay; in contrast to ref. 17 these distances are interpreted as adjusted for slope and motorized traffic because we use cyclist distance (Section 4.4 Eq. 9) for dradius as well as drauting. The multiple trip/cyclist combinations can also be interpreted as a simulation of non-interacting agents. In modelling terms, this means that multiple betweenness values are computed for each link, based on different values of drauting, dradius, rmin, rmax and W(z), where

$$W(z)=\{\begin{array}{ll}1 & if\,z\,is\,a\,destination\,of\,interest\\ 0 & otherwise\end{array}$$
(3)

The sDNA + software automatically sets rmin and rmax given the desired distance bands above. Traffic aversion and hence drauting and dradius are modified by changing the value of parameter t in Eq. (9). A betweenness value for each distance band is computed for each possible combination of t = {0.4,0.6,0.8} with W(z) representing {everywhere, city centre, recreational facilities} respectively. The multiple betweenness values are used as independent variables in a linear regression to predict cyclist flows using the sDNA Learn tool:

$$flow={\beta }_{0}+{\beta }_{source}\,source+{\beta }_{1}betweennes{s}_{1}+{\beta }_{2}betweennes{s}_{2}+\ldots$$
(4)

where the βs are regression coefficients, and source is a dummy variable set to 0 if the actual flow was recorded by the Department for Transport and 1 if recorded by Cardiff Council.

Cross-validated ridge regression is used to handle inherent collinearity and prevent overfit36,37; models can thus be compared using a cross-validated coefficient of determination (R2). The Box-Cox transform is inappropriate in a multiple regression context and is therefore replaced with a weighting scheme

$$RW(y)={y}^{\lambda }/y$$
(5)

Where RW(y) is the regression weight for a data point with dependent variable value y, and λ is a calibration parameter (similar to that in the Box Cox transform, and unrelated to the regularization parameter λ in ridge regression) such that regressing with λ = 1 minimizes absolute errors while λ = 0 minimized relative errors. The actual value of λ is chosen so as to minimize the GEH (Geoffrey E. Havers) error statistic popular in transport planning38, which captures a mixture of absolute and relative error in residuals:

$${\rm{GEH}}=\sqrt{2{({\rm{x}}-{\rm{y}})}^{2}/({\rm{x}}+{\rm{y}})}$$
(6)

To predict mode share, ref. 18 and our model 3 calibrate a multivariate model based on network reach within all the distance bands, trip purposes and for all the cyclist types outlined above, where

$$Reach\,(x,rmin,rmax,{d}_{radius})=\sum _{y\in R(x,rmin,rmax,{d}_{radius})}W(y)$$
(7)
$$journey\,to\,work\,mode\,share={\beta }_{0}+{\beta }_{1}Reac{h}_{1}+{\beta }_{2}Reac{h}_{2}+\ldots$$
(8)

where the βs are regression coefficients. As mode share data is only available on a zonal basis, the reach variables are averaged over all links within each zone to provide the independent variables for regression.

### Definition of distance

The cycling models of betweenness and network density are both based on a cycling distance metric which accounts for the effect of slope, levels of motorized traffic and straightness on the distance perceived by the cyclist. Ref. 17 begins with the findings of ref. 11, simplifying and recalibrating to arrive at the definition outlined in Eqs. (911):

$$\begin{array}{c}cyclist\,distance=Euclidean\,network\,distance\times slopefa{c}^{s}\times trafficfa{c}^{t}\\ +\,cumulative\,angular\,change\times \frac{67.2}{90}\times a\end{array}$$
(9)

where

$$slopefac=\begin{array}{ll}1.000 & if\,slope < 2 \% \\ 1.371 & if\,2 \% < slope < 4 \% \\ 2.203 & if\,4 \% < slope < 6 \% \\ 4.239 & if\,slope > 6 \% \end{array}$$
(10)
$$trafficfac=0.84\,{e}^{\frac{AADT}{1000}}$$
(11)

and AADT is the predicted annual average daily flow of motorized vehicles on the link. The cycling distance is measured as a round trip and it is assumed that a cyclist adopts the same route for both outward and return journey. Calibration in that paper is achieved by varying the parameters a, s and t, with the best fit on the Cardiff data set given by a = 0.2, s = 2, t = 0.04.

Motor traffic enters the definition of distance in Eq. (11). For the present study, we replace this with a trafficfac defined for each road class. In model 1 this is defined as per Eq. (11) albeit replacing individual simulated AADT for each link, with a length-weighted median simulated AADT for the road class within the smaller cyclist network model (i.e. excluding the larger network model used to predict motorized flow in ref. 17). We use these values as starting points for further optimization of the model parameters, with the endpoint of optimization being model 2. Optimization was conducted by manual adjustment of parameters to correct systematic over/underprediction of cyclist flows per road class: e.g. non-residential dual carriageways had lower actual cyclist flow than predicted, so their trafficfac was increased, etc. Finally, we take the trafficfac parameters derived in our model 2 and apply them to replace trafficfac in the methodology of ref. 18 (described in more detail in section 4.3 above), giving our best predictions of cyclist flow and mode share in model 3.