Learning naturalistic driving environment with statistical realism

For simulation to be an effective tool for the development and testing of autonomous vehicles, the simulator must be able to produce realistic safety-critical scenarios with distribution-level accuracy. However, due to the high dimensionality of real-world driving environments and the rarity of long-tail safety-critical events, how to achieve statistical realism in simulation is a long-standing problem. In this paper, we develop NeuralNDE, a deep learning-based framework to learn multi-agent interaction behavior from vehicle trajectory data, and propose a conflict critic model and a safety mapping network to refine the generation process of safety-critical events, following real-world occurring frequencies and patterns. The results show that NeuralNDE can achieve both accurate safety-critical driving statistics (e.g., crash rate/type/severity and near-miss statistics, etc.) and normal driving statistics (e.g., vehicle speed/distance/yielding behavior distributions, etc.), as demonstrated in the simulation of urban driving environments. To the best of our knowledge, this is the first time that a simulation model can reproduce the real-world driving environment with statistical realism, particularly for safety-critical situations.


Supplementary information for experiments
The Ann Arbor roundabout dataset (abbreviated as AA dataset) is collected at State St. -W Ellsworth Rd. 16 intersection, Arbor, Michigan. A roadside camera-based perception system 1,2 is deployed for real-time traffic 17 object detection, localization, and tracking to collect vehicle trajectory data. For training purposes, we used data 18 collected on May 2 nd , 2021, from 10:00 to 17:00, including around 17,000 road users. For each vehicle, the data 19 includes its position and heading information at 2.5 Hz. We excluded frames that involve pedestrians, cyclists, 20 and trailers since there are only a few frames that include these agents and the data size is limited for training. It 21 should be noted that the proposed method can handle diverse road users (e.g., pedestrians) and model their 22 interactions if the data is sufficient. For validation purposes, we used crash data from large-scale trajectories and 23 police crash reports 3 to obtain ground-truth safety-critical events statistics (e.g., crash rate and crash type 24 distribution). 25 The rounD dataset 4 is collected at several different locations with high accuracy tracking of around 13,000 road 26 users at high frequency (25Hz). We chose the roundabout with most of the data and it is located at Neuweiler, 27 Aachen. Similar to the AA dataset, we also exclude frames that involve pedestrians, cyclists, and trailers for 28 training. 29

b. Experiment settings 30
In this study, we assume all vehicles have an identical size with 3.6 meters in length and 1.8 meters in width. Note 31 that the proposed method can be easily extended to handle different vehicle sizes by incorporating length and 32 width in the input data. We simulate approximately 15,000 hours of simulation to record data and generate 33 simulated statistics to validate the statistical realism of the NeuralNDE, where all data are used for calculating 34 crash-related metrics and 100 hours of data are used for other metrics. 35

c. Evaluation metrics 36
The Ann Arbor roundabout ground-truth crash rate is obtained based on data from August to mid-November 2021 37 for around 75 days from 7:00-19:00. There were 14 crashes in this roundabout with a total vehicle travel distance 38 of 1.16 10 kilometers. Therefore, the empirical crash rate ground-truth is 1.21 10 crash/km. The ground-39 truth crash type and crash severity distributions are queried from the Michigan Traffic Crash Facts 3 dataset whose 40 data is directly from police crash reports. We use data from 2016-2020, and there are a total of 520 crashes at this 41 roundabout with the crash type distribution as shown in Supplementary Fig. 1b. When we calibrate the 42 NeuralNDE, we consider the crash type that is greater than 5%, which includes angle, rear-end, and sideswipe 43 crashes. For the crash severity, we use the worst injury of all involved occupants in the crash as the ground truth. 44 Of the 520 crashes, 498 were non-injury crashes, 22 were minor injuries, and zero were serious and fatal crashes. 45 We have no access to the ground-truth crash rate, crash type distribution, and crash severity data of the rounD 46 roundabout. 47 To determine the crash type of a simulated collision, we follow the definition of the National Highway Traffic 51 Safety Administration 5 and consider the state of collision vehicles at the crash moment. Specifically, we consider 52 the relative position and relative heading of the two colliding vehicles. There are four potential relative positions, 53 i.e., front, left, right, and rear, of two colliding vehicles, as shown in Supplementary Fig. 1c. The relative heading 54 of two vehicles is between 0 to 180 degrees, where 0 degrees means two vehicles are heading the same direction 55 and 180 degrees means the opposite direction. We define a crash as rear-end if the relative position is rear or front, 56 and the relative heading is smaller than 40 degrees. A sideswipe crash is when the relative position is left or right, 57 and the relative heading is smaller than 30 degrees or greater than 150 degrees. A head-on crash is when the 58 relative position is front, and the relative heading is greater than 90 degrees. Other crashes are considered angle 59 crashes. 60 To calculate the change in velocity (Delta-V) based on the conservation of momentum, the collision is assumed 61 to be a perfectly inelastic collision and the vehicles have the same mass. Therefore, the change in velocity can be 62 obtained based on the difference between the impact speed vector and the separation speed vector. For example, 63 consider a rear-end crash, where the front vehicle is initially stationary and the rear-end collision vehicle is 64 traveling at 30 mph. Then, the separation speed of the two vehicles will be 15 mph, and the Delta-V of both 65 vehicles are 15 mph. Many existing studies investigated the relationship between Delta-V and occupant injury 66 level, we follow their found thresholds 6 to measure the crash severity. Specifically, in side impact crashes (e.g., 67 angle crash), there is no injury if Delta-V is smaller than 8 mph, minor injury if Delta-V is between 8 and 14 mph, 68 serious injury if Delta-V is between 14 and 24 mph, and fatal injury if Delta-V is greater than 24 mph. For frontal 69 impact crashes (e.g., rear-end crash), the corresponding thresholds are no injury 0,11 mph, minor injury 70 11, 23 , serious injury 23,34 , and fatal injury 34, ∞ . 71

d. SUMO simulation settings 72
We compare the proposed method with SUMO 7 -a widely used simulation platform for microscopic traffic 73 behaviors where is the maximum acceleration, is the desired speed, and are the velocity and the bumper-83 to-bumper range at the current time step , is an exponent parameter and ⋆ is the desired bumper-to-bumper 84 range which can be calculated as 85 where is the minimum range at standstill, is the desired time headway, Δ is the speed difference and 87 is the comfortable deceleration. The model parameters are set based on the data and common practice as We train the safety mapping network by using the RMSprop 10 optimizer. We set the batch size to 64 and the 95 learning rate to 0.0001. The learning rate is reduced to its 0.3 every 600 epochs. The training took around 20 days 96 on an Intel i7-10700F CPU and NVIDIA 3070 GPU desktop with a total number of 3,000 training epochs. To 97 cover all potential safety critical patterns, we randomly sampled the vehicle states as input and their ground truth 98 is generated with a rule-based model. When two vehicles are going to collide, we push them apart by setting a 99 repulsive force between them. The force is projected to the heading direction of each vehicle and rectifies their 100 states until they are not colliding with each other, as illustrated in Supplementary Fig. 2. Each vehicle is 101 considered as 3.8 meters in length and 2.0 meters in width when training the safety mapper, which includes a 0.2 102 meters buffer compared to the real size. Note that we do not modify the heading of each vehicle and only rectify 103 the position, which is similar to guiding the vehicle to decelerate or accelerate in safety-critical situations to avoid 104 a crash. Instead of directly predicting the rectified states, we train the mapper to generate the residual between the 105 ground truth and the input. The rectification is performed frame-by-frame. The mean absolute error between the 106 predicted position residue and the ground-truth residue is used as the loss function. Since safety-critical situations 107 rarely happen, the residual may follow a sparse pattern where most of the values are close to zero. Therefore, 108 when generating the training data, we balance the ratio between the activated and non-activated output by using 109 heuristic sampling where in each frame, the first 80% of vehicle states are uniformly sampled and the rest 20% 110 are sampled from the neighbor of existing vehicles. We generate 240,000 random frames for each training epoch. 111 During the training phase, the number of tokens (vehicles) is set to a fixed number of 32 considering the batch-112 wise training efficiency. 113 When training the behavior modeling network, we freeze the safety mapping network. Both the behavior 114 modeling network and the discriminator are updated jointly by using the RMSprop 10 optimizer. The batch size is 115 set to 32 and the learning rate is set to 0.0001 with decay to its 0.3 every 300 epochs. We set the training token 116 size to 32. When there are fewer than 32 vehicles in the road network, fake vehicle states will be used to pad absolute error between the predicted states (predicted x and y coordinates and heading) and ground-truth states at 26 the next 5 steps. The adversarial loss is calculated using the BCEWithLogitsLoss following the general setting of 27 generative adversarial training. 28

f. Model scalability experiment settings 29
We use SUMO 7 to generate vehicle trajectory data and use it as the ground truth of NDE for training and validation. 30 We collect around five hours of data to train NeuralNDE models for the intersection and roundabout scenarios. 31 The training settings are the same as previous experiments using AA and rounD datasets. During the inference 32 time, the intersection and roundabout areas are controlled by NeuralNDE, and the transition area in between is 33 controlled by model-based methods. Since there is no crash in NDE (i.e., SUMO), the acceptance probability of 34 the crash critic module is set to zero. We also apply the safety mapping rule that used to train the safety mapping 35 network to the whole simulation environment including the transition area to further guarantee the safety of the 36 simulation. We simulate the whole network for around 100 hours and collect the data to validate the performance. 37 The simulation resolutions and metrics definitions are the same as in previous experiments. The PET is collected 38 for vehicles within the roundabout circle and intersection. The instantaneous speed for the intersection scenario 39 is collected for all vehicles in the area. To investigate the effects of the behavior network backbone, a two-layer Multilayer Perceptron (MLP) (hidden 51 dimension equals 256) with batch normalization layers and Relu activation function is compared. The main 52 difference between the MLP and Transformer is that the Transformer utilizes the self-attention mechanism in its 53 architectural design. By formulating the road agents as individual tokens, the self-attention mechanism in 54 Transformer is naturally capable of characterizing inter-token interaction between agents. The result shows that 55 the performance of the Transformer backbone is significantly better in all metrics compared to the MLP backbone. 56 We also compared the Long Short-Term Memory (LSTM) network (two layers, hidden dimension equal to 256) 57 as the backbone architecture. To model the interactions between agents, the LSTM module is embedded in a Seq-58 to-Seq framework 11 as a recurrent unit. This design allows the network to handle the interactions among all input 159 agents instead of only historical ones. The results show that Transformer can achieve better performance in 160 modeling vehicle interactions. 161 We also examine the importance of the safety mapping network. The conflict critic module is not applicable 162 without the safety mapping network. From the results, we can find the crash rate is extremely unrealistic and 63 multiple magnitudes higher than the ground truth. This result validates the performance of the proposed safety 64 mapping network that significantly reduces the modeling error in safety-critical situations. The model exhibits 65 good performance in other metrics since it does not consider safety performance and only optimizes to imitate 66 normal driving behaviors. 67 Finally, we demonstrated the significance of the conflict critic module and the adversarial training. We cannot 68 control the generation process of safety-critical events without the conflict critic module, therefore, we cannot 69 obtain accurate crash rate and crash type distribution. Without adversarial training, we found that the crash rate 70 and crash type distribution will be degraded. Table 1 demonstrates that the proposed model exhibits the overall 71 best performance considering all evaluation metrics. vehicle. and denote the Hellinger distance and the KL-divergence, respectively. 77

b. NeuralNDE results of rounD dataset 78
The NeuralNDE results using the rounD dataset are shown in Supplementary Fig. 3. From the results, we can find 79 that NeuralNDE can achieve statistical realism and significantly outperform existing methods. We only show 180 normal driving behavior statistics since the safety-critical driving behavior ground truth, e.g., crash and near-miss 181 data, is unavailable. 182