Crash and disengagement data of autonomous vehicles on public roads in California

Autonomous Vehicles (AVs) are being widely tested on public roads in several countries such as the USA, Canada, France, Germany, and Australia. For the transparent deployment of AVs in California, the California Department of Motor Vehicles (CA DMV) commissioned AV manufacturers to draft and publish reports on disengagements and crashes. These reports must be processed before any statistical analysis, which is cumbersome and time-consuming. Our dataset presents the processed disengagement data from 2014 to 2019, crash data till the 10th of March 2020 and supplementary road network and land-use data extracted from OpenStreetMap. Primary data are manually assessed and converted into an easily processed format. Our processed data will be advantageous to the research community and enable accelerated research in this domain. For example, the data can be utilised to discern trends in disengagement, observe the distribution of disengagement causes, and investigate the contributory factors of the crashes. Such investigations can subsequently improve the reporting protocols and make policies and laws for the smooth deployment of this disruptive technology. Measurement(s) autonomous vehicles • disengagement • crashes • reaction times Technology Type(s) manual entry • OpenStreetMap • Google Maps Factor Type(s) temporal interval Sample Characteristic - Location State of California Measurement(s) autonomous vehicles • disengagement • crashes • reaction times Technology Type(s) manual entry • OpenStreetMap • Google Maps Factor Type(s) temporal interval Sample Characteristic - Location State of California Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.15049227

of testing), and cause (environmental, another road user, hardware or software discrepancy, and path planning discrepancy). The study also found that as the maturity of the testing increases by a month, the probability of automated disengagement increases by 0.014. Another study developed a Bayesian latent class model that identified six classes of collision patterns 13 . The authors pointed out the necessity of more advanced and robust collision narrative reporting for better analysis of the critical events. More recently, machine learning techniques were used to evaluate the crash severity of AVs 14 .
As evident, more complex models are being developed lately based on the CA DMV reports. However, these reports must be processed before conducting any statistical analysis, which is cumbersome and time-consuming. Furthermore, new crashes and disengagements are happening with more test vehicles on the road, and new data are appended regularly to the CA DMV dataset. The exclusion of the latest data used in the previous studies can result in premature findings 7 and require constant scrutiny. Additionally, the advancement of this technology may change the currently observed trends 15 . Since the crash data are considered over a prolonged period, the time-varying explanatory variables may vary significantly. Neglecting latent within period variation may result in the loss of crucial explanatory variables. This loss of information by using discrete-time intervals can institute error in model estimation because of unobserved heterogeneity 16,17 , and hence studies using this data are required to be updated in the light of new information available.
In this context, we present the processed disengagement data from 2014 to 2019, crash data till the 10th of March 2020 and supplementary road network and land-use data of the AV testing locations in California. For the preparation of the current dataset, primary data in the form of crash and disengagement reports were requested from CA DMV. The primary data are extracted and organised into the processable format for potential reuse. Furthermore, OpenStreetMap and Google maps were used to extract supplementary data, which could be utilised to expand the research around AVs' safety like in a previous study 11 , preparing guidelines and governance for AV deployment on public roads.
Specifically, the research community, government organisations, and public forums active in AV-related safety studies can benefit from this processed dataset. A comprehensive reporting protocol is necessary for the well-ordered deployment of AVs on public roads in any area. Henceforth, research can be conducted using our dataset towards identifying the shortcomings and fallibility of the reporting protocols used by CA DMV and assist government organisations around the world to devise better protocol systems for transparent deployment of AVs on public roads. For example, a limitation of the current CA DMV reports was pointed out by a previous study 4 , which stated to include the categories of disengagements in the reporting protocol by analysing the categorisation of disengagements into proposed macro and micro categories. Macro-level and spatial models of AV crashes can also be estimated using the supplementary data provided by our spatial database. They can be utilised to assess the hotspot crash/disengagement locations. Furthermore, the rate of disengagements can be affected by expanding the testing area as new and unfamiliar roads can generate new challenges for the AV brain. It is recommended to include testing area details in the reports. Also, new studies are advised to take into account the effect of spatial features. Eventually, managing a database that can be directly used without further processing or organisation could help accelerate the research and promote consistent terminologies, which is the contribution of this dataset. The processed data can bring consistency in the studies since different authors might use different appellations and can be readily imported to modelling and analysis software, thereby facilitating reproducible research.

Methods
Our processed dataset contains disengagement data from 2014 to 2019 and crash data until the 10 th of March 2020. The disengagement primary data are reported by CA DMV at the end of every year, and the crash primary data are continuously updated. There are four kinds of data in our repository, viz. disengagement data, crash data, street network data and land use data. The procedure of their extraction is presented in the following sections.
Disengagement and crash data. Primary data, i.e. disengagement and crash reports, were requested from CA DMV 18 . The reports are in pdf files and scanned copies of handwritten text. Authors manually scrutinised each report and converted them into excel and comma-separated files that can be processed by statistical tools like Stata, Nlogit and can be imported directly using scripting languages like Python or R for further analysis. A consistent data entry convention was discussed and developed considering different tools and scripting languages. The process of converting the reports into a processable format was time-consuming and cumbersome. More than three months was invested in going through reports and processing them. Automation of extraction of data from reports was infeasible since some manufacturers are submitting handwritten reports.

CA DMV data reports
Sensor data (KITTI, Waymo, etc.) Microsimulation and driving simulator data • Constitutes field data, and hence the system is exposed to unprecedented scenarios.
• Constitutes field data, and hence the system is exposed to unprecedented scenarios.
• Constitutes artificially generated or simulated data, and hence the system is not exposed to unprecedented scenarios.
• Includes the crash and disengagement reports.
• Includes sensor data (camera and lidar data).
• Includes trajectory information of AVs along with other vehicles.
• Contains information about the critical events (disengagements and crashes).
• Do not contain information about the critical events.
• Contains information about critical events (excluding the ones caused by perception error). www.nature.com/scientificdata www.nature.com/scientificdata/ Additionally, some reports have subjective responses and hence required manual data entry and an expert's elucidation. As a result, the data preparation required extensive manual hours investment. Furthermore, an extensive discussion was required to deduce the features that are not explicitly reported but can affect the crash, such as location characteristics (intersection, parking protocol and traffic light) and vehicle type. The latitude and longitude of every crash location were extracted from Google maps, and the "street view" feature of Google maps was used to discern location characteristics. The car model of the non-AVs involved in the crash mentioned in the reports was used to discern the "vehicle type" and resources like Wikipedia and manufacturers' websites were used. Some of the key descriptive statistics of the crash data are presented in Table 2. Through the availability of this processed data, research groups worldwide could bypass these cumbersome steps and accelerate their research.
Street network and land use data. Python was used in the extraction of both street network and land use data. GeoPandas 19 , OSMnx 20 , and Pandas 21 modules were used to assist with data manipulation. The geographical coordinates specified by the user (geocoded from crash data) are used to extract the road network (edges and nodes) using the OSMnx module. Custom filters are used as additional arguments to limit the results of the process (i.e., only returning edges and nodes, and not additional map features like pedestrian paths and building structures). Any missing features (due to incomplete data from OpenStreetMap) are assigned default values as appropriate (such as speed limits). Edges with multiple classifications are assigned only the highest classification (e.g., if an edge is both "primary" and "residential", assigned only "primary" attribute). Any non-one-way roads are adjusted so that there are two edges for each of those roads (one for each travel direction). The street network data is in the form of shapefiles which can be opened using any geographic information system (GIS) tool such as QGIS or ArcGIS. Finally, the land-use data, in the form of points of interest (POI) such as restaurants, clinics, banks, etc. are also extracted from OpenStreetMap using OSMnx, by specifying the same coordinates as were used for network extraction. Extraction of spatial information took additional 20 hours.

Data records
This section discusses the organisation of the processed data in the figshare repository 22 . Table 3 outlines the structure of the repository. There are four folders in the repository. DMV Crash report (Secondary Data). The folder includes a Microsoft Excel file "DMV Crash Data.xlsx", which contains three sheets. The Crash datasheet contains the processed information from the original reports Other party's vehicle status (Kinematics of non-AV) Vehicle type (Indicator of non-AV vehicle's size) Categorical =1 (two-wheelers, i.e. bicycle, scooter); =2 (sub-compact/compact cars); these are identified as having an inside volume between 2.4m3 and 3.1m3 by combining cargo and passenger volume. =3 (mid-size cars) these are identified as having an interior volume between 3.1m3 and 3.4m3. =4 (trucks/buses).
Intersection Type (Type of intersection given that crash happened at an intersection) Categorical Relative Velocity Continuous Relative velocity of the vehicles involved in the crash at the time of impact.
Reported only by a few manufacturers (176 out of 259 crashes reported relative speed) www.nature.com/scientificdata www.nature.com/scientificdata/ of CA DMV (primary data) and is manually inputted into this excel file. README sheet contains the variable definitions, and the Variable sheet contains the summary of the data. These excel files contain the yearly disengagement data extracted from the CA DMV reports (primary data) in the years 2014-2019. In each of these files, there are several tabs, of which the first one is a "README" tab that provides a brief description of the corresponding excel file and the data layout. The second tab in these files provides the summary of disengagement data within each year, including the miles driven and the number of manual and automatic disengagements by each company in each month of that year. The rest of the tabs in the Excel files provide individual disengagement records (not always reported by all companies), including the reason for disengagement and in some cases, the reaction time to take over after disengagement. For example, the 2016_ DISENGAGEMENT file has six other tabs where the manufacturers have recorded individual disengagements. Similarly, 2017_DISENGAGEMENT and 2019_DISENGAGEMENT files have eight and thirty-five additional tabs, respectively, that have information for individual manufacturers. The 2018_DISENGAGEMENT file has 35 additional tabs containing information of individual manufacturers (Apple has three subfiles, and UATC (Uber) has 2, due to their large dataset).    Table 3. Data Repository Structure. Note: Detailed descriptions can be found in the "ReadMe" files in repository 22 .

Fig. 1 Land Use Data (Points of Interest)
. "Land use" is the term used to describe the human use of land. It represents the economic and cultural activities (e.g., agricultural, residential, industrial, mining, and recreational uses) that are practiced at a given place 24  www.nature.com/scientificdata www.nature.com/scientificdata/ interest of these two areas, including the spatial coordinates. Figure 1 shows the map of Area 1 along with a legend for all the extracted POIs. Figure 2 represents a sub-area within Fig. 1 for better distinguishment of the different POIs.  www.nature.com/scientificdata www.nature.com/scientificdata/

Street Network Data (Supplementary Data).
This folder contains two sub-folders: Area1(San Francisco), Area2(San Jose) and Crash Data, which include the street network shapefiles (in the form of nodes and edges) for Area 1 and Area 2, respectively. The shapefile format is a geospatial vector data format for storing geometric location and associated attribute information relevant for geographic information system (GIS) software. Edges represent the physical streets, and the attribute table of edges includes information on the number of lanes, speed limit (in mph), the street name, type of road (freeway, primary, residential, etc.), and edge length. Nodes represent the intersection of edges, and their attribute table includes information on whether the node is a signalised or a priority sign intersection. Most of the crash records in the primary data provided by CA DMV did not include the spatial coordinates of the crash location. Instead, they include a description of the streets where the crash happened. Google Maps was used to geocode the text and convert it into spatial coordinates. The crash report provides a brief description of the crash. This includes the direction in which the vehicle is moving, approaching the intersection or placed at the intersection. Vehicle 1 movement provided information about the state of car (moving = 1, stationary = 0). The exact location of the crash was discerned using street information and a brief description. In case the report mentions "approaching intersection" or "at the intersection", the geocode of intersection is provided. Otherwise, the midpoint of the street is geocoded. Figures 3 and 4 show the different crash locations superimposed on the road networks of Areas 1 and 2, respectively.

technical Validation
The reports are manually processed, annotated, and verified by the expertise. For instance, the reports have some subjective information, such as the seriousness of the injury and the location of the crash. First, one of the authors extracted the relevant data manually. Then the same task was performed by another author independently. It is important to note that these two authors did not have any direct connection with each other during the data extraction process. Once both the datasets were obtained, it was found that they were mostly consistent and had less than 2% contradicting data entries. Such records were then finally checked and corrected by a discussion among all the authors. At each of these steps, authors avoided using automated pipelines and used expert human intervention to register the implicit data.
Additionally, previously published studies were replicated with our processed crash and disengagement data, and the results were found to be consistent. The replicated results can be found in a recent study published by the authors 14 . The disengagement trends with appended data presented in our recently published study 14 are coherent with a previous study 4 . The crash typology distribution and imbalance in injury outcome in our study 14 are also coherent with another previous study 11 . These validation methods are discussed in detail below. www.nature.com/scientificdata www.nature.com/scientificdata/ 1. For the automated validation, data were imported to different software and scripting languages. The spatial data were imported to QGIS software and the csv files were imported to Stata and Python. Frequently used operations such as filtering, summarising, and grouping were performed. The importing and operation databases have been successful, and there are no missing data. Further, the ranges of each variable column were checked. For instance, there are four categories of "Vehicle type" defined. The summarisation of "Vehicle type" column was found to be consistent with the protocol used. 2. A further validation was explicitly done for spatial data. Spatial data points were plotted on the maps to verify that the spatial information/points fall within the test area. Figures 3 and 4 verify that all the points lie inside the testing area. 3. To validate our data, we compared our results with those from previous studies: • Percentage of rear-end crashes A previous study 6 reported that around 62% of crashes were rear-ended. Another study 11 which analysed data until the year 2018, reported 63% of crashes as rear-ended. A more recently published study 14 that used our dataset reported 61.78% (160 out of 259) rear-ended crashes. Consistent numbers reported corroborates validation of our dataset, as shown in Fig. 5. This figure can be compared with Vehicle #1 (AV) panel of Figure 7 of the previous study 6 (refer to the following URL: https://doi. org/10.1371/journal.pone.0184952.g007).
• Distribution of crash severity A previous study 11 reported 89% CAV involved crashes are property damage only (PDO) crashes. PDO crashes translate to no injury crashes. A similar percentage (85%) was reported with extended data (our dataset) in our recently published study 14 . 4. Lastly, the crash records reported by a previous study 6 (refer to the following URL: https://doi.org/10.1371/ journal.pone.0184952.t001) were compared with our dataset. Table 4 presents a detailed comparison. There were differences in the column names between the two datasets. For example, if our data reported Vehicle1 Status = 1 (Moving) and Vehicle1 Status = 1 (Moving), the study 22 reported the Status of vehicles and relative direction = Moving/Moving. Furthermore, there are minor discrepancies in subjective columns, such as 'injury reporting' caused by the ambiguity of reporting protocol. Nevertheless, the interpretation of data was mostly consistent and thus, substantiate the validation of our dataset.

Limitations and Future Work
CA DMV reports are the best publicly available data containing information about critical events (crashes and disengagements) involving AVs. Nevertheless, the reports (and, therefore, our processed data) have few limitations. The reports (primary data) consist of a few fields like 'damage severity' , which are assessed subjectively and may vary based on the assessor. Furthermore, the information of safety operation drivers is missing in the reports. Psychological attributes such as the risk attitude of the drivers could play a crucial part in discerning the details of safety operation 23 but are not available in the reports. A previous study proposed a distinction between exogenous (factors outside the control of the driver and/or manufacturer, e.g., weather, infrastructure condition, outside traffic) and endogenous (factors can be acted upon) contributory factors of disengagement as an additional mandatory category in the reporting protocol 4 . However, this data lacks the difference within possible controllability of the factor, which can be crucial in discerning trends. Lastly, there is no data about conflicts.

Fig. 5
Crash type reported using our dataset. The percentage of rear-end crashes reported is 61.8%. A previous study 6 reported that around 62% of crashes were rear-ended. Another study 11 which analysed data until the year 2018, reported 63% of crashes as rear-ended. A more recently published study 14 that used our dataset reported 61.78% (160 out of 259) rear-ended crashes. This figure can be compared with Vehicle #1 (AV) panel of Figure  7 of the previous study 6 (refer to the URL: https://doi.org/10.1371/journal.pone.0184952.g007). These reported figures regarding the rear-end crashes were consistent with our dataset and thus reinforce the validity of our dataset.

Usage Notes
The excel sheets can be opened using any spreadsheet or workbook software. The shapefiles can be opened using any geographic information system tool such as QGIS or ArcGIS. The data can be directly imported to statistical or programming tools such as Stata, Nlogit and Python for performing analysis. We recommend that users review the sample reports in the database for further clarification. The disengagement data can be utilised to discern the trends as performed in some previous studies 4,8 . Manufacturer-specific information like reaction time can be used for investigation as presented in a previous study 5 . The crash data can be utilised in two significant ways. First, descriptive statistics and identification of trends as done by some past studies 6 . Second, crash data can be used for exploratory investigation as performed by a few studies 11,13 . New fields are appended in the reports, and auxiliary data (spatial data, vehicle type etc.) can be utilised to develop more comprehensive models.

Code availability
Primary data (crash and disengagement data) extraction does not utilise any automated pipelines or scripts due to implicit and subjective data in the original reports. Experts manually analysed each report to register the data into processable formats. GeoPandas 19 , OSMnx 20 , and Pandas 21 Python custom modules are used for the extraction of secondary data. Code documentation for GeoPandas (https://github.com/geopandas/geopandas) and OSMnx (https://github.com/gboeing/osmnx) can be found in the corresponding GitHub repositories. A more elaborative procedure is explained in the Methods section. acknowledgements Authors want to thank Jason Wang, PhD student at UNSW Sydney, for his contribution to the supplementary data collection. Authors acknowledge the Australian Research Council Grant LP160101021 and the School of Civil and Environmental Engineering at the University of New South Wales, Sydney, for the resources and support that made this research possible. Finally, authors thank the two anonymous reviewers and the Editorial team members for their helpful comments that significantly improved the quality of the paper and the data.