Background & Summary

Autonomous Vehicles (AVs) are being widely tested on public roads in several countries around the world. Although AVs are becoming more competent due to advancements in sensing and navigation technologies, safety stands as the biggest challenge in adopting this disruptive technology1. Currently, there are two major categories of “field” datasets that are publicly available; the crash and disengagement reports by the California Department of Motor Vehicles (CA DMV) and the extensive sensor (lidar and camera) datasets like KITTI2, Waymo3, etc. The latter category of datasets is proficient for accelerating research in computer vision, localisation, and behaviour cloning but lacks information about the critical instances (crash and disengagements). On the other hand, CA DMV disengagement and crash reports are the foremost publicly available datasets containing information about the critical instances. These critical events data are beneficial for laying a roadmap for the deployment of AVs, which includes reporting protocols, legal frameworks, and infrastructure maintenance plans. Table 1 presents the different categories of datasets used by researchers to evaluate the safety and performance of AVs.

Table 1 Different Categories of Datasets Available for Analysing AVs.

For the transparent deployment of AVs in California, the CA DMV commissioned AV manufacturers to draft and publish reports on disengagements and crashes. As a result, several researchers used the CA DMV reports to evaluate AVs’ safety-critical events. Most of the earlier studies focussed on discerning the trends in disengagements and crashes involving AVs4,5,6,7,8,9. Recently, a few studies have used these reports for developing statistical models. For example, one such study developed a nested logit model using three distinct outcomes: (1) no disengagement, (2) disengagement with no crash, and (3) disengagement with a crash10. The study also estimated endogenous switching models and deduced that disengagements lessen over time, and factors related to AV systems and other roadway participants elevate the tendency of disengagement without a crash. In another study, driving mode (autonomous vs manual mode), collision location, roadside parking protocol, rear-end collision, and one-way road were found to be significantly influencing crash severity11.

Another study developed fixed and random parameters binary logistic regression models to evaluate disengagement initiation12. The study emphasised that unfitting interpretations are possible if the random parameter approach is not used. Significant variables included location (highway or street), maturity of testing (months of testing), and cause (environmental, another road user, hardware or software discrepancy, and path planning discrepancy). The study also found that as the maturity of the testing increases by a month, the probability of automated disengagement increases by 0.014. Another study developed a Bayesian latent class model that identified six classes of collision patterns13. The authors pointed out the necessity of more advanced and robust collision narrative reporting for better analysis of the critical events. More recently, machine learning techniques were used to evaluate the crash severity of AVs14.

As evident, more complex models are being developed lately based on the CA DMV reports. However, these reports must be processed before conducting any statistical analysis, which is cumbersome and time-consuming. Furthermore, new crashes and disengagements are happening with more test vehicles on the road, and new data are appended regularly to the CA DMV dataset. The exclusion of the latest data used in the previous studies can result in premature findings7 and require constant scrutiny. Additionally, the advancement of this technology may change the currently observed trends15. Since the crash data are considered over a prolonged period, the time-varying explanatory variables may vary significantly. Neglecting latent within period variation may result in the loss of crucial explanatory variables. This loss of information by using discrete-time intervals can institute error in model estimation because of unobserved heterogeneity16,17, and hence studies using this data are required to be updated in the light of new information available.

In this context, we present the processed disengagement data from 2014 to 2019, crash data till the 10th of March 2020 and supplementary road network and land-use data of the AV testing locations in California. For the preparation of the current dataset, primary data in the form of crash and disengagement reports were requested from CA DMV. The primary data are extracted and organised into the processable format for potential reuse. Furthermore, OpenStreetMap and Google maps were used to extract supplementary data, which could be utilised to expand the research around AVs’ safety like in a previous study11, preparing guidelines and governance for AV deployment on public roads.

Specifically, the research community, government organisations, and public forums active in AV-related safety studies can benefit from this processed dataset. A comprehensive reporting protocol is necessary for the well-ordered deployment of AVs on public roads in any area. Henceforth, research can be conducted using our dataset towards identifying the shortcomings and fallibility of the reporting protocols used by CA DMV and assist government organisations around the world to devise better protocol systems for transparent deployment of AVs on public roads. For example, a limitation of the current CA DMV reports was pointed out by a previous study4, which stated to include the categories of disengagements in the reporting protocol by analysing the categorisation of disengagements into proposed macro and micro categories. Macro-level and spatial models of AV crashes can also be estimated using the supplementary data provided by our spatial database. They can be utilised to assess the hotspot crash/disengagement locations. Furthermore, the rate of disengagements can be affected by expanding the testing area as new and unfamiliar roads can generate new challenges for the AV brain. It is recommended to include testing area details in the reports. Also, new studies are advised to take into account the effect of spatial features. Eventually, managing a database that can be directly used without further processing or organisation could help accelerate the research and promote consistent terminologies, which is the contribution of this dataset. The processed data can bring consistency in the studies since different authors might use different appellations and can be readily imported to modelling and analysis software, thereby facilitating reproducible research.

Methods

Our processed dataset contains disengagement data from 2014 to 2019 and crash data until the 10th of March 2020. The disengagement primary data are reported by CA DMV at the end of every year, and the crash primary data are continuously updated. There are four kinds of data in our repository, viz. disengagement data, crash data, street network data and land use data. The procedure of their extraction is presented in the following sections.

Disengagement and crash data

Primary data, i.e. disengagement and crash reports, were requested from CA DMV18. The reports are in pdf files and scanned copies of handwritten text. Authors manually scrutinised each report and converted them into excel and comma-separated files that can be processed by statistical tools like Stata, Nlogit and can be imported directly using scripting languages like Python or R for further analysis. A consistent data entry convention was discussed and developed considering different tools and scripting languages. The process of converting the reports into a processable format was time-consuming and cumbersome. More than three months was invested in going through reports and processing them. Automation of extraction of data from reports was infeasible since some manufacturers are submitting handwritten reports.

Additionally, some reports have subjective responses and hence required manual data entry and an expert’s elucidation. As a result, the data preparation required extensive manual hours investment. Furthermore, an extensive discussion was required to deduce the features that are not explicitly reported but can affect the crash, such as location characteristics (intersection, parking protocol and traffic light) and vehicle type. The latitude and longitude of every crash location were extracted from Google maps, and the “street view” feature of Google maps was used to discern location characteristics. The car model of the non-AVs involved in the crash mentioned in the reports was used to discern the “vehicle type” and resources like Wikipedia and manufacturers’ websites were used. Some of the key descriptive statistics of the crash data are presented in Table 2. Through the availability of this processed data, research groups worldwide could bypass these cumbersome steps and accelerate their research.

Table 2 Variable Definitions and Summary Statistics of Crashes from Our Dataset22.

Street network and land use data

Python was used in the extraction of both street network and land use data. GeoPandas19, OSMnx20, and Pandas21 modules were used to assist with data manipulation. The geographical coordinates specified by the user (geocoded from crash data) are used to extract the road network (edges and nodes) using the OSMnx module. Custom filters are used as additional arguments to limit the results of the process (i.e., only returning edges and nodes, and not additional map features like pedestrian paths and building structures). Any missing features (due to incomplete data from OpenStreetMap) are assigned default values as appropriate (such as speed limits). Edges with multiple classifications are assigned only the highest classification (e.g., if an edge is both “primary” and “residential”, assigned only “primary” attribute). Any non-one-way roads are adjusted so that there are two edges for each of those roads (one for each travel direction). The street network data is in the form of shapefiles which can be opened using any geographic information system (GIS) tool such as QGIS or ArcGIS. Finally, the land-use data, in the form of points of interest (POI) such as restaurants, clinics, banks, etc. are also extracted from OpenStreetMap using OSMnx, by specifying the same coordinates as were used for network extraction. Extraction of spatial information took additional 20 hours.

Data Records

This section discusses the organisation of the processed data in the figshare repository22. Table 3 outlines the structure of the repository. There are four folders in the repository.

Table 3 Data Repository Structure.

DMV Crash Report (Secondary Data)

The folder includes a Microsoft Excel file “DMV Crash Data.xlsx”, which contains three sheets. The Crash datasheet contains the processed information from the original reports of CA DMV (primary data) and is manually inputted into this excel file. README sheet contains the variable definitions, and the Variable sheet contains the summary of the data.

DMV Disengagement Report (Secondary Data)

The DMV Disengagement Report folder contains seven Microsoft Excel files, and their description is provided below.

  • 2014_DISENGAGEMENT, 2015_DISENGAGEMENT, 2016_DISENGAGEMENT, 2017_DISENGAGEMENT, 2018_DISENGAGEMENT and, 2019_DISENGAGEMENT

    These excel files contain the yearly disengagement data extracted from the CA DMV reports (primary data) in the years 2014–2019. In each of these files, there are several tabs, of which the first one is a “README” tab that provides a brief description of the corresponding excel file and the data layout. The second tab in these files provides the summary of disengagement data within each year, including the miles driven and the number of manual and automatic disengagements by each company in each month of that year. The rest of the tabs in the Excel files provide individual disengagement records (not always reported by all companies), including the reason for disengagement and in some cases, the reaction time to take over after disengagement. For example, the 2016_DISENGAGEMENT file has six other tabs where the manufacturers have recorded individual disengagements. Similarly, 2017_DISENGAGEMENT and 2019_DISENGAGEMENT files have eight and thirty-five additional tabs, respectively, that have information for individual manufacturers. The 2018_DISENGAGEMENT file has 35 additional tabs containing information of individual manufacturers (Apple has three subfiles, and UATC (Uber) has 2, due to their large dataset).

  • COMBINED_DISENGAGEMENT_SUMMARY

This excel file is a combined summary of all the aforementioned excel files, where the file displays the monthly miles driven, number of automatic and manual disengagements reported by each company from 2014 to 2019.

Land Use (Supplementary Data)

This folder contains two comma-separated value (CSV) files; poi_List Area 1(San Francisco) and poi_List Area2 (San Jose), consisting of the information about the points of interest of these two areas, including the spatial coordinates. Figure 1 shows the map of Area 1 along with a legend for all the extracted POIs. Figure 2 represents a sub-area within Fig. 1 for better distinguishment of the different POIs.

Fig. 1
figure 1

Land Use Data (Points of Interest). “Land use” is the term used to describe the human use of land. It represents the economic and cultural activities (e.g., agricultural, residential, industrial, mining, and recreational uses) that are practiced at a given place24. The land-use data, in the form of points of interest (POI) is presented in this figure. This figure is an illustration of land use data for area 1 (San Francisco). Each point in the figure represents the land use characteristics. The description of the coloured points\legends is presented in the list next to the figure.

Fig. 2
figure 2

Land Use Data for a Sub-Area within Area 1. This figure represents the land use characteristics of a sub-area within Area 1. The different POIs can be better distinguished with the help of this figure.

Street Network Data (Supplementary Data)

This folder contains two sub-folders: Area1(San Francisco), Area2(San Jose) and Crash Data, which include the street network shapefiles (in the form of nodes and edges) for Area 1 and Area 2, respectively. The shapefile format is a geospatial vector data format for storing geometric location and associated attribute information relevant for geographic information system (GIS) software. Edges represent the physical streets, and the attribute table of edges includes information on the number of lanes, speed limit (in mph), the street name, type of road (freeway, primary, residential, etc.), and edge length. Nodes represent the intersection of edges, and their attribute table includes information on whether the node is a signalised or a priority sign intersection. Most of the crash records in the primary data provided by CA DMV did not include the spatial coordinates of the crash location. Instead, they include a description of the streets where the crash happened. Google Maps was used to geocode the text and convert it into spatial coordinates. The crash report provides a brief description of the crash. This includes the direction in which the vehicle is moving, approaching the intersection or placed at the intersection. Vehicle 1 movement provided information about the state of car (moving = 1, stationary = 0). The exact location of the crash was discerned using street information and a brief description. In case the report mentions “approaching intersection” or “at the intersection”, the geocode of intersection is provided. Otherwise, the midpoint of the street is geocoded. Figures 3 and 4 show the different crash locations superimposed on the road networks of Areas 1 and 2, respectively.

Fig. 3
figure 3

San Francisco (Area 1) Crash Locations Superimposed on the Road Network. The location (using latitude and longitude coordinates) of every crash is mapped into the testing area. Each point (red star) on this figure exhibits a crash location on the map for area 1, superimposed on the road network (the blue lines).

Fig. 4
figure 4

San Jose (Area 2) Crash Locations Superimposed on the Road Network. Like the Fig. 3, each point of Fig. 4 indicates a crash location on the map for area 2.

Technical Validation

The reports are manually processed, annotated, and verified by the expertise. For instance, the reports have some subjective information, such as the seriousness of the injury and the location of the crash. First, one of the authors extracted the relevant data manually. Then the same task was performed by another author independently. It is important to note that these two authors did not have any direct connection with each other during the data extraction process. Once both the datasets were obtained, it was found that they were mostly consistent and had less than 2% contradicting data entries. Such records were then finally checked and corrected by a discussion among all the authors. At each of these steps, authors avoided using automated pipelines and used expert human intervention to register the implicit data.

Additionally, previously published studies were replicated with our processed crash and disengagement data, and the results were found to be consistent. The replicated results can be found in a recent study published by the authors14. The disengagement trends with appended data presented in our recently published study14 are coherent with a previous study4. The crash typology distribution and imbalance in injury outcome in our study14 are also coherent with another previous study11. These validation methods are discussed in detail below.

  1. 1.

    For the automated validation, data were imported to different software and scripting languages. The spatial data were imported to QGIS software and the csv files were imported to Stata and Python. Frequently used operations such as filtering, summarising, and grouping were performed. The importing and operation databases have been successful, and there are no missing data. Further, the ranges of each variable column were checked. For instance, there are four categories of “Vehicle type” defined. The summarisation of “Vehicle type” column was found to be consistent with the protocol used.

  2. 2.

    A further validation was explicitly done for spatial data. Spatial data points were plotted on the maps to verify that the spatial information/points fall within the test area. Figures 3 and 4 verify that all the points lie inside the testing area.

  3. 3.

    To validate our data, we compared our results with those from previous studies:

    • Percentage of rear-end crashes

      A previous study6 reported that around 62% of crashes were rear-ended. Another study11 which analysed data until the year 2018, reported 63% of crashes as rear-ended. A more recently published study14 that used our dataset reported 61.78% (160 out of 259) rear-ended crashes. Consistent numbers reported corroborates validation of our dataset, as shown in Fig. 5. This figure can be compared with Vehicle #1 (AV) panel of Figure 7 of the previous study6 (refer to the following URL: https://doi.org/10.1371/journal.pone.0184952.g007).

      Fig. 5
      figure 5

      Crash type reported using our dataset. The percentage of rear-end crashes reported is 61.8%. A previous study6 reported that around 62% of crashes were rear-ended. Another study11 which analysed data until the year 2018, reported 63% of crashes as rear-ended. A more recently published study14 that used our dataset reported 61.78% (160 out of 259) rear-ended crashes. This figure can be compared with Vehicle #1 (AV) panel of Figure 7 of the previous study6 (refer to the URL: https://doi.org/10.1371/journal.pone.0184952.g007). These reported figures regarding the rear-end crashes were consistent with our dataset and thus reinforce the validity of our dataset.

    • Distribution of crash severity

    A previous study11 reported 89% CAV involved crashes are property damage only (PDO) crashes. PDO crashes translate to no injury crashes. A similar percentage (85%) was reported with extended data (our dataset) in our recently published study14.

  4. 4.

    Lastly, the crash records reported by a previous study6 (refer to the following URL: https://doi.org/10.1371/journal.pone.0184952.t001) were compared with our dataset. Table 4 presents a detailed comparison. There were differences in the column names between the two datasets. For example, if our data reported Vehicle1 Status = 1 (Moving) and Vehicle1 Status = 1 (Moving), the study22 reported the Status of vehicles and relative direction = Moving/Moving. Furthermore, there are minor discrepancies in subjective columns, such as ‘injury reporting’ caused by the ambiguity of reporting protocol. Nevertheless, the interpretation of data was mostly consistent and thus, substantiate the validation of our dataset.

    Table 4 Comparison between our dataset and a previous study6.

Limitations and Future Work

CA DMV reports are the best publicly available data containing information about critical events (crashes and disengagements) involving AVs. Nevertheless, the reports (and, therefore, our processed data) have few limitations. The reports (primary data) consist of a few fields like ‘damage severity’, which are assessed subjectively and may vary based on the assessor. Furthermore, the information of safety operation drivers is missing in the reports. Psychological attributes such as the risk attitude of the drivers could play a crucial part in discerning the details of safety operation23 but are not available in the reports. A previous study proposed a distinction between exogenous (factors outside the control of the driver and/or manufacturer, e.g., weather, infrastructure condition, outside traffic) and endogenous (factors can be acted upon) contributory factors of disengagement as an additional mandatory category in the reporting protocol4. However, this data lacks the difference within possible controllability of the factor, which can be crucial in discerning trends. Lastly, there is no data about conflicts. Authors believe that for a better understanding of the development of AVs, sensor data along with driver’s and road characteristics at the time of the crash should be made available to the government and researchers.

Usage Notes

The excel sheets can be opened using any spreadsheet or workbook software. The shapefiles can be opened using any geographic information system tool such as QGIS or ArcGIS. The data can be directly imported to statistical or programming tools such as Stata, Nlogit and Python for performing analysis. We recommend that users review the sample reports in the database for further clarification.

The disengagement data can be utilised to discern the trends as performed in some previous studies4,8. Manufacturer-specific information like reaction time can be used for investigation as presented in a previous study5. The crash data can be utilised in two significant ways. First, descriptive statistics and identification of trends as done by some past studies6. Second, crash data can be used for exploratory investigation as performed by a few studies11,13. New fields are appended in the reports, and auxiliary data (spatial data, vehicle type etc.) can be utilised to develop more comprehensive models.