A comprehensive database of the geographic spread of past human Ebola outbreaks

Ebola is a zoonotic filovirus that has the potential to cause outbreaks of variable magnitude in human populations. This database collates our existing knowledge of all known human outbreaks of Ebola for the first time by extracting details of their suspected zoonotic origin and subsequent human-to-human spread from a range of published and non-published sources. In total, 22 unique Ebola outbreaks were identified, composed of 117 unique geographic transmission clusters. Details of the index case and geographic spread of secondary and imported cases were recorded as well as summaries of patient numbers and case fatality rates. A brief text summary describing suspected routes and means of spread for each outbreak was also included. While we cannot yet include the ongoing Guinea and DRC outbreaks until they are over, these data and compiled maps can be used to gain an improved understanding of the initial spread of past Ebola outbreaks and help evaluate surveillance and control guidelines for limiting the spread of future epidemics.

Ebola is a zoonotic filovirus that has the potential to cause outbreaks of variable magnitude in human populations. This database collates our existing knowledge of all known human outbreaks of Ebola for the first time by extracting details of their suspected zoonotic origin and subsequent human-to-human spread from a range of published and non-published sources. In total, 22 unique Ebola outbreaks were identified, composed of 117 unique geographic transmission clusters. Details of the index case and geographic spread of secondary and imported cases were recorded as well as summaries of patient numbers and case fatality rates. A brief text summary describing suspected routes and means of spread for each outbreak was also included. While we cannot yet include the ongoing Guinea and DRC outbreaks until they are over, these data and compiled maps can be used to gain an improved understanding of the initial spread of past Ebola outbreaks and help evaluate surveillance and control guidelines for limiting the spread of future epidemics.

Background & Summary
The genus Ebolavirus belongs to the family Filoviridae and contains five known species to date that vary in their distribution, reservoir hosts and their pathogenicity to humans. With the exception of Reston ebolavirus which has only shown pathogenicity among primates and porcids, all of these have shown some capacity to spill over from their natural reservoirs and cause human cases 1,2 . While only a single human case of Tai Forest ebolavirus has been documented 3 , the remaining three species (Zaire ebolavirus, Sudan ebolavirus and Bundibugyo ebolavirus) are all known to permit human to human transmission after the initial suspected zoonotic transfer resulting in outbreaks of different sizes, geographic extents and case fatality rates ( Fig. 1) 4,5 . Which reservoir species are responsible for maintaining Ebolavirus transmission between outbreaks is not well understood 6 , but several candidate species have been identified. In Gabon three species of bats (Hypsignathus monstrosus, Epomops franqueti and Myonycteris torquata) were found to be infected with Ebola virus 7 and some human outbreaks have been directly linked to bat exposure 8 . While it is increasingly clear that gorillas (Gorilla gorilla) and chimpanzees (Pan troglodytes) are dead end hosts for the virus, infection in populations of these species is frequently found and they still present a risk of animal to human transmission 3,[9][10][11][12][13][14] . For many outbreaks it has been difficult to definitively identify the source of human Ebola index cases, but activities that bring humans into close contact with the blood of mammals through activities such as hunting and the bushmeat trade are common to many of the index cases 8,[14][15][16] .
Following the initial suspected zoonotic transfer, secondary transmission can result from close contact between infectious individuals or corpses and other humans, usually through exposure to infectious bodily fluids 2 . Due to the close degree of contact required for secondary transmission, certain specific community activities are commonly associated with hotspots of secondary transmission such as family home care, traditional burial practices that involve washing the corpse, or healthcare settings where sufficient protective measures are not in place 15,17,18 . These focal transmission events combined with an incubation period of 5-9 days mean transmission can often be observed in waves of cases 2,18 in both space and time.
Spread of cases over longer distances is often associated with treatment seeking that draws people from rural villages that typify the index case locations to big urban centres with central medical facilities ( Supplementary Figures 1-22). While this mostly involves domestic land travel 19 , some instances of international importation by air travel have been documented 18 . Following travel of an infectious individual, either secondary clusters of Ebola cases will occur, or transmission will be interrupted by control methods such as quarantine and patient contact tracing 20,21 . Due to the variable rate of progression of symptoms of Ebola virus disease (EVD) (onset of Ebola Haemorrhagic fever can range from 2-21 days 2 ), case fatality rates can vary significantly depending on a number of factors associated with Ebola virus pathogenesis and the quality and timing of symptomatic care. The complex interaction between surveillance, control, treatment seeking, patient-contact rates and their combined effects on the dynamics of transmission dictate the spread, magnitude and case fatality rate of an Ebola outbreak.
This database collates existing knowledge on the geographic spread of past Ebola outbreaks in a standardised format that allows the dynamics of different outbreaks to be compared. Procedures for data abstraction are outlined and each outbreak is summarised with a map and brief text description. These data will be useful for conducting spatial analyses of Ebola outbreak spread. We include every outbreak preceding the atypical 2013 Guinea epidemic which has spread further and faster than any previous epidemic. Once the current Guinea and Democratic Republic of the Congo (DRC) outbreaks are over, this database will be updated to include the same standardised data fields for these contemporary outbreaks. Periodic updates to include any additional Ebola outbreaks will also ensure this resource has on-going relevance in Ebola spread analyses. In particular, a comparison between the Guinea 2013 outbreak and historical outbreaks will allow an evaluation of surveillance and control guidelines in terms of their appropriateness for mitigating the spread of future Ebola outbreaks of variable magnitude. In the meantime, it is hoped that these data will support research into EVD epidemiology which can be brought to bear on the current outbreak.

Data collection
Tables listing proven outbreaks of Ebola virus, sourced from the scientific literature 5 and from health reporting organisations 22 , were used to coordinate initial searches of the formal scientific literature using Web of Science and PubMed for each specific outbreak. Relevant papers were abstracted and, where possible, outbreak-specific epidemiological surveys were sourced. The citations in these references were obtained in order to reconstruct the outbreak in detail and extract a range of epidemiological data relating to geographic spread, case and fatality numbers.
In this analysis we excluded ongoing outbreaks meaning that the current Guinea and DRC spread data are not yet included. When this outbreak is over and data has been assimilated we will update this database to incorporate the Guinea and DRC outbreaks using the same procedures described below, which will allow a comparison with historical outbreaks of Ebola.

Definitions of index, secondary and imported cases
Index cases were defined as any human infection resulting from interaction with non-human sources of the disease. Among the sources examined, index cases were identified based on reported interaction with suspected zoonotic reservoirs or hosts, such as non-human mammals during hunting trips 8,[14][15][16]23 . In cases where a mode of suspected zoonotic transmission could not be established, the first reported case was assumed to be the index case. Any cases arising from existing human infections are considered as secondary infections. Cases reported after the index cases were assumed to be secondary cases unless they were accompanied by specific details of likely exposure to a zoonotic reservoir or non-human host. If a case was reported in an area where no further cases occurred and no continued transmission was documented, these were termed imported cases.

Procedure for geo-positioning
For each index, secondary or imported case cluster that could be linked to a unique geographic location we performed a range of procedures similar to methods employed elsewhere 24,25 to assign geographic coordinates. For index and secondary cases these locations were representative of the site of suspected zoonotic transfer or human-to-human transmission respectively. Index cases were geopositioned as the location where exposure to the suspected zoonotic reservoir was likely to be highest. For example hunters who butchered carcases on hunting expeditions were geopositioned as a polygon covering the area of the hunting trip not a point at the hunters' homes. In contrast, if an individual purchased bushmeat from a local market for preparation and consumption at home, the home of the individual was georeferenced as the index case, as it was considered the location of highest exposure to the suspected zoonotic reservoir. For imported cases these locations were representative of all locations that infected patients travelled to, but did not cause onward transmission. For the purpose of this analysis we excluded international imported cases that did not cause autochthonous secondary transmission as, in most cases, they represented diagnosed foreign workers or expatriates who were evacuated for specialised treatment 3,10 . In such circumstances appropriate preventative measures are employed meaning that risk of onward transmission is negligible.
Each occurrence was reported to the highest degree of spatial resolution available based upon the information provided, as long as they could be categorised into one of: index, secondary or imported cases. This ranged from point locations (indicative of a precise location, such as a village), to areas, termed polygon locations, which correspond approximately to administrative regions or custom digitised areas based on site descriptors within the primary articles. Administrative regions were defined as classified by the Food and Agriculture Organization's Global Administrative Unit Layers (GAUL) coding 26 . These classify national boundaries as admin0 units, states or provinces as admin1 units and districts as admin2 units. By classifying Ebola occurrences as polygons we were able to represent the geographic uncertainty around the exact location of Ebola transmission which could have occurred anywhere within the defined region. For towns, the coordinates of the centre were recorded, unless a specific part of the town (or an explicit latitude and longitude) was described. Coordinates for point locations were extracted using Google Earth (version 7.1.2). If the area concerned could not be assigned to a finer resolution than 5 km × 5 km, it was entered as polygon rather than point data. If specified regions could not be linked to an admin2 or admin1 unit, custom polygons were digitised using site descriptions in the text articles. For imprecise descriptors e.g., '15 km from the town' with no direction specified or 'cases occurred on a north-south road between village X and Y', circular polygons were digitised based on radius distances given or extreme points that defined the geographic limits of transmission. These circular polygons could be trimmed if their area included admin1 or admin2 administrative regions which reported no EVD patients. Some articles referred to 'healthcare districts' that did not correspond to admin1 or admin2 units, but were definable based on maps presented in the primary literature that were digitised or were available on the map sharing website IKI (www.ikimap.com) 27 . For index cases that referred to suspected zoonotic transfer in specific forests or game reserves, polygons were drawn based on the specified park or forest geographic boundary as shown in Google Earth. Two exceptional cases were present. In the first, for outbreak 13, the index case transmission site description merely mentioned a case being reported as 'near the town of Mbandza'. In this instance a circular polygon was defined with radius of half the distance to the next specified location of transmission (7.5 km) 15 . In the second, for outbreak 9, two locations described in the primary literature could not be located, but were described as 'near to the town of Booue' 18 . As a result the same procedure was undertaken and a radius of 30 km was defined around the village of Booue. All digitising was performed using ArcGIS 10.1 28 .

Key outbreak metrics recorded
The total number of cases for each outbreak was obtained from the most recent primary source (Table 1 (available online only)). Cases included both clinically suspected and laboratory confirmed cases at the point of care, or diagnosed retrospectively. The number of people who died with a suspected or confirmed diagnosis of EVD was also recorded. These data were spatially disaggregated as much as possible from the information given in the text to give measures of spatial variation in case fatality rate within an outbreak.
Outbreak start and end time were defined by reports of the first (index) and last cases respectively. For each secondary and imported case that occurred in a novel location, the source of importation was recorded where reported. When possible this was reconciled with the date of the first secondary or imported case in each cluster to determine geographic spread order. We only included confirmed sources of origin, as determined by detailed patient histories, not suspected sources and no differentiation of spread order to two new clusters from the same source was made unless both new clusters documented the date of their first secondary case. If two secondary case clusters had a confirmed source but no differentiation of order we assigned them the same spread order. Similarly if a secondary case cluster could have come from two specified sources both were included as sources of spread. Due to a lack of epidemiological investigation at the time, or through information being lost in the reporting chain, the spread order of many secondary and imported case clusters was partially complete or missing altogether (Table 2).
For each outbreak the above details were combined with additional information from the original articles on method and timing of spread to construct the text descriptions accompanying the spread maps in Supplementary Figures 1-22 (selected examples in Figs 2,3,4). Spread is defined as movement of an infected individual from the location of infection to a transmission-free area. For example an individual who was likely infected from the suspected zoonotic reservoir on a hunting expedition in the forest then caused secondary infections back in their local village, would qualify as having spread the virus from the forest to the village. By contrast if the first reported case occurred in a village where subsequent secondary cases then occurred, no spread would have been recorded as index and secondary transmission occur in the same location.  Selected examples are presented in Figs. 2,3,4. In the first, initial focal infection jumped from village to village, as well as longer distance dispersal primarily through treatment seeking of infected patients (Fig. 2). In the second (Fig. 3) more limited outbreak, cases occurred in two main clusters, with one containing working camps surrounding the site of suspected zoonotic transfer and the other villages and a local hospital in the vicinity of Makokou. In the final example outbreak (Fig. 4) secondary transmission spread radially out from the village of Balimba following treatment seeking of a faith healer. Similar to outbreak 1, infected patients travelled to local and national healthcare centres for treatment, in some cases causing secondary transmission at great distances from the index case.
In the text descriptions, methods, sources and order of spread were described in a standardised manner, where information allowed. A brief summary of the evolution of the outbreak over time and the breakdown of total number of cases, deaths and case fatality rate over the course of the entire outbreak is also given. The first reported cases of Sudan Ebola virus were in three workers at a cotton factory in Nzara, in close proximity to three game reserves. The method of acquisition was unknown. The first secondary cases arose in Nzara infecting a total of 67 people who were primarily family members of the factory workers. Further secondary transmission clusters emerged in Maridi following spread from Nzara due to seeking treatment, after which further cases occurred in Juba due to patients who were referred. Additional cases from Maridi were also referred directly to Juba making the source of infection in Juba difficult to identify. Secondary transmission also emerged in Tembura due to a patient seeking family care, although the source of this infection is unknown. Imported cases from Juba to Khartoum and from Nzara to Omdurman were also reported following a patient seeking treatment and a referral for diagnosis respectively (see inset). The principal mode of transmission in this outbreak was initially familial, although in Maridi secondary transmission arose through nosocomial transmission. Seeking of treatment was the principal cause of geographic spread. The first index case became ill on the 27 June 1976 before the first secondary cases in July and subsequent secondary transmission clusters from August to October. Cases peaked in September (138 cases, 65 deaths). The final case was reported on 25 November 1976. Imported cases in Omdurman and Khartoum were reported in August and September, respectively. Overall, 284 cases were reported with 151 deaths giving a CFR of 53.2%. This figure varied in different locations: Nzara (67,31,46%), Maridi (213,116,55%), Tembura (3,3,100%), Juba (1,1,100%). Arrows indicate order of spread. Where spread order is known, numbers are indicative of the order of spread. Arrows sharing the same number indicate that it was not possible to distinguish which spread happened first. A wider-scale map showing the locations of all the outbreaks since 1976 is also available in Supplementary Fig. 23 and a table detailing which type of data were obtained from which source for each outbreak is given in Table 1 (available online only) 3,8,10,[13][14][15][16][17][18]20,[29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46] .

Data Records
The data from this analysis are summarised in two types of data format (Data Citation 1). First a data table details unique geographic locations of Ebola occurrence, including information on type of transmission, location, spread, timing and case number. These geographic locations are grouped into individual outbreaks (n = 22) and summary statistics on timings and case and death numbers are given for each outbreak. Second, geographic information files are provided that match the information presented in the data table to explicit geographic areas. These are available in a variety of formats that can be read by geographic information system (GIS) applications. Information from these two file types were used to make the outbreak summary maps and text in Supplementary Figures 1-22 (selected examples in Figs. 2,3,4).

Data table of unique Ebola virus transmission locations
The table includes the following fields, detailed below. The value 'NA' was entered if information was unknown, unreported or indeterminable. The term 'occurrence' refers to Ebola transmission locations that are either unique in the geographic location or in their type of transmission (index, secondary or imported). Each row in the table represents a unique Ebola occurrence. A group of occurrences make up contained 'outbreaks' and fields with the prefix 'OB' summarise various metrics related to the entire outbreak that each occurrence belongs to.
UNIQ_ID: A unique identification number for each occurrence at which index, secondary or imported cases of EVD have occurred at unique geographic locations (n = 117). NAME: Text description of the point or polygon that defines the location of the occurrence. COUNTRY: The country where the majority of cases occurred in each outbreak. VIRUS: The Ebola virus species of each outbreak. SOURCE_2: The unique identification number of the occurrence where the first EVD patient came from. An occurrence may have more than one source if infected patients came from more than one source but it is unknown which triggered secondary transmission.
SOURCE_3: The unique identification number of the occurrence where the first EVD patient came from. An occurrence may have more than one source if infected patients came from more than one source but it is unknown which triggered secondary transmission.

END_YEAR:
Year of last reported case in the occurrence. REP_CASE: The total number of cases (suspected or confirmed) reported over the course of the outbreak, but only within the occurrence.
REP_DEATH: The total number of deaths (suspected or confirmed) reported over the course of the outbreak, but only within the occurrence.
OB_ID OB_DEATH: The total number of deaths (suspected or confirmed) reported over the course of the outbreak in all areas.

Geographic information files
The geographic information files include index, secondary and imported cases linked to geographic locations. All fields match those in the data table. Unknown, unreported or indeterminable data are represented by the character combination 'NA'.

Technical Validation
For each outbreak, all relevant text articles were used to confirm or reach a consensus on the likely result. Where differing results were found (e.g., total case count), figures from primary research articles took preference over review, summary articles or periodical epidemiological reports.
The extracted geographic and epidemiological summary information was cross-checked by at least two different researchers to ensure accuracy.
All point and polygon locations were checked in ArcGIS 10.1 28 against national and subnational boundaries to ensure their location matched the text descriptions. A gridded raster file from the Global Lakes and Wetlands Database 47 giving the locations of rivers and lakes was also included to check points and polygons fell on land rather than water. Any points that fell in water were moved to the nearest land pixel.

Usage Notes
The data presented here can be used in combination with spatial and temporal meteorological and socioeconomic information to generate hypotheses about the factors that may be important in the emergence of index cases and the spread of secondary and imported cases of EVD.
Pigott et al. combined data on Ebola index cases with a suite of environmental information in a species distribution model to map the zoonotic niche of Ebola transmission across Africa 48 . Matching more varied and finer scale local information at the sites of the index cases presented here could help develop our understanding of the complex process of Ebola virus emergence and the risk posed by certain human activities and land use patterns.
Using the secondary and imported case data, it would be possible to model and investigate causes of spread of human Ebola outbreaks. Understanding spread of the pathogen in past outbreaks may aid control of the current outbreak. A comparison between the spread of historical outbreaks and the spread of the current ongoing outbreak once it is over will be useful for informing Ebola surveillance and control guidelines to minimise the size and burden of these sporadic zoonoses.
Finally, this dataset can be used to investigate how the rate, extent and the environment in which Ebola outbreaks spread relates to important outbreak measures such as the total number of cases and case fatality rate. Understanding what distinguishes brief, geographically limited and low mortality Ebola outbreaks from those that impose a much higher public health burden will be important for informing future surveillance, control and treatment efforts 49 . This dataset provides the most comprehensive collection of standardised data on Ebola outbreak spread currently available and will be an important resource for these uses. These records will also be updated periodically with data from ongoing and future Ebola outbreaks to ensure the ongoing relevance of this database.