A database of geopositioned Middle East Respiratory Syndrome Coronavirus occurrences.

As a World Health Organization Research and Development Blueprint priority pathogen, there is a need to better understand the geographic distribution of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) and its potential to infect mammals and humans. This database documents cases of MERS-CoV globally, with specific attention paid to zoonotic transmission. An initial literature search was conducted in PubMed, Web of Science, and Scopus; after screening articles according to the inclusion/exclusion criteria, a total of 208 sources were selected for extraction and geo-positioning. Each MERS-CoV occurrence was assigned one of the following classifications based upon published contextual information: index, unspecified, secondary, mammal, environmental, or imported. In total, this database is comprised of 861 unique geo-positioned MERS-CoV occurrences. The purpose of this article is to share a collated MERS-CoV database and extraction protocol that can be utilized in future mapping efforts for both MERS-CoV and other infectious diseases. More broadly, it may also provide useful data for the development of targeted MERS-CoV surveillance, which would prove invaluable in preventing future zoonotic spillover.


Background & Summary
Middle East Respiratory Syndrome Coronavirus (MERS-CoV) emerged as a global health concern in 2012 when the first human case was documented in Saudi Arabia 1 . Now listed as one of the WHO Research and Development Blueprint priority pathogens, cases have been reported in 27 countries across four continents 2 . Imported cases into non-endemic countries such as France, Great Britain, the United States, and South Korea have caused secondary cases [3][4][5] , thus highlighting the potential for MERS-CoV to spread far beyond the countries where index cases originate. Reports in animals suggest that viral circulation could be far more widespread than suggested by human cases alone [6][7][8] .
To help prevent future incidence of MERS-CoV, public health officials can focus on mitigating zoonotic transfer; however, in order to do this effectively, additional research is needed to determine where spillover could occur between mammals and humans. Previous literature reviews have looked at healthcare-associated outbreaks 9 , importation events resulting in secondary cases 10,11 , occurrences among dromedary camels 12,13 , or to summarize current knowledge and knowledge gaps of MERS-CoV 14,15 . This database seeks fill gaps in literature and build upon existing notification data by enhancing the geographic resolution of MERS-CoV data and providing occurrences of both mammal and environmental detections in addition to human cases. This information can help inform epidemiological models and targeted disease surveillance, both of which play important roles in strengthening global health security. Knowledge of the geographic extent of disease transmission allows stakeholders to develop appropriate emergency response and preparedness activities (https://www.jeealliance.org/ global-health-security-and-ihr-implementation/joint-external-evaluation-jee/), inform policy for livestock trade and quarantine, determine appropriate demand for future vaccines (http://cepi.net/mission) and decide where to deliver them. Additionally, targeted disease surveillance will provide healthcare workers with updated lists of

Methods
The methods and protocols summarized below have been adapted from previously published literature extraction processes [18][19][20][21][22] , and provide additional context surrounding our systematic data collection from published reports of MERS-CoV. Data collection. We identified published reports of MERS-CoV by searching PubMed, Web of Science, and Scopus with the following terms: "Middle Eastern Respiratory Syndrome", "Middle East Respiratory Syndrome", "MERSCoV", and "MERS". The initial search was for all articles published about MERS-CoV prior to April 30, 2017, and was subsequently updated to February 22, 2018. These searches were conducted through the University of Washington Libraries' institutional database subscriptions. We searched the Web of Science Web of Science Core Collection (the subscribed edition includes Science Citation Index Expanded, 1900-present; Social Sciences Citation Index, 1975-present; Arts & Humanities Citation Index, 1975-present; Emerging Sources Citation Index, 2015-present). We searched the standard Scopus database and the standard, freely available PubMed database; these products have a single version that is consistent across institutional subscriptions or access points.
In total, this search returned 7,301 related abstracts, which were collated into a database before a title-abstract screening was manually conducted (Fig. 1. Flowchart). Articles were removed if they did not contain an occurrence of MERS-CoV; for example, vaccine development research or coronavirus proteomic analyses. Non-English articles were flagged for further review and brought into the full text screening stage. The accompanying supplementary file highlight the title and abstract screening process and the inclusion and exclusion criteria.
Full text review was conducted on 1,083 sources. To meet the inclusion criteria, articles must have contained both of the following items: 1) a detection of MERS-CoV from humans, animals, or environmental sources, and 2) MERS-CoV occurrences tagged with spatial information. Additionally, extractors attempted to prospectively  manually remove articles containing duplicate occurrences that were already extracted in the dataset. Extractors only prospectively manually removed articles if it was clear the articles contained data we were confident had already been extracted and had high-quality data. We excluded 885 sources based on full text review. In addition, we reviewed citations and retroactively added relevant articles to our database if they were not already included. We retroactively added and subsequently marked ten articles for extraction using this process. In total, we extracted 208 peer-reviewed sources reporting detection of MERS-CoV that included geographic and relevant epidemiological metadata.
Geo-positioning of data. Google Maps or ArcGIS 23 was used to manually extract location information at the highest resolution available from individual articles. We evaluated spatial information as either points or polygons. The geography was defined as a point if the location of transmission was reported to have occurred within a 5 × 5 km area. Point data are represented by a specific latitude and longitude. A point references an area smaller than 5 × 5 km in order to be compatible with the typical 5 × 5 km resolution of satellite imagery used for global analyses. The geography was defined as a polygon if the location of transmission was less clear, but known to have occurred in a general area (e.g. a province), or the location of transmission occurred within an area greater than 5 × 5 km (e.g. a large city). We used contextual information to determine location in instances where the author's spelling of a location differed from Google Maps or ArcGIS. Maps provided by authors were digitized using ArcGIS.
We used three different types of polygons: known administrative boundaries, buffers, and custom polygons. Relevant administrative units were sourced from the Global Administrative Unit Layers curated by the Food and Agricultural Organization of the UN 24 for known administrative boundaries of governorates, districts, or regions, and paired with the occurrence record. Buffers were created to encompass areas in cities and regions without corresponding administrative units. To ensure that buffers encompassed the entirety of the area of interest, Google Maps was used to determine the required radius. In areas with unspecified boundaries (e.g. Table Mountain National Park and the border region between Saudi Arabia and UAE) ArcGIS was used to generate custom polygons, which were assigned a unique code within a defined shapefile for ease of re-identification.
• index: Any human infection of MERS-CoV resulting after direct contact with an animal and no reported contact with a confirmed MERS-CoV case or healthcare setting. • unspecified: Cases that lacked sufficient epidemiological evidence to classify them as any other status (e.g. serosurvey studies). • NA: Non-applicable field; case was not a patient (e.g. mammal) • secondary: Defined as any cases resulting from contact with known human infections. Cases reported after the index case can be assumed to be secondary cases unless accompanied by specific details of likely independent exposure to an animal reservoir. • import: Cases that were brought into a non-endemic country after transmission occurred elsewhere.
• zoonotic: Transmission occurred from an animal to a human.
• direct: Only relevant for human-to-human transmission.
• unspecified: Lacked sufficient epidemiological evidence to classify a human case as zoonotic or direct.
• animal-to-animal: Transmission occurred from an animal to another animal. 19. clinical: Describes whether the MERS-CoV occurrence demonstrated clinical signs of infection. Denoted by yes, no, or unknown.
• yes: Clinical signs of infection were present/reported. Clinical signs among humans may range from mild (e.g. fever, cough) to severe (e.g. pneumonia, kidney failure). Clinical signs among camels include nasal discharge. • no: Clinical signs of infection were not present/reported. • unknown: Subject(s) may or may not have been demonstrating clinical signs of infection. For example, some authors did not explicitly mention symptoms, but individuals reportedly sought medical care. Another example being when a diagnostic serosurvey was conducted during an ongoing outbreak. The term "unknown" was used when articles lacked sufficient evidence for extractors to definitively label as "yes" or "no".
20. diagnostic: Describes the class of diagnostic method that was used. PCR, serology, or reported. 21. diagnostic_note: More detailed information related to the specific test used (e.g. rk39, IgG, or IgM serology). 22. serosurvey: Describes the context if serological testing was used.
• diagnostic: testing of symptomatic patients.
• exploratory: historic exposure determined among healthy asymptomatic individuals. 23. country: ISO3 code for country in which the case occurred. 24. origin: Open-ended field to provide more details on the specific in-country location of MERS-CoV case. 25. problem_geography: This field was utilized if the MERS-CoV case was reported in a location that could cause uncertainty when determining exact geographic occurrence (e.g. hospital, abattoir). 26. lat: Latitude measured in decimal degrees. 27. long: Longitude measured in decimal degrees. 28. latlong_source: The source from which latitude and longitude were derived. 29. loc_confidence: States the level of confidence that researchers had when assigning a geographic location to the MERS-CoV case (good or bad). An answer of 'good' meant the article stated clearly that the case occurred in a specific geographic location and no assumptions were required on part of the researcher. An answer of 'bad' meant the article did not clearly state the specific geographic location of the MERS-CoV case, but the researcher was able to infer the location of occurrence. The field SITE_NOTES was utilized to detail the logic behind researchers' decisions when inference was required. 30. shape_type: The geographic shape type assigned to the MERS-CoV occurrence (point or polygon). 31. poly_type: If the MERS-CoV occurrence was assigned a shape_type of polygon, was it admin (GAUL), custom, or buffer? 32. buffer_radius: If a MERS-CoV occurrence was assigned a buffer, what is the radius in km? 33. gaul_year_or_custom_shapefile: File path used to reach the necessary shape file in ArcGIS. Users of this dataset can find custom shapefiles created for this dataset at: https://cloud.ihme.washington.edu/index. php/s/DGoyKYqnbjG54F2/download 34. poly_id: A standardized and unique identifier assigned to each GAUL shapefile. 35. poly_field: Which type of polygon was used to geo-position the occurrence? (e.g. if admin1 polygon was used, enter ADM1_CODE) 36. site_notes: Miscellaneous notes regarding the site of occurrence. 37. month_start: Month that the occurrence(s) began. If the article provided a specific month of illness onset, the month was assigned a number from 1-12 (1 = January, 2 = February, etc.). If the article did not provide a specific month of illness onset, then researchers assigned a value of 'NA' .

month_end:
Month that the occurrence(s) ended, defined as the date a patient tested negative for MERS-CoV. If the article provided a specific month for recovery, the month was assigned a number from 1-12 (1 = January, 2 = February, etc.). If the article did not provide a specific month of symptom onset, then researchers assigned a value of 'NA' . 39. year_start: Year that the occurrence(s) began. If the year of illness onset was not provided in the article, the IHME standard was used: (year_start = publication year -3).

year_end:
Year that the occurrence(s) ended. If the article did not provide a specific year for recovery, the IHME standard was used: (year_end = publication year -1). 41. year_accuracy: If years were reported, this field was assigned a value of '0' . If assumptions were required, this field was assigned a value of '1' .

technical Validation
All data extracted from the original search (October 2012 to April 30, 2017) was reviewed independently by a second individual to check for accuracy. Challenging extractions from the updated search (May 1, 2017 to February 22, 2018) were selected for group review during bi-weekly team meetings. Upon extraction completion, all data were checked to ensure they fell on land and within the correct country.
While the protocol implemented above was designed to reduce the amount of subjective decisions made by extractors, total elimination was not possible. Wherever a subjective decision had to be made, the extractor utilized the various notes fields in order to document the logic behind decisions. These decisions were subsequently reviewed by other extractors.

Usage Notes
The techniques described here can be applied to collect and curate datasets for other infectious diseases, as has been previously demonstrated with dengue 20 and leishmaniasis 18 . Additionally, since these data were collected independently through published reports of MERS-CoV occurrence, they may be used to build upon existing notification data 26,27 . Our ability to capture occurrences in this dataset is contingent on the data contained within published literature. Therefore, this dataset does not represent a total count of all cases. Instead, this dataset's value lies within its geo-precision. Data were extracted with a focus on obtaining the highest resolution possible. These data may be merged with other datasets, such as WHO 26 or OIE 27 surveillance records, and are intended to complement, not replace, these resources. Together, published reports and notification data can provide a more comprehensive snapshot of current disease extent and at-risk locations.
An important consideration, whether using the literature data alone, or in combination with other databases, is the potential for duplication. Various pieces of metadata can be used to evaluate where potential duplicates could lie, such as common date fields (month_start, month_end, year_start, year_end) or consistent geographic details (lat, long, poly_id, shape_type) or shared epidemiological tags (patient_type). Researchers may wish to consider further steps, such as fuzzy matching of geographic data (e.g. matching a point with an overlapping buffer) or temporal data (e.g. matching a precise month with an overlapping month interval). We acknowledge this duplicate-removal process will not catch all matching records, but it will likely catch several. We recommend Occurrences are layered from top to bottom in the following order: Index (green), Unspecified (orange), Mammal (yellow), Import (blue), Secondary (purple). Points were plotted using their assigned latitudes and longitudes, and shape files were created for polygons. Buffers were also plotted using assigned latitudes and longitudes, after which each buffer's custom radius was drawn. Higher resolution geographies (points, buffers, governorates) were plotted on top of lower resolution geographies (countries, regions).
www.nature.com/scientificdata www.nature.com/scientificdata/ this approach because it will allow researchers to remove several duplicates without erroneously deleting any two occurrences that are truly unique (i.e. not duplicates). Essentially, we recommend a sensitive approach above a more specific approach, as the latter simply risks culling too many records that aren't actually duplicates.
When merging with other databases, consistency in metadata tagging is essential. For the WHO Disease Outbreak News data feed 26,27 for instance, nomenclature for case definitions is slightly different, with WHO definitions of "Community Acquired" and "Not Reported" comparable to "Index" and "Unspecified" respectively. In addition, it is important to recognize what information is beyond the scope of these additional databases. Again, when comparing to the WHO dataset, it is important to recognize that serologically positive cases do not meet the case definition used in the WHO database. These adjustments need to be identified on a dataset-to-dataset basis. among cases tagged as Index or unspecified. Occurrences tagged as Index are coloured green, those tagged as unspecified are coloured orange. Points were plotted using their assigned latitudes and longitudes, and shape files were created for polygons. Buffers were also plotted using assigned latitudes and longitudes, after which each buffer's custom radius was drawn. Higher resolution geographies (points, buffers, governorates) were plotted on top of lower resolution geographies (countries, regions). Points were plotted using their assigned latitudes and longitudes, and shape files were created for polygons. Buffers were also plotted using assigned latitudes and longitudes, after which each buffer's custom radius was drawn. Higher resolution geographies (points, buffers, governorates) were plotted on top of lower resolution geographies (countries, regions).
www.nature.com/scientificdata www.nature.com/scientificdata/ This database can be combined with other covariates (e.g. satellite imagery) to produce environmental suitability models of MERS-CoV infection risk and potential spillover on both global and regional scales as achieved with other exemplar datasets [28][29][30][31] . This information can be useful in resource allocation aimed at improving disease surveillance and contribute towards a better understanding of the factors facilitating continued emergence of index cases.
The addition of sampling techniques and prevalence data may improve this dataset. Researchers were ultimately unable to add these data due to inconsistencies in the way literature reported sampling techniques and prevalence date by geography. An attempt to extract these data using the current approach would have led to sporadic inclusion of this information and would not have been comprehensive for the entire dataset. Moving forward, we recommend authors report sampling technique and prevalence data at the highest resolution geography possible, as seen in Miguel et al. 32 . We encourage continued presentation of paired epidemiological and geographic metadata that would allow for more detailed analysis in the future.
This database may also be utilized in clinical settings to provide an evidence-base for diagnoses when used in conjunction with patient travel histories. Additionally, it can be used to identify geographies for surveillance, particularly areas where MERS-CoV has been documented in animals but not humans (e.g. Ethiopia and Nigeria). Identifying locations for surveillance will, in turn, inform global health security. While models will increase the resolution at which these questions can be addressed, datasets such as this provide an initial baseline.
A major limitation of this database is the potential for sampling bias, which stems from higher frequency of disease reporting within countries where there exists strong healthcare infrastructure and reporting systems. This database does not attempt to account for such biases, which must be addressed in subsequent modelling activities where such biases are of consequence. Similarly, another limitation is potential duplicate documentation of singular occurrences. This can happen when the same occurrence is assigned different geographies (e.g. point, polygon) in multiple publications. Even though extractors made efforts to prospectively manually identify duplicate occurrences, this was challenging because the process relied upon papers providing sufficient details for extractors to determine a duplicate occurrence (e.g. geography, patient demographics, dates of occurrence, diagnostic methods, etc.). However, the majority of papers did not report such details for each occurrence. In those instances, it was impossible for extractors to discern whether occurrences may have been duplicates from a previous artic le. Even case studies inconsistently reported patient details and demographic information. These are some examples of challenges faced by extractors when we attempted to identify duplicates. Without sufficient contextual clues, extractors lacked evidence to determine duplicity and thus likely extracted some unique occurrences more than once. Despite efforts to remove duplicate occurrences from the database, it is possible that some remain.
Geographic uncertainty is similarly problematic for analyses such as this. In some cases, polygons, as opposed to points, are utilised as a geographic frame of reference, reflecting the uncertainty in geotagging in the articles themselves. For some occurrences, there is a strong assumption that the geography listed corresponds to the site of infection. While the use of 5 km × 5 km as the minimum geographical unit allows for some leeway in this precision, it is possible that even with the point data (often corresponding to household clusters) these may not map directly with true infection sites. This must be considered in any subsequent geospatial analysis.
Finally, this database represents a time-bounded survey of the literature. While all efforts were made to be comprehensive within this period, articles, and therefore data, will continue to be published. Efforts to streamline ongoing collection processes are still to be fully realized 33 . Regardless, we hope that this dataset provides a solid baseline for further iteration.