Arboviral disease record data - Dengue and Chikungunya, Brazil, 2013–2020

One of the main categories of Neglected Tropical Diseases (NTDs) are arboviruses, of which Dengue and Chikungunya are the most common. Arboviruses mainly affect tropical countries. Brazil has the largest absolute number of cases in Latin America. This work presents a unified data set with clinical, sociodemographic, and laboratorial data on confirmed patients of Dengue and Chikungunya, as well as patients ruled out of infection from these diseases. The data is based on case notification data submitted to the Brazilian Information System for Notifiable Diseases, from Portuguese Sistema de Informação de Agravo de Notificação (SINAN), from 2013 to 2020. The original data set comprised 13,421,230 records and 118 attributes. Following a pre-processing process, a final data set of 7,632,542 records and 56 attributes was generated. The data presented in this work will assist researchers in investigating antecedents of arbovirus emergence and transmission more generally, and Dengue and Chikungunya in particular. Furthermore, it can be used to train and test machine learning models for differential diagnosis and multi-class classification. Measurement(s) clinical data Technology Type(s) interview Measurement(s) clinical data Technology Type(s) interview


Background & Summary
Arboviral diseases are a global health concern due to their rapid geographic spread. These diseases are transmitted through arthropod insects such as Aedes Aegypti and Aedes Albopictus. These types of virus, known as arboviruses, are more commonly found in tropical countries whose climates favour viral amplification and transmission 1 . Among these diseases, Dengue, Chikungunya, Yellow Fever, and, more recently, Zika, have higher prominence due to their relatively higher case numbers. Over the past thirty years, the spread and impact of these diseases on public health have increased dramatically 2 . Furthermore, there is evidence that COVID-19 intervention measures, such as lockdowns, have contributed to an increase in arbovirus cases 3 .
The spread of Dengue in recent decades is dramatic. In 2019, WHO Region of the Americas recorded the highest number of Dengue cases in history 4 . Brazil has the highest number of absolute cases of Dengue and Chikungunya worldwide 5,6 . These two diseases are the most common arboviral diseases in the country; both reached historical peaks in recent years. For example, reported cases and deaths due to Dengue reached a peak of 2,248,570 cases and 840 deaths in 2019 5 . In 2016, Brazil there were 558,542 reported cases of Chikungunya, the highest number reported to date 6 .
The correct diagnosis of arboviruses is a significant challenge. According to Pan American Health Organization (PAHO) 5,6 , only about half of reported cases are confirmed, with the remainder being treated as suspected cases. This is due to the concurrency of circulation of these diseases and the high similarity in the symptoms of Dengue and Chikungunya which makes clinical diagnosis difficult. In the absence of point-of-care virus-specific testing, even experienced and well-trained physicians may misdiagnose an arbovirus infection due to the similarity in symptoms 7 . Rapid tests, especially for Dengue, are effective in confirming the disease but only up to the fifth day post-infection. After this period, such tests have a high rate of error thus requiring the use of laboratory tests. Unfortunately, laboratory testing requires technical equipment that is not widely www.nature.com/scientificdata www.nature.com/scientificdata/ available throughout Brazil. In addition, laboratory testing is also subject to misdiagnosis due to co-infection and cross-reaction with the various arboviruses found in the country 8 . Such misdiagnosis can result in a wide range of negative outcomes including inadequate or inappropriate treatment. Indeed, despite arboviruses being notifiable diseases in Brazil and the public sector being the primary health service provider for over 70% of the population, relatively few confirmatory tests are carried out 7 . According to the Brazilian Ministry of Health 9 , "only approximately 23% were tested in reference laboratories".
Given that Brazil is hyper-endemic for arboviruses, the amount of patient data collected is very large. For example, almost 1.5 million cases of Dengue were reported to Brazilian Information System for Notifiable Diseases, from Portuguese Sistema de Informacao de Agravo de Notificacao (SINAN) in 2020. As such, this represents a significant source of information for both epidemiological analysis as well as training and optimising machine learning models for health purposes. The objective of this work is to make available a Brazilian national data set with clinical, laboratory, and socio-demographic data on both confirmed, discarded, and inconclusive cases of Dengue and Chikungunya so that this data can be used for future research, such as the development of machine learning model that helps to correctly classify these patients. A high-level epidemiological analysis of the data set is also presented.

Methods
The data was collected from the Brazilian Information System for Notifiable Diseases, Sistema de Informação de Agravo de Notificação (SINAN) http://portalsinan.saude.gov.br/. The data set is from a public data repository and according to current Brazilian laws, there is no need for ethics committee approval. SINAN collates case notification data of diseases present on the national list of compulsory notification of diseases, injuries and public health events https://bvsms.saude.gov.br/bvs/saudelegis/gm/2020/prt0264_19_02_2020.html. This includes Dengue and Chikungunya. The data contains notifications of Dengue and Chikungunya cases that occurred in Brazil, including all 26 states and the Federal District (Brasília), between 2013 and 2020. Dengue-related data contains clinical data (pre-existing symptoms and comorbidities), laboratory tests performed, and socio-demographic data for each case. With the exception of one hundred records, Chikungunya-related data contains only socio-demographic data. No explanation on why only one hundred Chikunya records contain clinical and laboratory test data was provided with the data. It is possible that these cases were treated as suspected cases of Dengue and only later confirmed as cases of Chikungunya however this has not been confirmed. These cases are included in the data set summary in Table 6. For both data sets, no individually identifiable health information is made available in the data set.  Table 1. Attributes removed after preprocessing. Figure 1 presents the preprocessing steps used for cleaning the data set. First, the SINAN data from all states were aggregated resulting in 13,421,230 notifications and 118 attributes. The records were grouped into three distinct groups by the CLASSI_FIN attribute:

Attributes removed
• Dengue: Patients with confirmed Dengue; • Chikungunya: Patients with confirmed Chikungunya; and • Discarded/Inconclusive: Patients who tested negative or inconclusive for Dengue or Chikungunya following laboratory tests.
Only notifications that were (a) confirmed or (b) discarded/inconclusive following clinical diagnostic were selected. For confirmation criteria, we used the Brazilian MS definitions that can be found here: https:// bvsms.saude.gov.br/bvs/publicacoes/diretriz_nacionais_prevencao_controle_dengue.pdf. After this step, the attribute used for the filter (CRITERIO) was also removed, since it now contains only a single value. The attribute TP_NOT identifies the type of notification generated. As all notifications are of the "Individual"  www.nature.com/scientificdata www.nature.com/scientificdata/ type, the TP_NOT attribute has the same value for all records. Attributes that had more than 60% null data or that were not in the original data dictionary were also removed. Attributes that still had null fields were filled with the default value, "not informed", as per the data dictionary. The transformation from categorical to numerical data was also carried out. Table 1 shows all the attributes removed in the preprocessing process.
At the end of the process, the data set consisted of 4,307,513 records for Dengue, 325,000 records for Chikungunya, and 2,100,029 records for the Discarded/Inconclusive category.

Data Records
The processed data set, as well as the raw data, are available in Mendeley Data 10 and can be found via the link https://data.mendeley.com/datasets/2d3kr8zynf/4. Figure 2 presents the number of records in the data set by category (Dengue, Chikungunya, Discarded/Inconclusive) in Brazil from 2013-2020. As can be clearly seen, Dengue infections in 2013, 2015, 2016, and 2019 were comparatively high 11,12 . In 2017, there was a drop in confirmed cases of both Dengue and Chikungunya in the country to similar levels for both diseases (120,753 cases of Dengue and 113,087 cases of Chikungunya). Figure 3 shows the age structure of the cases reported in this data set, divided into three categories: young people, adults and the elderly. The youth category includes individuals up to 18 years of age. The adult category is for individuals aged between 20 and 59 years. Finally, the elderly category are individuals aged 60 and over. In every year, the highest incidence of Dengue, Chikungunya or Inconclusive cases is in the adult category.  heat maps of the number of Dengue, Chikungunya and discarded/inconclusive cases, respectively, by state and year. In these figures, the more intense the color, the greater the number of cases of each disease. Most Dengue cases (Fig. 4) occurred in the Southeast and Midwest of the country, more specifically www.nature.com/scientificdata www.nature.com/scientificdata/ in the states of Minas Gerais MG, Goiás GO and São Paulo SP. In 2015, SP had the highest number of cases of Dengue in a single state with more than 360,000 reported cases. This could reflect its population numbers and density.
Chikungunya emerged in the Americas in 2013 13 . Following the reporting of the first locally transmitted Chikungunya infection in Brazil in September 2014, the disease rapidly spread across Brazil 13 . Consistent with this timeline, the data set includes data for the years 2015 to 2020. Figure 5 illustrates the spread of Chikungunya in Brazil from the confirmation of initial autochthonous cases in Ceara CE in the Northeast to a major outbreak in Rio de Janeiro in 2018 and 2019. Figure 6 shows discarded/inconclusive cases of Dengue and Chikungunya. Firstly, the number of cases is high in the states of CE and Pernambuco PE from 2015 to 2017, most likely reflecting the emergence of Chikungunya and associated difficulties in diagnosing the disease accurately 14 . This data raises questions regarding the quality of the surveillance system in these areas. For example, greater numbers of discarded/inconclusive cases in certain areas may indicate that the health and surveillance infrastructure in these areas is inferior to those in other states. Secondly, similar to Dengue, most of these categories of cases are located in the cities of SP and MG. Indeed, SP is the state with the highest number of cases in 2015, 2016 and 2019.
The final data set is composed of 56 attributes that are grouped according to Fig. 7 and are detailed in Tables 2, 3, 4 and 5. Demographic, epidemiological and clinical (symptoms, signs and comorbidities) data were grouped as resource-limited attributes as per Lee et al. 15 . Specific equipment is not specified in the data set. Laboratory attributes (serological) and others are grouped as well-resourced attributes because they require specific equipment to be performed.
Socio-demographic data ( Table 2) includes age, sex, gestational age, race, and area of residence, amongst others. www.nature.com/scientificdata www.nature.com/scientificdata/ Symptoms relate to specific physical features which can indicate the existence of a disease. As per Table 3, the data set contains 13 symptoms.
Comorbidities are preexisting conditions in the patient. Table 4 presents the clinical data with information about comorbidities. Table 5 presents the attributes for laboratory data. This data comprises results from serological and other tests. It also contains data on whether the patient was hospitalised as well as the final patient classification.  www.nature.com/scientificdata www.nature.com/scientificdata/ The general and disease baseline characteristics are shown in Table 6. Baseline characteristics show an overall mean (SD) age over 30 years and a predominance of women for each arboviral disease. Fever (37.3%), headache (34.5%), and myalgia (34%) were the most frequent symptoms. It is important to highlight that in confirmed cases of Chikungunya, the absence of symptoms in the records directly affect the percentage of these symptoms in general.

technical Validation
All data presented in this work can be corroborated by reports published by the Ministry of Health of Brazil.      www.nature.com/scientificdata www.nature.com/scientificdata/ serotypes and the origins of imported cases. In López et al. 17 , the Dengue outbreak in Santa Fé, Argentina was investigated. This city has a temperate climate and experienced an increase in Dengue cases and virus circulation from 2009. Santa Fé experienced the largest outbreak in Argentina to date. The intention of the authors of both papers was to support further research in understanding the factors and patterns of arboviruses emergence and transmission.
In line with Robert et al. 16 and López et al. 17 , the data set presented in this work expands the data available to researchers on the emergence and transmission of two arboviruses, Dengue and Chikungunya. To this end, it complements these works and progresses work towards a potential international arbovirus data set suggested by Robert et al. 16 .
Arboviruses are hyperendemic in Brazil. The social, environmental and climate conditions in Brazil combined with disordered urban growth and population migration have escalated the public health risk presented by arboviruses. The COVID-19 pandemic and prolonged economic crisis are exacerbating efforts to control negative outcomes from these diseases 3 . These factors make it difficult to combat and prevent these diseases in the country, as well as to understand how the virus reacts and spreads. Although there is not complete data on all arboviruses, the data presented here can help in the fight against Dengue and Chikungunya, and assist in addressing misdiagnosis as experienced during the Zika epidemic in 2015 14 . For example, it can provide data develop (low cost) decision support tools for the differential diagnosis of these diseases. In particular, this data may be used as both training and test data sets for machine learning and deep learning models for binary and multi-class classification and prediction.

Code availability
The code used to pre-process the data set presented in this paper is available at: https://github.com/dotlab-brazil/ arbovirus-dataset-brazil.