Background & Summary

Arboviral diseases are a global health concern due to their rapid geographic spread. These diseases are transmitted through arthropod insects such as Aedes Aegypti and Aedes Albopictus. These types of virus, known as arboviruses, are more commonly found in tropical countries whose climates favour viral amplification and transmission1. Among these diseases, Dengue, Chikungunya, Yellow Fever, and, more recently, Zika, have higher prominence due to their relatively higher case numbers. Over the past thirty years, the spread and impact of these diseases on public health have increased dramatically2. Furthermore, there is evidence that COVID-19 intervention measures, such as lockdowns, have contributed to an increase in arbovirus cases3.

The spread of Dengue in recent decades is dramatic. In 2019, WHO Region of the Americas recorded the highest number of Dengue cases in history4. Brazil has the highest number of absolute cases of Dengue and Chikungunya worldwide5,6. These two diseases are the most common arboviral diseases in the country; both reached historical peaks in recent years. For example, reported cases and deaths due to Dengue reached a peak of 2,248,570 cases and 840 deaths in 20195. In 2016, Brazil there were 558,542 reported cases of Chikungunya, the highest number reported to date6.

The correct diagnosis of arboviruses is a significant challenge. According to Pan American Health Organization (PAHO)5,6, only about half of reported cases are confirmed, with the remainder being treated as suspected cases. This is due to the concurrency of circulation of these diseases and the high similarity in the symptoms of Dengue and Chikungunya which makes clinical diagnosis difficult. In the absence of point-of-care virus-specific testing, even experienced and well-trained physicians may misdiagnose an arbovirus infection due to the similarity in symptoms7. Rapid tests, especially for Dengue, are effective in confirming the disease but only up to the fifth day post-infection. After this period, such tests have a high rate of error thus requiring the use of laboratory tests. Unfortunately, laboratory testing requires technical equipment that is not widely available throughout Brazil. In addition, laboratory testing is also subject to misdiagnosis due to co-infection and cross-reaction with the various arboviruses found in the country8. Such misdiagnosis can result in a wide range of negative outcomes including inadequate or inappropriate treatment. Indeed, despite arboviruses being notifiable diseases in Brazil and the public sector being the primary health service provider for over 70% of the population, relatively few confirmatory tests are carried out7. According to the Brazilian Ministry of Health9, “only approximately 23% were tested in reference laboratories”.

Given that Brazil is hyper-endemic for arboviruses, the amount of patient data collected is very large. For example, almost 1.5 million cases of Dengue were reported to Brazilian Information System for Notifiable Diseases, from Portuguese Sistema de Informacao de Agravo de Notificacao (SINAN) in 2020. As such, this represents a significant source of information for both epidemiological analysis as well as training and optimising machine learning models for health purposes. The objective of this work is to make available a Brazilian national data set with clinical, laboratory, and socio-demographic data on both confirmed, discarded, and inconclusive cases of Dengue and Chikungunya so that this data can be used for future research, such as the development of machine learning model that helps to correctly classify these patients. A high-level epidemiological analysis of the data set is also presented.

Methods

The data was collected from the Brazilian Information System for Notifiable Diseases, Sistema de Informação de Agravo de Notificação (SINAN) http://portalsinan.saude.gov.br/. The data set is from a public data repository and according to current Brazilian laws, there is no need for ethics committee approval. SINAN collates case notification data of diseases present on the national list of compulsory notification of diseases, injuries and public health events https://bvsms.saude.gov.br/bvs/saudelegis/gm/2020/prt0264_19_02_2020.html. This includes Dengue and Chikungunya. The data contains notifications of Dengue and Chikungunya cases that occurred in Brazil, including all 26 states and the Federal District (Brasília), between 2013 and 2020. Dengue-related data contains clinical data (pre-existing symptoms and comorbidities), laboratory tests performed, and socio-demographic data for each case. With the exception of one hundred records, Chikungunya-related data contains only socio-demographic data. No explanation on why only one hundred Chikunya records contain clinical and laboratory test data was provided with the data. It is possible that these cases were treated as suspected cases of Dengue and only later confirmed as cases of Chikungunya however this has not been confirmed. These cases are included in the data set summary in Table 6. For both data sets, no individually identifiable health information is made available in the data set.

Figure 1 presents the preprocessing steps used for cleaning the data set. First, the SINAN data from all states were aggregated resulting in 13,421,230 notifications and 118 attributes. The records were grouped into three distinct groups by the CLASSI_FIN attribute:

  • Dengue: Patients with confirmed Dengue;

  • Chikungunya: Patients with confirmed Chikungunya; and

  • Discarded/Inconclusive: Patients who tested negative or inconclusive for Dengue or Chikungunya following laboratory tests.

Fig. 1
figure 1

Pre-processing steps performed to build the final data set.

Only notifications that were (a) confirmed or (b) discarded/inconclusive following clinical diagnostic were selected. For confirmation criteria, we used the Brazilian MS definitions that can be found here: https://bvsms.saude.gov.br/bvs/publicacoes/diretriz_nacionais_prevencao_controle_dengue.pdf. After this step, the attribute used for the filter (CRITERIO) was also removed, since it now contains only a single value. The attribute TP_NOT identifies the type of notification generated. As all notifications are of the “Individual” type, the TP_NOT attribute has the same value for all records. Attributes that had more than 60% null data or that were not in the original data dictionary were also removed. Attributes that still had null fields were filled with the default value, “not informed”, as per the data dictionary. The transformation from categorical to numerical data was also carried out. Table 1 shows all the attributes removed in the preprocessing process.

Table 1 Attributes removed after preprocessing.

At the end of the process, the data set consisted of 4,307,513 records for Dengue, 325,000 records for Chikungunya, and 2,100,029 records for the Discarded/Inconclusive category.

Data Records

The processed data set, as well as the raw data, are available in Mendeley Data10 and can be found via the link https://data.mendeley.com/datasets/2d3kr8zynf/4. Figure 2 presents the number of records in the data set by category (Dengue, Chikungunya, Discarded/Inconclusive) in Brazil from 2013–2020. As can be clearly seen, Dengue infections in 2013, 2015, 2016, and 2019 were comparatively high11,12. In 2017, there was a drop in confirmed cases of both Dengue and Chikungunya in the country to similar levels for both diseases (120,753 cases of Dengue and 113,087 cases of Chikungunya).

Fig. 2
figure 2

Number of records in the data set by category (Dengue, Chikungunya, Discarded/Inconclusive) in Brazil per year.

Figure 3 shows the age structure of the cases reported in this data set, divided into three categories: young people, adults and the elderly. The youth category includes individuals up to 18 years of age. The adult category is for individuals aged between 20 and 59 years. Finally, the elderly category are individuals aged 60 and over. In every year, the highest incidence of Dengue, Chikungunya or Inconclusive cases is in the adult category.

Fig. 3
figure 3

Age structure of individuals in cases of Dengue, Chikungunya and Inconclusive.

Figures 46 present heat maps of the number of Dengue, Chikungunya and discarded/inconclusive cases, respectively, by state and year. In these figures, the more intense the color, the greater the number of cases of each disease. Most Dengue cases (Fig. 4) occurred in the Southeast and Midwest of the country, more specifically in the states of Minas Gerais MG, Goiás GO and São Paulo SP. In 2015, SP had the highest number of cases of Dengue in a single state with more than 360,000 reported cases. This could reflect its population numbers and density.

Fig. 4
figure 4

Occurrence of confirmed cases of Dengue by Brazilian state.

Fig. 5
figure 5

Occurrence of confirmed cases of Chikungunya by Brazilian state.

Fig. 6
figure 6

Occurrence of discarded/inconclusive cases of Dengue and Chikungunya by Brazilian state.

Chikungunya emerged in the Americas in 201313. Following the reporting of the first locally transmitted Chikungunya infection in Brazil in September 2014, the disease rapidly spread across Brazil13. Consistent with this timeline, the data set includes data for the years 2015 to 2020. Figure 5 illustrates the spread of Chikungunya in Brazil from the confirmation of initial autochthonous cases in Ceara CE in the Northeast to a major outbreak in Rio de Janeiro in 2018 and 2019.

Figure 6 shows discarded/inconclusive cases of Dengue and Chikungunya. Firstly, the number of cases is high in the states of CE and Pernambuco PE from 2015 to 2017, most likely reflecting the emergence of Chikungunya and associated difficulties in diagnosing the disease accurately14. This data raises questions regarding the quality of the surveillance system in these areas. For example, greater numbers of discarded/inconclusive cases in certain areas may indicate that the health and surveillance infrastructure in these areas is inferior to those in other states. Secondly, similar to Dengue, most of these categories of cases are located in the cities of SP and MG. Indeed, SP is the state with the highest number of cases in 2015, 2016 and 2019.

The final data set is composed of 56 attributes that are grouped according to Fig. 7 and are detailed in Tables 2, 3, 4 and 5. Demographic, epidemiological and clinical (symptoms, signs and comorbidities) data were grouped as resource-limited attributes as per Lee et al.15. Specific equipment is not specified in the data set. Laboratory attributes (serological) and others are grouped as well-resourced attributes because they require specific equipment to be performed.

Fig. 7
figure 7

Attributes in the final data set.

Table 2 Socio-demographic data.
Table 3 Clinical data – Symptoms.
Table 4 Clinical data – Comorbidities.
Table 5 Laboratory data.

Socio-demographic data (Table 2) includes age, sex, gestational age, race, and area of residence, amongst others.

Symptoms relate to specific physical features which can indicate the existence of a disease. As per Table 3, the data set contains 13 symptoms.

Comorbidities are preexisting conditions in the patient. Table 4 presents the clinical data with information about comorbidities.

Table 5 presents the attributes for laboratory data. This data comprises results from serological and other tests. It also contains data on whether the patient was hospitalised as well as the final patient classification.

The general and disease baseline characteristics are shown in Table 6. Baseline characteristics show an overall mean (SD) age over 30 years and a predominance of women for each arboviral disease. Fever (37.3%), headache (34.5%), and myalgia (34%) were the most frequent symptoms. It is important to highlight that in confirmed cases of Chikungunya, the absence of symptoms in the records directly affect the percentage of these symptoms in general.

Table 6 General and disease baseline characteristics.

Technical Validation

All data presented in this work can be corroborated by reports published by the Ministry of Health of Brazil.

Usage Note

Robert et al.16 discuss the emergence of Dengue and related arboviruses (Zika and Chikungunya) in Córdoba, Argentina, and present a data set with records relating to the the transmission of Dengue, Chikungunya and Zika. This data set comprises data from 2009 to 2018 including known data on circulating dengue virus (DENV) serotypes and the origins of imported cases. In López et al.17, the Dengue outbreak in Santa Fé, Argentina was investigated. This city has a temperate climate and experienced an increase in Dengue cases and virus circulation from 2009. Santa Fé experienced the largest outbreak in Argentina to date. The intention of the authors of both papers was to support further research in understanding the factors and patterns of arboviruses emergence and transmission.

In line with Robert et al.16 and López et al.17, the data set presented in this work expands the data available to researchers on the emergence and transmission of two arboviruses, Dengue and Chikungunya. To this end, it complements these works and progresses work towards a potential international arbovirus data set suggested by Robert et al.16.

Arboviruses are hyperendemic in Brazil. The social, environmental and climate conditions in Brazil combined with disordered urban growth and population migration have escalated the public health risk presented by arboviruses. The COVID-19 pandemic and prolonged economic crisis are exacerbating efforts to control negative outcomes from these diseases3. These factors make it difficult to combat and prevent these diseases in the country, as well as to understand how the virus reacts and spreads. Although there is not complete data on all arboviruses, the data presented here can help in the fight against Dengue and Chikungunya, and assist in addressing misdiagnosis as experienced during the Zika epidemic in 201514. For example, it can provide data develop (low cost) decision support tools for the differential diagnosis of these diseases. In particular, this data may be used as both training and test data sets for machine learning and deep learning models for binary and multi-class classification and prediction.