Background & Summary

Tuberculosis is an airborne infectious disease caused by the bacillus Mycobacterium tuberculosis; globally it is the second largest cause of morbidity and mortality by an infectious agent1,2. Historically, there has been a significant global effort to reduce the death rate of tuberculosis. However, these efforts have been compromised due to the COVID-19 pandemic. Brazil has one of the highest incidences of tuberculosis worldwide and is among the 22 countries considered by World Health Organization (WHO) as having a high burden of tuberculosis3,4. In 2019, Brazil registered 96,000 cases of the disease, with a mortality rate of 7.00%04.

The elimination of tuberculosis is a global priority, as evidenced by its inclusion in the Sustainable Development Goals. Central to reducing the transmission of TB and ultimately the elimination of TB is early identification of TB-infected patients, application of infection-control measures, and early enrollment in treatment5. To this end, WHO has called for intensified research and innovation to improve early diagnosis, shorten and provide more effective treatment regimens, improve prevention, and partners for cross-sectoral actions5.

The clinical management of tuberculosis relies on the medical assessment of clinical and diagnostic information. Data on relapse, co-infection, and severity can be crucial to decide on procedures as pharmacological and clinical interventions. Timely intervention is vital to control the spread of the disease, and the patient’s prognostis and ultimate outcome. However, predicting a patient’s prognosis is a complex task as tuberculosis has different treatment outcomes depending on the type of TB6. Answering the WHO call for innovation in early diagnosis, extant literature has proposed the application of artificial intelligence techniques, such as machine learning and deep learning models, to support the speed and efficacy of tuberculosis treatment decision-making, and specifically prognosis.

The Brazilian Information System for Notifiable Diseases (Sistema de Informação de Agravo de Notificação or SINAN) from the Brazilian Ministry of Health collects and stores data on each disease incidence of a notifiable disease in Brazil. This data is routinely generated by the Epidemiological Surveillance System. SINAN has a database with socio-demographic, clinical, and laboratory data on suspected tuberculosis cases that can be used to generate multiple analyses for public health planning and the assessment of disease prognosis. However, most machine learning and deep learning models applied in the literature for the treatment of tuberculosis require labeled data, that is, they contain information about what is being classified. This work presents an extension of the SINAN database that includes outcome data (i.e. “CURED” or “DIED”) for the period January 2001 to April 2020. The availability of such data enables researchers to create training and test data sets, and use this data to build, evaluate, and optimise machine learning models to support the prognosis of tuberculosis in patients. Also, other outcomes regarding treatment adherence and relapses are available and can be assessed. A high-level epidemiological analysis of the data set is also presented.


The original data was collected from the Information System for Notifiable Diseases (Sistema de Informação de Agravos de Notificação7) for the period from January 2001 to April 2020 including data from all 26 Brazilian states and the Federal District (Brasília) of Brazil. It contains socio-demographic, clinical and laboratory data about patients who were diagnosed with tuberculosis. While the SINAN-TB database is public, certain data is labeled sensitive and is protected by the General Law for the Protection of Personal Data Brazil (Lei Geral de Proteção de Dados Pessoais or LGPD). Such sensitive data is only available upon request to SINAN’s ethics committee. The data used in this research does not contain any such sensitive information.

The SINAN data set was cleaned using a variety of preprocessing techniques as outlined in Fig. 1. The original data set comprised 1,712,205 records and 88 attributes. Following preprocessing, 748,106 rows and 50 fields were removed resulting in a final preprocessed data set of 964,099 records and 38 attributes.

Fig. 1
figure 1

Pre-processing steps performed to build the final data set.

Tables 14 shows all the attributes removed in the preprocessing process. These attributes were removed for different reasons including the column featuring primarily empty values (‘NaN’); attributes starting with the nomenclature ‘ID’; attributes starting with ‘DT’ with the exception of ‘DT_NOTIFIC’ and ‘DT_NASC’; attributes irrelevant to the tuberculosis context (such as ‘BENEF_GOV’, ‘TRANSF’, ‘NU_LOTE’ and ‘NU_TELEFON’); replacement fields with ‘NaN’ values, by 9 (others), since step two did not eliminate all ‘NaN’ values; removal of lines with different values from ‘1’ (CURED class) and ‘3’ (DIED class) from the attribute ‘SITUA_ENCE’; removal of lines with ‘DT_NOTIFIC’, ‘DT_ENCERRA’ and ‘DT_NASC’ with ‘NaN’ values; calculation of the number of days that the patient spent in treatment using ‘DT_NOTIFIC’ and ‘DT_ENCERRA’ and add new attribute called ‘DIAS_EM_TRATAMENTO’; attributes removed by authors’ discretion/analysis, as well as duplicate data and attributes.

Table 1 Attributes removed from original SINAN-TB database - Reason for removal: more than 65.00% of records are null.
Table 2 Attributes removed from original SINAN-TB database - Reason for removal: outside the socio-demographic, clinical and/or laboratory context.
Table 3 Attributes removed from original SINAN-TB database - Reason for removal: removed by authors’ discretion/analysis.
Table 4 Attributes removed from original SINAN-TB database - Reason for removal: Removed for other reasons.

Data Records

The original and preprocessed data set, as well as the English data dictionary, are available at the Mendeley Data repository and can be accessed via the link (

Figure 2 presents the number of records in the data set by year and by prognosis (records labelled as CURED and DIED) in Brazil between January 2001 and April 2020. It is important to note that the year 2020 has relatively fewer records as the data set only includes records up to April 2020. In addition, SINAN notifications were adversely affected by the COVID-19 pandemic2. The highest number of DIED cases was in 2017 (3,099) and the highest number of CURED cases was in 2018 (61,839).

Fig. 2
figure 2

Records in the data set by year and by prognosis (records labelled as CURED and DIED).

Figure 3 presents the number of records in the data set by age group and by treatment outcome (records labelled as CURED and DIED). Most cases of tuberculosis are among patients 20 to 60 years old, with the highest number of CURED (412,723) in the 20 to 40 age group, and the highest number of DIED (14,349) between 40 and 60 years old.

Fig. 3
figure 3

Records in the data set by age group and by treatment outcomes (records labelled as CURED and DIED).

Figure 4 presents heat maps of the cases of tuberculosis by Brazilian regions between January 2001 and April 2020, while Fig. 5 shows the cases of DIED by region in the same period. The Southeast region, comprising the states of São Paulo (SP), Minas Gerais (MG), Espírito Santo (ES), and Rio de Janeiro (RJ) had the highest incidence of tuberculosis with 345,491 cases (records labelled as CURED and DIED); it also had the highest number of deaths (14,215) over the 19 years. With 51,878 cases, the Midwest region was the region with the lowest number tuberculosis cases and lowest number of deaths (1,697). The state with the highest number of tuberculosis cases was Rio de Janeiro (RJ) with 168,495 tuberculosis cases and 7,912 deaths. The state with the lowest incidence of tuberculosis was Roraima (RR), in the North region, with 2,413 cases of TB. The state with the lowest incidence of deaths is Amapá (AP) with 61 registered deaths Table 5.

Fig. 4
figure 4

Confirmed cases of tuberculosis by Brazilian region between January 2001 and April 2020.

Fig. 5
figure 5

Deaths by tuberculosis by Brazilian region between January 2001 and April 2020.

Table 5 Socio-demographic data.

The final data set had 39 attribute grouped in to the three categories - socio-demographic (as presented in Table 5), clinical, and laboratory based on9,10. As can be seen in Fig. 6, clinical data was further categorised into comorbidities, drugs, and other.

Fig. 6
figure 6

High level attribution categorisation in the final data set.

Table 6 shows the attributes grouped as clinical data for comorbidities such as diabetes, AIDS and others. Drugs administered to patients during tuberculosis treatment were grouped as clinical data as per Table 7.

Table 6 Clinical data – Comorbidities.
Table 7 Clinical data – Drugs.

Only two clinical attributes were labelled “Other” as per Table 8: the clinical form of tuberculosis (labelled as “FORMA”) and the type of health unit admission (labelled as “TRATAMENTO”) for the patient containing: new case, recurrence, re-entry after abandonment, don’t know, transfer and post-death.

Table 8 Clinical data – Other.

The laboratory attributes were generated from the results of tests performed in the laboratory such as X-ray, HIV serology result, tuberculin skin test etc, and were grouped as shown in Table 9.

Table 9 Laboratory data.

Supplementary Table 1 lists all attributes described with their appropriate characteristics. Males had the highest number of records labelled as CURED and DIED; females had a mortality rate almost three times lower than men (26.40%). Only 6.00% of tuberculosis cases had an AIDS-associated disease and 6.80% of patients tested positive for HIV. The most widely administered drugs were Rifampicin and Isoniazid, both with 67.00% of CURED cases, although 50.20% of patients who died from the disease also took these drugs. The drugs with a low administration rate were Streptomi and Ethionamide with only 0.80% and 0.90% of the total number of patients taking these medications, respectively. The pulmonary clinical form of tuberculosis represents 84.60% of all cases. Patients who died from tuberculosis spent an average of 56 days in treatment while those cured spent 211 days in treatment.

Technical Validation

All data presented in this work can be corroborated by reports published by the Brazilian Ministry of Health.

Usage Notes

This data set can serve as the basis for researchers to develop, evaluate, and optimise machine learning and deep learning models to predict treatment outcomes and support health professionals in the diagnosis, prognosis, treatment and control of tuberculosis. As a result, the burden on already overstretched health systems and economies, particularly those in disadvantaged regions around the world, can be reduced by accelerating the restoration. Furthermore, making data available enables researchers worldwide to carry out individual patient data meta-analysis and thereby generating more robust evidence on clinical and public health.