Data-driven identification of communities with high levels of tuberculosis infection in the Democratic Republic of Congo

When access to diagnosis and treatment of tuberculosis is disrupted by poverty or unequal access to health services, marginalized communities not only endorse the burden of preventable deaths, but also suffer from the dramatic consequences of a disease which impacts one’s ability to access education and minimal financial incomes. Unfortunately, these pockets are often left unrecognized in the flow of data collected in national tuberculosis reports, as localized hotspots are diluted in aggregated reports focusing on notified cases. Such system is therefore profoundly inadequate to identify these marginalized groups, which urgently require adapted interventions. We computed an estimated incidence-rate map for the South-Kivu province of the Democratic Republic of Congo, a province of 5.8 million inhabitants, leveraging available data including notified incidence, level of access to health care and exposition to identifiable risk factors. These estimations were validated in a prospective multi-centric study. We could demonstrate that combining different sources of openly-available data allows to precisely identify pockets of the population which endorses the biggest part of the burden of disease. We could precisely identify areas with a predicted annual incidence higher than 1%, a value three times higher than the national estimates. While hosting only 2.5% of the total population, we estimated that these areas were responsible for 23.5% of the actual tuberculosis cases of the province. The bacteriological results obtained from systematic screenings strongly correlated with the estimated incidence (r = 0.86), and much less with the incidence reported by epidemiological reports (r = 0.77), highlighting the inadequacy of these reports when used alone to guide disease control programs.

www.nature.com/scientificreports/ access to health services outside screening campaigns will lead to very limited additional cases found. Additionally, it will typically impact staff motivation and require difficultly scalable human and financial resources 1-6 . Pushed by the necessity to achieve a significant yield, a typical recommendation is to perform systematic screening in households of TB patients and among people living with HIV [7][8][9] . Although these indications are well proven to be useful, restricting systematic screening to these very limited groups will structurally miss the opportunity to detect the majority of TB cases, in particular in the context of the structural under-diagnosis of TB and HIV such as experienced in the DRC [10][11][12] . It has further been described that within households of active TB cases, the source of new infections is very often different from the presumed index case 13 . These observations illustrate the need to extend systematic screening beyond the current recommended perimeter, in particular to areas with very high incidence of the disease 14 .
These recommendations are in practice inapplicable, as countries lack tools to identify pockets of the population where high level of under-notification hides dramatic incidence rates of TB. We develop a tool which would precisely identify the pockets of the population where the majority of TB cases are undetected. In this context, such tool would allow guiding highly efficient ACF interventions, and would allow avoiding over-or underutilization of community workers and medical resources 15 .

Methods
We introduce a new approach to ACF planning which is in two-folds: it combines a data-driven detection of high-incidence pockets of the disease and a digital assessment of the individual risk as triage.
Firstly, we collect openly-available datasets that describe environment characteristics such as the population density, the presence of the local health care system or the closeness to mining sites. A data-driven prediction algorithm combines these datasets with the information from the local TB reports to precisely identify localized pockets of very high circulation of TB which can be defined as an incidence rate above 1000/100,000 (1%). A representation of these calculations on a map of the South-Kivu province of the Democratic Republic of Congo (DRC) is illustrated in Fig. 1. In this map, color codes represent the estimated incidence rate for active pulmonary tuberculosis. This map illustrates the significant variations in the predicted incidence within the province: while the vast majority of the surface of the province shows predicted incidence rates below 0.1%, only very limited areas actually show a risk of above 1%. In South-Kivu, the share of population living in high-risk zones, with a predicted incidence rate above 1%, is only about 2.5%. This same population is expected to host more than 20% of active TB cases.
Secondly, in the communities highlighted by the estimated incidence rate an ACF intervention is performed with the aid of a digitally supported questionnaire and the MediScout application stack as triage.
Study setting and study design. We performed a multi-centric prospective study in the South-Kivu province of the DRC, a region at the border with Rwanda and Burundi and facing a high burden of TB, partly due to a situation of over 20 years of conflicts and population displacements. As most eastern provinces of DRC, South-Kivu has important artisanal and industrial mining sectors. The South-Kivu province hosts a population of 5.8 million inhabitants divided in 34 health zones. In total, 113 health facilities providing basic TB diagnostic and treatment services.
We evaluated two main elements in this study. The first element under evaluation is the map of predicted incidence rate that represents a risk measure for the local communities. The major outcome is the ability to segregate communities with a high rate of TB ( > 1 %) from other communities not systematically eligible for ACF interventions as per the WHO criteria. To do so, we included 11 urban, semi-urban and rural communities distributed in 11 health zones of the province. The choice of these communities was made in order to cover a wide range of TB predicted incidence, ranging from 0.1 to 2.3% (see Table 1). Some zones at high predicted risk such as Itombwe and Minembwe had to be excluded due to ongoing insecurity issues. We included in this study remote communities such as Matili and Lugushwa which were only accessible using local planes combined by 2 days of travel on a motorcycle. Table 2 provides detailed information on the notified incidence rate, the number of cases reported by the local health facilities divided by the population covered by such facilities, and the predicted incidence rate for each of these communities. In this table, we observe sensible differences between the notified incidence rate and the predicted incidence rate, the predicted incidence rate being up to ten times higher than the notified incidence rate in the same area. An example of this situation is the city of Shabunda, a rural and mining area with a very low coverage of health structures.
The second element under evaluation is the individual risk assessment tool represented by the MediScout mobile app, including the built-in questionnaire (see Fig. 2). The questionnaire, based on the weighted evaluation of TB-related individual risks, such symptoms, exposure and personal history, acts as a digital-supported triage system.
We partnered with a local organization experienced in TB ACF and trained a project coordinator and 20 research assistants, involved in community-outreach activities, for the utilization of the MediScout applications. This training included the initiation of ACF missions, individual screenings and referral of patients to local health clinics and was held in Bukavu on the 20th and on the 21st of March 2019. These screeners, originating from the provincial capital city Bukavu, visited all study areas, where they were accompanied by local community health workers to facilitate the collaboration of the community. The data collection was performed during the period from Mar 2019 to Jan 2020.
Relative incidence rate prediction. We gathered openly available data linked to the risk of TB to compute a relative risk level for each precise location. The different data sources and their utilization in our model are described hereunder. All datasets were retrieved in January 2019. www.nature.com/scientificreports/ Population distribution and density information was extracted from the Worldpop project 16,17 , an initiative which provides an estimation of the population density with a resolution of approximately 100 m. These estimations result from combined geospatial datasets with available aggregated count data 18 . Further, additional information on the location of urban and residential environments was gathered from the OpenStreetMap project 19 .
We then combined the demographic data with the most granular level of epidemiological surveillance, being, in the context of DRC where this study took place, the Health Zone quarterly notification reports. These were used to make a baseline distribution of the TB cases. For this baseline distribution, we assumed areas with low population density have a lower TB incidence-rate compared to higher density areas. Locations with a population density lower than 10 people per square kilometer were ignored in this distribution model.
We further used the location and type of health facilities extracted from the Global Healthsites Mapping Project 20 in collaboration with OpenStreetMap 19 . We used the distance between each community and the nearest health facility to estimate the phenomenon of under-detection of TB, which is correlated to the difficulties for Table 2. Distribution of reported and estimated incidence on the communities interested in this work. www.nature.com/scientificreports/ each community to access health services, assuming that "far" from health facilities, 50% of the real TB cases are missed. Furthermore, the type of health facility, local clinical versus hospitals, was also used in the risk estimation. Finally, silica exposure, in particular when linked to mining activities, is a risk factor for TB [21][22][23] . Since the South-Kivu province, where this study took place, has an important mining activity, we integrated in the risk computation the location and size of mines, accessible through the IPIS Research project 24 and OpenStreetMap 19 . The predicted risk of TB was correlated with the presence of mines in the same geographical areas. We assumed that in mining environments, the incidence rate of TB cannot be lower than of 0.5% or 500/100,000.
Individual risk prediction. The main output that was to be achieved by the individual assessment questionnaire was to be able to yield a positivity rate greater than 10% among people identified as being at high risk for active TB.
To build this individual assessment tool, we took into consideration previous observations highlighting that individual risk for TB infection can be extrapolated from a combination of elements including symptoms and the type of exposition to TB 15 .
Our questionnaire for evaluating the individual risk for TB was therefore based on the presence or absence of several TB-related symptoms and the precise context of an eventual exposition to TB. The elements captured and the weight given to each element is described in Table 3. The survey generates a total score in the range 0-20. In this study, all people with a score higher than or equal to 4 and those with a cough (regardless of the total score) were referred for a laboratory test.
MediScout solution. The MediScout web application (developed by Savics, Belgium) was used as an integrated end-user interface allowing to visualize the incidence prediction map and plan geographically localized ACF campaigns. The MediScout mobile application (Savics, Belgium) was used to guide community-based health workers in the restricted area of interest, while receiving from the remote web application ACF missions to perform. Further, the app was used to go through the individual questionnaires when in direct contact with the study participants, to automatically compute the individual risk and send the results of these questionnaires back to the remote server for later analysis, see Fig. 2 for a visualization ( Fig. 2A: mobile application mission interface, Fig. 2B: mobile application questionnaire interface). The MediScout mobile app allows uploading information to the MediScout web interface when internet or 3G connectivity were available. This system allows full traceability of all community-based screening events, including GPS location, demographics of the population covered and individual scores.
Based on these reports, aggregated statistics and geographic locations are reported automatically and in real-time (Fig. 2C: web app interface with example of data visualization tools and intervention mapping at the bottom).
Cases distribution from population density. The cases reported by the local health system are disaggregated according to the local population density inspired by the equilibrium solution of a simple compartmental model (SIS) of an endemic infected population 25 : www.nature.com/scientificreports/ where p 0 represents the minimal value of the population density for which the endemic infected population survive. Furthermore, p 1 accounts for the relation between the population density and the transmission parameter of the disease. A normalization is applied in order to recover the same aggregated figures. We choose p 0 = 10 inhabitants per square km and p 1 = 1000 inhabitants per square km.

Ethical approval. The study and its protocols have been approved by the Comité Institutionnel d'Ethique de
la Santé of the Université Catholique de Bukavu with reference number UCB/CIES/NC/07/2019.

Informed consent.
All subjects participating in this study (or their legal guardians) gave its informed consent. All methods were carried out in accordance with the relevant guidelines and regulations.

Results
Performance of the relative incidence rate prediction tool. We used the individual risk-assessment questionnaire to evaluate the performance of the incidence rate prediction algorithm. This questionnaire, based on a combination of TB-related symptoms, personal exposition to TB and personal history of TB, was considered as a good proxy for estimating the level of circulation of TB in a community (see Table 3). Overall, in the 11 study sites, 13841 people agreed to respond to the individual risk-assessment questionnaire. 8322 (60.1% of total) questionnaires originated from areas at low predicted incidence rate ( < 1 %) and 5519 (39.9%) questionnaires originated from areas at high predicted incidence rate ( > 1% ) (see Table 4). In areas with low predicted incidence, 55.5% of the responders had a low risk (score < 4 ) and 44.5% had a high risk for TB (score ≥ 4 ). In comparison, in areas with higher predicted incidence, 32.5% of the responders had a low risk (score < 4 ) and 67.5% had a high risk for TB (score ≥ 4).
The individual risk score is comprised between 0 and 20, as is computed based on a diversity of questions including related to the presence of several TB-related symptoms and different elements reflecting the intensity of exposition to TB. We looked at the distribution of these scores in the high and low predicted incidence areas.
The median and the interquartile ranges of the risk score in low or high incidence predicted areas was of 3 (2-6) and 5 (3)(4)(5)(6)(7)(8)(9) respectively, highlighting the different, although partly overlapping, distribution of scores in the two populations (p-value = 0.000). Figure 3 shows the distribution of the risk score in the two types of settings. Figure 3 (top) illustrates how the individual risk score in the two populations at high and low risk follows different distributions (with p-value = 0.000). It also illustrates the high dispersion of risk scores within these areas, reflecting the heterogeneity of individual risk in each community: the individual score median as well as the confidence intervals increase in areas at higher risk.  www.nature.com/scientificreports/ Performance of the individual-risk assessment questionnaire. We used Ziehl-Neelsen microscopy as a reference method to evaluate the performance of the individual-risk assessment questionnaire. A method with higher sensibility such as GeneXpert would have been preferred, but, unfortunately, it wasn't available to the local health care system. In total 1153 laboratory tests were performed. Screeners were trained to suggest a laboratory test only among people presenting a cough (regardless of the total score) or having an individual risk score ≥ 4. Within the performed tests, 112 individuals were diagnosed with laboratory-confirmed pulmonary TB (9.7% positivity). Unfortunately, some laboratory outcomes in few remote facilities resulted unlinked from the corresponding questionnaire and were discarded from this analysis (we know only the aggregated picture of 449 negative tests within Matili, Lugungu-Katchungu and Shabunda). The positivity rate was 8.8% (n = 55/626) for those with a score ≥ 4 and 12.3% for those with a score ≥ 8 (n = 48/389).
No positive laboratory results were reported among people with a score lower than 4, regardless of the presence of cough (21 lab tests). The proportion of positive smears increased together with the risk score as illustrated in Fig. 4, supporting the effectiveness of the mobile questionnaire as triage.
Performance of the integrated prediction as a policy-decision tool for ACF disease control programs. To assess the performance of MediScout as a technical tool that can be used by TB-program managers and community-outreach organizations to optimize the efficiency of ACF interventions, we compared the predicted incidence rate with the reported incidence of bacteriologically-confirmed pulmonary cases.
Although the populations originating from areas with high risk ( > 1% predicted incidence) and areas with low risk were of similar size, of the 112 individuals diagnosed with bacteriologically-confirmed pulmonary TB, 91 (81.25%) originated from the predicted zones at very high risk and 21 (18.75%) originated from zones predicted to have an incidence rate <1%. Figure 5 (right) shows a strong association (r = 0.861) between the predicted incidence rate and the proportion of bacteriologically-confirmed pulmonary TB (B + TB) cases identified, with a nearly linear correlation between the predicted incidence and the observed yield of the ACF interventions (with a fitting line with a slope of 0.95).  Correlation of the predicted incidence with the measured incidence within sampled population (right). Note the high value of the correlation coefficient r. The parameter a represent the slope of the fitting line. As a term of comparison, the correlation of the incidence extracted from the health system reports with the measured incidence within the sampled population (left). Note in this case the lower correlation coefficient and the slope of the fitting line (the measured incidence surpass 10x the incidence of reported cases). www.nature.com/scientificreports/ In comparison, Fig. 5 (left) illustrates a much less performant correlation between the historically notified cases and the actual incidence observed in this study associated with the heterogeneous level of under detection of TB.

Choice of individual risk threshold.
In the present prospective study, the community health workers suggested single individuals to refer to a laboratory for a microscopy test if their individual-risk score reached at least the conservative threshold value of 4. The choice of such conservative value is dictated by the specific testing aims of the current study. We computed the sensitivity and specificity of the survey based on different thresholds (see Fig. 6). The chosen threshold presents a sensitivity = 1.0 (no screening with individual score below 4 where confirmed positive to TB in laboratory) and a specificity = 0.46. The choice of a threshold value equal to 6 would still have high sensitivity = 0.98 while increasing the specificity = 0.67 (and decreasing the number needed to test in laboratory to find a positive case). Applying a threshold at 6 would have saved 2692 (36%) tests, but only 1 case ( ∼ 0 %) would have been missed.

Discussion
One of the major limitations of this study is that the diagnosis of TB relied uniquely on series of three Ziehl-Neelsen microscopy tests per patient, the only method available to the local health system. This technique probably underestimated the level of active TB disease as it is known to have a limited sensitivity and would therefore miss a proportion of paucibacillary infections. The real incidence may thus have been higher than reported in this study. Despite this limitation, we believe that performing this evaluation in a real-life setting, which includes the technical limitations experienced by health workers, is probably more informative than a study which would create an artificial bias by including technologies which are currently inaccessible for the vast majority of the population. Very interestingly, we could show that using data-driven predictions and setting a threshold at a predicted incidence of 1% allows prioritizing ACF interventions in well-circumcised pockets of the population where over 80% of the cases, found through ACF, reside. This illustrates that in setting with uncontrolled transmission, the majority of the people experiencing active TB disease will not access the health system if they are not actively identified and supported by outreach interventions.
Determining and disentangling the real factors that trigger the spread of the disease in each case may be an unachievable task. Indeed, our prediction grasps the most important factors that, in this framework, include poverty or difficulty to access to health care 12 (two highly entangled factors) and closeness to mining activities [21][22][23] . The efficiency of this data-driven approach emerges from the comparison to similar ACF activities (in mining communities). In particular, in location predicted to be at high risk, the number needed to be screened reached 61 (5519 screenings/91 positive cases) while the number needed to test was as low as 9 (770 tested/91 positive cases). The same statistics in similar settings on other works are considerably higher (110 and 25 23 , 151 and 12 10 ).
Another major consequence of the described findings is that historical notification reports may fail in uniformly sample the disease incidence. Planning and prioritization of TB control interventions should, therefore, not rely only on those reports. In such cases, the decision process would contribute to under-estimate the actual level of disease transmission in the most vulnerable pockets of the population without or with limited access to the health system. This study highlights that these historical notifications reports should be integrated with demographic, geographical and social data in order to optimally inform public health authorities and funders about where to prioritize the implementation of complementary interventions.
Current notification-based epidemiological surveillance approaches have structurally failed to quantify the level of tuberculosis under-detection. The roots of this problem are multiple and include the fact that populations affected by a high burden of tuberculosis are concomitantly affected by a multitude of other povertyrelated problematics in particular facilitated access to prevention and curative health services. In this study, we demonstrate that this major problem, that prevents any TB-eradication program to achieve its objective, can be overcome by analyzing TB notification data together with other publicly-available data such as population density, environmental risk factors for tuberculosis and socio-economic indicators. These results support a One Health approach to tuberculosis control, which aims to integrate the patient and his disease in a broader context. Figure 6. Expected sensitivity and specificity if the threshold to refer subjects to lab is changed to a different value. The vertical lines correspond to a threshold of 4 and 6. A threshold of 4 is maybe too conservative; in other contests, one can safely use 6. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.