A role for community-level socioeconomic indicators in targeting tuberculosis screening interventions

Tuberculosis screening programs commonly target areas with high case notification rates. However, this may exacerbate disparities by excluding areas that already face barriers to accessing diagnostic services. We compared historic case notification rates, demographic, and socioeconomic indicators as predictors of neighborhood-level tuberculosis screening yield during a mobile screening program in 74 neighborhoods in Lima, Peru. We used logistic regression and Classification and Regression Tree (CART) analysis to identify predictors of screening yield. During February 7, 2019–February 6, 2020, the program screened 29,619 people and diagnosed 147 tuberculosis cases. Historic case notification rate was not associated with screening yield in any analysis. In regression analysis, screening yield decreased as the percent of vehicle ownership increased (odds ratio [OR]: 0.76 per 10% increase in vehicle ownership; 95% confidence interval [CI]: 0.58–0.99). CART analysis identified the percent of blender ownership (≤ 83.1% vs > 83.1%; OR: 1.7; 95% CI: 1.2–2.6) and the percent of TB patients with a prior tuberculosis episode (> 10.6% vs ≤ 10.6%; OR: 3.6; 95% CI: 1.0–12.7) as optimal predictors of screening yield. Overall, socioeconomic indicators were better predictors of tuberculosis screening yield than historic case notification rates. Considering community-level socioeconomic characteristics could help identify high-yield locations for screening interventions.

www.nature.com/scientificreports/ notification rates because of a low TB burden or because the population faces economic and other social barriers to accessing care 13-15 . Thus, it is possible that people who might most benefit from community-based screening interventions may live in areas that would not be prioritized if interventions were only targeted to places with high case notification rates 16 . Additionally, because individuals who attend community-based screening programs may be different from those who seek care at local public health facilities, historic case notification rates from the health facilities might not useful for targeting community-based screening programs. We aimed to assess the utility of using historic case notification rates to target TB active case-finding activities and determine whether other neighborhood characteristics might be more useful. Using data from a communitybased screening program in Lima, Peru, we evaluated whether the neighborhoods where the greatest percentages of screened individuals were diagnosed with TB were the same neighborhoods with the highest case notification rates in previous years. In addition, we explored whether neighborhood-level demographic or socioeconomic indicators predicted the yield of TB diagnoses among screened individuals better than historic case notification rates.

Methods
Study design. We conducted an exploratory analysis-with an aim of informing future work-to assess whether historic case notification rates and other neighborhood characteristics could predict how many of the individuals screened by a community-based screening program would be diagnosed with TB. Our target population was all individuals who attended the community-based screening program, regardless of where they were screened or what motivated them to be screened. Our geographic unit of analysis was residential neighborhoods, a smaller unit than health facility catchment areas; this smaller unit is more relevant to the planning of community-based screening programs, which are most likely to reach people in the immediate vicinity.

Study population.
Peru is a middle-income country with an estimated TB incidence of 119 per 100,000 population 13 . Our study focuses on residents of the contiguous catchment areas of eight primary-level public health facilities in the Carabayllo district, Lima, Peru; this area had a population of about 212,000 in the 2017 census 17 . We excluded the catchment areas of four other health facilities in the less urbanized periphery of Carabayllo because key predictor data were unavailable. While TB screening and treatment are free in Peru, people nevertheless face both direct and indirect costs for seeking care, which present barriers to timely diagnosis 18 . TB screening program. Starting in February 2019, a community-based TB screening program was implemented in three contiguous districts of north Lima, including Carabayllo 19 . The screening program involved mobile screening units offering free chest radiography, regardless of the presence of symptoms. If the chest radiograph was abnormal, individuals underwent a physical examination and were asked to provide a sputum sample for rapid testing with GeneXpert MTB/RIF (Cepheid, Sunnyvale, CA). Individuals were diagnosed based on either a positive GeneXpert MTB/RIF result or by a physician based on clinical and radiologic evidence. All individuals diagnosed with TB were referred for treatment to their local health facility.
Mobile units were open to the public and stationed in high-traffic areas such as parks, markets, transport terminals, and outside health facilities (which tend to be in centrally located areas). To promote awareness of the screening program, a structured community engagement strategy was implemented prior to the arrival of the mobile screening unit in each community, including the incorporation of popular opinion leaders and a multimedia campaign of videos, audio vignettes, flyers, posters, community murals and jingles 20 . Prior to launch, the implementation team consulted with community leaders to define boundaries corresponding to local definitions of Carabayllo neighborhoods, which were then mapped. Seventy-four neighborhoods ranged from 0.04 to 4.36 km 2 in area. All people attending the screening program were asked to indicate what neighborhood they lived in, aided by the maps and staff familiar with the area. Data sources. Outcome data were obtained from the TB screening program for people screened February 7, 2019-February 6, 2020. We restricted analysis to residents of the geographic area of interest, including those who were screened at any mobile unit site. Neighborhood-level predictor data were obtained from the 2017 census and TB treatment registers. Details of data sources and processing can be found in the Supplementary Material. Statistical analysis. Our outcome of interest was neighborhood "screening yield, " defined as the proportion of screened neighborhood residents who were diagnosed with TB. We used two different analytic approaches that have complementary strengths and limitations in handling the continuous neighborhood-level predictor data: binomial logistic regression and Classification and Regression Tree (CART) analysis. Binomial logistic regression is an established approach with well-characterized methods for testing model assumptions to assure validity of the resulting model. However, ascertaining relationships between the outcome and continuous predictor variables is difficult unless meaningful thresholds for categorizing predictors are determined a priori. In contrast, CART is a nonparametric method that uses recursive partitioning to search through all potential predictors and cutoff values to categorize predictors, identifying the most important predictors and their optimum predictive thresholds. However, in CART analysis there are no established methods to estimate certainty that are analogous to confidence intervals, and CART may identify predictor thresholds that are not programmatically useful.
Approach 1: logistic regression analysis. We used binomial logistic regression to assess univariable associations between neighborhood-level characteristics and the odds that a person will be diagnosed with TB given www.nature.com/scientificreports/ that they live in a neighborhood with certain characteristics. The outcome of interest was the screening yield. Neighborhood-level predictors were assessed as continuous values since we had no way to determine a priori meaningful categorizations for these demographic and socioeconomic predictors. We ran model diagnostics, including Pearson, deviance, standardized, and likelihood residuals, Cook's distance (D), and DFBETA, for key variables and further explored any observations that were identified to be highly influential based on the Cook's D. For any neighborhood identified as a potential influential outlier, we conducted sensitivity analyses with the neighborhood included and excluded to assess whether its inclusion in the analysis created or strengthened associations. If it did, we removed the neighborhood from the primary analyses and presented sensitivity analyses demonstrating the impact of removing these outlier neighborhoods. We also conducted a sensitivity analysis with the outcome of bacteriologically confirmed TB. Details of sensitivity analyses can be found in Supplementary Material. Additionally, we ran spatial dependence diagnostics using the Lagrange Multiplier lag and error tests to assess whether the models should include a spatial autocorrelation term. Logistic regression analyses were performed with SAS V9.4 (SAS Institute, Cary, North Carolina, USA).
Approach 2: classification and regression tree analysis. We conducted two neighborhood-level CART analyses with screening yield as the outcome. Approach 2a treated screening yield as a continuous outcome, allowing the CART process to define the final outcome categories based on where it split the outcome variable. Approach 2b treated screening yield as a categorical outcome, defining the top 15 neighborhoods with the highest screening yield as "high yield" and the rest as "low yield, " and then allowing the CART process to predict which yield category each neighborhood would fall into. The models were weighted by the number of residents screened from each neighborhood. We ranked and selected the primary node and assessed the relevance of each variable in the final model. Measures of predictive importance-determined by computing the improvement measure attributable to each variable in its role as a surrogate to the primary split-were assigned to each potential predictor, entailing both marginal and interaction effects involving this variable. The data sets were split into increasingly homogenous sub-groups, using least squares method to split nodes and add smaller daughter nodes to the tree. Maximal trees were generated and pruned based on relative misclassification costs, complexity, and parsimony. Ten-fold cross-validation was performed, in which the data set was randomly split into learning and test sets. CART analysis was then applied to determine model performance and predictive accuracy in these test sets, removing the need for a validation data set.
For the final derived trees from Approaches 2a and 2b, we assessed the utility of the predictors and thresholds identified by the analyses by converting these node values into categorical predictors, which we then used in a binomial logistic regression analysis. This produced odds ratios (OR) and 95% confidence intervals (CIs) for ease of interpretation for those more familiar with association statistics and effect sizes. Sensitivity analyses were performed to assess the impact of any outlier neighborhoods, the inclusion of mathematically related variables, and restriction to bacteriologically confirmed cases (Supplementary Material). CART analysis was run using Salford Systems Data Mining and Predictive Analytics Software version 8.0 (Salford Systems, San Diego, California, USA).

Ethics approval.
This study was conducted in accordance with the U.S. Health and Human Services regulations for the protection of human subjects (HHS 45CFR 46). Informed consent was not required, as the Mass General Brigham Institutional Review Board determined that the study constituted exempt human subjects research (protocol 2019P002416).

Results
During the analytic period, the mobile screening units screened 29,619 residents from the 74 neighborhoods (14.0% of the population) and diagnosed 147 TB cases, of which 125 (85.0%) were bacteriologically confirmed. The median TB screening yield from the screening program per neighborhood was 0.4% (interquartile range [IQR]: 0.0-0.8%; range: 0.0-2.3%, plus a single neighborhood with a much higher yield of 12.0%) (Fig. 1). The summary statistics of neighborhood characteristics are reported in Table 1.
Approach 1: logistic regression results. During assessment of model diagnostics, we identified three potentially influential observations by Cook's D. We ran sensitivity analyses, removing each of the potentially influential observations one by one, and found that removal of one of the three neighborhoods substantially reduced the strength of several of the observed associations. This observation was a neighborhood with a 12% TB screening yield; 3 individuals were diagnosed with TB disease out of only 25 people screened. Due to this observation's large impact on the results of the regression model, we excluded it from our primary analyses to avoid associations driven by outlier values. Thus, all primary analyses include 73 neighborhoods, and sensitivity analyses were run with all 74 neighborhoods (Supplementary Material). Additionally, no spatial dependence was observed through the Lagrange Multiplier lag and error tests, so a spatial autocorrelation term was not applied to the models.
The primary logistic regression analysis identified that the percent of households in the neighborhood that own a vehicle was strongly inversely associated with a TB diagnosis (OR: 0.76 per 10% increase in vehicle ownership; 95% CI: 0.58-0.99; P = 0.044) (  (Table 3). Fourteen out of 22 (64%) considered socioeconomic indicators were included in the 15 most important variables list, while only one out of the 16 (6%) epidemiologic or sociodemographic indicators-historic TB case notification rate amongst those greater than 44 years old-was included. The primary node identified was the percent of households in a neighborhood that own a blender (Fig. 2). Neighborhoods in which ≤ 83.1% households owned a blender had a higher mean TB screening yield (mean: 0.6%, SD: 0.3%) than neighborhoods where > 83.1% of households owned a blender (mean: 0.3%, SD: 0.2%). Using this cutoff to define a categorical predictor in logistic regression found that people living in the neighborhoods with less blender ownership had 1.7 (95% CI: 1.2-2.6; P = 0.008) times the odds of TB compared to people living in neighborhoods with higher blender ownership.
Among neighborhoods in which ≤ 83.1% households owned a blender, those in which > 55.1% of household owned a sound system had a higher mean TB yield (mean: 1.0%, SD: 0.1%) as compared to neighborhoods where ≤ 55.1% of households owned a sound system (mean: 0.5%, SD: 0.3%). Using this cutoff to define a categorical predictor in logistic regression, we found that amongst people living in neighborhoods where ≤ 83.1 household owned a blender, those in the neighborhoods with more sound system ownership had 1.8 (95% CI: 1.1-3.0; P = 0.016) times the odds of TB compared to those living in neighborhoods with less sound system ownership.
Approach 2b treated the outcome of TB screening yield as a categorical variable and categorized the 15 neighborhoods with the highest screening yield as "high yield. " Across these 15 neighborhoods, the average screening yield was 1.2% (SD: 0.4%), compared to 0.2% (SD: 0.3%) in the other 58 neighborhoods. We identified the top 15 most important variables for predicting high-yield neighborhoods (Table 4). Six out of 22 (27%) considered socioeconomic indicators were included in the 15 most important variables list, while nine out of the 16 (56%) epidemiologic or sociodemographic indicators were included.
The primary and only node identified in the best produced tree was the percent of TB patients with a prior TB episode (Fig. 3). Greater than 10.6% of TB patients with a prior episode of TB led to the model identifying 16 neighborhoods as having a high TB screening yield, whereas 10.6% or less identified 57 neighborhoods of low TB screening yield. The positive predictive value of using the cutoff of 10.6% of historic TB patients with a  www.nature.com/scientificreports/ www.nature.com/scientificreports/ predictor variable which we subjected to logistic regression; there we found that people living in neighborhoods with greater than 10.6% of TB patients who had a prior TB episode had 3.6 (95% CI: 1.0-12.7; P = 0.041) times the odds of TB as compared to those living in neighborhoods with 10.6% or less of TB patients with a prior TB episode. For both CART approaches, sensitivity analyses including the outlier neighborhood produced similar results to the primary analysis, with the same predictors identified as being most important (Supplementary Material). In the sensitivity analysis restricting to bacteriologically confirmed cases, approach 2a (continuous outcome) produced similar results as the primary analysis. Approach 2b (categorical outcome) identified the same three most important predictors as the primary analysis; the other predictors identified as important differed somewhat in the sensitivity analysis, with greater representation of socioeconomic predictors compared to the primary analysis.

Discussion
In our study, we found that historic case notification rates were not good predictors of the yield of TB diagnoses among residents of communities served by a mobile TB screening program in Lima, Peru. In two different analytic approaches that treated screening yield as a continuous outcome, the best predictors of yield of TB diagnosis were socioeconomic indicators. This was true both when predictors were treated as continuous variables and when they were assessed for optimal partitioning via CART. Epidemiologic predictors were more useful www.nature.com/scientificreports/ for predicting screening yield in a CART analysis that treated the screening yield outcome as binary, although negative predictive value was far better than positive predictive value. While our analyses did not identify a single consensus set of indicators for predicting neighborhoods with high screening yields, our findings highlight the utility of considering community-level socioeconomic characteristics-rather than only historic case notification rates-when geographically targeting screening interventions. While the association between TB and poverty is well established 21 , our findings do not simply imply that screening yields are higher in poorer neighborhoods. Although government agencies may establish income-based definitions of poverty for policy purposes, the experience of poverty is multidimensional and heterogenous 22 . Household-level poverty in Peru and other countries is best predicted by a combination of characteristics, including education, employment, housing, and ownership of certain items 23,24 . Our analysis found higher TB screening yields in neighborhoods with lower levels of vehicle and blender ownership, consistent with larger proportions of that neighborhood's residents living in poverty. However, none of our analyses identified factors related to education, employment, or housing as useful predictors. Moreover, in combination with blender ownership, we found that sound system ownership had the opposite association with TB screening yield than would be expected. Thus, the predictors identified in our analysis may be related to living in poverty, but may also reflect differences among disadvantaged communities that we are unable to explore further given the data available.
It is important to note that our outcome of screening yield is not the same as the TB prevalence in the community, as people who attend community-based screening units are not necessarily representative of the neighborhood in which they reside 25,26 . However, from a program planning perspective, screening yield is an important indicator because it can help programs launching new screening initiatives to prioritize areas where www.nature.com/scientificreports/ screening activities may have greater impact 27 . In addition, screening yield for a community-based program is a potential indicator of unmet need for diagnostic services, so it is meaningful even if it is not correlated with prevalence. Similarly, an association of historic case notification rates with TB screening yield may not have been observed because people attending the screening units may be fundamentally different than those who choose to present to a health facility to get diagnosed. Further, both of these groups may not be representative of the population characteristics reported in the 2017 census, which could explain the lack of association observed with many census-derived variables. A strength of this study is the use of CART analysis, which results in easily interpretable decision trees for use in clinical practice 28,29 . CART identified previously concealed associations 30,31 , as observed when predictors included as continuous variables did not have an association with the outcome, but were associated when included as categorical variables using the thresholds identified by CART. These results can complement the results of other statistical methods; CART does not provide an analogue to a confidence interval to quantify or support the validity of the findings, but the observed thresholds can be subjected to standard hypothesis testing via regression analysis where the validity can be determined 32 . Other benefits of using CART analysis are that: it is a non-parametric method so no distributional assumptions are needed 28,29 , there is no need a priori identify hypotheses about relationships between potential predictors and the outcome, and it can overcome missing data through the use of surrogate measures 33 .
Our study was limited to the potential predictors available in the census and those routinely collected in the treatment registers. Other neighborhood-level characteristics, such as local infrastructure or accessibility to health services, may be better predictors of screening yield than those we assessed. Paper-based records and irregular address systems in many neighborhoods also limited the amount of data that could feasibly be collected on TB epidemiology indicators at the neighborhood level. Our decision to use yield as an outcome also did not take into account the varying coverage of the screening program in different neighborhoods. Additionally, our analytic population included only 147 TB cases diagnosed via a mobile screening campaign. While this corresponds to a high overall screening yield of 1 case per every 201 people screened, the number of outcome events was small for an analysis across 74 neighborhoods. This could have reduced our ability to detect associations. Due to the low absolute number of historic cases in a given neighborhood per year, we calculated a five-year average case notification rate for each neighborhood to address the potential variability over time that is not due to true changes in disease prevalence, which is in line with other studies who also aggregate cases over time to avoid issues with small case counts 34 . However, this prevented us from assessing changes in case notifications over time as a potential predictor. Finally, multiple analytic approaches were purposely used due to the complementary strengths and limitations of the methods in handling the continuous neighborhood-level predictor data. However, the use of multiple approaches led to different results, suggesting more work is needed to understand how each or both together may optimally be used to inform the optimization of community-based active case-finding for TB.
In conclusion, bringing mobile TB screening services to communities affected by poverty helps overcome barriers to accessing care. Socioeconomically disadvantaged communities may disproportionately benefit from screening interventions even if routine surveillance data does not suggest a disproportionate TB burden. Because barriers to accessing TB services can lead to underdiagnosis, limiting screening interventions to areas with known high TB burdens may exacerbate existing disparities in access to diagnostic services. Further analyses of casefinding activities should be undertaken at the neighborhood level to identify additional and better predictors of screening yield in communities.