The spatial epidemiology of sickle-cell anaemia in India

Sickle-cell anaemia (SCA) is a neglected chronic disorder of increasing global health importance, with India estimated to have the second highest burden of the disease. In the country, SCA is particularly prevalent in scheduled populations, which comprise the most socioeconomically disadvantaged communities. We compiled a geodatabase of a substantial number of SCA surveys carried out in India over the last decade. Using generalised additive models and bootstrapping methods, we generated the first India-specific model-based map of sickle-cell allele frequency which accounts for the district-level distribution of scheduled and non-scheduled populations. Where possible, we derived state- and district-level estimates of the number of SCA newborns in 2020 in the two groups. Through the inclusion of an additional 158 data points and 1.3 million individuals, we considerably increased the amount of data in our mapping evidence-base compared to previous studies. Highest predicted frequencies of up to 10% spanned central India, whilst a hotspot of ~12% was observed in Jammu and Kashmir. Evidence was heavily biased towards scheduled populations and remained limited for non-scheduled populations, which can lead to considerable uncertainties in newborn estimates at national and state level. This has important implications for health policy and planning. By taking population composition into account, we have generated maps and estimates that better reflect the complex epidemiology of SCA in India and in turn provide more reliable estimates of its burden in the vast country. This work was supported by European Union’s Seventh Framework Programme (FP7//2007–2013)/European Research Council [268904 – DIVERSITY]; and the Newton-Bhabha Fund [227756052 to CH]


Library assembly
Online searches of two major bibliographic databases, PubMed (http://www.pubmed.gov) and Scopus (http://www.scopus.com) were performed to identify published surveys of β S allele frequency carried out in India since 2010. To ensure consistency with the search strategy used in Piel et al. (2013), 1 the same keyword string was used: "sickle cell" or "haemoglobin S" or "hemoglobin S" or "Hb S". We also included the search term "India" to make it country-specific and limited our search to between 2010 and 2016. Our final search on 14 October 2016 yielded 502 references across all three databases, which were imported to a bibliographic management software (Endnote X7.7.1, Thomson Reuters, Carlsbad, CA, USA). Twenty-four duplicates were identified and manually removed, leaving a total of 478 references to be reviewed.
Unpublished sources of data and references available only in the local literature (n = 84) were identified through our UK-India collaboration, supported by the Newton Bhabha Fund, and reviewed in the same manner. Finally, the dataset used by Piel et al. (2013) 1 in their global sickle-cell mapping study was clipped to India and merged. All additional sources from which data were abstracted are listed in the Supplementary Bibliography S1. Sources included in the earlier database are listed in Piel et al. (2013).

Data inclusion criteria and abstraction process
The identified sources were reviewed in detail by one of the authors (CH) using a specific set of inclusion criteria, which were outlined in a pre-defined protocol. This ensured objectivity and consistency in the data abstraction process. A random sample of 10% of the references were reviewed and checked by a second author (FP), and the survey details extracted by the two authors compared.

Survey representativeness
Only surveys that were conducted among representative population samples were included in the database. Two key considerations when determining the representativeness of a sample were: (i) the source population from which the sample was taken, e.g. pregnant women, newborns, community, etc., and (ii) the sampling methodology used, e.g. random, consecutive, unselected or universal. All family-based studies as well as those conducted amongst patients (e.g. malaria patients or general outpatients) or suspected cases of haemoglobinopathies were deemed to be inherently biased and thus immediately excluded. However, surveys that took place amongst newborns, pregnant women, children and adolescents or in the community, although unbiased in terms of their source population, were scrutinised for their sampling methodology. Those with stated or suspected bias in their selection method were excluded. For instance, studies that took place in the community but which required voluntary participation without any prior education or awareness campaign were excluded due to a potential bias, for example towards individuals with an affected family member.
Ethnicity and social grouping Previous efforts to map β S have excluded surveys that target specific ethnic groups due to their lack of representativeness of the general population. However, given the unique population structure of India in the form of highly endogamous population groups 2 and the well-known heterogeneities in β S frequency across ethnic and social groups, 3 were assumed to belong to the GC category. Subsequently, the data were split into two subsets: (i) surveys carried out in scheduled populations (i.e. STs and/or SCs), and (ii) surveys carried out in non-scheduled population (i.e. OBCs and/or GCs). Those surveys that did not specify the ethnic origin or social grouping of the study sample or included a mixture of scheduled and non-scheduled individuals were entered into the database but excluded from our mapping analysis.

Survey diagnostic methods
There are several methods available for the diagnosis of SCT and/or SCD. [5][6][7] These include the sickling test, solubility test, haemoglobin electrophoresis and high performance liquid chromatography (HPLC). The sickling test is based on the principle that affected red blood cells become sickle-shaped when exposed to a low oxygen tension. A blood sample is mixed with sodium metabisulphite (an oxygen scavenger) and examined under a microscope. By contrast, in a solubility test, the red blood cells are lysed to release the haemoglobin into the blood plasma. Normal haemoglobin is soluble in the blood plasma, whilst sickle haemoglobin is not and forms small crystals that cause the blood plasma to become turbid. 8 In India, the most commonly used methods are sickling test and HPLC, depending on the resources available. 6 Given the good reliability of most diagnostic methods for β S , 5 no strict inclusion criteria with regards to diagnostics were applied. However, key information regarding the precise diagnostic algorithm used in a survey was recorded. This included the presence or absence of an initial screening stage and the type of confirmatory diagnostic test used. Where multiple confirmatory diagnostic tests were used, the most reliable methodusually HPLC or, in some cases, DNA sequencingwas recorded. This information is available on request.
Sample size and β S allele frequency An important prerequisite to the inclusion of a survey in the database was the clear reporting of both the sample size and the number of AS individuals identified. When provided, the number of SCA individuals and any compound heterozygotes (e.g. co-inheritance of β S with βthalassaemia or with another structural variant such as β E ) were also recorded. Occasionally, surveys directly reported the frequency of the β S allele without a breakdown of the different genotypes. These were also included, but only if sample size was provided.
Genotype counts were used to calculate the allele frequency of β S , a key input for our model, using the formula: where q is the β S allele frequency, nSCA and nSCT are the number of β S homozygotes and heterozygotes or compound heterozygotes, respectively, and N is the size of the study sample.
A simplifying assumption of this calculation is that the populations are at Hardy-Weinberg equilibrium (HWE). 9,10 Sample size varied enormously between surveys. We did not place any constraints on sample size, but rather accounted for it in the statistical model by weighting the contribution of each survey to the fitted model according to its size. As a result, the contributions of very small samples were down-weighted whilst those of larger samples were strengthened.

Georeferencing
We applied an inclusion criterion of spatial specificity, whereby only those surveys that could be georeferenced to at least the second administrative level were included. In India, this corresponds to the district level (https://censusindia.gov.in). Various geopositioning gazetteers were used to identify the latitude and longitude of all included surveys as precisely as possible.
Where a survey described a specific sampling site, the geographic coordinates of that site were used. For surveys only reporting the district in which they were conducted, the centroid of the district was identified and recorded. In instances where the presented survey data came from multiple sites, the centroid between all the survey sites was determined using the Geographic Information Systems (GIS) software, ArcGIS Desktop (ArcMap 10.4.1, ESRI Inc., Redlands, CA, USA). No information on the spatial extent of the surveys was recorded, as this information was rarely available.
An advantage of our modelling approach, which is described in detail in Supplementary Information S2, is that all spatial duplicates could be retained, thereby maximising the size of the evidence-base for the mapping model. were carried out in two of these states (Maharashtra and Chhattisgarh). Nevertheless, this reflects the areas of highest concern with regards to SCA. 4,6 The total number of individuals sampled was 1 300 719 with 88.1% of these sampled in these four states. However, the majority of the surveys were carried out in community samples (n = 133).
Twenty-seven surveys lacked information on their target population group (pregnant women, newborns, etc.), but did not indicate any potential bias and so were retained. Scheduled tribes were the most studied group, following by scheduled castes and other backward classes.
Some heterogeneity in the diagnostic algorithm used was also observed. However, the vast majority of the surveys used HPLC as the confirmatory diagnostic test, with less than half of these incorporating an additional screening stage.

Overview of generalised additive models (GAMs)
To generate a continuous map of β S allele frequency in India, we employed a generalised additive modelling approach. Generalised additive models (GAMs) comprise a collection of nonparametric regression techniques performing regression inference on complex, non-linear, relationships between a given response variable and multiple predictor variables. 11 The advantage of using GAMs over more common linear models is that they can create extremely flexible functional relationships while effectively regularising the objective function -that is, they automatically trade off over/under fitting to provide the best estimation of the underlying predictive patterns. 12 In addition, unlike more black box methods, such as gradient boosting machines, 13 the models are easily interpretable due to the contribution of each independent variable to the prediction being explicitly encoded.
For a given response ∈ and associated predictors ∈ , the GAM estimates an unknown function = ( ) through an additive series of smooth functions, i.e.
( ( )) = + 1 ( 1 ) + 2 ( 2 ) + ⋯ + ( ) + ⋯ + where E(Y) denotes the expected value of the response and g(Y) denotes the link function that allows for the modelling of non-Gaussian likelihoods. 11 The terms s1(x1),…,sn(xn) denote smooth, nonparametric thin-plate spline functions. GAMs can also be generalised to multiple dimensions to model the interactions between two predictor variables such as spatial coordinates. The general principles of generalised additive modelling have been described previously by Hastie and Tibshirani (1987) 14 and West (2012). 15

Model covariates: malaria
Whilst the malaria hypothesis is widely accepted and supported by high-quality and varied evidence, 16 the geographical relationship between sickle-cell and historical malaria remains unresolved in the Indian context. This is presumably due to a myriad of other factors influencing the distribution of β S here, including population structure, consanguinity and migration. To explore this relationship using our GAM approach, we sourced two maps of the pre-control distribution of malaria in India (Supplementary Figure S3). The first was a global map produced by a team of researchers in the 1960s and represents the distribution of malaria when it was at its highest, circa 1900 (Supplementary Figure S3a). 17 The authors synthesised information and data on multiple malariometric indices from a range of sources and combined this with expert opinion as well as temperature and rainfall data to generate the map. To suit the purposes of our study, the map, which was previously used by Piel et al. (2010), was clipped to India in ArcGIS Desktop. The second map was produced by Sir Patrick Hehir in 1927 and is specific to India (Supplementary Figure S3b). 18 The map was previously digitised by Dr Katherine Battle from the Malaria Atlas Project, who shared it with us for this study. Whilst there is little information on the evidence-base upon which the map was generated, to our knowledge it is the only available pre-control map of malaria specifically for India. Coupled with the additional detail provided by the map compared to that from Lysenko et al. (1968), this justified its inclusion in our analysis.
To gain insight into the relationship between sickle-cell in India and present-day malaria, we also obtained a contemporary map of malaria from the Malaria Atlas Project map repository (MAP, www.map.ox.ac.uk) (Supplementary Figure S3c). Generated using a Bayesian geostatistical model that incorporates data from parasite rate surveys as well as from environmental covariates, the map shows the mean value of the range of age-standardised The final covariate explored in our analysis includes geographical location, given by latitude and longitude in decimal degrees (see Supplementary Information S1).

Model selection
The GAM selection process requires consideration of two key model features. First, the covariates to include in the final model must be selected, and a decision about which covariates to include as smoothing terms made. Second, given a model structure, the degree of smoothness for each smoothing term must be optimally balanced such that under-or over-smoothing is avoided.
Prior to the model selection process, the data were split into two subsets: (i) those surveys that were carried out in scheduled populations, and (ii) those that were carried out in non-scheduled populations. Model selection was then carried for each of the two datasets separately. A backward stepwise approach was used to select the final model for each dataset. The starting model included smooth terms for latitude and longitude (both separately and in combination), historical malaria (Lysenko et al., 1968), contemporary malaria, NTL and accessibility. Cubic regression splines were used for all smooth terms except the two-dimensional geographical term, for which we used thin-plate regression splines. Each survey's contribution to the model was weighted according to its survey size.
The variables associated with the highest p-values were successively removed until the generalized cross-validation criterion (GCV, a computationally efficient generalisation of standard out-of-sample predictive performance), Akaike Information Criterion (AIC) and mean square error (MSE) of the model were minimised. 11 The deviance explained and R 2 were also examined. Finally, smoothing parameters, which determine the degree of smoothness of the predictive functions, were estimated from the data via restricted maximum likelihood (REML).

Interpolation to generate two continuous β S allele frequency surfaces
The

Variability in model predictions
The predictive output of the GAM of course varies between the bootstrapped samples For spatial predictions, this variability varies geographically and is a function of the quality, quantity and sample size of the available data. In addition, heterogeneity in the observed frequencies of β S can affect the level of variability in the model's predictions. To quantify the variability in the behaviour of the GAM predicting β S allele frequency, we calculated the 95% CI of the probability distribution of predicted β S allele frequency for each 10km x 10km pixel.
All analyses were performed using R 3.3.2. Full scripts of the code are available on request.

The final maps
Predicted maps for scheduled and non-scheduled populations were generated separately and are shown in Supplementary Figures S5 and S6. These were paired, together with district-level data on the proportion of scheduled and non-scheduled groups in the population, to generate a composite β S allele frequency map. This is shown in Figure 2b of the main text and displays the median value of the 2500 bootstrapped predictions generated for each 10km x 10km pixel.
The 95% CI of the bootstrapped predictions for each pixel provides a measure of how consistently the GAM performs and is shown in Supplementary Figure S7. Variability in our predictions is greatest where data are present but sparse (e.g. in northeastern Assam) and/or where there is considerable heterogeneity in the observed prevalence (e.g. along the border of Karnataka and Tamil Nadu). In these areas, variability in the model's predictions was at least 50%. High variability of up to 12% was also observed in Jammu & Kashmir, Punjab, northeastern Madhya Pradesh, West Bengal and the eastern part of Odisha, extending down into the northern districts of Andhra Pradesh and Telangana. There was low variability (<5%) across most of the central belt of high β S frequency. However, variability was also low wherever data were completely absent. This is due to the absence of data preventing higher β S allele frequencies from being predicted.