A new study in which Paola Sebastiani and co-workers apply a computer-assisted method to determine the combined effects of multiple candidate gene single nucleotide polymorphisms (SNPs) on stroke risk in sickle cell anemia (SCA) could be a model for how bioinformatic SNP analysis will be used for risk assessment for complex diseases in the future.

The completion of the human genome project, coupled with technological advances in large-scale genotyping and bioinformatics tools, has prompted a resurgence in the use of disease association studies to detect major susceptibility genes. The three million SNP sites identified so far exemplify the allelic complexity of the human genome and serve as markers for large association studies, so providing a valuable resource for investigating the genetic basis of many complex human diseases.

Although SCA is considered a ‘monogenic’ disease, arising from a single point mutation in the β-globin gene, it is characterized by considerable clinical heterogeneity. Stroke is a particularly devastating manifestation of SCA, afflicting 11% of children before the age of 20 years.1 As only a fraction of SCA patients develop stroke, environmental and genetic modifiers beyond the sickle gene mutation must account for this phenotypic variability.

Several limited association studies that attempted to identify single SNPs within individual stroke susceptibility genes have produced inconsistent results. This inconsistency has cast doubt on the contribution of potentially relevant genetic modifiers of stroke risk in SCA.2, 3, 4, 5, 6, 7 However, because complex phenotypes such as stroke presumably arise from multiple interacting genes located throughout the genome, the optimal approach would be to search for sets of markers in different genes and analyze these markers jointly, rather than individually. Most statistical approaches typically evaluate the effects of individual SNPs one at a time, and if a significant disease association is found, the SNP is then considered to be near or within a susceptibility gene.8 However, when large numbers of SNPs are tested simultaneously and related to a single patient phenotype, a true association may not be distinguished from a false one that chance alone has caused (ie a Type I error).9 Furthermore, such a marker-by-marker approach ignores the multigenic nature of complex diseases, and fails to account for possible interactions between susceptibility genes.

Sebastiani and colleagues offer a viable alternative approach to disease association analyses that uses a machine-learning method derived from the field of artificial intelligence. This approach, based on Bayesian networks, allows one to inferentially explore previously undetermined relationships among genetic and clinical variables, and describe these relationships, once identified. The model integrates the relationships between multiple SNPs, clinical variables and phenotype, and so overcomes many of the limitations of current statistical approaches to disease association studies.

To identify SNPs contributing to stroke risk in SCA, the authors surveyed 108 SNPs distributed over 39 candidate genes in an unselected population of 1398 African-American adults with SCA. They used the Bayesian network algorithm to analyze genotyping results from 92 subjects with documented clinical stroke and 1306 subjects without stroke. An overall ‘dependency network’ was derived from the joint probabilities of the interdependent relationships between SNPs, clinical variables and stroke.

In all, 25 SNPs among 11 biologically plausible candidate genes, including those that encode BMP6, TGFBR3, SELP and CSF2, in combination with HbF (fetal hemoglobin) level, were found to directly modulate stroke risk. Interestingly, HbF level was not found to be independently associated with stroke in previous analyses of the same study cohort.1 Another nine genes were found to be associated with stroke via interactions with direct markers.

The model was also used to predict stroke risk based on the combined effects of the genetic and clinical markers identified in the network. Although the individual contribution of each SNP was modest, the simultaneous effect of all 25 identified SNPs and their interaction with clinical variables predicted stroke risk with an accuracy of 98.5%. The results of this analysis were validated in a separate population of 114 individuals with SCA, including seven with reported stroke and 107 without stroke. The Bayesian network model correctly predicted the presence of stroke in 100%, and the absence of stroke in 98%, of the study subjects. The overall predictive accuracy of 98.2% using this model was compared to a logistic regression model that identified only five of the 25 SNPs found in the Bayesian network model and gave an overall accuracy of 88% in the same set of individuals. However, while the ability of the Bayesian network algorithm to predict stroke in this study appears impressive, the validation of this model was based on a small sample that included only seven stroke patients.

The limited availability of large numbers of well-characterized cases and controls represents another challenge in the study of genetic markers in SCA. By using previously collected samples and clinical data from a representative national cohort of individuals with SCA, this study exemplifies how biological sample repositories linked to clinical databases may be efficiently used to successfully perform large disease association studies. However, because cerebrovascular disease in SCA is heterogeneous, manifested as ischemic stroke, intracranial hemorrhage and silent infarction, rigorous phenotypic characterization of cases and controls is imperative.

The lack of available clinical and neuroimaging data needed for optimal phenotypic classification limited this study. Despite these limitations, Sebastiani and co-workers have powerfully demonstrated that multiple SNP sites from different genes over distant parts of the genome are better at identifying overt stroke in SCA than any single SNP or previously identified clinical variable alone. Their results highlight the combined influence of several candidate susceptibility genes on stroke and suggest biological pathways to be explored in future mechanistic studies.

The potential utility of the Bayesian network algorithm is illustrated by the model's ability to determine accurately the relative genetic and clinical effects on stroke risk, find the most probable combination of genetic variants leading to stroke and predict an individual's odds for developing stroke given his/her genotypic profile. As more candidate SNPs and clinical markers are identified, this predictive algorithm will undoubtedly become an invaluable tool in genetic association studies aimed at identifying disease susceptibility genes. The complex interactions modeled through this approach might ultimately translate into clinical benefit through early identification and targeted intervention in those individuals at greatest risk for a particular disease phenotype such as stroke. The computer may well replace the clinician in determining stroke risk, but it will be left to the clinician to apply this information in caring for the patientâ–ª