Original Article

The Pharmacogenomics Journal (2007) 7, 180–189. doi:10.1038/sj.tpj.6500414; published online 12 September 2006

Use of pairwise marker combination and recursive partitioning in a pharmacogenetic genome-wide scan

L L Warren1, A R Hughes1, E H Lai1, D V Zaykin2, S A Haneline1, A T Bansal3, A W Wooster1, W R Spreen1, J E Hernandez1, T R Scott1, A D Roses1 and M Mosteller1 on behalf of the CNA30027 and CNA30032 study teams

  1. 1GlaxoSmithKline, RTP, NC, USA
  2. 2National Institute of Environmental Health Sciences, RTP, NC, USA
  3. 3GlaxoSmithKline, Harlow, Essex, UK

Correspondence: Dr M Mosteller, Genetic Analysis, Genetic Research, GlaxoSmithKline, Five Moore Drive, RTP, NC 27709, USA. E-mail: mike.m.mosteller@gsk.com

Received 28 February 2006; Revised 13 July 2006; Accepted 24 July 2006; Published online 12 September 2006.



The objective of pharmacogenetic research is to identify a genetic marker, or a set of genetic markers, that can predict how a given person will respond to a given medicine. To search for such marker combinations that are predictive of adverse drug events, we have developed and applied two complementary methods to a pharmacogenetic study of the hypersensitivity reaction (HSR) associated with treatment with abacavir, a medicine that is used to treat HIV-infected patients. Our results show that both of these methods can be used to uncover potentially useful predictive marker combinations. The pairwise marker combination method yielded a collection of marker pairs that featured a spectrum of sensitivities and specificities. Recursive partitioning results led to the genetic delineation of multiple risk categories, including those with extremely high and extremely low risk of HSR. These methods can be readily applied in pharmacogenetic candidate gene studies as well as in genome-wide scans.


pairwise marker combinations, sensitivity, specificity, recursive partitioning, candidate genes, genome scan



The first objective of a pharmacogenetic study of an adverse drug reaction is to demonstrate a statistical association between the adverse event and one or more genetic markers.1, 2, 3 Although such a finding is in itself noteworthy, an additional important question is whether a predictive assay based on the genetic marker(s) would be clinically useful.4, 5, 6 A number of somewhat subjective factors must be considered in answering this question including the seriousness of the adverse event, the frequency of its occurrence, the severity of the disease or condition being treated and the effectiveness of alternative therapies, if available.7, 8, 9 In addition, there are standard, objective performance measures for clinical diagnostic tests.

Usually, clinical tests are designed to distinguish people with a disease, 'cases', from people who are free from such disease, 'controls'. After the test is administered, each individual's result is determined to be either 'positive' or 'negative'. If the individuals who experience an adverse drug reaction are considered to be cases and those who do not experience an adverse event are considered to be controls, then one measure of test performance, sensitivity, is defined as the percentage of cases that test positive. A companion measure, specificity, is defined as the percentage of controls that test negative.

Ideally, for a genetics-based predictive assay to be useful, both its sensitivity and specificity need to be as close to 100% as possible. For example, if a genetic marker predictive of an adverse drug reaction has high sensitivity but low specificity, many with positive assays will actually not be at risk for the adverse event. If those with positive assays were denied treatment then many such individuals would not reap the potential benefits of the medicine although in truth they would not be at risk for the adverse event. On the other hand, if a genetic marker exhibited high specificity but low sensitivity, then among those with negative assays there may be a number of individuals who are at risk of the adverse event. What might be considered adequate levels of sensitivity and specificity will depend on the particular medicine and adverse event being evaluated.

Sensitivity and specificity can be used to compare the potential usefulness of several clinical assays. To understand what impact a genetics-based assay might have in clinical practice, estimation of its predictive values is helpful. In the context of a genetics-based assay for an adverse drug reaction, positive predictive value (PPV) can be defined as the percentage of individuals, among all those who are positive for the assay, who experience an adverse event. Similarly, negative predictive value (NPV) is the percentage of individuals, among all who are negative for the assay, who do not have the adverse drug reaction. These measures can be used to answer important questions posed by patients to their physicians. 'If my assay result is negative, indicating it is safe for me to take the medicine, what is the probability that I will have an adverse event anyway?' The answer is 100% minus the NPV. 'If my assay result is positive, what is the chance that I could take the medicine and not have the adverse event?' The answer is 100% minus the PPV. The PPV and NPV of a genetics-based assay can be estimated if the prevalence of the adverse event and the sensitivity and specificity of the assay are known.

Because adverse drug reactions are probably the result of multiple genetic – as well as environmental – factors, it may well be that to achieve desirable test characteristics many genetic assays will have to be based on multiple genetic markers rather than on a single marker. Therefore, a significant challenge for pharmacogenetic researchers is to identify and apply useful statistical methods for finding such predictive marker combinations. This paper describes two approaches for identifying combinations and how those approaches were applied to data from a genome-wide scan for markers potentially predictive of hypersensitivity reaction (HSR) in HIV-1-infected patients following treatment with abacavir (ABC).

ABC is an effective antiretroviral drug used to treat HIV-1 infection. Approximately 2–9% of patients treated with ABC develop an HSR that in rare cases has proved fatal.10, 11, 12 To identify genetic markers that are associated with the HSR, candidate gene and genome scan approaches were employed.13, 14, 15 The studies compared the frequency of genetic variants in subjects who developed presumed HSR with those who did not. Thirty-eight markers were found to be associated (P<0.05) with HSR in whites in two retrospective case–control studies. Among these replicated markers, HLA-B*5701 possessed the highest performance characteristics, with a sensitivity of 56.4% and a specificity of 99.1%. This marker was also reported to be strongly associated with ABC HSR by Mallal et al.16 and by Hughes et al.17 Although specificity was quite high, sensitivity was only moderate. As a result, marker combination analyses were conducted in an effort to identify a marker set with sufficient sensitivity and specificity to be clinically useful.

The two approaches that were used were pairwise marker combination and recursive partitioning (RP).18, 19 Detailed descriptions of these approaches are provided in the Methods section. The pairwise marker combination approach was used to consider genotype combinations for all marker pairs and determine whether the sensitivity and specificity of a two-marker combination represented an improvement over the characteristics of the contributing markers. RP, a data mining procedure that has the capability to uncover statistical interaction among large numbers of variables, was used to evaluate combinations of three or more markers with respect to their usefulness in estimating HSR risk. These methods can be readily applied not only in genome-wide scans, but also in pharmacogenetic candidate gene studies as well.



Pairwise marker combination analysis

The analyzed data set consisted of 118 cases and 231 controls, and 38 replicated markers. A positive two-marker combination was constructed based on 'higher risk' genotypes (the genotype that is more prevalent in cases than in controls) from single-marker analysis. Three logical combinations, 'AND', 'OR' and 'EXCLUSIVE OR', were used to combine genotypes for a marker pair (see Statistical methods section for definitions). Seven hundred and three (703) marker pairs (all the possible pairs among the 38 replicated markers) were evaluated for their association with HSR. For each pair of markers, the configuration with the lowest P-value was identified. Sensitivity and specificity estimates for these marker pairs were compared to those of individual markers, with the objective of identifying a marker pair that had better sensitivity and specificity than any of the individual markers alone.

Among replicated markers, HLA-B*5701 exhibited the strongest association with HSR, with a specificity of 99.1% and a sensitivity of 56.4%. Figure 1 plots the sensitivity vs the specificity for HLA-B*5701 (large square), the 37 other individually replicated markers (small squares), the combination of HLA-B*5701 and a heat shock protein marker (HSPA1L) previously reported by Martin et al.20 (large circle) and the other 702 marker pairs (small circles).

Figure 1.
Figure 1 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Plot of sensitivity vs specificity: CNA30032 white replication analysis population.

Full figure and legend (106K)

The plot illustrates that individual markers and the marker pairs feature a range of sensitivities and specificities; however, they displayed an inverse relationship. That is, markers and marker pairs with high sensitivity tended to have low specificity and vice versa. The plot is somewhat asymmetric in that there appeared to be a higher proportion of marker pairs that showed high specificity (>95%) and low/moderate sensitivity (20–60%). Among all 703 marker pairs, 355 marker pairs exhibited higher sensitivity and 45 marker pairs showed higher specificity than HLA-B*5701 alone, but no marker pair had better sensitivity and specificity than HLA-B*5701.

The HLA-B*5701–HSPA1L marker combination has been reported to have a sensitivity of 94.4% and a specificity of 99.6% in a study of 18 cases and 230 controls from 248 ABC -treated patients in Western Australia.20 In the analysis of the much larger CNA30032 white replication data set (115 cases, with 227 controls), sensitivity and specificity of this marker pair were 47.8% and 99.6%, respectively (Table 1).

The specificities observed for the HLA-B*5701–HSPA1L combination were the same for the two studies but the sensitivity observed in the GlaxoSmithKline (GSK) CNA30032 study was only about half of that observed in the study of Martin et al.,20 suggesting the possibility of important differences between the two study populations. Of particular note are the differences in the assignment of case status between the two studies. Case ascertainment in the Western Australia cohort was made by a single physician in a geographically limited patient population at a single clinical center. The original clinical assessments were subsequently supplemented with ex vivo lymphocyte stimulation and ABC skin patch testing. This process differs from the assignment of case status in the GSK studies, which was performed retrospectively and without skin patch testing. Furthermore, participants in the GSK PGx studies were recruited at 142 clinical centers in 12 different countries. Another explanation for the differences in the performance characteristics of HLA-B*5701 between the two study populations may be imprecision in physicians' identification of ABC HSR and inclusion of subjects into GSK's case populations. In several double-blind clinical trials that included ABC, the incidence of clinically suspected ABC HSR was 2–3% in the treatment arms that did not include ABC, suggesting that over-reporting of ABC HSR occurs in the clinical trial setting by experienced HIV clinicians. These data suggest that GSK's collection of ABC HSR cases for its pharmacogenetic research may include 'cases' who did not experience ABC HSR. Inclusion of 'non-cases' would compromise GSK's ability to identify pharmacogenetic markers associated with ABC HSR and would negatively impact the performance characteristics of pharmacogenetic markers, including but not limited to HLA-B*5701 that are identified.

Recursive partitioning

One thousand random trees were generated using data from 349 white subjects, including 118 cases and 231 controls (see Methods section for description of RP methodology of subject samples). The performance characteristics for the five most predictive RP trees are summarized in Table 2.

None of the trees produced performance characteristics with both high sensitivity and high specificity. However, all trees, except Tree III, resulted in performance characteristics slightly better than HLA-B*5701 alone, which had a sensitivity of 56.4% and a specificity of 99.1%.

As illustrated in Figure 2, the four markers included in Tree I partitioned subjects into six terminal nodes. These four markers were (1) marker no. 10019338, a polymorphism that maps to multiple locations, including one in intron 1 of the tumor necrosis factor-alpha gene, (2) HLA-B*5701 (marker no. 2791186), (3) chromosome 9 marker no. 4072881 (within the FLJ31810 gene for a neuronal leucine rich protein) and (4) chromosome 6 marker no. 3854120 (a marker with no established gene association that lies distant from the HLA-B chromosomal region of chromosome 6). The combination of genotypes that corresponded to each of these terminal nodes was designated as either 'positive' – suggesting subjects included in that node were at increased risk for experiencing HSR – or 'negative' – implying that subjects included in that node were at decreased risk for HSR. In this way, a 2 times 2 contingency table could be created with subjects cross-classified as 'cases' or 'controls' and 'positive' or 'negative' to assess how well these combinations could discriminate between cases and controls. For Tree I, the resulting composite sensitivity was 57.1% and the composite specificity was 99.1%.

Figure 2.
Figure 2 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Tree I. RP analysis of HSR in the CNA30032 white replication data set. See the Statistical methods section for a general description of RP trees. The topmost node is called the 'root node' and represents all the observations in the data set. For all nodes, n=number of observations in the node, u=the mean value of the dependent variable (in this analysis, the proportion of cases in the node), and N### is the node identifier. For non-root nodes, the marker identifier and the genotype(s) that characterize the node are displayed. For each node that has been partitioned, the P-value (p) for the indicated partition is displayed.

Full figure and legend (36K)

By assuming an HSR rate of 5% in white patients, the frequencies of the marker genotype combinations represented by the nodes of this tree were estimated. Also estimated was the risk of HSR for individuals who would be members of each terminal node (as described in the Methods section). These estimates are shown in Table 3, sorted by whether the node was terminal and then by increasing risk of HSR.

If these results were validated through replication in an independent sample and applied to a population of white ABC -treated patients, it is estimated that 17.4% of patients would be assigned to a group with a 0.2% risk of HSR. An additional 75.0% (63.6+11.4%) would be assigned to a group with a 2.7% risk of HSR. The remaining 3.6% of the patients would have an HSR risk of 21.3%, or higher, including 2.5% of all patients whose estimated risk would be 100%. In contrast to traditional diagnostic tests that typically classify patients into one of two groups, these RP results identified several genetically characterized groups with associated HSR risk ranging from very low to very high.



When using combinations of markers to identify a potentially useful diagnostic assay, we have applied both pairwise marker combination and RP approaches. These two methods are complementary to each other. In essence, the pairwise marker combination method can be thought of as a breadth-first approach. It tries to find an assay based on two-locus marker genotype combinations that can be applied to the whole population. In contrast, RP can be thought of as a depth-first approach. It works by dividing the data into groups and consequently could be the foundation for group-specific predictive assays.

The analysis strategy we took was to perform marker combination analyses using only the replicated markers. We analyzed one data set to discover markers that were associated with presumed HSR and used a second independent data set to verify the initial association. Similar approaches have been adopted in other large-scale genetic association studies.21, 22, 23 We believe that the replication of initial findings in a second independent data set is crucial to the validation of association findings.24, 25, 26, 27 This is especially important for marker combination analyses as the number of false-positive findings would increase exponentially with the number of markers considered. Methods designed to search for pure genetic interactions have been proposed.28, 29, 30 However, we have chosen to look for combinations among markers that have demonstrable association with HSR rather than consider all available markers. Because we used only replicated markers in our analyses, it is possible we could have missed markers that do not display a strong effect on their own, but are nonetheless associated with the trait through interaction with one or more other markers. RP is a useful tool for discovering and exploring interactions among genetic markers. In addition, recent work by Bastone et al.31 has shown that RP is a more general method for the detection of high-order genotype–phenotype association. Their work has demonstrated that MDR (multifactor-dimensionality reduction),28 a method specializing in detecting gene–gene interactions, is in fact a special case of RP.

For the pairwise marker combination method, one can calculate bounds on the characteristics of a composite assay given the test characteristics of two individual markers. When a marker with high sensitivity and relatively low specificity is combined with another marker with high specificity and relatively low sensitivity, the composite assay based on these two markers would not be able to achieve high sensitivity and high specificity. In the case when the 'AND' logic combination is used, the sensitivity of the composite assay is upper bounded by the lower sensitivity of the two markers. Likewise, when two markers are combined with the 'OR', or 'Exclusive OR' logic operator, the upper bound of the specificity of the composite assay would be the lower specificity of the two markers.

RP was developed expressly to explore interactions among a set of predictor variables. Because of this, it is very useful for exploring epistasis (gene–gene) among a large set of genetic markers. Interaction between two-marker genotypes means that when both are present in the same individual, a unique effect is seen, such as an increased susceptibility to an adverse event for example. The end result of combining marker genotypes in this way will be a composite assay whose sensitivity will be no greater than the lowest among the markers contributing to the combination. The composite specificity of such combinations will be no less than the maximum among the individual specificities. Thus, when an RP tree is transformed into a two-category classification system, as we have done, it will tend to feature a composite sensitivity that is similar to that of the least sensitive contributing marker and a composite specificity similar to that of the most specific contributing marker. Unfortunately, it is not possible to end up with a composite sensitivity and specificity that both exceed the sensitivity and specificity of any given marker in the combination.

Although RP may not result in an enhanced two-category classification system for pharmacogenetic application, it does present another potentially useful option. As shown in Table 3, RP can identify subsets of a patient population for which the estimated risk may be extremely low – in the case of prediction of adverse events, a protective effect – or very high. Although such an approach may not be able to categorize all patients into extreme categories, it may be able to do so for a substantial portion of the patient population. It is acknowledged that, if clinically implemented, the interpretation of results from this approach could be challenging. Nonetheless, the clinical usefulness of this type of assessment should be discussed and considered because for some adverse events it may not be possible to completely dichotomize an entire population into high- and low-risk groups.

As with any genetic finding, the need for replication of any marker combination results cannot be overemphasized. This is especially true in the context of attempting to develop an assay predictive of an adverse drug reaction.

In our study of ABC HSR, there was no combination of marker genotypes that exhibited a sensitivity and a specificity greater than those of HLA-B*5701 alone. However, the two approaches we have applied to these data represent practical strategies that can be employed to uncover useful marker combinations in pharmacogenetic studies. There is an ongoing need to investigate and assess these and other methods designed to discover predictive genetic marker combinations.


Materials and methods


Two retrospective case–control studies, CNA30027 and CNA30032, were conducted to investigate genetic polymorphisms in HIV-1-infected subjects who developed presumed HSR following treatment with ABC. In both CNA30027 and CNA30032, the diagnosis of hypersensitivity to ABC was made by the investigator or treating clinician and then reviewed by a GSK project physician for consistency with the agreed definition of a presumed hypersensitivity case. HIV-infected subjects who tolerated ABC for at least 6 weeks without evidence of an HSR were enrolled as controls.

Participants in clinical studies are requested to designate their ethnic origin as 'Black', 'White', 'Asian', or 'Other'. Approximately 73% of the subjects in CNA30027 and CNA30032 classified themselves as 'White'. Studying a sample that is ethnically heterogeneous can lead to the appearance of false-positive genetic associations. To avoid this possibility, we have analyzed subjects of different ethnic origins separately. This paper summarizes analyses conducted using data from the white subjects, which constituted the largest of the ethnic groups.

Subject data from the two studies contributed to 'discovery' sample sets on which genome scans were conducted. From CNA30027 130 white subjects contributed to discovery sets. Of these subjects, 121 (93%) were male and their median age was 42 years. Contributing to discovery sets from CNA30032 were 368 white subjects, 307 (83%) of whom were male and whose median age was 42 years. A 'replication' set of 349 white CNA30032 subjects (76% male, median age 42 years) was used to reassess the associations found using the discovery sets. The cases in the replication set satisfied the 'restrictive' case definition because they were designated as definite or probable cases during a second review of their medical records by GSK physicians and none had ever been treated with a non-nucleoside reverse transcriptase inhibitor, a drug class known to cause skin reactions similar to those that characterize ABC HSR. None of the 349 subjects in the replication data set was included in any of the discovery sets.

Genetic markers

Genome-wide scans were conducted including high-density genotyping of approximately 105,000 single nucleotide polymorphisms (SNPs) in subjects from CNA30027 (n=60, 42 cases, 18 controls) and CNA30032 (n=210, 99 cases, 104 controls) and approximately 1.7 million SNPs in pooled samples from CNA30032 (n=369, 200 cases, 169 controls). From these analyses, 1659 markers were selected for assay and analysis in a single, larger, sample set from CNA30027 (n=177, 71 cases, 106 controls) and CNA30032 (n=499, 263 cases, 236 controls). Among the 1659 markers, 814 markers were found to be statistically associated with HSR in white subjects (P<0.05 in genotypic or allelic association analysis). The 814 markers were then evaluated in an independent set of samples (the replication set) reserved from white subjects in CNA30032 (n=349, 118 cases, 231 controls) and, when evaluated for association with HSR, the results for 32 of them were statistically significant.

The 32 markers resulting from replication of genome scan discoveries, plus six markers found during candidate gene studies, were chosen as potential contributors to polygenic or epistatic effects leading to susceptibility to ABC HSR. When choosing the marker set for combination analysis, a decision was made to only use markers showing association in both the discovery and replication subject sets.

Statistical methods

Pairwise marker combination analysis

When markers were analyzed individually, the genotype (or pair of genotypes) that resulted in the largest association statistic and that was more frequent in cases than in controls was determined and referred to as the 'higher risk' genotype. The other genotypes (or genotype) were referred to as the 'lower risk' genotype. The idea behind the pairwise marker combination approach is to systematically assess combinations of the risk groups from two markers. The goal is to identify marker combinations that may possess better sensitivity and specificity than individual markers alone. For two markers, there are four possible combinations of these risk groups, as depicted in Table 4.

Of the ways that the higher and lower risk genotypes can be configured into positive and negative groups, the logic combinations 'AND', 'OR' and 'Exclusive OR' were considered.

  • 'AND' Positive=Higher risk genotype for marker 1 and marker 2
    Negative=Lower risk genotype for marker 1 or marker 2
  • 'OR' Positive=Higher risk genotype for marker 1 or marker 2 (or both)
    Negative=Lower risk genotype for marker 1 and marker 2
  • 'Exclusive OR' Positive=Higher risk genotype for marker 1 or marker 2 (but not both)
    Negative=Lower risk genotype for marker 1 and marker 2 or
    Higher risk genotype for marker 1 and marker 2

Thus, for each marker pair, three contingency tables were evaluated and the configuration with the lowest P-value for association with HSR was identified and summarized. In genetics terms, 'AND' would correspond to genetic interaction, while 'OR' and 'Exclusive OR' would correspond to genetic heterogeneity.

Recursive partitioning

RP is a data mining tool for automatic identification of interactions and homogeneous groups through repeated (recursive) application of a statistical test to the data. The test contrasts the value of a dependent variable (Y) among two or more classes of one of the predictor variables (Xi). As applied here, the dependent variable, Y, was the binary classification of each subject as a case or control, the predictor variables were the 38 replicated markers, and the classes of each marker were the higher risk and lower risk genotypes. The statistical test applied was based on Pearson's chi2 statistics. If the test result indicates that the distribution of the dependent variable differs between the classes of, say, X1, then the data set is partitioned into two subsets. The procedure is repeated within each of the subsets defined by the classes of X1. If, within one of these subsets, another predictor variable, say, X2, could be used to partition the data, the test for association between Y and X2 would be conditional on the values of X1. If X1 and X2 jointly influence the distribution of Y, then the association of X2 and Y could be highly significant. When it is possible to further partition the data, the procedure is repeated using a third predictor variable, and so on, until the sample size is exhausted or no dependence is established between the distribution of Y and any predictor. HelixTree software32 (Golden Helix Inc., Bozeman, MT, USA) was used to perform RP analysis. Sample output of an RP analysis is shown in Figure 3.

Figure 3.
Figure 3 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

An example of a RP tree. Each box is called a 'node' and represents a subset of the observations in the entire data set. For each node, several descriptive statistics are displayed including, the number of observations (n) and the mean value of the dependent variable (u). When cases are coded as '1' and controls as '0', the node mean is equal to the proportion of cases in the node. Topmost node is called the 'root node' and represents all the observations in the data set. For each node that has been partitioned, the P-value (p) for the indicated partition is displayed. Each partitioned ('parent') node is connected to its 'daughter nodes' by family pedigree-like lines. At the top of each daughter node, the variable (in this example, genetic marker ID numbers) used to partition the data, and the variable value that defines the node are shown. Unless there are missing data for the partitioning variable, the sum of the observations in the daughter nodes equals the observations in the parent node. The set of nodes that are not partitioned are called the 'terminal nodes'. Each terminal node can be characterized by unique combination of marker genotypes. Node N22 consists of the 56 subjects with genotype C_T or C_C for marker 3911370 and who carry one or two 5701+ alleles for marker 2791186. These subjects had a mean response of 0.98, which was considerably higher than the overall mean of 0.34.

Full figure and legend (24K)

The software is capable of performing analyses using an automatic algorithm in which the most significant predictor variable is used to partition each node, as well as creating 'user-guided' trees in which the splitting variable, Xi, can be specified by the user at all levels of data partitioning. However, its ability to generate multiple random trees was the primary feature used in this analysis. In this mode, partitioning of a particular node is carried out using a predictor that is randomly selected among those which result in a significant test statistic. Evaluation of many such 'random' trees generated in this fashion leads to an understanding of which predictor variables are interacting, or are correlated, with each other. RP modeling was conducted under the following conditions: (1) the data were partitioned only if the associated P-value was 0.10, or less, (2) 1000 random tables were generated, and (3) each predictor was randomly selected from among the 10 most significant predictors.

A measure of how well a set of predictors partitions a data set begins with calculating the sampling variance for each of the terminal nodes. The square root of the weighted average of the terminal node sampling variances is referred to as the tree's root mean square (RMS) error. The most predictive among a set of randomly generated trees are those with lowest RMS error. After identifying trees of interest – those with low RMS error – each was evaluated for use as the basis for a predictive assay. The genotype combination characterizing a terminal node was considered 'positive' (indicating an increased risk of HSR) if the proportion of cases was greater than the proportion of cases in the root node and 'negative' if otherwise. A 2 times 2 table reflecting the number of cases and controls that would be classified as positive and negative was constructed and the corresponding test characteristics and predictive values were calculated. Table 5 shows how the tree depicted in Figure 3 would be summarized.

Despite the fact that cases and controls were not sampled at random, the probability that an individual will be at risk of HSR, given his or her multilocus genotype, can be calculated using the RP tree results, as shown by Zaykin and Young.33 The individuals who comprise a given node in an RP tree represent those who carry a specific combination of marker genotypes. The combination is defined by tracing the node back up through the tree to the root node. If the nodes along such a branch are numbered from zero beginning with the root node, then the HSR risk for individuals in the ith node is given by

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where omega is the HSR prevalence among all treated patients, ni the number of individuals comprising node i, ui the proportion of individuals in node i who are cases, mi the number of individuals comprising the sister node, and vi is the proportion of individuals in the sister node who are cases.

For example, for the tree pictured in Figure 2, the HSR risk for node N22 is calculated by indexing node N2 as node 1 (i=1), node N22 as node 2 (i=2), and assuming the HSR prevalence in the general population (omega) is 0.05.

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

The frequency of the marker genotype combinations, as defined by a specific RP node, can be estimated in a non-random sample of cases and controls. Let N and u denote, respectively, the sample size and the proportion of cases in the root node and, as above, let ni and ui denote the sample size and proportion of cases in any other node in the tree. The population frequency for a node can be estimated as

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

where omega is defined as above.



Duality of Interest

None declared.



  1. Lindpaintner K. Pharmacogenetics and the future of medical practice. J Mol Med 2003; 81(3): 141–153. | PubMed | ISI |
  2. Roses AD. Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat Rev Genet 2004; 5(9): 645–656. | Article | PubMed | ISI | ChemPort |
  3. Schmith VD, Campbell DA, Sehgal S, Anderson WH, Burns DK, Middleton LT et al. Pharmacogenetics and disease genetics of complex diseases. Cell Mol Life Sci 2003; 60(8): 1636–1646. | Article | PubMed | ISI | ChemPort |
  4. Evans WE, Relling MV. Moving towards individualized medicine with pharmacogenomics. Nature 2004; 429(6990): 464–468. | Article | PubMed | ISI | ChemPort |
  5. Goldstein DB. Pharmacogenetics in the laboratory and the clinic. N Engl J Med 2003; 348(6): 553–556. | Article | PubMed | ISI |
  6. Guzey C, Spigset O. Genotyping as a tool to predict adverse drug reactions. Curr Top Med Chem 2004; 4(13): 1411–1421. | Article | PubMed |
  7. Hosford DA, Lai EH, Riley JH, Xu CF, Danoff TM, Roses AD. Pharmacogenetics to predict drug-related adverse events. Toxicol Pathol 2004; 32(Suppl 1): 9–112. | Article | PubMed |
  8. Pirmohamed M, Park BK. Genetic susceptibility to adverse drug reactions. Trends Pharmacol Sci 2001; 22(6): 298–305. | Article | PubMed | ISI | ChemPort |
  9. Hall ST, Abbott N, Schmith G, Brazell C. Pharmacogenetics in drug development: regulatory and clinical considerations. Drug Dev Res 2004; 62: 102–111. | Article | ChemPort |
  10. Cutrell AG, Hernandez JE, Fleming JW, Edwards MT, Moore MA, Brothers CH et al. Updated clinical risk factor analysis of suspected hypersensitivity reactions to abacavir. Ann Pharmacother 2004; 38(12): 2171–2172. | PubMed |
  11. Hetherington S, McGuirk S, Powell G, Cutrell A, Naderer O, Spreen B et al. Hypersensitivity reactions during therapy with the nucleoside reverse transcriptase inhibitor abacavir. Clin Ther 2001; 23(10): 1603–1614. | Article | PubMed | ChemPort |
  12. Cutrell A, Hernandez JE, Edwards M, Fleming J, Brothers C, Powell W et al. Clinical risk factors for hypersensitivity reactions to abacavir: retrospective analysis of over 8000 subjects receiving abacavir in 34 clinical trials. 43rd Interscience Conference on Antimicrobial Agents and Chemotherapy, 2003, Chicago, IL USA, Abstract H-2013, 2005.
  13. Hughes AR, Mosteller M, Bansal AT, Davies K, Haneline SA, Lai EH et al. Association of genetic variations in HLA-B region with hypersensitivity to abacavir in some, but not all, populations. Pharmacogenomics 2004; 5(2): 203–211. | Article | PubMed | ISI | ChemPort |
  14. Hughes AR, Mosteller M, Warren LL, Gatherum A, Scott T, Spreen WR. Key findings from the analysis of candidate gene markers to two retrospective, case–control studies to investigate genetic polymorphisms in HIV Infected subjects who developed hypersensitivity following treatment with abacavir. 2003 GlaxoSmithKline Report RJ2003/00003/00.
  15. Hughes AR, Haneline S, Hernandez JE, Mosteller M, Scott T, Warren LL et al. Key findings from the analysis of candidate gene markers and genome-wide single nucleotide polymorphisms (SNPs) from two retrospective, case–control studies to investigate genetic polymorphisms in HIV infected subjects who developed hypersensitivity following treatment with abacavir. 2004 GlaxoSmithKline Report RJ2004/00004/00.
  16. Mallal S, Nolan D, Witt C, Masel G, Martin AM, Moore C et al. Association between presence of HLA-B*5701, HLA-DR7, and HLA-DQ3 and hypersensitivity to HIV-1 reverse-transcriptase inhibitor abacavir. Lancet 2002; 359(9308): 727–732. | Article | PubMed | ISI | ChemPort |
  17. Hughes DA, Vilar FJ, Ward CC, Alfirevic A, Park BK, Pirmohamed M. Cost-effectiveness analysis of HLA B*5701 genotyping in preventing abacavir hypersensitivity. Pharmacogenetics 2004; 6(14): 335–342. | Article |
  18. Hawkins DM, Young SS, Rusinko AI. Analysis of large structure activity data set using recursive partitioning. Quant Struc Act-Relat 1997; 16: 296–302. | ChemPort |
  19. Young SS, Ge N. Recursive partitioning analysis of complex disease pharmacogenetic studies. I. Motivation and overview. Pharmacogenomics 2005; 6(1): 65–75. | Article | PubMed | ChemPort |
  20. Martin AM, Nolan D, Gaudieri S, Almeida CA, Nolan R, James I et al. Predisposition to abacavir hypersensitivity conferred by HLA-B*5701 and a haplotypic Hsp70-Hom variant. Proc Natl Acad Sci USA 2004; 101(12): 4180–4185. | Article | PubMed | ChemPort |
  21. Lin MT, Storer B, Martin PJ, Tseng LH, Gooley T, Chen PJ et al. Relation of an interleukin-10 promoter polymorphism to graft-versus-host disease and survival after hematopoietic-cell transplantation. N Engl J Med 2003; 349(23): 2201–2210. | Article | PubMed | ISI | ChemPort |
  22. Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T et al. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet 2002; 32(4): 650–654. | Article | PubMed | ISI | ChemPort |
  23. Kammerer S, Roth RB, Reneland R, Marnellos G, Hoyal CR, Markward NJ et al. Large-scale association study identifies ICAM gene region as breast and prostate cancer susceptibility locus. Cancer Res 2004; 64(24): 8906–8910. | Article | PubMed | ISI | ChemPort |
  24. Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG. Replication validity of genetic association studies. Nat Genet 2001; 29(3): 306–309. | Article | PubMed | ISI | ChemPort |
  25. Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 2003; 33(2): 177–182. | Article | PubMed | ISI | ChemPort |
  26. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet Med 2002; 4(2): 45–61. | Article | PubMed | ISI | ChemPort |
  27. Vieland VJ. The replication requirement. Nat Genet 2001; 29(3): 244–245. | Article | PubMed | ChemPort |
  28. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001; 69(1): 138–147. | Article | PubMed | ISI | ChemPort |
  29. Jannot AS, Essioux L, Reese MG, Clerget-Darpoux F. Improved use of SNP information to detect the role of genes. Genet Epidemiol 2003; 25(2): 158–167. | Article | PubMed | ISI |
  30. Nelson MR, Kardia SL, Ferrell RE, Sing CF. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res 2001; 11(3): 458–470. | Article | PubMed | ISI | ChemPort |
  31. Bastone L, Reilly M, Rader DJ, Foulkes AS. MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered 2004; 58(2): 82–92. | Article | PubMed | ISI | ChemPort |
  32. Lambert CG. HelixTree® Genetics Analysis Software. Golden Helix, Inc. http//wwwgoldenhelixcom, 2005.
  33. Zaykin DV, Young SS. Large recursive partitioning analysis of complex disease pharmacogenetic studies. II. Statistical considerations. Pharmacogenomics 2005; 6(1): 77–89. | Article | PubMed | ChemPort |


This research would not have been possible without the participation of literally thousands of individuals. We thank the patients and the healthcare providers who generously contributed to these studies as well as the dedicated GSK staff who facilitated the conduct of these studies and provided critical laboratory and bioinformatic support. For a list of all the participating investigators, refer to the supplementary information. This work was supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences.

Supplementary Information accompanies the paper on the The Pharmacogenomics Journal website (http://www.nature.com/tpj)



These links to content published by NPG are automatically generated


Diabetes Missing links

Nature News and Views (06 Dec 2007)