Original Article

Bone Marrow Transplantation (2012) 47, 217–226; doi:10.1038/bmt.2011.56; published online 28 March 2011


Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation

S R Marino1, S Lin2, M Maiers3, M Haagenson4, S Spellman3, J P Klein5, T A Binkowski6, S J Lee7 and K van Besien8

  1. 1Department of Pathology, University of Chicago Medical Center, Chicago, IL, USA
  2. 2Department of Health Studies, University of Chicago, Chicago, IL, USA
  3. 3National Marrow Donor Program, Minneapolis, MN, USA
  4. 4Statistical Center for International Blood and Marrow Transplant Research, Minneapolis, MN, USA
  5. 5Medical College of Wisconsin, Milwaukee, WI, USA
  6. 6Center for Structural Genomics of Infectious Diseases and Midwest Center for Structural Genomics, Argonne National Laboratory, Argonne, IL, USA
  7. 7Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
  8. 8Department of Medicine, University of Chicago Medical Center, Chicago, IL, USA

Correspondence: Dr SR Marino, Department of Pathology, University of Chicago Medical Center, MC 0006, 5841 South Maryland Avenue, Chicago, IL 60637-1470, USA. E-mail: smarino@bsd.uchicago.edu

Received 6 April 2010; Revised 13 December 2010; Accepted 26 January 2011



The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared with the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2107 HCT recipients with good or intermediate risk hematological malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166 and 167; HLA-B 97, 109, 116 and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163 and 173. In all 13 had been previously reported by other investigators using classical biostatistical approaches. Using the same data set, traditional multivariate logistic regression identified only five amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods.


random forest analysis; HLA matching; amino acid substitutions; unrelated donor; hematopoietic cell transplantation



Unrelated donor hematopoietic cell transplantation (HCT) is an established treatment option for patients with hematological malignancies who lack a HLA identical sibling. Approximately 70% of unrelated donor transplants in 2009 facilitated by the US National Marrow Donor Program (NMDP) used donors who were HLA matched with the recipient; the other 30% had at least one HLA mismatch. HLA mismatches are a major barrier to successful long-term outcome in HCT; even a single Ag or allele mismatch has a significant effect on graft survival and particularly on incidence and severity of GvHD.1, 2, 3, 4, 5 Although the molecular basis of allorecognition in GvHD and cellular graft rejection is not completely understood,6, 7 isolated reports have shown that a single amino acid substitution between mismatched HLA alleles at a critical location can have an important role in acute GvHD8 and graft rejection.9 However, long-term survival after HCT is likely influenced not by a single mismatch but by multiple interacting mismatches as well as by patient and donor clinical characteristics and biological factors.

Mismatched Ags and alleles differ in the number, type and location of mismatched amino acids on the structure of the HLA molecule. Some substitutions may alter the peptide binding capability of the HLA molecule, whereas others may be irrelevant. It is likely that substitutions on the HLA molecules with altered peptide binding capacity that affect T-cell allorecognition underlie the varying clinical severity of GvHD and transplant outcomes associated with HLA-mismatched transplantation. Studies focused on the identification of amino acid substitutions associated with adverse outcomes are scarce10, 11 and in conflict with functional studies.12, 13 Furthermore, these studies used traditional statistical techniques which have a limited ability to simultaneously analyze the effect of a large number of unordered categorical risk factors, side-chain variability at each amino acid position, and their potential interactions.

The purpose of this study was to identify HLA amino acid substitutions that are associated with lower survival at day 100 post transplant (D100S) using a novel statistical methodology referred to as random forest analysis.14, 15 Random forest analysis is a computationally intensive method that uses a recursive partitioning algorithm to build individual prediction trees from randomly sampled subsets of data. It automatically accounts for interactions among a large number of potential predictors of HCT outcome.16 Although random forest analysis has not been used to analyze HLA data in unrelated transplantation before, this type of analysis has been shown to be extremely powerful and robust in the analysis of data sets with a ‘large p and small n’, data sets where the number of predictor variables (p) is large, but the number of cases (n) is relatively small. In comparative analysis of discrimination methods for gene array expression data, it has consistently been shown to be superior or at least equivalent to other methods.17, 18, 19


Patients and methods


The study was based on a data set of 3855 patient–donor pairs facilitated by the NMDP between 1988 and 2003. All surviving recipients included in this data set were retrospectively contacted and provided informed consent for participation in the NMDP research program. Approximately 4% of surviving patients would not provide consent for research. To adjust for the potential bias introduced by exclusion of non-consenting surviving patients, a sampling process randomly excluded appropriately the same percentage of deceased patients using a biased coin randomization with exclusion probabilities based on characteristics associated with not providing consent for use of the data in survivors.2 The final study population consisted of 2107 patients with good or intermediate risk hematological malignancies who underwent allogeneic HCT from HLA-matched or single HLA class I allele or Ag-mismatched unrelated donors. Good risk was defined as AML and ALL in first CR, CML in first chronic phase, and myelodysplastic syndrome subtype refractory anemia. Intermediate risk was defined as AML and ALL in second or subsequent CR or in first relapse, and CML in accelerated phase or second chronic phase. Patients with high-risk disease were excluded from the analysis in order to better examine the relationship between amino acid substitutions and survival.

High-resolution HLA typing was performed for HLA-A, B, C, DRB1, DQA1, DQB1, DPA1 and DPB1 on all donor–recipient pairs as previously described.2 However, in this study only HLA-A, B, C and DRB1 were considered in the definition of HLA matching based on the results of the Lee et al.2

To avoid confounding effects of HLA mismatches in the graft-vs-host and host-vs-graft directions, donors and recipients that were homozygous at an HLA class I locus (n=91) were excluded from analysis. Donor–recipient pairs with more than one mismatch in HLA-A, B, C and DRB1 or those mismatched at HLA-DRB1 were also excluded. There were 1507 donor–recipient pairs who were matched at HLA-A, B, C and DRB1 (referred to as the matched group) and 600 donor–recipient pairs with only one allele or Ag mismatch at HLA-A, B or C (referred to as the mismatched group). The frequency distribution of the 600 mismatched donor–recipient pairs at HLA-A, B and C is 179 (29.8%), 88 (14.7%) and 333 (55.5%), respectively.

Data sources

The Center for International Blood and Marrow Transplant Research is a research affiliation of the International Bone Marrow Transplant Registry, Autologous Blood and Marrow Transplant Registry and the NMDP established in 2004 that comprises a voluntary working group of more than 450 transplantation centers worldwide that contribute detailed data on consecutive allogeneic and autologous hematopoietic HCT to a Statistical Center at the Medical College of Wisconsin in Milwaukee and the NMDP Coordinating Center in Minneapolis. Participating centers are required to report all transplants consecutively; compliance is monitored by on-site audits. Patients are followed longitudinally with yearly follow-up. Computerized checks for discrepancies, physicians’ review of submitted data and on-site audits of participating centers ensure data quality. Observational studies conducted by the Center for International Blood and Marrow Transplant Research are carried out in compliance with the Privacy Rule as a Public Health Authority, and in compliance with all applicable federal regulations pertaining to the protection of human research participants as determined by continuous review of the Institutional Review Boards of the NMDP and the Medical College of Wisconsin since 1985.

Amino acid substitution assignment

Amino acid substitutions were assigned by comparing the amino acid sequences of the mismatched alleles carried by the donor and the recipient using the International Immunogenetics Project/HLA database (http://www.ebi.ac.uk/imgt/hla) accessed on July 2007. Polymorphic amino acid positions were identified by position number and type. The observed mismatches between patient and donor were recorded by position number and the two different amino acids. The majority (~80%) of the HLA alleles in the International Immunogenetics Project HLA database are defined on the basis of partial sequence, in which a portion of the exonic nucleotides is not described. For this study, we restricted the analysis to exons 2–3 for class I alleles and exon 2 for class II alleles, wherein the majority of the alleles are fully characterized. To address the few instances, in which the reference sequence definition is incomplete within these exons, we used a simple imputation method to fill-in the sequence with that of the most similar fully characterized allele. The similarity measure used was hamming distance or the minimum number of nucleotide differences.

Statistical analyses

Random forest analysis

Random forest analysis was used to identify amino acid substitutions associated with the primary end point of survival to day 100, accounting for clinical and transplant characteristics and other simultaneous amino acid substitutions present. Because random forest analysis has not been used before in HCT studies, we provide a brief description of the method and its functional properties.

Random forest is a tree-based method for classification developed by Breiman14 that uses an ensemble of classification or decision trees. Using a recursive-partitioning algorithm, each classification tree is built based on a bootstrap sample of the training data. Some records will be included more than once in the sample, and others will not appear at all. Generally, about two-third of the records will be included in each bootstrap sample of the training data set, and one-third will be left out. The left out records are used to provide an ongoing dynamic assessment of model performance, similar to repeated cross-validation. In addition, a random subset of the available predictor variables is used to determine the best partition of the data at each node of each individual tree building process. This doubly random process produces a collection of substantially different trees. Together, the resulting decision trees form the forest that represents the final ensemble tree model, in which each decision tree votes for the result and the majority wins.

In contrast to traditional multivariate modeling, the random forest analysis can account for inter-relationships among all potential predictors including highly multilevel unordered categorical covariates in building a tree-based predictive model. Unlike traditional univariate and multivariate logistic regression analysis, random forest analysis has the capability to analyze large training data sets with hundreds or even thousands of input variables. The two-part randomness (random subset of patients, random subset of variables) used by the random forest method has been shown to deliver considerable robustness to noise, outliers, and over-fitting, when compared with a single tree classifier. Random forest analysis was carried out using the random forest software, version 1.0 (Salford Systems, San Diego, CA, USA).

Four patient–donor clinical characteristics (age, disease type, disease status, donor–recipient gender match) identified as associated with day 100 survival in preliminary analyses and 127 amino acid substitution position variables at HLA-A, B or C constituted the set of eligible predictors in the random forest analysis. We built a random forest model based on a collection of 500 classification trees with each individual tree built from a bootstrap sample of the original 2107 donor–patient pairs. At each tree node (except the terminal nodes) of growing a tree, a set of 15 predictors randomly selected from the total 131 predictors was used to determine the best split of the node. Results for each potential variable are expressed as a 0–100 ranking of variable importance, with higher scores indicating greater predictive ability. In contrast to traditional univariate and multivariate modeling, confidence intervals and P-values are not available.

Traditional univariate and multivariate analysis

Traditional univariate and multivariate analyses were carried out to compare the results obtained by the random forest analysis with those obtained from more common statistical approaches using the same data set. For the univariate approach, each mismatched type by position subgroup was compared with the HLA-matched group using a binary indicator variable in multiple logistic regression model with adjustment for patient risk factors. Because of multiple testing, indicator variables with a more stringent P-value of less than or equal to0.005 were considered as statistically significant, indicating that the death rate by day 100 of the specific mismatched type by position subgroup is different from that of the matched group.

For the traditional multivariate logistic regression model, the potential differential effects of substitution type were ignored and the model tested the effect of any amino acid substitution within each position (mismatch vs match regardless of type). An initial screening was conducted by testing the effect of each amino acid substitution position separately at 5% significance level in a logistic regression model with adjustment for the significant patient risk factors (age, disease type, disease stage and donor–recipient gender match). Then, based on the amino acid substitution position variables that were significant in the initial screening, a final model was built using a forward stepwise regression procedure with a 5% significance level as the variable entry or deletion criterion. This final model allowed for an identification of interactive effect among multiple amino acid substitution positions but could not evaluate types of substitutions or their interactions because the model cannot accommodate the large number of indicator variables necessary to code all possible substitution types and their interactions among combinations of substitution positions.



Patient characteristics

Patient characteristics are summarized in Table 1 for the HLA-mismatched and -matched groups, respectively. There were significant differences between the groups with respect to age, disease type, disease stage, conditioning regimen and GvHD prophylaxis at the 5% significance level. However, after Bonferroni adjustment for multiple comparisons to reduce the possibility of false positive results only age and disease stage remained significant at the 5% level. The day 100 survival was 79% for the HLA-matched group and 69% for the HLA-mismatched group, P<0.001.

Distribution of amino acid substitutions positions and types

From the 600 donor–recipient pairs that had one HLA-A, B or C amino acid mismatch and were DRB1 matched, 371 had Ag mismatches and 229 had allele mismatches as defined by the NMDP.2 HLA-A, B and C sequences each had up to a total length of 181 amino acids. Amino acid substitutions were identified in 50 positions in HLA-A, 44 positions in HLA-B and 33 positions in HLA-C, for a total of 127 mismatched amino acid positions. Most mismatched positions have multiple mismatch types, hence a total of 389 amino acid substitutions were identified for the 127 positions (an average of 3.1 types per amino acid substitution position), Table 2.

Amino acid substitutions identified by the random forest analysis

Four patient variables (age, disease stage, disease type and gender match) and 33 amino acid substitutions out of 127 amino acid substitutions were assigned an importance score of 2.9 or higher (in a scale of 0–100) by random forest analysis and identified as predictors of death at day 100 post transplant, Table 3. A cutoff value of 2.9 for the importance score on a scale of 0–100 was established to include the most important overlapping amino acid substitutions across the different HLA class I loci. The criteria used for selection of the most important positions was to include all 13 previously identified amino acid substitutions as well as any new position (n=20) with an importance score higher than a previously identified position. Amino acid substitutions using this definition were HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166 and 167; HLA-B 97, 109, 116 and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163 and 173, Figure 1. Table 3 shows a ranking of these amino acid substitutions by the strength of the importance score received on random forest analysis, and also summarizes previous reports in the literature.

Figure 1.
Figure 1 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Representative HLA molecules with non-permissive amino acid substitutions identified using random forest analysis. The residues are shown as mismatch groupings. (a) HLA-A, B and C positions 97, 116 and 156. (b) HLA-A and C positions 9, 77 and 95. (c) HLA-A 43, 62, 63, 76, 114, 152, 166 and 167. (d) HLA-B position 109. (e) HLA-C positions 6, 11, 14, 21, 66, 80, 99, 163 and 173. The mismatches are found on the α-1 and α-2 domains, with the majority occurring in the peptide-binding groove.

Full figure and legend (83K)

Most frequent HLA class I mismatches accounting for amino acid substitutions exhibiting the highest importance scores

The most frequent HLA class I mismatches for each of the 33 amino acid substitutions identified by random forest with high importance scores and their frequencies are listed in Table 4. Table 5 shows the most common HLA class I mismatches for each locus that correspond to the amino acid substitutions with high importance scores. The most common HLA mismatches in relation with these amino acids for each class I locus are HLA-A*02:01/02:05, HLA-B*35:01/35:03 and HLA-C*01:02/02:02, Table 5. The percentages were calculated based on all mismatches at a particular locus as the denominator. Only HLA mismatches with a frequency of greater than or equal to10 were included. However, if no HLA mismatches with a frequency of greater than or equal to10 were available, the highest available frequency was included in the table.

Traditional univariate analysis of amino acid substitutions adjusting for clinical variables

Table 6 lists all 13 amino acid substitution subgroups with greater than 10 patients and with significantly greater death rates by day 100 (P<0.005 in two-sided test) as compared with the HLA-matched group (1507 donor–recipient pairs) in univariate analysis adjusting for clinical variables. For the HLA-A-mismatched group, only one amino acid substitution position and type, 156-L:W (recipient:donor), was identified. No amino acid substitutions associated with worse outcome were identified for the HLA-B-mismatched group. This may be due in part to the fact that there are only 88 (14.7%) HLA-mismatched donor–recipient pairs with HLA-B mismatches. In all, 12 amino acid substitutions were identified in the HLA-C-mismatched group. A total of seven different amino acid substitutions are on the α-1 domain, in seven different positions and five amino acid substitutions are located on the α-2 domain, in four different positions.

Traditional multivariate analysis of amino acid substitution positions adjusting for clinical variables

We first tested if a single amino acid substitution position (regardless of substitution type) was associated with death by day 100 after adjustment for important patient risk factors. Using a 5% significance level, we identified the following substitution positions: HLA-A 9, 17; HLA-B 109 and 116; and HLA-C 6, 9, 11, 14, 16, 21, 24, 49, 77, 80, 97, 99, 114, 116, 156, 163. With a more stringent 0.5% significance level only the following 10 HLA-C positions: 9, 11, 21, 77, 80, 97, 99, 116, 156 and 163 were identified. Of these 10 HLA-C positions, 9 positions (except 163) were already identified by univariate analysis that tested the effect of substitution type at each substitution position, Table 6. It can be seen that multivariate analysis identified four additional substitution positions at the 0.5% significance level. This indicates that in addition to identifying more informative substitution type effect, testing the differential effect of substitution type at each substitution position is also a more powerful approach to identify substitution positions. Holding patient risk factors in the model, we used a forward stepwise procedure with a 5% significance level for entry into and removal from the model to select the most important amino acid substitution positions from the initially identified positions. We found that HLA-A positions 17, 73, 166, HLA-B position 116 and HLA-C position 116 were the only amino acid substitution positions simultaneously associated with outcome, Table 7.

HLA-DQ and DP matching status was also analyzed. DQ matching status was not associated with survival rate at day 100 (P=0.33) but DP matching status was (P=0.005). These results indicate that there is no linkage effect of the class I mismatches with DQA1 or DQB1 disparities. There was no survival difference between patient–donor pairs that had one HLA class I Ag or allele mismatch (P=0.66).



Several large studies using standard multivariable modeling have established the importance of molecular matching at HLA-A, B, C and DRB1 for the outcome of HCT.1, 2, 3, 4, 5 It is estimated that on average, every additional mismatch is associated with a 10% decrement in survival after adult unrelated donor transplantation for good risk patients.2 But it is equally clear that many patients, particularly minorities lack matched unrelated donors20 and suitable mismatched donors need to be identified to offer transplants to these patients. The effect of HLA mismatching on GvHD, relapse, and TRM is mediated by amino acid substitutions, several of which can be found in most mismatched alleles. In this study, we have identified 33 amino acid substitutions’ locations that are associated with survival at day 100 post transplant. Some of these locations, 97, 116 and 156, were present in all three HLA class I loci. Substitution locations 9, 77 and 95 were present on HLA-A- and HLA-C-mismatched Ags or alleles. Some locations were only identified on mismatched Ags or alleles of a single locus; HLA-A 43, 62, 63, 76, 114, 152, 166 and 167; HLA-B 109; and HLA-C, 6, 11, 14, 21, 66, 80, 99, 163 and 173. The majority of the important amino acid substitutions identified in this study as associated with survival to day 100 are located on the α-1 or the α-2 domains of the peptide-binding site, encoded by exons 2 and 3, respectively, and are predicted to directly affect T-cell allorecognition.21, 22, 23 The most common HLA mismatches associated with these amino acids are HLA-A*02:01/02:05, 02:01/02:06, 03:01/03:02, 01:01/11:01, 02:01/68:01 and 24:02/24:03; HLA-B*35:01/35:03 and 35:01/35:08; and HLA-C*01:02/02:02, 04:01/16:01, 05:01/07:04, 14:02/15:02, 03:03/04:01, 07:01/12:03, 06:02/07:01, 01:02/03:03, 01:02/15:02, 03:04/07:02 and 02:02/15:02. The identification of amino acid substitutions that are associated with a higher than average risk of failure in HCT, the so-called non-permissive amino acid substitutions, represents a first step toward the ultimate goal of identifying acceptable mismatches that could be used in the clinical setting for selection of suitable mismatched unrelated donors for patients lacking HLA-identical donors. However, additional studies using different data sets as well as functional studies are necessary to confirm these findings prior to clinical implementation of these results.

Initial insights of the importance of specific amino acid substitutions were based on identification of individual patients and isolation of cytotoxic T-cell clones directed against HLA subtypes absent in the donor.8, 9, 24 Ferrara et al.10 using a large data set reported in 2001 that substitutions at position 116 of class I molecules increase risk for acute GvHD and TRM. However, they did not attempt to distinguish the effects of substitutions in HLA-A, HLA-B or HLA-C.10 Recently, Kawase et al.11 have reported non-permissive HLA mismatches associated with acute GvHD in HCT patients from the Japan Marrow Donor Program. In contrast to our study, the study population of Kawase et al.11 comprised of recipients with heterogeneous diagnoses and disease stages, and donor–recipient pairs with mismatches at multiple HLA loci. They conducted a traditional multivariate analysis to evaluate the effect of HLA one-locus allele mismatch on acute GvHD while adjusting for clinical factors (disease, treatment and patient-related predictors) as well as mismatch status in other loci.11 They found four non-permissive mismatches in HLA-A, one in HLA-B, seven in HLA-C, one in DRB1, one mismatch associated with DRB1-DQB1 and two in HLA-DPB1.11 A similar model was used to analyze the effect of each amino acid substitution type on each position separately. However, they did not adjust for multiple amino acid substitutions that commonly occur within a single HLA mismatch.11 They found two non-permissive amino acid substitutions at HLA-A, positions 9 and 116, and six non-permissive amino acid substitutions at HLA-C positions 9, 77, 80, 99, 116 and 156.11 More recently, the same group has published an analysis of HLA mismatches that predict for relapse and overlap minimally with the mismatches associated with acute GvHD.25 Functional studies have also been reported,12, 13 however, their results are in conflict with Ferrara et al.10 and Kawase et al.11 reports and only include a small number of cases.

Our analysis differed from Kawase et al.11 in several ways. First, we used a different end point namely death by day 100 and restricted our analysis to patients with good or intermediate risk leukemia. By focusing the analysis to a more restricted and hence more homogeneous study population, we hypothesized that we would reduce variability due to disease variables and increase the power to detect variables that predict for GvHD. Second, we used a new statistical method, random forest analysis, which has not been previously applied in HCT but which has several advantages over more conventional analysis methods as demonstrated by our results. Using random forest analysis, we confirmed all non-permissive amino acid substitutions identified by Kawase et al.11 as well as the few amino acid substitutions reported by other investigators.8, 9, 10, 24 Although random forest analysis does not validate the interpretation of substitutions as permissive vs non-permissive and does not provide a P-value, the fact that we were able to identify these previously reported non-permissive amino acid substitutions by random forest and not by traditional multivariate analysis in our data set, supports the observation in other fields that random forest provides greater data analytic power. Furthermore, in addition to the 8 amino acid substitutions identified by Kawase et al.,11 we identified another 25 that had similar or higher importance scores in the random forest analysis. Future studies in different patient populations are required to confirm the importance of these amino acid substitutions in HCT. However, for the patient who needs a HCT today from an HLA-mismatched donor, the evolving literature suggests that using a donor who is mismatched with the recipient at positions 116 or 156 at either of the HLA class I loci, at position 9 at HLA-A or HLA-C, and at position 99 at HLA-C may increase the risk for early death and other adverse outcomes.

A number of limitations of this study should also be mentioned. Although there were some notable commonalities, the three separate analytic techniques we used using the same data set identified different sets of clinical variables and amino acid substitutions associated with survival at day 100, highlighting the need for independent validation in multiple data sets and using multiple approaches. Also, we chose survival at day 100 as our primary endpoint, as it is objective and likely most closely associated with acute GvHD. However, further studies should be conducted to investigate amino acid substitutions that have their maximal association with other outcomes and to determine permissive amino acid substitutions. Our analysis identified associations between amino acid substitutions and survival at day 100, but we cannot confirm the biological importance. Only well-designed functional studies will show if the specific amino acid substitutions identified affect T-cell allorecognition or function or if they are markers for other critical factors causing increased mortality. Other biological factors that affect HLA amino acid mismatches and T-cell allorecognition in HCT such as shape of the T-cell receptor repertoire have not been investigated in this study. Finally, although most of these amino acid locations have been identified in other studies, we acknowledge that some of these amino acid substitution locations may only be a marker of a specific allele mismatch instead of a truly important location that has an effect on survival.

In conclusion, using random forest to analyze the largest currently available data set of HCTs, we were able to confirm 13 previously identified class I amino acid substitutions as well as 20 additional novel class I amino acid substitutions that are predictors of survival at day 100. Random forest analysis presents a robust statistical methodology for the analysis of HLA mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods. On the basis of these results, random forest analysis may prove an equally valuable tool to evaluate other transplant outcomes of interest.


Conflict of interest

The authors declare no conflict of interest.



  1. Flomenberg N, Baxter-Lowe LA, Confer D, Fernandez-Vina M, Filipovich A, Horowitz M et al. Impact of HLA class I and class II high-resolution matching on outcomes of unrelated donor bone marrow transplantation: HLA-C mismatching is associated with a strong adverse effect on transplantation outcome. Blood 2004; 104: 1923–1930. | Article | PubMed | ISI | ChemPort |
  2. Lee SJ, Klein J, Haagenson M, Baxter-Lowe LA, Confer DL, Eapen M et al. High-resolution donor-recipient HLA matching contributes to the success of unrelated donor marrow transplantation. Blood 2007; 110: 4576–4583. | Article | PubMed | ISI | ChemPort |
  3. Shaw BE. The clinical implications of HLA mismatches in unrelated donor haematopoietic cell transplantation. Int J Immunogenet 2008; 35: 367–374. | Article | PubMed | ISI |
  4. Hauzenberger D, Schaffer M, Ringdén O, Hassan Z, Omazic B, Mattsson J et al. Outcome of haematopoietic stem cell transplantation in patients transplanted with matched unrelated donors vs allele-mismatched donors: a single centre study. Tissue Antigens 2008; 72: 549–558. | Article | PubMed | ISI |
  5. Petersdorf EW. Optimal HLA matching in hematopoietic cell transplantation. Curr Opin Immunol 2008; 20: 588–593. | Article | PubMed | ISI |
  6. Whitelegg A, Barber LD. The structural basis of T-cell allorecognition. Tissue Antigens 2004; 63: 101–108. | Article | PubMed | ISI | ChemPort |
  7. Archbold JK, Ely LK, Kjer-Nielsen L, Burrows SR, Rossjohn J, McCluskey J et al. T cell allorecognition and MHC restriction-A case of Jekyll and Hyde. Mol Immunol 2008; 45: 583–598. | Article | PubMed | ISI |
  8. Keever CA, Leong N, Cunningham I. HLA-B44-directed cytotoxic T cells associated with acute graft-versus-host disease following unrelated bone marrow transplantation. Bone Marrow Transplant 1994; 14: 137–145. | PubMed | ISI |
  9. Fleischhauer K, Kernan NA, O’Reilly RJ, Dupont B, Yang SY. Bone marrow-allograft rejection by T lymphocytes recognizing a single amino acid difference in HLA-B44. N Engl J Med 1990; 323: 1818–1822. | Article | PubMed | ISI | ChemPort |
  10. Ferrara GB, Bacigalupo A, Lamparelli T, Lanino E, Delfino L, Morabito A et al. Bone marrow transplantation from unrelated donors: the impact of mismatches with substitutions at position 116 of the human leukocyte antigen class I heavy chain. Blood 2001; 98: 3150–3155. | Article | PubMed | ISI | ChemPort |
  11. Kawase T, Morishima Y, Matsuo K, Kashiwase K, Inoko H, Saji H et al. High-risk HLA allele mismatch combinations responsible for severe acute graft-versus-host disease and implication for its molecular mechanism. Blood 2007; 110: 2235–2241. | Article | PubMed | ISI | ChemPort |
  12. Heemskerk MB, Roelen DL, Dankers MK, van Rood JJ, Claas FH, Doxiadis II et al. Allogeneic MHC class I molecules with numerous sequence differences do not elicit CTL response. Human Immunol 2005; 66: 969–976. | Article | ISI |
  13. Heemskerk MB, Cornelissen JJ, Roelen DL, van Rood JJ, Claas FH, Doxiadis II et al. Highly diverged MHC class I mismatches are acceptable for haematopoietic stem cell transplantation. Bone Marrow Transplant 2007; 40: 193–200. | Article | PubMed | ISI | ChemPort |
  14. Breiman L. Random Forests. Machine Learning J 2001; 45: 5–32. | Article | ISI |
  15. Breiman L. Statistical modeling: the two cultures. Statist Sci 2001; 16: 199–231. | Article | ISI |
  16. Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004; 5: 32. | Article | PubMed | ChemPort |
  17. Díaz-Uriarte R, de Andrés SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7: 3. | Article | PubMed | ChemPort |
  18. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77–87. | Article | ISI | ChemPort |
  19. Lee JW, Lee JB, Park M, Song SH. An extensive evaluation of recent classification tools applied to microarray data. Comput Stat Data Anal 2005; 48: 869–885. | Article | ISI |
  20. Dew A, Collins D, Artz A, Rich E, Stock W, Swanson K et al. Paucity of HLA-identical unrelated donors for African-Americans with hematologic malignancies: the need for new donor options. Biol Blood Marrow Transplant 2008; 14: 938–941. | Article | PubMed | ISI |
  21. Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC. Structure of the human class I histocompatibility antigen, HLA-A2. Nature 1987; 329: 506–512. | Article | PubMed | ISI | ChemPort |
  22. Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC. The foreign antigen binding site and T cell recognition regions of class I histocompatibility antigens. Nature 1987; 329: 512–518. | Article | PubMed | ISI | ChemPort |
  23. Bjorkman PJ, Strominger JL, Wiley DC. Crystallization and X-ray diffraction studies on the histocompatibility antigens HLA-A2 and HLA-A28 from human cell membranes. J Mol Biol 1985; 186: 205–210. | Article | PubMed | ISI |
  24. Burrows SR, Khanna R, Burrows JM, Moss DJ. An alloresponse in humans is dominated by cytotoxic T lymphocytes (CTL) cross-reactive with a single Epstein-Barr virus CTL epitope. implications for graft-versus-host disease. J Exp Med 1994; 179: 1155–1161. | Article | PubMed | ISI | ChemPort |
  25. Kawase T, Matsuo K, Kashiwase K, Inoko H, Saji H, Ogawa S et al. HLA mismatch combinations associated with decreased risk of relapse: Implications for molecular mechanism. Blood 2009; 113: 2851–2858. | Article | PubMed | ISI | ChemPort |


We thank Theodore Karrison, PhD, for statistical support. This study was supported by the University of Chicago Cancer Research Center, Chicago, Illinois (Fund-6-33573 (SRM)). The Center for International Blood and Marrow Transplant Research is supported by Public Health Service Grant/Cooperative Agreement U24-CA76518 from the National Cancer Institute (NCI), the National Heart, Lung and Blood Institute (NHLBI) and the National Institute of Allergy and Infectious Diseases; a Grant/Cooperative Agreement 5U01HL069294 from NHLBI and NCI; a contract HHSH234200637015C with Health Resources and Services Administration (DHHS); two Grants N00014-06-1-0704 and N00014-08-1-0058 from the Office of Naval Research; and grants from AABB; Aetna; American Society for Blood and Marrow Transplantation; Amgen Inc.; Anonymous donation to the Medical College of Wisconsin; Astellas Pharma US Inc.; Baxter International Inc.; Bayer HealthCare Pharmaceuticals; Be the Match Foundation; Biogen IDEC; BioMarin Pharmaceutical Inc.; Biovitrum AB; BloodCenter of Wisconsin; Blue Cross and Blue Shield Association; Bone Marrow Foundation; Canadian Blood and Marrow Transplant Group; CaridianBCT; Celgene Corporation; CellGenix, GmbH; Centers for Disease Control and Prevention; Children's Leukemia Research Association; ClinImmune Labs; CTI Clinical Trial and Consulting Services; Cubist Pharmaceuticals; Cylex Inc.; CytoTherm; DOR BioPharma Inc.; Dynal Biotech, an Invitrogen Company; Eisai Inc.; Enzon Pharmaceuticals Inc.; European Group for Blood and Marrow Transplantation; Gamida Cell Ltd.; GE Healthcare; Genentech Inc.; Genzyme Corporation; Histogenetics Inc.; HKS Medical Information Systems; Hospira Inc.; Infectious Diseases Society of America; Kiadis Pharma; Kirin Brewery Co. Ltd.; The Leukemia and Lymphoma Society; Merck and Company; The Medical College of Wisconsin; MGI Pharma Inc.; Michigan Community Blood Centers; Millennium Pharmaceuticals Inc.; Miller Pharmacal Group; Milliman USA Inc.; Miltenyi Biotec Inc.; National Marrow Donor Program; Nature Publishing Group; New York Blood Center; Novartis Oncology; Oncology Nursing Society; Osiris Therapeutics Inc.; Otsuka America Pharmaceutical Inc.; Pall Life Sciences; PDL BioPharma Inc; Pfizer Inc; Pharmion Corporation; Saladax Biomedical Inc.; Schering Corporation; Society for Healthcare Epidemiology of America; StemCyte Inc.; StemSoft Software Inc.; Sysmex America Inc.; Teva Pharmaceutical Industries; THERAKOS Inc.; Thermogenesis Corporation; Vidacare Corporation; Vion Pharmaceuticals Inc.; ViraCor Laboratories; ViroPharma Inc.; and Wellpoint Inc. The views expressed in this article do not reflect the official policy or position of the National Institute of Health, the Department of the Navy, the Department of Defense, or any other agency of the US Government.

Author contributions: SRM conceptualized the study, interpreted the results and wrote the manuscript; SRM and SL designed the study; SL performed the univariate, multivariate, and random forest analyses; MM prepared amino acid database for analysis; MH prepared data for statistical analysis; SS and SJL contributed ideas and made significant contributions to the writing of the manuscript; JK performed multivariate analysis; TAB prepared the figure; KVB provided overall advice and guidance. All authors reviewed the manuscript.