Main

Recent large-scale WES studies have revealed the existence and relatively high frequency of somatic changes in HLA class I genes in head and neck cancer, squamous cell lung cancer, stomach adenocarcinoma and diffuse large B-cell lymphoma1,2,3,4,5. The HLA locus, located on chromosome 6, is among the most polymorphic regions of the human genome, with thousands of documented alleles for each gene6. These class I alleles are critical mediators of the cytotoxic T-cell response, presenting cellular peptides on the cell surface in a form that can be recognized by the T-cell receptor7,8. The finding of enhanced somatic mutation rate in HLA genes has strongly implicated HLA dysfunction as a possible mechanism of immune evasion in the development and progression of certain cancers1,2,3,4,5.

Each individual expresses six major histocompatibility complex (MHC) class I alleles, encoded by three genes (HLA-A, HLA-B and HLA-C) located on the two homologous copies of chromosome 6. Conventional determination of HLA type is performed using serology- and/or PCR-based methods that are labor-intensive and time-consuming9,10,11. Several protocols have recently been proposed for HLA-targeted multiplexed PCR coupled with next-generation sequencing, but by design, they provide information restricted to HLA alleles, and not the rest of the genome12,13,14,15,16. Theoretically, HLA typing information should be directly extractable from WES data, an increasingly available and cost-effective approach for the comprehensive analysis of genome-wide somatic alterations. The human reference genome, however, has a single sequence for each HLA gene and would likely misrepresent the true alleles in the individual, thereby causing suboptimal alignments. In addition, the HLA genes are GC-rich and therefore typically suffer from lower sequencing coverage due to lower efficiency in capture and amplification, and increased sequencing errors that further reduce the alignment rates. Consequently, to accurately detect somatic mutations in the HLA genes, one needs to first accurately align all reads originating from this region in both the tumor and matched normal samples and only then to apply somatic mutation detection tools. We also surmised that conventional alignment and mutation detection methods, which do not focus dedicated attention on this highly polymorphic region, would be prone to errors.

To this end, we developed the algorithm Polysolver (polymorphic loci resolver), which enables high-precision HLA typing even while using relatively low-coverage WES data, and a subsequent mutation detection pipeline that uses the inferred alleles as a basis for high-fidelity detection of mutations in HLA genes. By analyzing WES data from 7,930 cancer patients, we demonstrate high sensitivity and specificity of our method in detecting HLA somatic mutations. Further characterization suggests a functional impact of these mutations on this biologically important and complex locus.

Results

Inference of class I HLA alleles using Polysolver

To develop Polysolver, we put together a training set of data from eight chronic lymphocytic leukemia (CLL) patients for which WES data as well as conventional PCR-based HLA typing were available17 (Supplementary Table 1). We first confirmed the expected poor coverage and inverse correlation between GC content and coverage in HLA genes in this set (Supplementary Fig. 1). We reasoned that coverage at these highly polymorphic regions can be substantially improved by ensuring retrieval of true HLA reads that failed to align to the canonical reference, followed by alignment to a library of all known HLA alleles. These alignments could then be used for subsequent computational inference of the individual's HLA type. Thus, Polysolver consists of the following steps: (i) improved retrieval and alignment of HLA reads; (ii) inference of the HLA alleles using a two-step Bayesian classification approach (Fig. 1a, Supplementary Notes 1 and 2, and Supplementary Software). In brief, we increased the precision of the alignment by first selecting reads from the WES data that potentially originated from the HLA region (Supplementary Fig. 2) and aligning them to a full-length genomic library of all known HLA alleles based on the IMGT (ImMunoGeneTics)/HLA database18 (Online Methods) using a precise alignment method (Novoalign), and keeping all best-scoring alignments for each read to use in subsequent steps. Inference of the two alleles for each HLA gene was based on a Bayesian calculation that takes into account the base qualities of aligned reads, observed insert sizes, as well as the ethnicity-dependent prior probabilities of each allele12,19 (Supplementary Table 2).

Figure 1: Development and validation of Polysolver for inference of MHC class I type.
figure 1

(a) Schematic of the Polysolver algorithm. (b) Comparative performance of Polysolver and other previously reported algorithms20,41,42,43,44 by library size (error bars correspond to s.d.) using the following performance criteria: (i) sensitivity, the proportion of all true allele species that are correctly identified by the algorithm; (ii) precision, the probability that an inferred allele species is correct; (iii) accuracy, the fraction of total number of alleles that are correctly called; and (iv) homozygosity success rate, the fraction of all homozygous cases that are correctly inferred.

Source data

For validation, we applied Polysolver to WES data from an independent set of 253 HapMap samples with known HLA genotypes (Supplementary Tables 3 and 4). We observed that Polysolver achieved an overall mean sensitivity of 97% (83% of samples had all allele species correctly identified), overall mean precision of 98.8% (93.6% samples had no incorrectly identified allele species), mean overall accuracy of 97% (83% samples had all alleles correctly called) and a 100% homozygosity success rate (83 of 83 homozygous cases correctly identified) in HLA typing at the protein coding level. Compared to other recently reported algorithms for inference of HLA type directly from WES data, Polysolver outperformed four of five other tools and performed comparably to the recently described OptiType tool20 (Fig. 1b, Supplementary Table 4 and Supplementary Note 3). To accommodate future use of Polysolver for samples from individuals of unknown ethnic origin, we developed a principal components (PC)-based method for exome-based ethnicity inference (Online Methods and Supplementary Fig. 3), which can be used before analysis by Polysolver to ensure maximal typing accuracy.

Detection of somatic mutations within the HLA region

A standard approach for detection of somatic mutations is to first align both tumor and normal reads to the reference genome and then scan the genome and identify mutational events observed in the tumor but not in the matched normal (e.g., as implemented in MuTect21). We reasoned that the accurate detection of individual native HLA type using germline data by Polysolver could substantially improve alignment of reads (in both tumor and normal samples) and hence improve the sensitivity and specificity of somatic mutation calling within the HLA region (Fig. 2a). In this setting, the inferred allele species for each HLA gene would serve as patient-specific reference 'chromosomes' against which preselected HLA reads from the tumor and germline samples are aligned separately, followed by standard mutation calling. We therefore built an analysis pipeline to call somatic mutations in the HLA genes that includes the following steps: (i) ethnicity detection using the normal sample; (ii) inference of HLA type by applying Polysolver on the normal sample (although other highly accurate HLA typing tools could also be used); (iii) re-alignment of the HLA reads in both tumor and normal samples to the inferred HLA alleles while filtering out likely erroneous alignments (Online Methods); (iv) application of standard tools to detect somatic mutations (MuTect21 and Strelka22) by comparing the re-aligned tumor and normal HLA reads.

Figure 2: Polysolver for the detection of somatic mutations in MHC class I alleles across cancers.
figure 2

(a) Schema for detection of somatic changes in HLA genes using Polysolver. Mutation detection algorithms MuTect21 and Strelka22 were incorporated for calling point mutations and indels, respectively, following MHC class I typing of the germline by Polysolver. (b) Comparison of somatic HLA mutations identified by TCGA (yellow) across cancers using standard approaches to those identified by Polysolver (black) (n = 2,545). Green: mutations found in common between the two data sets. (c) Number of HLA mutations and the percentage of samples bearing HLA mutations per cancer type identified by TCGA and Polysolver. (d) Validation of mutations using RNA-seq and long-read sequencing. RNA-seq–based validation was restricted to 49 samples with HLA point mutations (missense, nonsense, non-stop, splice site) identified by exome analysis and with available RNA-seq data. Long-read sequencing was performed on HLA alleles from 18 samples with available DNA material (Online Methods)27.

Source data

To test this approach, we initially assembled a data set of 2,545 cases of matched tumor and germline DNA spanning 12 tumor types—10 from The Cancer Genome Atlas project (TCGA), and 2 separate genomic studies focusing on CLL and melanoma. Fifty-nine HLA gene somatic mutations were previously detected using standard methods (Supplementary Note 4) and reported as part of a pan-cancer analysis effort23 (Online Methods)17,24. On reanalysis of these cases with our Polysolver-based mutation detection pipeline, we detected 36 of 59 (61%) previously reported HLA mutations, as well as 37 novel somatic HLA mutations; in total, we detected 73 mutations in 64 of 2,545 cases (Fig. 2b,c and Supplementary Tables 5–7). Manual review of all HLA mutation events using IGV25 suggested that 9 of 23 mutations identified exclusively by TCGA were true events, of which 6 were just below the detection limit of our pipeline and were identified once we slightly relaxed the read-filtering criteria used before mutation calling (Supplementary Table 8 and Supplementary Note 5).

When available, we examined matched RNA-sequencing data and sought orthogonal evidence of expression of the somatically mutated HLA allele that was detected by WES (indel calls were excluded from this analysis owing to low reliability of indel alignment and detection by RNA-seq26). A mutation was considered validated if there were at least two alternate allele-bearing reads in the RNA-seq data for well-powered sites (Online Methods). In total, we could evaluate RNA-seq data for 49 of 96 mutations, including 10 that were exclusively reported by TCGA, 17 detected only by our pipeline and 22 that were detected by both. We observed a high rate of RNA-seq-based validation of missense, nonsense and splice-site mutations in the set of 22 mutations found in common (8 of 8, 8 of 11, and 2 of 3 events, respectively; Fig. 2d and Supplementary Table 9). We likewise observed high rates of validation for events identified exclusively by the Polysolver-based mutation detection pipeline (7 of 9, 5 of 6, and 2 of 2 events, respectively). By contrast, only 2 of 10 mutations uniquely identified by TCGA were validated using RNA-seq.

We further performed experimental validation of inferred mutation calls through direct targeted sequencing of HLA-A and HLA-B alleles of 18 TCGA samples identified as bearing HLA mutations for which DNA material was available (Online Methods)27. Six of these 18 samples did not have adequate coverage at the site of mutation and were removed from the analysis owing to lack of sufficient power for mutation detection (Online Methods). Of the remaining 12 mutations, this analysis confirmed all 11 of 11 HLA mutations that were inferred by the Polysolver-based mutation detection pipeline (5 identified by TCGA also; 6 identified exclusively by Polysolver), whereas the sole mutation identified exclusively by TCGA was not validated (Fig. 2d and Supplementary Table 10). Altogether, these results demonstrate that the Polysolver-based approach is both a sensitive and specific somatic mutation–detection strategy within the highly polymorphic HLA loci.

Patterns of somatic HLA mutation across tumor types

We extended our analysis of Polysolver-based mutation detection to a total of 7,930 TCGA tumor-normal pairs (including the original collection of 2,545 and 5,385 additional cases). In total, we detected 298 somatic HLA mutations in 266 of 7,930 (3.3%) individuals (Supplementary Tables 11 and 12). The median allele fraction across somatic changes was 33% (interquartile range: 16–58%), suggesting that most of these mutations are heterozygous (Supplementary Fig. 4a).

Among the cancer types, we observed differences in frequency, localization and types of somatic HLA mutations (Fig. 3). In addition to finding HLA mutations occurring significantly in head and neck (HLA-A, HLA-B), lung squamous (HLA-A) and stomach (HLA-B) cancer as previously reported, we further identified HLA-A (FDR, q = 2.3 × 10−8) and HLA-B (FDR, q = 3.9 × 10−7) to be significantly mutated in colon adenocarcinoma. By contrast, CLL (n = 128) and liver cancer (n = 202) entirely lacked HLA mutations, and only single mutations were detected in glioblastoma (n = 390) and thyroid cancer (n = 486). 214 of 298 HLA mutations (71.8%) fell in 64 recurrent positions (i.e., amino acids that were mutated in at least two instances). The recurrent sites were distributed across the HLA gene (median of 2 mutated cases/recurrent site (range 2–24), (Fig. 3, bottom, Supplementary Table 13 and Supplementary Fig. 4b,c).

Figure 3: Distribution of HLA mutations across cancers and across functional domains and tumor types.
figure 3

Top, distribution of potential loss-of-function events, including out-of-frame and nonsense mutations. The histogram summarizes the number of events identified at each position. Central panel, pattern of mutations detected in each tumor type. Bottom, recurrent events; recurrent positions (with disease, allele group) with frequency ≥5 cases/recurrent site are shown. Bladder (BLCA), breast (BRCA), cervical squamous (CESC), colon adenocarcinoma (COAD), head and neck squamous (HNSC), lower-grade glioma (LGG), lung adenocarcinoma (LUAD), lung squamous (LUSC), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), melanoma (SKCM), stomach adenocarcinoma (STAD), thyroid (THCA), endometrial (UCEC).

Source data

Somatic class I HLA mutations are likely positively selected

Alterations highly likely to have a functional effect, including loss-of-function events (nonsense, frameshift indels, splice site), were significantly enriched in HLA mutations compared to non-HLA mutations (Fig. 4a, chi-squared test P < 2.2 × 10−16). We also observed that whereas loss-of-function mutations occurred in all functional domains of the HLA molecule, they demonstrated a strong preference for the N-terminal end in the leader peptide sequence (P = 0.0038), which would likely result in a completely nonfunctional protein (Supplementary Fig. 4d). The highest frequency of mutations localized to exon 4 (118 mutations, 39.6%), which encodes the a3 domain of the HLA protein that binds to the CD8 co-receptor of T cells28 (Fig. 4b). Abrogation of this function could lead to a loss of T-cell recognition and thereby a loss of immune reactivity. The second-highest frequency of mutations occurred in exon 3 (56 mutations, 18.8%) followed by exon 2 (49 mutations, 16.4%), which encode the a1 and a2 peptide binding domains of the HLA molecule, respectively, which conventionally bind 9- and 10-mer peptides for antigen presentation29.

Figure 4: Distribution of MHC class I mutations and evidence of positive functional selection.
figure 4

(a) Comparison of spectrum of mutations in non-HLA genes and HLA genes. The ratio of number of mutations of a particular type to the number of silent mutations is compared between the non-HLA and HLA genes for all mutation types (chi-square test, P < 2.2 × 10−16). Ins., insertion; del., deletion. (b) Distribution of HLA mutations across exons. (c) Mutations in HLA positions that are in actual physical contact with the peptide (contact residues). Left panel, the relative orientation of a 9-mer peptide with respect to the HLA and T-cell molecules. Positions 2 and 9 constitute the primary anchors, whereas position 6 forms the secondary anchor with HLA. The remaining position interacts with the T-cell molecule. Right panel, the nine amino acids of the peptide and their corresponding HLA contact residues are indicated along the rows (green, HLA-interacting anchor positions; blue, T-cell-interacting positions). The histogram depicts the frequency of observed HLA mutations in contact residues corresponding to each peptide position29. (d) Killer lymphocyte effector genes are more highly expressed in tumors exhibiting MHC class I mutation. Unbiased statistical analysis was employed to find genes more highly expressed in tumors harboring a mutation in an MHC class I allele. Heatmap displays color-coded expression ratio of medians (HLA-mutant vs. nonmutant samples) for genes (columns) in each cancer type (rows), excluding cancer types with fewer than three instances of HLA mutation in the cohort. *P < 0.05; **P < 0.0005 indicates the significance of the association for the given gene in the given cancer type according to one-sided Wilcoxon rank-sum test (null hypothesis: expression is not greater in the mutants). Cytolytic activity (geometric mean of GZMA and PRF1 expression) is included as though a gene. The depicted genes are those for which expression in MHC class I–mutated tumors was most significantly elevated across cancers (unadjusted P < 10−10 combined by Fisher's method, Supplementary Table 15).

Source data

Analysis of the position of the mutated residues within exons 2 and 3 in relationship to their predicted interaction with binding peptide29 further strongly suggests alteration of immune function by these somatic HLA mutations (Supplementary Table 14). The two major anchor grooves in the HLA molecule bind to positions 2 and 9, respectively, of the peptide, and a mutation in either groove would be expected to profoundly affect the biochemical stability of the MHC-peptide complex29. A secondary anchor groove that interacts primarily with the sixth amino acid of the peptide lies between the two primary anchor grooves30. Overall, 28.6% of mutations (30 of 105) in the peptide binding domains were in residues that come in contact with the peptide and 80% (24 of 30) of these were in positions that comprised one of the two primary anchor grooves (Fig. 4c).

We hypothesized that loss-of-function HLA mutations would more likely arise in the presence of selective pressure imposed by the host immune response against the tumor. A growing body of studies has shown that higher mutational burdens in cancers give rise to a higher load of mutation-derived immunogenic epitopes and that immune responses against these are associated with clinical benefit31. These immune responses are presumably driven by the presentation of tumor-derived epitopes by antigen-presenting cells to stimulate effector lymphocyte responses. Consistent with the idea that a tumor would evolve in a manner to escape recognition and destruction by tumor-directed T or natural killer (NK) cells, we detected an association between the presence of HLA somatic mutations and tumor expression signatures of effector lymphocyte infiltration, as recently defined32 (Supplementary Table 15 and Fig. 4d). Although putative loss-of-function somatic mutations in tumor HLA genes could lead to a decrease in the presentation of immunogenic epitopes by the tumor cell and evasion of immunologic targeting, these same mutations would not affect the ability of nontumor, host antigen-presenting cells to ingest and present tumor antigens to T cells, thereby stimulating immune infiltration. To further examine this idea, we analyzed the expression of 18,000 genes in matched RNA-seq data from 4,512 samples across 11 tumor types and found the strongest associations in 6 of 11 cancer types (stomach, endometrial, cervical, head and neck, colorectal and glioma), suggesting that reduced MHC class I activity may be particularly important for driving immune escape in these tumor types. From this unbiased analysis, the most significantly enriched genes were interferon gamma (IFNG), T-cell attractive chemokines (CXCL9, CXCL10, CXCL11), lytic molecules (GZMA, GZMH, PRF1, GNLY), as well as the “Cytolytic Activity” metagene (analyzed previously as a measure of anti-tumor T/NK cell activity32). These results suggest that acquisition of HLA mutations without abrogation of expression may provide a complementary immunosurveillance escape mechanism in which potential destruction of the tumor by T cells and NK cells is precluded.

Discussion

Immune evasion is a critical process in tumor biology and is enabled by several mechanisms including immune-editing33, downregulation of HLA expression34, secretion of immunosuppressive mediators35 and expression of proteins that modulate immune checkpoints36. Most recently, somatic mutation of HLA genes was revealed to be a significantly frequent process in some tumor types4. Improved sensitivity and accuracy of somatic HLA mutation detection could better characterize this already strongly implicated mechanism of immune evasion across cancers. We therefore created Polysolver, a model-based algorithm for accurate inference of HLA typing information from germline exome-capture data, which enables more sensitive and specific detection of somatic HLA mutations compared to standard techniques reliant on alignment to the canonical reference genome.

We have demonstrated that Polysolver infers HLA-type information with 97% sensitivity and 98% precision from exome-capture sequencing data and is among the best-performing tools for the analysis of HLA loci from WES data. Indeed, different typing tools, or a combination thereof, may be used for optimizing different aspects of HLA mutation detection performance, for example, a consensus approach that only uses allele species commonly identified by multiple tools as a basis for mutation detection would favor increased specificity at the cost of sensitivity. The better performance of HLA mutation detection was assessed to be primarily due to use of inferred alleles as reference and employment of stringent criteria for filtering aligned reads before mutation calling. We estimate an increase in sensitivity from 58.8% to 94.1% and specificity from 20% to 53.3% over standard methods, based on validation of point mutations in RNA-seq data. An expected limitation of Polysolver is its restriction to identification of known alleles, but future versions may be augmented by an assembly-driven module that would enable discovery of novel HLA alleles, and by representing a wider range of ethnic groups. Polysolver and other available HLA typing tools that can be used with WES are also not yet suitable for clinical use where much higher accuracy (>99.9%) is required. However, the Polysolver-based mutation detection pipeline can still be used effectively for detecting somatic changes in HLA genes once experimentally determined HLA typing information is available.

In this study, we performed a comprehensive characterization of HLA mutations in 7,930 samples across 20 different tumor types. We have shown that, in comparison to previous studies, the HLA mutational spectrum elucidated by our analysis has significantly reduced false positives and detects additional somatic mutations. Several biologic insights emerged from our analysis. First, we identified colon adenocarcinoma to be significantly affected by somatic mutation in class I HLA genes in addition to head and neck, lung squamous and stomach cancer, thus further supporting HLA mutation as a common oncogenic mechanism. In contrast, other cancers such as glioblastoma, ovarian cancer and CLL largely lacked mutations in HLA genes. Second, several characteristics of the identified nonsynonymous mutations suggest that they functionally affect antigen presentation. We identified 29 sites across the HLA genes that were recurrently mutated in at least three cases, and 35 sites by two cases suggesting positive selection at these positions. We further noted a significant enrichment in loss-of-function events in the HLA genes, such as frameshifting indels, nonsense and splice-site mutations. These events would be expected to abrogate HLA class I surface expression on tumors37,38,39, thereby affecting antigen presentation to immune cells. We determined that the majority of the detected mutations map to regions critical for antigen presentation. More than a third of the mutations (39.6%) were in exon 4 that encodes the MHC class I allele a3 domain, which binds to the CD8 co-receptor on T cells28. Mutations in this domain have been previously shown to abrogate binding to CD8 (ref. 40). Exons 2 and 3 harbored 35.2% of the mutations—these exons encode the surfaces that present peptides to immune cells. We found evidence that exon 2 and 3 HLA mutations preferentially localized to residues critical for anchoring peptide to the MHC binding grooves, and would be expected to interfere with the fundamental process of antigen presentation29,30.

Finally, we observed a strong association between effector lymphocyte gene expression signatures and HLA mutations, which is consistent with the hypothesis that somatic changes in these genes are a plausible immune escape mechanism, which arise in response to increased cytolytic activity in several tumor types. However, additional experiments are required to better understand this mechanism.

Improvements in massively parallel sequencing technologies are now enabling increased coverage and longer read lengths, which should further help Polysolver in resolving somatic changes in HLA regions. Further efforts will be focused on extending the methodology to other data modalities including RNA-seq and whole genome sequencing. In addition to enabling better detection of HLA mutations, accurate HLA typing by Polysolver can also be used to study germline associations of HLA alleles in diseases, such as autoimmune diseases and cancer. It could be used prospectively for preliminary screening for matches for allogeneic organ transplantation. Finally, as described here, Polysolver can be potentially extended to extract sequence and mutation information from other polymorphic regions in the genome such as MHC class II, nonclassical MHC alleles, TAP1 and TAP2 genes, and MIC-A and MIC-B ligands, and hence is a generally applicable analysis framework to address these otherwise challenging loci.

Methods

Polysolver is freely available for noncommercial use at http://www.broadinstitute.org/cancer/cga/polysolver and in Supplementary Software.

WES data.

All samples were obtained under Institutional Review Board approval and with documented informed consent. A complete list of TCGA samples is given in Supplementary Table 11. Mutational spectra of CLL17,45 and melanoma24 have previously been reported, whereas mutation lists for lung squamous carcinoma (LUSC), lung adenocarcinoma (LUAD), bladder (BLCA), head and neck (HNSC), colon (COAD) and rectum (READ), glioblastoma (GBM), ovarian (OV), uterine corpus endometrial carcinoma (UCEC) and breast (BRCA) were obtained from the Sage Bionetworks' Synapse resource (http://www.synapse.org/#!SYNAPSE:syn1729383). For a subset of CLL patients (N = 8), HLA typing was performed by molecular typing (Tissue Typing Laboratory, Brigham and Women's Hospital, Boston), and these cases were used as a training set for the Polysolver algorithm (Supplementary Table 1). The validation set comprised 253 samples from 183 distinct individuals (47 Caucasian, 50 Blacks, 41 Chinese and 45 Japanese individuals) that had both exome data and experimentally determined HLA type information12 (http://www.1000genomes.org/).

Polysolver allele database creation.

To maximally retrieve true HLA reads, we constructed a full-length genomic reference library of known HLA alleles (6,597 unique entries) based on the Multiple Sequence Alignment (MSA) files provided in the IMGT database (v3.10; http://www.ebi.ac.uk/ipd/imgt/hla/), similar to the approach described in Erlich et al.12. We first used the cDNA file to impute exons in an incompletely sequenced allele by using a reference allele that had protein-level identity with the allele in question, as was evident by concordance of 4-digit nomenclature. If no such reference allele was available, we set as reference an allele that derived from the same allele group, as was evident by concordance of 2-digit nomenclature. In cases where there were multiple such possibilities for choosing the reference allele, we chose the first listed allele in the MSA. A similar approach was used to impute the missing components of the sequences listed in genomic (gDNA) MSA file. Finally the full-length genomic sequence of each allele was imputed by assembling exons from the cDNA imputation step and introns from the gDNA imputation.

Ethnicity inference and prior probability estimation.

4-digit allele frequencies for different ethnicities were calculated by taking a sample-size weighted average of all relevant population studies in the Allele Frequency Net Database (http://www.allelefrequencies.net/).

A rapid principal components analysis (PCA)-based method was developed to infer ethnicity for samples of unknown racial origin (Kiezun et al., unpublished data). Exome data for samples of known (self-described) ethnicity from the 1000 Genomes and HapMap projects (n = 1,398, with 911 Caucasians, 375 Blacks, 54 Asians and 58 South Asians) was genotyped at a predefined set of 5,845 loci chosen based on considerations related to known linkage disequilibrium between different loci, representation on population genotyping platforms and consistency between genome releases46. A PCA revealed distinct segregation of Caucasian, Black, Asian and South Asian samples in the 2-dimensional space defined by the first two principal components. Any new sample of unknown ethnicity can now be projected in this space and its Euclidean distance from the clusters centroids can be computed. Ethnicity is inferred based on the cluster of minimal distance from the sample projection.

Allele inference.

The posterior probability calculations for alleles corresponding to each HLA gene (A, B or C) are performed separately as described below:

Let

NA ≡ # alleles corresponding to the HLA gene

N ≡ # reads aligning to at least one allele

Nm ≡ # reads aligning to allele am

NT ≡ # reads in the sequencing run

fm ≡ population-based prior probability of allele m

rk1 ≡ first read of read pair rk

rk2 ≡ second read of read pair rk

dk ≡ insert length of read pair rk

lk1 ≡ length of first read of read pair rk

lk2 ≡ length of second read of read pair rk

qi ≡ Phred-like quality of sequenced base i

ei ≡ probability that the sequenced base i is an error

The quality scores of the alignment were used to build a model for the sequencing process. Let us say that a given read pair rk does in fact derive from an allele am and their sequence relationship allowing for miscalls in the sequencing process is accurately captured in the alignment. Let YAi, YCi, YGi and YTi denote random variables corresponding to observing bases A, C, G and T respectively at position i in read pair rk in its alignment to allele am. Then

where

Let D denote a random variable for the observed insert length of a paired read in the sequencing run based on alignment to the complete genome. For a given read pair rk, the empirical insert size distribution can be used to estimate the probability of observing the insert length dk as

Assuming positional independence of quality scores, and independence of generated reads and their insert sizes, the probability of observing rk given allele am is then

where sk corresponds to the lowest theoretical probability achievable for read pair r'k with perfect base qualities and segment lengths equal to those of rk. Since 93 is the maximum achievable base quality under Illumina 1.8+ format, sk is computed as

The posterior probability of allele am using all reads that align to it is given by

Log transformation of the above equation yields

Note that the terms and are constants for all alleles and can be ignored. The first allele is inferred as the one that maximizes the posterior probability.

To infer the second allele we had to handle the fact that different alleles are very similar to each other, including the winning allele. Therefore, we weight reads aligning to multiple alleles by applying a heuristic strategy. For a given allele am, the likelihood lmk of a read rk that also mapped to the winning allele aw with likelihood lwk was weighted by a factor equal to lmk/(lmk + lwk). Consequently, reads mapping exclusively to am with respect to aw were assigned a weight of 1. The read insert size and allele prior probability components were preserved from the first allele inference step. The second winner at each locus was identified as the allele with the maximal reevaluated score.

Pre- and post-processing steps for HLA mutation detection.

Prior to detection of somatic changes using MuTect and Strelka by comparison of tumor and normal HLA reads aligned to Polysolver-inferred HLA alleles, the following changes and filters were implemented: (i) NotPrimaryAlignment bit flag was turned off from all alignments as several reads mapped to multiple alleles; (ii) mapping quality was changed to a nonzero value (=70) for all reads; (iii) alignments where both mates did not align to the same reference allele were discarded; and (iv) alignments where at least one mate had more than one mutation, insertion or deletion event compared to the reference allele were discarded. Soft-clipping of the reads was not allowed during the alignment. Alleles with multiple detected somatic changes were removed from the analysis. In cases where both inferred alleles were identical in the region of detected somatic mutation, the mutation was assigned to the more common allele in the population. All somatic events were visualized using IGV (MuTect: 'KEEP' entries in call_stats file, Strelka: All entries in all.somatic.indels.vcf file) and the ones that passed manual review were further annotated for the gene compartment (intron, exon, splice site) and protein change. Splice sites were defined as the set of splice consensus sequence positions that had a bit score of at least 1 in either the human major/U2 or human minor/U12 introns at the exon/intron boundaries (9 positions at the 5′ splice donor end of the intron including the ultimate base in the upstream exon, and 2 positions at the 3′ splice acceptor end of the intron)47.

Validation of somatic HLA mutations by RNA-seq evaluation.

The MutationValidator tool (data not shown) was used for orthogonal confirmation of mutations in RNA-seq data. A mutation was considered validated in RNA-seq if there were at least two reads supporting the mutation. In brief, to determine the power, we first model the distribution of allelic fraction of the mutation based on the exome data as a Beta(a+1, r+1) distribution, where a is the number of reads bearing the alternate allele and r is the number of reads bearing the reference allele at the site of mutation. Then, given the total number of reads aligning at the position in the RNA-seq data (N), power was calculated as the probability that we would detect at least two reads bearing the alternate allele in the RNA-seq data (assuming the mutation has the same underlying allele fraction as the DNA) using the Beta-binomial distribution Beta-Binom(N,a+1,r+1), that is,

A threshold of 80% power was used to consider a site to be powered to detect the mutation in the RNA-seq data. Sites that had less than 80% power were removed from the analysis.

Standard HLA typing.

Standard HLA typing was performed at the Brigham and Women's Hospital Tissue Typing Laboratory using a combination of sequence-specific oligonucleotide probe (SSO) and sequence specific primer (SSP) techniques. Genomic DNA samples were initially typed using locus-specific LabType SSO kits (One Lambda Inc.) and analyzed using a Luminex 200. Loci for which there were more than one common well-documented (CWD) allele were subsequently resolved by PCR-SSP kits (One Lambda Inc. and Life Technologies) and analyzed using gel electrophoresis.

Validation of inferred somatic HLA mutations by targeted long sequencing of HLA-A and -B.

HLA-A and HLA-B amplification of TCGA samples. HLA locus-specific amplification for HLA-A and HLA-B sequences were performed separately using HGSgo-AmpX kits from GenDX (Utrecht, Netherlands). Briefly, for each sample, 100 ng of genomic DNA was mixed with 1 μl of AmpX primer (GenDX), 1.25 μl dNTP mix (Qiagen), 2.5 μl LongRange PCR Buffer (Qiagen), 0.4 Symbol l LongRange PCR Enzyme (Qiagen) and nuclease-free water was added to a final volume of 25 μl per reaction. Samples were then placed in a thermal cycler and PCR was performed using the following conditions: initial denaturation at 95 °C for 3 min, followed by 35 cycles of 95 °C for 15 s, 65 °C for 30 s and 68 °C for 6 min, followed by a final incubation at 68 °C for 10 min. All PCR reactions were then purified using Agencourt AMPureXP beads, according to the manufacturer's protocol (Beckman Coulter). Following AMPureXP purification, the concentrations of the amplification products (3.1–3.4 kb) were confirmed by Quant-iT (Life Technologies), and the sizes were confirmed using an Agilent Bioanalyzer DNA 7500 kit.

Library construction and long sequencing. SMRTbell DNA template libraries were prepared from the HLA-A and HLA-B amplicons, according to the manufacturer's suggested protocol (5 kb Template Preparation and Sequencing, Pacific Biosciences). Briefly, equimolar pools of HLA-A and HLA-B amplicons were prepared for each sample. Pooled amplicons were then end repaired and ligated to barcoded SMRTbell adapters. Following the addition of barcoded SMRTbell adapters, all samples were pooled and exonuclease treated according to the manufacturer's suggested protocol. Pooled, barcoded libraries were then purified using AMPure PB beads (Pacific Biosciences) and quantified using an Agilent Bioanalyzer DNA 7500 kit. Pooled samples were sequenced in SMRTCells with a Pacific Biosciences RSII instrument using the P6 DNA/Polymerase Binding Kit in conjunction with the DNA Sequencing Reagent 4.0. Barcoded subreads were analyzed using the SMRT Analysis (version 2.3.0) Long Amplicon Analysis (LAA) protocol.

Analysis. We confirmed the accuracy of the Pacific Biosciences-based long sequencing approach through testing six samples from normal volunteers with known HLA typing (performed at BWH Tissue Typing laboratory based on a combination of sequence-specific SSO and SSP techniques, see above), wherein we observed 100% concordance between the two approaches. The LAA phased consensus fastq sequences and HLA typing for each sample were derived using a set of publicly available analysis tools (https://github.com/bnbowman/HlaTools). In total, data were generated from 28 samples corresponding to 18 different mutations (10 tumor/normal pairs and 8 tumor-only cases). The median number of subreads generated per sample was 20,120 (range: 7,464–40,990). For validation of Polysolver-predicted mutations, the subreads from the corresponding samples were split into contiguous 76-mers, aligned to alleles comprising the inferred HLA type for the individual using Novoalign (http://www.novocraft.com/) and visualized using IGV. Only reads that had no more than one somatic event of the same type (mismatch, insertion, deletion) as the mutation being assessed were retained. After filtering, the median number of 76-mer reads mapping to the allele predicted to have the mutation was 1,046 (range: 9–3,860). Power was calculated using the MutationValidator tool as described above, and a threshold of 80% power was used in evaluating the mutations.

Identifying changes in gene expression associated with nonsilent MHC class I mutation.

Gene expression data were obtained and processed as described32. In short, “Level_3” gene-level data were obtained from GDAC Firehose (http://gdac.broadinstitute.org/). Read counts were tallied per gene symbol and divided by the gene symbol's maximum transcript length (as defined by UCSC Genome Browser's table “knownIsoforms” (hg19 version)). For each sample, these values were rescaled to sum to a total of one million, such that expression estimates may be interpreted as Transcripts Per Million transcripts (TPM).

For each gene (of 18,000 quantified pan-cancer), a one-sided Wilcoxon rank-sum test was applied to determine whether the mutants (those samples nonsilently mutated in any of the six HLA alleles) demonstrated significantly higher expression than the nonmutants. In performing this rank-based test, random tie breaks were applied when two samples exhibited identical gene expression. Note that in addition to the 18,000 genes tested, “cytolytic activity” (defined previously as the geometric mean of GZMA and PRF1 expression32) was also included. This process was executed separately per tumor type and excluded tumor types for which the count of mutated samples with available expression data was fewer than three (which excluded glioblastoma, CLL, kidney clear cell cancer, liver cancer, ovarian cancer, prostate cancer, melanoma and thyroid cancer). This resulted in a matrix of P-values (11 tumor types by 18,000 genes). Fisher's method was applied to each gene to assess its overall significance across the 11 tumor types. Per-cancer and pan-cancer P-values are presented (Supplementary Table 15). Effect sizes (estimated by taking the ratio of median expression in the mutants to median expression in the nonmutants) for top genes (defined as those with unadjusted P < 10−10) are depicted in the form of a heatmap (Fig. 4d). For this heatmap, row and column orderings reflect hierarchical clustering (on the basis of the effect size variable), though dendrograms are not shown.

This entire process was repeated, but we reversed the directionality of the one-sided Wilcoxon rank-sum tests in order to identify genes with lower expression in HLA mutants. Per-cancer and pan-cancer P-values for this analysis are presented in Supplementary Table 16, and the effect size heatmap appears as Supplementary Figure 5.

Accession codes.

dbGaP: phs000178.