Identifying immunodominant T cell epitopes remains a significant challenge in the context of infectious disease, autoimmunity, and immuno-oncology. To address the challenge of antigen discovery, we developed a quantitative proteomic approach that enabled unbiased identification of major histocompatibility complex class II (MHCII)–associated peptide epitopes and biochemical features of antigenicity. On the basis of these data, we trained a deep neural network model for genome-scale predictions of immunodominant MHCII-restricted epitopes. We named this model bacteria originated T cell antigen (BOTA) predictor. In validation studies, BOTA accurately predicted novel CD4 T cell epitopes derived from the model pathogen Listeria monocytogenes and the commensal microorganism Muribaculum intestinale. To conclusively define immunodominant T cell epitopes predicted by BOTA, we developed a high-throughput approach to screen DNA-encoded peptide–MHCII libraries for functional recognition by T cell receptors identified from single-cell RNA sequencing. Collectively, these studies provide a framework for defining the immunodominance landscape across a broad range of immune pathologies.


Canonical CD4 T cell responses are restricted to peptide antigens presented by major histocompatibility complex class II (MHCII), which in turn is dictated by sequence-specific interactions between the peptide backbone and the MHCII binding groove1,2. Degeneracy in this interaction allows for presentation of a broad spectrum of peptide antigens and promotes diverse responses to potential antigens. However, CD4 T cell responses tend to be constrained to a limited set of immunodominant epitopes, even when the pool of available peptide epitopes is not limiting. Thus, the T cell response must be balanced with regard to the magnitude of the response to any given epitope (to maximize efficacy) and specificity toward multiple epitopes (to combat epitope escape). Despite intensive investigation, the factors controlling immunodominance and antigenicity are incompletely understood.

At the level of antigen-presenting cells, several lysosomal pathways contribute to antigen processing and epitope selection3. In this context, numerous thiol reductases4 and proteases5,6 perform redundant functions in converting native proteins into MHCII ligands. The sequence- and structure-specific preferences of these enzymes for their substrates can bias which peptides are ultimately processed and loaded onto the MHCII. Furthermore, the kinetic parameters of the interaction of peptides with MHCII (peptide–MHCII complex) impact the stability of the complex such that weak binding peptides are replaced by means of human leukocyte antigen DM (HLA-DM) editing7,8. Taken together, the complexity of antigen processing and limited availability of model immunodominant antigens have posed a barrier to understanding the biochemical features of antigenicity.

Defining immunodominant CD4 T cell antigens has benefited from the development of unbiased approaches for epitope discovery. In this context, autologous self-epitopes have been identified by immunoaffinity purification of MHCII-associated peptides followed by Edman degradation sequencing or mass spectrometry9,10,11,12,13,14,15,16,17. Later advances in mass spectrometry allowed for the identification of tumor antigens18 and autoantigens19,20. While detection of MHCII-associated peptides derived from extracellular proteins has been achieved21, identifying epitopes from complex microorganisms remains challenging. Previous successes identified commensal T cell epitopes from within antigens that are recognized by immunoglobulins22,23,24. Here, we develop a technology for CD4 antigen discovery and demonstrate its ability to systematically define the immunodominance hierarchy of antigens and discover novel bacterial epitopes based on genomic sequence.


MHCII peptidomics in primary murine dendritic cells defines the biochemical features of antigenicity

To define the features of antigenicity, we developed a proteomics-based approach to identify MHCII-associated peptides in murine bone marrow–derived dendritic cells (BMDCs) from C57BL/6J mice expressing I-Ab. In these experiments, mature peptide–MHCII complexes were immunoprecipitated from primary BMDCs with monoclonal Y3P antibody25. Associated peptides from biological replicates were acid-eluted, chemically labeled with isobaric tagging for relative and absolute quantitation (iTRAQ) mass tags for relative quantification and analyzed by liquid chromatography–tandem mass spectrometry (LC-MS/MS) (Fig. 1a). This sensitive approach identified 3,671 unique peptides derived from 1,088 autologous murine proteins (Fig. 1b and Supplementary Table 1).

Fig. 1: MHCII peptidomics in primary murine dendritic cells results in more than 3,700 distinct peptide identifications and defines the I-Ab-binding motif.
Fig. 1

a, Experimental workflow for immunopurification and sequencing of MHCII-associated peptides from murine dendritic cells. Peptide–MHCII complexes were immunopurified from WT and Atg16l1/− cells. Associated peptides were then acid-eluted, labeled with iTRAQ 4-plex reagents, desalted with SCX and C18, and analyzed using high-resolution LC-MS/MS. HCD, higher-energy collisional dissociation. b, A database search strategy for peptide–MHCII sequencing. All MS/MS spectra were searched against a database containing mouse proteins using the Spectrum Mill software with a ‘no enzyme’ specificity. Mouse peptides were validated using a 1% FDR cutoff, and the total number of peptides quantified across all samples were reported. c, The I-Ab-binding motif was derived from endogenous mouse peptides bound to MHCII. The heatmap color coding represents the frequencies of each amino acid at each respective position.

Having generated an expansive catalog of the MHCII immunopeptidome in murine BMDCs, we sought to identify the key biochemical features associated with antigenicity. Initially, we derived the optimal I-Ab-binding motif from primary sequence features in autologous murine peptides (Fig. 1c). The resulting 9-mer core peptide sequence resembles previous predictions based on in vitro binding kinetics between synthetic peptide libraries and purified I-Ab26. In contrast, we detected a strong preference for proline in the P4 position, as suggested by previous work27,28.

Autophagy shapes the MHCII immunopeptidome

The primary sequence of antigenic peptides confers binding to MHCII but does not represent the only factor governing antigenicity. Given that lysosomal pathways contribute to antigen processing and epitope selection, we hypothesized that autophagy shapes the immunopeptidome by dictating which antigens gain access to the lysosomal compartment. The core autophagy protein, autophagy-related protein 16-1 (ATG16L1), is required for macroautophagy and xenophagy, which divert the cytosolic cargo to the lysosomes for disposal. To selectively perturb autophagy in antigen-presenting cells, we generated Atg16l1f/f × CD11c-Cre mice (Atg16l1/−). MHCII peptidomics experiments demonstrated that Atg16l1/ BMDCs were severely impaired in their ability to present peptides from organelle-derived and cytosolic proteins (Fig. 2a,b). In contrast, Atg16l1/− cells displayed elevated levels of lysosome- and endosome-derived peptides compared to wild-type (WT) BMDCs (Fig. 2b). These findings are consistent with the known role for ATG16L1 in macroautophagy and highlight the degree to which lysosomal trafficking impacts the spectrum of peptides presented by the MHCII.

Fig. 2: Antigen processing pathways and epitope features revealed by MHCII peptidomics.
Fig. 2

a, Deficiency in the autophagy protein ATG16L1 skews the spectrum of MHCII-associated peptides. MHCII-bound peptides quantified in Atg16l1/ dendritic cells relative to WT. Replicate (rep) samples were compared based on a log2 fold change (FC) between mouse strains. Each dot represents a unique peptide sequence. Peptides that were observed to be significantly upregulated or downregulated are shown in red, while peptide measurements that were not reproducible across both biological replicates are shown in cyan. Dot plot axes: log2 FC. Histogram axes: number of distinct peptides. n = 2 biologically independent samples per genotype in a single experiment. Reproducible replicates (95% limits of agreement of a Bland–Altman plot) were subjected to a moderated t-test to assess statistical significance. b, Abundance and subcellular sources of MHCII-associated peptides derived from Atg16l1/ and WT dendritic cells. c, Epitope mapping relative to domain structure of endogenous antigens indicates preferential presentation of epitopes derived from the luminal/extracellular domains of transmembrane proteins and epitopes positioned between structurally defined domains. d, Immunodominant epitope prediction with BOTA. Workflow of BOTA with input as genome and output as a binding score. The upper panel shows the extraction of candidate peptides; the lower panel shows the deep neural network core of the BOTA algorithm to assign a binding score to each candidate peptide. To extract candidate peptides, predicted genes from the input genome are processed by HMMTOP, Pfam domain search using HMMER version 3.1b2, and PSORTb version 3.0.2 to define various features, which are later integrated; candidate peptides are selected based on criteria previously described. To encode each candidate peptide with length l, amino acids are first encoded using a b-bit vector at random; thus, the total amino acid space could be presented by a 20 × b binary matrix. During the encoding process, a window of length k slides through the input peptide, thus forming l-k+1 windows for an input peptide with length l. This process is repeated for all d detectors, forming a d-HMM model scoring matrix X. Matrix X will then go through the regular rectify-max pool-neural network prediction route to generate a final output score, f, for each input peptide.

Given the abundance of MHCII-associated peptides derived from endolysosomal proteins, we queried these sequences to define additional features that may impart antigenicity. Using the high-resolution I-Ab-binding motif as a reference sequence (Fig. 1c), we scanned the endolysosomal proteins identified by proteomics for consensus. As expected, we could retrospectively predict peptides that we identified by mass spectrometry; however, we found many instances of consensus sites that were not presented on the MHCII. In the vast majority of these cases, consensus sites located within structured protein domains were not detected by proteomics, whereas those located in unstructured interdomain regions were (Supplementary Fig. 1a). For example, MHCII peptidomics identified one peptide derived from the murine cation-independent mannose-6-phosphate receptor (CIM6PR). This peptide is located between two CIM6PR domains. In contrast, we found three additional peptides conforming to the I-Ab-binding consensus motif that were located within CIM6PR domains and were not detected as processed peptides bound to the MHCII (Fig. 2c). Across the dataset, epitopes preferentially originated from interdomain regions of >20 amino acids or protein domains >30 amino acids (Supplementary Fig. 1b). Thus, epitope accessibility in the context of native protein structure is an important factor that impacts MHCII binding and protease processing.

Training a deep neural network on MHCII peptidomics data

Having demonstrated that MHCII peptidomics can identify antigenic features conferred by lysosomal processing, we leveraged this dataset to devise a neural network-based algorithm for epitope prediction. The concept of epitope accessibility, together with MHCII affinity, adds to the complexity of formulating predictions for immunodominant epitopes. Therefore, we explored the use of a deep neural network-based algorithm to predict these epitopes using peptidomics data as a training set. We developed a model, the bacteria originated T cell antigen (BOTA) predictor, which generates a list of candidate peptides using information including protein cellular location, transmembrane structure (if applicable), and domain distribution (Fig. 2d). BOTA requires only an annotated genome input to extract amino acid sequences from genes. It then relies on outputs from three algorithms: HMMTOP, PSORT, and HMMER search against Pfam. First, BOTA identifies secreted and cell wall proteins29. Second, it masks the intracellular regions and transmembrane domains of cell wall proteins (with an eight-amino-acid flanking buffer)30. Third, it excludes regions that fall within small domains or between a series of adjacent domains (inaccessible, compact folding)31. The metric used for domain mapping includes the distance to the upstream/downstream domains, the density of flanking domains, and the domain size. Finally, candidate protein regions are piped into the deep neural network predictor to generate their probabilities to be MHCII binders (Fig. 2d). We note that this pipeline is modular and can be used to generate allele-specific models in both human and animal models.

The core algorithm of BOTA is a deep neural network, which employs a sparse representation of input peptides and multiple pretrained binder models to generate a feature map. The feature map includes many nonlinear summarizations of peptide features that are used to make predictions. The output from the deep neural network is a score (f) that serves as an indicator of MHCII binding. In the case of mouse I-Ab, we used MHCII peptidomics data to train the BOTA model. The model was trained using randomly sampled parameters 30 times on a threefold cross-validation scheme. The optimal parameter calibration was selected based on their receiver operating characteristic curves and areas under the curve (AUCs).

The advantages of BOTA over current methods are that BOTA: (i) requires only the whole genome, not polypeptides, as input; (ii) considers epitope accessibility by focusing the search on extracellular regions of proteins and epitope location relative to protein domain organization; (iii) is trained on peptidomics data from antigen-presenting cells, which are significantly more accurate than in vitro peptide binding data; and (iv) easily scales to thousands of genomes for different alleles. Thus, BOTA is capable of predicting immunogenic epitopes from medically relevant bacterial pathogens.

Validation of BOTA epitope predictions for Listeria with MHCII peptidomics

To validate BOTA, we tested its performance in predicting MHCII-restricted epitopes from Listeria monocytogenes, one of the most common foodborne pathogens associated with a relatively high mortality rate32. Despite the epidemiological burden of listeriosis, the number of experimentally validated CD4 T cell epitopes encoded within the Listeria genome is limited. Thus, we sought to identify MHCII-associated peptides in BMDCs exposed to live Listeria for 10 min or 6 h before immunoprecipitation of mature peptide–MHCII complexes. Associated peptides from biological replicates of the 10 min and 6 h time points were analyzed using LC-MS/MS (Fig. 3a,b). These experiments discovered 48 unique peptides derived from exogenous Listeria proteins. Twenty-nine of these peptides represented nested sets derived from seven unique proteins (Supplementary Table 2). The second-most enriched peptide was the previously identified immunodominant epitope from listeriolysin O (LLO190–205), while the remaining peptides represented novel candidate antigens. Notably, all of the detected peptides were derived from secreted proteins or cell wall proteins. These observations highlight the importance of epitope accessibility for antigen presentation.

Fig. 3: Validation of BOTA epitope predictions with MHCII peptidomics.
Fig. 3

a, Experimental workflow for immunopurification and sequencing of MHCII-associated peptides from murine dendritic cells. Peptide–MHCII complexes were immunopurified from WT cells after a 10-min or 6-h Listeria treatment. Associated peptides were then acid-eluted, labeled with iTRAQ 4-plex reagents, desalted with SCX and C18, and analyzed using high-resolution LC-MS/MS. b, MHCII-bound peptides (mouse and Listeria) detected before and after Listeria exposure. Biological replicates (rep) were compared based on the log2 FC between time 10 min and 6 h after exposure to bacteria. Each dot represents a unique peptide sequence. Peptides that were observed to be significantly upregulated or downregulated are shown in red, while peptide measurements that were not reproducible across both biological replicates are shown in cyan. Dot plot axes: log2 FC. Histogram axes: number of distinct peptides. n = 2 biologically independent samples per treatment in a single experiment. Reproducible replicates (95% limits of agreement of a Bland–Altman plot) were subjected to a moderated t-test to assess statistical significance. c, Predictions for Listeria epitopes were made using the deep neural network core of BOTA. d, The BOTA model pretraining accuracy plateaus after 200 epochs in cross-validation. The model was trained using the mouse peptides captured in BMDCs infected with Listeria (blue line); in contrast, the same model trained solely on IEDB data reached a plateau at approximately 70%, signifying a 15% gap in accuracy. e, Comparison of predictions for Listeria epitopes in proteins identified by proteomics. Peptides are split into categories based on the protein’s subcellular localization using PSORTb. f, BOTA was used to predict epitopes for the human inflammatory bowel disease risk allele (HLA-DRB1*01:03) that were annotated in IEDB as validated by MHCII binding assays or T cell reactivity assays. Full-length protein sequences associated with these epitopes were used as input for the BOTA upstream modules. The output of the BOTA upstream modules (trimmed protein sequences) was used as input for epitope prediction using NetMHCIIpan with HLA-DRB1*01:03 specificity. This approach yielded 76% accuracy (Venn diagram BOTA predictions overlapping with validated IEDB epitopes) for predicting IEDB validated epitopes, with 12 epitopes missed by BOTA. In contrast NetMHCIIpan predictions on full-length proteins without using the BOTA upstream modules performed with 22% accuracy (Venn diagram NetMHCIIpan predictions overlapping with validated IEDB epitopes); 39 epitopes were missed by NetMHCIIpan.

We next compared Listeria peptides identified by MHCII peptidomics with BOTA predictions. Nine out of 17 BOTA-predicted epitopes were validated by MHCII peptidomics (Supplementary Table 2). Of the 35 unique Listeria peptides identified by proteomics with an adjusted P < 0.05, BOTA-predicted epitopes were present in 28. We compared the prediction accuracy of the BOTA model between training with two datasets: peptides we captured in murine dendritic cells (‘peptidomics training’); and MHCII-associated peptides from the Immune Epitope Database33 (‘IEDB training’) (Fig. 3c,d). Both BOTA pretraining accuracies plateaued after 200 epochs, but BOTA training with our peptidomics data increased prediction accuracy over training with IEDB data by 15%. Notably, BOTA showed strong improvement for mouse MHCII alleles compared to current prediction methods, including state-of-the-art NetMHCIIpan26 (Fig. 3e). To test BOTA predictions for human HLA alleles, we focused on the inflammatory bowel disease risk allele HLA-DRB1*01:0334. We first searched IEDB (see URLs) for HLA-DRB1*01:03-restricted epitopes validated by MHCII binding assays or T cell reactivity assays. We then used full-length protein sequences associated with these epitopes as input for the BOTA upstream modules (HMMTOP and Pfam) and used the output as input for epitope prediction using NetMHCIIpan with HLA-DRB1*01:03 specificity. This approach yielded 76% accuracy (38 out of 50) for predicting IEDB-validated epitopes, with 12 missed epitopes from BOTA (Fig. 3f). In contrast, NetMHCIIpan predictions on full-length proteins without using BOTA upstream modules performed with 22% accuracy (11 out of 50), while 39 epitopes were missed by NetMHCIIpan (Fig. 3f).

The improvement in prediction accuracy achieved by BOTA signifies successful application of a deep neural network to a complex biomedical problem. Previous efforts using traditional neural networks or hidden Markov models were limited in their ability to extract highly abstract features, leading to insufficient insights into epitope prediction35. In contrast, BOTA was designed to predict CD4 T cell epitopes for virtually any MHCII allele and any antigen source, including commensal microbes, pathogens, autoantigens, and tumor antigens.

BOTA and MHCII peptidomics predict immunodominance of Listeria epitopes in vivo

Given that MHCII peptidomics and BOTA successfully identified Listeria peptides and the features associated with efficient antigen presentation, we sought to measure T cell responses to these epitopes in vivo. To quantify the CD4 T cell response to the top eight candidate Listeria epitopes, mice were killed 7 days after intraperitoneal infection with Listeria. T cells were restimulated in vitro with synthetic peptides, and interferon-γ (IFN-γ) responses were enumerated by enzyme-linked immunospot (ELISPOT) assay. Robust responses were detected against four of the eight candidate epitopes (lmo0202, lmo2558, lmo2185, lmo0135) (Fig. 4a). The remaining four epitopes elicited weaker responses near the limit of detection for ELISPOT, suggesting subdominance (Fig. 4b). Notably, the T cell response to these eight candidate epitopes correlated remarkably well with their abundances detected by MHCII peptidomics (Fig. 4c). To determine if the abundance of Listeria antigens drives immunogenicity, we evaluated previously published microarray data measuring RNA expression of Listeria genes after infection of macrophages36. Accordingly, Listeria antigen expression correlated modestly with our measurement of the corresponding T cell response in ELISPOT assays (Fig. 4d), although low expression did not appear to preclude antigenicity (Fig. 4e).

Fig. 4: BOTA and MHCII peptidomics accurately predict immunodominance in vivo.
Fig. 4

a, Epitope mapping and domain structure of Listeria antigens indicate preferential presentation of surface-exposed and secreted proteins. b, Immunodominance of epitopes was determined by infecting mice with Listeria. At day 7, splenocytes were collected and restimulated with the indicated peptides for quantification of the T cell response by IFN-γ ELISPOT. Data represent the mean number of spots per 1 × 105 CD4 T cells ± s.d. for n = 6 mice. ce, Immunodominance in vivo (IFN-γ ELISPOT data from Fig. 4b) correlates with the FC of Listeria peptides quantified by MHCII peptidomics (data from Fig. 3b), and to a lesser extent, with mRNA expression of the corresponding peptide-encoding genes in Listeria derived from infected macrophages (previously published microarray36).

Integration of T cell phenotype with T cell receptor (TCR) specificity in the Listeria response

Having demonstrated the utility of BOTA for predicting bacterial epitopes, we next sought to rigorously characterize the antigen-specific T cell response to Listeria. Previous approaches for defining immunodominance, such as the ELISPOT assay, rely on ranking candidate epitopes based on the magnitudes of the T cell responses they elicit in vivo. However, such approaches require prior knowledge of TCR specificity and T cell phenotype (cytokine profile). To address these limitations, we developed a technique for single-cell analysis of T cell phenotypes matched to their corresponding TCR sequences. This TCR sequencing (TCR-seq) approach allows for simultaneous enumeration of T cell clonal frequency and cytokine profiles (Supplementary Fig. 2).

Toward this end, we infected mice with Listeria and fluorescence-activated cell-sorted single CD4 T cells for transcriptomics and TCR-seq. Based on unbiased clustering of transcriptional profiles, T cells partitioned into four distinct subsets (Fig. 5a,b). To identify the activated effector T cell (Teff) cluster, we generated a per-cell Teff score that was based on the gene expression signature derived from ImmGen datasets comparing CD4 Teff splenocytes (day 8, after lymphocytic choriomeningitis virus infection) versus naïve CD4 T splenocytes37. These analyses identified cluster 2 as being enriched for Teff cells expressing signature activation genes (Ccl5, Nkg7, Ikzf1), genes that regulate endoplasmic reticulum homeostasis (Calr, Pdia3), and genes that control cellular metabolism (Acly, Akr1a1) (Fig. 5a). Among these T cells, the TCR repertoire was remarkably diverse with a maximal clonal frequency of ~3% for conventional T cells. The most abundant conventional TCR (Trav14-2Traj25|Trbv5Trbj2-7) was detected exclusively in cells residing in Teff cluster 2. By contrast, natural killer T cells expressing the invariant α chain Trav11Traj18 were found scattered throughout the clusters (Fig. 5c and Supplementary Table 3).

Fig. 5: Single-cell RNA-seq integrates T cell phenotype with TCR repertoire in the Listeria response.
Fig. 5

Mice were inoculated with Listeria by intraperitoneal injection on days 0 and 11. On day 18, FSChiCD4 + CD8-B220-MHCII T cells were FACS-sorted from spleens for single-cell RNA-seq and TCR-seq. Sorted T cells were derived from two mice, sequenced separately, and combined for analysis. a, Violin plots displaying the Teff signature score derived from the ImmGen datasets. Each dot represents a single cell classified by clusters defined by t-distributed stochastic neighbor embedding (tSNE). n = 1,920 cells from 2 mice. The limits of the violin plot capture the minima and maxima. b, The tSNE plot derived from the T cell transcriptomes identifies distinct cell states that cluster according to unique signatures. Each dot represents a single cell that is color-coded according to gene signature. n = 1,920 cells from 2 mice. c, Circos plots of the linkages between the TCR α chain CDR3 and TCR β chain CDR3. The ribbons link two chains with thickness proportional to the number of corresponding TCR pairs observed. Dominant TCR clones are labeled according to TCR gene segment usage. d, Specifying antigen reactivity by screening TCRs for reactivity with Listeria epitopes predicted by BOTA. HEK 293 T cells were transfected to express peptide epitopes fused in-frame with the I-Ab β chain bearing CD3ζ cytoplasmic domains. BW5147_CD4-28 cells were transduced to express chimeric single-chain TCRs bearing the transmembrane and cytoplasmic domains of CD3ζ . In this coculture system, cognate antigen recognition results in T cell activation characterized by the production of IL-2. e, The most abundant TCR identified in mice infected with Listeria (lmo_R6) was screened for reactivity against Listeria epitopes as described earlier. IL-2 was detected in culture supernatant by cytometric bead array. As controls, OT2 TCR (reactive with Ova) and LLO_118 TCR (reactive with LLO) were included. f, HEK 293 T cells were transfected with constructs encoding single-chain TCRs. Cells were analyzed by FACS for the expression of TCRβ and binding to LLO I-Ab tetramers (NEKYAQAYPNVS I-Ab). Data represent a single experiment.

Using TCR-seq and whole-transcriptome data (Supplementary Table 4), we prioritized the most abundant TCRs to test for reactivity against BOTA predictions (Fig. 5c). In designing a screening modality, we took into consideration the low affinity interaction between TCRs and peptide antigens presented on MHCII. This interaction occurs between multiple TCRs and peptide–MHCII complexes within lipid bilayers on adjacent cells. The avidity and kinetics of engagement/disengagement elicit a TCR signaling cascade that amplifies the input signal. Therefore, we designed a heterologous expression system in which TCR-negative BW5147-CD4-CD28 cells are transduced to express single-chain TCRα (scTCRα) and scTCRβ fused to the cytoplasmic tail of CD3ζ. Functional TCRαβ proteins in BW5147-CD4-CD28 cells engage cognate peptide–MHCII on antigen-presenting cells to initiate a TCR signaling response that results in the expression of interleukin-2 (IL-2). As a source of surrogate antigen-presenting cells, we used human embryonic kidney cells 293 (HEK 293) T cells transfected to express I-Abα-CD3ζ and candidate peptide antigens fused to I-Abβ-CD3ζ. For these experiments, we selected the most abundant TCR clone from TCR-seq to test for reactivity with four Listeria peptide–MHCII complexes predicted by BOTA (Fig. 5d). As a positive control, we demonstrated that the OT2 TCR expressed in BW5147-CD4-CD28 cells reacted with ovalbumin peptide in the context of I-Ab by inducing IL-2 secretion (Fig. 5e). Similarly, a previously identified TCR (LLO_118)38 reacted robustly with LLO (Fig. 5e). The top TCR candidate identified by TCR-seq also reacted with LLO, according to induction of IL-2 and binding to LLO-I-Ab tetramers (Fig. 5e,f and Supplementary Fig. 3). Taken together, these experiments establish the feasibility of integrating population-level TCR repertoires with antigen specificity by screening individual TCRs for reactivity against candidate antigens predicted in silico.

Identification of commensal epitopes

Having validated BOTA epitope predictions for a common pathogenic bacterium, we sought to explore the complex relationship between commensal microbes and host adaptive immunity. In this context, the intestinal microbiome fine-tunes inflammatory thresholds, primes innate immune effector function, and shapes the adaptive immune response through selection and tolerization of lymphocytes39. Thus, the dynamic role of the microbiome in immune education impacts organ systems throughout the body, and in turn, many disease states40. We reasoned that because the host adaptive immune system continuously interacts with the microbiome, monitoring the magnitude and nature of the T cell response to specific commensal antigens can reveal the health status of the immune system. In the state of health, the adaptive immune system maintains tolerance to local gut antigen exposure and protection from systemic infection. The sheer number of bacterial species and diversity of protein-coding elements in the microbiome represents an enormous search space. Toward identifying immunogenic commensals, we developed ‘serum immunoglobulin commensal capture and sequencing’ (SICC-seq). Mice were administered a course of dextran sodium sulfate (DSS) to induce barrier breach and allowed to recover for 7 days. Serum was then collected from mice and incubated with stool to opsonize commensals. Immunoglobulin G (IgG)-positive microbes were enriched by selection with Pierce Protein A/G magnetic beads (Thermo Fisher Scientific) and analyzed along with total stool by 16S ribosomal RNA (rRNA) sequencing (Supplementary Table 5). Using this approach, we demonstrated that induction of colitis with DSS, an epithelial injury model, elicits a systemic T cell–dependent IgG response that preferentially targets bacteria in the order Bacteroidales (Fig. 6a). In contrast, Akkermansia bacteria evade this response, likely due to their ability to induce T cell tolerance and IgA responses under homeostatic conditions (before colitis) (Fig. 6a). Having established the immunogenicity of Bacteroidales bacteria, we employed BOTA to predict T cell epitopes from the dominant species inhabiting mice41 (Fig. 6b) and identified a highly abundant epitope in a SusC-like protein that is conserved across the Bacteroidales order and often duplicated within species (Fig. 6b). To determine if T cells recognize this SusC epitope in vivo, we collected splenocytes from a naive mouse and stimulated them with SusC peptide in vitro. Importantly, the SusC peptide induced IL-10 production by T cells, indicating a homeostatic relationship between host T cells and Bacteroidales bacteria in a normal healthy mouse (Fig. 6b). We hypothesize that these interactions extend to many other commensals, and that tumultuous relationships between T cells and commensals typify immune pathologies and autoimmunity.

Fig. 6: Computational prediction and validation of a dominant commensal antigen.
Fig. 6

a, Mice were administered DSS to induce colitis before analysis by SICC-seq. At day 14, serum was collected and incubated with stool to allow binding of IgG with commensals. IgG-positive and IgG-negative fractions were separated with magnetic beads covalently attached to Pierce Protein A/G. The immunogenicity of Bacteroidales bacteria was demonstrated by the IgG reactivity score (relative abundance in IgG-positive minus IgG-negative fractions) derived from 16S rRNA sequencing. b, BOTA identified SusC, a highly represented epitope within and across the Bacteroidales genome, including the murine commensal Muribaculum intestinale. Splenocytes from naive mice were collected and stimulated in vitro with SusC peptide or DMSO control. Cytokines were measured 24 h later by cytometric bead array. Data represent the mean cytokine concentration ± s.d. for n = 7 mice. **P < 0.0001 as determined by an unpaired, two-tailed Student’s t-test.


Host–microbe interactions cooperatively influence the specificity and diversity of the T cell response. A deeper understanding of this relationship requires developing new approaches for unbiased antigen discovery, defining features of antigenicity, and elucidating host pathways that underlie preferential selection of these features. Toward these objectives, we report a highly quantitative adaptation of MHCII peptidomics. By recapitulating the interaction between primary murine BMDCs and live L. monocytogenes, we identified four dominant CD4 T cell epitopes and established their immunodominance hierarchy in vivo. In addition, MHCII peptidomics identified over 3,600 autologous mouse peptides, which provided an exceptionally detailed view of lysosomal function.

As a consequence of deep profiling of the MHCII immunopeptidome, we generated a rich dataset for identifying features associated with antigen processing and developed BOTA as a predictive model, incorporating several important attributes of immunodominant epitopes revealed by proteomics. Based on observations from MHCII peptidomics, optimal epitopes tend to (i) derive from secreted proteins or extracellular regions of cell wall proteins in bacteria, (ii) have a primary sequence structure that conforms to a defined MHCII binding motif, (iii) be located more than eight amino acids away from transmembrane domains, and (iv) have tertiary structure characteristics that promote accessibility to the MHCII binding groove and the enzymes required for proteolytic processing. Importantly, all of the features identified by peptidomics are readily identifiable at the level of DNA sequence; therefore, the only input users need to run a BOTA immunodominance analysis is an annotated genome. Consequently, BOTA enables large-scale mapping of the immunodominant epitope landscape for any bacterial species or collection of species, such as the human microbiome. Taken together, we demonstrate the utility of MHCII peptidomics for training a deep neural network that specifically identifies candidate epitopes with key features associated with immunodominance and antigenicity. Furthermore, MHCII peptidomics serves as a powerful tool for unbiased discovery of complex pathogen antigens and for the interrogation of host pathways underlying human disease.

Inherent to antigen discovery is the significant challenge of validating epitope predictions. Conclusive validation of epitope immunogenicity requires demonstration that a measurable T cell response is elicited in vivo. To address this challenge, we developed an approach for identifying and integrating TCR repertoire, phenotype, and antigen reactivity. Coupling TCR-seq with whole-transcriptome profiling at the single-cell level enabled assignment of transcriptional phenotypes to individual TCRs. While we employed this approach to identify TCRs associated with CD4 Teff cells derived from Listeria-infected mice, it is applicable to any immune cell type that can be defined at the level of the transcriptome, including Treg or TH17. Such determinations are challenging in T cell hybridomas because the primary T cell phenotype is not preserved after immortalization. Moreover, generating T cell hybridomas is an inefficient process that is further biased by chromosome loss, drug selection, and screening for antigen reactivity. In contrast, TCR-seq is comparatively efficient, which is an important consideration in the context of limited T cell input from precious clinical specimens. A small tissue biopsy is sufficient to generate a permanent archive of TCR sequences matched with transcriptional profiles that can be used to screen defined TCRs for reactivity against candidate epitopes. Thus, immunodominance can be unambiguously defined in the context of epitopes that elicit the strongest T cell responses, as determined by clonal frequency and/or absolute numbers within a relevant tissue or organ.

Here, we develop a technology with broad utility for antigen discovery. BOTA is capable of predicting CD4 T cell epitopes from multiple sources, including genomes or transcriptomes derived from pathogens, the microbiome, allergens, tumors, and tissue biopsies (Supplementary Fig. 4). By coupling BOTA predictions with TCR-seq from matching specimens, we enable generation of DNA-encoded epitope and TCR libraries that can be screened to define antigen specificity and quantify immunodominance. The screening platform we developed can be implemented in arrayed format (well-based screening of individual TCRs by peptide–MHCII antigens) or in a pooled format where cells expressing TCR libraries are cocultured with cells expressing peptide–MHCII libraries. For example, we have performed TCR fingerprinting by coculturing stimulator cells expressing the OT2 TCR with responder cells displaying a library of ovalbumin peptide mutants expressed as a fusion protein with the MHCII ectodomain linked to the CD3ζ cytoplasmic domain. After fluorescence-activated cell sorting (FACS) of activated responder cells (41bb+) that had engaged in a productive interaction with stimulator cells, we sequenced the Ova-MHCIIζ libraries to recover the known cognate peptide ligand for the OT2 TCR (Supplementary Fig. 5). In this context, it is possible to simultaneously screen many epitopes for reactivity with a given TCR.

With new approaches to effectively predict T cell epitopes and validate TCR reactivity, it is possible to identify antigenic determinants in bacterial pathogens and even complex communities, such as the intestinal microbiome. We identified an I-Ab-restricted epitope in a SusC-like protein derived from Bacteroidales bacteria that is associated with IL-10-producing T cells derived from mouse spleen. Notably, these experiments identified systemically circulating T cells with reactivity to a benign commensal species. Our findings highlight the intimate relationship between commensals and the host adaptive immune system. In this context, previous studies in mouse models have demonstrated that a systemic T cell–dependent IgG response to commensals is cross-protective against pathogen infection42. Similarly, IgG reactivity with commensals has been shown to be widespread in healthy humans and qualitatively perturbed in the context of autoimmunity43,44. While our understanding of host–commensal mutualism is in its infancy, future research stands to reveal how this relationship promotes homeostatic immunoregulation or precipitates immune dysfunction. New approaches to antigen discovery offer new opportunities for biomedical research, such as tracking antigen-specific immune responses in clinical studies, vaccine design, and understanding the host–microbiome relationship in cancer, autoimmunity, and inflammatory disease.


Immune Epitope Database and Analysis Resource, www.iedb.org; UniProt, https://www.uniprot.org/; IMGT reference set, http://www.imgt.org/vquest/refseqh.html; Seurat package, http://www.satijalab.org/seurat.


Immunoaffinity purification of MHCII complexes

Bone marrow was collected from C57BL/6J (WT), Atg16l1f/f × CD11c-Cre mice45. BMDCs were differentiated from bone marrow for seven days in antibiotic-free DMEM supplemented with 20% fetal bovine serum (FBS) and granulocyte-macrophage colony-stimulating factor (2% conditioned TOPO medium made in-house). TOPO cells expressing recombinant murine GM-CSF were a generous gift from H. Virgin (Washington University School of Medicine). All cell lines were tested monthly for Mycoplasma contamination. Functional testing and titration of GM-CSF conditioned media in BMDC differentiation cultures was performed as supportive authentication of cell line identity. In parallel, L. monocytogenes strain EGDe (ATCC) was cultured overnight in brain heart infusion medium, washed in PBS, and cocultured with BMDCs at a multiplicity of infection of approximately 100:1 for 30 min. Tissue culture dishes containing cocultures were washed in PBS and cultured for an additional 10 min or 6 h in dendritic cell media containing gentamicin (30 μg ml−1). BMDCs were then collected by scraping, washed in PBS, and lysed in 1% Tergitol-type NP-40 detergent, 4 mM MgCl2, 6 μg ml−1 DNase I from bovine pancreas (Sigma-Aldrich), and PBS pH 7.4 at 250 × 106 cells per 4 ml−1 lysis buffer/sample. Clarified lysates were subjected to immunoprecipitation overnight at 4 °C with gentle rotation. Immunoprecipitation was performed with GE Healthcare NHS Mag Sepharose (Thermo Fisher Scientific) covalently coupled to anti-H2-IA (clone Y3P25). Each sample contained 130 μl packed beads and approximately 650 μg antibody. Beads were then washed twice in PBS containing 0.1% Tergitol-type NP-40 and three times in PBS.

Peptide–MHCII elution and desalting

Peptides were eluted from MHCII complexes and desalted on in-house-built Empore C18 StageTips (3M) as described previously46. Sample loading, washes, and elution were performed on a tabletop centrifuge at a maximum speed of 2,000–3,500g. Briefly, StageTips were equilibrated with 2 × 100 μl washes of methanol, 2 × 50 μl washes of 50% acetonitrile (ACN)/0.1% formic acid (FA), and 2 × 100 μl washes of 1% FA. In a tube, the dried beads from MHCII-associated peptide immunoprecipitation were thawed at 4 °C, reconstituted in 50 μl 3% ACN/5% FA, and loaded onto StageTips. The beads were washed with 50 μl 1% FA, and the peptides were further eluted using two rounds of 5-min incubations in 10% acetic acid. The combined wash and elution volumes were loaded onto StageTips. The tubes containing the immunoprecipitated beads were washed again with 50 μl 1% FA, and this volume was also loaded onto StageTips. Peptides were washed twice on the StageTips with 100 μl 1% FA. Peptides were eluted using a step gradient of 20 μl 20% ACN/0.1% FA, 20 μl 40% ACN/0.1% FA, and 20 μl 60% ACN/0.1% FA. Step elutions were combined and dried to completion.

iTRAQ 4-plex labeling for quantitative proteomics

Quantitative proteomics was performed as described previously47. Briefly, each peptide mixture was reconstituted in 20 μl dissolution buffer labeled with 0.5 units (40 μl of 80 μl iTRAQ 4-plex reagent in ethanol) of iTRAQ 4-plex reagent for 1 h at room temperature (~22 °C). Excess reagent was quenched with 5 μl Tris hydrochloride for 30 min at room temperature. The iTRAQ 4-plexes were combined, dried to completion, and acidified by adding 1% FA (150 μl). Samples were desalted on in-house-built Empore SCX-C18 StageTips (3M) as described previously46. Peptides were eluted from the C18-SCX StageTips with two pH cuts (5.5, 11); the remaining 20% ACN was diluted with 1% FA. Samples were then loaded onto in-house-built Empore C18 StageTips, desalted, and dried to completion as described earlier.

Peptide–MHCII sequencing by tandem mass spectrometry

All nano-LC-electrospray ionization-MS/MS analyses employed the same LC separation conditions described here. Samples were chromatographically separated using a Proxeon Easy-nLC 1000 (Thermo Fisher Scientific) fitted with a PicoFrit (New Objective) 75-μm inner diameter capillary with a 10-μm emitter, packed under pressure to ~20 cm with C18 Reprosil beads (1.9 μm particle size, 200 Å pore size; Dr. Maisch GmbH) and heated at 50 °C during separation. Samples were reconstituted in 9 μl 3% ACN/5% FA 3 μl (~100 × 106 cell equivalents) was injected for analysis. Peptides were eluted with a linear gradient from 7 to 30% of Buffer B (0.1% FA/90% ACN) over 82 min, 30–90% Buffer B over 6 min, and then held at 90% Buffer B for 15 min at 200 nl min−1 (Buffer A: 0.1% FA/3% ACN) to yield ~11 s peak widths. During data-dependent acquisition, eluted peptides were introduced into a Q Exactive HF Hybrid Quadrupole-Orbitrap Mass Spectrometer (Thermo Fisher Scientific) equipped with a nanoelectrospray source (James A. Hill Instrument Services) at 2.15 kV. A full-scan MS was acquired at a resolution of 60,000 from 300 to 1,800 m/z (AGC target: 1e6; 20 ms maximum ion time). Each full scan was followed by the top 15 data-dependent MS/MS scans at a resolution of 15,000, using an isolation width of 1.7 m/z with a 0.3 m/z offset, a fixed first mass at 100 m/z, a collision energy of 29 eV, an AGC target of 5e4, and a maximum fill time of 100 ms maximum injection time. An isolation offset of 0.3 m/z was used so that doubly charged precursor isotope distributions would be centered in the isolation window. Some MHCII-associated peptides tend to be short (< 15 amino acids) so the monoisotopic peak is nearly always the tallest peak in the isotope cluster and the MS acquisition software places the tallest isotopic peak in the center of the isolation window in the absence of a specified offset. Dynamic exclusion was enabled with a repeat count of 1 and an exclusion duration of 10 s. Charge state screening was enabled along with monoisotopic precursor selection using Peptide Match Preferred to prevent triggering of MS/MS on precursor ions with a charge state of 1, > 6, or unassigned.

Interpretation of MS/MS data

Mass spectra were interpreted using the Spectrum Mill software package version 5.1 pre-release (Agilent Technologies). MS/MS spectra were excluded from searching if they did not have a precursor MH+ in the range of 600–4,000, had a precursor charge > 5, or had a minimum of < 5 detected peaks. Merging of similar spectra with the same precursor m/z acquired in the same chromatographic peak was disabled. High-resolution MS/MS spectra were searched against a UniProt database containing reference proteome sequences (including isoforms and excluding fragments) from human and mouse (41,157 entries), with a set of common laboratory contaminant proteins (150 sequences) to yield a total of 41,307 redundant sequences. The sequences were downloaded from the UniProt website (see URLs) in April 2013.

Before both search rounds, all MS/MS had to pass the spectral quality filter with a sequence tag length > 3, that is, a minimum of four masses separated by the in-chain mass of an amino acid. In the first round search, all spectra were searched using a ‘no enzyme’ specificity, fixed modification of cysteine as unmodified, fixed modification of partial iTRAQ labeling, variable modifications (oxidized methionine, deamidation, N-terminal acetylation), a precursor mass tolerance of ±10 ppm, product mass tolerance of ± 20 ppm, and a minimum matched peak intensity of 50%. Peptide spectrum matches (PSMs) for individual spectra were automatically designated as confidently assigned using the Spectrum Mill autovalidation module to apply target–decoy based false discovery rate (FDR) estimation at the PSM level to set scoring threshold criteria. Peptide autovalidation was performed with an auto threshold strategy using a minimum sequence length of 7, automatic variable range precursor mass filtering, and score and delta rank1-rank2 score thresholds optimized across all LC-MS/MS runs. This yielded a PSM level FDR estimate for precursor charges 1 through 7 of <1.0% for each precursor charge state. In the second-round search, all remaining spectra that were not confidently identified in the first round were searched using these parameters against a Reference Sequence database containing L. monocytogenes strain EGDe reference protein sequences (2,867) downloaded in December 2014. An additional round of FDR thresholding as described earlier was applied to PSMs from the second-round search to estimate FDR by species (mouse versus L. monocytogenes EGDe). The combined PSMs from each round had a peptide level FDR < 2.0%. Only L. monocytogenes EGDe peptides that did not overlap with human peptides were reported.

Listeria infection and ELISPOT assay

The L. monocytogenes EGDe strain was grown to log phase in brain heart infusion, washed in PBS, and used to inoculate mice by intraperitoneal injection with a dose of 1 × 104 CFU per 200 μl−1 per mouse. Seven days after infection, spleens were collected, red blood cells were lysed, and single-cell suspensions were replated at 2.5 × 105 per 200 μl−1 per well on 96-well ELISPOT plates (Merck Millipore) precoated with anti-IFN-γ (BD Biosciences) at a concentration of 5 μg ml−1 in PBS. Peptides were added at a concentration of 100 nM to restimulate splenocytes overnight. After stimulation, plates were washed, blocked, and incubated with biotinylated anti-IFN-γ at a concentration of 2 μg ml−1 and streptavidin-AKP (BD Biosciences) at a dilution of 1:1,000. After 60 min, plates were washed and developed with 3-amino-9-ethylcarbazole (Sigma-Aldrich). IFN-γ-secreting cells were scanned and enumerated using an ImmunoSpot S6 ENTRY Analyzer ELISPOT reader (Cellular Technology Limited). All samples were run in triplicate. The following peptides were synthesized by New England Peptide: lmo2185_293–312 ADFRYVFDTAKATAASSYPG; lmo0202_189–204 WNEKYAQAYPNVSAKI; lmo0135_150–169 VDDTTVKFTLPTVAPAFENT; lmo2558_533–553 APGQETQHYYGLPVADSAIDR; lmo2360_289–306 GGINQAYTGSTALSDGLN; lmo0186_285–303 GTKEKVVATPVSNVSTSSA; lmo0582a_26–46 STVVVEAGDTLWGIAQSKGTT; and lmo0582b_12–25 IAVTAFAAPTIASA. To calculate the frequency of CD4+ cells in each splenocyte sample from infected mice, an aliquot of cells from each mouse was stained for flow cytometric analysis; 1 × 106 cells were incubated with mouse BD Fc block, clone 2.4G2 (RUO; BD Biosciences) in PBS/FBS for 20 min at 4 °C. Cells were then washed and stained with fluorescein isothiocyanate-conjugated anti-mouse CD4 antibody (BioLegend) for 20 min at 4 °C. Fluorescently labeled cells were acquired on the FACSVerse 8 Color flow cytometer (BD Biosciences) and analyzed with the FlowJo analysis software (FlowJo LLC version 10.4.0).

Cytokine assays

Splenocytes from 8–12-week-old C57BL/6 mice were isolated and cultured in complete media (DMEM with 10% FBS, 1% L-glutamine, 2.5% sodium bicarbonate, and 1% penicillin-streptomycin) at a concentration of 1 × 107 cells per ml−1. Cells from each well were pulsed with a final concentration of 10 μM of the SusC 14-mer peptide (VLKDASAAAIYGSR) or a vehicle control (dimethylsulfoxide) for 24 h at 37 °C. Supernatants were collected and cytokine production was measured by cytometric bead array (Flex Set; BD Biosciences) for IFN-γ (cat no. 558296), IL-17A (cat no. 560283), and IL-10 (cat no. 558300).

BOTA algorithm architecture

BOTA starts with an input genome and the associated genome annotation in GFF3 format. It first extracts the amino acid sequences of the protein-coding genes and then performs the following analysis: (1) domain identification using HMMScan function of HMMER version 3.1b248 against Pfam49; (2) cellular localization prediction using PSORTb version 3.0.229 with default settings; and (3) transmembrane topology prediction by HMMTOP30. This information is then integrated to generate a list of candidate peptides following three criteria: (1) the protein should be located in the outer membrane, cell wall (if applicable), or extracellular space; (2) if located in the outer membrane or cell wall, only the outfacing part of the protein will be considered; and (3) the peptide should present sufficient accessibility as decided by three rules: (a) it cannot be fewer than eight amino acids away from the anchoring domain in the cell wall or outer membrane; (b) it cannot be located in domains shorter than 30 amino acids because small domains are usually tightly folded; and (3) it cannot be flanked by two domains that are fewer than 20 amino acids apart. These criteria were based on observations about the vast majority of Listeria epitopes captured by MHCII peptidomics. These candidate peptides are then scored by the deep neural network as described in the next section. For each peptide classified as a candidate epitope, BOTA further validates it with the motif score, as defined previously, and a randomized score. For the motif score validation, it calculates all the 9-mers within the peptide and requires the maximum to be larger than 5 × 10−11; for the randomized score, BOTA shuffles the amino acids in the 9-mer and calculates the motif score for each of the randomized 9-mers. The motif score of the original 9-mer requires a rank in the top 30% of all randomized 9-mers.

Deep neural network for MHCII binding prediction

We employed a deep neural network scheme to develop an MHCII binding prediction. In brief, we first encoded every amino acid into a p-dimensional binary vector b, with half of its elements being 1 and the rest being 0, chosen at random. Therefore, given a peptide with length l longer than k, it is first converted to an l×b descriptor matrix S, in which Sij = 1 if the 1-valued indices of the i-th amino acid overlap with j, otherwise Sij = 0. The matrix S is then normalized by row sum to become S′ such that

$$S_{ij}^\prime = S_{ij}{\mathrm{/}}\mathop {\sum}\nolimits_{j = 1}^b {S_{ij}}$$

S′ is then convoluted into an (l – k+1)×d matrix X, where d is the number of pretrained motif network models within the overall model and k is the length of such binding cores. Xij represents the score of motif network model j aligned to position i. The cohorts of motif network models are arranged in a d×k×b array H with Hijk being the d-th motif network model aligned to the k-th position of the b-th set of amino acids (Fig. 2d). This sequence conversion and convolution setup is similar to model developed by Alipanahi et al.50.

With the convoluted matrix X, we filter it with a max-rectified linear unit layer; then, the rectified matrix Y is fed into a max pooling stage to be transformed into d-dimensional vector Z, in which

$$z_j = {\mathrm{max}}\left( {Y_{1j},\,Y_{2j}, \ldots ,Y_{nj}} \right)$$

This d-dimensional vector Z is then used as input for a neural network with the maxout dropout model. Z is then used as the input for a standard output layer for the final prediction calculation.

The goal is to minimize the prediction error as measured by the 1-norm distance of all the peptides. Back propagation with a stochastic gradient descent method using mini-batch size of 64 was used to reach the optimal weights. To train the weights, the mouse epitope peptides were used as a true positive training set; we also added in silico true negative peptides by randomly surveying an equal number of the peptides that are not part of the peptidomics data readout. We first randomly constructed 100 replicates of the threefold cross-validation datasets using the epitopes and the in silico true negative sequences. For each replicate, BOTA’s initial state weights were assigned randomly and then trained until performance plateaued (<0.1% improvement in ten iterations). The validation’s average AUC was used as a measure of model fitness; the one with the highest average AUC among the 100 replicates was chosen to generate the final optimal parameters by using all mouse peptides (Fig. 3c). Then, the optimal model constructed in the previous step was used to predict the unique Listeria peptides; NetMHCIIpan was used to predict the affinity of the same set of peptides with default settings.

TCR-seq and 5′ digital gene expression (DGE)

Single T Cells from mice were sorted into 384-well plates containing capture buffer, UltraPure bovine serum albumin (0.5 mg ml−1; MCLAB) and 2 uM template switch oligo (TSO, 384 unique TSO, 1 unique TSO per well; Integrated DNA Technologies). Oligo sequences can be found in Supplementary Table 6. A reverse transcription reaction was performed with a polydT oligo and Maxima Reverse Transcriptase (Thermo Fisher Scientific). Unique TSO were composed of a 5′ Biotin, a truncated Illumina adaptor, a seven-base-pair unique molecular identifier, and three riboguanosines at the 3′ end. The 5′ end of the polydT oligo was comprised of a 5′ Biotin and a truncated Illumina adaptor. Following reverse transcription, the reactions were treated with Exonuclease I (NEB); then, in-well whole-transcriptome amplification (WTA) was performed with Herculase II (Agilent). The primers in the WTA targeted the two truncated Illumina adaptor sequences used in the TSO and in the polydT oligo from the reverse transcriptase reaction. After WTA, all 384 reactions were pooled and purified over a single Zymoclean Gel DNA Recovery column (Zymo Research Corporation). The volume of DNA was reduced by using Agencourt AMPure XP (Beckman Coulter) beads after column elution.

To capture the TCRα and TCRβ chain sequences, 0.5–1 ng of the WTA product was used in two separate PCR reactions (PCR1) using Herculase II. The TCRα reaction used a primer to target the TCRα constant region (CGGCACATTGATTTGGGAGT) and a primer to target the truncated Illumina adaptor in the TSO. The TCRβ reaction used a primer to target the TCRβ constant region (CTTGCCATTCACCCACCAGC) and a primer to target the truncated Illumina adaptor in the TSO. The DNA from TCRα PCR1 and TCRβ PCR1 reactions was isolated by using Agencourt AMPure XP beads and used separately in a second set of PCR reactions (PCR2). PCR2 further targeted the TCR regions of interest by nesting the TCR-specific primer within the region already captured. Additionally, PCR2 served to add Illumina adaptors for sequencing to the 5′ and 3′ ends of the targeted regions. The primer targeting the TCR for TCRα PCR2 reaction contained the sequence CACAGCAGGTTCTGGGTTCTGGATG with an Illumina P5 sequencing adaptor appending the 5′ end. The primer targeting the TCR for TCRβ PCR2 reaction contained CAAGGAGACCTTGGGTGGAGTCACA with an Illumina P5 sequencing adaptor appending the 5′ end. The second primer used in both PCR2 reactions targets the TSO-labeled end and includes an Illumina P7 sequencing adaptor and eight-base-pair sample barcode. These two reactions were then cleaned using Agencourt AMPure XP beads; the amplicons generated were selected and isolated by Zymoclean Gel DNA Recovery column after the PCR products were run on a 2% E-Gel EX agarose gel (Thermo Fisher Scientific). TCRα and TCRβ amplicons were pooled at equimolar concentrations for sequencing. A final concentration of 14 nM with 10% PhiX Control v3 (Illumina) was sequenced with a 600-cycle MiSeq kit (Illumina) by paired-end reads of 305 and 305 cycles and one adapter read.

To generate a 5′ DGE library, the WTA product (0.6–1 ng) was tagmented for 10 min using the Nextera XT Library Prep Kit (Illumina) followed by a 14-cycle PCR to add adapters and amplify the fragmented library. The 14-cycle PCR incorporates the provided Nextera XT i5 adapters (Illumina) and a custom universal primer that targets the TSO-labeled end and includes an Illumina P7 sequencing adapter and eight-base-pair sample barcode. The tagmentation reaction was cleaned with Agencourt AMPure XP beads and the library size was selected by Zymoclean Gel DNA Recovery column after the library was run on a 2% E-Gel EX agarose gel. The library, at a final concentration of 2 pM with 10% PhiX Control v3 , was sequenced with a 75-cycle NextSeq 500/550 kit v2 (Illumina) by paired-end reads of 46 and 21 cycles and one adapter read.

The TCR-seq reads were first deconvoluted according to well identifier and unique molecular identifier (UMI). We required that at least six reads were detected for each UMI. For each UMI, we determined the consensus sequence at each nucleotide position for all reads associated with that UMI. Consensus sequences for each UMI were mapped by Basic Local Alignment Search Tool to the TCR database (IMGT reference set; see URLs). The best hit for each UMI-associated TCR consensus sequence (representing TCR V and J gene segments) was quantified based on read counts and tabulated in association with each distinct UMI on a per-well basis.

The 5′ DGE RNA-seq data were analyzed using an in-house data analysis pipeline. Reads were aligned to the mm10 genome using bwa51. The downstream analysis was carried out using the Seurat package (see URLs). The dataset was filtered to remove genes that were expressed in < 10% of the data or cells that expressed transcripts that mapped to fewer than 500 unique genes. Further, to remove doublet cells, cells that displayed more heterogeneity than the 90th quantile (1,846 unique genes) were filtered out. Highly variable, highly expressed genes (log(σ/μ) > 0.75 and log(μ) > 1.5) were identified from a mean-variance dispersion analysis. These genes were then used for clustering the single cells using the shared nearest neighbor (SNN)-Cliq algorithm, which maps the cells onto a k-nearest neighbor graph and then finds ‘cliques’ of cells expressing similar genes52. The cells were clustered on the basis of highly expressed, highly varying genes, using a graph-based method, SNN-Cliq52,53.

TCR screening

TCR-negative BW5147 cells were obtained from ATCC. The stable subline BW_4–28 was produced by the introduction of murine CD4 and CD28 by means of lentiviral transduction. The open reading frame (ORF)-encoding CD4-P2A-CD28 was synthesized (Integrated DNA Technologies) and cloned in place of spCas9-P2A-BlastR into pXPR_BRD101 (Genetic Perturbation Platform; Broad Institute) by Gibson assembly. The stable subline BW_B7/4 was produced by the introduction of a chimeric murine CD86 cytoplasmic domain fused in-frame with CD4 transmembrane and cytoplasmic domains by means of lentiviral transduction. The ORF-encoding CD86/CD4 was synthesized (Integrated DNA Technologies) and cloned in place of spCas9-P2A-BlastR into pXPR_BRD101 (Genetic Perturbation Platform; Broad Institute) by Gibson assembly.

The scTCRz_v3 vector was derived from pLX_TRC307 (Genetic Perturbation Platform; Broad Institute) by replacing the stuffer sequence with the following codon-optimized ORF: mTrbc1-hCD3zeta(transmembrane and cytoplasmic domains)-P2A-hIgKleader-mTrac-hCD3zeta(transmembrane and cytoplasmic domains). scTCRs were synthesized as gBlocks (Integrated DNA Technologies) comprising Trav-Traj-linker(3xG4SGGGG)-Trbv-Trbj and cloned by Gibson assembly into NheI and BsrGI sites upstream of TRBC1 in the scTCRz_v3 vector. The MHCIIz_v2 vector was derived from pLX_TRC307 (Genetic Perturbation Platform; Broad Institute) by replacing the stuffer sequence with the following sequence: H2Ab1leader-XmaI-linker(3xG4S)-H2Ab1(extracellular and transmembrane domains)-hCD3zeta(cytoplasmic domain)-P2A-H2Aa(extracellular and transmembrane domains)-hCD3zeta(cytoplasmic domain). Peptide epitope-encoding sequences were synthesized as ultramers (Integrated DNA Technologies) and cloned by Gibson assembly into the XmaI site in the MHCIIz_v2 vector. scTCRs were introduced into BW_B7/4, by lentiviral transduction and maintained under puromycin selection (3 μg ml−1). To define the antigen specificity of TCRs, BW_4–28 cells expressing scTCRs were cocultured with HEK 293 T cells transfected with pMHCII and CD86/CD4. Cells were combined at a 1:1 ratio in 96-well flat-bottom plates (total 100,000 cells per well). Culture supernatants were collected after 18 h and IL-2 was measured by cytokine bead array (BD Biosciences).

Serum immunoglobulin commensal capture and sequencing

Mice were administered 2% DSS (MP Biomedicals) dissolved in drinking water for 7 days while control mice were administered untreated drinking water. Subsequently, mice receiving DSS were given untreated drinking water for the next 7 days, while the control group remained on the untreated water. Mice were killed on day 14. Blood was collected from each mouse via cardiac puncture. Stool was collected from each mouse by flushing the colon with PBS. Serum was collected from each blood sample and pooled according to experimental group (DSS treatment or water). The stool was homogenized in PBS, pooled according to experimental treatment group, and filtered through 70 uM filters. The remaining debris was removed from the stool suspension by centrifugation. Stool suspensions were incubated with serum overnight at 4 °C. After centrifugation, pellets isolated from the overnight incubations were incubated with Pierce Protein A/G magnetic beads and PBS/0.5% BSA for 2 h at 4 °C. The beads were isolated from the supernatant by means of a magnet and washed with PBS. DNA was isolated from bacteria bound to the beads using a QIAamp DNA mini kit (QIAGEN) and rLysozyme Solution (Merck Millipore), following the manufacturer’s instructions for the isolation of genomic DNA from Gram-positive bacteria. Amplification and 16S rRNA gene library preparation was performed as previously described54,55 and sequenced on the Illumina MiSeq platform.


Before infection, mice were bred in specific pathogen-free facilities at the Massachusetts General Hospital, Boston, MA, and transferred to biosafety level 2 housing when infected with L. monocytogenes. All animal studies were conducted in compliance with ethical regulations and were approved by the Institutional Care and Use Committee at Massachusetts General Hospital. Atg16l1f/f mice were generated as previously described45 and bred with mice expressing Cre recombinase under the control of the CD11c promoter (CD11c-Cre) (The Jackson Laboratory). All mice were maintained on food and water ad libitum, used between 8 and 12 weeks of age, and age- and sex-matched for each experiment.

Statistical analysis

For data analysis, iTRAQ 4-plex ratios for the two biological replicates were filtered to retain only those deemed reproducible, as described previously56. Reproducibility was based on replicates being confined within the 95% limits of agreement of a Bland–Altman plot57. Reproducible replicates were then subjected to a moderated t-test to assess statistical significance58. To generate the I-Ab-binding motif, all mouse peptides were aligned by MultAlin59 and visualized with WebLogo. The frequencies of each amino acid at each position of the 9-mer core were used to generate a position weight matrix. The I-Ab binding scores for Listeria peptides were produced from this matrix and represented as the product of amino acid frequencies for each position of the 9-mer core (Supplementary Table 1). Epitope mapping and protein domain overlay were performed with the National Center for Biotechnology Information’s Conserved Domain Database31. Subcellular localization of antigenic Listeria proteins was predicted by PSORTb version 3.0.229 and by COMPARTMENTS60 for mouse proteins. Transmembrane topology was predicted with HMMTOP30.

Reporting summary

Further information on research design can be found in the Nature Research Reporting Summary linked to this article.

Code availability

All source code for BOTA is available at https://bitbucket.org/luo-chengwei/bota.

Data availability

Source data are available for Figs. 1, 2, 5, and 7 and can be found in the Supplementary Information. There are no restrictions on source data availability. Data for Fig. 7 can be accessed through GEO accession GSE117166.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Babbitt, B. P., Allen, P. M., Matsueda, G., Haber, E. & Unanue, E. R. Binding of immunogenic peptides to Ia histocompatibility molecules. Nature 317, 359–361 (1985).

  2. 2.

    Stern, L. J. et al. Crystal structure of the human class II MHC protein HLA-DR1 complexed with an influenza virus peptide. Nature 368, 215–221 (1994).

  3. 3.

    Kim, A. & Sadegh-Nasseri, S. Determinants of immunodominance for CD4 T cells. Curr. Opin. Immunol. 34, 9–15 (2015).

  4. 4.

    Arunachalam, B., Phan, U. T., Geuze, H. J. & Cresswell, P. Enzymatic reduction of disulfide bonds in lysosomes: characterization of a gamma-interferon-inducible lysosomal thiol reductase (GILT). Proc. Natl Acad. Sci. USA 97, 745–750 (2000).

  5. 5.

    Hsieh, C. S., deRoos, P., Honey, K., Beers, C. & Rudensky, A. Y. A role for cathepsin L and cathepsin S in peptide generation for MHC class II presentation. J. Immunol. 168, 2618–2625 (2002).

  6. 6.

    Hsing, L. C. & Rudensky, A. Y. The lysosomal cysteine proteases in MHC class II antigen presentation. Immunol. Rev. 207, 229–241 (2005).

  7. 7.

    Miyazaki, T. et al. Mice lacking H2-M complexes, enigmatic elements of the MHC class II peptide-loading pathway. Cell 84, 531–541 (1996).

  8. 8.

    Schulze, M. S. & Wucherpfennig, K. W. The mechanism of HLA-DM induced peptide exchange in the MHC class II antigen presentation pathway. Curr. Opin. Immunol. 24, 105–111 (2012).

  9. 9.

    Rudensky, A. Y., Preston-Hurlburt, P., Hong, S. C., Barlow, A. & Janeway, C. A. Jr. Sequence analysis of peptides bound to MHC class II molecules. Nature 353, 622–627 (1991).

  10. 10.

    Hunt, D. F. et al. Peptides presented to the immune system by the murine class II major histocompatibility complex molecule I-Ad. Science 256, 1817–1820 (1992).

  11. 11.

    Chicz, R. M. et al. Predominant naturally processed peptides bound to HLA-DR1 are derived from MHC-related molecules and are heterogeneous in size. Nature 358, 764–768 (1992).

  12. 12.

    Chicz, R. M. et al. Specificity and promiscuity among naturally processed peptides bound to HLA-DR alleles. J. Exp. Med. 178, 27–47 (1993).

  13. 13.

    Sette, A. et al. Invariant chain peptides in most HLA-DR molecules of an antigen-processing mutant. Science 258, 1801–1804 (1992).

  14. 14.

    Lippolis, J. D. et al. Analysis of MHC class II antigen processing by quantitation of peptides that constitute nested sets. J. Immunol. 169, 5089–5097 (2002).

  15. 15.

    Sofron, A., Ritz, D., Neri, D. & Fugmann, T. High-resolution analysis of the murine MHC class II immunopeptidome. Eur. J. Immunol. 46, 319–328 (2016).

  16. 16.

    Mommen, G. P. et al. Sampling from the proteome to the human leukocyte antigen-DR (HLA-DR) ligandome proceeds via high specificity. Mol. Cell. Proteomics. 15, 1412–1423 (2016).

  17. 17.

    Dongre, A. R. et al. In vivo MHC class II presentation of cytosolic proteins revealed by rapid automated tandem mass spectrometry and functional analyses. Eur. J. Immunol. 31, 1485–1494 (2001).

  18. 18.

    Depontieu, F. R. et al. Identification of tumor-associated, MHC class II-restricted phosphopeptides as targets for immunotherapy. Proc. Natl Acad. Sci. USA 106, 12073–12078 (2009).

  19. 19.

    Suri, A., Walters, J. J., Rohrs, H. W., Gross, M. L. & Unanue, E. R. First signature of islet beta-cell-derived naturally processed peptides selected by diabetogenic class II MHC molecules. J. Immunol. 180, 3849–3856 (2008).

  20. 20.

    Seamons, A. et al. Competition between two MHC binding registers in a single peptide processed from myelin basic protein influences tolerance and susceptibility to autoimmunity. J. Exp. Med. 197, 1391–1397 (2003).

  21. 21.

    Nelson, C. A., Roof, R. W., McCourt, D. W. & Unanue, E. R. Identification of the naturally processed form of hen egg white lysozyme bound to the murine major histocompatibility complex class II molecule I-Ak. Proc. Natl Acad. Sci. USA 89, 7380–7383 (1992).

  22. 22.

    Brandwein, S. L. et al. Spontaneously colitic C3H/HeJBir mice demonstrate selective antibody reactivity to antigens of the enteric bacterial flora. J. Immunol. 159, 44–52 (1997).

  23. 23.

    Lodes, M. J. et al. Bacterial flagellin is a dominant antigen in Crohn disease. J. Clin. Invest. 113, 1296–1306 (2004).

  24. 24.

    Cong, Y., Feng, T., Fujihashi, K., Schoeb, T. R. & Elson, C. O. A dominant, coordinated T regulatory cell-IgA response to the intestinal microbiota. Proc. Natl Acad. Sci. USA 106, 19256–19261 (2009).

  25. 25.

    Janeway, C. A. Jr et al. Monoclonal antibodies specific for Ia glycoproteins raised by immunization with activated T cells: possible role of T cellbound Ia antigens as targets of immunoregulatory T cells. J. Immunol. 132, 662–667 (1984).

  26. 26.

    Andreatta, M., Schafer-Nielsen, C., Lund, O., Buus, S. & Nielsen, M. NNAlign: a web-based prediction method allowing non-expert end-user discovery of sequence motifs in quantitative peptide data. PLoS One 6, e26781 (2011).

  27. 27.

    Zhu, Y., Rudensky, A. Y., Corper, A. L., Teyton, L. & Wilson, I. A. Crystal structure of MHC class II I-Ab in complex with a human CLIP peptide: prediction of an I-Ab peptide-binding motif. J. Mol. Biol. 326, 1157–1174 (2003).

  28. 28.

    Liu, X. et al. Alternate interactions define the binding of peptides to the MHC molecule IA(b). Proc. Natl Acad. Sci. USA 99, 8820–8825 (2002).

  29. 29.

    Yu, N. Y. et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 26, 1608–1615 (2010).

  30. 30.

    Tusnády, G. E. & Simon, I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J. Mol. Biol. 283, 489–506 (1998).

  31. 31.

    Marchler-Bauer, A. et al. CDD: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–D226 (2015).

  32. 32.

    Scallan, E. et al. Foodborne illness acquired in the United States: major pathogens. Emerg. Infect. Dis. 17, 7–15 (2011).

  33. 33.

    Vita, R. et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 43, D405–D412 (2015).

  34. 34.

    Goyette, P. et al. High-density mapping of the MHC identifies a shared role for HLA-DRB1*01:03 in inflammatory bowel diseases and heterozygous advantage in ulcerative colitis. Nat. Genet. 47, 172–179 (2015).

  35. 35.

    Wang, P. et al. A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput. Biol. 4, e1000048 (2008).

  36. 36.

    Chatterjee, S. S. et al. Intracellular gene expression profile of Listeria monocytogenes. Infect. Immun. 74, 1323–1338 (2006).

  37. 37.

    Heng, T. S. et al. The Immunological Genome Project: networks of gene expression in immune cells. Nat. Immunol. 9, 1091–1094 (2008).

  38. 38.

    Weber, K. S. et al. Distinct CD4+ helper T cells involved in primary and secondary responses to infection. Proc. Natl Acad. Sci. USA 109, 9511–9516 (2012).

  39. 39.

    Palm, N. W., de Zoete, M. R. & Flavell, R. A. Immune-microbiota interactions in health and disease. Clin. Immunol. 159, 122–127 (2015).

  40. 40.

    Hall, A. B., Tolonen, A. C. & Xavier, R. J. Human genetic variation and the gut microbiome in disease. Nat. Rev. Genet. 18, 690–699 (2017).

  41. 41.

    Ormerod, K. L. et al. Genomic characterization of the uncultured Bacteroidales family S24-7 inhabiting the guts of homeothermic animals. Microbiome 4, 36 (2016).

  42. 42.

    Zeng, M. Y. et al. Gut microbiota-induced immunoglobulin G controls systemic infection by symbiotic bacteria and pathogens. Immunity 44, 647–658 (2016).

  43. 43.

    Christmann, B. S. et al. Human seroreactivity to gut microbiota antigens. J. Allergy Clin. Immunol. 136, 1378–1386.e1–5 (2015).

  44. 44.

    Stoll, M. L. et al. Altered microbiota associated with abnormal humoral immune responses to commensal organisms in enthesitis-related arthritis. Arthritis Res. Ther. 16, 486 (2014).

  45. 45.

    Conway, K. L. et al. Atg16l1 is required for autophagy in intestinal epithelial cells and protection of mice from Salmonella infection. Gastroenterology 145, 1347–1357 (2013).

  46. 46.

    Rappsilber, J., Mann, M. & Ishihama, Y. Protocol for micro-purification, enrichment, pre-fractionation and storage of peptides for proteomics using StageTips. Nat. Protoc. 2, 1896–1906 (2007).

  47. 47.

    Mertins, P. et al. Ischemia in tumors induces early and sustained phosphorylation changes in stress kinase pathways but does not affect global protein levels. Mol. Cell. Proteomics. 13, 1690–1704 (2014).

  48. 48.

    Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

  49. 49.

    Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).

  50. 50.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

  51. 51.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  52. 52.

    McDavid, A. et al. Modeling bi-modality improves characterization of cell cycle on gene expression in single cells. PLoS Comput. Biol. 10, e1003696 (2014).

  53. 53.

    Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).

  54. 54.

    Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. USA 108 (Suppl 1), 4516–4522 (2011).

  55. 55.

    Caporaso, J. G. et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621–1624 (2012).

  56. 56.

    Krönke, J. et al. Lenalidomide induces ubiquitination and degradation of CK1α in del(5q) MDS. Nature 523, 183–188 (2015).

  57. 57.

    Bland, J. M. & Altman, D. G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1, 307–310 (1986).

  58. 58.

    Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, Article3 (2004).

  59. 59.

    Corpet, F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881–10890 (1988).

  60. 60.

    Binder, J. X. et al. COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database (Oxford) 2014, bau012 (2014).

Download references


We thank H. Vlamakis, T. Reimels, and I. Latorre for scientific input, J. Gracias for technical assistance, and P. Rogers for the FACS work. This work was supported by funding from The Leona M. and Harry B. Helmsley Charitable Trust, National Institutes of Health grants DK043351, AI109725, AT009708, and DK092405, and the Juvenile Diabetes Research Fund to R.J.X.

Author information

Author notes

  1. These authors contributed equally to this work: Daniel B. Graham, Chengwei Luo.


  1. Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA

    • Daniel B. Graham
    • , Chengwei Luo
    • , Daniel J. O’Connell
    • , Ariel Lefkovith
    • , Eric M. Brown
    • , Moran Yassour
    • , Mukund Varma
    • , Jennifer G. Abelin
    • , Guadalupe J. Jasso
    • , Caline G. Matar
    • , Steven A. Carr
    •  & Ramnik J. Xavier
  2. Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

    • Daniel B. Graham
    • , Chengwei Luo
    • , Kara L. Conway
    •  & Ramnik J. Xavier
  3. Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

    • Daniel B. Graham
    •  & Ramnik J. Xavier
  4. Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, MA, USA

    • Daniel B. Graham
    •  & Ramnik J. Xavier
  5. Center for Computational and Integrative Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

    • Chengwei Luo
    • , Kara L. Conway
    •  & Ramnik J. Xavier
  6. Immunology Program, Harvard Medical School, Boston, MA, USA

    • Guadalupe J. Jasso


  1. Search for Daniel B. Graham in:

  2. Search for Chengwei Luo in:

  3. Search for Daniel J. O’Connell in:

  4. Search for Ariel Lefkovith in:

  5. Search for Eric M. Brown in:

  6. Search for Moran Yassour in:

  7. Search for Mukund Varma in:

  8. Search for Jennifer G. Abelin in:

  9. Search for Kara L. Conway in:

  10. Search for Guadalupe J. Jasso in:

  11. Search for Caline G. Matar in:

  12. Search for Steven A. Carr in:

  13. Search for Ramnik J. Xavier in:


D.B.G., C.L., and R.J.X. conceptualized the study. D.B.G., C.L., J.G.A., K.L.C., and S.A.C. constructed the study methodology. C.L. and M.Y. managed the software used in the study. C.L., M.V., and J.G.A. undertook the formal analysis of the data. D.B.G., J.G.A., C.G.M., A.L., G.J.J., E.M.B., D.J.O., and K.L.C. undertook the investigation. S.A.C. managed the resources. D.B.G. wrote the original manuscript draft. D.B.G. and R.J.X. supervised the study. R.J.X. acquired the funding for the study.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Daniel B. Graham or Ramnik J. Xavier.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–5 and Supplementary Table 2

  2. Reporting Summary

  3. Supplementary Table 1

    MHCII peptidomics

  4. Supplementary Table 3

    TCR pairing from Listeria-infected mice

  5. Supplementary Table 4

    Single-cell RNA-seq in T cells from Listeria-infected mice

  6. Supplementary Table 5

    16S rRNA sequencing from SICC-seq

  7. Supplementary Table 6

    TCR-seq and 5ʹ-DGE oligonucleotides

About this article

Publication history




Issue Date



Further reading