Deep sequencing of blood and gut T-cell receptor β-chains reveals gluten-induced immune signatures in celiac disease

Celiac disease (CD) patients mount an abnormal immune response to gluten. T-cell receptor (TCR) repertoires directed to some immunodominant gluten peptides have previously been described, but the global immune response to in vivo gluten exposure in CD has not been systematically investigated yet. Here, we characterized signatures associated with gluten directed immune activity and identified gluten-induced T-cell clonotypes from total blood and gut TCR repertoires in an unbiased manner using immunosequencing. CD patient total TCR repertoires showed increased overlap and substantially altered TRBV-gene usage in both blood and gut samples, and increased diversity in the gut during gluten exposure. Using differential abundance analysis, we identified gluten-induced clonotypes in each patient that were composed of a large private and an important public component. Hierarchical clustering of public clonotypes associated with dietary gluten exposure identified subsets of highly similar clonotypes, the most proliferative of which showing significant enrichment for the motif ASS[LF]R[SW][TD][DT][TE][QA][YF] in PBMC repertoires. These results show that CD-associated clonotypes can be identified and that common gluten associated immune response features can be characterized in vivo from total repertoires, with potential use in disease stratification and monitoring.

The differential abundance (DA) analysis was done on clonotypes that are detected in at least two different individuals across all three repertoire types. There were 11834 such clonotypes (which we refer to as public clonotypes) out of the 251259 sequenced in all three experiments. For DA analysis, we used subsets of the data from the clonotype abundance matrix, where the rows are the public clonotypes and the columns are selected samples of interest for DA analysis depending on the focus of our analysis (For instance, for DA analysis on PBMC repertoires, we select only the CD PBMC samples).

Data transformation and normalization
TCRB Repertoires show high range in clonal abundance values with few clonotypes having high frequency and majority having low frequency, and abundance dependent frequency variation. To improve differential abundance detection and for better interpretation, we add a pseudo-count of one to all clonotype abundances and log-transform the selected data matrix to base 2. This data is then normalized using quantile normalization.

Differential abundance testing
Once we had obtain the normalized data matrix, we perform two class rank product analyses using RankProd 1 with P-values < 0.01 considered significant. The Rank product method identifies differentially abundant features (clonotypes in our data) between groups of samples using a rank product (RP) statistic for fold changes observed in replicate experiments. Briefly, it ranks clonotypes according to their fold change between the two groups/conditions. It computes a rank product for each clonotype (the geometric mean of the ranks for the clonotype). It is unlikely that a clonotype has a top ranking fold change consistently in replicate experiments by chance. Using a permutation test, the method estimates the probability of the RP happening by chance for each clonotype. Clonotypes with smallest RP, that result from consistent top ranking in replicate experiments unlikely to be obtained by chance, are then more likely to be enriched or deriched clonotypes of interest (depending on the ranking order of the fold changes). We selected clonotypes with p-values < 0.01 for further filtering for differentially abundance (DA).

Differential abundant clonotype filtering
Due to the combined effects of very high noise in the data pertaining to the nature of the data itself, low number of samples and high number of tests, we did not reach significance after multiple testing corrections in the Rank Product analyses. Thus, to help reduce possible false positives in our Rank product results, we performed filtering of the clonotypes with pvalues < 0.01 using sequential forward feature selection (from the most significant to the least). In this step, we selected the first n clonotypes with a mean "leave-one-out" cross validation error of ≤ 0.25 when used in a random forest classification model for prediction of repertoire's gluten exposure status. This was performed for the list of clonotypes that had pvalues < 0.01 from the Rank product enrichment analysis, and separately for those with pvalues < 0.01 from the derichment analysis. Clonotypes that passed this filtering were considered differentially abundant (DA) as a result of gluten exposure.

Identifying DA clonotypes of interest
Further filtering of the DA clonotypes was done by hierarchical clustering to identify the most interesting DA clonotypes. Hierarchical clustering of differentially abundant (DA) clonotypes was performed using a modified Levenshtein distance (LD) measure to estimate pairwise clonotype similarity by taking into account differences in sequence length and physicochemical properties. The physicochemical properties were evaluated using hydrophobicity according to the Kyte and Doolittle scale 3 , acidity according to the isoelectric point(PI) and molecular mass(Da) of the amino acids constituting the clonotype CDR3 sequences. For each amino acid clonotype, we used the mean property (i.e acidity, hydrophobicity or molecular mass) of the amino acids making up the clonotype to represent the clonotype's overall propensity for that property. Similarity between two clonotype AA sequences was then estimated using LD, modified by their difference in length, mean acidity, hydrophobicity, and weight as follows: The pairwise similarity matrix using this measure was then hierarchically clustered; the dendrogram was cut dynamically 4 and the resulting clusters were assessed for average fold change values. The cluster containing clonotypes with the highest average fold change was considered the 'top cluster'. Clonotypes in the top cluster may be considered the most likely candidates driving the immune response to gluten. V gene usage was assessed for DA clonotypes. MEME 5 was used to search for over-represented sequence motifs for clonotypes in each cluster. .

4
Supplementary Tables: Table S1. Summary of TCRB CDR3 sequences obtained from samples. The columns represent the three sub-projects (1=whole PBMC TCR repertoires, 2=sorted CD4+IFNg+ cell TCR repertoires, 3=gut biopsy TCR repertoires). Standard deviations are given in parentheses.  Table S2. The number of differentially abundant clonotypes for each subject estimating the number of T cells dynamic upon gluten intake. These represent putative gluten-induced clonotypes when comparison was made between total repertoires of a subject before and after treatment or putative-gluten specific clonotypes when comparison was made between CD4+IFNg+ sorted and unsorted repertoires. a Results from public DA analysis (approach two) report public clonotypes that show significant differential abundance across individuals on average (not necessarily in every individual) and enables detection of clonotypes that appear in one condition (see Figure 4), whereas DA analysis performed between the two repertoires from each individual imposes the criteria that clonotypes must be shared in the two repertoires (i.e., it misses clonotypes that were absent in either gluten exposed or unexposed conditions). Thus, only a subset of the Public DA clones was available for DA analysis in each individual (since all are not shared between two repertoires of each individual). In addition, some of those that are in both repertoires may not necessarily be differentially abundant in the particular individual. Figure S1. Differential abundance analysis work flow: the work flow for differential abundance analysis of public clonotypes before vs after gluten exposure. The differentially abundant (DA) public clonotypes can further be filtered by selecting the cluster with the highest average fold change after hierarchical clustering of the DA clonotypes Perform rank-based comparison of clonotype frequency/abundance across sample conditions (using two class Rank Product analysis): P-value < 0.01

Supplementary Figures:
Filter out possible false positives (using Sequential forward feature selection): · Build random forest models for repertoire condition classification (gluten exposed versus unexposed) using selected clonotypes (step-wise inclusion of clonotypes from the most significant to least) · Leave one out cross validation(CV) to estimate accuracy of each model · Cutoff : last clonotype when CV Error ≥ 0.25; the clonotypes that together identify repertoire conditions with ≥ 75% accuracy are DA · Hierarchical clustering of DA clonotypes; selecting Top cluster · Overlap with known gluten reactive clonotypes · Motif finding (using MEME) Differential Abundance Analysis (DA)

Condition 1 Condition 2
Data preprocessing · Filter out clones not seen in at least two different individuals · Add 1 to all clonotype frequency counts · Log-Transform data · Normalize data (Quantile) . Overlap between DA clonotypes in each individual: the number of overlapping enriched clonotypes between patients was low in A) CDPBMC repertoires B) CDGUT repertoires. C) CD005 and CD039 did not share any of their significantly enriched glutenspecific clones in their sorted CD4+IFNg+ repertoires. D) There was also low overlap between public enriched clonotypes detected in patient PBMC and GUT. E & F) There was low overlap of DA clones detected from DA analysis of unsorted repertoires (post-gluten challenge vs pre-challenge), and those detected from sorted vs unsorted repertoire comparison (sorted vs post-challenge, sorted vs pre-challenge) for subjects CD005 and CD039.

CD005Wst
CD039Wst CD PBMC HC PBMC CD GUT FIGURE S5. Public differentially de-enriched CD PBMC clonotypes: Red color intensity indicates prevalence level; clonotypes are hierarchically clustered using a modified Levenshtein distance that took physicochemical properties into account. The cluster indicated with the green bar has the highest average fold change decrease during gluten exposure. The donut plot on the right shows the TCRBV genes used by all DA clonotypes in CD patient pre-challenge from the least to the most used. Sample names with "CD", CD patient; d0, day 0; d6 day 6; W, wheat challenged PBMC sample. FIGURE S6. Public differentially enriched GUT clonotypes: Red color intensity indicates prevalence level; clonotypes are hierarchically clustered using a modified Levenshtein distance that took physicochemical properties into account. The cluster indicated with the green bar has the highest average fold change increase during gluten exposure. The donut plot on the right shows the TCRBV genes used by all DA clonotypes in CD patient during active disease from the least to the most used. Healthy control samples are included for comparison. Sample names with GBgfd, gut biopsy samples after GFD treatment; GBact, gut biopsy samples from active disease; CD, CD patient FIGURE S7. Public differentially enriched healthy control PBMC clonotypes: Red color intensity indicates prevalence level; clonotypes are hierarchically clustered using a modified Levenshtein distance that took physicochemical properties into account. The cluster indicated with the green bar has the highest average fold change increase during gluten exposure. The donut plot on the right shows the TCRBV genes used by all DA clonotypes in the two healthy controls following oral gluten challenge. Every other clonotype is shown on the second column for better readability.
FIGURE S8. Public differentially de-enriched healthy control PBMC clonotypes: Red color intensity indicates prevalence level; clonotypes are hierarchically clustered using a modified Levenshtein distance that took physicochemical properties into account. The cluster indicated with the green bar has the highest average fold change decrease during gluten exposure. The donut plot on the right shows the TCRBV genes used by all DA clonotypes in the two healthy controls following oral pre-gluten challenge. Every other clonotype is shown on the second column for better readability. FIGURE S10. Nucleotide to amino acid ratio to compare convergent recombination: convergent recombination is higher in DA clonotypes that are observed in the set of public clones both in gluten unexposed repertoires and in gluten exposed repertoires (with higher increase in the latter). (A) In gluten exposed repertoires, PBMC (two-tailed Wilcoxon Signed Rank Test p = 0.125; paired t-test, p = 0.016). (B) In gluten exposed repertoires, Gut (twotailed Wilcoxon Signed Rank Test p = 0.0625; paired t-test, p = 0.0066). (C) Healthy controls, the same trend is observed. Overall, the public subset of the DA clonotypes of each subject show higher convergent recombination in gluten unexposed repertoires compared to the private subset, and this trend is increased in gluten exposed repertoires. Horizontal bars indicate median. , is observed in CD patient total PBMC repertoires on day 6. This is observed in the treated gut repertoires instead and not during active disease in the gut. It is also observed in the wheat challenged sorted samples. The plots show values calculated from total reads of the repertoires. Horizontal bars indicate median. gate), and sorted for IFNg secreting CD4+ T cells (upper middle plots, right upper quadrant). CCR9 staining of the gated lymphocytes from P1 demonstrated distinct populations with high and low CCR9 staining intensity (lower middle plots).The gate separating these populations was used to differentiate CCR9+ versus CCR9-CD4+ IFNg+ cells (shown in the rightmost contour plots).(c) The CD4+ IFNg secreting population was enriched in CCR9+ expression, indicating that the majority were gut-homing. Mean ± SD is shown. The total CD4+ IFNg secreting population was submitted to TCR analysis to assess immune features in the gluten peptide-specific repertoire.