# Profiling of the TCRβ repertoire in non-model species using high-throughput sequencing

## Abstract

In recent years, immune repertoire profiling with high-throughput sequencing (HTS) has advanced our understanding of adaptive immunity. However, fast progress in the field applied mostly to human and mouse research, with only few studies devoted to other model vertebrates. We present the first in-depth characterization of the T-cell receptor (TCR) repertoire in a non-model mammal (bank vole, Myodes glareolus), widely used in ecological and evolutionary research. We used RNA from spleens, 5′RACE and HTS to describe V and J segments of TCRβ, qualitatively characterize preferential V–J segment usage and CDR3 length distribution. Overall orthology to murine genes was preserved, with 11 J and 37 V genes found in voles (although 3 V genes lacked a close orthologue). Further, we implemented unique molecular identifiers for quantitative analysis of CDR3 repertoire with stringent error correction. A conservative, lower bound estimation of the TCRβ repertoire was similar to that found for mice (1.7–2.3 × 105 clonotypes). We hope that by providing an easy-to-follow molecular protocol and on-line bioinformatics tools that do not require reference sequences (AmpliTCR and AmpliCDR3), we will encourage HTS immune repertoire profiling in other non-model vertebrates, thus opening new research avenues in e.g. comparative immunology, ecology and evolutionary biology.

## Introduction

The central role of T cells in the immune response, both humoral and cytotoxic, is mediated by membrane, heterodimeric T-cell receptors (TCRs). In general, majority of circulating T cells bear receptors formed by α and β chains (however this varies between tissues and species, with frequencies ranging from 90–99% in human, mice or rat to 60–85% in some ruminants, pigs or chickens)1. The main function of αβ TCRs is recognition of foreign peptides in the context of the major histocompatibility complex (MHC)2.

The variability of TCRs, necessary to ascertain specificity and robustness for interaction with different antigens, is generated similarly to that of immunoglobulins–that is, by a random, somatic recombination of different, germline DNA segments (Fig. 1). During somatic recombination, V (variable) and J (joining) segments form the variable domain of the α chain, whereas the equivalent domain of the β chain is composed of the recombination of three segments: V, J, and D (diversity). In the murine genome, there are approximately 100 V and 60 J segments of α chain, and 35 V, 12 J, and 2 D segments of the β chain; however, up to one third of them are pseudogenes or otherwise non-functional3. The most variable fragment of a TCR – the Complementarity Determining Region 3 (CDR3) – is responsible for recognition of the peptide–MHC complex. CDR3 encompasses the V(D)J junction, where insertion and/or deletion of random nucleotides between the rearranged segments (N-diversity regions) increases diversity over that introduced by recombination between V, D, and J genes. As a result, each new T cell is provided with a potentially unique receptor, generating an extremely diverse TCR repertoire2. The functional TCR repertoire is shaped during maturation in the thymus, where processes of positive and negative selection assure MHC restriction and central tolerance of T cells, respectively (reviewed in Klein et al.)4. The size of an individual αβ TCR repertoire of mature lymphocytes has still not been precisely determined even in humans, with estimates differing by an order of magnitude5,6 from 2.5 × 107 to 1 × 108. In the house mouse (Mus musculus), the number of unique TCR αβ pairings was estimated at 1.9 × 106, with the number of unique TCRβ chain types estimated to reach 5–8 × 105 clonotypes7.

The composition and size of the TCR repertoire is crucial for a successful immune response8,9,10; however, its enormous diversity has long impeded in-depth study of an individual TCR’s collections at the genomic level. Indirect techniques, such as CDR3 spectratyping11, allowed a general overview of TCR repertoire dynamics, but a direct exploration of sequence diversity was only possible with exceedingly time-consuming and costly cloning and Sanger sequencing (reviewed in Six et al.)12. Consequently, profiling of immune repertoires was limited to a few model species of medical or veterinary importance. However, in the past decade, high-throughput sequencing (HTS) has revolutionized and prompted direct, affordable, and efficient sequencing of immune repertoires13,14. The accuracy of TCR repertoire characterization (focusing on the short, but extremely variable, CDR3 region) has further been improved by using 5′ Rapid Amplification of cDNA Ends (5′ RACE)15, which alleviates the amplification bias introduced by a multiplexed PCR16,17. Further integration of unique molecular identifiers (UMIs)18 into the 5′ RACE protocol allows efficient correction of PCR and sequencing errors19. The sequence-level resolution of HTS studies has already granted new insights into various aspects of adaptive immunity, such as immune response to infection20, autoimmune diseases21, repertoire changes with age6, or cancer progression and persistence22,23. Despite the potential offered by HTS, few attempts were made to transfer these advances to species other than human and mouse. Rare examples of comprehensive analysis of highly variable TCR repertoires with HTS include zebrafish (Danio rerio)24 and rhesus monkey (Macaca mulatta)25,26–both of which are well-established models in medical and biological research. Still, these rare studies have revealed an astonishing variation in the TCR repertoires among vertebrates, with D. rerio repertoire being four to five orders of magnitude smaller than estimates for humans.

Adaptation of HTS immune repertoire profiling to studies of non-model species would open new research prospects in diverse fields of biology, including comparative immunology, ecology and evolutionary biology. For example, information on multiple non-model vertebrates is necessary for comparative immunologists to make generalisations about evolution of the TCR diversity. It would enable testing of long-held hypotheses about the evolution of the vertebrate immune systems, such as those assuming trade-offs between the MHC copy number and the TCR repertoire size27,28. It would also allow following an adaptive immune response during infection or upon vaccination in species of ecological, rather than commercial, value (e.g. endangered species, bioindicators). However, lack of universal molecular protocols and the level of bioinformatics sophistication necessary for an analysis of immune repertoires generated with HTS can easily discourage prospective researchers.

In the present study, we applied recent developments in TCR repertoire sequencing to a non-model mammal (the bank vole, Myodes glareolus), which is nonetheless often used in ecological and evolutionary research. We chose αβ T cells, as the most prevalent T cell subset. Importantly, αβ T cells recognise antigens in the context of the MHC, so ability to study αβ TCR repertoires is a pre-requisite to study the aforementioned trade-offs associated with individual MHC diversity. In the future, however, an analogical approach can be used to study other T cell populations, e.g. γδ T cells (lacking the MHC restriction and bridging the adaptive and innate-like responses)29. Within the αβ TCR subset studied, we focused on the TCRβ chain because it is more variable due to the presence of D segment within its CDR3, and thus likely serves as a better representation of the extent of the individual TCR diversity than α chain.

We took advantage of the fact that the 5′ RACE technique does not require prior knowledge of the V and J segments in the species of interest. Necessary primers within a constant region of the TCRβ were designed based on de novo assembled transcriptomes–an approach previously tested using highly complex system of duplicated MHC genes30. We used RNA isolated form spleens of bank voles from a genetically diverse laboratory colony (see Materials and Methods) as a starting material for sequencing library preparation. We adopted molecular protocols available for human and mouse15 (Supplementary File S1) to amplify TCRβ chains transcripts containing full CDR3 region, as well as the entire V and J segments. These sequences were subjected to detailed qualitative analysis (description of V and J genes, V–J segment usage, CDR3 length distribution) using our newly developed software: AmpliTCR and AmpliCDR3. We further used a modified molecular protocol incorporating UMIs18,19, to allow stringent sequencing error correction (also implemented in AmpliCDR3), necessary for quantitative estimation of the TCR repertoire size. Our tools are freely available on-line as an easy to navigate web server. The tools are a part of AmpliSAT suite (see Materials and Methods), which already comprises AmpliSAS31, an increasingly popular tool for multi-locus MHC genotyping in non-model species.

Our focal species, the bank vole, is a small rodent of the Cricetidae family, widespread in western Europe and northern Asia. It has been extensively studied in various subjects, ranging from adaptive radiation32,33, genetic basis of adaptation34, and response to selection35 to processes shaping sexual36 and behavioural traits37. Primary interest in its immunity was driven by its role as a reservoir of a pathogenic Puumala hantavirus (PUUV)38,39, which causes a mild form of haemorrhagic fever with renal syndrome in humans. To date, a number of components of the bank vole’s immune system have been studied, such as the MHC30,40,41,42 and Toll-like receptors43,44, but sequencing and profiling of immune repertoires, such as TCRs, has never been attempted. Thus, our work adds a crucial element to the immunogenic resources available for the bank vole and opens new experimental avenues in research involving this species.

## Results

### Qualitative description of the bank vole TCRβ repertoire

Using 5′ RACE and Illumina paired-end 300-bp MiSeq sequencing (Supplementary Fig. S1), we obtained a full variable region of TCRβ chains from seven bank voles, from which V and J segments were extracted. In this section, we described phylogenetic comparison of these segments with mouse genes, which also served as a basis for naming of the bank vole loci. Furthermore, we analysed V–J segment usage and CDR3 length distribution. At this stage, we did not analyse D segment sequence or usage, because we cannot confidently infer its germline sequence from recombined CDR3 region (N-diversity regions introduced too much uncertainty).

#### V and J segment identification and naming

Bank voles and mice diverged ca. 20–40 million years ago45,46, yet orthologous genomic murine segments could be identified for most J and V bank vole segments (Figs 2 and 3). Because the bank vole neither has a high-quality annotated reference genome nor are TCR germline gene segments available in the Genbank or International ImMunoGeneTics information system® (IMGT), we assigned J and V variants extracted from TCRβ transcripts to provisional loci based on their assignment to distinct clades in a phylogenetic tree. Additionally, orthology and similarity (Supplementary Figs S2 and S3) to the mouse references from the IMGT database3 (Supplementary Table S1) further supported recognition of these clades as loci. Bank vole genes were named BV_TRBV or BV_TRBJ, with the numbering of each locus corresponding to the murine orthologue. For a few V segments clear orthologue was absent (Fig. 3), and these putative loci were numbered as 40–42 (numbers not present among murine genes). With only a few exceptions, there were between one and two variants (hereafter, alleles) belonging to each phylogenetic clade (hereafter, locus) present per individual (Supplementary Figs S4 and S5). Such a pattern is expected in a diploid organism and further justifies treating clades as separate loci.

We identified 25 distinct V-segment subgroups (genes that share at least 75% identity at the nucleotide level - as defined by IMGT®-ONTHOLOGY)47. Most of them comprised only one locus; however, two subgroups, BV_TRBV12 and BV_TRBV13, contained multiple loci. Similar expansion took place with the murine orthologue subgroups TRBV12 and TRBV13, which have three loci each. Overall, in the bank vole, we found 37 V putative loci (with 1–5 alleles per locus, summing to a total of 116 alleles across loci), compared to 21–22 functional loci in the mouse48. In the case of the J segment, we identified 11 loci within two subgroups–one with five loci and the other with six loci. This number exactly matches the number of functional J loci in the mouse48. We identified 1–3 alleles per locus, with a total of 17 alleles across loci.

We compared TCRβ V segments retrieved in the present work (from directly sequenced transcripts of the bank vole TCRβ) against V segments extracted from a recently assembled draft bank vole genome. The latter had been downloaded from the online database Vgenerepertoire.org49, containing automatically extracted V genes from whole-genome assemblies (WGA) for over 200 vertebrate species. Only 17 TCRβ V genes were extracted from the bank vole draft genome, compared to 37 putative loci found herein. Comparison of amino acid sequences (Supplementary Fig. S6) showed that the following subgroups of BV_TRBV genes are missing from the WGA: 1, 2, 5, 12, 13, 20, 26, 40, and 42. All of these, except for 40 and 42, have clear murine orthologues, with subgroups 12 and 13 containing multiple duplicated loci. The reason for this discrepancy most likely is the rather poor quality of the bank vole draft genome (scaffolds: 367 242; contigs: 2 834 384; contig N50: 1 591; L50: 365 195), which did not allow successful recovery of all functional genes. All extracted genes from WGA are, however, present in our dataset, with the exception of V49. This locus is likely non-functional and not transcribed, similarly as its murine orthologue TRBV21 (Supplementary Fig. S6), known for alterations in the splicing sites3.

#### V-J segment usage and CDR3 length distribution

To avoid possible biases introduced by differences in receptor expression between cells, both features were measured by the number of identified, unique nucleotide variants, rather than abundance of the recovered transcripts.

Based on the assignment of the V and J segments of the bank vole to the provisional loci, we analysed preferential usage of V and J segments in the rearranged TCRβ transcripts. In all seven bank voles, the dominant gene segments were BV_TRBV23, BV_TRBV24, BV_TRBV13–1 (Fig. 4a), and BV_TRBJ1–1, BV_TRBJ1–4 (Fig. 4c). Accordingly, combinations of the above segments formed the most favoured V–J pairs, although BV_TRBV23/BV_TRBJ1–4 was the only pairing consistently ranked in the top 1% in all the individuals. Other most common combinations, found in the top 1% of the highest frequency pairs, were BV_TRBV13–1/BV_TRBJ1–4 and BV_TRBV23/BV_TRBJ1–1. A comparison of preferred V–J pairs among individuals revealed both between-individual similarities and differences. Figure 5 shows an example of contrasting patterns of V–J gene usage. Individual s02 is characterized by a strong bias toward pairings involving V segment 23 with different J segments from subgroup 1 (BV_TRBJ1–1, BV_TRBJ1–2, and BV_TRBJ1–4) and a very scarce use of J segments from subgroup 2 in all combinations. In contrast, individual s03 has a more even usage of all J segments, including those from subgroup 2. A comparison of all seven individuals is shown in Supplementary Fig. S7.

The CDR3 lengths had a bell-shaped distribution and ranged from 5 to 18 amino acids (15–54 nt). The median length equals 12 amino acids, 71% of unique CDR3 sequences had a length of 11–13 amino acids, and 90% of CDR3s had length ranging from 10 to 14 amino acids (Fig. 4e). In silico spectratypes with V and J segments are shown in Supplementary Fig. S8. The global mean length of CDR3s was 11.91 amino acids (weighed by read count), but mean lengths of CDR3s formed by different V and J segments showed consistent differences (Fig. 4b,d). For example, V segments BV_TRBV01, BV_TRBV20, and BV_TRBV30 tend to form shorter CDR3s (means: 10.37, 11.17, and 11.02, respectively), whereas BV_TRBV02 and BV_TRBV05 slightly longer ones (mean 12.56, for both). Similarly, J segments BV_TRBJ1–1, BV_TRBJ1–2, and BV_TRBJ2–1 form shorter than average CDR3s (mean: 11.35, 11.37, and 11.11, respectively).

### Quantitative analysis of the bank vole TCRβ repertoire

For three additional individuals, a modified 5′ RACE protocol integrating UMIs and Illumina paired-end 150-bp HiSeq sequencing (Supplementary Fig. S1) was used to estimate the TCRβ repertoire size. The repertoire size (i.e. number of unique CDR3 variants per individual) was estimated using the Chao2 estimator, based on UMI error-corrected, unique CDR3s from four independent replicated samples per individual.

#### Repertoire size estimation

The mean number of observed, unique (non-redundant) TCRβ CDR3 sequences per amplicon subsampled to 1.5 mln reads was 1.3 × 105 (SD = 0.1 × 105) for nucleotide sequences and 1.1 × 105 (SD = 0.1 × 105) for amino acid sequences (Supplementary Table S2). The mean total number of observed, unique CDR3 sequences per individual (adding up the four replicates) was 1.8 × 105 (range: 1.5–2 × 105) nucleotide sequences and 1.5 × 105 amino acid sequences (range: 1.2–1.6 × 105).

A major obstacle in repertoire-size estimation in mammals is the inability to capture the entire repertoire, even if the library from a given sample can be exhaustively sequenced50,51. This caveat–inherent to repertoire sampling and revealed with exhaustive sequencing experiments–results from the fact that the diversity found in any aliquot of cells will typically reflect a fraction of the total repertoire14,50. For that reason, richness estimators–traditionally used in ecology to tackle “unseen species” problem–are applied to calculate total clonotype diversity. They allow to estimate, from given samples, how many unobserved species (or in this case–clonotypes) are present in an individual. Following other authors6,52 we used a non-parametric, incidence-based estimator Chao253 to estimate the lower boundary of TCRβ repertoire size. The mean of the lower-bound estimate was 2.1 × 105 (range 1.7–2.3 × 105) different nucleotide CDR3 sequences (Supplementary Table S3), which indicates that in four samples we directly observed up to 86% splenic TCRβ CDR3 sequences.

In our size-estimation protocol, we sub-sampled amplicons to control for differences in the sequencing depth. The 1.5 mln raw reads sufficed to retrieve most of the diversity present in each amplicon and in the individual (across all replicated amplicons). For example, for the highest-coverage individual (s08, mean: 3.5 mln raw reads per replicate), with sub-sampling to 1.5 mln, the unique number of observed CDR3s was 1.4 × 105 nucleotide sequences per replicate and, altogether, 1.9 × 105 for the individual (when a total of 6 mln reads were analysed). Doubling the sequencing effort increased the number of unique CDR3s to 1.7 × 105 per replicate, and to a total of 2.1 × 105 for the individual (when all 13.9 mln reads were analysed). The Chao2 estimator was robust to those variations, as the estimates of CDR3 repertoire size increased only by 8%–from 2.3 × 105 (at 1.5 mln reads per replicate) to 2.5 × 105 (with ±3.5 mln reads per replicate).

#### Private versus public repertoire

The vast majority of CDR3 nucleotide sequences represented private TCRβ repertoire–that is, these sequences were not present in any other individual. On average, only 3.7% of CDR3 nucleotide sequences was shared between any two individuals, and 1.3% was present in all three individuals, thus representing a public repertoire (Supplementary Fig. S9). In contrast, comparison of common CDR3 nucleotide sequence between replicates from the same individual showed that, on average, 77% of sequences was shared between any two replicates, and 47% was present in all four replicates. The number of shared amino acid CDR3 sequences between any two individuals was, on average, 14.3% (SD: 1%). Further, 7.5% CDR3 amino acid sequence was present in all the individuals (Supplementary Fig. S9; Supplementary Table S3).

## Discussion

We present, to the best of our knowledge, the first study to analyse in detail TCR repertoire of a non-model species, using HTS and state-of-the-art error-correction protocols. We described V and J gene segments in the bank vole, analysed their phylogenetic relationship to model murine TCRs, and also characterized basic features of its TCRβ repertoire, such as CDR3 length distribution, V–J gene usage, and a conservative estimation of the TCRβ repertoire size. We provide this detailed descriptive analysis as an example of a successful immune repertoire profiling in a species that lacks sufficient genomic resources permitting access to well-established molecular protocols and associated bioinformatics tools.

Overall, the orthology of bank vole V and J segments to mouse genes was conserved, as was the number of V and J segments. The conservation of TCR structure, manifested in the long branches of the phylogenetic trees, may result from an obligatory interaction of TCRs with MHC during antigen recognition. Olivieri et al.49 drew a similar conclusion from a comparison of V genes automatically extracted from WGAs of the mouse and rat. Their analysis showed that the phylogeny of V segments of TCRs had been characterized by much longer branches, compared to V segments of structurally similar–yet neither MHC nor otherwise restricted–immunoglobulin receptors. A constrained binding interface of co-evolving germline-encoded TCR loops and conserved features shared by all (otherwise polymorphic) MHC alleles should preserve orthologous relationship between genomic TCR segments over large evolutionary time scales54,55.

Despite overall conservation, a few bank vole V loci did not have a clear murine orthologue. One reason for the unclear orthology could be deletion of some genes in one of the species. For example, the bank vole BV_TRBV41 gene misses the mouse orthologue, but it closely groups with the TRBV10 gene, which is pseudogenised in mice. This suggests that genes belonging to this clade might have become non-functional in mice, and some may have been eventually deleted. Another reason could be expansion after the diversification of species. For example, TRBV12 and TRBV13 groups have multiple genes both in voles and mice. These duplicated genes group by species, which suggests that they have expanded independently, after the split. A database collecting numerous V genes extracted from WGAs (Vgenerepertoire.org49) shows an expansion of TCRβ V segments in many species from the Cricetidae family, with more than 30 genes found in prairie vole (Microtus ochrogaster) and North American deer mouse (Peromyscus maniculatus). But, we note that the number of genes recovered from a WGA may strongly depend on the quality of the assembly, and should be treated with caution. The WGA-based method may substantially underestimate the true diversity and expansion of V genes. In case of the bank vole, more than a half of the TCRβ V genes were missing from the automatically extracted V genes from a WGA, compared to our direct sequencing of TCRβ transcripts.

Bell-shaped CDR3 length distribution, preferential use of particular V and J segments, and unequal frequencies of V–J pairings that we found are all well-documented features of the TCR repertoires of other vertebrates11,24,56,57. Between-individual similarities in V–J segment usage and pairing may stem from chromatin conformation, physical proximity of germline segments, and/or recombinatorial bias58,59,60. Observed differences might reflect T cell expansion during immunization with different antigens57, however, standardized and controlled laboratory conditions under which the voles were kept render differential antigen exposure unlikely. Alternatively, individual skewing of TCR repertoires may be linked to the genetic heterogeneity of the animals. The breeding colony from which our samples came from was started from a large number of wild-caught animals. Due to appropriate breeding strategy32, the colony retained much of the original genetic diversity. For example, our previous study using animals from this colony showed a high between-individual variation in MHC genes30. Thymic selection, which depends upon individual MHC composition, may therefore be, at least to some extent, responsible for the observed between-individual differences in TCR repertoires61,62,63.

A lower bound of the TCRβ repertoire size in the bank vole was conservatively estimated at 2.1 × 105 unique CDR3 types, which is comparable to the estimate for the mouse, 5–8 × 105 (based on Sanger sequencing of a single CDR3 length band and interpolation over the sizes of all bands7). Repertoire size reflects overall diversity, without taking into account abundance. While in some cases information about the number of cells bearing particular TCR might be more interesting, for example if clonal expansions are of interest, such abundance data are harder to confidently obtain when RNA is used as staring genetic material. Although presence of UMIs allows for differentiation between single cDNA templates (alleviating issues of e.g. unequal PCR amplification), still two identical cDNA variants with different UMIs may either represent multiple transcripts from one T cell or transcripts from different T cells with identical TCRs (e.g. expanded clones or cases of identical/convergent recombination).

Less than 4% of nucleotide CDR3 sequences was shared between any two individuals, and only 1.3% was present in all three individuals, which is similar to the numbers reported for human50. The number of shared amino acid sequences was almost four times greater, and is of the same order of magnitude as found for human and mice50,64. We note that, although the number of observed sequences was not far from the estimated repertoire (on average, 86%), we do not have full individual repertoires and the number of shared clonotypes may be underestimated. However, it was previously shown that the frequency of a given TCR type (within-sample abundance or presence in multiple aliquots from a single individual) predicts sharing of this clonotype between individuals65. Therefore, while more exhaustive repertoire sequencing should identify less frequent variants, it would be unlikely to significantly increase the size of the shared and public repertoire.

Taken together, our work broadens knowledge of the immune system of the bank vole–a well-established species in ecological and evolutionary research, and an emerging, non-murine model in the field of eco-immunology and studies of zoonosis66,67. We showed that an informative analysis of an individual TCR repertoire is possible with partial information on the genomic composition of the TCR locus (e.g., we did not have full sequences of genomic segments prior to somatic recombination and had only provisional assignment to the actual loci for V and J segments). In the future, the methodology developed and tested herein can be used to study immune response in nature and in experiments involving this species.

More importantly, our TCR profiling workflow can be easily adapted to other non-model vertebrates that lack genetic resources such as well-annotated reference genomes or V/J segment listings in bioinformatics databases. It does not require dedicated molecular reagents and kits - it uses universal 5′RACE kits and requires only one species-specific set of primers in the constant region of the TCR molecule. The primers can be easily designed based on e.g., the de novo draft genome or transcriptome assemblies of this region, which are relatively easy and straightforward as compared to assemblies of highly duplicated and diversified V and J segments. Such obtained TCR sequences can be readily analysed with our new, publicly available, on-line AmpliTCR and AmpliCDR3 programs. They extract V and J segments and the CDR3 region based on conserved amino acid residues, and, unlike other published tools68,69, they do not rely on availability of genomic references of V/D/J segments. Therefore, the promise of “no more non model species” in the HTS era70 can be finally fulfilled for the challenging task of immune repertoire sequencing. We hope that our approach will become particularly useful in ecological and evolutionary research, where organisms of interest rarely belong to the exclusive club of well-established model species. Yet, there is great potential in such studies–from enhanced comparative immunology and opportunity to explore long-standing evolutionary hypotheses to capability of tracking adaptive immune response and disease progression in vertebrate species of ecological importance.

## Materials and Methods

### Primer design

The bank vole is a non-model organism, with limited genetic information available. To design nested primers for 5′ RACE, we used de novo assembled transcriptomes from spleens of seven animals. For a detailed description of the dataset and the assembly strategy, see Migalska et al.30 Briefly, transcriptomes were de novo assembled with Trinity71, and contigs highly similar to the first exon of the constant regions of TCRβ Mus musculus sequences (as extracted from Genbank records: AH002088.2, AH002089.2, and X03574.1) were retrieved. Based on these contigs, three nested primers in the 3′ end of the constant region were designed: MyglTCRb_1: TGATCTCTGCTTCTGATG, MyglTCRb_2: GATGGCTCAAACAGGGTGACC, and MyglTCRb_3: GGACTCACCTTGCTCAGATCCT. The full list of primers and adaptors used in the study is given in the Supplementary Table S4, and a schematic illustration of the primers and adaptor location is presented in the Supplementary Fig. S1.

### Experimental animals and RNA extraction

The animals in the study were acquired from a laboratory colony of bank voles maintained at the Institute for Environmental Sciences (Jagiellonian University, Kraków, Poland). The colony was set up from wild animals captured in Niepolomice Forest near Kraków (southern Poland), for a large experimental evolution study modelling adaptive radiation32. The colony consists of three artificial selection directions and an unselected control. The voles from all selection directions are kept under standardised conditions, except for short-term measurements. Spleens from randomly chosen, healthy individuals (n = 10, all coming from the same selection direction: ability to maintain weight on herbivorous diet) were collected during necropsy, in accordance with internationally recognized guidelines for research on animals approved by the Krakow Ethical Committee for Experiments on Animals. Information on the animals’ sex, age at death, as well as on its assignment to the qualitative or quantitative part of the study is given in Supplementary Table S5. Fragmented spleens (4–6 fragments depending on spleen size) were preserved in RNAlater (Sigma-Aldrich), and homogenized piece by piece with FastPrep® (MP). Total RNA was extracted using RNAzol® RT (Sigma-Aldrich) according to the manufacturer’s instructions, and eluted in 50 µl RNase-free water. For qualitative description, seven bank voles were used (one spleen fragment per individual). For quantitative analysis of three additional bank voles, aliquots (20 µl) of extracts from each spleen fragment were pooled and purified on NEXTflex™ Poly(A) Beads (Bioo Scientific®) to remove abundant rRNA and reduce sample volume (elution in 15 µl).

### 5′ RACE-based library construction and Illumina sequencing

For qualitative description of the bank vole TCRβ repertoire, we used the 5′ RACE method, based on the procedure described by Mamedov and colleagues15, but murine-specific primers were replaced with bank vole-specific ones (see Primer design section). cDNA synthesis was conducted using the 5′ template-switching procedure, followed by nested PCR amplification steps, described in detail in Mamedov et al.15, and summarized in the Supplementary Fig. S1. For the first step (cDNA synthesis), the Mint-2 cDNA synthesis kit (Evrogen) containing the 5′ oligo adaptor PlugOligo-1 was used, with custom-made MyglTCRb_1 primer (1 µM) and 2 µg total RNA for each sample. Subsequent PCRs were conducted using the high-fidelity Encyclo polymerase mix (Evrogen) according to the manufacturer’s instructions. In the first PCR, 1 µl of the first strand cDNA was used as a template, with Smart20 oligonucleotide and nested MyglTCRb_2 primer (Supplementary Table S4). In the second PCR, 1 µl of the product of the first-round reaction was used as a template, with Step_1 oligonucleotide and MyglTCRb_3 nested primer (Supplementary Table S4). PCR conditions for the first PCR were as follow: 1-min denaturation at 95 °C, 20 cycles of 95 °C for 20 s, 65 °C for 20 s, and 72 °C for 50 s, with a final elongation for 3 min at 72 °C. Conditions of second PCR were the same, except for the cycles (their number was reduced to 10) and final elongation prolonged to 5 min. PCR products were run on 1.5% agarose gel, and bands of desired size (~600 bp) were excised using QIAquick Gel Extraction Kit (QIAGEN) according to manufacturer’s protocol. The final library was constructed with NEBNext® DNA Library Prep Master Mix Set for Illumina® (NEB). Paired-end (PE), 300-bp sequencing was performed at the Institute of Environmental Sciences, Jagiellonian University, Krakow (Poland), with the MiSeq Reagent Kit v3 on a MiSeq sequencer (Illumina). The sequencing yielded more than 20 mln raw reads (mean per sample: 2.9 mln, SD: 0.2 mln; Supplementary Table S6).

To facilitate use of this molecular protocol, we enclose it as a Supplementary File S1.

### Qualitative description of the bank vole TCRβ repertoire

The TCRβ repertoire for seven individuals (s01–s07) was analysed and described, as detailed in the following steps: (i) pre-processing of paired-end Illumina reads; (ii) read sub-sampling, extraction of V and J segments and retrieval of consensus V and J variants by AmpliTCR; (iii) phylogenetic analysis and naming of the V and J variants; and (iv) description of the V–J segment usage and CDR3 length distribution with AmpliCDR3.

All bioinformatics tools used in the following sections (AmpliMERGE73, AmpliCLEAN, AmpliTCR, and AmpliCDR3) belong to the AmpliSAT suite (Amplicon Sequencing Analysis Tools) available at: http://evobiolab.biol.amu.edu.pl/amplisat/.

#### Pre-processing of Illumina data

Overlapping paired-end reads from the 5′ RACE-based library (MiSeq 2 × 300 bp run) were merged with AmpliMERGE–a tool based on FLASH74–using default parameters. Subset of 2 million randomly chosen, merged reads was filtered with AmpliCLEAN, removing reads with lower average Phred quality score (<30) and those not matching the last 12 nucleotides of the MyglTCRb_3 primer (Supplementary Table S4) at the beginning of the constant TCRβ region. On average, 1.8 mln reads per sample passed the pre-processing step (Supplementary Table S6).

#### Read sub-sampling, TCRβ V and J segments extraction, and retrieval of consensus V and J variants by AmpliTCR

To analyse a comparable number of TCRβ sequences for each individual, 1 mln random reads per sample were selected for further processing; 95% of these reads had lengths between 325 and 536 bps and covered the full variable region of the TCRβ, including the rearranged and spliced V(D)J segments. Identification of V and J segments was performed automatically by AmpliTCR. Although partial sequence of D segments was extracted by the software, it was neither included in further analysis nor described, as too much uncertainty about its germline sequences is introduced by N-diversity regions. The AmpliTCR uses protein patterns in the PROSITE format75 to identify highly conserved residue positions in the translated sequence. To extract TCRβ V segment (TRBV) the following pattern was used: x(5)-Q-x-P-x(14)-C-x(10,11)-W-Y-x(39,42)-[LM]-x(14)-C-x; and for TCRβ J segment (TRBJ): x(4)-G-x-G-x(2)-L-x-[VI]-x. Patterns were composed of conserved amino acids previously described in the literature76,77 as well as the multiple alignment of human and mouse TRBV and TRBJ sequences found in the IMGT database78. On average, 91% of the analysed reads (range: 87–94%) contained both protein patterns. AmpliTCR discards sequences containing unspliced introns, frameshift indels, and premature termination codons (PTCs). Approximately 41% of TCRβ sequences with recognized protein patterns were later discarded for those reasons.

Extracted TRBV and TRBJ sequences were de-replicated by AmpliTCR into unique, non-redundant variants and ordered by sequencing depths. Next, TRBV variants with frequency less than 0.01% were removed to discard artefacts, such as sequencing errors, PCR-based substitutions, and chimeras. For TRBJ, the frequency threshold was increased to 0.1%. Both frequency values are based on the minimum percentage of rare alleles for each TCRβ segment found in human50. Further, to remove the remaining artefacts, variants were clustered with the CD-HIT-EST79 algorithm (implemented in AmpliTCR), with a minimum identity threshold of 95% for TRBV and 85% for TRBJ. These thresholds were established based on sequence similarity between the TRBV and TRBJ loci in human and mouse80. In addition, to prevent merging true but similar variants, sequences with within-cluster frequencies above 20% were moved to form a new cluster. Finally, major variants within each cluster were selected as putative TRBV and TRBJ alleles. All alleles are available in the Supplementary Files S2 and S3.

#### Phylogenetic analysis and naming of the V and J segments

No genomic information about segments composing bank vole TCRβ was available prior to the analysis. Provisional V and J loci identification was conducted based on formation of distinct clades in phylogeny by variants (alleles) from 7 individuals, and further supported by both orthology (co-occurrence of bank vole alleles and murine reference within the same phylogenetic clade) and sequence similarity to the mouse references from the IMGT database3 (the full list of the murine reference sequences with accession numbers is available in a Supplementary Table S1, all retrieved bank vole alleles are in the Supplementary Files S2 and S3). The orthology was assessed from phylogenetic trees constructed separately for all V and J alleles in MEGA681. The neighbour-joining trees were constructed with the evolutionary distances computed using the Maximum Composite Likelihood method, with a bootstrap of 1000 replicates (Supplementary Figs S2 and S3). In case of J segments, several branches were poorly resolved, with low bootstrap values. This is most likely caused by the short length of sequences (<25 bp), which resulted from trimming them at the border of the CDR3 region. However, visual inspection of an alignment of the bank vole sequences to the murine reference allowed confident loci assignment. We named bank vole loci (clades of putative alleles) with a “BV” prefix (BV_TRBV for V segments and BV_TRBJ for J segments), while keeping the numbering of the murine orthologues. Alleles within each locus were numbered arbitrarily. A few V segment sequences did not have a close murine ortholog (or a highly similar sequence); therefore, to avoid confusion, their clusters were named with BV_TRBV4X (as none of the murine V segments has a number of 40 or higher). Supplementary phylogenetic analysis was conducted for V segments, with the addition of genes downloaded from an online database Vgenerepertoire.org49 (accessed: May 20, 2017), that stores V genes automatically extracted from whole-genome assemblies. V genes of the bank vole had been extracted from a draft genome with GenBank assembly ID: LIPI00000000.1.

#### V–J segment usage and CDR3 length distribution

After the V and J segment loci were identified and named, we assessed V–J segment usage and CDR3 length distribution in the seven bank vole individuals. Pre-processed Illumina reads (see above) were re-analysed with AmpliCDR3 tool, with the reference BV_TRBV and BV_TRBJ segments obtained in the previous steps. For each individual V and J segment, sequences are locally aligned with BOWTIE282 (–sensitive-local -k 2) to the BV_TRBV/J references. Particular BV_TRBV and BV_TRBJ loci were annotated if they had aligned with more than 90% identity to any of the reference sequences provided.

The CDR3 region is defined as the sequence between the last conserved Cys in the V segment and one residue (3 nt) before the first conserved Gly in the J segment. AmpliCDR3 extracts the CDR3 region by searching for a TG[TC][GA][CG] motif located at the 3′ end of the V segment and a ([GA][CG]) motif marking the 5′ beginning of the CDR3 region. The 5′ end of the CDR3 is fixed at 31-bp upstream from the 5′ end of the constant TCRβ region (sequence of the MyglTCRb_3 primer, Supplementary Table S4, Supplementary Fig. 1). AmpliCDR3 identified the correct protein pattern for CDR3 regions in approximately 55% of the pre-processed reads (44–63%). Finally, the AmpliCDR3 filters putatively non-functional CDR3 sequences (shorter than 15 bp and longer than 54 bp, containing frameshift indels and PTCs). After filtering, on average, 85% of the reads were retained (75–89%). Although the remaining sequences likely still contain substitution errors introduced during PCR or sequencing, they should not significantly affect the descriptive characterization of the TCRβ repertoire (i.e. V–J segment usage and CDR3 length distribution). Qualitative statistics were calculated and visualized with R83.

### Quantitative analysis of the bank vole TCRβ repertoire

For three additional individuals (s08–s10), a quantitative analysis of the TCRβ CDR3 repertoire was conducted with a UMI-based error-correction strategy. Data were analysed with the tools AmpliMERGE, AmpliCLEAN, and AmpliCDR3 in the following steps: (i) pre-processing of paired-end Illumina reads; (ii) sub-sampling, identification of CDR3 sequences and filtering; (iii) UMI-based CDR3 error correction; (iv) repertoire size estimation; and (v) private versus public repertoire comparison.

#### Pre-processing of Illumina data

Non-overlapping, paired-end reads from Illumina HiSeq 2500 (2 × 150 bp) were concatenated with AmpliMERGE and filtered with AmpliCLEAN to remove reads with average Phred quality score lower than 30. The reads did not cover the whole variable region of the TCRβ chain (Supplementary Fig. S1), but retrieved the UMI, the entire CDR3 and J segment, as well as the partial V segment.

#### Read sub-sampling, CDR3 region identification, and filtering with AmpliCDR3

To control for differences in sequencing depth that might influence the number of retrieved variants, we subsampled pre-processed reads up to 1.5 mln. Rarefaction curves analysis showed that, at this point, the number of new CDR3 sequences found within an amplicon started to plateau, such that our sub-samples likely captured most of the CDR3 diversity within an amplicon (Supplementary Fig. S10).

CDR3 region identification and filtering was undertaken with AmpliCDR3, as described in the subsection on V-J segment usage and CDR3 length distribution. Conserved protein motives delimiting CDR3 were found in 95% (SD = 3%) of the reads; on average, 72% of them (SD = 5%) passed the filtering criteria (Supplementary Table S2).

#### UMI-based CDR3 error correction

For reliable estimation of the number of unique TCRβ CDR3 variants within an amplicon, we conducted UMI-based error correction. As UMIs are added to molecules during cDNA synthesis, all HTS reads sharing identical UMIs should originate from the same, amplified cDNA template. Any substitution or indel among reads with identical UMI can, therefore, be identified as an artefact. At the same time, any two sequences tagged with different UMIs can be inferred to represent true, biological variants even if they differ by as little as a single nucleotide substitution19. Our error-correction algorithm, implemented in AmpliCDR3 tool, is based on the principles of the MIGEC (molecular identifier groups–based error correction) strategy developed by Shugay and colleagues19, but it is simplified to increase speed and modified to account for lower sequencing depths. Nonetheless, AmpliCDR3 retrieves highly comparable TCR repertoires, both in terms of total numbers of observed CDR3s and calculated diversity indices. This is shown by application of AmpliCDR3 to a murine dataset (generated with library preparation protocol incorporating UMIs, as in our study) and analysed in detail using MIGEC19 and MiXCR69 by Izraelson and colleagues84. Results of both approaches where very similar (see Supplementary Table S7) and highly correlated (correlation coefficient >0.99).

To infer the correct sequence of a CDR3, AmpliCDR3 groups reads with identical UMIs into molecular identifier groups (MIGs)19. As the majority of errors arise during late PCR cycles or sequencing, they are expected to represent the minority of reads within a MIG, and a consensus sequence is inferred to be the true CDR3 sequence. However, early-stage PCR errors, presence of PCR chimeras, and so on can produce ambiguous, “mosaic” MIGs. For that reason, the AmpliCDR3 inspects each MIG and discards those in which half of the reads fails to pass the similarity threshold (2-bp substitution) to the most abundant sequence (thereby 0.6% of reads were discarded). If a MIG contained only two reads, they had to be identical to remain in the analysis as a true sequence. MIGs that contained only one read were discarded (29%).

Because CDR3s with different UMIs might derive from one T cell (multiple mRNA transcripts), or from different T cells bearing an identical CDR3, we kept only unique (non-redundant) CDR3 variants in further analysis, and we did not analyse the abundance data. Note, however, that AmpliCDR3 provides information on abundance of each CDR3 variant (both the number of sequencing reads and MIGs representing this variant). Lastly, CDR3 nucleotide sequences were translated into amino acid sequences.

#### Repertoire size estimation

To estimate the lower bound of individual TCRβ diversity (repertoire size) we chose a non-parametric, incidence-based Chao2 estimator53. Chao2 does not require assumption of a priori clonotype distributions, and it had been previously applied to TCR repertoire estimation6,52. In contrast to abundance-based estimators, which use absolute counts of T cells bearing the given CDR3 (clonotype), Chao2 uses presence–absence data across replicates. We preferred to avoid abundance-based estimators as it is difficult to infer absolute T cell numbers (i.e. abundance of a given clonotype) by counting the observed mRNA transcripts. The reason is that a differentiation between identical CDR3 sequences extracted from mRNAs coming from a single T cell and those from different cells but bearing identical CDR3s (i.e. expanded clones or instances of convergent recombination) is impossible. For calculations, we used the following equation (1):

$${S}_{Chao2}=\,{S}_{Obs}+\,(\frac{m-1}{m})\frac{{q}_{1}^{2}}{2{q}_{2}}$$
(1)

where SObs is the total number of CDR3 unique sequences observed across all replicates for a given individual, q1 is the number of CDR3 sequences present only in one replicate, q2 is the number of CDR3 sequences present in two replicates, and m is the number of replicates (m = 4). Calculation of CDR3 sequences (both at nucleotide and amino acid level) present in one, two, three or all four replicates for each individual was conducted automatically by AmpliCDR3.

#### Private versus public repertoire

To compare TCR diversity among individuals, we first pooled unique CDR3 sequences from four replicates for each individual and removed redundant sequences. The resulting repertoire was compared among individuals–both at the nucleotide and amino acid levels. We further defined a CDR3 sequence as “private” if it was present in only one bank vole in our analysis, and as “public” if it was shared by all three individuals. Calculations were conducted with AmpliCDR3.

#### Data accessibility

Illumina MiSeq and HiSeq raw reads (see the Materials and Methods section) are deposited in FASTQ format in the European Nucleotide Archive (ENA), with the study accession PRJEB22487. V and J segment consensus alleles are available in the Supplementary Information (Supplementary Files S2–S3) associated with this article. Bioinformatics tools AmpliTCR, AmpiCDR3, AmpliMERGE, and AmpliCLEAN are available for online use or download at our laboratory server: http://evobiolab.biol.amu.edu.pl/amplisat/.

## References

1. 1.

Holderness, J., Hedges, J. F., Ramstead, A. & Jutila, M. A. Comparative Biology of γδ T Cell Function in Humans, Mice, and Domestic Animals. Annu. Rev. Anim. Biosci. 1, 99–124 (2013).

2. 2.

Chaplin, D. D. Overview of the immune response. J. Allergy Clin. Immunol. 125, S3–S23 (2010).

3. 3.

Bosc, N. & Lefranc, M. P. The mouse (Mus musculus) T cell receptor beta variable (TRBV), diversity (TRBD) and joining (TRBJ) genes. Exp. Clin. Immunogenet. 17, 216–28 (2000).

4. 4.

Klein, L., Kyewski, B., Allen, P. M. & Hogquist, K. A. Positive and negative selection of the T cell repertoire: what thymocytes see (and don’t see). Nat. Rev. Immunol. 14, 377–91 (2014).

5. 5.

Arstila, T. P. et al. A direct estimate of the human alphabeta T cell receptor diversity. Science 286, 958–961 (1999).

6. 6.

Qi, Q. et al. Diversity and clonal selection in the human T-cell repertoire. Proc. Natl. Acad. Sci. 111, 13139–13144 (2014).

7. 7.

Casrouge, A. et al. Size estimate of the alpha beta TCR repertoire of naive mouse splenocytes. J. Immunol. 164, 5782–7 (2000).

8. 8.

Kuchroo, V. K. et al. A single TCR antagonist peptide inhibits experimental allergic encephalomyelitis mediated by a diverse T cell repertoire. J. Immunol. 153, 3326–36 (1994).

9. 9.

Yager, E. J. et al. Age-associated decline in T cell repertoire diversity leads to holes in the repertoire and impaired immunity to influenza virus. J. Exp. Med. 205, 711–23 (2008).

10. 10.

Cicin-Sain, L. et al. Loss of naive T cells and repertoire constriction predict poor response to vaccination in old primates. J. Immunol. 184, 6739–45 (2010).

11. 11.

Pannetier, C. et al. The sizes of the CDR3 hypervariable regions of the murine T-cell receptor beta chains vary as a function of the recombined germ-line segments. Proc. Natl. Acad. Sci. USA 90, 4319–23 (1993).

12. 12.

Six, A. et al. The past, present, and future of immune repertoire biology - the rise of next-generation repertoire analysis. Front. Immunol. 4, 413 (2013).

13. 13.

Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H. & Holt, Ra Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 19, 1817–24 (2009).

14. 14.

Benichou, J., Ben-Hamo, R., Louzoun, Y. & Efroni, S. Rep-Seq: Uncovering the immunological repertoire through next-generation sequencing. Immunology 135, 183–191 (2012).

15. 15.

Mamedov, I. Z. et al. Preparing unbiased T-cell receptor and antibody cDNA libraries for the deep next generation sequencing profiling. Front. Immunol. 4, 456 (2013).

16. 16.

He, L. et al. Toward a more accurate view of human B-cell repertoire by next-generation sequencing, unbiased repertoire capture and single-molecule barcoding. Sci. Rep. 4, 6778 (2014).

17. 17.

Liu, X. et al. Systematic Comparative Evaluation of Methods for Investigating the TCRβ Repertoire. PLoS One 11, e0152464 (2016).

18. 18.

Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).

19. 19.

Shugay, M. et al. Towards error-free profiling of immune repertoires. Nat. Methods 11, 653–5 (2014).

20. 20.

Hou, D., Chen, C., Seely, E. J., Chen, S. & Song, Y. High-Throughput Sequencing-Based Immune Repertoire Study during Infectious Disease. Front. Immunol. 7, 336 (2016).

21. 21.

Maecker, H. T. et al. New tools for classification and monitoring of autoimmune diseases. Nat. Rev. Rheumatol. 8, 317–328 (2012).

22. 22.

Logan, A. C. et al. High-throughput VDJ sequencing for quantification of minimal residual disease in chronic lymphocytic leukemia and immune reconstitution assessment. Proc. Natl. Acad. Sci. 108, 21194–21199 (2011).

23. 23.

Wu, D. et al. High-throughput sequencing detects minimal residual disease in acute T lymphoblastic leukemia. Sci. Transl. Med. 4, 134ra63 (2012).

24. 24.

Covacu, R. et al. System-wide Analysis of the T Cell Response. Cell Rep. 14, 2733–2744 (2016).

25. 25.

Li, Z. et al. Comprehensive analysis of the T-cell receptor beta chain gene in rhesus monkey by high throughput sequencing. Sci. Rep. 5, 10092 (2015).

26. 26.

Fu, L. et al. A comprehensive profiling of T- and B-lymphocyte receptor repertoires from a Chinese-origin rhesus macaque by high-throughput sequencing. PLoS One 12, e0182733 (2017).

27. 27.

Nowak, M. A., Tarczy-Hornoch, K. & Austyn, J. M. The optimal number of major histocompatibility complex molecules in an individual. Proc. Natl. Acad. Sci. 89, 10896–10899 (1992).

28. 28.

Woelfing, B., Traulsen, A., Milinski, M. & Boehm, T. Does intra-individual major histocompatibility complex diversity keep a golden mean? Philos. Trans. R. Soc. Lond. B. Biol. Sci. 364, 117–28 (2009).

29. 29.

Vantourout, P. & Hayday, A. Six-of-the-best: unique contributions of γδ T cells to immunology. Nat. Rev. Immunol. 13, 88–100 (2013).

30. 30.

Migalska, M., Sebastian, A., Konczal, M., Kotlík, P. & Radwan, J. De novo transcriptome assembly facilitates characterisation of fast-evolving gene families, MHC class I in the bank vole (Myodes glareolus). Heredity (Edinb). 118, 348–357 (2017).

31. 31.

Sebastian, A., Herdegen, M., Migalska, M. & Radwan, J. amplisas: a web server for multilocus genotyping using next-generation amplicon sequencing data. Mol. Ecol. Resour. 16, 498–510 (2016).

32. 32.

Sadowska, E. T., Baliga-Klimczyk, K., Chrzaścik, K. M. & Koteja, P. Laboratory model of adaptive radiation: a selection experiment in the bank vole. Physiol. Biochem. Zool. 81, 627–40 (2008).

33. 33.

Sadowska, E. T. et al. Evolution of basal metabolic rate in bank voles from a multidirectional selection experiment. Proceedings. Biol. Sci. 282, 20150025 (2015).

34. 34.

Kotlík, P. et al. Adaptive phylogeography: functional divergence between haemoglobins derived from different glacial refugia in the bank vole. Proc. Biol. Sci. 281 (2014).

35. 35.

Konczal, M. et al. Genomic Response to Selection for Predatory Behavior in a Mammalian Model of Adaptive Radiation. Mol. Biol. Evol. 33, 2429–2440 (2016).

36. 36.

Mokkonen, M. et al. Negative frequency-dependent selection of sexually antagonistic alleles in Myodes glareolus. Science 334, 972–4 (2011).

37. 37.

Lonn, E. et al. Balancing selection maintains polymorphisms at neurogenetic loci in field experiments. Proc. Natl. Acad. Sci. USA 114, 3690–3695 (2017).

38. 38.

Vapalahti, O. et al. Hantavirus Infections in Europe. Lancet Infect. Dis. 3, 653–661 (2003).

39. 39.

Charbonnel, N. et al. Immunogenetic factors affecting susceptibility of humans and rodents to hantaviruses and the clinical course of hantaviral disease in humans. Viruses 6, 2214–41 (2014).

40. 40.

Bryja, J., Galan, M., Charbonnel, N. & Cosson, J. F. Duplication, balancing selection and trans-species evolution explain the high levels of polymorphism of the DQA MHC class II gene in voles (Arvicolinae). Immunogenetics 58, 191–202 (2006).

41. 41.

Axtner, J. & Sommer, S. Gene duplication, allelic diversity, selection processes and adaptive value of MHC class II DRB genes of the bank vole, Clethrionomys glareolus. Immunogenetics 59, 417–426 (2007).

42. 42.

Scherman, K., Råberg, L. & Westerdahl, H. Positive Selection on MHC Class II DRB and DQB Genes in the Bank Vole (Myodes glareolus). J. Mol. Evol. 78, 293–305 (2014).

43. 43.

Tschirren, B. et al. Polymorphisms at the innate immune receptor TLR2 are associated with Borrelia infection in a wild rodent population. Proc. R. Soc. B Biol. Sci. 280, 20130364–20130364 (2013).

44. 44.

Morger, J. et al. Distinct haplotype structure at the innate immune receptor Toll-like receptor 2 across bank vole populations and lineages in Europe. Biol. J. Linn. Soc. 116, 124–133 (2015).

45. 45.

Horner, D. S. et al. Phylogenetic analyses of complete mitochondrial genome sequences suggest a basal divergence of the enigmatic rodent Anomalurus. BMC Evol. Biol. 7, 16 (2007).

46. 46.

Honeycutt R. Rodents (Rodentia). In The Timetree of Life (eds Shedlock, A. & Edwards, S.) 490–494 (Oxford University Press, 2009).

47. 47.

Giudicelli, V. & Lefranc, M. P. Ontology for immunogenetics: the IMGT-ONTOLOGY. Bioinformatics 15, 1047–54 (1999).

48. 48.

Giudicelli, V., Chaume, D. & Lefranc, M.-P. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res. 33, D256–61 (2005).

49. 49.

Olivieri, D., Faro, J., Von Haeften, B., Sánchez-Espinel, C. & Gambón-Deza, F. An automated algorithm for extracting functional immunologic V-genes from genomes in jawed vertebrates. Immunogenetics 65, 691–702 (2013).

50. 50.

Warren, R. L. et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 21, 790–7 (2011).

51. 51.

Laydon, D. J., Bangham, C. R. M., Asquith, B. & Crm, B. Estimating T-cell repertoire diversity: limitations of classical estimators and a new approach. (2015).

52. 52.

Vanhanen, R. et al. T cell receptor diversity in the human thymus. Mol. Immunol. 76, 116–122 (2016).

53. 53.

Chao, A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43, 783–791 (1987).

54. 54.

Scott-Browne, J. P. et al. Evolutionarily conserved features contribute to αβ T cell receptor specificity. Immunity 35, 526–35 (2011).

55. 55.

Garcia, K. Reconciling views on T cell receptor germline bias for MHC. Trends Immunol. 33, 429–436 (2012).

56. 56.

Jores, R. & Meo, T. Few V gene segments dominate the T cell receptor beta-chain repertoire of the human thymus. J. Immunol. 151, 6110–22 (1993).

57. 57.

Miles, J. J., Douek, D. C. & Price, D. A. Bias in the αβ T-cell repertoire: implications for disease pathogenesis and vaccination. Immunol. Cell Biol. 89, 375–87 (2011).

58. 58.

Manfras, B. J., Terjung, D. & Boehm, B. O. Non-productive human TCR beta chain genes represent V-D-J diversity before selection upon function: insight into biased usage of TCRBD and TCRBJ genes and diversity of CDR3 region length. Hum. Immunol. 60, 1090–100 (1999).

59. 59.

Li, H. et al. Recombinatorial biases and convergent recombination determine interindividual TCRβ sharing in murine thymocytes. J. Immunol. 189, 2404–13 (2012).

60. 60.

Ndifon, W. et al. Chromatin conformation governs T-cell receptor J gene segment usage. Proc. Natl. Acad. Sci. 109, 15865–15870 (2012).

61. 61.

Gulwani-Akolkar, B. et al. Do HLA genes play a prominent role in determining T cell receptor V alpha segment usage in humans? J. Immunol. 154, 3843–51 (1995).

62. 62.

Silberman, D. et al. Class II major histocompatibility complex mutant mice to study the germ-line bias of T-cell antigen receptors. Proc. Natl. Acad. Sci. 113, E5608–E5617 (2016).

63. 63.

Sharon, E. et al. Genetic variation in MHC proteins is associated with T cell receptor expression biases. Nat. Genet. 48, 995–1002 (2016).

64. 64.

Madi, A. et al. T-cell receptor repertoires share a restricted set of public and abundant CDR3 sequences that are associated with self-related immunity. Genome Res. 24, 1603–12 (2014).

65. 65.

Quigley, M. F. et al. Convergent recombination shapes the clonotypic landscape of the naive T-cell repertoire. Proc. Natl. Acad. Sci. USA 107, 19414–9 (2010).

66. 66.

Torelli, F. et al. Recombinant IFN-γ from the bank vole Myodes glareolus: a novel tool for research on rodent reservoirs of zoonotic pathogens. Sci. Rep. 8, 2797 (2018).

67. 67.

Jackson, J. A. Immunology in wild nonmodel rodents: an ecological context for studies of health and disease. Parasite Immunol. 37, 220–232 (2015).

68. 68.

Brochet, X., Lefranc, M.-P. & Giudicelli, V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res. 36, W503–W508 (2008).

69. 69.

Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).

70. 70.

Dheilly, N. M. et al. No more non-model species: The promise of next generation sequencing for comparative immunology. Dev. Comp. Immunol. 45, 56–66 (2014).

71. 71.

Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

72. 72.

Menzel, U. et al. Comprehensive Evaluation and Optimization of Amplicon Library Preparation Methods for High-Throughput Antibody Sequencing. PLoS One 9, e96727 (2014).

73. 73.

Biedrzycka, A., Sebastian, A., Migalska, M., Westerdahl, H. & Radwan, J. Testing genotyping strategies for ultra-deep sequencing of a co-amplifying gene family: MHC class I in a passerine bird. Mol. Ecol. Resour. 17, 642–655 (2017).

74. 74.

Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–63 (2011).

75. 75.

Sigrist, C. J. A. et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–74 (2002).

76. 76.

Yanagi, Y., Chan, a, Chin, B., Minden, M. & Mak, T. W. Analysis of cDNA clones specific for human T cells and the alpha and beta chains of the T-cell receptor heterodimer from a human T-cell line. Proc. Natl. Acad. Sci. USA 82, 3430–4 (1985).

77. 77.

Malissen, M. et al. Mouse T cell antigen receptor: structure and organization of constant and joining gene segments encoding the beta polypeptide. Cell 37, 1101–10 (1984).

78. 78.

Robinson, J. et al. The IMGT/HLA database. Nucleic Acids Res. 41, D1222–7 (2013).

79. 79.

Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–9 (2006).

80. 80.

Glusman, G. et al. Comparative genomics of the human and mouse T cell receptor loci. Immunity 15, 337–49 (2001).

81. 81.

Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30, 2725–9 (2013).

82. 82.

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–9 (2012).

83. 83.

R Core Team. R: A Language and Environment for Statistical Computing. (2017).

84. 84.

Izraelson, M. et al. Comparative analysis of murine T-cell receptor repertoires. Immunology 153, 133–144 (2018).

## Acknowledgements

The authors sincerely thank P. Koteja for the donation of bank vole spleens. The authors also thank K. Dudek for running the Illumina MiSeq sequencing, O. Britanova for providing murine TCR dataset for software performance comparison and M. Konczal for comments on an earlier version of the manuscript. This work was funded by the National Science Centre (NCN) grant UMO-2013/08/A/NZ8/00153 awarded to J. Radwan, and supported by NCN PhD scholarship UMO-2017/24/T/NZ8/00088 awarded to M. Migalska.

## Author information

M.M. and J.R. designed the experiments, M.M. conducted the experiments, A.S. wrote the AmpliSAT suite, A.S. and M.M. conducted the bioinformatics analysis, and M.M., A.S. and J.R. analysed results and wrote the manuscript.

Correspondence to Magdalena Migalska.

## Ethics declarations

### Competing Interests

The authors declare no competing interests.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions