RNA sequencing identifies clonal structure of T-cell repertoires in patients with adult T-cell leukemia/lymphoma

The diversity of T-cell receptor (TCR) repertoires, as generated by somatic DNA rearrangements, is central to immune system function. High-throughput sequencing technologies now allow examination of antigen receptor repertoires at single-nucleotide and, more recently, single-cell resolution. The TCR repertoire can be altered in the context of infections, malignancies or immunological disorders. Here we examined the diversity of TCR clonality and its association with pathogenesis and prognosis in adult T-cell leukemia/lymphoma (ATL), a malignancy caused by infection with human T-cell leukemia virus type-1 (HTLV-1). We analyzed 62 sets of high-throughput RNA sequencing data from 59 samples of HTLV-1−infected individuals—asymptomatic carriers (ACs), smoldering, chronic, acute and lymphoma ATL subtypes—and three uninfected controls to evaluate TCR distribution. Based on these TCR profiles, CD4-positive cells and ACs showed polyclonal patterns, whereas ATL patients showed oligo- or monoclonal patterns (with 446 average clonotypes across samples). Expression of TCRα and TCRβ genes in the dominant clone differed among the samples. ACs, CD4-positive samples and smoldering patients showed significantly higher TCR diversity compared with chronic, acute and lymphoma subtypes. CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.

Our criteria for classifying samples in four different groups are based on a relative comparison of the expression of TRB and TRA. In a given sample, a dominant clone is a clone of which the largest relative and absolute read counts of TRB or TRA were more than 10% and 100 reads, respectively. We classified these patterns as being either (I) monoclonal, with a single dominant clone, or (II) non-monoclonal. Note that TRB and TRA are expressed concurrently in the same cell, and thus the monoclonal pattern was further divided into three groups: (I)-Group 1: TRB >> TRA; (I)-Group 2: TRB << TRA; (I)-Group 3: TRB  TRA. The non-monoclonal pattern made up Group 4, which contained the rest of the samples.  Supplementary Table-1:The sample information including ID abbreviations, TRA and TRB rearrangements across analysed samples   sample gender age  sub-type  type  PVL  TRA-nu  TRA-AA TRAV  TRAD  TRAJ  TRA-len TRB-nu  TRB-AA TRBV  TRBD  TRBJ  TRB-len  clonotypes reads*  Gini-Simpson  ATL05 M  60 Acute  aggressive 47.296 TGTGCTCTTAAGAACCAGGGAGGAAAGCTTATCTTC CALKNQGGKLIF TRAV9-2 .  TRAJ23  12 TGTGCCAGCAGACAAGATCCGGGACTAGCGGGAGCAGATACGCAGTATTTT CASRQDPGLAGADTQYF TRBV4-2 TRBD2  TRBJ2-3  17  583  15545 0.5333584  ATL11 F  72 Acute  aggressive 141.74 TGTGCTTTCATGGATGCTGGCAACAACCGTAAGCTGATTTGG CAFMDAGNNRKLIW TRAV38-1 .  TRAJ38  14 TGTGCCAGCAGCTTAAATCAGGACACTGAAGCTTTCTTT CASSLNQDTEAFF TRBV11-3 .  TRBJ1-1  13  81  6948 0.4667009  ATL14 F  60 Acute  aggressive 114.13 TGTGCAGAGAGTAGCCAGGGAGGAAAGCTTATCTTC CAESSQGGKLIF TRAV5  .  TRAJ23  12 TGTGCCAGCAGCACCGGACAGGGGACTGAAGCTTTCTTT CASSTGQGTEAFF TRBV11-3 TRBD1  TRBJ1-1  13  148  Overview of the analysis pipeline: Each biological sample of blood or tissue will typically have thousands or millions of T/B-cell receptors (TCRs/BCRs) [1,2]. This has prevented a global analysis of the full TCR/BCR repertoire using conventional DNA sequencing. The objective of a typical analysis pipeline within this context is to detect as completely as possible the arranged TCR/BCR gene sequences and to accurately determine their abundance within a sample [1].
Compared with conventional multiplex PCR, RNA-seq decreases the PCR bias that may result from different efficiencies of primers for different V and J genes [2].
Because clinical samples (blood or tissue) are often available in limited amounts, it is often not possible to split a sample for separate transcriptome and TCR/BCR profiling; extracting TCR/BCR transcripts that are present in bulk RNA-seq data is therefore highly advantageous.
Given that whole RNA-seq data contain only a small fraction of TCR/BCR reads (one per 10 5 −10 7 reads) depending on the degree of immune cell infiltration/expansion, a reliable tool that enables sensitive and accurate extraction of clonotypes is necessary [1].
In the current study, we started from whole-transcriptome sequencing data prepared without any initial enrichment as a starting input in FASTQ format. In general, RNA-seq has a wide variety of applications, scientists plan experiments and optimize analysis workflows depending on their research goal [3]. In the current manuscript, we took advantage of an alternative workflow that enables simultaneous analysis and quantification of TCR and BCR profiles in a single high throughput sequencing assay.
Currently MiXCR [1] is the most appropriate software for this kind of analysis [4].
MiXCR is a well-accepted tool for extracting BCR and TCR clonotypes from raw NGS data. MiXCR allows millions of reads to be accurately and rapidly assigned to their respective variable V and J gene segments. MiXCR accepts either FASTA or FASTQ files as input. There is no limitation in the maximum size of the input file (the number of sequencing reads), and thus there is no need for adjusting the input size. MiXCR is distributed as a standalone program that works with the command line. A quality control procedure includes filtering and error correction steps that are built into the analysis pipeline. The ram requirement is 2 Gb. Average time of analysis for 10 8 reads is about 4 hours. There is no dependency on external software. MiXCR can analyze the TCR/BCR repertoire from different species including human, mouse and rat [1].

Data source origin: Genomic or transcriptomic data are acceptable for analysis by
MiXCR [1]. In TCR/BCR profiling, using RNA (transcriptomic data) as the starting material is preferable to using DNA (genomic data), because RNA contains the final TCR/BCR transcripts. A uniform quality and quantity of starting material (RNA) also ensures a positive analysis outcome [2]. MiXCR, by default, assumes that the source of data is transcriptomic. MiXCR accepts sequencing input of different lengths (50-bp, 75-bp, 100-bp; paired end or single end), although 100-bp paired-end sequencing libraries can result in better yields (i.e., a higher number of clonotypes having fully matched CDR3s without mismatches or indels) [1].

Alignment:
The FASTQ files need to be processed to result in meaningful biological information. This requires several stages of raw data processing. The first stage is to match the sequences to the known genomic reference (assign each TCR/BCR to its germline component gene sections). MiXCR performs this stage with its built-in alignment algorithm without requiring external alignment tools. MiXCR uses the KAligner algorithm, which is a particular version of the K-mer algorithm [1,4].
MiXCR aligns the input data to the genomic sequences of V, D and J genes and filters out non-aligned reads before collecting the aligned reads.
Assembly: The assemble command builds clonotypes from alignments obtained with the built-in alignment algorithm. MiXCR assembles the reads containing fragments of CDR3. TCR sequences that contain defined V and J genes but do not fully cover the ends of the CDR3 region are extended using germline sequences.
Error correction: Error correction is a critical processing step that ensures the reliability of the final output. The common approach of trimming low-quality sequencing reads (based on the Phred quality score) is not suitable for the high-throughput analysis of TCR and BCR profiles. Because this approach typically removes only errors produced during base-calling, such an unsupervised quality filtration method can potentially be biased with respect to the removal of particular TCR/BCR rearrangements over others [5]. Therefore, for this kind of analysis, MiXCR uses a sophisticated error-editing approach based on clustering, through which MiXCR identifies similar sets of sequences and then absorbs the rarer member of each cluster into the more common ones. With this approach, MiXCR corrects the artificial diversity resulting from PCR and sequencing errors [1,4,6]. The default minimal value for the sequencing quality score is 20; nucleotides with a score of <20 are considered as "bad".
If a sequencing read contains at least one bad nucleotide within the target gene region, it is left out of the initial assembly stage and is further processed by the mapper [1]. For further information, please refer to the MiXCR documentation in https://mixcr.readthedocs.io/en/master/.
Reporting the results: MiXCR reports the list of identified clonotypes, their abundance and VDJ gene composition. It can export all clonotypes as well as clonotypes for a specific immunological chain. The tab-delimited output of MiXCR is parsable as input for the tcR and VDJ tools.
Accuracy and efficiency of analysis: RNA-seq TCR profiling is quantitative for clonotypes that make up >0.1% of the overall TCR repertoire. MiXCR can efficiently extract 90−100 TRB CDR3 reads per million unique reads. By analyzing 3  10 7 reads from 500 sorted T-cells, MiXCR can identify ~350−450 distinct clonotypes. This shows that MiXCR can accomplish almost complete repertoire extraction. In addition, software testing of MiXCR with data generated in silico has revealed a high extraction efficiency with no false positive clones observed [1].
The appropriate sequencing depth for each study depends on its goals. Deep sequencing is preferable for studies that aim to carry out an extensive analysis of the repertoire of cohorts and populations, whereas sequencing at a lower depth might be preferable for identification of already known or abundant clonotypes [2].

Terminology:
TCR profile: The sum of all TCRs present on the T cells of one individual.
Clonotype: A clonotype is a unique nucleotide sequence that arises during the gene rearrangement process for that receptor.
Spectratype: A histogram of clonotypes binned by CDR3 length and variable segment.

Comparing the Gini-Simpson index
Supplementary