Somatic mutations in lymphocytes in patients with immune-mediated aplastic anemia

The prevalence and functional impact of somatic mutations in nonleukemic T cells is not well characterized, although clonal T-cell expansions are common. In immune-mediated aplastic anemia (AA), cytotoxic T-cell expansions are shown to participate in disease pathogenesis. We investigated the mutation profiles of T cells in AA by a custom panel of 2533 genes. We sequenced CD4+ and CD8+ T cells of 24 AA patients and compared the results to 20 healthy controls and whole-exome sequencing of 37 patients with AA. Somatic variants were common both in patients and healthy controls but enriched to AA patients’ CD8+ T cells, which accumulated most mutations on JAK-STAT and MAPK pathways. Mutation burden was associated with CD8+ T-cell clonality, assessed by T-cell receptor beta sequencing. To understand the effect of mutations, we performed single-cell sequencing of AA patients carrying STAT3 or other mutations in CD8+ T cells. STAT3 mutated clone was cytotoxic, clearly distinguishable from other CD8+ T cells, and attenuated by successful immunosuppressive treatment. Our results suggest that somatic mutations in T cells are common, associate with clonality, and can alter T-cell phenotype, warranting further investigation of their role in the pathogenesis of AA.


Sample preparation and immunogene panel sequencing
Mononuclear cells (MNC) were enriched from peripheral blood or bone marrow samples with Ficoll density gradient centrifugation (Ficoll-Paque PLUS, GE Healthcare). CD8+ and CD4+ T cell fractions were separated via positive selection by immunomagnetic bead sorting (Miltenyi Biotec) according to manufacturer's instructions or by flow cytometry (FACSAria II, Beckton Dickinson).

Identification of potentially pathogenic somatic variants
Variant identification was performed with Genome Analysis Toolkit (GATK) and involved an initial variant discovery on individual samples and a subsequent genotyping of all variants across all samples. Briefly, pre-processing of sequencing data, read mapping to GRCh38 reference, and discovery of short somatic variants followed our previous procedure (2,3). Variants were discovered in tumor-only mode without a paired normal as well as by pairing samples with their matched counterparts (i.e. CD4+ datasets paired with CD8+ datasets and vice versa and CD3+ datasets paired with CD3neg datasets and vice versa). MuTect2 was used to make variant calls.
Recurrent variant calling artifacts were in exome sequencing analyses filtered using a panel of normals (PON) created from the exome data of 24 healthy unrelated Finnish individuals and in panel sequencing analyses using a panel of normals created from immunogene panel healthy T cell samples. For the genotyping analysis, variant calls passing filters and supported by ten or more reads were aggregated across samples and supplemented by a set known STAT3 lost-of-function variants 4 and hotspot variants (detected ≥30 samples and constituting ≥1% of the listed mutations of the each gene in COSMIC v90 (4)) in genes recurrently mutated in AA (5)  Sequencing coverage was computed using Samtools (7) depth and by restricting coverage computation to panel regions and/or known RefSeq exons.

Somatic mutation burden analysis
As variants seen recurrently in healthy T cell samples were used to filter variants, an adjusted somatic mutation burden was computed for each sample by taking into account only singleton variants passing the filtering process. Adjusted somatic mutation burden was calculated separately for LP variants and fraction-specific variants in CD4+ and CD8+ T cells of each patient. Total number of variants was divided by the number of exonic bases with a depth ≥30 per sample. In group comparisons, outliers were excluded by removing samples whose adjusted mutation burden was more than 1.5 fold above 3 rd or below 1 st quartile.
Variant effect and conservation results were aggregated using a majority rule (i.e. the variant was deemed as damaging if more than 50% of predictions categorized it as "damaging", "pathogenic", "possibly pathogenic", "medium" or "high").

Mutational signature analysis
All synonymous and non-synonymous variants passing the filtering process were used in mutational signature analysis. Identification of mutational signatures was done using the deconstructSigs18 software with default parameters and using cancer profiles downloaded from the COSMIC web site on September 2017. Function mapSeqlevels from the package GenomeInfoDb was used to convert EnsEMBL chromosome nomenclature to UCSC nomenclature.

Amplicon validation
Mutations found in the immunopanel sequencing (Supplementary table S4) were validated by amplicon sequencing as previously described (21)  were washed with 2ml of PBS-BSA buffer (300g/5min), resuspended to RPMI and kept on ice before and after sorting. Sorting was performed with BD Influx Cell sorter with gating strategy as presented in Figure S2. 300,000 target cells were collected to Protein Lo-Bind tubes (Eppendorf, cat. no 0030108). After sorting, single-cell samples were partitioned using a Chromium Controller (10X Genomics) and scRNA-seq and TCRαβ-libraries were prepared using Chromium Single Cell and sample index PCR (14 and 9 cycles, respectively). All libraries were sequenced using NovaSeq 6000 system (Illumina), S1 flowcell with the following read length configuration for gene expression libraries: Read1=26, i7=8, i5=0, Read2=91. Length configurations used for TCRenriched libraries: Read1=150, i7=8, i5=0, Read2=150. The raw data was processed using Cell Ranger 3.0.1 pipelines. 10X Genomix pipelines "cellranger mkfastq" was used to produce FASTQ (raw data) files, "cellranger count" to perform alignment, filtering and UMI counting for the 5' gene expression data and "cellranger vdj" to perform V(D)J sequence assembly and paired clonotype calling for the V(D)J data.

Single-cell RNA-sequencing data analysis
Cells were subject to quality control. Cells with a high amount of mitochondrial transcripts (>15% of all UMI counts) or ribosomal transcripts (>50%), cells with less than 100 genes or over 4500 genes expressed, cells expressing a low or high (<25% or >60%) amount of house-keeping genes, or cells with a low or high read depth (<500 or >30 000) were excluded. Quality control measures are illustrated in Figure S3 for AA-4 and Figure S4 for AA-3. In total, 7 822 -15 651 cells per sample were captured.
To overcome batch-effect, we used a recently described probabilistic framework to overcome different nuisance factors of variation in an unsupervised manner with deep generative modelling as described elsewhere (22). Briefly, the transcriptome of each cell is encoded through a nonlinear transformation into a low-dimensional, batch corrected latent embedding. The latent embedding was then used for graph-based clustering implemented in Seurat              (attached as an excel file)