To the Editor:
High-throughput sequencing is gaining importance in adaptive immunity studies, demanding efficient software solutions for immunoglobulin (IG) and T-cell receptor profiling1. Here we report MiXCR (available at http://mixcr.milaboratory.com/ and https://github.com/milaboratory/mixcr/), a universal framework that processes big immunome data from raw sequences to quantitated clonotypes. MiXCR efficiently handles paired- and single-end reads, considers sequence quality, corrects PCR errors and identifies germline hypermutations. The software supports both partial- and full-length profiling and employs all available RNA or DNA information, including sequences upstream of V and downstream of J gene segments (Fig. 1, Supplementary Note 1 and Supplementary Table 1).
In contrast with previous software2, 3, 4, 5, MiXCR employs an advanced alignment algorithm that processes tens of millions of reads within minutes, with accurate alignment of gene segments even in a severely hypermutated context (Supplementary Note 2 and Supplementary Tables 2–6). In paired-end sequencing analysis, MiXCR aligns both reads and aggregates information from both alignments to achieve high V and J gene assignment accuracy. It handles mismatches and indels and thus is suitable even for sequences with many errors and hypermutations. MiXCR employs a built-in library of reference germline V, D, J and C gene sequences for human and mouse based on corresponding loci from GenBank6.
MiXCR further assembles identical and homologous reads into clonotypes, correcting for PCR and sequencing errors using a heuristic multilayer clustering. Additionally, it rescues low-quality reads by mapping them to previously assembled high-quality clonotypes7 to preserve maximal quantitative information (Supplementary Note 3). The Illumina MiSeq platform currently allows for deep full-length IG repertoire profiling with ~20 million long paired-end reads. MiXCR captures all complementarity-determining regions (CDRs) and framework regions of immune genes and permits the assembly of full-length clonotypes. Flexibility to analyze partial-length data is also provided, allowing, for example, users to group reads into clonotypes on the basis of CDR1 and CDR3 or of CDR3 sequence only. MiXCR produces report files with overall run statistics. A clonotype list with detailed information as well as intermediate alignment results can be exported to tab-delimited text files.
MiXCR was tested against real data as well as synthetically generated data with allele frequencies, immunoglobulin somatic hypermutation, and PCR and sequencing error rates closely resembling those of real data. The extraction efficiency and accuracy were comparable to or better than those of existing tools in the field, and the execution on IG data was between two and four orders of magnitude faster (Supplementary Tables 2–6 and Supplementary Notes 4 and 5). The software has a simple-to-use command-line interface, is easy to install using cross-platform binaries, works with different data formats and handles output from various sequencing platforms. MiXCR is free for scientific and nonprofit use.
- Nat. Immunol. 15, 118–127 (2014). , , &
- Nat. Methods 10, 813–814 (2013). et al.
- Bioinformatics 29, 542–550 (2013). , , , &
- Nucleic Acids Res. 41, W34–W40 (2013). , , &
- Nat. Commun. 4, 2333 (2013). et al.
- Nucleic Acids Res. 41, D36–D42 (2013). et al.
- Eur. J. Immunol. 42, 3073–3083 (2012). et al.
This work was supported by the Russian Science Foundation (project no. 14-14-00533).
- Supplementary Figures and Text (614 KB)
Supplementary Tables 1–6 and Supplementary Notes 1–5