A potent adaptive immune system is fundamentally reliant upon the generation of a diverse repertoire of B-lymphocyte antigen receptors (BCRs, the membrane-bound form of antibodies expressed on the surface of B cells). BCRs are assembled by somatic recombination of a large number of immunoglobulin gene segments (Fig. 1), and the repertoire of BCRs expressed in any given individual is continuously shaped by exposure to exogenous antigens and endogenous host factors. Existing mechanisms for BCR diversification can yield an astronomical number of possible BCRs (in theory, >1013 in humans)1,2; this number exceeds the total number of B lymphocytes in the human body (1–2 × 1011) (ref. 3). Because of labor and cost considerations, it is completely impractical to analyze such a diverse BCR repertoire using traditional Sanger sequencing. However, Ig-seq (a term coined by Andrew Fire, Stanford University) has allowed us to determine antibody gene repertoires at an unprecedented depth. The information gained by Ig-seq is proving invaluable for understanding antibody responses in health and disease and for diagnostic purposes. In addition, Ig-seq can be combined with other techniques, including expression and isolation of antigen-specific antibodies, sequencing of multiple RNAs from single cells4, and proteomic analyses of antibodies in blood or secretions, to help elucidate the properties of antibodies that mediate protection against infectious diseases or, alternatively, that mediate autoimmune responses. In this Review we describe the experimental approaches and technical challenges related to high-throughput antibody gene sequencing, as well as the ways in which Ig-seq might be applied to advance our understanding of immunology and to address unmet clinical needs related to infectious diseases, immune dysregulation and cancer.

Figure 1: Antibody structure and sequence diversification mechanisms.
figure 1

(a) Schematic of IgG structure. In the top chains, domains encoded from germline V, D, J and C segments are indicated. Nontemplated N-nucleotides are shown in red. These top chains delineate the 5′ to 3′ genetic composition of the antibody. In the bottom chains, framework (FR) and complementarity-determining regions (CDRs) are indicated. These bottom chains delineate the N-terminal to C-terminal protein sequence. Dashed lines denote disulfide bonds. (b) Key steps in antibody diversification. The primary antibody heavy chain repertoire is created predominantly by the somatic recombination of variable (V), diversity (D) and joining (J) gene segments, and by the random nontemplated addition of N-nucleotides. The antigen-binding site of a heavy chain is formed by the juxtaposition of the hypervariable complementarity-determining regions (CDR-H1, H2 and H3) and the framework 3 region (FR3). After productive IgH rearrangement, recombination of the light chain (IgL) ensues, and the heterodimeric pairing of H and L chains forms the complete antibody of the IgM isotype that is expressed on the surface of a newly formed immature B cell. Eμ: IgM intronic enhancer; Sμ: tandem repeats critical for class-switch recombination. Numbers in parentheses refer to estimates of human germline VH DH and JH segments.

Generation of the antibody repertoire

Antibodies are produced by a developmentally ordered series of somatic gene rearrangement events that occur exclusively in developing B cells and continue throughout the life of an organism. Antibodies consist of heavy (μ, α, γ, δ, ɛ) and light chains (κ, λ), which are linked by disulfide bonds. The intact antibody contains variable and constant domains (Fig. 1a). Antigen binding occurs in the variable domain, which is generated by recombination of a finite set of tandemly arranged variable (V), diversity (D) and joining (J) germline gene segments (Fig. 1b). This process, called VDJ recombination, often results in the addition and deletion of nucleotides at the junctions between ligated gene segments (Fig. 1b). More specifically, DNA exonucleases can trim the ends of the gene segments, and DNA polymerases and transferases can randomly insert templated palindromic or nontemplated nucleotides, respectively.

During B-cell development, immunoglobulin heavy (IgH) chain gene recombination typically occurs before immunoglobulin light (IgL) chain gene recombination. If both IgH and IgL genes are productively rearranged, the fully assembled antibody heterodimer is expressed on the surface of the B cell. In B cells bearing productively rearranged antibodies, the process of allelic exclusion (and locus exclusion in the case of IgL) ensures that each B cell expresses a single antibody5. After passage through several developmental checkpoints, newly generated mature IgM+IgD+ B cells form the naive B cell (and, therefore, naive antibody) repertoire. Most of the diversity in the naive antibody repertoire is concentrated at the site of IgH VDJ gene segment ligation, also known as the IgH complementarity-determining region 3 (CDR-H3) (Fig. 1b). Because of the combinatorial and nontemplated nature of the mechanisms that generate the CDR-H3, it is the most diverse component in terms of length and sequence of the antibody H-chain repertoire and is a principal determinant of antibody specificity6,7. Nonetheless, there are instances where antigen specificity is dictated solely or predominantly by the L chain.

When a B cell encounters antigen in an environment that provides requisite co-stimulatory signals and T-cell help, BCR stimulation induces B-cell proliferation. This process, known as B-cell clonal expansion, occurs primarily in highly organized areas of secondary lymphoid organs (e.g., spleen, lymph nodes and Peyer's patches8,9) referred to as germinal centers (Fig. 2). Clonal expansion is followed by somatic hypermutation of the variable domains of antibodies; this is mediated by activation-induced cytidine deaminase. B cells expressing BCRs bearing somatic mutations that increase affinity for antigen outcompete other B cells for access to antigen. As a result, the B cells bearing the highest-affinity antibodies undergo preferential expansion and survival, a process referred to as affinity maturation. Somatic hypermutation also results in sequence diversification of the CDR-H1 and CDR-H2 hypervariable regions and of the framework 3 (FR3) region, which was proposed to constitute a fourth hypervariable region of the antibody H chain10. Activation-induced cytidine deaminase also mediates class-switch recombination, which generates antibodies bearing different constant regions. B cells expressing somatically mutated, high-antigen-affinity BCRs can differentiate into long-lived memory B cells, capable of mediating rapid recall responses to the same antigen, or into terminally differentiated plasma cells; the latter downregulate BCR expression, establish residency in the bone marrow, gut lamina propia (and, to a smaller degree, in secondary lymphoid tissues), and secrete protective antibodies at extremely high rates estimated at 10,000–20,000 antibody molecules per second11. Antibody production by long-lived plasma cells in the bone marrow is postulated to proceed for very long times, possibly throughout the entire lifetime of the organism.

Figure 2: Key steps in the development of antigen-specific B cells.
figure 2

The steps of normal B-cell differentiation and diversification of the antibody repertoire are indicated in black text. Normal B cells are generated in the bone marrow, migrate to the periphery and, following developmental checkpoint selection, comprise the population of IgM+IgD+ mature naive B cells. When these cells are activated by cognate antigen in the presence of T-cell help, they enter a germinal center (GC) reaction where they rapidly proliferate; this results in clonal expansion and subsequent somatic hypermutation catalyzed by activation-induced cytidine deaminase. B cells bearing antibodies with high affinity for cognate antigen and that survive the GC reaction can undergo class-switch recombination to IgG, IgA or IgE isotypes and ultimately differentiate into memory B cells, antibody-secreting plasmablasts or plasma cells. After subsequent encounter with the same cognate antigen, memory B cells can proliferate or differentiate directly into antibody-secreting cells. Steps that proceed abnormally, leading to the development of human B-cell leukemias and lymphomas, are indicated in red text. ALL, acute lymphoblastic leukemia; CLL, chronic lymphocytic leukemia; MCL, mantle cell lymphoma; GC-DLBCL, germinal center diffuse large B cell lymphoma; FL, follicular lymphoma; ABC-DLBCL, activated B cell–like DLBCL; MGUS, monoclonal gammopathy of undetermined significance; MM, multiple myeloma. B-cell malignancies that have not been analyzed extensively by high-throughput sequencing are shown in parenthesis.

Diversity in the primary antibody repertoire (before exogenous antigen exposure) stems from the allelic diversity in immunoglobulin gene segments, combinatorial diversity introduced during somatic recombination, junctional diversity caused by the imprecision of the recombination process, pairing of IgH and IgL polypeptide chains, and receptor editing, wherein the existing V-gene segment is replaced with another (Fig. 1). In addition, VH replacement, a process resulting from the presence of a cryptic recombination signal sequence in FR3, might influence as much as 5–12% of the human primary B-cell antibody repertoire12. Diversification of the post-antigen-stimulation secondary antibody repertoire stems from somatic hypermutation and class-switch recombination.

Organism age also influences the antibody repertoire13,14. During early ontogeny, the mammalian adult B-cell repertoire is generated in a predictable developmentally programmed fashion, whereas in advanced age humoral immune responsiveness deteriorates; this phenomenon is referred to as immunosenescence and is thought to be attributable in part to a progressive restriction of the antibody repertoire. For example, among the elderly there is an increased prevalence of autoantibodies and, at the serological level, an increased amount of either a single or a small number of serum immunoglobulins that are produced at a high level by benign outgrowths of one or more plasma cell clones15,16.

Low-throughput analysis of the antibody repertoire

In the 1990s, Sanger sequencing enabled the determination of IgH and IgL VDJ recombinants (hereafter simply referred to as V genes) in typically up to a few hundred B cells per experiment17,18,19. Subsequent studies began to clone immunoglobulin genes from single B cells following limiting dilution, and to express and functionally characterize the cloned antibody proteins, thus enabling the interrogation of antibody specificity. This advance proved invaluable for the isolation of antibodies relevant to disease, especially the isolation of pathogen-neutralizing antibodies20. B-cell immortalization (and subsequent sequencing of the V genes in immortalized B clones) provided an alternative route for the expression of small numbers of antibodies21,22,23.

The ability to interrogate the antibodies encoded by small numbers of B cells has yielded numerous important immunological insights. For example, Wardemann et al.24 first used single-cell cloning and antibody expression to demonstrate that a fraction of newly generated B cells in the human bone marrow express self- and polyreactive antibodies and that their development is regulated at two independent self-tolerance checkpoints. Subsequent studies from our laboratory and from others elucidated mechanisms and features of antibodies associated with autoimmune diseases or immunodeficiency syndromes25,26,27,28. B-cell cloning techniques also enabled the isolation of antibodies able to neutralize numerous clinically important pathogens including severe, acute, respiratory syndrome coronavirus (SARS-CoV), influenza and HIV-1, among many others29,30,31,32,33,34,35. Understanding the mechanisms that lead to the elicitation of neutralizing antibodies is, in turn, helping researchers design more effective vaccines20,36,37,38. Nonetheless, one key limitation of low-throughput, B-cell cloning studies has been that they provide only a glimpse of a miniscule slice of the full antibody repertoire.

High-throughput sequencing of the antibody repertoire

Compared with Sanger sequencing, Ig-seq can provide a much broader picture of the antibody repertoire (Fig. 3a). Although conceptually simple, for its proper application, Ig-seq demands thoughtful consideration of experimental design, a detailed understanding of the sources of DNA sequence and quantification errors, the ability to delineate which VH genes are paired with which VL genes in each single B cell, and the use of appropriate data mining and visualization tools to make biological sense of the large amounts of information generated in such experiments.

Figure 3: Methods for high-throughput sequencing of the Ig sequence repertoire.
figure 3

(a) Schematic of steps involved in high-throughput sequencing of Ig genes from bulk B-cell populations of B-cell subsets sorted according to expression of indicated cell-surface markers. Either genomic DNA (gDNA) or mRNA can be used as template, and the choice of template influences the number and location of primers used for subsequent PCR amplification. gDNA amplification is performed using primers complementary to the rearranged V-region gene (VDJ recombinant); amplification of cDNA is performed either using a 5′ primer pool complementary to the leader peptides or FR1s of V-gene segments, and a single 3′ CH1 (or Cκ,Cλ if amplifying light chain genes) primer, or alternatively by 5′ RACE. Although throughput is high, in bulk analysis information regarding which VH and VL chains were paired in the same cell is lost, as cells are lysed in bulk and VH and VL genes are amplified in separate reactions. (b) Schematic of single-cell immunoglobulin repertoire sequencing methods, which preserve information about endogenous VH:VL pairs. Left panel: B-cell lysis and mRNA capture in picoliter well arrays. Middle panel: single-cell PCR following limited B-cell dilution and amplification using barcoded primers. Right panel: microfluidic barcoding of VH and VL cDNAs. Ig, immunoglobulin.

Experimental design.

The first consideration is the source of B cells. Most human antibody sequencing studies have used B cells from peripheral blood because the blood is one of the few readily accessible sources of B cells in humans (tonsils is the other one). However, it is estimated that only 2% of the 1–2 × 1011 B cells in the human body are present in peripheral blood, compared with almost 28% in lymph nodes, 23% in the spleen and on mucosal surfaces, and 17% in the red bone marrow (medulla ossium rubra)3. Thus, the antibody repertoire in peripheral B cells provides a narrow view of the humoral response to antigen challenge.

Second, it is important to consider whether to use genomic DNA (gDNA) or mRNA for immunoglobulin sequencing analyses (Fig. 3a). Whether or not one should use gDNA or mRNA depends on what question is being asked. Sequencing gDNA facilitates estimation of the clonality of a given Ig sequence (in other words, the number of B cells expressing that antibody) because the number of sequence reads will, in general, be proportional to the number of gDNA template molecules (assuming no primer biases, as discussed below). On the other hand, using mRNA as a template can provide an estimate of the relative expression level of various immunoglobulin sequences in the repertoire. However, because immunoglobulin transcription varies dramatically (up to 100-fold) between naive B cells and plasma cells (Fig. 2)39, using unsorted bulk B cells from peripheral blood as the source of mRNA makes it challenging to deduce cellular clonal frequencies. When unsorted peripheral blood mononuclear cells are used as a source of mRNA, the degree of somatic hypermutation in sequencing data sets may be employed ipso facto to distinguish V-gene transcripts derived from antigen-experienced versus naive B cells40. Alternatively, cell-surface markers can be used to sort B-cell subsets of interest before mRNA isolation41,42. Further complicating comparisons between repertoires from different studies, recent work43,44,45,46 has unveiled an unexpectedly high degree of polymorphism in the human IGH locus. Understanding the degree of genetic variation is critical for assigning VDJ segment usage, estimating somatic hypermutation and comparing antibody responses among individuals.

Another experimental design consideration relates to the large number of B cells in humans (2–4 × 109 B cells in the blood alone). The minimal sampling depth needed to cover the antibody repertoire to an extent sufficient to answer a particular biological question needs to be considered. Clearly the sequencing depth must be greater than the number of B cells in the sample (which means that enumeration of B cells by fluorescence-activated cell sorting is essential). Further, the required sequencing depth is also dependent on the approach used to minimize PCR and base-calling errors (see below).

Distinguishing error from true biological variation.

There are two sources of sequence errors: those that arise from sample preparation (reverse transcription and PCR) and those inherent to the DNA sequencing platform.

Relative clonal frequencies can be estimated from gDNA by preparing and sequencing libraries from multiple aliquots from the same sample (technical replicates)41,47. In this approach, each aliquot is used to generate a barcoded library, and the resulting libraries from different aliquots are sequenced. The barcode prevents artifacts due to the contamination of one library with DNA from a different aliquot. Because antibody genes found in multiple libraries can arise only if the respective clonal B cells are present in each of the starting aliquots, this method can reveal clonal expansions41,47. However, using gDNA as the templates is not without complications. First, amplification of VDJ segments from gDNA necessitates the use of primer sets that anneal to all the individual germline V-gene segments (Fig. 3a). Another drawback of gDNA-derived antibody gene libraries is that they contain productive and nonproductive VDJ rearrangements (the latter are also present in cDNA libraries but at a much lower frequency due to the decay of nonsense RNAs)48,49. Lastly, the lower concentration of template in gDNA necessitates a greater number of PCR cycles; this increases error frequencies and further confounds quantification.

Using mRNA as the starting material enables amplification with reverse transcription and 5′RACE (5′ rapid amplification of cDNA ends) with 3′ primers that anneal to the constant region of IgH or IgL, thus circumventing the need for complex V-gene-specific primer sets (Fig. 3a). If starting with mRNA, errors introduced by reverse transcriptase can be minimized using commercial high-fidelity retroviral reverse transcriptases or thermostable group II intron reverse transcriptases50.

Regardless of whether gDNA or cDNA is used as the template, PCR introduces amplification artifacts owing to the differential amplification of some DNA templates over others (even in 5′RACE), base misincorporation and template switching; the latter results in chimeras from the joining of fragments encoded by two or more template DNAs. Nucleotide misincorporation by PCR cannot generally be distinguished from most types of base-calling errors introduced during sequencing, but the latter generally occur at higher frequency and hence they are a greater concern. Chimeras resulting from template switching generate sequences that either cannot be assigned to a germline V-gene segment by standard VDJ identification algorithms, or are interpreted as having an artifactually very high rate of somatic hypermutation. The presence of chimeras makes discernment of true VH gene replacement events particularly challenging. However, the fact that gene replacement by definition has to involve recombination to upstream VH gene segments in the chromosome (because downstream VH gene segments will have been deleted during the primary recombination event) can be used to assist the identification of true VH replacement events. Quantification biases introduced by PCR may be minimized by amplifying and then sequencing the same sample with two or more different primer sets, followed by informatics comparison of the similarities in the respective datasets51; another possible solution is using emulsion PCR to sequester individual DNA molecules for amplification52.

Several of the complications underlying DNA sequencing error have been recently reviewed53,54,55,56,57 so here we focus only on the issues most pertinent to Ig-seq, and on recent technological advances aimed at increasing sequencing accuracy of antibody repertoires. Sequencing errors relate to the particular DNA sequencing technology used in each experiment and encompass incorrect base assignment, insertion/deletions (collectively known as indels) and ambiguous base calls. Pyrosequencing-based technologies (Roche 454 and IonTorrent) are dominated by indels, whereas dye-labeled reversible terminator technology (Illumina HiSeq and MiSeq) is dominated by substitution errors55,58,59,60,61. Indels generated by pyrosequencing methods arise at frequencies around 5 × 10−3 and can be computationally accounted for with various degrees of success56,62,63. According to a recent analysis, the frequency of base substitutions varies from 0.3% to 0.9%, depending on the platform61. Overall, the Illumina platform is most suitable for Ig-seq applications because of its combination of relatively low base-calling error rates and relatively low cost.

It should be noted that the importance of sequencing errors depends on the objective of the antibody repertoire sequencing experiment. For example, if the objective is to generate CDR-H3 length distribution statistics or V-J segment use statistics, then sequencing errors are less of a problem, as meaningful information can be obtained using clustering algorithms that group together highly homologous sequences and minimize the effects of sequencing error51,64,65,66,67. For other applications where sequence accuracy is critical, a number of recently developed approaches can help reduce sequencing errors. For example, up to tenfold higher sequencing accuracy can be attained using circularized V-gene DNA that is sequenced multiple times using the PacBio platform68; however, this accuracy comes at a higher cost and much lower throughput. Various DNA barcoding techniques, whereby a nucleotide barcode is appended to each DNA template molecule before PCR amplification, can also improve sequencing accuracy69,70,71. Barcoding enables more accurate quantification of the DNA template molecules in the original library (by counting barcodes instead of sequence reads), and error correction (by generating consensus reads having the same barcode). In a recent application of barcoding to Ig-seq, the Quake laboratory72 achieved 95% identity in the V-gene repertoire (for sequences present at >5 reads) between two technical replicates. Another method73 that relies on barcoding and sequencing both strands can achieve very high accuracy (<10−8 error rate), although at the expense of throughput.

Identifying endogenous VH:VL pairs.

As discussed above, until recently, native VH:VL pairs could be identified only after single-cell cloning by limiting dilution and Sanger sequencing of the individual VH and VL genes. This process is inherently low throughput, expensive (due to high reagent usage) and yields a very limited set of antibody sequences. A modest increase in single-cell cloning throughput (from 500 to 2,000 B cells) was achieved using overlap extension PCR to produce single VH:VL amplicons (as opposed to separate VH and VL cDNAs) for Sanger sequencing74. In a study from the Quake laboratory, Weinstein et al.4 sorted 200 mouse B cells into microwells, followed by quantitative RT-PCR on a microfluidic chip using primers specific for genes of interest that also contained sequencing adapters, thus enabling the correlation between expression of multiple genes and the degree of somatic hypermutation in the antibody heavy and light chain genes.

Additional approaches for sequencing endogenous VH:VL pairs were more recently developed in the Wardemann and Georgiou laboratories. Busse et al.75 used a two-dimensional, bar-coded primer matrix to combine single-cell VH and VL gene amplification with high-throughput sequencing; this increases throughput up to a total of 50,000 individual B cells (Fig. 3b). Importantly, this approach also enabled the facile cloning of IgH and Igκ,λ sequences into expression vectors for further antibody characterization. DeKosky et al.76 developed a VH:VL pairing technology that relies on sequestering single B cells into subnanoliter volume wells, lysing the cells, capturing RNA on poly-dT beads and generating amplicons encoding linked VH:VL segments by emulsion overlap extension PCR (Fig. 3b). Yields of up to 6–7 × 103 unique VH:VL pairs from 7 × 104 activated memory B cells in a one-day experiment, with >96% validated pairing accuracy, have been reported. This method was adapted to detect co-expression of V genes with transcription factors of interest (e.g., BLIMP1) in antibody-secreting cells (B.J. DeKosky, personal communication).

A much higher-throughput method (>2 × 106 B cells per experiment) under development in our laboratory relies on cell encapsulation within controlled microdroplet diameter emulsions (an important consideration because reverse transcriptase is inhibited in droplets of <5 nl volume) (unpublished data). Alternatively, single cells have been encapsulated in water-in-oil emulsion together with uniquely barcoded beads using a microfluidic device. After reverse transcriptase and PCR, every RNA molecule originating from a single cell is effectively loaded onto the uniquely barcoded beads. PCR products are sequenced on the Illumina platform and correctly paired VH and VL sequences are identified by virtue of their shared barcode77 (Fig. 3c). Given the rapid pace of technology advancement, it may be expected that in the very near future all antibody repertoire analyses will report natively paired VH and VL genes.

Bioinformatic analysis of antibody sequences.

Several established methods for VDJ assignment and CDR-H3 identification are available (e.g., IMGT/V-Quest, IgBLAST or iHMMune-align) and faster, more precise algorithms continue to be developed54,57,78,79. Estimation of the size of the antibody repertoire can be accomplished by rarefaction analysis, maximum entropy and Poisson log-normal distribution models51,80,81. There are also many approaches for clustering V genes likely to have originated from the expansion and somatic hypermutation of a single B cell encoding an unmutated antibody precursor57. However, clustering becomes particularly problematic as the sequencing depth and repertoire diversity increases. Likewise, inference of VH evolution (i.e., how somatically mutated V genes arise from precursor sequences initially present in the naive primary repertoire) is a major challenge67,82. Finally, it should be noted that researchers working in this area generally use custom-made bioinformatics pipelines. The lack of standardization and shared computational resources makes it extremely difficult to carry out meta-analyses of published data generated in different laboratories; this is a key issue that will need to be addressed as the field matures. Interchangeable data formats deposited in a central database, validated open-source algorithms for data analysis and standards for Ig-seq experimental description analogous to the minimum information about a microarray experiment83 are critically needed84.

Applications of antibody repertoire sequencing

Ig-seq is finding a wide range of basic and applied immunology applications.

Antibody discovery.

Screening of large combinatorial libraries by ribosomal, phage, bacterial or yeast display is widely employed for the isolation of antibodies capable of binding virtually any ligand85,86,87. Combinatorial libraries are typically generated by the random pairing of very large (typically >107 each) ensembles of VH and VL genes isolated from mammals (naive or immune libraries, depending on whether the animal was immunized); alternatively one can use synthetic libraries in which a single, or a small set, of VH and VL genes are diversified by mutagenesis of the CDRs. Screening involves sequential rounds of binding to antigen, a process that progressively restricts the diversity of the library to very few antibody clones with the requisite affinity and specificity. We and others have used high-throughput sequencing as a means of evaluating the initial diversity of antibodies encoded in a library64,88,89 and to determine how this diversity declines as binders to antigen are progressively enriched89,90,91,92. During library screening, antibody diversity is reduced as antigen-binding clones are enriched over the background of unrelated antibodies in the starting library. However, the expression of some antibodies having high affinity often adversely affects the growth of the cells that encode them (nota bene: antibodies with poor affinity are equally likely to have an adverse effect on cell growth, but obviously such clones are of no interest in library screening). As a result, the respective antibody genes are enriched during early rounds of screening for antigen binding but are then progressively depleted because they are outcompeted by faster growing cells expressing lower quality antibodies87,90,92,93. High-throughput sequencing of libraries following one or two rounds of selection for antigen binding has therefore been used to rescue clones encoding high-affinity antibodies that could not have been discovered otherwise.

Three different approaches that exploit antibody repertoire sequencing analysis for antibody isolation directly from animals or humans, without library screening, have been developed. First, in work from the Georgiou laboratory, Reddy et al.94 observed that in mice the antibody repertoire encoded by CD138+ antibody-secreting B cells in the bone marrow (bone marrow plasma cells) becomes highly polarized seven days after secondary immunization with antigen, with the most abundant VH and VL genes representing 2–30% of all V-gene sequences. To discover antigen-specific antibodies, Reddy et al.94 paired VH and VL genes based on their relative rank frequency in the respective repertoires and thus obtained antibodies with nanomolar affinity for antigen. Subsequently, Saggy et al.95 showed that VH genes present at high frequencies in the splenic V-gene repertoire of immunized mice encode antibodies capable of binding the immunizing antigen. However, the types of bone marrow or spleen samples used in these mouse analyses are rarely available from humans.

A different approach, pioneered by Shapiro, Kwong and coworkers, builds on the observation that the co-evolution of heavy and light chains in broadly neutralizing HIV-1–specific antibodies is reflected in the matching topology of VH and VL phylogenetic trees as gleaned from Ig-seq63,96. VH and VL genes within matching branches of the respective phylogenetic trees, that is, displaying similar patterns of mutation accumulation, were paired to yield novel broadly neutralizing antibodies. This method is particularly useful for the identification of antibodies with high somatic hypermutation levels (such as those elicited by persistent infection) where a deep phylogenetic tree can be constructed.

Ig-seq has also enabled the proteomic identification of antibodies from biological fluids. Even though the main effector function of B cells is the secretion of antibodies into blood or mucosal and respiratory epithelia, the composition of the antibody repertoire in these bodily fluids remained elusive. We and others realized that shotgun liquid chromatography–tandem mass spectrometry (LC-MS/MS) proteomic identification of the antibodies that comprise the humoral response requires a matched personal antibody sequence database, obtained by Ig-seq, to interpret the MS/MS spectra (Fig. 4)97,98,99. In this manner, Polakiewicz and coworkers98,99 identified high-affinity, antigen-specific antibodies from immunized animals, from a human vaccinated with hepatitis B and from a cytomegalovirus-infected volunteer98,99. In more recent studies, we determined the monoclonal antibody repertoire that comprises the serum polyclonal response in rabbits and in humans following vaccination (unpublished data and ref. 66). The advent of antibody discovery by proteomic mining of the serum antibody response has in turn opened the way for the isolation of biologically relevant antibodies from convalescent patients. Finding the serum antibodies that were responsible for the resolution of disease states in patients is likely to be of great relevance for drug discovery because such antibodies will have already been established to be of therapeutic value.

Figure 4: Deconvoluting the serum antibody repertoire.
figure 4

B cells from peripheral blood or other tissues are sorted and subjected to high-throughput immunoglobulin V-gene sequencing, resulting in generation of a personal antibody sequence database. Antigen-specific antibodies from serum are then isolated by affinity chromatography, digested into peptides, and subjected to LC-MS/MS analysis. The MS/MS data are interpreted using the antibody sequence database, thereby allowing identification of CDRH3-derived peptides and the genes encoding the repertoire of antigen-specific antibodies in the serum. Ig, immunoglobulin.

Understanding immune repertoire development.

The use of Ig-seq to map out the global antibody repertoire was first accomplished by Quake and coworkers in the zebrafish, which exhibit the basic features of adaptive immunity but have fewer possible VDJ combinations than mice or humans51,100. Ig-seq has subsequently been used to analyze the repertoires of a variety of other species, including mice, chickens and cattle51,66,68,94,101,102,103,104,105,106,107.

More importantly, however, Ig-seq is providing unprecedented insights into the mechanisms responsible for shaping the human naive BCR repertoire. As noted above, Ig-seq analysis of the VH repertoire revealed the high degree of allelic diversity in the human immunoglobulin locus43,45,108. Analysis by our laboratory and by others of germline VH, Vκ and Vλ segment usage and frequencies of recombination between particular V-D and D-J segments in the naive BCR repertoire revealed a marked skewing that in turn then shapes the repertoire in mature, antigen-experienced B cells7,43,47,49,64,102,109,110,111,112. A single D-segment reading frame is overwhelmingly preferred in VDJ recombination, inversions of D segments are extremely rare113 and the CDR-H3 region exhibits universal categorical constraints with respect to amino acid composition, average hydrophobicity, charge distribution and length. Antibodies with long or charged CDR-H3s are of great clinical interest because, although they are more likely to be autoreactive, they often enable binding to occluded sites on pathogens and mediate pathogen neutralization. In the human repertoire, long CDR-H3s arise by D-D joining (which occurs at frequencies from 0.125% to >0.5% in the naive repertoire but is reduced in antigen-experienced B cells) and by extensive nucleotide addition combined with preferential usage of longer D segments and of the longer J6 segment in the germline47,49,114,115. The conservation of hydrophobicity, which results in the selection of a neutral CDR-H3 and against hydropathically extreme (charged or hydrophobic) antibody-binding sites, is a common theme across vertebrate evolution7,116. Collectively, these findings indicate that the CDR-H3 loop in humans explores vast sequence space, but within boundaries40,53,64,102,105.

Recent studies suggest determinism in VDJ recombination frequencies in the primary antibody repertoire. VDJ segment usage in the antibody repertoire of homozygotic twins is indistinguishable, suggesting that it is determined by genetic factors40. Nonetheless, it is important to note that diversification in the CDR-H3 results in a highly private repertoire that shows very little overlap among individuals. Further, the human naive antibody repertoire that developed in immunocompromised mice engrafted with human CD34+ hematopoietic stem cells105 showed V-J usage patterns and, even more strikingly, checkpoint depletion of antibodies displaying autoreactivity signatures similar to those in humans24.

Important insights are also being gleaned from sequencing of antibody repertoires encoded by B cells at different anatomical sites. One particularly noteworthy study analyzed the murine intestinal IgA repertoire117. Another study of mouse mucosal B-cell repertoires in the intestinal lamina propria raised the intriguing possibility that, contrary to long-standing dogma, the bone marrow might not be the exclusive site of adult B-cell development. This was because B cells in the gut expressed Rag-1 recombinase, displayed a pre-B or immature B phenotype and importantly, encoded a distinctive VL repertoire; these features are consistent with B-cell development and selection in situ in the lamina propria118. The IgG VH repertoire in the cerebrospinal fluid has also been examined119 and found to have differences in somatic hypermutation compared to that in the peripheral blood, further suggesting that B-cell maturation might have occurred independently in the central nervous system or the periphery.

Several studies have attempted to estimate the total diversity of the human antibody repertoire by Ig-seq of peripheral B cells40,47,102. This task is very challenging owing to the large size of the antibody repertoire, variable sampling depth, transcriptional differences among B-cell subsets, sequencing errors and, last but not least, the fact that peripheral blood contains <2% of all B cells3. The overwhelming majority of published studies105,120 as well as our unpublished data are consistent with the notion that, on the one hand, the VH gene repertoire is highly private (unique to an individual), although a small number of CDR-H3 appear to be shared (in other words, they are stereotypical or public) among different individuals. However, a sizable fraction of shared sequences are found in light chains, due to the lower diversity of the VL gene repertoires121,122. It is intriguing that shared IgL chain genes also tend to be the most abundantly expressed123.

Infectious diseases.

Ig-seq is also providing insights into the adaptive immune responses elicited by pathogen challenge or vaccination. Pathogen challenge can affect the BCR repertoire of responding B as well as the naive repertoire. Many pathogens produce superantigens, which are proteins that bind to certain antibody V domains, resulting in BCR cross-linking and subsequent B-cell deletion. Predictably, superantigen exposure results in a skewed naive antibody repertoire103. Surprisingly, however, depletion of V genes bound by superantigen was not observed in the naive repertoire of transgenic mice constitutively expressing superantigen48. Interestingly, skewed naive B-cell antibody repertoire was also reported for patients with chronically evolving hepatitis C infection124. Changes in the overall antibody repertoire are also evident following vaccination or infection120,125,126,127. Notably, Boyd, Fire and coworkers125 observed convergent antibody signatures (stereotyped CDR-H3 sequences) in patients experiencing acute dengue infection. This observation raises the possibility that Ig-seq aimed at detecting stereotypical responses may be used as a diagnostic tool for predicting infectious disease severity.

Another exciting potential application of Ig-seq is the identification of V genes clonally related to those encoded by protective antibodies isolated from an individual; one can then infer the antibody lineages that led to the evolution of these protective antibodies, starting from an unmutated common ancestor IgH germline sequence. This approach is particularly relevant to understanding the evolution of broadly neutralizing antibodies (bNAbs) during infection with rapidly evolving viral pathogens such as influenza and HIV-1 (refs. 67, 96, 127, 128, 129, 130). Time-ordered sequencing of evolving virus populations and antibody responses in the same host, together with isolation of bNAbs, is helping delineate the dynamic between adaptive immune responses that exert selective pressure on the virus and the emergence of viral escape mutations131. Tracing the evolutionary paths that lead to the generation of bNAbs is also critical for the design of immunogens and vaccination schedules that will elicit an immune response by first activating naive, germline, antibody-expressing B cells and then steering B clonal selection toward an affinity maturation pathway that leads to the production of bNAbs36,38. Multidonor analysis of VRC01-class anti-HIV-1 bNAbs confirmed that the elicitation of such antibodies from a single ancestor B cell indeed occurs in multiple individuals132. Ig-seq is also useful for evaluating how innate immune responses elicited by adjuvants, such as toll-like receptor 4 (TLR4) or TLR7/8 agonists, affect the diversity of the antibody response and possibly antibody functionality133.

Ig-seq might also be applied to answer the long-standing question of why people in certain age groups, usually the elderly, exhibit higher susceptibility to infectious disease and/or are less well-protected by vaccination. For example, by analyzing antibody lineage structure, isotype and mutational load in the VH repertoire of volunteers in various age groups before and after influenza vaccination, our laboratory detected a higher IgM mutational load before vaccination and a lower degree of repertoire diversity after vaccination in elderly individuals120. Another study also detected a smaller degree of IgM and IgA CDR-H3 diversification in elderly individuals before and after receiving the influenza or the 23-valent pneumococcal vaccine126.

Immune dysregulation.

We expect that deep sequencing of the antibody repertoire in patients with autoimmunity or primary immunodeficiency will provide important mechanistic insights that in turn may guide the development of appropriate therapies. However, very few analyses of antibody repertoires in patients with immune dysregulation have been published thus far; we believe this dearth reflects the very recent development of Ig-seq. That said, a few studies have been reported. In multiple sclerosis, the cerebrospinal fluid VH gene repertoire is biased and shows strong evidence of B-cell activation119. It will be interesting to determine whether the activated B cells in the cerebrospinal fluid are reactive towards KIR4.1, a recently discovered dominant antigen in multiple sclerosis134. In idiopathic IgG4-associated cholangitis (an autoimmune disease associated with abnormal levels of IgG4 in serum), Ig-seq revealed the presence of large clonal IgG4 expansions in affected tissues and peripheral blood; these clonal expansions disappeared after corticosteroid treatment, indicating that determination of IgG4 clonality is a distinguishing feature of the disease and therefore constitutes a useful tool for differential diagnosis135.


B-cell leukemias, lymphomas and multiple myeloma are malignancies that arise at different stages of B-cell development (Fig. 2). As such, BCRs on malignant B cells constitute a biomarker for the abundance of the malignant cell population. Ig-seq of the V-gene repertoire in peripheral B cells, bone marrow samples, tumors and even blood-borne free DNA has been used both for disease detection and for delineating the degree to which antibody evolution and diversification in malignant cells correlates with disease progression or relapse47,58,136,137,138,139,140. For example, Ig-seq of V genes in peripheral blood facilitated detection of cancerous cells and minimal residual disease following treatment of B-cell chronic lymphocytic leukemia (CLL), the most common leukemia in adults47,58,137. The use of Ig-seq to determine the V-gene repertoire en masse and detect the presence of antibodies encoded by leukemic clones (whose V-gene sequences are established from analyses of cancer cell samples before initiation of treatment) circumvents the need to develop a personalized PCR assay for the CLL clonotype in each patient to determine whether relapse has occurred. Ig-seq was also used to detect minimal residual disease in pediatric patients with B-cell acute lymphoblastic leukemia (B-ALL)138 and as a marker of non-Hodgkins lymphoma136.

Ig-seq also revealed that B-ALL patients display various degrees of clonotypic diversity, which arises predominantly from VH gene replacement and appears to be related to relapse frequency138,139. A small degree of B-cell clonal heterogeneity was also observed in CLL (with the degree varying based on whether the disease originated from unmutated or somatically hypermutated B cells)58,137. In contrast, the malignant clonotype in multiple myeloma, a disease which arises from terminally differentiated plasma cells in the bone marrow (Fig. 2) that lack active mechanisms of antibody diversification, displayed little evidence of heterogeneity140. Of note, in CLL and other hematologic malignancies, the same (stereotypic) CDR-H3 is detected in many patients141. The wider application of high-throughput sequencing for the detection of malignancy-stereotypic clonotypes, which may be present at a low frequency in subjects that do yet not display clinical disease, may prove to be a useful, early diagnostic tool. Lastly, Ig-seq may also be applied to understand gammopathies, which are conditions that result in abnormally high levels of antibodies in serum; these are more prevalent in the elderly and can develop into multiple myeloma15,16.


The humoral immune system has evolved to encode an astonishing diversity of antibodies that collectively comprise the antibody repertoire and provide a potent arsenal of recognition reagents (or anticipatory receptors) that can recognize virtually any organic macromolecule of biological significance. The B lymphocyte and its 'anticipatory' receptor, a term used to emphasize the ability of the naive repertoire to bind biologically relevant antigens, is as ancient as the last common vertebrate ancestor (500 million years ago)142. The extraordinary complexity of the vertebrate adaptive immune system has been likened to the Titan brothers Epimetheus (hindsight) and Prometheus (foresight) of Greek mythology143. The germline conservation of antibody genes carries the imprint of ancient adaptations to pathogens humans were exposed to during evolution and, hence, is Epimethean in scope; on the other hand, the immunologic foresight represented by the repertoire encoded by antigen-experienced, mature B cells within an individual enables adaptations to future pathogen challenges and is a Promethean and anticipatory feature. The advent of high-throughput DNA sequencing has enabled the determination of the antibody gene repertoire at unprecedented depth that was inconceivable, even half-a-dozen years ago. We can begin to decipher both the Epimethean and the Promethean compartments of humoral immunity and how each is shaped by the other. Technologies to improve sequence precision and data analysis are being developed at a breakneck pace, reshaping our understanding of many important aspects of B-cell immunology and increasingly affecting clinical diagnosis, antibody drug discovery and vaccine development. However, realizing the full impact of Ig-seq will require the implementation of standards for experimental annotation and data analysis, as well as the creation of databases facilitating deposition and sharing of these important data.