Immunoglobulin heavy- and light-chain genes rearrange early in B-cell ontogeny. The rearranged heavy-chain VDJ gene is formed by the recombination of genes selected from three sets of germline genes: variable (immunoglobulin heavy-chain variable, IGHV), diversity (IGHD) and joining (IGHJ).1 Additional diversity is introduced following clonal selection, when point mutations are introduced into the immunoglobulin genes by the process of somatic hypermutation,1 and such mutations complicate the identification of the germline genes that contribute to rearranged VDJ genes. The unequivocal identification of these genes is becoming an issue of increasing importance, because analyses of immunoglobulin gene sequences have recently assumed significance in clinical decision-making.
The number of mutations in immunoglobulin genes is an important prognostic indicator for patients with chronic lymphocytic leukemia (CLL).2, 3 Mutation analysis requires that the germline IGHV genes of CLL immunoglobulin sequences be identified, but misidentification can result from errors in the reported repertoire of germline IGHV genes.4, 5 It has reasonably been argued that the most comprehensive database should be used by researchers seeking to identify the genetic elements that make up a rearranged gene, as the absence of germline sequences in a database could be a significant source of errors.5 The ImMunoGeneTics (IMGT) database of human heavy-chain immunoglobulin sequences is the most comprehensive collection available,6 and includes 226 IGHV genes and alleles, excluding pseudogenes.
We have recently reviewed the human IGHD7 and IGHJ8 genes that make up the IMGT IGHD and IGHJ repertoires.9 In this study, we reconsider the reported IGHV gene repertoire by analyzing features of the reported germline alleles, and by an analysis of the apparent usage of different germline IGHV genes in a database of 4718 rearranged VDJ gene sequences.
Results
Analysis of original reports of germline sequences
The reported germline IGHV genes of the IMGT repertoire6 were investigated to determine the likelihood that each allele had been correctly reported. Features, which cast doubt on the credibility of reported sequences, are presented in Supplementary Table S1. The references, which originally reported the germline genes of the IMGT repertoire, were reviewed, and 69% of the 226 sequences were either unpublished (6/226) or had only been reported by a single published study (149/226). Analysis of the references showed that many of these studies examined genetic material from a single individual. An individual can have at most two allelic variants of any gene, yet it is now apparent that multiple allelic variants have often been reported from a single donor. For example, seven of the eight reported alleles of IGHV3-15 were derived from a single individual,10 as were 16 of the 19 reported alleles of IGHV3-30.11 A total of 52 reported germline sequences are derived from five studies of this kind (Table 1). A total of 47 of these sequences have never been confirmed by additional reports, and only a handful of highly mutated sequences in the database of 4718 rearranged VDJ genes aligned to these reported alleles.
Table 1 - Multiple alleles of human immunoglobulin IGHV genes that have been reported from a single individual.
A total of 24 alleles can be questioned as the studies that reported them also reported multiple allelic variants of another gene, from a single individual, casting doubt on the quality of all sequences in these reports.10, 11, 12, 13, 14 A further 12 questionable IGHV2-5 and IGHV2-70 alleles appear to be derived from a single individual,15, 16 but this could not be ascertained with absolute certainty. Details of this analysis are included in Supplementary Table S1.
There were 31 germline alleles with truncated 3' ends (13.7%), and 19 alleles with truncated 5' ends (8.4%). Seven reported alleles were truncated at both their 3' and 5' ends. Nucleotides were missing from the sequences reported for IGHV4-31*05 and IGHV4-61*04, while IGHV1-2*03 and IGHV4-59*08 contained ambiguities. IGHV3-30*08 was identified from cDNA rather than from genomic DNA, and could therefore have been amplified from a mutated sequence.
Analysis of IGHV genes in rearranged VDJ sequences
Unless an allele is present in the population at a very low frequency, it can be expected that alignments to the allele will be seen in a large database of rearranged VDJ sequences. Analysis of 4718 VDJ sequences produced alignments to only 128 of the 226 previously reported IGHV germline alleles. Fifteen of the one hundred and twenty-eight alleles each aligned to only one sequence in the VDJ sequences database, and a further 38 alleles were identified in fewer than five sequences. Evidence that some of these alleles were also originally reported as a result of sequencing errors comes from a consideration of the number of mismatches seen between sequences in the VDJ sequence database and these reported germline sequences. For example, none of the 15 alleles, which were each seen just once in the database of 4718 VDJ sequences, was unmutated. Overall, however, 23% of the VDJ sequences were unmutated. Details of the mutation analysis are included in Supplementary Table S1.
Classification of reported alleles
After considering all of the features associated with the alleles, including features that support and challenge the existence of reported alleles, a five-level classification system was developed, as presented in Table 2. Details of the classification of all alleles are presented in Supplementary Table S1. Only 44 Level 1 alleles were identified. These alleles unquestionably exist and they are sufficiently different to other alleles to allow their confident identification, even in highly mutated VDJ genes. Thirty-eight sequences were designated as Level 2 sequences. These sequences also unquestionably exist. There is a danger, however, of alignments to these sequences being in error, because of similarities between these alleles and other common alleles. A total of 40 sequences were classified as Level 3 or Level 4 sequences, for although there are reasons to doubt the existence of the alleles, there is insufficient evidence to remove them from the available repertoire. The number of times germline genes of different levels were seen in the database of VDJ sequences is shown in Figure 1. The classification of alleles was made with little reference to the frequency of alignments in the VDJ sequence database. The validity of the classification system is supported by the fact that 56% (46/82) of Levels 1 and 2 alleles were commonly seen in the database (
20 alignments per allele), while 78% (81/104) of the Level 5 sequences were never seen in the database of 4718 VDJ sequences. The 104 Level 5 alleles that we believe should be removed from the expressed repertoire are shown in Table 3.
Figure 1.
Percentage of alleles, ranked from functionally rare (0 alignments) to common (
20 alignments), as identified in a database of 4718 rearranged VDJ sequences. Data are shown separately for alleles of different classification Levels (L1–L5).
Identification of unreported polymorphisms
Analysis of the database of VDJ sequences showed that some alleles that were frequently seen in the database of rearranged sequences showed unexpected patterns of mutations. The
2-tests showed a highly significant absence of unmutated sequences among rearrangements to IGHV2-5*01 (4 unmutated sequences among 83 rearrangements), IGHV3-11*03 (0 of 24), IGHV3-64*05 (0 of 19) and IGHV4-39*06 (0 of 84). Analysis of these rearrangements led to the identification of shared 'mismatches' which are likely to have arisen because of sequencing errors in the reported germline sequences for IGHV3-11*03, IGHV3-64*05 and IGHV4-39*06. This led to the inference of the putative alleles IGHV3-11*p04, IGHV3-11*p05, IGHV3-64*p06, IGHV3-64*p07 and IGHV4-39*p07. The identification of four unmutated IGHV2-5*01 sequences suggests that this allele exists, but the relative rarity of unmutated sequences among the 83 IGHV2-5*01 alignments led to the identification of the putative allele IGHV2-5*p11. This sequence was formed by extending IGHV2-5*10 with the final seven nucleotides of IGHV2-5*01. The identification of this putative allele led to a further review of all short alleles, and to the identification of the putative alleles IGHV1-18*p03, IGHV2-70*p14, IGHV3-43*p03 and IGHV4-34*p14. These putative alleles represent extended versions of the reported alleles IGHV1-18*02, IGHV2-70*04, IGHV3-43*02 and IGHV4-34*03. Finally, a wider review of apparent mutations using multiple sequence alignment led to the inference of the putative alleles IGHV3-49*p04 and IGHV3-49*p05. The 12 putative alleles are presented in Supplementary Figure S1, and the results of a realignment of sequences against an expanded germline repertoire that included the 12 putative alleles are shown in Supplementary Table S1.
In order to search for possible new alleles, we carried out genomic IGHV sequencing by PCR from buccal swabs of six individuals. PCR products were cloned and inserts from 71 clones were sequenced. The existence of three of the putative alleles, IGHV3-49*p04, IGHV3-49*p05 and IGHV4-39*p07 was confirmed. We therefore propose that these sequences be recognized as IGHV3-49*04 (Level 2), IGHV3-49*05 (Level 2) and IGHV4-39*07 (Level 1). In addition, the existence of two alleles (IGHV3-23*03 and IGHV3-73*02) that had each only been reported once before, was confirmed. The remaining sequences that were produced were all either pseudogenes or functional Level 1 and 2 sequences.
Discussion
Reported immunoglobulin heavy-chain gene sequences are overwhelmingly the product of early studies, principally between 1987 and 1995,17 while over the last 10 years, only two supposedly functional alleles have been reported.18, 19 Early workers had little understanding of the organization of the immunoglobulin locus, other than knowing that it contained multiple highly similar genes. Typically, degenerate PCR primers were used to amplify sequences from biological samples obtained from a handful of subjects. It is therefore not surprising that these studies report many highly similar sequences.
Now that the human heavy-chain genes have been systematically named,20 and now that genomic sequencing has confirmed the general features of the locus,21 it is possible to reexamine early reports and to infer that many sequences reported in these studies contain nucleotide errors. This is most obviously seen in studies that reported multiple highly similar sequences from a single individual. The possibility that some or all of these sequences could include Taq polymerase-mediated PCR errors was explicitly acknowledged in many of these early studies.11, 12, 15 In addition to PCR errors, apparent differences were sometimes introduced through the use of coding region primers. This is true, for example, for the allelic 'variants' IGHV2-70*02, IGHV2-70*03, IGHV2-70*06 and IGHV2-70*07, which all share primer-mediated differences to the common IGHV2-70*01 allele at their 3' ends.12 The amplification of chimeric sequences22 and errors introduced during manual data entry are likely to explain other sequences.
Although the initial objective of this study was to determine the likelihood that germline sequences could have been reported in error, the analysis of a large database of rearranged VDJ genes also allowed us to investigate unreported polymorphisms. A total of 12 putative polymorphisms were identified.
Many alleles that were reported in the early 1990s were amplified by PCR, using primers that were based upon sequences at the 3' end of the FR3 region of the genes.12, 23, 24, 25 Consequently, many of these sequences are not full length. Five of the putative alleles reported here were produced by extending apparently truncated sequences. Seven other alleles were identified from the overabundance of particular mismatches within rearranged VDJ gene alignments. Three of these sequences were subsequently identified by genomic screening. Two of the sequences, IGHV3-49*04 and IGHV3-49*05, have also been highlighted as likely polymorphisms (humIGHV106 and humIGHV121) in the VBASE2 repertoire.26 This automatically generated repertoire identifies IGHV sequences from both genomic and rearranged gene sequences. humIGHV106 was identified from the whole genome shotgun sequence database.27 humIGHV121 was identified in the high throughput genomic sequence database.
Not only does the identification of three new polymorphisms give credence to our bioinformatic analysis, but it also suggests that relatively few common alleles remain to be reported. On the basis of the number of alignments seen to the new and to previously reported alleles, we estimate the remaining putative polymorphisms identified in this study to be present in from about 5% (IGHV4-34*p14) to 45% (IGHV3-64*p07) of the population (data not shown), and therefore believe the existence of these alleles should eventually be confirmed. Polymorphisms that remain to be identified are likely to either be rare alleles of commonly expressed genes, or more common alleles of rarely expressed genes. They will, therefore, only very rarely lead to the misidentification of rearranged genes.
A motivation for this study was the need to improve the identification of mutations within rearranged VDJ genes, for the extent of mutation in immunoglobulin genes is an important prognostic indicator for patients with CLL.2, 3 Poor prognosis is associated with relatively unmutated sequences, while a better prognosis is associated with more highly mutated sequences. The accurate determination of mutation numbers is therefore critical. This will be aided by both the refinement of the reported repertoire and the classification system described in this study. This study shows that it is unlikely that analysis of CLL sequences would be compromised by the existence of unreported polymorphisms. The reported IGHV gene repertoire is essentially complete, and at least within those ethnic populations whose sequences dominate gene sequence databases, the expression of unreported polymorphisms in rearranged VDJ genes is likely to be very rare.
Methods
Compilation of immunoglobulin sequence databases
A database of IGHV germline genes was first compiled from the (IMGT IGHV gene database (http://imgt.cines.fr/),6 and was last updated from this site on 3 August 2007. Pseudogenes that were identified from the IMGT annotations were not included, but the eight open reading frames (ORFs) were retained. These ORFs are defined by IMGT as genes whose coding region has an ORF, but which cannot be transcribed, translated or folded correctly,28 because of apparently defective splicing sites, recombination signals or regulatory elements.
The publications that reported each germline gene were identified from IMGT annotations, and publication details were included in the database. Each publication was reviewed, with special attention being paid to primers used, and to the number of individuals whose DNA was amplified in each study.
The germline genes were analyzed by multiple sequence alignment, using ClustalW (Accuracy)29 via the platform Angis Biomanager (www.angis.org.au). By reference to other alleles of the same gene, alleles were identified and recorded if they were apparently truncated by nine or more nucleotides at their 3' and/or 5' ends. Germline sequences that were identical to another sequence at all but a single nucleotide, and germline sequences with missing nucleotides or ambiguities were also noted.
A database of rearranged VDJ sequences was compiled from the European Molecular Biology Laboratory database (http://www.ebi.ac.uk/embl/).30 Where sets of clonally related sequences were found, on the basis of shared IGHV, IGHD and IGHJ genes, N nucleotides and junction length, only the least mutated sequence was retained. Duplicate and incomplete sequences, as well as sequences containing ambiguities were also removed from the database. The final data set was made up of 4718 VDJ sequences.
Sequence analysis of rearranged IGHV genes
Rearranged VDJ sequences were aligned to the reported germline repertoire using the iHMMune-align program (www.emi.unsw.edu.au/~ihmmune),31 and using scoring matrices that are appropriate for 90, 95 and 99% identity between the input VDJ sequence and its germline IGHV pair.32 Alignments were repeated after extending all short alleles using the 3' end nucleotides of the most commonly expressed allele of each of these IGHV genes. Where discrepancies were noted between alignments to a reported allele and to its extended version, the alignments were manually analyzed to determine whether or not a truncated version of the allele could have been reported.
The number of mutations in each rearranged sequence was noted from the output of the program, and the frequencies with which different levels of mutation were seen in alignments to each germline gene were then compared to the overall distribution of mutations in the data set by
2-test. Where the number of sequences that aligned to a particular gene/allele included an unexpectedly low number of unmutated sequences, additional analysis was performed, using multiple sequence alignment. Where shared mismatches were commonly seen at a particular position within a sequence, unreported polymorphisms were inferred. These putative alleles were named by the inclusion of the descriptor 'p' in their allele name—for example IGHV3-49*p04.
DNA isolation and amplification
In order to confirm the existence of putative alleles, genomic screening was undertaken, with a focus on genes of the IGHV3 and IGHV4 families, as the putative alleles from these families were collectively responsible for most of the IGHV gene alignments to the putative alleles in the database of rearranged VDJ sequences. Buccal smears were collected from six volunteers, with the approval of the UNSW Human Research Ethics Committee. DNA was extracted and IGHV sequences were amplified as previously described,33 using Pfu polymerase and the following family-specific forward and reverse primer: 5'-ATGGAGTTTGGGCT(T,G)AGCT-3' (IGHV3 forward primer), 5'-(A,C)TG(A,G)C(C,T)TCCCCTC(A,G)CT(C,G)TG-3' (IGHV3 reverse primer), 5'-CTGTTCACAGGGGTCCTGTC-3' (IGHV4 forward primer) and 5'-ACTCACCTCCCCTCACTGTG-3' (IGHV4 reverse primer). PCR products were then cloned and sequenced as previously described32 and the IGHV gene sequences were then aligned against the germline sequence database using iHMMune-align.30
References
- Jung D, Giallourakis C, Mostoslavsky R, Alt FW. Mechanism and control of V(D) J recombination at the immunoglobulin heavy chain locus. Annu Rev Immunol 2006; 24: 541–570. | Article | PubMed | ISI | ChemPort |
- Damle RN, Wasil T, Fais F, Ghiotto F, Valetto A, Allen SL et al. Ig V gene mutation status and CD38 expression as novel prognostic indicators in chronic lymphocytic leukemia. Blood 1999; 94: 1840–1847. | PubMed | ISI | ChemPort |
- Hamblin TJ, Davis Z, Gardiner A, Oscier DG, Stevenson FK. Unmutated Ig V(H) genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood 1999; 94: 1848–1854. | PubMed | ISI | ChemPort |
- Lane BS, Mensah AA, Lin K, Pettitt AR, Sherrington PD. Analysis of VH gene sequences using two web-based immunogenetics resources gives different results, but the affinity maturation status of chronic lymphocytic leukaemia clones as assessed from either of the resulting data sets has no prognostic significance. Leukemia 2005; 19: 741–749. | Article | PubMed | ISI | ChemPort |
- Pekova S, Baran-Marszak F, Schwarz J, Matoska V. Mutated or non-mutated? Which database to choose when determining the IgVH hypermutation status in chronic lymphocytic leukemia? Haematologica 2006; 91: ELT01. | PubMed |
- Pallares N, Lefebvre S, Contet V, Matsuda F, Lefranc MP. The human immunoglobulin heavy variable genes. Exp Clin Immunogenet 1999; 16: 36–60. | Article | PubMed | ISI | ChemPort |
- Lee CEH, Gaëta B, Malming HR, Bain ME, Sewell WA, Collins AM. Reconsidering the human immunoglobulin heavy chain locus. 1. An evaluation of the expressed human IGHD gene repertoire. Immunogenetics 2006; 57: 917–925. | Article | PubMed | ChemPort |
- Lee CEH, Jackson KJL, Sewell WA, Collins AM. Use of IGHJ and IGHD gene mutations in analysis of immunoglobulin sequences for the prognosis of chronic lymphocytic leukemia. Leuk Res 2007; 31: 1247–1252. | Article | PubMed | ChemPort |
- Ruiz M, Pallares N, Contet V, Barbie V, Lefranc MP. The human immunoglobulin heavy diversity (IGHD) and joining (IGHJ) segments. Exp Clin Immunogenet 1999; 16: 173–184. | Article | PubMed | ChemPort |
- Adderson EE, Azmi FH, Wilson PM, Shackelford PG, Carroll WL. The human VH3b gene subfamily is highly polymorphic. J Immunol 1993; 151: 800–809. | PubMed | ChemPort |
- Olee T, Yang PM, Siminovitch KA, Olsen NJ, Hillson J, Wu J et al. Molecular basis of an autoantibody-associated restriction fragment length polymorphism that confers susceptibility to autoimmune diseases. J Clin Invest 1991; 88: 193–203. | PubMed | ChemPort |
- Campbell MJ, Zelenetz AD, Levy S, Levy R. Use of family specific leader region primers for PCR amplification of the human heavy chain variable region gene repertoire. Mol Immunol 1992; 29: 193–203. | Article | PubMed | ISI | ChemPort |
- Weng NP, Snyder JG, Yu-Lee LY, Marcus DM. Polymorphism of human immunoglobulin VH4 germ-line genes. Eur J Immunol 1992; 22: 1075–1082. | Article | PubMed | ChemPort |
- van Es JH, Heutink M, Aanstoot H, Logtenberg T. Sequence analysis of members of the human Ig VH4 gene family derived from a single VH locus. Identification of novel germ-line members. J Immunol 1992; 149: 492–497. | PubMed | ChemPort |
- Andris JS, Brodeur BR, Capra JD. Molecular characterisation of human antibodies to bacterial antigens: utilization of the less frequently expressed VH2 and VH6 heavy chain variable region gene families. Mol Immunol 1993; 30: 1601–1616. | Article | PubMed | ChemPort |
- Cook GP, Tomlinson IM, Walter G, Riethman H, Carter NP, Buluwela L et al. A map of the human immunoglobulin VH locus completed by analysis of the telomeric region of chromosome 14q. Nat Genet 1994; 7: 162–168. | Article | PubMed | ISI | ChemPort |
- Cook GP, Tomlinson IM. The human immunoglobulin VH repertoire. Immunol Today 1995; 16: 237–242. | Article | PubMed | ISI | ChemPort |
- Ohm-Laursen L, Larsen SR, Barington T. Identification of two new alleles, IGHV3-23*04 and IGHJ6*04, and the complete sequence of the IGHV3-h pseudogene in the human immunoglobulin locus and their prevalences in Danish Caucasians. Immunogenetics 2005; 57: 621–627. | Article | PubMed | ChemPort |
- Romo-Gonzalez T, Morales-Montor J, Rodriguez-Dorantes M, Vargas-Madrazo E. Novel substitution polymorphisms of human immunoglobulin VH genes in Mexicans. Hum Immunol 2005; 66: 732–740. | Article | PubMed | ChemPort |
- Lefranc MP. Nomenclature of the human immunoglobulin heavy (IGH) genes. Exp Clin Immunogenet 2001; 18: 100–116. | Article | PubMed | ChemPort |
- Matsuda F, Ishii K, Bourvagnet P, Kuma K, Hayashida H, Miyata T et al. The complete nucleotide sequence of the human immunoglobulin heavy chain variable region locus. J Exp Med 1998; 188: 2151–2162. | Article | PubMed | ISI | ChemPort |
- Meyerhans A, Vartanian JP, Wain-Hobson S. DNA recombination during PCR. Nucleic Acids Res 1990; 18: 1687–1691. | Article | PubMed | ISI | ChemPort |
- Friedman DF, Cho EA, Goldman J, Carmack CE, Besa EC, Hardy RR et al. The role of clonal selection in the pathogenesis of an autoreactive human B cell lymphoma. J Exp Med 1991; 174: 525–537. | Article | PubMed | ISI | ChemPort |
- Tomlinson IM, Walter G, Marks JD, Llewelyn MB, Winter G. The repertoire of human germline VH sequences reveals about fifty groups of VH segments with different hypervariable loops. J Mol Biol 1992; 227: 776–798. | Article | PubMed | ISI | ChemPort |
- Victor KD, Pascual V, Lefvert AK, Capra JD. Human anti-acetylcholine receptor antibodies use variable gene segments analogous to those used in autoantibodies of various specificities. Mol Immunol 1992; 29: 1501–1506. | Article | PubMed | ChemPort |
- Retter I, Althaus HH, Munch R, Muller W. VBASE2, an integrative V gene database. Nucleic Acids Res 2005; 33: D671–D674. | Article | PubMed | ChemPort |
- Istrail S, Sutton GG, Florea L, Halpern AL, Mobarry CM, Lippert R et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Nati Acad Sci USA 2004; 101: 1916–1921. | Article | ChemPort |
- Giudicelli V, Lefranc MP. Ontology for immunogenetics: the IMGT-ONTOLOGY. Bioinformatics 1999; 15: 1047–1054. | Article | PubMed | ChemPort |
- Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22: 4673–4680. | Article | PubMed | ISI | ChemPort |
- Kulikova T, Akhtar R, Aldebert P, Althorpe N, Andersson M, Baldwin A et al. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 2007; 35: D16–D20. | Article | PubMed | ChemPort |
- Gaëta B, Malming HR, Jackson KJL, Bain ME, Wilson P, Collins AM. iHMMune-align: hidden Markov model-based alignment and identification of germline segments in immunoglobulin gene sequences. Bioinformatics 2007; 23: 1580–1587. | Article | PubMed | ChemPort |
- States DJ, Wish W, Altschul SF. Improved sensitivity of nucleic acid database searches using application-specific scoring matices. Methods: A Companion to Methods in Enzymology 1991; 3: 66–70. | Article | ChemPort |
- Dahlke I, Nott DJ, Ruhno J, Sewell WA, Collins AM. Antigen selection in the IgE response of allergic and non-allergic individuals. J Allergy Clin Immunol 2006; 117: 1477–1483. | Article | PubMed | ChemPort |
Acknowledgements
This study was supported by a grant from the National Health and Medical Research Council.
Supplementary Information accompanies the paper on Immunology and Cell Biology website (http://www.nature.com/icb)
MORE ARTICLES LIKE THIS
These links to content published by NPG are automatically generated.
NEWS AND VIEWS
Chopping and changing in immunoglobulin genesNature News and Views (02 Oct 1980)
Immunology: More sources of antibody diversityNature News and Views (28 Jul 1983)

