Current state of diversity in genomics studies

Genomic studies have mainly used samples from individuals of European ancestry, at the expense of learning from the largest and most genetically diverse populations. For example, 78% of individuals included in genome-wide association studies (GWAS) reported in the GWAS Catalog ( through January 2019 are of European descent1, while Asian populations account for 59.5% of the world population based on the Population Reference Bureau’s World Population Data Sheet ( Though this is partially due to inadequate sampling of non-European populations, researchers tend to exclude data from minority groups when conducting statistical analyses2 even when diverse datasets are available. The limited inclusion of samples from diverse populations hinders the equitable advancement of genomic medicine as a result of persistent uncertainty with respect to the genetic etiology of disease across populations, as well as differential rates of adverse drug events, treatment outcomes and other health disparities.

In recent years there has been an increased awareness of the limited generalizability of findings across populations and the benefits for the discovery and interpretation of gene–trait associations brought about by the inclusion of diverse populations in genomic studies. This has motivated the inclusion of diverse, multiethnic populations in large-scale genomic studies. For example, whole-genome sequencing in individuals of African descent3 and whole-exome sequencing in a southern African population4 have improved understanding of genetic variation in under-represented populations. Additional efforts have been made to establish reference genome datasets for research in diverse populations; these include the GenomeAsia 100K Project, Human Heredity and Health in Africa (H3Africa) initiative, Taiwan Biobank, Population Architecture Using Genomics and Epidemiology (PAGE) Consortium, Trans-Omics for Precision Medicine (TOPMed) program, Clinical Sequencing Evidence-Generating Research (CSER) consortium, Human Genome Reference Program (HGRP) and All of Us Research Program. However, the field of immunogenomics, especially that related to adaptive immune receptors, has yet to benefit from a similar growth in diversity.

The need for diversity in immunogenomics

Central to immunity are the repertoires of T cell receptors (TCRs), immunoglobulins, human leukocyte antigens (HLAs) and killer cell immunoglobulin-like receptors (KIRs). Thus, analyses of the loci that encode these molecules are critical to immunogenomics studies.

T cells and B cells recognize antigens through their TCRs and immunoglobulins, which are formed through the process of V(D)J (variable, diversity and joining region) recombination. Capturing the vast diversity of recombined, expressed TCR and immunoglobulin repertoires was not possible until the development of high-throughput sequencing techniques in the late 2000s. Freeman and colleagues employed 5′ rapid amplification of cDNA ends (5′ RACE) PCR to amplify TCR cDNA and to characterize TCR repertoires5. Weinstein and colleagues sequenced an antibody repertoire in zebrafish6 in 2009, creating the foundation of adaptive immune receptor repertoire sequencing (AIRR-seq) technologies. In 2010, Boyd and colleagues applied AIRR-seq to human immunoglobulins7. Since then, studies including AIRR-seq have seen exponential growth, and findings from these studies have shaped our understanding of human immune repertoires in different settings8.

AIRR-seq analysis and other immunogenomics studies offer new opportunities to deepen our understanding of the immune system in the context of a variety of human diseases, including infectious diseases9, cancer10, autoimmune conditions11 and neurodegenerative diseases12. Furthermore, AIRR-seq data provides information on expression profiles; germline V, D and J gene usage; complementarity-determining region (CDR) diversity; and, in the case of immunoglobulin repertoires, somatic hypermutation levels. There have been extensive efforts to explore the genetic diversity of the HLA and KIR systems13,14, and the knowledge gained from these efforts could be integrated into the AIRR-seq studies of TCR and immunoglobulin germline gene diversity.

As in the field of genomics, greater diversity in immunogenomics research has the potential to enable the discovery of novel genetic traits associated with immune system phenotypes that are common or different across populations. While evidence for extensive diversity in germline TCR and immunoglobulin genes have been reported in the human population15,16, most AIRR-seq studies that use sequencing to study T and B cell receptor repertoires have been conducted in individuals of European descent, leaving other populations under-represented17. Exclusion of non-European populations in genomics research limits our understanding of how pathogens have exerted selective pressures on immune-related genes in populations living in different environments, and thus on infectious disease manifestation18.

Germline gene diversity and databases

A critical step in AIRR-seq studies is germline gene assignments, which requires reliable and comprehensive databases of germline V(D)J alleles representing different populations. So far, such databases are lacking because the genetic regions encoding these genes have been exceptionally challenging to characterize at the genomic level. Not only do these loci contain a mixture of functional genes and pseudogenes with high similarity, but they are also characterized by considerable structural variation, with deletions and duplications occurring at high frequency in different populations. Given the complexity of the TCR and immunoglobulin genomic loci and deficits in existing germline databases, the determination of immune receptor germline gene usage from bulk RNA-seq or whole-genome sequencing is often inaccurate. Efforts to improve germline databases are therefore critical for improved coverage of diversity in immune repertoire analysis. Computational methods to infer germline TCR and immunoglobulin genes from AIRR-seq data are expected to accelerate these efforts19,20,21,22,23,24 (Table 1). Comparisons are also needed between results obtained from methods for inferring germline gene variants from AIRR-seq repertoires25 and from direct sequencing of genomic DNA15, such as the sequencing and assembly of large-insert clones (for example, bacterial artificial chromosome (BAC) and fosmid clones)16 and, more recently, whole-genome sequencing and targeted long-read sequencing26.

Table 1 Tools for inference of germline TCR and immunoglobulin genes from AIRR-seq data

The most widely used reference database for immunogenomics data, the international ImMunoGeneTics information system (IMGT)27, has been a valuable resource. However, it lacks a comprehensive set of human TCR and immunoglobulin alleles representing diverse populations worldwide. Further uncertainty stems from descriptions of sample populations in databases being based on geography or self-identified race and/or ethnicity of study subjects, rather than genetic ancestry. As a result, we have a limited understanding of population-level TCR and immunoglobulin germline gene variation. However, progress is being made.

The AIRR Community (AIRR-C; is an international community of bioinformaticians and immunogeneticists that has been formed to develop standards and protocols to promote sharing and common analysis approaches for AIRR-seq data, including the AIRR Data Commons28. As a means to enrich available germline gene sets, the AIRR-C established the Inferred Allele Review Committee (IARC; to review and curate new immunoglobulin or TCR germline genes inferred from AIRR-seq data. Its work is underpinned by the Open Germline Receptor Database, which provides submission and review workflows. IARC-affirmed sequences are published in this database, together with supporting evidence. VDJbase was also recently launched as a public database that allows users to access population-level immunoglobulin and TCR germline data, including reports and summary statistics on germline genes, alleles, single nucleotide and structural variants, and haplotypes of interest derived from AIRR-seq and genomic sequencing data. It currently contains AIRR-seq data from 421 human donors, representing 724 immunoglobulin heavy chain gene alleles. The integration of TCR datasets is in progress. Together these initiatives will help pave the way for the development of approaches that extend germline curation efforts to include more data types and ultimately ensure that population-level metadata can be more effectively captured and leveraged.

Recommendations for the immunology community

The immunology community should make targeted efforts to include non-European populations in AIRR-seq and other immunogenomics studies. Already, AIRR-seq studies in more diverse populations have uncovered evidence for extensive genetic heterogeneity. For example, in a study of South Africans with HIV, Scheepers and colleagues discovered many immunoglobulin heavy chain variable (IGHV) alleles that were not represented in IMGT15, information of relevance to HIV vaccine design aimed at germline-targeting immunogens29. In a study in the Papua New Guinea population, 1 new IGHV gene and 16 IGHV allelic variants were identified from AIRR-seq data30. These discoveries of alleles indicate the need for further population-based AIRR-seq datasets and the identification and validation of the presence of new alleles so that they can be added to public databases. It will be critical to conduct studies in various human populations if we are to fully understand how AIRR-seq can be leveraged to make improvements in a wide range of applications, including vaccine design.

Further, we suggest that extant open AIRR-seq datasets could be used to augment immunoglobulin and TCR germline databases and inform AIRR-seq and other immunogenomics studies across diverse populations. It may be possible in the future to use AIRR-seq data to infer genetic ancestry, but such bioinformatics methods are yet to be developed and thus the utility of genetic ancestry in this field has yet to be demonstrated. Conclusions about new germline variants discovered through non-targeted sequencing data, including RNA-seq based on short read sequences, should be drawn with caution owing to the complexity of the adaptive immune receptor loci26, as described above. New methodologies and computational approaches should be developed to facilitate the inclusion of diverse population datasets into existing databases, with the aim of enhancing our knowledge base to reflect global genomic immunological diversity in populations around the globe. Such enriched databases would provide researchers with baseline resources to design and implement the next generation of personalized and precision immunodiagnostics and therapeutics31.

At the current stage of the global COVID-19 pandemic, many vaccine trials and programs are underway worldwide, offering opportunities to investigate the role of genetic factors in vaccine-mediated immune responses. Such investigations will require careful study designs to effectively address potential confounding factors such as environmental, economic and social determinants of health that systematically differ between populations defined by self-identified measures of diversity and that are correlated with continental-level ancestry32. Incomplete representation of diverse populations limits our capacity to address the impact of genetics on clinical phenotypes, and ideally this should be investigated alongside non-genetic risk factors for disease. Different genetic variants in an etiologic pathway modify the clinical presentation of disease, and these effects can differ by genomic background33. Specific immunoglobulin germline genes, and in some cases alleles, have been found to be preferentially used in the response to pathogens, suggesting a degree of convergence in the antibody response, as observed for influenza9, HIV-134, Zika virus35 and SARS CoV-236. Therefore, in addition to environmental factors, genetic variability in immune genes is likely to drive differential effects in vaccine effectiveness and infection outcomes17.

Our interdisciplinary group consists of leading researchers from 17 regions, including the United States, Canada, Norway, France, Sweden, the United Kingdom, Russia, Saudi Arabia, Israel, South Africa, Nigeria, Chile, Peru, China, Japan, Taiwan and French Polynesia, who share concerns about the lack of diversity in immunogenomics and embrace a need to tackle these challenges. As an interdisciplinary group with expertise in biomedical and translational research, population and public health genetics, health disparities, computational biology and immunogenomics, we wish to raise awareness about the value of including diverse populations in AIRR-seq and immunogenomics research.