Main

In a medical context, we investigate human genomes to explain, anticipate, or mitigate their effects on the affiliated phenotypes. Researchers focus on collections of data about groups, trends, and mechanisms, but health care workers need to take the knowledge gleaned back to the individual. Whether for research or for clinical intervention, the questions may be driven primarily by phenotype or by genotype. Phenotype-driven research begins with a cohort of individuals who share characteristics, and commonality is sought among their genetic variants. Genotype-driven research ascertains individuals according to particular genetic variants and then documents the associated phenotypes.1 In a clinical context, phenotype-driven investigation is for diagnosis. A trait or condition brings to medical attention an individual whose genome may then be assayed for evidence of a particular genomic variant to confirm a suspected diagnosis, or scanned for evidence of anything unusual and assessed for the likelihood of a causal relationship. In contrast, a genotype-driven clinical investigation, such as a family study or population screening, characterizes a genotype to anticipate the possible phenotypic outcome (and perhaps to intervene).

In the first 50 or so years of clinically applied genetics, traditional karyotype analysis has always been a global genomic assay with limited (albeit improving) resolution; whereas, other laboratory investigations have been relatively targeted in nature, typically interrogating one genetic locus at a time. With recent technologic developments, the cytogenetic and molecular approaches are merging into one that is global in scope, but with high sensitivity and resolution. Not only it is becoming feasible—indeed practical—to scan the entire genome simultaneously in search of particular genetic flags, but the unity of the data will eventually allow a comprehensive interpretation of the genomic findings.

Two recent technologies are rapidly changing our entire approach to studying the human genome. Microarrays in various forms—some relatively targeted and others with genome-wide capacity—have been developed for comparative genomic hybridization and detection of chromosome imbalance, or for single nucleotide genotype analysis. During the same time, rapidly evolving DNA sequencing methods produced the first two genome sequences, each from a single individual, published in 20072 and 2008.3 These were each accomplished at a fraction of the expense of the Human Genome Project reference sequence, and ongoing cost improvements are ushering in the era of the personal genome and the means to directly assay genomic variation.

Perhaps the most striking finding to emerge from these new technologies has been the extent of interindividual variation accounted for, not by single base-pair differences such as single nucleotide polymorphism (SNPs) or rare mutations, but by structural variants involving larger segments of DNA.47 These include both balanced rearrangements (inversions and translocations) and copy number variants (CNVs).79 The genome is neither as binary nor as static as we might have surmised; rather, it can be dynamic, with plenty of iteration and absence. Some variants in this class are structurally simple, but others are complex, and the CNVs can reflect either loss or gain of genetic material relative to a designated reference genome.

Of particular interest is that these structural variants are associated with a full spectrum of phenotypic outcomes, from unrecognizable or inconsequential through to those that may be incompatible with life (Fig. 1). Researchers are documenting these variant genomic sites at an exponential pace, which we anticipate will approach an asymptote within the next 5 years, at least with respect to the polymorphic variants. The concomitant activity is to catalogue the nature and extent of human variation associated with each of these variant loci—an activity that is likely to be ongoing indefinitely. For this genotypic information to be useful, particularly in a clinical context, it needs to be related to phenotypic outcomes, and to that end, we have barely scratched the surface. This area of investigation, in the realm that is intermediate between microscopic chromosome analysis and gene mutation assays, is already revealing both genotypes and phenotypes that can be far more complex than those associated with classical cytogenetic or Mendelian traits. From that complexity, however, is likely to emerge the explanations not only for overtly maladaptive syndromes, disorders and diseases but also for adaptive traits, variable responses and susceptibilities, common and complex traits, subtle individual distinguishing features or idiosyncrasies, and the opportunity to accommodate changing environments.

Fig 1
figure 1

CNV characteristics and frequency. Conceptual curves show projected frequencies in the population for SNPs and point mutations (dashed gray) and CNVs (blue) with different characteristic associations. Some examples of traits that correspond to the different groupings are described in Table 1 and Figure 2. Although real data would not currently follow these curves, we anticipate that with time, the thorough identification of associations will show that CNVs follow the exhibited trend. These curves will also be strongly affected by environmental conditions and relationships with other variants in each genome. A penetrance curve (red) is shown for the CNVs. This is a relative curve, where 100% penetrance is indicated by a height equal to the CNV frequency curve. Two recent studies caught our interest, exemplifying how CNVs previously designated to be ‘benign' might move to the ‘adaptive trait' or ‘neutral trait' groupings, involving α- amylase104 and testosterone metabolism.105 Two other new studies provide examples of rare recurrent susceptibility CNVs found in schizophrenia106 and novel mechanisms for germ-line CNV effects in cancer predisposition.107

THE GENOTYPIC SPECTRUM

Array technologies and whole genome sequencing are finally drawing our focus to the kind of variation that is intermediate in size, completing the spectrum between single base variants (mutations or SNPs) and microscopically-visible aneuploidies or heteromorphisms. Most of the present discussion will pertain to CNVs, which seem to be the more prevalent form of structural variation,2,1012 though currently accessible methods detect the quantitative variants more readily than balanced translocations or inversions.13

CNVs have been defined operationally as involving segments of DNA that are 1 kb or greater in size,7 though this limit is somewhat arbitrary from a functional perspective. Smaller variants, such as minute insertions and deletions or variable number of tandem repeats are excluded from the working definition and discussion, but are recognized as part of the full genotypic spectrum.13

Many genomic structural variants characterized to date have been associated with structures called “segmental duplications” or “low-copy repeats”: segments that predispose to genomic rearrangement during meiosis by nonallelic homologous recombination (NAHR).14 Because these sequences are vulnerable, the resultant rearrangements tend to recur, creating clusters of variants with common endpoints. Other rearrangements that are not in association with such duplicated elements are more randomly distributed and nonrecurrent. Two mechanisms have been proposed to explain the latter: nonhomologous end joining15 and replication fork stalling and template switching.16 Recent higher-resolution data are contradictory as to the predominant mechanism mediating the majority of genomic imbalances. For example, two studies estimated that only 9%17 or 14%18 of structural breakpoints fall within repetitive sequences, suggesting that nonrecurrent mechanisms predominate, whereas another study10 demonstrated that 47% of breakpoints follow NAHR rules. Some of these differences can be attributed to ascertainment biases in the technologies and the size of variants being assayed, but more data will be required before we fully understand the genesis of structural variation. There are only a few primary reports assessing new mutation rates for CNVs.6,19 Emerging observations suggest a locus-specific rate of 1.6 × 10−6 −1.2 × 10−4, which is 3–4 times greater than that observed for SNPs.20,21 Moreover, it seems that most CNV gains are local duplication events, but new studies of both human and Drosophila also demonstrate transposition events.10,22

The larger CNVs (>50 kb) described in recent population surveys seem skewed toward rare variants.6 Their distribution also seems to be nonrandom, with more in the subtelomeric and centromeric regions of chromosomes.23,24 Overall, however, the spectrum of structural variation is extensive. In terms of phenotypic impact, the location of these variants in relation to genes is particularly germane. They may occur anywhere, but are more common in regions devoid of genes, known as “gene deserts.” Some comprise multigene segments that are deleted, duplicated or moved; others involve segments contained within functional genes; yet others are in nongene segments that nonetheless have a regulatory role on gene function. When genes are involved, impact of the variant will be contingent upon the function(s) of these genes. “Essential” genes are less likely to be tolerant of any disruption, and de novo variants that affect them may face strong selection25 (Fig. 1). The functions of “disease-associated” genes are sufficiently important that their disruption or copy number change may lead to a clinically-recognizable phenotype. Other genes can have more subtle effects on phenotype and fitness, being more robust or more discretionary, and it is the genetic elements at this end of the spectrum that seem to have the greatest relationship with CNVs and other structural variants.

We can classify the genomic structural variants according to form. “Balanced” rearrangements involve no loss or gain of genetic material, and include intrachromosomal and interchromosomal translocations and inversions. These are not detectable by current array-based methods, but are revealed by direct comparison of genome sequences or cytogenomic approaches. Some ostensibly balanced rearrangements in individuals with a clinical phenotype have, on closer scrutiny with the higher resolution methods, been found to comprise subtle deletions or duplications at their breakpoints, or to be associated with additional changes elsewhere in the genome2629 (Fig. 2). Truly balanced rearrangements, even when they have no functional effect in a carrier, can, nonetheless, create genomic instability for future generations.3033

Fig 2
figure 2

CNV and phenotypic complexity in autism spectrum disorder (ASD) pedigrees. The size of each de novo or inherited CNV is shown below each family member. Arrows identify probands. ASD cases have filled symbols (gray denotes developmental delay but not a definitive ASD diagnosis); open symbols denote individuals who do not have ASD. (A) ASD probands may carry multiple de novo events including those overlapping genes known to be associated with ASD (SHANK3)73 or (B) large de novo rearrangements accompanied by smaller de novo events at different loci. (C) Male probands may inherit chromosome X deletions (PTCHD1) from female carriers, unmasking an identical CNV in fraternal siblings with variable expression of ASD. (D). Recurrent CNV gains and losses (representing the reciprocal events) in unrelated probands (at 16p11.2) may be inherited from non-ASD parents or (E, F) present as de novo events.

The remaining structural variants are “unbalanced” with respect to DNA content and are called CNVs. These can involve a relative loss or gain (deletion or replication) of genetic material. Methods to detect CNVs include comparative and directly- quantitative array screening strategies, sequence analysis, and site-focused assays such as quantitative polymerase chain reaction (qPCR) and fluorescence in situ hybridization (FISH). The form for an individual structural variant can be as simple as a segmental deletion, or a highly complex genomic rearrangement involving multiple elements.6,10,34,35 (For detailed description of variant classes.6,34) Collectively, there is also diverse complexity for the loci at which these events occur, since a given variant region may demonstrate overlapping but nonidentical rearrangements when genomes of different individuals are compared, and CNVs can be multiallelic. A challenge going forward, particularly for complex traits and diseases, will be to determine how structural variants and single nucleotide variants might interact, looking at mutation rates and linkage disequilibrium, and to develop new models to extract these data.36,37

Table 1 classifies some structural variants according to their genotypic form and features, from deletion CNVs through balanced rearrangements to CNVs with large relative gains of material. Clinically-relevant illustrative examples are listed for each. Other good reviews on this topic have been published.79,3840

Table 1 Spectra of CNV Genotypes and phenotypes

PHENOTYPIC SPECTRUM

The observable qualities of an organism comprise its phenotype. As with individual variation directed by single nucleotide variants, the phenotypic impact related to structural variants in the genome can be as severe as to cause embryonic lethality, or at the other end of the spectrum, to have little or no discernable outcome. In between, they can be associated with degrees of dysfunction, which, beyond a certain threshold, are called “disease,”41,42 though that threshold can sometimes be moved by clinical interventions. Traits may also be relatively adaptive or maladaptive in different environmental contexts. From a clinical perspective, structural variants can be the basis for severely disabling syndromes or diseases, for single-gene disorders and those involving large chromosomal segments. Their impact is being recognized much more, however, on the more quantitative traits where they can have somewhat incremental effects on phenotype and fitness. They are anticipated to be even more important for predisposition to common threats to health, such as heart disease, diabetes, cancer, or dementia, particularly in those with apparently complex etiology. Much of structural variation is not gene- or disease-associated and has become widely dispersed in the absence of selective pressure (Fig. 1). It is becoming clear that these variants are important contributors to traits that not only create a state of disease or health, but influence quality of life and simple human differences.

The earlier phenotype-driven research has detected more genomic deletions than duplications—a bias probably due to the typically milder phenotype associated with gain of genetic material.24 The corollary, however, is less selection pressure, and the relative abundance of CNV gains is becoming apparent with genotype-driven approaches.

TECHNOLOGY AND DATABASES

Both array-based and sequencing technologies are evolving quickly to adapt to the recognition of CNVs and other structural variants as important genomic elements to be ascertained, documented, and interpreted.13 Initially, there has been a detection bias in favor of medium-to-large and noncomplex variants. The genome-wide arrays are designed for breadth of detection and have limited ability to resolve endpoints of variant sequences with precision, or to determine whether variants are exactly the same or overlapping. Targeted arrays, and more labor-intensive approaches such as qPCR or FISH, can add information to allow more detailed interpretation. Repetitive elements are inherently challenging for DNA sequencing, and more variant regions are being discovered as gaps in the reference sequences are gradually conquered. The higher-density arrays and higher-throughput sequencing will also increase detection of variant regions that are smaller, more complex, and more difficult to interpret. A particular challenge for relating specified genotypes to phenotypes is that the high-throughput array technologies reveal relative copy-number differences but have limited ability to resolve absolute copy number—a matter particularly relevant to multiallelic loci.9,13 Further, they ascertain a diploid genotype and do not directly discern the component haploid variant alleles. Finally, the yet limited ability to resolve CNV breakpoints will, for some time to come, compromise their interpretation, particularly for predictive purposes.

The Database of Genomic Variants (DGV) (http://projects.tcag.ca/variation/) was established to catalogue genomic variation from human control samples, as a support for research correlating genomic variation with phenotypes.4,43 It is important to keep in mind that it derives from individuals deemed to be “healthy controls,” but the amount of phenotypic documentation is limited. A control subject for a cancer study, for example, may not have been assessed for health status with respect to blood pressure. Health is not static, and the status of a research participant could change. The DGV comprises structural variants not known to cause overt disease, but does not necessarily exclude alterations associated with complex, variable, mild, or late-onset phenotypes. The database is an essential research tool, but caution is needed in its use for prediction of health outcomes.

Databases such as Database of Chromosomal Imbalance and Phenotype in Humans using Ensemble Resources (DECIPHER) (https://decipher.sanger.ac.uk/) and others (reviewed in44) are intended to marry clinical phenotypic descriptions with data about structural variation. Currently, such databases house, primarily, information on highly penetrant variants that cause overt phenotypes such as dysmorphic syndromes and cognitive impairment. As the field moves from examining the role of structural variants in rare, highly penetrant disorders to that in common and complex traits and disease, the overlap of content in “control” and “disease” databases such as DGV and DECIPHER, respectively, will increase. Moreover, as depicted in Figure 1 and Figure 2D, some structural variants previously annotated as benign or neutral in their effect will be reclassified as predisposing, risk factors, or partially penetrant alleles.9,34,45

MODELING RELATIONSHIPS BETWEEN STRUCTURAL VARIANT GENOTYPE AND PHENOTYPE

Some structural variants influence single genes and behave as simple Mendelian traits, and others merge with the realm of traditional cytogenetics. They are coming to the fore, however, as contributors to the “everything else” category from textbook genetics—that of complex traits. CNVs underlie common variation that may be selectively advantageous, neutral, or detrimental in different contexts. To understand the relationships to phenotype, our thinking and analysis will need to evolve from models with simple, linear, binary, and discontinuous concepts to those that are complex, networked, multifocal, and continuous.25 CNVs will be responsible for complex additive and/or epistatic effects and for buffering. More elements will have individually small incremental effects, and be associated, not only with threshold traits, but also with those that are continuously variable or quantitative.

The structural variants can impact gene function8,35 (or not46) in several ways. They can create functional loss through deletion or disruption of one or more genes, behaving as dominant or recessive alleles according to the cellular function of the impacted gene product(s). They may cause disruption of a regulatory element with any number of possible positive and negative sequelae, including imprinting and differential allelic gene expression.47 Replication of genes may increase the protein product, or buffer the impact of other genetic variation.41 Rearrangements can have position effects on gene expression37 by separating genes from their regulatory elements or putting them into a different genomic context with new epigenetic factors. They may also generate novel fusion products.

A number of features of the variant genotype will be relevant to the concomitant phenotype:

  1. 1

    The location of the structural variant with respect to genes or regulatory regions.

  2. 2

    Dosage characteristics of the variant—whether there is a loss or gain of genetic material.

  3. 3

    When functional genes are impacted, the dosage-sensitivity of the related gene. Proteins involved in complexes are more likely to require dosage balance for optimal function.

  4. 4

    Extent of the variant—involving dosage effects on one or on multiple genes.

  5. 5

    Cellular role of the impacted gene product.

Phenotypes associated with particular variant regions can be relatively consistent, or highly variable. Consistency may reflect involvement of a single gene, but could also be due to a multigene segmental variant passed on from a common ancestor. Alternatively, concordance is often the result of recurrent rearrangements driven by predisposing genomic sequences, such as nearby segmental duplications, or a balanced variant such as an inversion. Phenotypic variability can have many causes. A syndrome or disorder, for example, might be defined by its core gene(s), but the extent of the CNV and its encompassing of nearby genes may influence the phenotype. Most importantly, the overall genomic context in which a given structural variant functions, and the environmental variables, will be different for each individual in which the variant is found.

In Figure 2, we present results from our group's studies of CNVs in individuals with autism spectrum disorder (ASD).27 Albeit still simplistic, the data begin to reveal some of the complexities to be considered when attempting to make proper genotype-to-phenotype associations.48 For example, with higher-resolution arrays, multiple de novo CNV events may be identified in individual samples (Fig. 2, A and B). CNVs can be unmasked, depending on their position and context in the genome, which might influence expressivity and penetrance (Fig. 2, C and D). Gains and losses at the same locus can lead to overlapping phenotypes, with variable penetrance and potential contributions from other risk alleles (Fig. 2, D-F). Among individuals with de novo structural variants in our ASD cohort, more than 10% had two or more variants, cautioning against assigning independent causation to all de novo structural variants observed.27 As expected for a common, complex disorder with potentially numerous contributing loci, in the family illustrated in Figure 2, F, the CNV deletion is detected in only one of the two ASD sibs. Although the 16p11.2 CNV may be more prevalent in autism families than among control subjects,27,49,50 the genomic characteristics demonstrated in ASD families in Figure 2, D-F suggest that this variant is neither necessary nor sufficient to cause ASD. We need to consider additional independent potential risk factors, including those that are genetic, epigenetic, gender-related, environmental, or stochastic in origin. A recent comparison of CNVs between monozygotic twins51 also draws attention to the prevalence of somatic events that can create mosaicism for structural variants, with possible contribution to a variety of phenotypes.

IMPLICATIONS IN THE APPLICATION TO HEALTH CARE

Genotyping arrays are already well established as front-line research tools, and are rapidly being integrated into mainstream medical practice, and, more recently, into consumer genomics. Whole genome sequencing is likely but a few years behind, and all of these approaches already generate far more genomic data than we can translate or interpret. Untargeted investigations, in particular, will yield huge amounts of data about variants—concerning both groups and individuals—that may not be interpretable for some time, but will still be on the table.52 In a research context, the question of how to manage incidental findings is but one looming dilemma being anticipated and contemplated by lawyers and ethicists.53 It is exciting to watch the resurgence of discovery as newly-recognized CNVs open investigative paths to issues that had been stalled. Even when CNVs are infrequent contributors to a particular phenotype, the rare cases are beginning to draw attention to relevant genes for further investigation of nucleotide variation (see e.g., Ref. 54), and on to functional studies. As research findings are turned into applications for medical practice, we must keep in mind that statistical inferences about groups and populations are not the same as implications for individual risk. Our ability to use these genotypic data to explain a phenotype (i.e., diagnosis) will be far ahead of our ability to predict outcomes, and this will be particularly true as the structural variants allow access to the massive realm of common and complex traits and disease.

At present, it is very difficult to know whether a given de novo variant is pathologic, and some of the reason for this is our still rudimentary knowledge of mutation rates in different classes of structural variants.21 It is also difficult to know whether an inherited variant is necessarily benign in a particular genomic environment. Efforts to document and catalogue genotypic data in relation to phenotypic information from thousands of phenotype-classified individuals and controls, should eventually make it possible to do so. As our focus evolves from primarily gene-specific investigations to what will eventually be routine whole-genome analysis, we will be driven to scrutinize individual phenotypes more closely. Some who are classified in research protocols as being unaffected by a particular trait or disease may, upon retrospective evaluation, be found to carry subtle signs, and such observations will help in understanding the spectrum of variation associated with particular structural variants.

As we come to recognize the extent of structural variation, it is making us aware of the degree to which the genome is fluid and unstable, both through germlines and in somatic events. Knowledge of this area of human variation is both filling in gaps and reminding us of the extent of complexity in biological systems. This should keep us cognizant of the opportunities to make diagnostic or predictive errors through erroneous assumptions, and the further we become dependent upon interactive and computer-based interpretations, the more such risks will emerge.

If the process of documenting and cataloguing the complex genomic variants and rearrangements is challenging, it is an almost daunting task to do the same systematically for phenotypic traits. Nomenclature needs to be standardized,55 and means found to accommodate subjectivity and the fact that health status is not static, among other complexities.44 Relational databases can then go forward, connecting observed variation in genomes to phenotypic outcomes, to allow a knowledge base for application to health care.

As personalized medicine becomes more common, interpreting the amount of information potentially available for an individual could quickly overwhelm. We are still inclined to look at one locus or variant at a time, and manually interpret the observations in isolation. This approach will continue to be appropriate for many of the genetic variants described to date, that have individually significant impact on phenotypes. As the bulk of information from variants is brought forward, however, particularly from CNVs, more and more will we recognize those with small incremental effects, and complex interactions with other elements. The holistic opportunities offered by genome-wide assays will gradually be realized, facilitated by tools with which to interpret complex networks of genomic interactions and to account for epigenetic and nongenetic factors, such as time, place, environment, and experience.

There will be an expanding role for professionals trained in such interpretation, as research delivers into the arena of applications for individual health care. Today's clinical molecular geneticists and cytogeneticists will merge skills and acquire completely new ones to fill this niche. There will be an enhanced role for counselors to communicate the interpreted information to the individuals, families or communities who will be impacted.52,56 They will be particularly challenged with issues of complexity, subtlety, and uncertainty at the same time as a sheer volume increase in demand for their services. The delivery of information about human variation will be as much an art as a science for a long time to come.

We are reminded of Charles Scriver's wise insight that, “genetic variation itself is normal; it is dis-ease only when we experience it as illness. The professional will understand the process underlying the disease; the healer will alter the perception of illness.”42 In contemplation, we add that these emerging studies of genomic structural variation will bring the professional to the phenotype and the healer to the genome, in pursuit of common answers to their respective questions.