Introduction

Our current understanding of many fundamental mechanisms underlying mammalian physiology has been profoundly influenced by discoveries in human genetics. Pioneers in human genetics such as Victor McKusick at Johns Hopkins and Charles Dent at University College in London spent their entire careers developing theoretical and technological methods for tracking the transmission of inheritable traits and linking them to gene defects impacting disease. In turn, these discoveries informed on how normal pathways controlled complex physiological processes. Over the last few decades, exponential improvements in the speed and accuracy of DNA sequencing, coupled with increasingly sophisticated mathematical approaches for annotating gene networks, have revolutionized the field of human genetics and made these once time consuming approaches assessable to most investigators. Indeed, at the time we are writing this article, high quality DNA sequences of entire genomes can be obtained commercially in a just a few days for $3 500 US dollars (http://www.genome.gov). Consequently, disease causing gene defects can be identified in a matter of a few weeks by most biomedical researchers provided that they can access appropriate bioinformatics expertise. Moreover, parallel advances in pathway analysis algorithms, together with open access to both chemical and pharmaceutical libraries, have for the first time enabled informed therapeutic targeting of disease causing gene pathways. In this perspective, we discuss some of these technological advances and describe how they have enabled the identification of the molecular defects underlying two rare bone diseases.

NextGen genome sequencing

Few advances have impacted biology more than innovations in DNA sequencing. In 2001, the cost of a human genome sequence was $100 million and took years to complete. Today, a human genome can be sequenced for $3 500 in a matter of days (http://www.genome.gov). The remarkable acceleration in our ability to sequence DNA has made genome/exome (the 1.5% of the genome consisting of protein-coding exons) sequencing routine and significantly increased our ability to identify variants responsible for Mendelian diseases. This is especially evident in the bone field where over the last three years Next-Generation Sequencing (NGS) has been used to identify new genes for osteogenesis imperfecta (OI) [as examples (15)], early-onset osteoporosis (6,7) and other skeletal dysplasias (810).

In addition to generating a reference genome sequence, the Human Genome Project created immense interest and investment in the creation of NGS platforms to sequence DNA faster and at a lower cost (11). NGS platforms have enabled the transition from sequencing DNA fragments one at a time to sequencing fragments in a massively parallel manner. To illustrate this idea it is useful to provide an overview of Illumina sequencing by synthesis, one of the most widely used NGS platforms (12). Illumina sequencing works by generating hundreds of millions of clonal DNA populations attached to a solid surface. Fluorescently labeled nucleotides are then incorporated one base at a time to the clonal fragments within a cluster. The addition of a particular base to a subset of clusters is recorded by high-resolution imaging. A different base is then added and imaged. This is repeated for multiple cycles until “reads” of 100–200 bases of sequence are generated for each of the millions of fragment clusters. Currently, the Illumina HiSeq2500 sequencer can generate 600 Gigabase pairs (Gbp) of sequence (200 human genome equivalents) every 11 days. There are also a number of other commercial or soon to be commercial NGS platforms. Examples include Ion Torrent semiconductor sequencing (13), Pacific Biosciences' single molecule real time (SMRT) sequencing (14) and Nanopore-based DNA sequencing by Oxford Nanpore Technologies (15). While these technologies are less mature than Illumina, they each have the potential to result in improvements in sequence quality, throughput and cost.

Disease gene identification and prioritization using exome sequencing

Mutations underlying single-gene disorders have traditionally been identified using candidate gene screening or linkage mapping/positional cloning strategies in pedigrees (16). However, these approaches require prior biological information regarding the disease or large families. In contrast, whole-genome or exome sequencing is capable of discovering disease genes in an unbiased manner (17). For single-gene disorders, exome sequencing is more efficient than whole genome sequencing as it focuses on regions where variants are the most likely to have functional significance (1.5% of the genome), thereby permitting the high coverage (>60X) that is needed to confidently identify variants. Also, the majority (>85%) of single-gene disorders are due to coding mutations (18). In an exome sequencing experiment, the coding regions of a genome are “captured” by hybridizing fragmented genomic DNA to a library of exonic DNA oligos. The captured sequences are then sequenced using a NGS platform (17).

An extremely active area of research in bioinformatics has focused on the processing and analysis of NGS sequence data (19). This has lead to a proliferation in the number of commercial and open-access software for sequence data analysis. A recent review surveyed 205 tools for various aspects of whole genome/exome sequence analysis, which gives a sense of the intense interest in this area (20). Here, we hope to provide a general overview of a typical exome sequence analysis pipeline (Figure 1).

Figure 1
figure 1

Typical exome sequence analysis pipeline.

As discussed above, DNA from individuals to be sequenced is subjected to exome capture. The samples are then sequenced, with a typical yield in the range of 50–90 million reads (100 bp) or 5–9 Gbp of sequence per sample. The first step in the analysis is to exclude low quality and duplicate reads. The next step is aligning these raw sequence reads to the human reference. In principle aligning reads is straightforward, but in reality there are a number of parameters to consider such as the number of mismatches to allow and whether to include reads that map equally well to multiple sites in the human genome. After excluding low quality reads and reads not mapping to the exome, a typical yield is 3–6 Gbp of sequence or 50–100 X coverage of the 50 megabase pairs (Mbp) exome. The end result is a list of sequence reads, their mapping locations and a quality score associated with the quality of the alignment.

The next step is to take the list of alignments and identify variants. In this step, variant positions in the genome, relative to a reference, are identified by comparing the “pileup” of base calls for every position in the exome. Homozygous variants are those in which all reads mapping to a position possess the non-reference allele, whereas heterozygous variants have the alternative allele in 50% of reads. Most variants are single-nucleotide changes, although commonly used software tools also have the capacity to identify short insertions and deletions (INDELS). It has recently been demonstrated that comparing relative sequence depth (number of reads mapping to a location) between samples can also identify copy number changes across exons (due to larger scale INDELS or structural variants) (21). One of the challenges that arises in calling variants is systematic sequencing errors or misalignments that can result in false positive variant calls. However, this is becoming less of an issue as the quality of sequencing data and alignment algorithms improve.

Once variants are called, they next need to be annotated. Common annotation includes location in the human genome, functional class (e.g. missense or nonsense variant), frequency in large population based sequencing project data sets [such as the NHLBI exome sequencing project (22)], levels of evolutionary conservation, in silico functional predictions (e.g. SIFT (23) and Polyphen (24) for missense variants), among many others.

The exome of any given individual will possess several hundred putatively functional variants (25). Since the goal is identifying the single causal mutation, filtering strategies are critical in narrowing the list and can often go from hundreds of mutations to the disease-causing variants, although this is highly dependent on the amount and quality of data used for filtering. It is often useful to begin by filtering on annotation. For example, synonymous (coding variants that do not result in an amino acid substitution) and intronic variants (i.e. those intronic variants not affecting splicing; intronic variants are identified in exome sequencing because capture probes often extend into introns to ensure capture and high coverage of the beginning and ends of exons and mutations that affect splicing) are unlikely to be disease causing. Excluding common variants or variants observed in normal cohorts is also informative since variants causing rare Mendelian forms of disease will by definition be rare. In most cases, parental sequence data or data from other relatives provides the single most informative filter. For example, if evidence exists that the disease is recessive then parental variant information can be used to exclude variants that do not fit a recessive mode of inheritance or are homozygous in both the probands and an unaffected sibling. It has also been demonstrated by our group (see below) and others (26,27) that for genetically heterogeneous diseases, such as OI, using network-based approaches which functionally link novel genes to known disease genes is also an effective filtering strategy.

Identification of new genes causing osteogenesis imperfecta

The widespread availability and reduced cost of NGS and bioinformatics has armed the broader research community with the tools necessary to identify disease causing gene pathways. In bone research, a particularly active area of gene discovery has occurred in patients with rare bone disorders such as OI that are caused by mutations in single genes.

OI comprises a group of heritable connective tissue disorders that result in bone fragility and susceptibility to fracture, bone deformity and growth deficiency. Most cases of OI are due to dominant mutations in the type I collagen genes COL1A1 or COL1A2, which alter the structure or quantity of type I collagen and cause a skeletal phenotype that ranges from subclinical to lethal. Other more rare forms of OI arise from gene mutations that affect proteins involved in collagen processing [i.e., prolyl hydroxylation, intracellular transport, or matrix incorporation (28)]. Indeed, the elucidation of the gene pathways underlying inherited forms of OI has uncovered the different molecular events that are required for normal collagen processing (28). However, it is clear from many case reports in the literature that there are many inherited forms of OI in which the genetic defect is unknown. These cases represent a fertile area for the application of NGS. Here we describe two recent examples, which illustrate how these powerful approaches have been exploited to identify the genetic basis for rare cases of OI.

WNT1

Wnt signaling is a well-established regulator of bone development and homeostasis (29). Much of the initial evidence implicating Wnt signaling and the skeleton came from discoveries that mutations in the WNT coreceptor LRP5 (30,31,32), and the WNT inhibitor SOST, caused altered bone mass (33,34,35). More recently, genome-wide association studies have identified common polymorphisms in or near a number of genes involved in WNT signaling (LRP5, SOST, CTTNB, WNT16, etc.) that are associated with bone mass in the general population (36). Now a series of recent NGS studies have implicated WNTs in OI through the discovery of disease-causing mutations in WNT1 (6,7,37,38).

A total of four independent studies recently reported the discovery of multiple mutations in WNT1 in individuals with OI and early-onset osteoporosis. In common to all studies was the use of NGS to identify the initial mutations in WNT1. In the work by Fahiminiya et al., 148 individuals with OI type IV were investigated (38). Sanger sequencing identified mutations in either COL1A1 or COL1A2 in 134 of the subjects. Mutations in other known OI genes were identified in 6 others. The other eight probands were subjected to exome sequencing. In four individuals from three families, homozygous mutations were identified in WNT1. In a similar study, Pyott et al. identified four OI families where the affected probands carried homozygous mutations in WNT1 (37).

Keupp et al. used exome sequencing to identify a 1 bp homozygous duplication in WNT1 in three affected individuals from a large consanguineous Turkish family (7). Follow-up capillary sequencing in 11 families identified four additional homozygous mutations in WNT1. In the same study, exome sequencing in a family with early-onset osteoporosis revealed a distinct heterozygous variant in WNT1 that segregated with the disease. Functional analysis revealed that WNT1 was expressed in differentiating osteoblasts and its expression increased as a function of osteoblast maturation. Also, in contrast to wild-type WNT1, overexpression of WNT1 constructs possessing three of the identified mutations did not induce canonical WNT signaling.

In the last study, Laine et al. investigated the basis of autosomal dominant early-onset osteoporosis in a large multigenerational pedigree (6). Linage analysis in 10 affected and 6 unaffected family members mapped the mutation to a 25.5 Mbp region on Chromosome 12. Targeted NGS of the region revealed a single novel variant in WNT1. Similar to the work of Keupp et al., the authors also investigated the genetic basis of OI in a family with two affected individuals. Similar to the other studies, exome sequencing was used to discover a homozygous variant in WNT1. The authors also demonstrated that in contrast to wild-type WNT1, overexpression of the two mutants did not induce canonical WNT signaling as measured by reporter assays and the expression of WNT target genes. Moreover, overexpression of the mutant constructs in MC3T3 osteoblasts reduced the capacity of cells to form mineralized nodules relative to wild-type WNT1. The authors also used lineage tracing expressing with Wnt1-cre transgenic and RosamT/mG reporter mice to show that Wnt1 is expressed in a subset of osteocytes in subchondral and cortical bone.

PEDF and BRIL

Studies from the Clemens lab examining the role of vascularization of bone had identified several genes that impinged on angiogenic regulatory pathways (39). Among these was Serpinf1, which encodes the antiangiogenic protein Pigment Epithelial-Derived Factor (PEDF) and was known to exert potent anti-angiogenic activity in several vascular beds (40,41). The protein was originally isolated from the epithelium of the developing retina where it is believed to coordinate proliferation and differentiation of the epithelial cells (40,42,43). PEDF is ubiquitously expressed in both human and mouse (44) and has been implicated in cell cycle control (45), fat metabolism (46), and tumorgenicity (47).

In the course of our studies on the role of PEDF in bone, two studies described inactivating mutations in the SERPINF1 gene as the cause of OI type VI (48,49). Type VI OI is distinct from other forms of the disease in that the afflicted subjects display an osteomalacia-like phenotype characterized by thickened osteoid and delayed mineralization. In addition, bisphosphonates are generally less effective in treating type VI than other subtypes of OI (50). The discovery of SERPINF1 as the genetic basis for OI type VI came as a surprise to the field because PEDF had no obvious connection to bone. To establish a role for PEDF in bone we characterized the skeletal features of a mouse with unrestricted loss of PEDF (51). These mice exhibit skeletal features resembling those seen in patients with OI type VI including increased unmineralized osteoid.

We next searched for cases of unexplained OI with familial inheritance in which PEDF production was compromised. In one case seen by Joan Marini at the NIH, a young girl born to normal parents presented with severe OI and was diagnosed on the basis of histomorphometry to have OI type VI. The proband has severe OI with relative macrocephaly, extreme short stature (at 25 years of age, her length is 50th percentile for a 28 month old girl), barrel chest and scoliosis. Histomorphometric analysis performed by Dr. Francis Glorieux revealed a histological picture diagnostic of OI Type VI including accumulation of unmineralized osteoid and a “fish scale” appearance of the matrix under polarized light. Osteoblasts isolated from the patient produced little if any PEDF as would be expected in OI VI (52).

Exome sequencing was performed by Hudson Alpha, on DNA extracted from blood from the parents, proband (isolated OI case) and an unaffected sibling. The proband tested negative for mutations in known OI genes including SERPINF1. Exome sequencing of all four family members identified a total of 22 407 high-quality variants (SNPs and INDELS). Variants that were potentially causal were identified using a set of discrete filtering steps. We first identified variants with familial genotype patterns consistent with a recessive mode of inheritance or that were only observed in the proband (putative de novo variants). Variants with a frequency >1% in the population were also discarded, as these were unlikely to cause OI. Lastly, we manually inspected each variant. These analyses eliminated all but 18 variants (5 recessive and 13 de novo). Based on the observation that PEDF is secreted at very low levels in cultured osteoblasts of the proband (Figure 2A), we hypothesized that the mutant gene interacted with PEDF. It has been shown that genes that physically interact are often highly co-expressed. Therefore, we reasoned that one of the 18 variant genes would be connected to SERPINF1 in a bone co-expression network. To test this prediction, we utilized a mouse co-expression network for bone generated in the Farber lab (53). Of the 18, Ifitm5 was the only variant gene strongly co-expressed in bone with Serpinf1 (Figure 2B), suggesting that it was causal.

Figure 2
figure 2

Identification of a dominant mutation in IFITM5 in severe OI Type VI. A) Western blot for PEDF of conditioned medium from normal (Control) and patient (S40L) osteoblasts. B) Network illustrates the strong co-expression relationships between genes known to cause OI and IFITM5 (outlined in yellow). C) Sanger sequencing confirmed the variant. The variant serine residue of BRIL (highlighted in green) is evolutionarily conserved. D) Predicted structure of the IFITM5/Bril protein (Pierre Moffatt, personal communication), which contains one transmembrane domains and an intracellular domain. The S40L mutation is located in the intracellular domain.

The candidate causal IFITM5 variant is located at 299 372 bp on chromosome 11. The variant is a C to T transition (c.119C>T) that causes a serine to leucine (p.S40L) substitution in bone restricted ifitm-like protein (BRIL; the protein encoded by IFITM5). Sanger sequencing confirmed that the mutation was de novo (Figure 2C). The variant serine residue is evolutionarily conserved (Figure 2C). Ifitm5 also known as fragilis4, is a member of the mouse fragilis family, which consists of 5 genes and at least 3 pseudogenes (54). Ifitm proteins have evolved diverse roles, including the control of cell proliferation, promotion of homotypic cell adhesion, protection against viral infection, and facilitating germ cell development. Examination of the gene structures, chromosomal location and tissue distribution of the different members of this gene family indicates that Ifitm5/Bril has diverged more recently to serve different functions. As described above, Bril, is highly expressed by osteoblasts and enhances matrix mineralization by osteoblasts in vitro (55). However, precisely how Bril impacts bone mineralization and how the different Bril mutations impact bone in OI is still unclear.

Soon after our findings, a different mutation in the IFITM5 gene was described as the cause autosomal dominant OI type V (3,4). Patients with OI type V have skeletal abnormalities distinct from those seen in OI Type VI (56), which were found to be caused by a recurrent mutation in the 5′-UTR of IFITM5 resulting in a mature Bril protein with 5 additional amino acids at its N terminus (Figure 2D). Thus in a matter on a few months two new proteins, which have no obvious connection to type 1 collagen synthesis or processing, were linked to two different forms of dominantly inherited OI. These exciting findings strongly suggest that both PEDF and Bril are critical components of a novel pathway required for normal bone matrix production and mineralization. Studies to test this possibility are currently underway.

The studies described above illustrate the power of NGS to identify mutations that underlie rare, heritable skeletal diseases. It is important to note that, with the exception of the large family segregating early-onset osteoporosis, traditional linkage mapping approaches would not have been useful because of the isolated nature and small size of each independent family. Thus, by identifying how the defective gene product causes genetic disorders, one can expeditiously identify critical molecular pathways that in some cases can be immediately targeted for therapeutic intervention. This is not a new idea but because of the vast improvements (speed) in the methodologies for finding defective genes, together with the exponential reduction in cost of sequencing, this strategy is now much more widely amenable to all investigators who have access to patient populations with rare genetic disorders.

The challenge of functional validation of genetic determinants

To understand the influence of a rare mutation on a skeletal phenotype it is usually necessary to perform functional studies in vivo. Such studies have firmly linked several OI associated gene mutations to a specific molecular process in collagen production, assembly or intracellular trafficking and secretion (28). However, although NGS has made mutation detection relatively straightforward, it is often unclear whether the identified sequence variant is the cause of the presenting phenotype. Although in vitro analysis with cell lines from patients can help in this regard, it is clear that a more comprehensive investigation at the tissue, organ or whole-organism level is required. One such effort is the International Knockout Mouse Consortium, which aims to mutate all protein-coding genes in the mouse and is providing resources to many laboratories that are studying the effects of loss-of-function alleles in different organs including the skeleton. As more gain-of-function and dominant-negative rare-disease-causing mutations are identified (e.g. IFITM5/BRIL), there will be a need for knock-in models to recapitulate these diseases. The development and wide adoption of genome-editing tools such as zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and clustered regularly interspaced short palindromic repeats (CRISPERs) provide a way to rapidly generate knock-in and other precise genome editing events (57). In addition to providing information on pathways and modifying factors, many of these mouse models will facilitate investigation of treatment modalities for these disorders. Such largescale phenotyping efforts will significantly challenge the research community to devise effective ways to capture, analyze and disseminate the genomic and phenotypic data.

Summary

In this perspective, we have attempted to illustrate how modern DNA sequencing and bioinformatics can be applied in a contemporary research lab setting to identify the genetic defects underlying rare bone disorders caused by mutations in single genes. In the cases illustrated here, the genes responsible for causing OI (Wnt1 and Ifitm5) were not previously known to function in the skeleton, but their linkage to a rare bone disorder prompts a new line of discovery research that will probe the normal function of the gene product in bone. Indeed, this type of approach has already led to discovery of SOST (ENSG00000167941) and LRP5 (ENSG00000162337) as critical regulators of Wnt signaling in bone and resulted in development of new drugs to stimulate bone formation (58).

A recent estimate suggests that approximately 50% of predicted 7 000 rare monogenic diseases have already been identified, and that most of the remaining disease-causing genes will be identified within the next ten years. The immediate beneficiaries of these discoveries will be the patients and families impacted by rare disease causing genes. The availability of the new body of genetic information will vastly improve clinical diagnosis and optimize treatment of both rare and more common bone disorders. Parallel translational studies informed by improved knowledge of bone controlling gene networks will clarify fundamental biological mechanisms in bone biology. We anticipate that many investigators in the bone field will participate in the incipient decade of discovery.