Introduction

Gene discovery strategies based on exome sequencing (ES) and whole-genome sequencing that are agnostic to boh known biology and mapping data provide powerful alternatives to conventional approaches to gene identification. Since their introduction in 2010, ES and whole-genome sequencing–based strategies have proven to be disruptive technologies that have rapidly accelerated the pace of discovery of genes underlying Mendelian phenotypes.1 For example, the rate of reported gene discovery increased from an average of ~166 per year between 2005 and 2009 to ~236 per year between 2010 and 2014, or an increase of 40% (i.e., ~70 additional reports) per year.1 However, this increase in reported discoveries is more modest than we, and perhaps others, anticipated.

Among the myriad factors limiting ES and whole-genome sequencing–based gene discovery, one key challenge is the lack of infrastructure for (i) large-scale standardized phenotypic delineation and comparison of families with Mendelian conditions and (ii) open sharing of sequence data, candidate genes, and putative causal variants between investigators and clinicians. These limitations often result in the identification of a putative causal variant or several high-priority candidate variants in an individual who has an unexplained phenotype (i.e., the causal gene is unknown), requiring extensive functional experimentation to establish a causal relationship.2,3 This issue frequently manifests in clinical settings as the reporting of a variant of unknown (or uncertain) significance (VUS).4,5 By contrast, identification of novel putative causal variants in the same gene in two or more families with the same or similar phenotype strongly supports a causal relationship independent of functional studies.3 To this end, sharing phenotypic and genetic information among investigators and clinicians in order to find multiple families with putatively pathogenic variants in the same gene is a straightforward approach to establishing a causal relationship. This is the rationale for developing infrastructure for large-scale release of combined genotype–to–structured phenotype data (e.g., Geno2MP (http://geno2mp.gs.washington.edu)1), structured phenotype matching6 (e.g., PhenomeCentral (http://phenomecentral.org), gene matching (e.g., GeneMatcher (http://genematcher.org)7), and variant matching (e.g., GenomeConnect (http://genomeconnect.org),8 Decipher/DDD (http://decipher.sanger.ac.uk)9,10), which are being coordinated via efforts such as Matchmaker Exchange (http://matchmakerexchange.org).11

There exist few, if any, formal resources for phenotype or gene matching that meet the needs of families that are motivated to identify individuals with similar phenotypes or VUSs in the same gene. Instead, families have turned to the Internet and social media as a way to share experiences and knowledge with other families and researchers in an effort to leverage more fully the diagnostic potential of clinical genetic testing. Such efforts at Internet-driven patient finding12 led, for example, to the widely publicized delineation of a novel disorder of glycosylation caused by loss-of-function variants in NGLY1.13,14

Inspired by this success,15 the parents of a child (family A) with developmental delay, hypotonia, and multiple minor anomalies, in whom clinical ES16 identified de novo VUSs in lysine (K)-specific demethylase 1A (KDM1A; OMIM 609132) and ankyrin repeat domain-containing protein 11 (ANKRD11; OMIM 611192), established a website, Milo’s Journey (http://milosjourney.com), Twitter account, and Facebook page to publicize these findings. Their goal was to identify other families with similarly affected children and/or VUSs in the same gene(s) and to recruit researchers to study their child’s condition. Their efforts were successful and within 5 days led to the identification of another family ( Figure 1 , family B; Supplementary Figure S1 online) who had a child with similar clinical characteristics and a de novo VUS in KDM1A ( Figure 2 ). Family A also contacted via e-mail various research groups in the United States that are investigating the genetic basis of developmental delay. A member of one of these groups made family A aware of a publication in which a de novo VUS in KDM1A was reported in a child with severe nonsyndromic intellectual disability and unaffected parents (family C).17 Literature searches by the diagnostic laboratory and clinicians for families A and B, and family A itself, had failed to identify family C in part because the information on VUSs, including the one in KDM1A, identified in the cohort was listed only in the supplementary materials of the article.

Figure 1
figure 1

Phenotypic characteristics of children with a mutation in KDM1A. All three individuals (a–c) with a mutation in KDM1A share a prominent forehead, slightly arched eyebrows, elongated palpebral fissures, a wide nasal bridge, thin lips, and widely spaced teeth. Case identifiers correspond to those in Table 1 , where a detailed description of the phenotype of each person is provided. C-1 and C-2 are pictures of the same child at 3 years 8 months and 8 years of age, respectively.

Figure 2
figure 2

Genomic structure of KDM1A , predicted KDM1A protein, and spectrum of mutations that cause developmental delay. (a) KDM1A comprises 21 exons, including protein-coding (blue) exons and noncoding (orange) exons. Lines with attached dots indicate the approximate locations of the three different de novo variants that we report to underlie developmental delay. The color of each dot reflects the domain/subdomain containing the corresponding mutated residue. (b) Protein domain structure of KDM1A. KDM1A has three domains—SWIRM (pink), amine-oxidase domain (AOD; blue and teal), and Tower (yellow)—as well as an unstructured N-terminal flexible region and C-terminal tail (gray). The AOD comprises two subdomains: the flavin adenine dinucleotide (FAD)-binding and substrate-binding functional subdomains. The active site cavity of KDM1A is within the substrate-binding subdomain and is required for KDM1A to demethylate H3K4me1/2 and repress transcription. Both the Tower and SWIRM domains have been shown to be necessary for the catalysis of histone demethylation by KDM1A.

Subsequently, the parents in family A (P.L. and K.M.P.) sought out the expertise of investigators at the University of Washington Center for Mendelian Genomics (UW-CMG) to help delineate the condition, confirm that variants in KDM1A were likely to be causal, and report the findings to the human genetics community at large. This experience with gene discovery for a Mendelian condition via social networking prompted the design and preliminary development of a Web-based portal (MyGene2) that will be accessible via the UW-CMG and through which families can submit phenotypic information and sequence data (e.g., variant call format and .bam files) to be warehoused and made accessible to researchers worldwide in order to facilitate more universal Internet-driven patient finding.

Materials and Methods

Studies were approved by the University of Washington and the University of Zurich institutional review boards, and consent to publish photographs was obtained. For two of the three families (families A and B; Figure 1 ; Table 1 ; Supplementary Figure S1 online), clinical ES was performed at GeneDx (Gaithersburg, MD) using the Agilent SureSelect XT2 All Exon V4 target. Both families requested .bam files from GeneDx and, upon receipt, each family transferred the files to the UW-CMG, where they were reprocessed using a standard pipeline as previously described.18 Reads were aligned to a human reference (hg19) using the Burrows-Wheeler Aligner 0.6.2. All aligned read data were subjected to (i) removal of duplicate reads (Picard MarkDuplicates version 1.70), (ii) insertion/deletion realignment (GATK IndelRealigner version 1.6-11-g3b2fab9), and (iii) base quality recalibration (GATK TableRecalibration version 1.6-11-g3b2fab9). Variant detection and genotyping were performed using GATK UnifiedGenotyper (version 1.6-11-g3b2fab9). Variant data for each sample were flagged using the filtration walker (GATK) to mark sites that were of lower quality and potential false positives (e.g., strand bias ≥−0.1, quality scores ≤50), allelic imbalance (ABHet > 0.75), long homopolymer runs (>4), and/or low quality by depth (<5).

Table 1 Mutations and clinical findings of individuals with KDM1A mutations

Variants with an alternate allele frequency >0.005 in the Exome Variant Server (NHLBI Exome Sequencing Project ESP6500; http://evs.gs.washington.edu/EVS), the 1000 Genomes Project, or the Exome Aggregation Consortium (ExAC; http://exac.broadinstitute.org), or >0.05 in an internal exome database of ~700 individuals, were excluded before analysis. In addition, variants that were flagged as low quality or potential false positives (quality score ≤30, long homopolymer run >5, low quality by depth <5, within a cluster of single-nucleotide polymorphisms) also were excluded from analysis. Variants that were flagged only by the strand bias filter flag (strand bias >−0.10) were included in further analyses because the strand bias flag was previously applied to valid variants. Variants were annotated with the SeattleSeq138 Annotation Server (http://snp.gs.washington.edu), and variants for which the only functional prediction label was any one of “intergenic,” “coding-synonymous,” “utr,” “near-gene,” or “intron” were excluded. Individual genotypes with a depth <4 or genotype quality <20 were treated as missing in analysis. The code to generate Figure 3 is available at: http://dx.doi.org/10.6084/m9.figshare.1537555.

Figure 3
figure 3

Effects of an increasing number of trios sequenced and specificity of phenotype on the power to detect significant association between putative mutations and phenotype. Assuming a de novo missense rate of 3.46 × 10−5/chromosome, as increasing numbers of trios (x axis) are tested by exome sequencing, the power to detect a significant association (ranges of possible P values are represented by different shades of grey; darker indicates smaller and more significant P values) between de novo variants in a gene and the phenotype of interest increases. In addition, as the specificity of the phenotype of interest increases, the proportion of individuals tested who have the phenotype (y axis) naturally decreases, also resulting in increased power. A small decrease (60–50%) in the proportion of individuals who have the phenotype of interest can increase power more than sequencing 10,000 additional trios.

Results

Analysis of variants from ES under a de novo mutation model confirmed the presence of a different de novo variant in KDM1A (RefSeq NM_001009999.2) in each of families A and B ( Table 1 ; Figures 1 and 2 ). A complete phenotypic description, facial photographs, and published variant information from family C were shared with the UW-CMG by the corresponding author (A.R.). All three children with de novo KDM1A variants had similar, albeit nonspecific, clinical findings ( Table 1 ), including similar facial features, global developmental delay, and hypotonia ( Figure 1 ; Table 1 ). In particular, all three individuals have a prominent forehead, slightly arched eyebrows, elongated palpebral fissures, a wide nasal bridge, thin lips, and widely spaced teeth ( Figure 1 ).

All three variants in KDM1A are missense variants that were predicted to be deleterious (minimum Polyphen-2 HumVar score of 0.962), result in amino acid substitutions of highly conserved amino acid residues (minimum Genomic Evolutionary Rate Profiling score = 5.72) in KDM1A, and have high combined annotation-dependent depletion scores suggestive of dominant mutations (minimum combined annotation-dependent depletion score of 27.2) ( Table 1 ). Moreover, KDM1A is in the top 2% of evolutionarily constrained genes (i.e., genes that are intolerant to functional variation), and this set of genes is enriched for genes that are known to underlie dominant Mendelian phenotypes.19 None of the three variants were found in over 71,000 control exomes comprising the ESP6500, 1000 Genomes Project phase I (November 2010 release), or Exome Aggregation Consortium (20 October 2014 release) databases, nor the internal databases (>1,400 chromosomes). No rare variants in KDM1A were present in individuals included in Geno2MP version 1.0 (ref. 1) who had a similar phenotype.

Discussion

Function of KDM1A and delineating a new disorder

Tunovic et al.16 hypothesized that the phenotype of the proband in family A might result from the combined effects of the de novo variant, c.2353T>C [p.(Tyr785His]), in KDM1A and a second de novo variant, c.2606_2608delAGA [p.Lys869del]), in ANKRD11, suggesting that the child was affected by two Mendelian conditions: a Kabuki syndrome–like phenotype caused by the variant in KDM1A and KBG syndrome (OMIM 158050) caused by the variant in ANKRD11. This hypothesis was motivated, in part, by the presence of physical findings that did not overlap with features observed in Kabuki syndrome (OMIM PS147920). However, comparison with two additional persons with de novo mutations in KDM1A reveals that many of these features seem to be shared among all three. This suggests that mutations in KDM1A cause a condition that has phenotypic overlap with Kabuki syndrome but is nonetheless distinct.

Additional evidence suggests that the c.2606_2608delAGA [p.(Lys869del)] variant in ANKRD11 in family A does not cause KBG syndrome. Excluding microdeletions or large chromosomal deletions, the vast majority of ANKRD11 variants that underlie KBG syndrome are frameshifts or nonsense mutations that are predicted to result in a truncated protein or nonsense-mediated decay.20,21,22 By contrast, only four missense or small deletion/duplication mutations have been reported as causing KBG syndrome.20,21,22 These findings, combined with the observations that ANKRD11 is not highly conserved (it has only 79% identity with its mouse ortholog21) and is highly polymorphic among the general population23, suggest that only a small subset of missense variants found in ANKRD11 result in KBG syndrome. Moreover the c.2606_2608delAGA [p.(Lys869del)] variant is not predicted to be pathogenic by CADD version 1.0 (a Phred-scaled score of 13.03 is well below the score of 25 observed for the majority of mutations that cause autosomal-dominant conditions). Finally, although macrodontia of the upper incisors, which is considered a hallmark feature of KBG syndrome, is often not observed until adult teeth emerge, the proband in family A has normal dentition.20,22,24

KDM1A is a histone demethylase that has been extensively studied in vitro and in model organisms, and has been shown to play diverse and key roles in regulating gene expression during development.25 Homozygous knockout of Kdm1a in mice is lethal during early embryogenesis.26 Kdm1a is involved in repression of neuronal genes in non-neuronal cells,27,28 and during the perinatal period, alternative splicing of KDM1A results in expression of two neuron-specific isoforms that regulate neurite maturation.29 In mice, proper skeletal muscle differentiation requires Kdm1a to demethylate myogenic promoters,30 which may explain the discovery of heart defects in mice that are homozygous for a hypomorphic Kdm1a allele.31 Interestingly, mice that are heterozygous for a Kdm1a deletion are apparently normal and fertile,26 suggesting that haploinsufficiency may not result in an obvious defect. Nevertheless, it remains to be seen whether the phenotypes we report to be caused by variants in KDM1A are the result of a loss or gain of function. An additional intriguing observation is that all three mutations alter residues in the amine-oxidase domain ( Figure 2b ), which comprises flavin adenine dinucleotide-binding and substrate-binding functional subdomains.32 The active site cavity of KDM1A is formed by the substrate-binding subdomain32 and is required for KDM1A to demethylate H3K4me1/2 and repress transcription.28

Scaling up gene discovery by social networking to tackle the n-of-1 problem

The discovery that variants in KDM1A underlie a distinctive and previously unrecognized Mendelian condition is the result of social networking by the family of an affected child with another family and several research groups. This approach consisted of establishing a website that included a comprehensive description of the proband’s symptoms and medical history using both lay and medical terminology and reports of putative pathogenic variants identified via ES, as well as a linked blog, Twitter account, and Facebook page. Exposure to the public-at-large via common social media such as Facebook and Twitter is a strategy that leverages sites that are familiar to many families. Technology-savvy families also are capitalizing on existing searchable information platforms such as editing entries for conditions described on Wikipedia, setting Google alerts for symptoms and rare variants, purchasing Google AdWords, and using Google analytics to identify pockets of researcher and patient activity.12 In this case, only 5 days after launching their website, family A received an e-mail from family B describing “her son, along with a picture of him that showed the remarkable resemblance between the two boys—he looked like he could be [proband A]’s brother” (P. Lorentzen, personal communication).

The direct-to-consumer genetic testing movement, in particular genetic ancestry testing, has made it routine to use the Internet and social media to research genetic relationships and the meaning of genetic information, including variants associated with disease.33 Building on the global reach of the Internet, online social networking also is increasingly leveraged by communities of people and families with rare diseases to connect families, enabling them to share their experiences, provider/researcher relationships, genetic knowledge, and strategies for advocating for their children.34,35,36 Indeed, one of the important benefits of social networking is the ability of families to share information, including sequence data, directly with researchers in the hope of garnering more interest and making collaboration more convenient and cost-effective. Accordingly, online social networking is increasing the role that families play in stimulating, coordinating, and supporting research.

To support and facilitate the efforts of families toward discovering the genetic basis of their condition, we developed an online portal, the Repository for Mendelian Disorders Family Portal (RMD-FP), for families to submit and subsequently share phenotypic information and ES and whole-genome sequencing data. The RMD-FP is a point of entry into the human genetics community for families who seek to cast the widest net in recruiting researchers to work on their condition. The RMD-FP provides information to families about research, facilitates family decisions about preferences for how their data may be used, and guides families through the process of directly submitting their phenotypic information and genomic data. The RMD-FP will eventually enable the collection of detailed self-reported phenotype/trait information via structured data entry and will enable families to receive results, if available, via My46, a Web-based tool for managing the return of genetic test results.

Once data are deposited in the RMD-FP, the phenotypic information will be curated and structured37 for submission to PhenomeCentral/Matchmaker Exchange. Genomic data will be reanalyzed, and all variants found by either prior diagnostic sequencing or reanalysis to be segregating under the appropriate inheritance model(s) (i.e., candidate variants) will be entered, along with the structured phenotypic data, into a database. If a small number of candidate genes are identified, the genes also will be submitted to GeneMatcher/Matchmaker Exchange. However, many families consist of a single affected individual with no contributing family history, leading to analysis under all possible standard inheritance models (e.g., homozygous recessive, compound heterozygous, and de novo). This results in a large list of candidate variants/genes that are not appropriate for submission to GeneMatcher and currently leaves families with no way to efficiently share candidate variants with other interested families, clinicians, diagnostic laboratories, and researchers. The combination of structured phenotype information and sharing of all candidate variants should increase data consistency and thus the probability of a match.

To help address this gap, we are developing MyGene2 (beta release projected for early 2016), a public Web-based tool that enables searches of candidate variants/genes that are linked to phenotypic profiles of persons and families in the RMD-FP. Users of MyGene2 will be able to search for candidate variants matching a gene, inheritance model, and/or phenotypic trait or profile. If a user identifies a variant of interest, they can register with the site by creating an account, which enables them to contact the submitter(s) for further information. Registration is required to protect the confidentiality of sample submitters, to track matches, and to survey matched users about subsequent discoveries and publication. Tracking outcomes also helps to ensure that families benefit from their participation. Families will be able to use sample submission to publicize their candidate variants/genes, make their data available to the community via an editable “family page,” and participate more fully in gene discovery efforts without requiring a high level of technical knowledge. Clinicians and researchers also will be able to search for additional families with mutations in the same gene for gene discovery and delineation of new conditions; diagnostic laboratories will be able to search for additional cases to assist in the interpretation of VUSs. Ensuing manuscripts using/describing matches made through MyGene2 will only be required to acknowledge the site and its sources of support. Indeed, we envision MyGene2 as a resource to empower families, clinicians, and investigators to delineate new Mendelian conditions largely independent of UW-CMG so as to accelerate the rate of gene discovery for Mendelian conditions. With broad participation from the human genetics community, MyGene2 has the potential to greatly facilitate overcoming the “n-of-1 problem.”

The scenario we report in which VUSs in the same candidate gene led to the ascertainment of several persons with overlapping clinical features and the delineation of a distinct syndrome is likely to become an increasingly common strategy for discovering genes for Mendelian conditions. Identification of three independent families in which each person with a de novo variant in the same gene has the same condition meets existing guidelines for causality of Mendelian disorders.1,3 Nevertheless, confidence would be gained by assigning a P value for this observation,38 but doing so is difficult in the absence of greater sharing of detailed phenotype information linked to ES data from a large number of independent cases. For example, developmental delay is perhaps the most common phenotype found in individuals that undergo clinical ES, comprising 64% of cases tested in one recent survey.39 Therefore, if we estimate that roughly 10,000 trios have been analyzed for de novo variants via clinical ES, ~6,400 are predicted to have had developmental delay.39 Using the Fisher exact test for independence between the presence of de novo variants in KDM1A and developmental delay, the P value for identifying three individuals with de novo variants in KDM1A and developmental delay and no individuals with de novo variants in KDM1A without developmental delay is only 0.5576. This P value is not significant because of poor statistical power to distinguish between persons with mutations in different genes who are broadly described as having the same common, nonspecific condition.

Power can be improved by increasing the sample size of trios tested, using additional phenotypic details to increase specificity about the phenotype tested, or some combination thereof; increasing the specificity of the phenotype of interest is much more efficient at improving power ( Figure 3 ). For example, if we consider a gene with a de novo mutation rate of 3.46 × 10−5 per chromosome (i.e., the mean predicted de novo mutation rate for missense mutations in the top ~5% of highly evolutionarily constrained genes in a recent study19), even if 100,000 trios are tested, the best P value that could be obtained as long as 64% have developmental delay is 0.094. By contrast, if we increase the specificity of the phenotype of interest and thus reduce the fraction of those same 100,000 cases with the phenotype to 50%, the P value drops to 0.016. While increasing both sample size and phenotypic specificity is ideal, a quick and effective way maximize the power of existing data sets is to make deep, structured phenotypic data linked to genotype data publicly available and accessible via tools like MyGene2 and others in order to enable statistically rigorous assessment of similar Mendelian gene discoveries.3,38

Mendelian gene discovery has traditionally been organized with the clinician-researcher as the central hub, around which families are solicited, experiments performed, results reported in manuscripts, and data shared. Social networking promotes a more egalitarian network in which families also act as nodes, independently sharing phenotypic information, genetic data, and results with other families and researchers.14 Yet this can be a labor-intensive, inefficient, and expensive endeavor that requires some technical expertise to maximize the effort. At its full potential, MyGene2 can serve as an organizing node for families, providing to them convenient and free access to data from a large number of other families and investigators. It should be noted that MyGene2 is but one new tool to facilitate social networking and data sharing among people who are interested in rare diseases. We expect—and indeed encourage—the development of other strategies and solutions40 toward the same goals. One outcome of this model is a somewhat diminished role for both clinicians and investigators as the central organizing node but greater empowerment of families, and we predict a greater rate of discovery of genes and newly delineated Mendelian conditions.

In summary, social networking among families led to the recognition, if not the frank discovery, that de novo variants in KDM1A cause a newly delineated condition characterized by developmental delay, hypotonia, and characteristic facial features. Coupled with the fairly narrow range of phenotypic variation observed in the affected individuals described herein, it is likely that mutations in KDM1A might also explain some cases of apparently isolated intellectual disability. Developing infrastructure to empower families to share phenotypic information and genetic data at scale would empower many more families worldwide, and we predict this will accelerate the pace of gene discovery for Mendelian conditions. The rapid translation of these discoveries into diagnostic tests and new starting points for repurposing or developing therapeutics would, in turn, improve the overall care of families with rare diseases.

Disclosure

M.J.B., H.K.T., and J.-H.Y. have a patent application pending on My46. The other authors declare no conflict of interest.