Massively parallel identification of functionally consequential noncoding genetic variants in undiagnosed rare disease patients

Clinical whole genome sequencing has enabled the discovery of potentially pathogenic noncoding variants in the genomes of rare disease patients with a prior history of negative genetic testing. However, interpreting the functional consequences of noncoding variants and distinguishing those that contribute to disease etiology remains a challenge. Here we address this challenge by experimentally profiling the functional consequences of rare noncoding variants detected in a cohort of undiagnosed rare disease patients at scale using a massively parallel reporter assay. We demonstrate that this approach successfully identifies rare noncoding variants that alter the regulatory capacity of genomic sequences. In addition, we describe an integrative analysis that utilizes genomic features alongside patient clinical data to further prioritize candidate variants with an increased likelihood of pathogenicity. This work represents an important step towards establishing a framework for the functional interpretation of clinically detected noncoding variants.

Here, we describe the implementation of an MPRA-based strategy for characterizing the functional consequences of rare noncoding variants detected directly from whole genome sequencing of rare disease patients with a history of negative genetic testing. We profiled > 3000 genomic regions of interest and identified hundreds that exhibit significant regulatory activity. Furthermore, we illuminated > 100 rare noncoding variants that significantly alter the regulatory capacity of these genomic regions. Importantly, the functional consequences of these variants could not have been predicted in the absence of experimental data.
In addition to profiling the impact of rare noncoding variants on regulatory activity we outline an integrative approach for prioritizing candidate variants with an increased likelihood of pathogenicity. We incorporate transcription factor binding profiles at variant sites alongside clinical data from each proband to strengthen phenotypic associations between variant-containing regulatory elements and clinical presentations. Using this method we uncover several consequential variants that occur within genomic contexts that may contribute to disease etiology. In summary, this study provides a roadmap for the implementation of MPRAs in a clinical setting and introduces an important tool for improving the interpretation of noncoding variants in rare disease genomes.

Results
Identification of rare noncoding genetic variants in a cohort of undiagnosed rare disease patients. To explore the impact of rare noncoding genetic variants on the regulatory potential of genomic sequences we first identified rare variants in a cohort of undiagnosed rare disease patients. We analyzed the genomes of 111 exome negative probands, an even mix of females and males with a median age of 7 years at the time of analysis (Fig. 1a,b). The patient cohort presented with a wide variety of clinical manifestations spanning all organ systems (Fig. 1c). Among the most prevalent clinical features were anomalies of the nervous system, the musculoskeletal system, and the cardiovascular system (Fig. 1c).
For this study we focused exclusively on rare noncoding single nucleotide variants occurring within 100 kb of an annotated transcription start site (TSS). Variants were designated as rare if they had a gnomAD minor allele frequency < 0.001 (internal allelic frequencies were used when gnomAD data was not available) and were detected using at least two independent sequencing technologies. Family trio sequencing was available for all probands in this study, allowing us to further distinguish de novo rare variants from those inherited from a parent. We observed a median of 14 de novo rare variants per proband within 100 kb of a TSS and a total of 1958 de novo variants within this genomic window across all 111 probands (Fig. 1d, Table S1). For comparison we also selected inherited rare variants from a subset of probands. Inherited variants were selected at a similar frequency, a median of 15.5 variants per proband within 100 kb of a TSS, for a total of 1101 inherited rare variants (Fig. 1d, Table S1). In the probands profiled in this study we found that de novo rare variants were detected throughout the ± 100 kb genomic window surrounding TSSs and were slightly more prevalent in genomic regions near TSSs (Fig. 1e).
Profiling the regulatory capacity of genomic sequences using a massively parallel reporter assay. To systematically evaluate the regulatory potential of rare variant-containing genomic sequences we designed a massively parallel reporter assay (MPRA). For our MPRA we designed a library of sequences composed of 100 nucleotides of genomic information centered around the chromosomal position of each rare noncoding variant we detected (Fig. 2a, Table S1). We designed four analogous sequences for each genomic region of interest to represent all possible nucleotides in the variant position (Fig. 2a, Table S1). This comprehensive sequence design can facilitate the identification of genomic positions where specific variants impact regulatory potential as opposed to positions where any deviation from the reference genome has regulatory consequences. These sequences were used to design an oligonucleotide library with each oligo containing a genomic sequence of interest as well as a unique 12 nucleotide barcode sequence. We assigned each genomic sequence of interest 5 independent barcodes resulting in an overall library size of 61,180 oligos (3059 variants × 4 sequence analogs per variant × 5 barcodes per sequence) (Table S1).
We next constructed an MPRA plasmid library using a two-step cloning approach. First, the oligo library was cloned into an empty plasmid (no reporter gene) backbone (Fig. 2b). In the second cloning step a reporter cassette containing an SV40 promoter followed by GFP was inserted between the genomic sequence of interest and its associated barcode sequence (Fig. 2b). In the final MPRA library pool each plasmid encodes a uniquely barcoded GFP transcript that permits the association of its expression with the upstream genomic sequence of interest (Fig. 2b). Following library construction we performed targeted sequencing of the barcode containing region of the plasmid pool and confirmed that all of the genomic sequences of interest were represented in the MPRA plasmid library (Fig. 2c, Table S2).
To measure the regulatory potential of the genomic sequences in our MPRA library we transfected the plasmid pool into HEK293T cells. Given the wide array of clinical manifestations in the patient cohort, we reasoned that potentially pathogenic noncoding variants would exhibit measurable effects on reporter expression in most cellular contexts. We chose HEK293T as a cellular system due to the cell type's robust growth properties and high transfection efficiency. Cells were harvested 24 h after transfection and targeted RNA-Seq libraries were generated by amplifying the barcode-containing region of the expressed GFP reporter transcripts. We detected robust expression of reporter transcripts associated with each of the genomic sequences of interest in our MPRA library (Fig. 2c, Table S2). Moreover, we observed high concordance in reporter expression associated with each respective sequence of interest across biological replicates (Fig. 2d).
In the MPRA experiment the abundance of each barcode in the RNA-Seq library serves as a proxy for the regulatory potential of the associated genomic sequence of interest. To establish a baseline for regulatory activity we first profiled MPRA expression for library elements representing the reference genome. After normalizing    www.nature.com/scientificreports/ sequences in our MPRA library (Fig. 2f). Sequences displaying significant regulatory activity were distributed evenly between genomic regions corresponding to inherited or de novo variants identified in our patient cohort (Fig. 2e,f). Moreover, we found that the regulatory activity of these sequences was split evenly between activation and repression of reporter expression (Fig. 2e). Interestingly, regulatory activity in the MPRA showed no correlation with the distance between the genomic sequence of interest and its nearest TSS in the genome (Fig. 2g). Altogether, these results demonstrate that the MPRA is a robust method for profiling the regulatory activity of genomic sequences. We next searched for MPRA library element features that might be associated with regulatory activity. In general, we observed uniform regulatory activity from genomic sequences representing variant-containing regions across all probands in our patient cohort (Fig. S1a). Furthermore, we detected uniform activity across the sequence analogs we designed for each variant (Fig. S1b). Expression in the MPRA was not influenced by the identity of the nucleotide in the variant position (Fig. S1c). Likewise, expression was similar across sequences corresponding to genomic regions containing inherited or de novo variants (Fig. S1d). Lastly, genomic features such as chromatin accessibility at the variant site were not predictive of regulatory activity (Fig. S1e). These data highlight the ability of the MPRA to discern sequences with regulatory capacity in the absence of predictive features.
Quantifying the impact of genetic variants on the regulatory capacity of genomic sequences. To evaluate the impact of rare genetic variants on the regulatory capacity of the genomic sequences we compared expression between corresponding reference and variant alleles represented in our MPRA library. We found that ~ 4.5% of the variants profiled had a significant impact on regulatory activity ( Fig. 3a,b, Table S3). Consequential variants were distributed evenly between genomic regions harboring inherited or de novo variants (Fig. 3b). Furthermore, there was no relationship between the class of base mutation (transition or transversion) and the effect of the variant on regulatory activity (Fig. S2a). Similar to the basal regulatory activity detected from reference alleles, we observed no correlation between the functional impact of a variant and its distance from the nearest TSS in the genome (Fig. 3c).
Overall, we identified 91 de novo variants that significantly impacted the regulatory potential of genomic sequences (Fig. 3d, Table S3). We detected 51 de novo variants that resulted in significant decreases in reporter expression and 40 de novo variants resulting in significant increases in reporter expression. Importantly, increases in reporter expression appeared to result from a loss of repression observed by the reference allele (Fig. 3d). Analysis of MPRA expression from all possible alleles corresponding to de novo variants that significantly alter reporter expression revealed that most of the regulatory activity was associated with reference or variant alleles and that other alternative alleles had minimal regulatory potential (Fig. 3d,e). These results demonstrate that the MPRA is a powerful tool for identifying genetic variants that impact the regulatory capacity of genomic sequences.
Integrative prediction of rare variant pathogenicity. To illuminate rare variants with a higher likelihood of pathogenicity we integrated human phenotype ontology (HPO) terms into our MPRA analysis. Briefly, HPO provides a standardized terminology for phenotypic features associated with human disease 17 . As a resource, HPO also documents associations between HPO terms and genes with demonstrated links to annotated phenotypes. We reasoned that a rare variant occurring within a regulatory element could result in phenotypes similar to those associated with transcription factors bound to the element. To test this hypothesis we utilized publicly available ChIP-seq data to assess transcription factor binding in the genomic regions profiled by our MPRA library. We subsequently evaluated the overlap in HPO terms associated with each transcription factor and the terms associated with probands harboring variants at binding sites for each factor (based on matching HPO accession numbers). We identified 145 variants (61 inherited and 84 de novo) for which there was significant overlap in HPO terms between the proband and at least one transcription factor bound to the variant-containing site (Fig. 4a, Table S4). Interestingly, variants that had significant overlap in HPO terms with bound transcription factors as well as a significant impact on regulatory activity in our MPRA were almost exclusively de novo (Fig. 4a). More specifically, we identified 8 variants within binding sites for transcription factors with significant HPO term overlap that impacted regulatory activity by a |z-score| > 2 in our MPRA, 7 of which were de novo.
We next evaluated whether transcription factor binding profiles can be predictive of the phenotypic consequences of noncoding variants. We selected a candidate de novo variant on chromosome 22 detected in a proband that had significant HPO term overlap with several factors that bind to the variant site (Fig. 4b). Expression analysis in the MPRA demonstrated that the reference allele corresponding to this variant exhibited repressive features that were absent with the variant allele (Fig. 4c). In agreement with this observation, one of the factors that binds to this site (KDM5B) is a lysine demethylase known to play a role in transcriptional repression 18 . Based on this relationship we predicted that inhibiting the function of KDM5B and perturbing the putative regulatory activity of the genomic region harboring the variant would have overlapping consequences.
To explore a potential association between KDM5B and the variant-containing genomic region we inhibited expression of KDM5B using CRISPR interference (CRISPRi) and profiled the downstream effects using transcriptome-wide expression analysis. In response to KDM5B inhibition we detected significant expression changes in 312 genes (220 upregulated, 92 downregulated) (Fig. 4d, Table S5). In parallel to KDM5B inhibition we used CRISPRi to interfere with the potential regulatory functions of the variant-containing genomic region which resulted in significant expression changes in 451 genes (268 upregulated, 183 downregulated) (Fig. 4b,e, Table S5). Importantly, the large number of differentially expressed genes in response to variant site perturbation indicates that the genomic region harboring the variant is a functional regulatory element. We observed www.nature.com/scientificreports/   www.nature.com/scientificreports/ significant overlap in the genes that were differentially expressed in response to KDM5B inhibition and variant site perturbation suggesting the presence of regulatory interactions between the transcription factor and the variant site (Fig. 4f). These results are consistent with the overlap in phenotypes (HPO terms) associated with   www.nature.com/scientificreports/ KDM5B and the proband harboring the variant and support the use of transcription factor binding profiles for predicting the consequences of noncoding variants. Overall, we identified 8 rare variants (7 de novo, 1 inherited) that exhibited a significant impact on reporter expression (|z-score| > 2) in our MPRA and occurred in a genomic region bound by transcription factor(s) associated with HPO terms that overlapped the respective proband (Fig. 4g). Our MPRA expression data revealed that 6 of these variants were loss-of-function (decreased expression) and 2 variants were gain-of-function (loss of repression). Although the 7 de novo variants that ultimately met our stringent prioritization criteria represent just 0.36% (7/1958) of the total de novo variants in the MPRA library, those variants represent 8.33% of the de novo variants that occurred in genomic regions bound by transcription factor(s) with overlapping HPO terms. This observation suggests that incorporating transcription factor binding profiles into the design of future MPRA libraries will likely lead to more efficient identification of high priority variants. Altogether, our results demonstrate that the MPRA, in combination with patient clinical data and complementary genomic data, is an effective method for illuminating rare noncoding variants that warrant detailed investigation.

Discussion
A growing body of evidence has demonstrated that most disease-associated variants occur in noncoding regions of the genome [19][20][21] . Although several computational tools have been developed for predicting the pathogenicity of noncoding variants, classifications from these tools exhibit low concordance and no single approach has been adopted as standard practice in the clinical interpretation of noncoding variants 12 . For rare noncoding variants that have not been previously detected or characterized, experimental evidence is nearly essential to evaluate the functional impact of the variant prior to consideration for involvement in disease. Here, we provide a scalable solution to this challenge by applying high-throughput functional genomic technologies to assist with the interpretation of clinically detected noncoding variants. We constructed an MPRA to directly profile the functional consequences of thousands of rare noncoding variants detected in the genomes of undiagnosed rare disease patients. Furthermore, we have outlined an integrative prioritization strategy that incorporates patient clinical data along with publicly available genomic data to pinpoint noncoding variants with an increased likelihood of pathogenicity.
Our MPRA-based strategy enabled us to systematically profile the functional consequences of 3059 clinically detected rare noncoding variants and our analysis pipeline led to the prioritization of 8 noncoding variants for follow up study. Although the overall yield of prioritized variants in this study was relatively low, our data have revealed features of the MPRA design that can be adapted to significantly improve the effectiveness of future screening campaigns. For example, we found that nearly all the noncoding variants prioritized using our approach were de novo. Furthermore, profiling all possible alleles at each variant site did not generate insight that would have a meaningful impact on variant interpretation. By focusing future MPRA library designs exclusively on de novo variants, profiling only reference or variant alleles, and further restricting libraries to genomic regions bound by transcription factors associated with relevant HPO terms we estimate that a 30-fold increase in the volume of prioritized variants could be achieved. Such an improvement would significantly increase the number of pathogenic variants uncovered using this approach.
The variants we profiled in this study were identified in rare disease patients that exhibited a variety of clinical manifestations spanning multiple organ systems suggesting that the effects of potentially pathogenic variants are not restricted to specific cellular contexts. However, this is unlikely to be the case for all patients that undergo WGS and defined cellular models may be required to profile variants associated with isolated clinical presentations. Aside from cellular context, there are additional aspects of MPRA design that are likely to impact experimental results. While the MPRA we describe in this study was performed using episomal expression vectors, lentiviral-based MPRA systems that integrate into the genome have been developed 22,23 . Chromosomally integrated MPRAs may be required to profile variants with functional consequences that are dependent on chromatin context.
As with all experimental methods, the MPRA approach we describe here has the potential to yield both false positive and false negative results. For example, our MPRA design profiles variant activity at a fixed short distance from the promoter of a reporter gene but variants often occur in genomic locations that are distant from genes that they may regulate. Increased distances between variants and their potential regulatory targets within the genome may impact their actual functional consequences. Alternatively, variants may reside in regions of the genome that are functionally inert (e.g., lack regulatory interactions) and alterations in the regulatory potential of these variant-containing sequences may not influence genome activity in their native genomic context. The potential for false positive results may have contributed to the equal distribution of consequential variants observed across the 100 kb windows surrounding TSSs that were profiled in this study. More specifically, a subset of the variants we profiled may not influence expression at more distal TSSs in the genome despite their influence on reporter expression using an MPRA strategy.
Most of the variants profiled in this study did not have an impact on reporter expression in our MPRA. However, a subset of these variants may have regulatory consequences in their native genomic contexts. For example, variants that exhibit their effects in concert with additional genomic features (e.g., DNA methylation, chromatin modifications) are unlikely to influence reporter expression in an MPRA. As mentioned previously, the regulatory effects of some variants may be restricted to specific cellular contexts. Although our MPRA approach was able to identify many variants that alter the regulatory capacity of genomic sequences, our experimental strategy was limited to HEK293T cells. We cannot exclude the possibility that many of the variants that lacked regulatory consequences in this study could have significant effects if profiled in alternative cellular models.
Among the most critical elements of the approach we outline in this study is direct access to patient clinical data. This information is essential for establishing phenotypic associations between genomic regions that harbor www.nature.com/scientificreports/ noncoding variants and clinical features of disease. The utility of patient data when predicting the phenotypic consequences of noncoding variants further supports the implementation of our strategy directly within a clinical setting. Although the approach we describe here is intended to facilitate the clinical interpretation of noncoding variants at the level of individual patients, data sharing across the broader research community is of fundamental importance to the process of rare disease diagnosis. Importantly, any functional data obtained when profiling a rare disease genome has the potential to aid in the interpretation of unrelated disease genomes that harbor variants in the same chromosomal positions or in proximal genomic regions. In summary, this study provides proof of principle for the application of an MPRA-based strategy to assist with the clinical interpretation of noncoding variants identified in disease genomes. We anticipate that such an approach will have a positive impact on the diagnostic rates achieved by clinical WGS.

Methods
Patient consent. Patients profiled in this study are participants in the Genomic Answers for Kids (GA4K) program at the Children's Mercy Research Institute. Informed written consent was obtained upon enrollment into the GA4K program.
Selection of candidate rare variants. Single nucleotide variants were called using DRAGEN 3.6.3 on GRCh38 for probands and parents enrolled in the Genomic Answers for Kids initiative at Children's Mercy Hospital 24 . The following criteria were used to select candidate variants from probands: (a) gnomAD minor allele frequency < 0.001 (internal allelic frequencies were used when gnomAD data was not available), (b) variant detection using at least two sequencing technologies (whole genome sequencing, whole genome bisulfite sequencing, 10 × linked-read sequencing, or PacBio long-read sequencing), and (c) variants occurring within 100 kb of an annotated transcription start site based on NCBI RefSeq annotations. Variants were excluded from consideration if: (a) observed in more than one proband, (b) previously reported as causative of a disorder, or (c) predicted to impact protein-coding gene function through nonsense mutation, frameshift mutation, disruption of a stop codon, loss of translation initiation, or interference with splice donor/acceptor sequences. Candidate rare variants were categorized as "inherited" if observed in sequencing reads from parental sequencing. Candidate rare variants were categorized as "de novo" if no evidence of the variant was present in parental sequencing data. Variant coordinates were lifted over to GRCh37 for genomic analyses.
MPRA library cloning. The MPRA oligo library was amplified using Q5 High-Fidelity DNA Polymerase (NEB). The amplified library was size-selected using Agencourt AMPureXP beads (Beckman Coulter). The oligo library was then inserted into an empty MPRA vector by golden gate assembly using BsaI (NEB) and T4 Ligase (NEB). The resulting library was purified using isopropanol precipitation (15 µL elute) then expanded by electroporation into 5 vials (3 µL ligation/vial) of One Shot TOP10 Electrocomp E. coli (ThermoFisher). The plasmid library was isolated using the Plasmid Plus Maxi kit (QIAGEN). The MPRA reporter gene (SV40 promoter followed by GFP) was then incorporated into the plasmid library by golden gate assembly using Esp3I (NEB) and T4 Ligase (NEB). The final MPRA plasmid library was purified, expanded, and isolated as described previously. Primers and PCR conditions are listed in Supplemental Table S6. Single guide RNA cloning. Forward and reverse-complement single-stranded oligonucleotide inserts containing 5′ BsmBI sites followed by single guide RNA sequences were purchased from IDT. Oligos were mixed at equimolar concentrations in NEB Buffer 3.1 and annealed using thermal cycler conditions listed in Supplemental Table S6. Following annealing of complementary oligonucleotide inserts, inserts were ligated into BsmBI-digested pXPR_050 vector (Addgene cat. no. 96925) using the Quick Ligation Kit (NEB) according to manufacturer's protocol and the ligation product was transformed into Stbl3 chemically competent E. coli cells (Invitrogen) via heat shock. Sequence-confirmed clones were cultured and pDNA extracted and purified using the Plasmid Plus Midi Kit (Qiagen). www.nature.com/scientificreports/ CRISPR interference. HEK293T cells stably-expressing dCas9-KRAB were plated in 6-well dishes at a density of 2 × 10 5 cells/well. Cells were transfected 24 h after plating with an equimolar pool of 3 plasmids encoding sgRNAs targeting a common gene or variant. Cells were transfected using the Lipofectamine 3000 kit (Invitrogen) according to manufacturer directions. Briefly, a mixture of 5 µL P3000 reagent, 833 ng of each of the 3 sgRNA-containing pXPR_050 vectors targeting the same gene or variant, and 125 µL Opti-MEM was added to a mixture of 7.5 µL of Lipofectamine 3000 and 125 µL of Opti-MEM. The mixture was allowed to incubate at room temperature for 15 min, after which it was added to cells in a drop-wise manner. RNA was isolated 72 h post-transfection using the RNEasy mini kit (Qiagen) as per the manufacturer's instructions.