## Main

Collectively, rare diseases affect 1 in 20 people1, but fewer than half of the approximately 10,000 cataloged rare diseases have a resolved genetic etiology2. Standardized genome sequencing (GS) of large, phenotypically diverse collections of patients with rare diseases enables etiological discovery across a wide range of pathologies3,4,5 while boosting genetic diagnostic rates for patients. The 100,000 Genomes Project (100KGP), the largest GS study of patients with rare diseases to date, sequenced 34,523 UK National Health Service patients with rare diseases and 43,016 of their unaffected relatives. The linked genetic and phenotypic data of 100KGP participants were then made available to researchers through a web portal called the Genomics England Research Environment. The scale and complexity of such large GS datasets and the hierarchical nature of patient phenotype coding6 induce numerous bioinformatic and statistical challenges. Most importantly, the full genotype data from GS studies of tens of thousands of individuals are typically stored in unmodifiable files many terabytes in size, leading to high storage and processing costs. Recently developed frameworks, such as Hail7 and OpenCGA8, afford greater flexibility. However, they are designed to capture genotypes for variants across the full minor allele frequency (MAF) spectrum, from rare (MAF < 0.1%) to common (MAF > 5%) variants. To accommodate large numbers of genotypes, they depend on distributed storage systems and require numerous software packages, hindering deployment. We developed a database schema, the ‘Rareservoir’, for working with rare variant genotypes and patient phenotypes flexibly and efficiently. We deployed a Rareservoir only 5.5 GB in size of 100KGP data and applied the Bayesian statistical method BeviMed9 to identify genetic associations between coding genes and each of the 269 rare disease classes assigned to patients by clinicians. Of the previously unknown associations that we identified, we followed up the most plausible subset in confirmatory analytical and experimental work.

## Results

### The Rareservoir

Relational databases (RDBs) provide a unified, centralized structure for storing, querying and modifying data of multiple underlying types. In principle, an RDB could provide a convenient foundation for the analysis of genotypes, variants, genes, participants and statistical results, but they cannot accommodate tables of the scale required to store exome or genome-wide genotypes in a moderately sized cohort. An RDB can, however, accommodate a sparse representation of genotypes corresponding to rare variants only, which encompass almost all variants having a large effect on rare disease risk. We developed an RDB schema, the Rareservoir, and complementary build procedure for the analysis of rare diseases, which by default, stores genotypes corresponding to variants for which all population-specific MAFs are likely to be <0.1%. This reduces the number of stored genotypes in a large study by about 99% (Extended Data Fig. 1). The Rareservoir encodes variants as 64-bit integers (‘RSVR IDs’) (Extended Data Fig. 2), which can represent 99.3% of variants encountered in practice without loss of information. RSVR IDs occupy a single column and increase numerically with respect to genomic position, allowing fast location-based queries within a simple database structure. To support the build process of a Rareservoir, we developed a complementary software package called ‘rsvr’ (Extended Data Figs. 2 and 3). The package includes tools to annotate variants with MAF information from control databases (for example, gnomAD10), pathogenicity scores (for example, combined annotation-dependent depletion (CADD) scores11) and predicted Sequence Ontology (SO)12 consequences with respect to a set of transcripts. We use a 64-bit integer (‘CSQ ID’) to record the consequences for interacting variant/transcript pairs, where each bit encodes one of the possible consequences, ordered by severity. Encoding the consequences in this way is efficient and enables succinct queries that threshold or sort based on severity of impact. The Rareservoir also contains a table with genetically derived data for each sample (including ancestry, sex and membership of a maximal set of unrelated participants) and a table of ‘case sets’ storing the rare disease classes assigned to each participant.

### BeviMed infers 241 known and 19 unknown genetic associations

We built a Rareservoir, 5.5 GB in size, containing 11.9 million rare exonic and splicing single-nucleotide variants (SNVs) and short insertions or deletions (indels) affecting canonical transcripts of protein-coding genes in Ensembl v.104 (ref. 13) from a merged variant call format file (VCF) containing genotype calls for 77,539 participants, including 29,741 probands, in the Rare Diseases Main Programme of the 100KGP (Data Release v.13) (Extended Data Fig. 4). During enrollment to the 100KGP, expert clinicians used the clinical characteristics of each affected participant to assign them to one or more of 220 ‘Specific Diseases’. The Specific Diseases are hierarchically arranged into 88 ‘Disease Sub Groups’, each of which belongs to 1 of 20 ‘Disease Groups’. Whereas the eligibility criteria for many Specific Diseases aligned to the same or closely related rare diseases, for others such as ‘Intellectual disability’, the criteria were broader and encompassed diverse genetic etiologies. We generated 269 analytical case sets corresponding to all distinct Specific Diseases and Disease Sub Groups, ranging in size from 5,809 to one proband, and stored them in the Rareservoir (Fig. 1a and Extended Data Figs. 5 and 6). We included these two levels of the phenotyping hierarchy to account for heterogeneity in presentation or diagnosis among cases sharing the same genetic etiology, with the aim of boosting power to identify statistical genetic associations.

Using the Bayesian statistical method BeviMed9, we obtained a posterior probability of association (PPA) between each of the 19,663 protein-coding genes and each of the 269 rare disease classes. BeviMed computes posterior probabilities over a baseline model of no association and competing association models, each of which assumes a particular mode of inheritance (MOI; dominant or recessive) and consequence class of etiological variant (in this study, high impact, moderate impact or 5′ untranslated region (UTR)). The PPA is obtained by summing the posterior probabilities over all association models. The association model with the greatest posterior probability (the modal model) determines the inferred MOI and class of etiological variant. Conditional on an association model, BeviMed models the pathogenicity of each included rare variant. In the model, participants with at least one pathogenic allele (under a dominant MOI) or at least as many pathogenic alleles as the ploidy (under a recessive MOI) have a pathogenic configuration of alleles, which determines their risk of case status. For each rare disease class, we selected a set of unrelated cases based on pedigree information provided by the 100KGP and compared them with participants not in the case set who belonged to different pedigrees and to a maximal set of unrelated participants, also provided by the 100KGP. To account for correlation between case sets, we only recorded the association for each gene having the highest PPA within a given Disease Group. Using a significance threshold of PPA > 0.95, we identified 260 significant associations, 241 of which were documented by the PanelApp gene panel database14, an expert-curated and annotated resource containing gene lists with high, medium or low levels of prior supporting evidence of causality for rare diseases (Fig. 1b). Of the 241 known associations that we identified, 43 (17.8%) were with Disease Sub Groups. For example, within each of the nine known genes associated with the Disease Sub Group ‘Posterior segment abnormalities’, the set of cases explained by variants with a conditional posterior probability of pathogenicity > 0.8 comprised participants encompassing multiple Specific Diseases (Extended Data Fig. 7). This demonstrates that participants with different Specific Diseases belonging to the same Disease Sub Group sometimes share defects in the same gene, which confirms that treating Disease Sub Groups, not just Specific Diseases, as case sets boosts statistical power.

Of the 241 associations identified as previously known according to PanelApp, 237 (98.3%) had an inferred MOI that was consistent with the MOIs listed for the relevant gene. Of these, the consistent MOI was found in the matched panel (223 associations), in the notes for the matched panel (5 associations) or in the MOIs listed for an alternative relevant panel (9 associations) in PanelApp (Source Data Fig. 1). This provided independent evidence that the genetic associations we labeled as known (without reference to MOI information) are genuinely supported by evidence in the literature, further demonstrating the accuracy of BeviMed’s inference. Of the four known associations with an inferred MOI that was incongruous with PanelApp, two had supporting evidence for the inferred MOI in the literature that was absent from PanelApp: EDA with dominant ‘Ectodermal dysplasia without a known gene mutation’15 and AICDA with dominant ‘Primary immunodeficiency’16. The two associations with an MOI that was unsupported in the literature were between UCHL1 and dominant ‘Inherited optic neuropathies’ and between SLC39A8 and dominant ‘Intellectual disability’.

Among 5,253 of the probands included in our analysis, the table of clinically reported variants available from the 100KGP Rare Diseases Main Programme at the time of this study comprised 4,907 distinct variants that had been classified as pathogenic or likely pathogenic in 1,863 genes. For 855 of these genes, etiological variants had been reported for only one family, suggesting that many genes that are etiological in the 100KGP are not identifiable by statistical association. Nevertheless, across the 260 associations identified by BeviMed, 2,536 distinct rare variants had a posterior probability of pathogenicity > 0.8 conditional on the modal model and were observed as part of a pathogenic configuration of alleles in a case (Source Data Fig. 1). Interestingly, among the subset of 2,485 variants contributing to the 241 known associations, only 1,604 featured in the table of clinically reported variants.

We found 19 previously unidentified genetic associations. To select a shortlist for further investigation, we assigned a plausibility score (range 0–3) based on three sources of additional evidence (Table 1). First, we considered evidence of purifying selection from gnomAD v.2.1.1. Any dominant associations with high-impact variants in a gene having a probability of loss-of-function intolerance (pLI) >0.9 or with moderate-impact variants in a gene having a Z score >2 were considered to be supported by population genetic metrics of purifying selection. To avoid disadvantaging recessive associations, which are unlikely to leave a detectable signature of purifying selection in gnomAD even if genuine, they were considered to be supported by default. Second, we considered cosegregation data: any association for which variants having a posterior probability of pathogenicity conditional on the modal model >0.8 tracked with case status in at least three additional family members and for which no affected relatives lacked the pertinent variants were considered to be supported by cosegregation. Third, we performed a comprehensive review of the literature for each gene and made a subjective assessment of whether an association was supported by biological function or previously known disease associations for related genes. In total, three genetic associations had a plausibility score of three and were, therefore, investigated further by gathering additional experimental evidence and looking for replication in other sequenced rare disease collections.

### Variants in ERG are responsible for primary lymphoedema

BeviMed identified a dominant genetic association between high-impact variants in ERG and the Specific Disease ‘Primary lymphoedema’, a group of genetic conditions caused by abnormal development of lymphatic vessels or failure of lymphatic function17,18. Three such variants were responsible for the high PPA, with locations ranging from codon 182 to 463 on the canonical Ensembl transcript ENST00000288319.12. One of the probands had two unaffected parents without the variant allele—one sequenced by the 100KGP and the other by Sanger sequencing—suggesting that the truncating heterozygous variant had appeared de novo. A participant in a fourth family who had been enrolled to the 100KGP for an unrelated condition also carried a predicted loss-of-function variant in ERG. Upon manual chart review, this participant had features associated with this unrelated condition but additional features consistent with primary lymphoedema, providing internal replication within the discovery cohort (Fig. 2a).

ERG encodes a critical transcriptional regulator of blood vessel endothelial cell gene expression19 that is essential for normal vascular development20. However, little is known about the contribution of ERG to lymphatic development or how primary lymphoedema could arise from loss-of-function ERG variants that affect different parts of the ERG protein (Fig. 2c). Total cellular expression of ERG detected by real-time quantitative polymerase chain reaction (PCR) in purified RNA and by immunoblotting of protein extracts was the same in primary human dermal lymphatic endothelial cell (HDLECs) as human umbilical vein endothelial cell (HUVEC) (Fig. 2d,e, respectively). Moreover, immunofluorescence microscopy of cultured HDLECs showed that ERG expression colocalized with the lymphatic endothelial cell nuclear marker PROX1 (Fig. 2f), a finding confirmed in vivo by immunostaining whole mounts of ear skin from mice at 3 weeks after birth (Fig. 2g). The positions of the p.S182Afs*22 and p.T224Rfs*15 variants suggest nonsense-mediated decay and haploinsufficiency as a possible disease mechanism. The other two variants, however, are located in the final exon of ERG and may, therefore, evade nonsense-mediated decay. We studied both types of variant in more detail to explore potential disease mechanisms. In HEK293 cells, which do not express endogenous ERG, overexpression of wild-type ERG cDNA recapitulated the nuclear expression pattern observed in the HDLEC and mouse ear skin models. However, overexpression of ERG mutant cDNAs resulted in mislocalization of ERG outside of the nucleus, in the cytosol (Fig. 2h,i and Extended Data Fig. 8), preventing it from binding to DNA and exerting its function as a transcription factor21. Together, these data confirm high levels of ERG expression within the nuclei of the lymphatic endothelium consistent with a transcription regulatory function during lymphangiogenesis. They also suggest that in the primary lymphoedema cases, defective lymphangiogenesis may result from reduced ERG availability in the nucleus because of either haploinsufficiency resulting from nonsense-mediated decay or mislocalization.

### Variants in PMEPA1 result in Loeys–Dietz syndrome

BeviMed identified a dominant genetic association between high-impact variants in PMEPA1 and the Specific Disease ‘Familial thoracic aortic aneurysm disease’ (FTAAD). The variant with the highest conditional probability of pathogenicity was an insertion of one cytosine within a seven-cytosine stretch in the last exon of the canonical Ensembl transcript ENST00000341744.8. This variant, which is predicted to induce a p.S209Qfs*3 frameshift, was observed in three FTAAD pedigrees of European ancestry in the 100KGP discovery cohort. We replicated the association in three additional collections of cases. First, the same variant was identified independently in eight affected members of three pedigrees of Japanese ancestry in a separate Japanese patient group. Second, a single-cytosine deletion within the same polycytosine stretch as the previous variant, and encoding p.S209Afs*61, was found in an FTAAD case enrolled in a separate collection of 2,793 participants in the 100KGP Pilot Programme. Lastly, we identified a family in Belgium wherein the affected members carried a 5-bp deletion in the same stretch of polycytosines inducing a frameshift two residues upstream of the other two variants (p.P207Qfs*3).

All pedigrees exhibited dominant inheritance of aortic aneurysm disease with incomplete penetrance and skeletal features including pectus deformity, scoliosis and arachnodactyly with complete penetrance, which cosegregated with the respective variants in genotyped participants (Fig. 3a). To assess whether PMEPA1 families affected by FTAAD form a phenotypically distinct subgroup, we analyzed the Human Phenotype Ontology (HPO) terms assigned to the 593 FTAAD families in both programs of the 100KGP. Using a permutation-based method22,23 based on the semantic similarity measure of Resnik et al.24, we found that the four 100KGP PMEPA1 families were significantly more similar to each other than to other FTAAD families chosen at random (P = 5.7 × 10−3). To characterize the PMEPA1 phenotype in greater detail, we compared the prevalence of each of the HPO terms in the minimal set of terms present in at least three of the four families with the prevalence in the other FTAAD families. We identified four HPO terms related to the musculoskeletal system that were significantly enriched (Fig. 3b), echoing the phenotypic characteristics of the syndromic aortopathy Loeys–Dietz syndrome25,26.

To understand the molecular mechanisms underlying this defect, we examined the protein–protein interactions27 for PMEPA1 and the complete set of high-confidence genes in the ‘Thoracic aortic aneurysm or dissection’ PanelApp panel. PMEPA1 encodes a negative regulator of transforming growth factor-β (TGFβ) signaling28, a pathway previously implicated in multiple aortopathies, including Loeys–Dietz syndrome29. The genes underlying known forms of Loeys–Dietz syndrome encode part of a tightly interacting subgroup of proteins in the TGFβ pathway, in which there is a direct interaction between the proteins encoded by SMAD2, SMAD3 and PMEPA1 (Fig. 3c). As the two candidate variants occur in the last exon of the transcript, they are likely to evade nonsense-mediated decay30. However, their truncating effects are predicted to remove a PPxY interaction motif while leaving the SMAD interaction motif intact (Fig. 3d), possibly affecting binding between PMEPA1 and SMAD2/3 and altering TGFβ signaling through a gain-of-function mechanism.

### Variants in GPR156 lead to recessive congenital hearing loss

BeviMed identified a recessive genetic association between high-impact variants in GPR156 and the Specific Disease ‘Congenital hearing impairment’. Two high-impact variants in GPR156 were responsible for the strong evidence of association: a 1-bp insertion predicting p.S207Vfs*113 and a 1-bp insertion predicting p.P718Lfs*86 with respect to the canonical Ensembl transcript ENST00000464295.6. One family contained two affected siblings who were both homozygous for the p.S207Vfs*113 variant inherited from heterozygous parents. In a second family, there were also two affected siblings, in this case compound heterozygous for the same p.S207Vfs*113 variant that was maternally inherited and a different p.P718Lfs*86 variant that was paternal. Using GeneMatcher31, we identified a third pedigree from Saudi Arabia with biallelic truncating variants in GPR156. This consanguineous pedigree contained four siblings with hearing impairment, all of whom were homozygous for a variant predicting p.S642Afs*162 (Fig. 4a). The eight affected individuals in these three families all had congenital nonsyndromic bilateral sensorineural hearing loss (see Extended Data Fig. 9 for illustrative audiograms).

GPR156 encodes probable G protein-coupled receptor 156, which has sequence homology to the class C GABAB receptors32. Although previously designated as an orphan receptor, GPR156 has recently been identified as a critical regulator of stereocilia orientation on hair cells of the auditory epithelium and other mechanosensory tissues33. Its expression is highly restricted to hair cells in the inner ear34. Disruption of stereocilia is a common pathogenic mechanism underlying many human Mendelian hearing loss disorders35, and the overexpression of GPR156 in hair cells relative to surrounding cells was commensurate with the overexpression of the 21 genes currently implicated in hearing impairment having a Gene Ontology term relating to stereocilia (Fig. 4b). By immunostaining of the Corti and vestibular system from wild-type mice, we found that GPR156 strongly colocalizes with actin at the apical surface of the outer and inner hair cells of the organ of Corti (Fig. 4c).

The p.S207Vfs*113 variant is located in the sixth of 10 exons of GPR156 and therefore, predicts absent expression through nonsense-mediated decay of the GPR156 mRNA. In contrast, the p.S642Afs*162 and p.P718Lfs*86 variants both occur within the final GPR156 exon and likely result in expression of abnormal GPR156 with an altered amino acid sequence and premature truncation of the cytoplasmic tail (Fig. 4d). To determine the effect of the variants on protein expression, we transfected Cos7 cells, which do not express GPR156 endogenously, with constructs containing cDNAs for wild-type GPR156 or GPR156 containing each of the three mutant alleles, tagged with a green fluorescent protein (GFP) reporter. While cells transfected with wild-type sequence expressed GPR156–GFP fusion protein robustly, cells transfected with the mutant constructs either did not express the protein appreciably or exhibited markedly reduced expression, suggesting that all three of the truncated proteins are degraded (Fig. 4e). These data suggest that the biallelic chain truncating variants in GPR156 cause a congenital hearing loss by preventing expression of GPR156 protein, thereby disrupting stereocilia formation in the auditory epithelium.

## Discussion

The standardization of GS within a health care system, together with powerful frameworks for genetic and phenotypic data processing and statistical analysis, promises to advance the resolution of the remaining unknown etiologies of rare diseases. We have developed a lightweight and easily deployable RDB, the Rareservoir, for genetic analysis of rare diseases using approaches such as BeviMed. In one unified analysis, we identified 260 associations, of which 241 had been published previously in a body of work spanning several decades of genetics research. Our results give an upper bound on the false discovery rate of 7.3%. In contrast, a recent analysis of 57,000 samples in the 100KGP reported 249 known and 579 previously unidentified associations36, giving an upper bound on the false discovery rate of 70%, which suggests that our analytical approach has a greater specificity for a given sensitivity. The associations spanned 86 disease classes across a wide range of organ systems. Interestingly, only 64% of the variants contributing substantially to the known associations were present in the table of clinically reported variants available at the time of this study. This suggests that, as cohorts grow larger, the results of statistical inference could help guide the clinical reporting process. The case sets we used in our genetic association analysis were based on the formal disease classifications used by the 100KGP. Some of the case sets, such as ‘Intellectual disability’ (5,529 probands), are particularly large and likely to be highly genetically heterogeneous, potentially limiting the power of our analyses. Careful partitioning of heterogeneous case sets using individual-level HPO terms6 has the potential to boost power. Of the 19 previously unidentified associations, we shortlisted, replicated and validated three. These three etiologies involve genes that had not previously been implicated in any of these human diseases. The remaining 16 associations include further plausible hypotheses. For example, LRRC7, which we identified to be associated with intellectual disability, encodes a brain-specific protein in postsynaptic densities37, and LRRC7-deficient mice exhibit a neurobehavioral phenotype38. USP33, which we found to be associated with early-onset hypertension, encodes a deubiquitinating enzyme implicated in regulating expression of the β2-adrenergic receptor regulation39. These and other candidates will require replication and validation before they can be considered causative genes.

The present study has several limitations. First, approximately 82% of the participants in the 100KGP are of European ancestry. While this percentage is in line with the proportion of residents in England and Wales reporting their ethnic group as white in the 2011 UK census (86%), its large magnitude constrains power to identify causative variants specific to other ancestry groups. Second, of the 260 case sets analyzed, 28 contained fewer than five probands, limiting power to identify the causes of the corresponding disease classes and highlighting the need for continued enrollment of patients with ultra-rare disorders. Third, we have only considered SNVs and indels in coding genes. The exploration of structural variants and of rare variation in noncoding genes and in regulatory elements of the genome may help identify further etiologies. Lastly, we focused our attention on monogenic models of rare disorders, even though the genetic etiologies of certain rare diseases may be polygenic. In addition, important variation in clinical presentation of monogenic disorders may be explained by polygenic effects. These limitations point toward multiple promising avenues of future research to uncover the remaining unknown genetic determinants of rare diseases.

## Methods

### Motivation for developing a sparse RDB

Computational approaches for discovering the etiologies of rare diseases typically depend on the analysis of a heterogeneous set of files, each of which can be very large and follow a distinct convention. Genotypes, for example, are ordinarily stored in VCFs containing data for one sample or for multiple samples. In the latter case, the data are usually distributed in files covering many different ‘chunks’ of the reference genome. Variant-level information, such as consequence predictions or pathogenicity scores, is typically encoded in strings that require extensive parsing to decode, either from within the VCFs containing the genotypes or in separate files. Modifying genotype or annotation files (for example, to incorporate newly generated data) requires rewriting files in their entirety. Phenotype data, pedigree data and the results of statistical inference are stored in a further set of files. Consequently, analyses are often burdensome to conduct and prone to error. Frameworks, such as Hail7 and OpenCGA8, afford greater flexibility, but they depend on the centrally organized deployment of a distributed storage system, hindering usability and portability.

For each of the 269 rare disease classes (Extended Data Figs. 5 and 6), we applied the BeviMed9 association test to rare variants extracted from the Rareservoir database in each of the 19,663 canonical transcripts belonging to a gene with a ‘protein_coding’ biotype. The analysis was carried out using R 3.6.2, making use of functionality from packages Matrix 1.2–18, dplyr 0.8.5, bit64 0.9–7, bit 1.1–14, DBI 1.1.0, RSQLite 2.1.4 and BeviMed 5.7. The case set for a given disease class and gene was constructed by selecting one case from each pedigree containing at least one person affected with the disease class. For the purposes of the association analysis, participants were labeled ‘explained’ by a given gene if they had variants in that gene classified as ‘pathogenic_variant’ or ‘likely_pathogenic_variant’ in the ‘gmc_exit_questionnaire’ table in the Genomics England Research Environment. To boost power, we used this information to reassign cases that were explained by variants in a different gene to the control group.

RDBs are widely used, mature technologies, well known for their speed, reliability, flexibility, structure and extensibility. In the context of rare diseases, an RDB can in principle render the modification, combination and addition of data on samples, variants, genes and other entities efficient, reliable and straightforward to implement using a single query language. Unfortunately, the performance of RDBs degrades substantially when the number of records in a table reaches several billion, and the number of genotypes in a cohort the size of the 100KGP easily surpasses this threshold. However, the MAFs of pathogenic variants with strong effects on rare disease risk are typically kept below 1/1,000 by negative selection, and the proportion of nonhomozygous reference genotypes for variants within that MAF stratum is only about 1% of the total (Extended Data Fig. 1). Consequently, it is possible to construct a compact RDB that includes virtually all the pathogenic variants even in a large cohort such as the 100KGP. This provides an opportunity for exploiting the benefits of a single unified RDB containing the nonhomozygous genotypes of rare variants upon which to conduct the entirety of the etiological discovery process. Furthermore, it provides a natural foundation for developing web applications for the multidisciplinary review of genetic, phenotypic, statistical and other data.

### Encoding RSVR IDs

SNVs and indels may be encoded as 64-bit integers called RSVR IDs. To compute an RSVR ID for a given variant, the following expression is evaluated:

$$c \times 2^{58} + p \times 2^{30} + \left| r \right| \times 2^{24} + \left| a \right| \times 2^{18} + \mathop {\sum}\limits_{{\mathrm{i}} = 1}^{\left| A \right|} {A_{\mathrm{i}} \times 4^{{\mathrm{i}} - 1}} ,$$

where c is the chromosome number (using 23, 24 and 25, respectively, to represent chromosomes X, Y and MT); p is the position; and |r| and |a| are the lengths of the reference and alternate alleles, respectively. A is a sequence identical to the alternate allele, a, when its length is less than 10 and otherwise, equal to the first five followed by the last four elements of a. In the summation, nucleotides are assigned values A = 0, C = 1, G = 2 and T = 3. The expression evaluates to integers that can be represented using 63 bits, setting the most significant bit to zero when encoding as 64-bit integers. The chromosome, position, reference, alternate allele lengths and alternate allele bases are thereby encoded, respectively, by the subsequent 5, 28, 6, 6 and 18 bits (with 2 bits per base for the alternate allele). This procedure and its inverse are implemented in the ‘rsvr enc’ and ‘rsvr dec’ programs, respectively. The reference and alternate alleles of input variants are first normalized by removing any redundant identical sequence from the starts and then, the ends. The proportion of variants in gnomAD 3.0 weighted by allele count that can be encoded losslessly is 99.3%, while 99.8% can be represented by a distinct RSVR ID. The full variant information corresponding to any encountered ambiguous RSVR ID may be stored in full in a additional table. Structural variants that can be represented by a position and length may also be encoded using distinct 64-bit RSVR IDs alongside SNVs and indels by setting the most significant bit to one and subsequently, encoding the type of structural variant using 2 bits (deletion 0, duplication 1, inversion 2, insertion 3), the chromosome using 5 bits (as done for SNVs and indels), and the start and length consecutively using 28 bits.

### Genetic association analysis of 100KGP data

We constructed a Rareservoir in the Genomics England Research Environment containing the PASSing49 variants in the merged VCF of 77,539 consented participants in the 100KGP Rare Diseases Programme. This Rareservoir only included variants with a PMAF > 0 according to GnomAD v.3.0, an internal MAF < 0.002 and at least one predicted consequence on a canonical transcript in Ensembl v.104. Variants with a greater MAF are unlikely to be highly penetrant for diseases eligible for inclusion in the 100KGP and are likely to have, at most, small effects on risk, making them challenging to validate. Variants with a median genotype quality <35 and SNVs with a CADD Phred score <10 were also excluded from the analyses.

Using BeviMed, we performed a Bayesian comparison of a baseline model of no association and each of six association models defined by an MOI and a class of etiological variant.

1. (1)

No association (prior probability 0.99)

2. (2)

Dominant association with ‘high’-impact variants having a PMAF ≥ 2 (that is, corresponding to a target MAF < 0.01%; prior probability 0.002475)

3. (3)

Dominant association with ‘moderate’-impact variants having a PMAF ≥ 2 (prior probability 0.002475)

4. (4)

Dominant association with ‘5′ UTR’ variants having a PMAF ≥ 2 (prior probability 0.00005)

5. (5)

Recessive association with ‘high’-impact variants having a PMAF ≥ 1 (that is, corresponding to a target MAF < 0.1%; prior probability 0.002475)

6. (6)

Recessive association with ‘moderate’-impact variants having a PMAF ≥ 1 (prior probability 0.002475)

7. (7)

Recessive association with 5′ UTR variants having a PMAF ≥ 1 (prior probability 0.00005)

Thus, the overall prior probability of association was 0.01, and there was an equal prior probability of dominant and recessive inheritance. The PPA was the sum of the posterior probabilities of models 2–7. We imposed a stricter PMAF threshold under a dominant MOI than under a recessive MOI because ceteris paribus, dominant variants are under stronger negative selection than recessive variants. The three groups of variants were selected as follows.

• 5′ UTR variants: those with a 5_prime_UTR_variant consequence

• High-impact variants: those with any consequence amongst start_lost, stop_lost, frameshift_variant, stop_gained, splice_donor_variant or splice_acceptor_variant, excluding variants with a ‘low-confidence’ LOFTEE score10

• Moderate-impact variants: those with any consequence amongst start_lost, stop_lost, frameshift_variant, stop_gained, splice_donor_variant or splice_acceptor_variant, missense_variant or inframe_deletion

The rationale for embedding variants from the high-impact class in the moderate-impact class is that both types of variant are capable of inducing a loss of function. The prior on the probability that a modeled rare variant is pathogenic, conditional on either the association model mediated by 5′ UTR variants or the association model mediated by moderate-impact variants, was set to Beta(2,8). This encodes a prior conditional expectation that 20% of rare variants are pathogenic, which is well suited to missense and 5′ UTR variants. However, we specified a distribution with a greater mean for the high-impact models. Specifically, the prior on the probability that a modeled high-impact variant is pathogenic was set to Beta(3,1), which reflects a prior conditional expectation that 75% of rare variants are pathogenic because loss-of-function variants tend to be functionally equivalent to each other. BeviMed reports the posterior probability that each variant is pathogenic conditional on the MOI and the class of etiological variant. The methodology is described in further detail in the original BeviMed publication9.

We applied the following postprocessing of BeviMed results with a PPA > 0.95.

• We reran BeviMed including all samples (that is, with relatives of cases and controls). Associations for which the analysis with all samples caused the PPA to fall below 0.9 were filtered out due to conflicting evidence for the association within families.

• We reran BeviMed after removing variants absent from affected relatives of the cases. Associations for which this removal caused the PPA to drop below 0.25 were filtered out because they depended on variants that were not shared by affected cases within families.

• To guard against false positives due to incorrect pedigree data, population structure or cryptic relatedness, we applied the following algorithm. We obtained the distribution of the number of rare variants in the Rareservoir shared by pairs of individuals within each assigned ancestry in the 100KGP. The top percentile in each of these distributions was used to indicate potential relatedness between participants in the same population. We reran BeviMed after removing cases so as to ensure that no more than one case from any set of potentially related cases sharing a variant was included in the analysis. Associations for which this analysis caused the PPA to fall below 0.25 were filtered out.

To account for correlation between case sets, for each gene, we removed all but the most strongly associated disease class within each disease group before reporting the 260 associations remaining. Without the postprocessing, the number of reported associations would have been 302. Conditional on the modal model underlying each of the 260 associations, we recorded the variants with a posterior probability of pathogenicity >0.8 accounting for at least one case in the 100KGP.

### PanelApp annotation

Significant associations were colored according to PanelApp14 (Fig. 1b) evidence levels for panel–gene relations (green for high evidence, amber for moderate evidence and red for low evidence) for panels of type ‘Rare Disease 100K’, which are organized hierarchically by Disease Sub Group and Disease Group, or of type ‘GMS Rare Disease’. Given an association between a gene and a case set (corresponding either to a Specific Disease or to a Disease Sub Group), we searched for panels that contained the gene and had the same name as the case set (ignoring case). If such a match was not found, we searched for panels that contained the gene and that belonged to a Disease Sub Group with the same name as the Disease Sub Group of the case set. When this matching rule generated multiple matches, we selected the panel(s) with the highest evidence. If multiple panels still remained, we selected the panel with the smallest number of genes. Associations for which no matching panel in PanelApp could be found were inspected manually to assess whether PanelApp contained an alternative suitable panel (marked with an asterisk in Fig. 1b).

### Shortlisting previously unidentified genetic associations for validation

Several sources of independent evidence were used to shortlist significant associations for validation. For each source, a score of one was awarded if the evidence was supportive and zero otherwise. Scores were then added over the different sources and used to rank the associations. Associations for which at least three sources of evidence were supportive were taken forward for further investigation. The sources of evidence and qualifying criteria for being considered supportive are listed below. Note that here we refer to variants that had a probability of pathogenicity >0.8 conditional on the modal model as ‘probably pathogenic’.

• Counting cosegregating pedigree members. The pedigrees harboring pathogenic configurations of probably pathogenic alleles were checked for cosegregation between genotype and affection status. This evidence counted as supportive for associations for which all such pedigrees demonstrated cosegregation and there were at least three additional relatives who had not been included in the association analysis but for whom there was cosegregation. Note that Binary Alignment and Map (BAM) files for the affected members of pedigrees who were called homozygous reference for probably pathogenic variants were checked for evidence of mosaicism to guard against the possibility that they were falsely portraying a lack of cosegregation.

• pLI and Z scores. pLI and Z scores for depletion of missense variants were obtained from the gnomAD v.2.2.1 browser10. pLI > 0.9 for associations in which high-impact variants were most strongly associated was counted as supportive, whilst Z scores >2 for associations in which moderate-impact variants were most strongly associated were counted as supportive.

• Recessive association. Population genetic metrics of purifying selection (pLI scores and Z scores) are sensitive to depletion of high-impact variants and missense variants, respectively. They are, therefore, useful measures to corroborate dominant associations. However, these metrics have low sensitivity to identify the signatures of selection against recessive diseases because isolated pathogenic variants in heterozygous form do not lead to a reduction in reproductive fitness. To avoid disadvantaging recessive associations identified by BeviMed, they were assigned a contribution of one point to the score.

• Literature review. A comprehensive literature review assessing the gene’s role (if any) in biological processes relevant to the disease, other diseases and a survey of model organisms was undertaken and determined to be either supportive or not.

### ERG: primary endothelial cell culture

Single-donor primary HDLECs (Promocell) were cultured in Endothelial Cell Growth Medium MV2 (Promocell). Pooled donor HUVECs (Lonza) were grown in Endothelial Cell Growth Media-2 (Lonza). HUVECs and HDLECs were grown on 1% (vol/vol) gelatin and used between passages 3 and 5.

### ERG: real-time PCR

HUVECs and HDLECs were grown to confluency in a pregelatinized six-well dish. Total RNA was isolated using the RNeasy Mini Kit (Qiagen), and 1 µg of total RNA was transcribed into cDNA using Superscript III Reverse Transcriptase (Thermo Fisher Scientific). Quantitative real-time PCR was performed using PerfCTa SYBR Green FastMix (Quanta Biosciences) on a Bio-Rad CFX96 System. Gene expression values of ERG in HUVECs and HDLECs were normalized to GAPDH expression and compared using the ΔΔCT method. The following oligonucleotides were used: ERG, 5′-GGAGTGGGCGGTGAAAGA-3′ and 5′-AAGGATGTCGGCGTTGTAGC-3′; GAPDH, 5′-CAAGGTCATCCATGACAACTTTG-3′ and 5′-GGGCCATCCACAGTCTTCTG-3′.

### ERG: immunoblotting analysis

Immunoblotting was performed according to standard conditions. Proteins were labeled with the following primary antibodies: rabbit anti-human ERG antibody (1:1,000; ab133264; Abcam) and mouse anti-human GAPDH (1:10,000; MAB374; Millipore). Primary antibodies were detected using fluorescently labeled secondary antibodies: goat anti-rabbit IgG DyLight 680 and goat anti-mouse IgG Dylight 800 (Thermo Scientific). Detection of fluorescence intensity was performed using an Odyssey CLx imaging system (Li-COR Biosciences, Lincoln) and Odyssey v.4 software.

### ERG: immunofluorescence analysis of endothelial cells and mouse tissues

Confluent cultures of HUVECs and HDLECs were fixed with 4% (wt/vol) paraformaldehyde for 15 min and permeabilized with 0.5% (vol/vol) Triton X-100 before incubation with 3% BSA (wt/vol) in phosphate buffered saline (PBS) containing the following primary antibodies: goat anti-human PROX1 antibody (1:100; AF2727; R&D Systems), rabbit anti-human ERG antibody (1:100; ab92513; Abcam) and mouse anti-human VE-cadherin (1:100; 555661; BD Biosciences). Secondary antibody incubation was carried out in 3% BSA (wt/vol) in PBS using the following antibodies: donkey anti-goat IgG Alexa Fluor-488 (1:1,000; A-11055), donkey anti-rabbit IgG Alexa Fluor-555 (1:1,000; A-31572) and donkey anti-mouse Alexa Fluor-594 (1:1,000; A-21203). All secondary antibodies were from Thermo Fisher Scientific. Nuclei were visualized using DAPI (Sigma-Aldrich). Confocal microscopy was carried out on a Carl Zeiss LSM 780 confocal laser scanning microscope with Zen 3.2 software. All animal experiments were conducted with ethical approval from Imperial College London under UK Home Office Project Licence PEDBB1586 in compliance with the UK Animals (Scientific Procedures) Act of 1986. Ear tissue was collected from euthanized 3-week-old male and female C57BL/6J mice and fixed in 4% (wt/vol) paraformaldehyde at room temperature for 2 h. Tissue was then washed with PBS followed by a blocking and permeabilization step using 3% (wt/vol) milk in phosphate-buffered saline solution containing 0.3% (vol/vol) Triton X-100 (PBST) for 1 h at room temperature. The following primary antibodies were used for immunofluorescence staining: goat anti-human PROX1 antibody (1:100; AF2727; R&D Systems) and rabbit anti-human ERG antibody (1:100; ab92513; Abcam). Primary antibodies were incubated at 4 °C overnight in 3% (wt/vol) milk in PBST. The following day, tissues were washed three times with PBST over the course of 2 h at room temperature. Tissues were incubated with secondary antibodies at room temperature for 2 h in 3% milk (wt/vol) in PBST. Primary antibodies were detected using fluorescently labeled secondary antibodies: donkey anti-goat IgG Alexa Fluor-488 (1:400; A-11055; Thermo Fisher Scientific) and donkey anti-rabbit IgG Alexa Fluor-555 (A-31572; Thermo Fisher Scientific). Stained samples were mounted onto glass slides using Fluoromount G (Thermo Fisher Scientific). Images were acquired using a Zeiss LSM 780 confocal laser scanning microscope with Zen v.3.2 software. All confocal images represent maximum intensity projection of Z stacks of single tiles.

### ERG: subcloning and overexpression in HEK293 cells

We subcloned ERG (ENST00000288319.12) from HUVECs into the mammalian expression vector pcDNA3.1 (Thermo Fisher). ERG variants were generated by site-directed mutagenesis using the Quikchange Lightning kit (Agilent) using the wild-type ERG cDNA as the template. Expression of wild-type and mutant ERG was carried out using polyethylenimine (Sigma-Aldrich) transfection reagent in HEK293 cells grown in Dulbecco’s Modified Eagle Medium (DMEM) (Thermo Fisher) with 10% (vol/vol) FBS. After 24 h, cells were fixed with 4% (wt/vol) paraformaldehyde for 15 min and permeabilized with 0.5% (vol/vol) Triton X-100 before incubation with 3% BSA (wt/vol) in PBS containing mouse monoclonal anti-ERG antibody (1:100; sc-376293; Santa Cruz Biotechnology). Secondary antibody incubation was carried out in 3% BSA (wt/vol) in PBS using donkey anti-mouse Alexa Fluor-488 (1:1,000; A-21202; Thermo Fisher). Nuclei were visualized using DAPI (Sigma-Aldrich). Confocal microscopy was carried out on a Carl Zeiss LSM780 confocal laser scanning microscope with Zen 3.2 software.

### ERG: estimation of nuclear and nonnuclear ERG in HEK293 cells

Each image was read into a pair of channel-specific 1,024 × 1,024 matrices in R v.4.2.1 using the readCzi function from the readCzi R package v.0.2.0. A pixel was declared to contain a nuclear region if the intensity in the blue channel exceeded 60% of the 95th percentile of blue intensities across all pixels above background (identified as exceeding 1.35 × 10−2 by visual inspection of bimodal intensity histograms). A pixel was declared to contain ERG if the intensity in the green channel exceeded 30% of the 95th percentile of the green intensities within the pixels previously declared to be nuclear. To fill in intranuclear gaps, any nonnuclear pixels adjacent to at least five nuclear pixels were declared nuclear. The estimated proportion of ERG that was cytosolic in an image was set to the number of ERG pixels that did not overlap nuclear pixels divided by the number of ERG pixels.

### GPR156: western blots

We subcloned GPR156 from human brain cDNA into EGFP-N2 vector. The three mutant GPR156 constructs were generated by mutagenesis using the QuickChange kit (Stratagene) and wild-type GPR156–GFP as a template. For expression analysis, the wild type and mutant constructs were transfected in COS7 cells grown in DMEM (Gibco) with 10% FBS. Transfections were performed with Lipofectamine 2000 reagent (Life Technologies). Cells were harvested 48 h after transfection; lysed in buffer containing 1% 3-[(3-cholamidopropyl)dimethylammonio]-1-propane sulfonate (CHAPS), 100 mM NaCl and 25 mM N-2-hydroxyethylpiperazine-N-2-ethane sulfonic acid (HEPES), pH 7.4; and clarified by centrifugation at 18,407g. Lysates (20 μg) were run on a 4–20% sodium dodecyl sulfate–polyacrylamide gel electrophoresis (SDS-PAGE) gel. The membrane was blocked with 5% milk, incubated with anti-GPR156 (1:200) and developed with horseradish peroxidase (HRP)-conjugated secondary (sheep anti-rabbit) antibody (1:1,000). Comparable loading was checked by stripping and reprobing the blots with anti-GAPDH (1:500) antibody (Santa Cruz Biotechnology).

### GPR156: whole-mount immunostaining of GPR156 in mouse inner ears

All the animal work was approved by the University of Maryland, Baltimore Institutional Animal Care and use Committee (IACUC 420002). Inner ears were dissected from C57BL/6J mice with a postnatal age of 10 days and fixed in 4% paraformaldehyde in PBS overnight. For whole-mount immunostaining, the cochleae were microdissected and were subjected to blocking for 1 h with 10% normal goat serum in PBS containing 0.25% Triton X-100, followed by overnight incubation at 4 °C with anti-GPR156 antibody (1:200; PA5-23857; Thermo Fisher) in 3% normal goat serum with PBS. F-actin was decorated using phalloidin (1:300). Confocal images were acquired from a Zeiss LSM710 confocal microscope, and images were processed using ImageJ v.1.53t software.

### Ethics

The 100,000 Genomes project was approved by East of England–Cambridge Central Research Ethics Committee ref:20/EE/0035. Only participants who provided written informed consent for their data to be used for research were included in the analyses. The study at the University of Maryland was approved by the institutional review board (RAC no. 2100001), and written informed consent was obtained by clinicians at King Faisal Hospital in Saudi Arabia from the participating individuals. The study of the Japanese ancestry pedigrees bearing PMEPA1 truncating alleles was approved by the Institutional Review Board of the National Cerebral and Cardiovascular Centre (M14-020) and Sakakibara Heart Institute (16–035), and written informed consent was obtained from the participating individuals.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.