Disrupting upstream translation in mRNAs leads to loss-of-function associated with human disease

Ribosome-profiling has uncovered pervasive translation in 5’UTRs, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate new gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals respectively, and demonstrate their impact on gene expression in human cells. Our results establish new mechanisms relating uORF variation to loss-of-function of downstream genes, and demonstrate that translated uORFs are genetically constrained regulatory elements in 40% of human genes.


INTRODUCTION
The classic view of information processing in the cell by gene expression occurs through transcription and translation. This basic flow is often complicated by regulatory elements which confer additional stages of processing and control. In particular, upstream open reading frames (uORFs) are segments of 5'UTR mRNA sequences that can initiate and terminate translation upstream of protein-coding start codons. Specific uORFs are known to control gene expression by tuning translation rates of downstream protein-coding sequences, and potential uORFs have been identified in ~50% of all human protein-coding genes 1,2 . Previous analyses of large-scale population data have shown that genetic variants creating new uORFs are rare, suggesting that these variants are subjected to strong negative selection due to their capacity to cause pathogenic loss-of-function of associated proteins 2,3 . In contrast, less is known about the impact of genetic variation affecting translated uORFs. Recent untargeted ribosome-profiling experiments have revealed striking evidence of active translation at thousands of uORFs throughout the genome, but the biological significance of this phenomenon remains unresolved.
Here we use translated uORFs mapped through ribosome-profiling experiments, and a deep catalogue of human genetic variation to characterize patterns of selection acting on single nucleotide variants (SNVs) in translated uORF sequences. We assess evidence for the functional importance of translation at uORFs, and explore possible phenotypic consequences associated with genetic variation in these sequences. Using the allele frequency spectrum of SNVs from 71,702 whole genome sequences in gnomAD, we find that SNVs that introduce new stop codons, or create stronger translation termination signals in uORFs are under strong selective constraints within 5'UTRs. We propose that these variants are under selective pressure because they disrupt translation initiation at downstream protein-coding sequences.
We then utilize the Penn Medicine Biobank (PMBB) to discover new, robust disease-gene associations using these predicted uORF-disrupting variants, and replicate these associations in the UK Biobank (UKB), and by gene burden tests that aggregate rare protein-coding loss-of-function variants. Finally we validate the impact of these uORF variants on gene expression for our top phenome-wide significant associations with diabetes and anxiety disorders in 5'UTRs from PMVK and VPS53 respectively. These data demonstrate that SNVs in translated uORFs that create new stop codons, or strengthen existing stop codons can contribute to disease pathology by changing gene expression.

Variants introducing new stop codons in uORFs are under strong negative selection
Translation initiation is the rate-limiting step controlling post-transcriptional gene expression 4 , and rates of translation initiation can significantly impact mRNA stability [5][6][7][8][9] . Cap-dependent translation initiation begins when the 40s ribosomal subunit encounters a start codon as it scans along the 5'UTR. At the start codon, 40s subunit acquires the 60s subunit with other translation initiation factors and peptide synthesis begins. Scanning ribosomes encountering uORFs may prematurely initiate translation in the 5'UTR; if this occurs, upon reaching the uORF termination codon the ribosome may dissociate from the mRNA transcript, or the 40s subunit may resume scanning after the 60s subunit is lost. Resumption of scanning leads to translation of downstream reading frames only if the necessary translation initiation factors are reacquired by the 40s subunit before reaching the downstream start codon. Thus, the spatial combination of uORFs and protein-coding start codons can produce different effects on expression of the downstream gene. Since elongating ribosomes must translate uORFs before they reinitiate translation at the CDS, we hypothesized that genetic variants introducing new stop codons in translated uORFs could impede downstream translation initiation. Since these variants interrupt translation without affecting the coding sequence directly, we term them upstream termination codons (UTCs) to distinguish them from premature termination codons within protein-coding sequences.
To estimate the deleteriousness of UTC-introducing SNVs, we assessed their frequency spectrum in gnomAD using the Mutability-Adjusted Proportion of Singletons (MAPS) metric.
MAPS compares the strength of selection acting against different classes of functional variation by determining the relative enrichment for rare singleton (one sequenced allele) variants within the gnomAD database, adjusted for local mutation rates (see Methods ). More deleterious groups of SNVs -including premature termination codons and essential splice site mutations -show greater enrichment in singletons in gnomAD, and consequently have higher MAPS scores. MAPS has previously been used to assess patterns of selective pressures acting on different classes of variation in both protein-coding and non-coding regions of the genome 3,10-14 .
Using translated uORFs from 4392 genes identified by deep ribosome profiling of two human cell lines ( Suppl. Fig. 1 ) 15 , we mapped genetic variation from 71,702 whole-genome sequences in gnomAD (version 3) 11 . We identified the subset of SNVs creating UTCs by selecting those which mutated uORF codons to either TGA, TAG, or TAA in the mapped uORF reading frame ( Figure 1a ). We calculated MAPS scores for these UTC-creating SNVs, finding that they are under strong negative selection within 5'UTRs, comparable to that of missense mutations in protein-coding regions of the genome ( Figure 1b- Fig. 1 ). Intriguingly, MAPS scores were highest for variants predicted to introduce strong (TAA) stop codons that are less susceptible to translational read-through [16][17][18] . The strong selective pressure to remove UTC-creating SNVs implies that these variants are also more likely to have functional biological consequences.
Crucially, we find that the enrichment in rare singleton variants appears to depend on the relationship between the new UTC-creating mutation and the CDS start codon. A subset of translated uORFs are known to overlap out-of-frame with their respective downstream coding sequences. These uORFs can repress translation of the downstream CDS by obscuring the protein-coding start codon 19,20 . Since uORF-mediated repression is contingent on the uORF overlapping out-of-frame with the CDS start codon, we hypothesized that UTC-introducing variants maintaining the uORF-CDS overlap would not exhibit similar MAPS scores to those that create new stop codons upstream of the CDS start. Indeed MAPS scores for UTCs that abolish the uORF-CDS overlap are significantly higher compared to all uORF-variants (P = 0.038), while MAPS scores for UTCs that preserve the overlap between the uORF reading frame and downstream CDS are indistinguishable from all uORF SNVs (P=0.4984, Figure 1b-iii ). These findings are consistent with the hypothesis that selective pressures against UTC-introducing variants are restricted to those that have the capacity to disrupt downstream translation initiation since the enrichment in singletons is observed only when a UTC is introduced proximal to the CDS start codon.

Translated uORFs use weak stop codons
Stop codons have different translation termination efficiencies in both prokaryotes and eukaryotes, with the hierarchy following the general pattern of TAA > TAG > TGA 16,21,22 . Given the observed selection against UTCs in translated uORFs, and in particular against TAA-introducing variants, we next asked whether stop codon usage by translated uORFs is distinct from the background distribution of TGA, TAG, and TAA trinucleotides in 5'UTRs. To perform this comparison, we determined the relative frequency that TGA, TAG, or TAA trinucleotide sequences appeared within non-translated 5'UTR sequences, and compared this frequency to the distribution of stop codons used in translated uORFs. To further control for the possibility that translated-uORF containing UTRs might have significantly different background nucleotide distributions, we also assessed the relative frequency of TGA, TAG, or TAA trinucleotides from uORF-containing UTRs with translated uORF sequences excluded.
Strikingly, we find that translated uORF stop codons are significantly depleted of TAAs compared to background UTR distributions ( Figure 1c ) Given the depletion of TAA-stop codons in translated uORFs, we next asked whether variants changing weaker stop codons (TGA, TAG) to TAA were also enriched for singletons. Compared to synonymous and missense variation within the protein-coding genome, we find that the MAPS metric for stop-strengthening variants is significantly higher ( Figure 1b-ii ). This difference remained significant compared to uORF variants matched by trinucleotide context, indicating that this effect is specific to uORF stop codons (P=0.012, Figure 1b-ii ). Given that TAA codons can facilitate greater termination efficiency and more rapid ribosomal dissociation from mRNAs compared to TAG and TGA codons 16,23,24 , these results are consistent with the possibility that stronger stop codons in uORFs can also increase the efficiency of translation termination in the 5'UTR. Thus, like UTCs, stronger stop codons in uORFs may be disfavored because they decrease the probability that ribosomes reinitiate translation at downstream coding sequences.

Upstream open reading frames are not under selection to maintain amino acid identity
Multiple transcriptome-wide ribosome profiling studies have proposed that some uORFs can encode functional micropeptides with important cellular roles 15,26,27 . This has fostered significant interest in the possibility that translated, non-canonical ORFs represent an overlooked class of potentially functional micropeptides with biological functions independent of the downstream protein-coding sequences 27,28 . While previous genome-wide approaches to assess uORF coding potential use metrics measuring evolutionary conservation across species, using human variation data directly allows us to capture potential lineage-specific constraints that may have been absent from cross-species analyses. Moreover, the pattern of constraint against UTC-introducing variants might also reflect selection to preserve putative micropeptide functionality. To address this possibility, we asked whether uORFs broadly exhibit similar constraints against missense variation as known protein-coding regions of the genome that would imply peptide functionality. We compared MAPS scores for predicted missense versus synonymous variants in translated uORFs to those in canonical protein-coding regions of the genome ( Figure 2a ). The frequency spectrum for missense variants in uORFs were significantly lower than that of missense variants in canonical protein-coding regions of the genome, and not significantly higher than MAPS scores for synonymous variants in translated uORFs (P=0.7118, Figure 2a-iv ). These results indicate that selection to maintain amino acid identity in uORF-encoded micropeptides is weak compared to canonical protein-coding sequences. As an additional control, we computed MAPS scores for predicted missense and synonymous SNVs in 693, 1188, and 276 translated non-canonical ORFs (ncORFs) mapped by ribosome profiling in 3'UTRs (dORFs), long-noncoding RNAs, and pseudogenes respectively, as these sequences are not thought to broadly encode for functional peptides. Similar to uORFs, predicted missense variants in these additional ncORFs were not significantly higher than predicted synonymous variants by MAPS score (dORFs P=0.3532; lncRNAs P=0.7777, pseudogenes P=0.4523 Since many translated uORFs are short, we asked whether longer uORFs might exhibit greater selection against missense variants compared to shorter uORFs. To test this possibility, we divided uORFs into long sequences >118 codons comprising the top 25% longest mapped uORFs, and short uORFs <118 codons in length. MAPS scores for missense variants in long versus short uORFs yielded no evidence of significant constraint acting on amino-acid changing variants compared to synonymous SNVs (long uORFs P=0.178, short uORFs P=0.9628, Figure   2a-v ).
Surprisingly, we observed that MAPS scores for both synonymous and missense variants in translated uORFs deviated significantly from all 5'UTR variation, implying that uORF variants are generally under a heightened degree of negative selection compared to synonymous variants in protein-coding sequences ( Figure 2a-iv ). The absence of similar effects for variants in dORFs, lncRNAs, or translated pseudogenes implies that this enrichment in singletons is unique to translated uORFs. One possibility is that synonymous variation in uORFs reflect selective pressures to maintain transitional efficiency by preserving codon optimality. Messenger RNAs that are enriched with more optimal codons are both more stable, and more efficiently translated by ribosomes 29  To test whether mutations in translated uORFs are constrained to maintain codon optimality, we asked if MAPS scores for mutations predicted to decrease codon optimality differed from those that increased codon optimality ( Figure 2b ). Using experimentally-determined codon-stability coefficients (CSCs) 35 , we matched each uORF SNV with its predicted consequence to codon optimality, and compared MAPS scores for optimality-increasing versus optimality-decreasing SNVs. As expected, SNVs increasing codon optimality were indistinguishable from all 5'UTR variants (P=0.1929, Figure 2c ). In contrast, variants predicted to decrease codon optimality had significantly higher MAPS scores (P<0.001), although the magnitude of this difference is moderate compared to UTC-introducing variants ( Figure 1b ). This effect remained significant regardless of whether variants were predicted to cause synonymous or missense mutations (P=0.0125 for synonymous; P=0.009 for missense), and was notably absent for translated ORFs in 3'UTRs, lncRNAs, and pseudogenes ( Figure 2c, Suppl. Figure 2 ). Furthermore, this pattern of increased constraint against optimality-decreasing mutations was robust to the use of CSC scores derived from alternative experimental approaches across several cell lines ( Suppl.

uORF start codons are conserved and under strong selective pressure
The finding of heightened selection against translation-interrupting variants in uORFs raises the question of why translated uORFs continue to persist in a large fraction of human genes.
Evidence that uORF-CDS organization, and the strength of uORF repression is strongly conserved across vertebrates, suggests that translation at uORFs is maintained to regulate downstream translation initiation 19 . To provide further genetic evidence that translation at uORFs is maintained by selection, we asked whether allele frequencies for variants affecting uORF start codons also exhibited strong selection to maintain their capacity for translation initiation. Using the MAPS metric, and genome-wide phyloP scores, we evaluated patterns of variation affecting uORF start codons. Since many translated uORFs begin with non-canonical start codons ( Figure 3a-i ), we distinguish between variants maintaining the start context by affecting the first position of the NTG trinucleotide from those that disrupt translation initiation by mutating the last two nucleotides in the uORF start codon ( Figure 3a-ii ). As expected, start-maintaining variants are no more enriched for singletons in gnomAD compared to synonymous protein coding variants. In contrast, start-disrupting SNVs are enriched for singletons at a level comparable to that of protein-coding missense SNVs, and SNVs introducing UTCs ( Figure 3b ). The heightened pressure to maintain translational initiation at uORF start codons is similarly reflected in phyloP scores for uORF start-disrupting genomic positions compared to distance-matched UTR controls (P < 0.001), and uORF-matched controls (P < 0.001, Figure 3c ). These data show that translation initiation at uORFs is evolutionarily constrained in humans, and are consistent with previous reports that uORF start codons are frequently conserved across species.
Taken together, our analyses of genetic variation in gnomAD show enrichment for rare allele frequencies in the frequency spectra of uORF start-disrupting, UTC-introducing, and stop-strengthening variants. Results from our analyses indicate that these classes of variation are under a heightened degree of negative selection, and imply that processes of translation initiation, elongation, and termination at translated uORFs are maintained by selective pressure.

uORF-disrupting variants associate genes with new disease phenotypes
The heightened MAPS score for UTC-introducing variants suggests that they are also likely to be functional. To explore the possibility that uORF UTC and stop-strengthening variants might contribute functionally to human disease susceptibility, we performed a phenome-wide association study (PheWAS) of predicted uORF-disrupting variants using the Penn Medicine Biobank (PMBB) -a large academic biobank with exome sequencing linked to EHR data for 10,900 individuals 36 .
Using exome sequencing from the PMBB, we identified heterozygous and homozygous individuals carrying mutations in uORFs which introduce upstream termination codons and stop-strengthening mutations. For the latter class, we focused on variants that introduced TAA stop codons, as the heightened MAPS score for such variants implied these mutations would be most deleterious. Filtering for variants with at least 5 heterozygous carriers with high-quality genotype, we identified 10 variants matching the above criteria (6 stop Table 1 ).

Replication of novel associations in UK Biobank
Out of six novel associations reaching FDR < 0.1, two showed P < 0.05 in UK Biobank (UKB) with consistent direction of effect, validating and further strengthening the significance of our previous results. Of these replicated associations, a stop-strengthening variant in BCL2L13 reached study-wide significance. For the remaining putative novel associations, the VPS53 uORF stop-strengthening variant did not replicate, although the direction of effect is consistent with results from the PMBB. For non-replicated associations, variants in NALCN and SHMT2 could not be replicated because there were fewer than 20 cases in the UKB cohort, and MOAP1 could not be replicated because this variant was absent from the UKB. Overall, single-variant analysis in the UK Biobank confirmed novel associations in PMVK and BCL2L13 respectively.

Disease-associated uORF variants change expression
To elucidate the possible biological consequences of UTC and stop-strengthening mutations, we selected our top two PheWAS association signals for functional assessment. To determine if these variants could affect gene expression, we measured the expression of a set of dual-luciferase reporters in HEK293T cells for PMVK , and VPS53 uORF variants. We compared the expression of the wild-type 5'UTR sequence for PMVK and VPS53 cloned upstream of a Firefly Luciferase ORF to two variant sequences -one with the uORF start codon removed, and a second sequence with the PheWAS-significant stop-strengthening mutation inserted. For VPS53 , we also tested the effect of a mutation changing a tryptophan TGG codon to a TAG upstream termination codon ( Figure 5b ). Across all constructs, we observed a significant reduction in expression of the downstream ORF when the PheWAS-significant stop-strengthening mutation was introduced ( Figure 5 ). Introducing a new UTC in the 5'UTR uORF of VPS53 also significantly reduced reporter gene expression relative to the wild-type sequence. Similar results were obtained from assays performed in HeLa cells ( Suppl. Fig. 9 ).
In all the tested constructs, UTC and stop-strengthening variants decreased relative Firefly expression. These data are consistent with the hypothesis that UTC or stop-strengthening variants are under negative selection because they decrease the probability of translational initiation at downstream coding sequences. These results are congruent with our genetic analysis, and imply that UTC-introducing and stop-strengthening variants represent a new class functional variation in 5'UTRs capable of causing loss-of-function of downstream coding genes.

Replication of novel associations by loss-of-function gene-burden studies
Results from reporter-gene experiments showed that UTCs and stop-strengthening variants could decrease expression of the downstream protein for PMVK , and VPS53 . Our findings implied that uORF UTC and stop-strengthening variants cause phenotypic consequences through potential loss-of-function of the downstream protein-coding gene. To further validate this hypothesis, we performed a gene burden test by aggregating rare loss-of-function protein-coding variants in the PMBB and UKB for each significant uORF-PheWAS association.
These studies could confirm that predicted loss-of-function in the protein coding sequence of the uORF-regulated gene causes the same phenotype as the uORF UTC or stop-strengthening variants. Indeed, similar loss-of-function gene burden approaches using rare protein-coding variants have successfully been applied to identify both known and new gene-disease associations in studies that utilized these two datasets 36,39 .
Of six PheWAS-significant associations uncovered in our discovery analysis (FDR<0.1), two associations were replicated by an independent loss-of-function gene burden test in either the UKB or PMBB. The associations between PMVK and diabetes, and SHMT2 and diseases of the salivary gland, were replicated in the UKB and PMBB respectively ( PMVK P=0.00727, SHMT2 P=0.005515, Table 1 ). Although no significant LOF-burden association for PMVK was replicated in the PMBB, predicted loss-of-function of PMVK was nominally associated with impaired fasting glucose (P=0.0235). A second uORF-disease association was replicated for NALCN and the parent PheCode of disorders of plasma protein metabolism in the UKB (P=0.0264).
Gene-disease associations for BCL2L13 could not be replicated in either the PMBB or UKB due to lack of carriers for predicted loss-of-function variants. Ultimately this analysis confirmed that loss-of-function gene burden tests using protein-coding variants are associated with the same phenotype for two uORF stop-strengthening mutations. This evidence of allelic heterogeneity for these phenotypes further strengthens the likelihood that uORF stop-strengthening variants can cause loss-of-function of downstream protein-coding genes.  HMGCR activity contributes to diabetes pathogenesis. Our data is the first to establish a putative link between PMVK and diabetes. Given the shared involvement of PMVK and HMGCR genes in the mevalonate pathway, it is possible that genetic variants in both these genes confer an increased risk of diabetes through a similar mechanism, however further studies will be needed to further elucidate the precise relationship between PMVK and diabetes.

DISCUSSION
A limitation of our analysis is that we cannot directly assess the impact of additional factors on uORF-mediated translational regulation. Specifically the strength of the uORF start codon 53 , intercistronic distance between the uORF stop codon and downstream coding gene 54 , and additional contributions of secondary structure in the 5'UTR have all been shown to impact uORF regulatory functions previously 55 . The contributions of these factors to uORF-mediated translational regulation can be highly context specific, and dissecting these differences in regulation remains an interesting challenge for future studies.
Finally, we note that being a hospital-based biobank, participants in the PMBB are generally less healthy than the general population. The relative enrichment in diseased individuals in the PMBB may account for why few associations discovered in our analysis of the PMBB are replicated in the UKB which is a healthy, population-based biobank. Indeed we were unable to test for an association for two of the six PMBB associations due to an inadequate number of individuals having the phenotype in UKB. As hospital-based biobanks become more prevalent these unreplicated associations should be revisited and confirmed.
Understanding and interpreting the impact of noncoding genetic variation is a fundamental challenge in biology. Many mutations affecting uORFs are known to cause disease 56       *T1D: type 1 diabetes, T2D: type 2 diabetes, UTC, upstream termination codon, ** Associations reaching FDR < 0.1.

Annotation of translated non-canonical open reading frames
Non-canonical ORF (ncORF) annotations encompassing 5'UTR ORFs (uORFs), 3'UTR ORFs (dORFs), long-noncoding RNA ORFs (lncRNA) and pseudogene ORFs were retrieved from Supplementary File 1 from Ji et al. 15 . These ncORFs were mapped by ribosome-profiling in human BJ fibroblasts and MCF10A breast epithelial cells using the RibORF algorithm. Using the final set of genomic coordinates for ncORFs identified in this study, we converted these coordinates to match hg38 annotations using the UCSC LiftOver executable ( https://genome.ucsc.edu/cgi-bin/hgLiftOver ). Out of 10,007 distinct non-canonical uORFs mapped in the original study, 27 whose length changed after conversion (N = 5 uORFs, 4 dORFs, 16 lncRNA ORFs, 2 pseudogenes) were excluded from subsequent analyses. Each Refseq mRNA ID for each ORF-associated RNA transcript was annotated to its associated Ensembl transcript ID using the BioMart database v86 annotations. The first three nucleotides of each ORF were used as start codons for downstream analyses. The final three nucleotides of each ORF were used as stop codons for downstream analyses.

Quality filtering and annotation of variants from gnomAD version 3
Variants from gnomAD 3 release were downloaded from the gnomAD browser website (https://gnomad.broadinstitute.org/downloads). A set of high-confidence variants were obtained by removing those failing the Filter column (Filter != PASS) from the gnomAD version 3 vcf files using bcftools (version 1.9), and those falling in low complexity regions (lcr != 1). This set of variants was used for all downstream analyses. We additionally removed variants where the total observed allele number was at less than 80% of the maximum number of sequenced alleles to control for differences in sequencing depth in the gnomAD WGS dataset. The remaining set of high-confidence variants was overlapped with genomic coordinates for annotated ncORFs, 5'UTR sequences, and annotated protein-coding sequences using bedtools gene models obtained from Ensembl. VEP consequences were further filtered to only include the predicted consequence for the canonical Ensembl transcript as determined in 11 .

Positional constraint analysis using variants from gnomAD
For the positional constraint analysis we applied the MAPS metric to each variant set. We developed a MAPS model following previous methods 11 . The set of synonymous protein-coding variants are used as a baseline measurement for neutral selection, and the proportion of singletons in a variant class are adjusted for differences in mutation rates due to local sequence context 10,11 . We trained our model by regressing the observed proportion of singleton-synonymous variants for each trinucleotide context within protein-coding regions of the genome using previously published context-dependent mutation rates derived from intergenic noncoding regions of the genome 11 . Since negative selection prevents deleterious mutations from becoming common in human populations, more deleterious mutationsincluding those disrupting essential splice sites or introducing premature termination codonsare also more enriched for singletons compared to neutral variants.
MAPS scores for a given set of variants are calculated as described previously 3 Optimality decreasing variants were defined as any variant which decreased the CSC score for the encoded codon, and optimality increasing variants were defined as any variant which increased the CSC score for the encoded codon.
Confidence intervals for MAPS scores were calculated using bootstrapping as described 3 . For each set of n variants used to compute a MAPS score, we select n variants randomly with replacement and recalculate MAPS scores. This is repeated over 10,000 permutations and the 5th and 95th percentiles of the MAPS scores distribution are used as confidence intervals.
P-values for differences in MAPS scores were determined by calculating the proportion of bootstrapped MAPS scores from an experimental group of variants that were larger than those from the control group 3 .

Genetic sequencing
This PMBB study dataset included a subset of 11,451 individuals in the PMBB who have undergone WES. For each individual, we extracted DNA from stored buffy coats and then obtained exome sequences generated by the Regeneron Genetics Center (Tarrytown, NY).
These sequences were mapped to GRCh37 as previously described 36   For UKB, we used the provided ICD-10 disease diagnosis codes for replication studies, and individuals were determined to have a certain disease phenotype if they had one or more encounters for the corresponding ICD diagnosis given the lack of individuals with more than two encounters per diagnosis, while phenotypic controls consisted of individuals who never had the ICD code. Individuals under control exclusion criteria based on PheWAS phenotype mapping protocols were not considered in statistical analyses.

Association studies
A phenome-wide association study (PheWAS) approach was used to determine the phenotypes associated with 5' UTR variants predicted to create new TAA UTCs, or strengthen existing uORF stop sites and carried by individuals in PMBB for the discovery experiment 37 . Each disease phenotype was tested for association with each uORF variant using a logistic regression model adjusted for age, age 2 , sex, and the first ten principal components (PCs) of genetic ancestry. We used an additive genetic model to collapse variants per gene via an extension of the fixed threshold approach 64 . Given the high percentage of individuals of African ancestry present in the discovery PMBB cohort, association analyses were performed separately in European (N=8198) and African (N=2172) genetic ancestries and combined with inverse variance weighted meta-analysis. Only 5' UTR variants with at least five total alternate alleles in PMBB were selected for univariate PheWAS analyses in the discovery phase while variants with greater than half of the genotypes annotated as missing due to low quality were excluded. This resulted in a final set of N=10 variants. Our association analyses considered only disease phenotypes with at least 20 cases, leading to the interrogation of 800 total Phecodes.
All association analyses were completed using R version 3.3.1 (Vienna, Austria).
We evaluated the robustness of significant uORF-phenotype associations in the same PMBB discovery cohort by aggregating pLOF and predicted deleterious missense variants in each uORF's corresponding gene into a 'gene burden' for hypothesis-driven association with the significant phenotype from discovery. Only gene burdens with at least five total alternate alleles in PMBB were selected for replication studies. All gene burden association studies in PMBB were based on a logistic regression model adjusted for age, age 2 , sex, and the first 10 PCs of genetic ancestry.
Additionally, we replicated our findings in UKB for significant uORF associations in the PMBB discovery using 1) hypothesis-driven univariate association studies for the same uORF variants and 2) hypothesis-driven gene burden collapsing pLOF and predicted missense variants for the corresponding genes. Only uORF variants and gene burdens with at least five total alternate alleles in PMBB were selected for replication studies. Association statistics were calculated similarly to PMBB, such that each disease phenotype was tested for association with each gene burden or single variant using a logistic regression model adjusted for age, age 2 , sex, and the first 10 PCs of genetic ancestry. Replication significance was defined using a P-value threshold of 0.05. All association analyses for PMBB and UK Biobank completed using R version 3.6.1.

Construction of expression vectors
The test plasmids used a modified pGL4.12[luc2CP] (Promega) vector backbone where the control of expression of the Firefly ORF was modified by the addition of an upstream CMV promoter. The modified pGL4.12 vector was linearized using Bgl-II and MreI restriction sites.
Hybrid 5'UTR fragments containing the entire 5'UTR sequence and the first 91 nucleotides of the Luc2 Firefly ORF were produced by gBlock synthesis and received from Integrated DNA Technologies using sequences in Suppl. Table 2 . Test plasmids were constructed by sub-cloning these hybrid 5'UTR sequences for PMVK, VPS53, and BCL2L13 into the modified pGL4.12 vector to preserve the uORF-CDS relationship for each construct. Correct fragment insertion was verified for each engineered construct by sanger sequencing. For PMVK and BCL2L13, the entire annotated 5'UTR sequence was used. For VPS53, because of a G-rich sequence in the 5'UTR upstream of the uORF complicated synthesis of the gene's entire 5'UTR fragment, we removed the first 75 nucleotides of the annotated 5'UTR sequence. Construct assembly was accomplished using the NEB Hi-Fi assembly protocol following manufacturer's instructions.