Joint mouse–human phenome-wide association to test gene function and disease risk

Phenome-wide association is a novel reverse genetic strategy to analyze genome-to-phenome relations in human clinical cohorts. Here we test this approach using a large murine population segregating for ∼5 million sequence variants, and we compare our results to those extracted from a matched analysis of gene variants in a large human cohort. For the mouse cohort, we amassed a deep and broad open-access phenome consisting of ∼4,500 metabolic, physiological, pharmacological and behavioural traits, and more than 90 independent expression quantitative trait locus (QTL), transcriptome, proteome, metagenome and metabolome data sets—by far the largest coherent phenome for any experimental cohort (www.genenetwork.org). We tested downstream effects of subsets of variants and discovered several novel associations, including a missense mutation in fumarate hydratase that controls variation in the mitochondrial unfolded protein response in both mouse and Caenorhabditis elegans, and missense mutations in Col6a5 that underlies variation in bone mineral density in both mouse and human.


Supplementary Fig. 5. Splice site variant in Hcfc1r1
(a) A SNP (A/G; chr17 at 23,674,533 bp) at splice acceptor site is indicated by a vertical arrow. The reference gene structure is shown in dark red, whereas the mutated isoform with Ensembl annotation (Ensembl gene ID: ENSMUSG00000023904) is shown in dark blue. Two dotted lines represent start positions of exon 2 in reference gene structure and mutated isoforms, respectively. (b) DNA sequences in exon2 of the reference gene but not in mutated isoform are shown in peach pink. Partial common sequences are shown in dark red. (c) Protein sequence conserved among other mammals.

Supplementary Fig. 6. Association analysis for a frameshift variant in Pcm1
(a) A 1-bp deletion (T/-; chr8: at 41,309,537 bp) in Pcm1 exon 24 was identified, supported by an mRNA isoform (accession: CF893233). The frameshift occurs in exon 24 in Pcm1. The deleted base is indicated by a vertical arrow. The normal amino acid sequence is shown in light purple and blue boxes. (b) Pcm1 is statistically differentially expression in hippocampus (p = 0.03) and whole brain (p = 0.02), with higher expression from B alleles. Significance was assessed by twotailed Students t-test, assuming unequal variances with two biological replicates (i.e. male and female). (c) Pearson correlation between Pcm1 mRNA expression in hippocampus (Probe set: 1436908_at) and locomotor activity after paraoxon organophosphate (OP) acetylcholinesterase inhibitor response (GN ID 10571) in BXD family. The mRNA expression in hippocampus was measured using Affymetrix M430 2.0 microarrays. The numbers above the dots represent BXD strain name. Two parental strains, B6 and D2, are highlighted in blue and brown, respectively. (d) Human phenome scan for association of Pcm1 (rs201598166) across BioVU. Each point represents the -log10(p value) of a single SNP-phenotype association. The horizontal red dotted line represents p = 0.01.
5′ UTR variants are of particular interest because they may control mRNA transcription, stability, or translational efficiency 2 . Seventy-four ncSNPs with highimpact scores (>0.9) located in the 5′ UTR potentially alter the binding affinity of transcription factors (Supplementary Data 9). One of the SNPs is in a repressor element 1-silencing transcription factor (REST) binding motif; 25 SNPs are in transcription factor binding sites from the OREGO database; and 39 are in bidirectional promoter regions for co-regulated genes.

SNP validation
We resequenced all 47 nonsense variants with PCR products and a subset of 59 randomly selected missense variants to assess the false positive rates (FPR).
We validated 42 nonsense variants and 57 missense variants-an FPR of 13% and 3.38% respectively (Fig. 2e). Although the FPR of these variants is significantly higher than that of SNPs in general, most of these coding SNPs are true variants with potentially substantial effects on mRNA stability and protein function.
We selected 53 out of 69 splice-site variants for further validation by Sanger resequencing, and 52 were confirmed as true variants, indicating that the false positive rate was only ~2% (1/53). In addition, 26 of the ncSNPs are predicted to alter processed miRNA sequence (Supplementary Data 10).

Small insertions and deletions
Small insertions and deletions (indels) are more frequent than large indels (Supplementary Fig. 2a and 2b). Most small indels-63% of all deletions and 64% of all insertions-are 1 to 3 bp in length (Supplementary Fig. 2a and 2b).
Single base pair indels account for 41% of small deletions and 41% of small insertions. Like SNPs, small indels are unevenly distributed due to the complex history of the parental strains and the retention of long intervals that are almost identical by descent. Of these, 31 gave reliable PCR products and 26 frameshift mutations were validated, indicating a false positive rate of ~16% (5/31).

Large insertions and deletions
We detected 9,347 large indels using the BreakDancer program and Illumina sequence data, and 9,201 large indels using the ABI SOLiD indel pipeline. Only  Fig. 2a and 2b). A total of 162 large indels spanned exons.
The false positive rate for large deletions is high 3 . To overcome this, we used two different detection approaches: unmapped/split-read (Pindel) and anomalously mapped read pairs (BreakDancer and the ABI large indel pipeline).
We detected 716 large indels that were identified using both of these approaches. We also performed PCR-based validation of 20 predicted large deletions and 20 predicted insertions. A total of 32 indels were validated, a ~20% false positive rate for detecting large indels.

Effect of noncoding SNPs on the structure of miRNA
In order to identify potential SNPs in miRNA encoding genes in mouse genome,

Missense variants confirmed by mass spectrometry-based proteomics data
To characterize missense variants supported by mass spectrometry-based proteomics data from the B6 and D2 strains, we created a customized protein database containing all mouse proteins from SwissProt/trEMBL and proteins with 11,355 missense variants. A total of 189,101 MS/MS spectra from 10 fractions were searched against this database using SEQUEST (version 27 revision 13). A total of 49,639 unique peptides and 6,732 protein groups were identified at an FDR of 1%. Of these, 79 unique peptides harboring missense variants were identified (Supplementary Data 8).

Summary of expression phenome
A total of 84 expression datasets were used for calculating the number of mRNA assays in Fig. 1b, including 28 tissues (Supplementary Data 2). Among these, 16 datasets highlighted in grey were further used for molecular phenome scan analysis. These datasets are accessible from our website www.genenetwork.org.

PheWAS analysis using GeneNetwork
The PheWAS analysis can be performed in GeneNetwork (www.GeneNetwork.org) by simply following these three steps. We use an produce large output tables with as many as 20,000 endophenotypes. However, these can be sorted easily by the position of the peak association score (click on column that is labeled Max LRS Location). When correctly sorted, this table will highlight those candidate mRNAs that presumably map to sequence variants at the Entpd2 locus.
Among the 69 splice-site ncSNPs, an acceptor site mutation within intron 1 of Hcfc1r1 (host cell factor C1 regulator) was of particular interest ( Supplementary Fig. 5a). This mutation (A/G; Chr 17 at 23,674,533 bp) disrupts a cryptic AG splice acceptor site of the B-type isoform (Supplementary Fig. 5a) and creates an alternative D isoform that trims away four amino acids (SLSP) on the N-terminus of exon 2 (Supplementary Fig. 5b). The shorter D isoform is relatively highly conserved among mammals, including humans (Supplementary  Fig. 6a). The mutated isoform is supported by an mRNA isoform (accession: CF893233). Pcm1 is highly expressed in several CNS tissues, including hippocampus and striatum, and has higher expression from B alleles (Supplementary Fig. 6b). SNP rs3696853 within Pcm1 is strongly linked to expression of Pcm1 in hippocampus (q = 1.3×10 -11 ; Supplementary Fig. 6c).
Pcm1 is thought to be downstream of mutations in the schizophrenia susceptibility candidate gene Disc1 4 and may be involved in regulation of protein kinase C activity 5 and to protein kinase calcium/calmodulin-dependent kinase II beta localization to the centrosome during dendrite patterning 6 . Several studies suggest a role for Pcm1 variation in CNS development 7 , schizophrenia 8 , and Huntington's disease 9,10 . As such, Pcm1 may be relatively novel but is also a key intracellular signaling modulator that critically influences behavior and brain function. Matched analysis of the BioVU cohort shows that rs7009117 in PCM1 is associated with psychiatric disorders, including schizophrenia (p = 6.3×10 -4 ), psychosis (p = 8.6×10 -4 ), and autism (p = 2.5×10 -2 ) (Supplementary Fig. 6d).