DNA sequence-level analyses reveal potential phenotypic modifiers in a large family with psychiatric disorders

Psychiatric disorders are a group of genetically related diseases with highly polygenic architectures. Genome-wide association analyses have made substantial progress towards understanding the genetic architecture of these disorders. More recently, exome- and whole-genome sequencing of cases and families have identified rare, high penetrant variants that provide direct functional insight. There remains, however, a gap in the heritability explained by these complementary approaches. To understand how multiple genetic variants combine to modify both severity and penetrance of a highly penetrant variant, we sequenced 48 whole genomes from a family with a high loading of psychiatric disorder linked to a balanced chromosomal translocation. The (1;11)(q42;q14.3) translocation directly disrupts three genes: DISC1, DISC2, DISC1FP and has been linked to multiple brain imaging and neurocognitive outcomes in the family. Using DNA sequence-level linkage analysis, functional annotation and population-based association, we identified common and rare variants in GRM5 (minor allele frequency (MAF) > 0.05), PDE4D (MAF > 0.2) and CNTN5 (MAF < 0.01) that may help explain the individual differences in phenotypic expression in the family. We suggest that whole-genome sequencing in large families will improve the understanding of the combined effects of the rare and common sequence variation underlying psychiatric phenotypes.


Supplementary Tables
Supplementary Table 1 GATK summary table  17 Supplementary Table 2 Phenotype models split by translocation status 17 Supplementary Table 3 Genes in linkage regions. a) RefSeq genes and associated GO terms for experimentally confirmed biological processes, most significant dbGAP mental health/behavioural phenotypes and Ensembl phenotype annotations. b) Associated dbGAP mental health and behaviour traits p < 1 x 10 -5 17 Supplementary Table 4 Summary of haplotype segregation in the t(1;11) family 18 Supplementary Table 5 Phenotype prediction  19 Supplementary Table 6 Two-point LOD scores for the haplotype regions: a) chr1q; b) chr11q1; c) chr11q2; d) chr5q and e) Regions LOD>2 19 Supplementary Table 7 Variant Effect Predictor annotation 20 Supplementary Table 8

Diagnoses and phenotypic models in the family
Psychiatric diagnoses of the t(1;11) family members were established by consensus between two psychiatrists (D.B. and A.W., the latter being blind to karyotype status 1 ) in accordance with DSM-IV criteria using the Structured Clinical Interview for DSM-IV (SCID), hospital records and general practitioner records. Psychosis in this context was defined the presence of one or more of the A detailed description of the study was given and written informed consent was obtained from all individuals prior to participation.

WGS sequencing and variant calling i) Library construction and sequencing
For the first 10 samples, 1 µg of DNA was sheared with the S2 Series Covaris and libraries were prepared with the Illumina TruSeq paired end adapter kit. For the remaining 39 samples, 100 ng aliquots of genomic DNA were fragmented using the S2 Series Covaris (with Illumina recommended settings) and paired end sequencing libraries were prepared using the Illumina TruSeq Nano LT library protocol. The gDNA was sheared to ~250 bp fragments, and size selection was performed using AMPure magnetic beads (Agencourt). The fragments were then end repaired, adenylated, and Illumina adapters were ligated. Ten cycles of PCR enrichment were performed. Library quality was assessed using an Agilent Bioanalyzer HS chip and library quantity was determined using a Qubit dsDNA HS assay kit. Each sample was sequenced over 3 lanes of a HiSeq2000, using a paired end 101bp run. Basecalls were generated using the Illumina Real Time Analysis (RTA) software. The binary files containing sequences and quality scores were transferred to a shared HPCC running Linux for further processing. The Illumina bcl2fastq software was then used to convert the files to Fastq format, and only reads passing the standard Illumina quality filter were included in the output files.

ii) Mapping and SNV calling in GATK Unified Genotyper
Reads were aligned to the human reference genome (UCSC hg19) with the BWA program 2 , allowing two mismatches in a 30 base pair seed. SAMtools 3 was used to process the alignments, including sorting, merging, and indexing. Duplicate read pairs (reads with identical outer coordinates) were removed with Picard (http:// http://broadinstitute.github.io/picard/) to counteract PCR artifacts generated during the sample preparation. Bamtools 4 was used to filter for properly paired reads.
Picard CollectMultipleMetrics and GCbias was used to determine insert size and quality stats for each library. SNPs and small indels were called with GATK v2.4.9 [5][6][7] . Local realignment around indels and base quality score recalibration was performed for each library. SNVs were called using the Unified Genotyper in multi-sample joint calling mode in order to achieve consistent calling across samples.
GATK Variant Quality Score Recalibration (VQSR) modeling was employed, and only variants passing all metrics were used for downstream analyses.
Variants were checked for Mendelian segregation using Pedcheck in Merlin 8 . Multiallelic variants and those with one or more Mendelian errors were removed from the analysis.

iii) Data QC
Data for each genome was assessed for overall quality and coverage. The percentage of reads properly paired, rate of PCR duplicates, library complexity, mean library insert size, mean coverage and GC bias were examined. Ts/Tv ratios of the SNVs called by Unified Genotyper for each individual and a comparison of SNV calls to dbSNP 129 were determined using GATKVariantEval. A check of sample sex was performed for each sample by comparing heterozygosity on the X chromosome and coverage of the X and Y chromosomes to the reported sex. Males were expected to have few heterozygous sites on chrX and to have similar coverage levels on the X and Y chromosomes, while females were expected to have many heterozygous calls on chrX, higher chrX coverage and little coverage of the Y chromosome.
We performed a kinship analysis in order to assess the degree of relatedness amongst samples and to rule out any sample mislabelling. We used the software KING 9 to compute the coefficient of relatedness from the VCF file using PASS variants.
To confirm the population of samples, PCA analysis was performed using Peddy 0.3.2 10 .

iv) Copy Number Variation (CNV)
An adaptation of a CNV detection tool developed by Yoon et al. 11  and those where the heterozygotes did not match the sequencing data (N=7; 3 Pass and 4 non-Pass).
The identified variants have been deposited in dbSNP (accession numbers ss2137543799 and ss3343271435-ss3353031818).

Linkage analysis
The linkage analysis tool SOLAR (version 8.1.1) 13,14 was used to perform both two-point and multipoint variance component linkage analysis. Minor allele frequencies were taken from the 1000 Genomes Phase 3 for Europeans (EUR) data set 15 . LOD adjustment was performed using the LODadj command in SOLAR 13,14 to correct deviation of the data from multivariate normality. This command corrects the potential inflation of observed LOD scores by regressing observed LOD scores on simulated LOD scores, generated by a simulation of a normally distributed trait (under a null hypothesis of no linkage) over 10,000 permutations 14,16 . All LOD scores presented are adjusted LOD scores.
Identity by descent matrices (IBDs) and multipoint IBDs were calculated for two-point and multipoint linkage analyses respectively 13 . To maximise CPU usage, the IBD step of SOLAR was parallelised across multiple cores using a custom script. individuals. In addition, due to the size of the region and number of SNPs, the variants in the chr5q were also filtered to exclude variants that did not achieve an OddsRatio (OR) ≥ 1.5 in association with the model F phenotypes. Two-point linkage analysis was also performed on variants under the nominally significant multipoint linkage peaks (LODs ≥ 2) on chr1p, chr2p, chr3q, chr4q and chr16.
These variants were also selected based on the QC parameters described above, with the variants under the chr1p, 3q, 4q and chr16p peaks being further restricted to those that could not be imputed using the 1000 Genomes Phase 3 reference panel. This was again due to the large number of SNPs in these regions.
We calculated the maximum theoretical two-point LODs (mtLODs) for the pedigree, drawn from the 48 whole genome-sequenced individuals for each phenotypic model. These are presented in Table 1.
These LODs represent the highest LOD scores achievable in this pedigree assuming perfect segregation of a phenotype-associated allele with a MAF of 0.00001 (the lowest value accepted by SOLAR), showing the maximum power in the family for each phenotype model.
For the multipoint analyses, random sets of variants were selected across the autosomal chromosomes (chr1-12 separated into p and q arms) from the GATK PASS filtered variants from the t(1;11) family WGS data. These variants were chosen using the following additional QC parameters: an observed heterozygosity >0.2 and <1; no Mendelian errors; Hardy-Weinberg equilibrium (HWE) p>0.001; and present on at least 2 chromosomes. Three independent datasets were selected for multipoint analysis and combined to give an approximate density of three variants per mega-base (Mb).

Haplotype phasing
Regions of the genome with genome-wide significant or suggestive multipoint LOD scores (LOD≥3.

Variant annotation
The genes that lie under the multipoint linkage peaks were identified using the UCSC genome browser (GRCh37/hg19 assembly) refGene MySQL reports all SNPs which are significant cis-eQTLs (p-value < 0.05; either directly or via a proxy SNP), the genes with significantly altered expression levels, as well as the most significant gene(s) and its corresponding p-value.

UK Population-based cohorts: GS:SFHS & UKB
GS:SFHS genotyping, imputation and quality control is detailed in a previous publication 19  The frequency of the phenotype-linked haplotypes in the Scottish population was calculated in Haploview 21 using the genotype data from GS:SFHS.

Region-wide association analyses
Affective disorder was defined as a diagnosis of MDD or BD in GS:SFHS 22 or probable BD or MDD in UKB 23 . Further phenotypes used in GS:SFHS are described in 22,24 .
For association analyses, sample specific covariates were applied as follows: GS:SFHS. Imputed data was analysed using ProbAbel 25 . All analyses were adjusted for age, sex and 6 PCs to account for population stratification.
UKB. Imputed data was analysed using SNPTest 26 . All analyses were adjusted for centre, array and batch as random effects, and age, sex and 15 informative principal components as fixed effects (PCs; UKB Data Dictionary items #22009.01 to #22009.15) to take account of batch effects and possible population stratification.
Additional mood and cognitive measures included the General health questionnaire 28 (GHQ28) 27 , the mood disorder questionnaire (MDQ) 28 and schizotypal personality (SPQ-B) 29 questionnaire. The personality traits of neuroticism and extraversion were assessed using the Eysenck personality questionnaire-revised short form. 30  Analyses of age of onset for individuals with a diagnosis of affective disorder retained both chr11q2 and chr16p in the minimal model (p=0.018). These results suggest that in addition to the translocation and associated haplotypes (chr1q, chr11q1 and chr11q2), the t(1;11) family carries additional risk haplotypes that increase the risk of Psychosis (chr5q, chr2p and chr3q) and affective disorder (chr11q2) and may modify the age of onset of illness (chr11q2 and chr16p). Supplementary Table 5.

Two-point linkage analyses
Supplementary Figure 5a  Full results from the two-point linkage analyses are given in Supplementary To assess the effects of non-coding variants within the phenotype-associated haplotypes, 340 variants that achieved a LOD of greater than half the maximum theoretical LOD were assessed for potential functional effects in Encode data from RegulomeDB 17 and brain cis-eQTL data from Braineac 18  Functional annotation of SNV with LODs> 2 are given in Supplementary Table 6e and Supplementary   Table 8f. Within the chr4q deletion region multiple SNV are associated with the expression of genes in the region, most prominently rs6822066, shown to affect the expression of FSTL5, NPY1R, TKTL2, NAF1, and NPY5R (best p = 0.00077 for FSTL5 in the thalamus).

GRM5, PDE4D and CNTN5
GRM5 maps within the chr11q1 haplotype, close to the translocation breakpoint 36 , and encodes a glutamate receptor previously associated with psychiatric disorders [37][38][39] . This protein is involved in the regulation of neural network activity and synaptic plasticity. Glutamatergic neurotransmission is involved in most aspects of normal brain function and can be perturbed in many neuropathological conditions. Multiple non-coding variants in the intron and downstream region of GRM5 were predicted to alter expression of GRM5 and other genes in the region in multiple brain regions. These PDE4D maps within the chr5q haplotype and encodes a cAMP-specific 3',5'-cyclic phosphodiesterase and degrades cAMP, which acts as a signal transduction molecule. This protein is known to interact with DISC1, the protein product of the gene directly disrupted by the translocation, and may indicate a further hit on the same neurodevelopmental signalling pathway. Association analysis in the Scottish population suggest that variants in this gene are associated with MDD and Mill Hill vocabulary a test of crystallised cognitive ability thought to reflect premorbid IQ.
CNTN5 is the only gene in the chr11q2 haplotype, and its protein is involved in the formation of axon connections in the developing nervous system. CNTN5, and the closely related genes CNTN4 and

Supplementary Tables
Supplementary Table 1 GATK summary table   Supplementary      These data combine: prediction of regulatory function from Regulome DB; annotation of whether the SNP is a brain specific eQTL (as annotated by BRAINEAC), including the tissue that shows most significant association; LOD scores (and associated model), haplotype allele, major and minor allele frequencies.