Whole-genome sequencing reveals new insights into age-related hearing loss: cumulative effects, pleiotropy and the role of selection

Age-related hearing loss (ARHL) is the most common sensory disorder in the elderly. Although not directly life threatening, it contributes to loss of autonomy and is associated with anxiety, depression and cognitive decline. To search for genetic risk factors underlying ARHL, a large whole-genome sequencing (WGS) approach has been carried out in a cohort of 212 cases and controls, both older than 50 years to select genes characterized by a burden of variants specific to cases or controls. Accordingly, the total variation load per gene was compared and two groups were detected: 375 genes more variable in cases and 371 more variable in controls. In both cases, Gene Ontology analysis showed that the largest enrichment for biological processes (fold > 5, p-value = 0.042) was the “sensory perception of sound”, suggesting cumulative genetic effects were involved. Replication confirmed 141 genes, while additional analysis based on natural selection led to a prioritization of 21 genes. The majority of them (20 out of 21) showed positive expression in mouse cochlea cDNA and were associated with two functional pathways. Among them, two genes were previously associated with hearing (CSMD1 and PTRPD) and re-sequenced in a large Italian cohort of ARHL patients (N = 389). Results led to the identification of six coding variants not detected in cases so far, suggesting a possible protective role, which requires investigation. In conclusion, we show that this multistep strategy (WGS, selection, expression, pathway analysis and targeted re-sequencing) can provide major insights into the molecular characterization of complex diseases such as ARHL.


Introduction
Complex diseases are caused by several genetic and environmental factors, many of which still need further investigation [1]. Moreover, in contrast to monogenic disorders where causative genetic variants have near complete penetrance, complex diseases need specific approaches to be further investigated. The difficulty of exploring complex disease genetic models, lies in controlling the environmental components on one hand and accounting for different unknown genetic hypotheses on the other, such as pleiotropy, cumulative effects, gene-gene interactions etc. [2].
Age-related hearing loss (ARHL) or presbycusis is a very common complex disease leading to hearing deficits in the elderly. Due to the ageing population in our society, ARHL has reached to a 30% incidence among people over 50 years of age [3], becoming a major health care problem worldwide. ARHL significantly decreases the autonomy of affected people and also contributes to their anxiety, depression, and cognitive decline [4,5]. The hearing loss is progressive, starting from high frequencies and subsequently affecting medium and lower frequencies as well. The cause is heterogeneous consisting of both environmental and genetic factors, and, although we have a reasonable understanding of the environmental risk factors, we know little about the genetic factors [6].
In particular, ARHL is characterized by an uncertain heritability and a higher genetic heterogeneity compared with other complex traits (i.e., diabetes, rheumatoid arthritis). Thus, it is not surprising that, up to now, there was a dearth of gene discovery despite some published Genome Wide Association studies (GWAS) [7,8]. Moreover, as genetic data become available on a large scale, pleiotropy increasingly becomes an interesting topic for complex and frequent diseases such as ARHL. Variability in complex diseases, such as ARHL, might be due to epigenetic changes during hair cell development and regeneration. To date, the possible involvement of epigenetic changes is still largely unknown and future studies are needed to shed light on this mechanism.
Recently, the possibility of analyzing whole-genome sequencing (WGS) and whole-exome sequencing (WES) data at population level has been highlighted [9]. However, it was shown that GWAS studies lose in power due to the allele frequency spectrum targeted by sequencing with realistic sample sizes [10]. Hence, it is necessary to explore new methodologies and approaches when dealing with WGS and WES data.
For the present work, WGS data for ARHL cases and controls from villages located in North-Eastern Italy (total of 156 subjects) were available, as well as individuals from an independent cohort of North-Western Italy, which was used for replication (56 subjects). Given the high complexity and genetic heterogeneity of the phenotype, we decided to explore a new scenario under the hypothesis that several variants contributing to the disease have a cumulative effect. To achieve this goal, we defined a pipeline characterized by the following steps: (a) identification of a burden of variants in WGS data significantly different among cases and controls, (b) natural selection analysis of genes in order to detect genes under selection, which can directly or indirectly influence the phenotype, (c) expression studies in the inner ear and (d) pathway analysis. Finally, to further validate our approach, target re-sequencing screening was carried out in an additional cohort of 389 ARHL patients for two genes previously reported as associated with hearing function in literature.

Subjects
A WGS cohort of 378 individuals from the Friuli Venezia Giulia region in North-Eastern Italy was selected strictly for ARHL cases and controls based on the audiometric phenotype. A total of 156 subjects were considered for the discovery set: 78 cases and 78 controls matched by sex and age (mean age cases = 69, controls = 65; percentage of female cases = 60%, controls = 63%). Following the same selection criteria, a replication set of 56 subjects (28 cases and 28 controls) from an independent cohort located in North-Western Italy, Piemonte region (original cohort size n = 424) was enrolled, together with a completely independent cohort of 389 cases coming from several locations in Italy. Detailed demographic information about all cohorts is reported in Supplementary Table 1. Audiometric tests, a clinical examination and a lifestyle questionnaire were carried out for each subject. Considering that ARHL mainly affects hearing function at high frequencies, 4 and 8 kHz were measured for each subject and the average value was computed (pure tone average high (PTAH)). To avoid non-genetic variations in the hearing phenotype (e.g., monolateral hearing loss), the best hearing ear was considered for each individual. Subjects with history of occupational risk or pathologies related to ARHL were excluded from the study. Cases were defined as people >50 years old having PTAH ≥ 40, whereas controls were subjects >50 years old with PTAH ≤ 25. Blood samples were collected and DNA was extracted using a phenol-chloroform extraction procedure. Considering possible substructure in the genetic background of our samples, the overall genetic diversity (He) and inbreeding coefficient (Finbred) were estimated for cases and controls using PLINK v1.07 [11]. The Mann-Whitney test was used to evaluate the differences in these two statistics between the two groups.

Whole-genome sequencing
Low coverage WGS data were generated using Illumina technology (Genome Analyzer and HiSeq 2000) at the Wellcome Trust Sanger Institute, Cambridge (UK) and Beijing Genomics Institute, Shenzhen (China). Data coverage ranged from 4× to 10×. Multi-sample genotype calling was performed using Samtools mpileup (v. 1.2) [12] and Variant Quality Score Recalibrator (VQSR) filtering was applied to the raw call data with GATK v.3.3 Varian-tRecalibrator module [13], separately for single nucleotide variants (SNVs) and insertions/deletions (INDELs). The For each population, the lowest Variant Quality Score Log Odds Ratio (VQSLOD) threshold was determined by the output produced by VariantRecalibrator to select the best cut-off in terms of specificity and sensitivity of the trained model. For single nucleotide polymorphisms (SNPs), the selected minimum VQSLOD value was −15.0283 (99.80% truth sensitivity threshold). As INDEL calling and alignment is still more prone to error, a conservative approach was preferred, selecting a sensitivity threshold of 95%. The filter has been applied using GATK's Apply Recalibration module.
To improve the quality of the raw genotyping results based on the low coverage sequencing data, several genotype refinement steps on the filtered call set were performed: (1) BEAGLEv4.r1230 [14] was used to assign posterior probabilities to all remaining genotypes. (2) SHAPEITv2 to phase all genotype calls [15] and (3) IMPUTEv2 [16] to perform internal imputation in order to correct for genotyping errors. Information about the ancestral allele and allele frequencies were retrieved from dbSNP build 138. The data are available through the European Genome/Phenome Archive (EGA) under the accession number EGAD00001002014 (https://www.ebi.ac.uk/ega/datasets/ EGAD00001002014).

Data analysis
The complete workflow described below is summarized in Fig. 1.

Outlier genes analysis
Following the standard VCF file format, a variant was called if the alternative allele was present in at least one subject from each group (cases and controls). The definition of a gene was based on the Variant Effect Predictor v.74 [17] annotations and the number of variants was computed for each gene separately for cases and controls. The number of variants in controls was regressed with a linear regression against the number of variants in cases. Outlier genes were detected by selecting outlier residuals from the regression, defined as being >6 inter-quartile distances from the first and third quartile. For those genes, we separately analyzed the upper and lower outliers as being significantly more mutated in cases than in controls (Group A) and vice versa (Group B).
Statistical overrepresentation tests were carried out using the web tool PANTHER version 10.0 [18] with the following Gene Ontology (GO) and PANTHER annotation data sets: pathways, molecular function, biological process, cellular component and protein class. Bonferroni correction for multiple testing was applied. Results were compared with 100 random overrepresentation tests on the same number of randomly selected genes, in order to determine any false-positive recurrent enrichment.
The same outlier detection protocol was conducted in the replication cohort.

Variants classification and distribution in cases and controls
To further investigate which variants were accumulated in the outlier genes in cases and controls, respectively, we annotated all the variants using ANNOVAR [19]. For simplicity, we divided variants in "exonic" and "other", grouping together the following annotations: intronic, upstream, downstream, UTR3, UTR5, non-coding RNA and splicing. We then compared variants present only in cases and only in controls vs. those that were shared.

Analyses of natural selection
The list of replicated genes was investigated for evidence of possible adaptation applying a selection scan based on PCA as described in Duforet-Frebourg et al. [20].  Fig. 1 Workflow chart. The figure summarizes the workflow followed and described in the article Considering that the analyzed cohorts were coming from Northern Italy villages, only SNPs that were differentiated along the European-Asian axis of variation (African, European and East Asian populations from 1000 Genome data) were considered in this study (cut-off a false discovery rate (FDR) of 0.1). Moreover, the derived allele frequency in the European and Asian population groups for each significant SNP was calculated.

WGS Comparison between cases and controls
For all the analyses, the functions implemented in the R package "pcadapt" have been used [21].
For each gene found under putative selection, we searched for any previous associations with other traits in the GWAS catalog [22] (available at: www.ebi.ac.uk/gwas. Accessed 30/08/2016), and we estimated the level of lossof-function (LoF) intolerance (pLI) from the Exac database [23]. The closer pLI is to one, the higher the LoF intolerance, that is, a gene with pLI > = 0.9 could be considered as extremely LoF intolerant.

Pathway analysis
Genes under positive selection from the previous steps underwent identification of molecular network interactions and pathway analysis using the Ingenuity Pathway Analysis (IPA) tool by Ingenuity Systems (Redwood City, California, USA; http://www.ingenuity.com). Only direct interactions were taken into account and scores were computed for constructed networks based on the likelihood of the genes being connected together by random chance (significant threshold: score >2, with >99% confidence).

Gene expression analysis
The expression of selected genes was tested in the inner ear of CD-1 mice. Total RNA was extracted from the whole cochleae of 2-month-old wild-type CD-1 mice using Direct-zol RNA MiniPrep Kit (Zymo Research). RNA was quantified using Nanodrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). Complementary DNAs were generated from 2 μg of total RNA using the Transcriptor First Strand cDNA Synthesis Kit (Roche) according to the manufacturer's protocol. Complementary DNAs (cDNAs) were used for semiquantitative RT-PCR (sqRT-PCR). In all, 4 μl of cDNA were used for PCR amplification. Gene-specific primers spanning exon-exon boundaries were designed (primer sequence available upon request). β-Actin primers were used as internal control. PCR reactions were optimized to 95°C for 2 min, 30 amplification cycles at 95°C for 30 s, 60°C for 30 s, 72°C for 4 s and a final extension of 1 min at 72°C using Kapa HiFi HotStart ReadyMix PCR kit (Kapa Biosystem, Cape town, South Africa). Amplified products were resolved on 2% agarose gels and visualized by ethidium bromide staining.

Discovery set analysis
Cases and controls analyzed in our cohort were genetically homogeneous because no differences were found in genome-wide observed heterozygosity (Mann-Whitney p-value > 0.13) and inbreeding coefficient (Mann-Whitney p-value > 0.52).
In order to estimate differences in numbers of variants per gene between cases and controls, a linear regression analysis was performed. As reported in Table 1, the number of variants varied greatly among genes. The linear regression showed great consistency, with data following the diagonal line almost perfectly (regression line coefficient = 0.9938 and R 2 = 0.9992). After selecting outliers using stringent criteria, a total of 746 genes were shortlisted with significantly more variants in cases than in controls (Group A, total of 375 genes) or significantly less variants in cases than controls (Group B, total of 371 genes). Complete lists of genes are reported in Supplementary Table 2. In Group A, overrepresentation tests of GO highlighted 45 significant biological/molecular processes for which our list of genes was enriched (Fig. 2). Regarding Group B, a total of 65 processes were significantly enriched, 18 of which were overlapping with those from Group A (Supplementary Table 3 and Fig. 3). Interestingly, in both groups, the "sensory perception of sound" biological function category was the most enriched among all categories (fold enrichment > 5).

Variants classification and distribution in cases and controls
Among the 746 outlier genes, we compared the distributions of exonic (i.e., coding) variants with the non-coding ones. A Wilcoxon test showed a significant increase of the proportion of exonic variants detected in cases only compared with the proportion of exonic variants in the set of shared variants in Group A (p~2e-10, Fig. 4a), whereas the proportions of non-coding variants were equally distributed (p > 0.05, Fig. 4b). On the other hand, Group B genes were not enriched for exonic variants detected only in controls (Figs. 4c, d).

Replication
The same protocol was used to replicate data in a completely independent cohort with another Italian geographic area (i.e., to overcome any confounding factors due to population structure or relatedness) [24]. Overall, 141 genes (57 out of 375 genes from Group A and 84 out of 371 from Group B) have been replicated (see Supplementary  Table 2). In particular, 28,353 out of 149,798 variants detected in the 57 replicated genes in Group A were only present in cases. The vast majority of them (96 %) were intronic or intergenic variants, whereas the remaining 4% (1003 variants) were annotated as upstream, downstream, UTR3, UTR5, ncRNA or exonic variants (i.e., on average 7.5 alleles in each patient, and 6 alleles per gene).
Regarding the 84 genes from Group B, 39,065 out of 224,384 variants were mainly intronic or intergenic (91%) and only present in controls. The remaining 9% of functionally relevant variants were present on average with 30 alleles per subject and 6 alleles per gene. Moreover, we found three genes with homozygous variants in cases only: KCNIP4, MAST, and TIAM1. Interestingly, the patient carrying the TIAM1 homozygous variant is a 66 y.o. ARHL male showing a severe hearing loss at PTAH (102.5 dB). The variant is an exonic variant, rs34882418 (NM_003253.2:c.326C>T, NP_003244.2:p.(Thr109Ile)) predicted to be deleterious by SIFT, not present in the ClinVar database and extremely rare in the 1000 Genomes database (minor allele frequency = 0.005).

Analyses of natural selection
Analysis of natural selection has been carried out to test if any of the 141 replicated genes could have been selected for having a major role on phenotypes directly or indirectly related to hearing. Moreover, we searched for evidence of possible adaptation along the European-East Asian axis. Twenty-one genes out of 141 (15%) were found to be under selection (13 from Group B and 8 from Group A) (see Table 2). Assessing LoF intolerance, a parameter to evaluate genes' sensitivity to highly damaging variants, we identified two genes from Group A and seven genes from Group B that showed a probability of LoF intolerance (pLI) > 0.9, which means that they did not tolerate the presence of LoF alleles (see Table 2). Interestingly, all genes found to be under selection were also present in the GWAS catalog (i.e., already associated with a specific phenotype). One very promising candidate gene is SDK1, which has been already associated with vestibular phenotypes (i.e., motion sickness) and it also seemed to be very important for synapse formation/function and inner ear development ( Table 2).

Pathway analysis of genes under selection
Finally, the 21 genes under selection have been investigated using IPA software in order to verify if they belong to any known pathway. Results showed two pathways, each including 10 focus molecules from our list of genes under selection. In particular, the first pathway (cell death and survival, nervous system development and function, tissue morphology, score = 24) contains the genes ATP8A2 and PDE4D that have been already reported as expressed in the inner ear or involved in the hearing system [25,26].

Expression analysis
Expression analysis of the 21 strongly suggestive genes under selection was performed in the inner ear of 2-monthold mice. mRNA levels were normalized relative to the βactin (Fig. 5). Interestingly, data analysis revealed positive expression for 20 genes (all but Dgki), suggesting a possible role in the auditory system. In order to broadly evaluate expression localization in the mouse inner ear, three publicly available resources were consulted: SHIELD, MGI and Gear (web resources). Results confirmed some level of expression for all the genes analyzed and all but three were expressed in hair cells of the cochlea. The Dgki gene, not detected in our experiments, was reported as weakly expressed in surrounding cells of the cochlea and of the vestibular system, as well as in auditory neurons. Detailed information is provided in Supplementary Table 4.

TRS ARHL cohort
Two of the genes (CSMD1 and PTPRD) from Group B and detected in the second IPA were previously described in two studies performed by our group, focusing on GWAS and replication in European and Asian samples. After statistical validation, both of them were investigated by immunocytochemistry in mouse cochlea and showed distinctive expression patterns [27] strongly suggesting a functional involvement in the auditory system. Based on this previous knowledge and experience in investigating these genes, we further prioritized them for performing target re-sequencing in a very large independent cohort of 389 ARHL cases. We focused on the 10 exonic variants (five for each gene) that we already detected only in controls in our discovery step. Six of them (three out of five for each gene, nonsynonymous and never reported in the 1000 Genomes and ESP databases) were not present in this large cohort (Table 3), thus suggesting a possible protective role to be further investigated.

Discussion
ARHL is a complex and heterogeneous disorder. Despite all the efforts made in gene discovery, little is still known about the genetic risk factors underlying the disease itself. In this light, identifying cumulative and polygenic effects involved in this disease is a big challenge. In this study, we conducted a WGS analysis in a large cohort of ARHL cases and healthy controls to identify genes and variants involved in ARHL. In particular, we showed that the strength of our approach relies on a novel strategy to query WGS data that is not based on association analysis. This strategy allowed us to identify genes enriched in several relevant biological and molecular processes such as "sensory perception of sound" (i.e., strictly connected with the phenotype investigated) or "neurological system development". These findings might play an important role in ARHL and hearing function, considering that the hearing system is connected to the brain via the cochlear nerve, and most of the hearing genes are expressed in the brain [29]. Given that genes were selected for a differential burden of variants between the two groups (cases or controls), the presence of functional enrichment is evidence of polygenic and cumulative effects involved in ARHL. Accordingly, we demonstrated that case-specific exonic variants were significantly enriched, further suggesting that an accumulation of coding variants might contribute to the pathogenicity of a complex disease such as ARHL. Of course in the future, attention should be paid to regulatory elements, which might also contribute to the phenotype similar to what has been described for other complex diseases [30]. Considering the difficulties underlying genetic replication studies (e.g., due to different population structure), the high replication rate of our strategy (i.e.,19%) further confirms its validity. The majority of replicated genes do not seem to be under natural selection, which means they accumulated variants in a neutral or relaxed fashion. This is in line with other genes associated with complex traits, which display an excess of negative selection [31]. This finding is also consistent with data showing that the hearing phenotype and related genes, such as Connexins, are affected by both assortative mating and relaxed purifying selection [32][33][34]. Nevertheless, a subset of these genes (21 out of 141) was estimated to be under natural selection. They were already present in the GWAS catalog, thus being associated with other phenotypes (i.e., not hearing) suggesting the presence of pleiotropy. Expression studies revealed that most of these genes were expressed in the inner ear.
The usefulness of our approach is further corroborated by the following examples. First, genes under selection in Group A should likely be those having stronger effects. Accordingly, a predicted deleterious homozygous variant in TIAM1 gene was detected in a patient, whose audiometric phenotype was the severest among all cases investigated. Most likely, the presence of this allele might be enough to explain the phenotype in this patient resembling a monogenic form of hearing loss. This agrees with recent results showing a continuum of effect sizes ranging from extremely rare variants driving Mendelian disorders to common variants with small effects involved in complex forms of the phenotype [35]. Moreover, TIAM1 is involved in regulating tissue polarity of hair cells by interacting with the LIS1 gene [36] and it also shows a high LoF intolerance (pLI = 1) further suggesting a possible pathogenic role. SNPs with higher DAF in SNPs with higher derived allele frequency in, Probability of LOF intolerance (pLi) probability of loss-of-function intolerance: we consider pLI > = 0.9 as an extremely LoF intolerant set of genes, GWAS catalog association phenotype previously associated with the gene by means of gwas The second example is related to genes under selection from Group B. Two of them, CSMD1 [37,38] and PTPRD [39], were previously associated with other phenotypes. More recently, they were associated with the hearing function [8,27] and replicated in a trans-ethnic cohort. Expression profiles in mouse cochlea by immunocytochemistry showed striking patterns suggesting a role in hearing function [27]. Building on this previous knowledge, a TRS study in a series of ARHL patients is reported identifying six variants with a possible protective role for ARHL and confirming the pleiotropic effect of genes under selection.
Regarding genotype-phenotype correlation, further functional experiments (i.e., immunohistochemistry and confocal experiments) to determine the exact localization and the role of the gene/proteins identified will possibly lead to a specific correlation between the audiometric profile and the relative gene. All these results highlight the complexity of the hearing phenotype, its polygenic nature, and the likely effects of genetic variation in the affected patients.

Conclusions
Following the hypothesis of cumulative effects of several variants contributing to the disease, we identified a list of candidate genes for ARHL by applying a new strategy based on different complementary approaches using WGS data. In conclusion, this study provides a list of 21 candidate genes, identified due to a differential mutational burden in ARHL cases and controls, with evidence of natural selection and expression in mouse cochlea. The molecular screening of these candidates in larger cohorts of cases and controls could be included in future NGS approaches for the identification of genetic risk factors and their individual contribution to the etiology of ARHL and to further understand how they interact with each other. Moreover, strong evidence of a possible role in the disease was obtained for two genes (i.e., CSMD1 and PTPRD) and a series of variants detected in their sequences. These findings underline the need for additional functional studies of both CSMD1 and PTPRD genes/variants (e.g., in animal models) to definitively prove their role in ARHL. Identifying genetics risk factors for ARHL and improving appropriate interventions will be a worthwhile goal for future research and results presented in this article will be a valuable starting point.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval The study was reviewed and approved by the Ethics Committee of the Burlo Garofolo Children's Hospital in Trieste (Italy).
Informed consent Written informed consent was obtained from each participant in the study and all the research was carried out according to the ethical standards defined by the Helsinki declaration.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.