Exploring rare and low-frequency variants in the Saguenay–Lac-Saint-Jean population identified genes associated with asthma and allergy traits

The Saguenay–Lac-Saint-Jean (SLSJ) region is located in northeastern Quebec and is known for its unique demographic history and founder effect. As founder populations are enriched with population-specific variants, we characterized the variants distribution in SLSJ and compared it with four European populations (Finnish, Sweden, United Kingdom and France), of which the Finnish population is another founder population. Targeted sequencing of the coding and non-coding immune regulatory regions of the SLSJ asthma familial cohort and the four European populations were performed. Rare and low-frequency coding and non-coding regulatory variants identified in the SLSJ population were then investigated for variant- and gene-level associations with asthma and allergy-related traits (eosinophil percentage, immunoglobulin (Ig) E levels and lung function). Our data showed that (1) rare or deleterious variants were not enriched in the two founder populations as compared with the three non-founder European populations; (2) a larger proportion of founder population-specific variants occurred with higher frequencies; and (3) low-frequency variants appeared to be more deleterious. Furthermore, a rare variant, rs1386931, located in the 3ʹ-UTR of CXCR6 and intron of FYCO1 was found to be associated with eosinophil percentage. Gene-based analyses identified NRP2, MRPL44 and SERPINE2 to be associated with various asthma and allergy-related traits. Our study demonstrated the usefulness of using a founder population to identify new genes associated with asthma and allergy-related traits; thus better understand the genes and pathways implicated in pathophysiology.


Introduction
The Saguenay-Lac-Saint-Jean (SLSJ) region is located in northeastern Quebec and is known for its unique demographic history and founder effect, characterized by several population bottlenecks followed by rapid expansion [1].
Founder populations, like SLSJ, are usually less diverse at the environmental and genetic levels [2], and harbor larger identical DNA segment than outbred populations [3,4]. An advantage of studying founder populations is the possibility to investigate variants, which occurred in higher frequencies as compared with the general population, due to either genetic drift or weaker selection against deleterious alleles [5][6][7][8]. These effects can lead to overcome selection disadvantage and allow alleles to reach higher frequencies in founder populations [9][10][11][12] such as the French-Canadian [13] and Finnish populations [14]. These characteristics make founder populations well suited to study the impact of rare and low-frequency variants in complex traits, such as asthma [11,15,16]. Recent studies showed the smaller but important contribution of rare and low-frequency variants in the genetic basis of complex traits [17,18]. Despite the major contribution of common variants, investigations of rare and low-frequency variants in founder populations would provide additional and insightful information about the underlying biological mechanism of traits development.
Rare and low-frequency variants have been previously studied to better understand the genetic basis of complex traits such as asthma and allergic diseases [19][20][21]. Although these variants do not explain a large part of the missing heritability they contribute nevertheless [19]. Rare variants exploration identified previously associated genes (ex: GSDMB [13], IL33 [15]) and new ones (ex: GRASP [13]), highlighting the importance of studying their role in the context of asthma and allergy-related traits.
In this study, we set out to identify new variants and genes associated with various traits related to asthma and allergy in hope to better understand the biological mechanisms and pathways of these diseases. We aimed to take an advantage of the well-characterized SLSJ cohort and its availabilities of comprehensive phenotypes related to asthma and allergy [1]. Our first objective is to characterize the variants distribution in the SLSJ founder population and to compare it with four European populations (i.e., Finland (FINN), Sweden (SWE), United Kingdom (UK) and France (FR)). Of the four populations, the Finnish is another founder population. As SLSJ is part of the French-Canadian population, we wanted to follow-up on previous results and see if we could observe the same enrichment of deleterious variants in this population [13]. Our second objective is to explore the functional consequences of identified rare and low-frequency variants and their respective genes in asthma and allergy-related traits in the SLSJ asthma familial cohort [1]. Given that asthma and allergies are characterized by multiple subphenotypes and endotypes, we selected the quantitative traits related to both the immunological and clinical aspects for investigations: serum immunoglobulin (Ig) E levels, eosinophil percentage and lung function.

Samples
We sequenced samples from 149 parent-child trios (447 samples) from the SLSJ asthma familial cohort [1]. For the characterization of variants distribution, of the 447 samples, 93 non-related parents were selected and paired by mean coverage to avoid bias with 93 non-related samples from each of the four study populations (FINN, FR, SWE and UK) for a total of 465 samples. All studies received ethic approbations from their respective ethic committees.
To analyze the impact of rare variants on the phenotypes of lung function (i.e., forced vital capacity (FVC), forced expiratory volume in 1 s (FEV 1 ) and Tiffeneau-Pinelli index (FEV 1 /FVC)), serum IgE levels and eosinophil percentage, the sequenced data from the 149 sequenced trios were used to infer and impute genotypes for the rest of the cohort (1214 individuals from 254 families [1]; see Inference and imputation section; see Table 1 for Clinical description and for Recruitment details). Lung function, serum IgE levels and differential white blood cell counts of the SLSJ cohort have been measured and published previously [1]. The study was approved by the Centre Intégré Universitaire de Santé et de Services Sociaux du SLSJ ethics committee. All subjects gave informed consent.

Capture and sequencing
Samples from the 149 trios from the SLSJ asthma familial cohort and 93 non-related individuals from each of the four European populations were all sequenced using a custom capture panel developed by our group [22], followed by next-generation sequencing. This custom capture panel covers around 3% of the genome, including coding and non-coding immune regulatory regions [22] (see Morin et al. [22] for details about panel description, capture and sequencing) because most of the variants associated with complex traits are located in the non-coding region of the genome [23]. This approach allowed us to assess the impacts of rare and low-frequency variants in a costeffective manner. Moreover, the use of next-generation sequencing was better suited in this context compared with genotyping chips or linkage analysis due to the small effect size and low-frequency of the variants we explored, as well as the low heritability of the traits [24]. To remove any potentially related individuals in the paired 465 subjects (93 subjects from each of the five populations), we performed an identity-by-descent estimation using the method of moments [25] and a principal component analysis using the SNPRelate R package [26] (Supplementary Figure 1A). We also used heterozygous/homozygous proportion to identify and remove outliers and their paired samples (Supplementary Figure 1B, C and D). A total of 380 samples (76 samples per population) were analyzed for variants characterization at the end. Four hundred forty samples from the SLSJ trios were kept for the second part of the study.

Variant calling and filtering
We aligned reads to the Genome Reference Consortium Human genome build 37 (GRCh37) using bwa 0.7.6a and we called variants using HaplotypeCaller v3.2 (GATK). We performed merge calling for all samples in the five populations and the 149 trios independently. Variants included in both sample sets met these criteria: (1) included within the targeted regions, (2) biallelic sites, (3) dp > 10× and gq > 35 in at least 90% of the samples. We removed the human leukocyte antigen region as it will be analyzed independently. We compared the sequenced data with genotyped data obtained for the 149 trios (447 individuals; see Genotyping section) as a quality control using heterozygous and biallelic variants that were both in captured regions and on the genotyping chip. Seven samples were removed due to a concordance of <95%. We also used the comparison between sequenced and genotyped data to set our filtering cut-off to read depth (dp) > 10 × , Allergy is defined as one positive skin prick testing (wheal diameter ≥ 3 mm at 10 min). The allergy phenotype is available for 1193 subjects, 445 trio members, 147 probands, 296 parents and 106 siblings j Cell type profiles are available for 967 subjects, 418 trios members, 137 probands, 283 parents and 98 siblings genotyping quality (gq) > 35; an accuracy and a sensitivity of > 95% were observed. We assessed Mendelian errors using VCFtools [27] and a parent/proband pair was excluded due to high error rate. Mendelian errors in the remaining samples were replaced by missing values. Sequences were inferred and imputed from the sequenced trios samples for the rest of the cohort (see Inference and imputation sections). Functional annotation was performed using SNPeff [28] and selectively constrained variants were identified with the Genomic Evolutionary Rate Profiling (GERP++ score) [29].
Genotyping DNA extraction, genotyping (Illumina 610K Quad Array) and variant filtering of the 1214 individuals from the SLSJ asthma familial cohort have been described previously [1,30]. These samples include the 149 trios sequenced. Genotyping data were used for quality cut-offs assessment, inference and imputation of the sequence for the entire cohort.

Inference and imputation
Genotype phasing was performed using SHAPEIT v2 and duoHMM [31][32][33] to consider familial structures. Prephasing was done using the trios (440 samples) using merged sequencing and genotyping data, as well as on the whole cohort using only genotyping data (1214 samples). We inferred the sequence in the non-sequenced siblings who were part of the same families as the trios. Chromosomes were separated by breakpoints that were identified using duoHMM and NUCFAMTOOLS [34]. We were able to reassemble the sequence in the siblings using informative markers (heterozygous in one parent and homozygous in the other) and we inferred on average 98.72% (96.89-99.33%) of the sequence at an accuracy of 99.67% (99.39-99.86%). We used the parental haplotypes (294 samples) as the reference panel to reduce the number of duplicated haplotypes and we imputed the sequence using IMPUTE2 [35] in the whole cohort. Missing genotypes from the inference were added using the imputed data. A total of 112,154 variants were imputed with a mean accuracy of 99.70% (98.55-99.98%) and retained variants with imputation quality of > 0.8 for a total of 112,083 variants. The imputation accuracy was measured by comparing imputed probands to their sequenced data.

DNA methylation
We measured DNA methylation levels on a subset of individuals from the SLSJ asthma cohort using whole blood (167 samples) and isolated eosinophils from blood (24 samples). Eosinophil cells isolation was performed as described in Ferland et al. [36] and DNA extraction and sodium bisulfite conversion were described in Liang et al. [37]. Methylation levels were assessed using the Infinium HumanMethylation450 BeadChip array (Illumina, San Diego, CA, USA). Normalization steps were described in Morin et al. [38].

Statistical analyses
To assess variants distribution differences across the five populations, the variant proportions or ratios were compared using a chi-square test combined with Cramer's V ( > 0.15 for 4 degrees of freedom) and an analysis of variance (ANOVA) followed by Tukey's test to assess the differences of per sample distribution. p-Values reported for the post hoc tests are adjusted for false discovery rate (chisquare) or with Bonferroni correction (ANOVA).
We performed single-variant analyses on low-frequency variants (0.01 < minor allele frequency (MAF) < 0.05) and which were not common in the UK10K [39] or 1000 Genomes Project (1KGP) [40] for a total of 15,294 variants (significance threshold set to p < 3.3e-6). For the genebased test, we combined rare and low-frequency variants per gene, including its surrounding 20 kb (including only immune DNase I Hypersensitivity Sites (DHSs)) and only included gene regions that had at least two variants for a total of 14,646 (significance threshold set to p < 3.4e-6 or false discovery rate 5%).
We explored variants associated with the five investigated phenotypes using EPACTS software (Efficient and Parallelizable Association Container Toolbox: http:// genome.sph.umich.edu/wiki/EPACTS). We performed a mixed model association called EMMAX (Efficient Mixed Model Association eXpedited) that accounts for sample structure (population structure and relatedness (kinship coefficient)). EMMAX supports single-variant association tests and different burden tests (combined multivariate and collapsing (CMC) [41] and Sequence kernel association test (SKAT) [42] methods). The two burden tests were used because of their different assumptions: CMC performs a multivariate test on the variant counts in each gene region [41], whereas SKAT combines the score of each variant in the gene regions assuming their independence, thus allowing them to go in different directions of effect [42]. Sex and age were used as covariates, as well as height for lung function assessment.
To determine if DNA methylation levels at genes associated with asthma-related traits were associated with serum levels, FEV 1 /FVC and asthma using whole blood (167 samples) and blood isolated eosinophils (24 samples), we applied a robust linear regression model including age and sex as covariates, as well as cell type composition for whole blood. To assess if results were obtained by chance, we performed 1000 permutations and determined if we get the same number of CpGs with p < 0.05.

Founder populations are enriched with populationspecific low-frequency variants
For the first objective of characterizing variants distribution, after removal of related individuals and outliers 76 samples from the SLSJ cohort were paired with 76 samples from each of the four other European populations (a total of 380 individuals). All five populations had a mean coverage of 28 × (18-52 × ) ( Table 2). A total of 192,227 variants (178,613 single-nucleotide variants (SNVs) and 13,614 indels) were called according to the aforementioned criteria and summarized in Table 2. Given the small number of samples being analyzed in this study, the rare variant spectrum only includes singletons, which are rare variants seen only one time. We observed a smaller number of variants in both founder populations (total number of variants: 111,669 (SLSJ), 108,485 (FINN), 120,473 (FR), 114,468 (SWE) and 116,937 (UK); Table 2) and was reflected in the mean number of singletons per sample for each population (Supplementary Table 1 and Supplementary Figure 2). A smaller proportion of singletons was observed in the SLSJ (21.6%) and FINN (19.4%) populations compared with other populations (28%, 24% and 25% for FR, SWE and UK respectively; Fig. 1a). Similar proportions of low-frequency variants were observed across each population (Fig. 1a). We observed a higher number of low-frequency variants in the founder populations when the mean number of variants per sample were compared with the non-founder populations (ANOVA p < 2e-16; Fig. 1b).
For population-specific variants, which were only observed in one population, higher proportions of low-frequency variants were found in the SLSJ and FINN populations (chisquare p < 2e-16, Fig. 1c) and higher mean numbers per sample per population than the non-founder populations (ANOVA p < 2e-16, Fig. 1d) compared with the smaller proportion observed previously in public databases (chisquare p < 2e-16; Supplementary Figure 3B). For rare population-specific variants, we observed smaller proportion in SLSJ and FINN than the non-founder populations (chi-square p < 2e-16, Fig. 1c). For functional variants in the SLSJ and Finnish founder populations, although not significant, we observed a larger proportion of nonsynonymous, loss-of-function (LoF) and GERP++ > 4 lowfrequency variants (Supplementary Figures 4 and 5 and  Supplementary Tables 2 and 3), as well as a higher nonsynonymous/synonymous ratio trend and GERP++ > 4/ GERP++ < 2 in the low-frequency variants of the founder populations compared with the others (Supplementary Figure 6A and 7A). The trend was significant when looking at the mean ratios of variants per sample for each population (ANOVA p = 3.72e-10 and p = 1.68e-9; Supplementary Figure 6C and 7C). We also looked at the enrichment of deleterious variants in the founder populations as compared with the non-founder populations ( Supplementary Figure 8). SLSJ has significant excess of nonsynonymous lowfrequency variants as compared with FR and UK (chisquare p = 8.3e-3 and p = 0.019; respectively, Supplementary Figure 8A) and excess low-frequency GERP++ > 4 variants than FR and UK (chi-square p = 2.9e-3 and p = 0.048; respectively, Supplementary Figure 8C). Similar patterns were observed for FINN; FINN has excess low-frequency nonsynonymous variants than FR (chisquare p = 0.048, Supplementary Figure 8B) and excess low-frequency GERP++ > 4 variants than FR and UK (chisquare p = 5.6e-4 and p = 0.014, respectively; Supplementary Figure 8D). Similar patterns were observed for population-specific variants (Supplementary Figure 9). No difference was observed for the average GERP++ score per sample distribution among populations (Supplementary  Figure 10). However, greater average GERP++ scores in the low-frequency variants were observed in SLSJ and FINN as compared with FR and the UK (ANOVA p = 1.8e-7, Supplementary Figure 11A). Overall, we observed a higher proportion of population-specific variants reaching higher frequencies in the founder populations. We also observed a tendency of enrichment for more deleterious variants in the founder population, especially in the lowfrequency spectrum.
Low-frequency variant located in CXCR6/FYCO1 is associated with eosinophil percentage in the SLSJ founder population We assessed the individual effect of the coding and noncoding low-frequency (0.01 < MAF < 0.05) that were not common in the UK10K [39] or 1KGP [40]. A total of 15,294 SNVs (Bonferroni threshold: p < 3.26e-6) were analyzed for association with five asthma-related traits in To assess significance, chi-square test (p < 0.05) and Cramer's V ( > 0.15) were performed for a and c and ANOVA followed by Tukey were performed for b and d. a = significantly different from FINN, b = significantly different from SLSJ, c = significantly different from FINN and SLSJ. In c, SWE is also significantly different from UK and FR. SLSJ Saguenay-Lac-Saint-Jean, FINN Finland, FR France, SWE Sweden, UK United Kingdom the SLSJ cohort. We observed a significant association (p < 3.26e-6) with eosinophil percentage for SNV rs1386931:C > T (p = 1.77e-6, effect size = 1.534), located in the 3ʹ-UTR of CXCR6 and in the intron of the FYCO1 genes (Table 3 and Fig. 2c). We also observed another SNV reaching suggestive significance (p < 1e-5) with serum IgE levels and it is located in the intron of the NRP2 gene (rs849558:T > C, p = 4.79e-6, effect size = −1.243; Table 3 and Fig. 2a). No variants were identified to be associated with lung function. Both SNVs were also found in 1KGP and UK10K. The SNV associated with eosinophil percentage (rs1386931:C > T) has a higher MAF in SLSJ (0.043) compared with the one observed in 1KGP (0.021) and UK10K (0.019). The other variant (rs849558:T > C) had slightly higher frequency in SLSJ (0.019) compared with UK10K (0.011).
NRP2 and MRPL44 are associated with serum IgE levels and eosinophil percentage in the SLSJ cohort, respectively For the gene-based test, we combined rare and lowfrequency variants per gene, including its surrounding 20 kb (including only immune DHSs) and only included gene regions that had at least two variants for a total of 14,646 genes (p < 3.4e-6 or FDR 5%). Two genes were significantly associated with serum IgE levels (NRP2, p = 3.16e-6; Fig. 3a and Table 4) and eosinophil percentage (MRPL44; p = 2.97e-6; Fig. 3c and Table 4). One SNV lead each association: a rare one for MRPL44 (rs76568361: T > G) and a low-frequency one for NRP2 (rs849558:T > C; Table 4). The latter almost reached significance in the single-variant association test (p = 4.79e-6, Table 3). Moreover, we identified four marginally associated genes (p < 1e-5, Supplementary Table 4). SHMT1 and SMCR8 were marginally associated with eosinophil percentage (p = 2.97e-6 and p = 6.21e-6; Fig. 3c) with the rare nonsynonymous variant rs79875842:A > G being the lead SNV for both genes (Supplementary Table 4); CCDC126 and CLK2P were marginally associated with FEV 1 /FVC (p = 4.62e-6; Supplementary Figure 12A) with rare nonsynonymous variants (rs73077128:G > A and rs146336907:C > T) for the pseudogene CLK2P (Supplementary Table 4).

DNA methylation in associated genes
To support our associations observed from single variants and gene-based association test, we performed DNA methylation analyses of CpG located ± 20 kb from the associated genes (Supplementary  Table 5).

Discussion
In our study, we characterized the variants distribution of rare and low-frequency variants in the SLSJ population, the enrichment of deleterious variants and their impacts on asthma and allergy-related traits. In terms of the variants distribution, overall observations in FINN were consistent with the published findings of another Finnish population genetic study [14]; higher proportions of population-specific rare and low-frequency variants and fewer rare variants as compared with non-founder populations. Unlike the published findings [14], no differences in proportions were detected in FINN when variants were stratified by function; it is likely due to small sample sizes. Similar to the findings of another French-Canadian population genetic study [13], a higher proportion of low-frequency variants was observed compared with non-founder populations. However, contrary to the same study [13], we observed a smaller proportion of rare population-specific variants in SLSJ and no difference in the proportion of low-frequency nonsynonymous variants as compared with non-founder populations; the latter likely due to a small sample size. However, a larger proportion of population-specific variants reached higher frequencies in the founder populations, reflecting the genetic drift associated with them. Our results are consistent with previous studies in terms of enrichments of deleterious variants in founder populations, especially in the low-frequency MAF minor allele frequency, SE standard error, IgE immunoglobulin E, UTR untranslated region, NRP2 neuropilin 2, CXCR6 C-X-C motif chemokine receptor 6 a p-Value in bold is those reaching the significance threshold of 3.3e-6 spectrum of variants [13,14]. Various factors may contribute to the differences in the findings between our study and others. The strength of our assessment resides in comparing two founder and three non-founder populations with identical experimental procedures and samples paired based on mean coverage. By investigating low-frequency variants in SLSJ, new genes have been identified for their associations with asthma or allergy-related traits. Using variant-based analyses, FYCO1, CXCR6 and NRP2 shed new light to mechanisms influencing allergy. Variant rs1386931:C > T is located in an intron of FYCO1 and 3ʹ-UTR of the CXCR6.
FYCO1 encodes for a protein that plays a role in the transport of autophagic vesicles [43], and autophagy and its genes were associated with asthma [44][45][46][47][48]. CXCR6 encodes for a chemokine receptor expressed on the surface of multiple immune cell types and was associated with asthma and Th2 inflammation in the lung [49,50]. Variant rs849558:T > C is located in an intron of NRP2. It is a transmembrane receptor implicated in multiple processes, including antigen presentation, phagocytosis and cell-cell interaction [51]. This gene was also associated in the genebased test; being led by the same associated SNV and was supported by DNA methylation association. Given that Fig. 2 Manhattan and qqplot plot for low-frequency single-variant association test for a serum IgE levels (lambda = 1.01, with associated qqplot (b)); and c eosinophil percentage (lambda = 1.08, with associated qqplot (d)). Significance cut-off are shown in blue (p < 1e-5) and red (Bonferroni, p < 3.3e-6) serum IgE levels and eosinophil percentage are associated with various allergic diseases, we tested variants rs849558: T > C and rs1386931:C > T for associations with additional disease phenotypes: asthma, atopy, allergic asthma, rhinitis and atopic dermatitis (Supplementary Table 6). NRP2 variant rs849558:T > C was marginally associated with atopy and allergic asthma. CXCR6/FYCO1 variant rs1386931:C > T was associated with atopic dermatitis. These findings point at plausible disease mechanisms of allergic diseases. Using a gene-based approach, there are now suggestive roles of MRPL44 and SERPINE2 in mediating eosinophil percentage. MRPL44 is implicated in protein synthesis in mitochondria and that mitochondria play an important role in eosinophil apoptosis and survival [52]. Moreover, the lead SNV for MRPL44 is located in the promoter region of the SERPINE2 gene, for which the association was suggestive (p = 1.47e-5). SERPINE2 is a serine protease inhibitor and is a known susceptibility gene for chronic obstructive pulmonary disease [53], emphysema [54] and asthma [55]. Our findings suggest a novel role of SER-PINE2 in eosinophils.

Conclusion
In this study, we showed that founder populations are enriched with deleterious low-frequency variants and that they present population-specific variants with higher frequencies. By examining rare and low-frequency variants for their associations with asthma and allergy-related traits in a founder population, we identified new genes worthy of further study. One of the lead SNV in the gene-based test was specific to the SLSJ population highlighting the importance of using sequencing data in founder population to identify new genes associated with complex traits. Other SNV also presented marginally higher frequency compared with the European populations. We also demonstrate the importance of addressing the non-coding regions of the genome by using sequencing studies, as three of the variants identified were non-coding and located in immune regulatory regions. Overall, we showed the usefulness of using a well-described founder population and the importance of assessing non-coding regions to better decipher the genetic basis of complex traits.