Genomics of Egyptian Healthy Volunteers: The EHVol Study

Comprehensive genomic databases offer unprecedented opportunities towards effective tailored strategies for the prevention and treatment of disease. The integration of genomic and phenotypic data from diverse ethnic populations is also key to advancements in precision medicine and novel diagnostic technologies. Current reference genomic databases, however, are not representative of the global human population, making variant interpretation challenging and uncertain, especially in underrepresented populations such as the North African population. To address this, a study of 391 Egyptian healthy volunteers (EHVols) was initiated as a milestone towards establishing the 1000 Egyptian Genomes project.


INTRODUCTION
Cardiovascular disease is a major cause of death and disability worldwide (Murray & Lopez, 2017;WHO 2017) and its prevalence continues to increase in low and middle income countries toward epidemic proportions (Roth et al., 2017;Yusuf et al., 2014). Effective tailored strategies for the prevention and treatment depends on thorough understanding of the mechanisms involved in specific populations. The rapid evolution of genomic and personalised precision medicine offers unprecedented opportunities in this regard (Manolio et al., 2009;O'Donnell & Nabel, 2011). These, however, are critically dependent on defining the genetic landscape of different populations, their individuals and the relation to their dynamic phenotype (Lau & Wu, 2018;Leopold & Loscalzo, 2018). In-depth information is lacking in populations which need it most Figure 1. Workflow of EHVol study. Study participants (1) were recruited from the general population via announcements (brochures, flyers, public events); (2) completed a questionnaire on demographic data, family and clinical history; (3) underwent detailed cardiovascular phenotyping and blood sampling. (4) All data were recorded and managed on a local REDCap database. Rounds of exclusion: First round of exclusion is based on the basis of demographic and general health questionnaire as described under 'study protocol and data collection'. A second round of exclusion was based on detailed cardiovascular phenotyping as described under 'cardiovascular phenotyping' in the methods section.

Cardiovascular Phenotyping
All individuals underwent detailed cardiovascular phenotyping including clinical examination, 12-lead electrocardiogram and CMR (Figure1). CMR was performed with a 1.5 T scanner (Siemens Magnetom Aera, Erlangen, Germany) using retrospective ECG triggering to capture the heart during the cardiac cycle. Steady State Free Precession (SSFP) end expiratory breathhold cine images were acquired in the short axis orientation covering the whole heart. Standard parameters were repetition, time/echo time 3.6/1.8ms; sense factor 2, flip angle, 60°; section thickness, 8 mm; no slice gap, matrix, 160 × 256; field of view, 300 mm; pixel size, 1.6 × 1.6 mm; number of phases 30 and phase percentage 67%. For future comparison with specific disease based sub-cohorts, phase contrast images were acquired at different aortic and levels for flow mapping. At the same acquisition levels 4D flow was performed to assess flow patterns. T1 mapping was performed on base, mid and apical heart levels for fibrosis assessment. 3D Tagging acquisition was done at base, mid and apical levels for strain assessment. Detailed structural and functional analysis on the CMR acquisitions was performed retrospectively using dedicated post processing and in-house software. Following phenotyping, a second round of exclusions, on the basis of specific cardiovascular diagnostic criteria, was applied (Table 1). Isolated apical noncompaction with normal ECG and normal CMR was not excluded. *Cutoff thresholds were determined as 3 standard deviations from the mean to be more inclusive of extremes, especially in the absence of normal CMR reference values specific to the Egyptian population (Kawel-Boehm et al., 2015).

Sample Collection and Biobanking
20 ml whole venous blood were withdrawn from each participant for laboratory testing (Hemoglobin A1c (HbA1c) and Troponin I), serum/plasma isolation and DNA extraction. For DNA extraction blood samples were transferred to K3EDTA tubes to avoid clotting. Blood samples were stored at 4oC (max. 5 days) prior to DNA extraction. DNA was extracted using Wizard® Genomic DNA Purification Kit (Promega, Catalog No. A1620) according to manufacturer's instructions. Concentration of 1μl DNA sample was determined using the NanoDrop 2000 (Thermo Scientific) spectrophotometer. 260/280 and 260/230 nm ratios were used to assess the DNA quality. All samples were stored centrally in the AHC Biobank.

Comparison of rare variation between EHVol and Caucasian HVOL (CHVol) controls
A cohort of 1,028 Caucasian healthy volunteers (CHVols) also sequenced using the Trusight Cardio Sequencing Kit for ICC, was analysed (Schafer et al., 2017;Pua et al., 2016). The CHVols were recruited prospectively via advertisement for the UK Digital Heart Project at Imperial College London. All volunteers underwent CMR to confirm the absence of cardiac disease. The frequency of rare variation in the selected CM genes was compared between the EHVol and CHVol cohorts. The threshold maximum credible population allele frequencies (AF) were defined as <=8.4x10-5 and <=4.0x10-5 for DCM and HCM, respectively. Variants were defined as rare if the filtering allele frequency (FAF) was less than these thresholds across all gnomAD populations (popmax FAF) (Whiffin et al. 2017). The frequency of rare variants per gene in the EHVol/CHVol cohorts was calculated by counting the number of rare variants per gene and dividing this by the cohort size (EHVols: n=391, CHVols: n=1,028).

Characteristics of the EHVol Study Population
Of the 440 recruited individuals, 40 met the first round of exclusion criteria and were therefore excluded. The remaining individuals underwent CMR (n=400) and ECG (n=349) screening (table 2). Based on the second round of exclusion described above, nine individuals were excluded from the cohort (Supplementary Table S2). A final cohort of 391 EHVols were sequenced. The baseline characteristics of the study population are summarized in table 3. The study population comprised of 166 females (42.5%) and 225 males (57.5%). The mean age (years) was 33.2 (SD 9.5).

Representation of EHVol genetic variation in gnomAD
2,040 CM variants were identified in the EHVol cohort (Supplementary Table S3). In order to assess the representation of these variants in the gnomAD dataset, we plotted the observed allele counts (ACs) in gnomAD against ACs in EHVol for all variants identified in our cohort. These EHVol variants were also binned by AC to report the proportion that was captured in gnomAD ( Figure 2A). Of the 2,040 EHVol CM variants, 1544 (75.7%) were captured in gnomAD, whereas 496 (24.3%) were absent from gnomAD (Table 4). The majority of non-gnomAD variants (n=335) were captured in AC bin 1. The remaining non-gnomAD variants constituted <10% of each AC bin (Figure 2A). Non-gnomAD variants were predominantly missense (29.8%) and "other" (56.5%) ( Figure 2B). Out of the 496 non-gnomAD variants identified in the EHVol cohort, 11.3% were present in the Great Middle Eastern Variome (GME) (Scott et al., 2016). The proportion of EHVol non-gnomAD variants was significantly (Fisher's exact test p<1.183e-07) higher than that of the CHVol cohort (24.3% vs 18.3% respectively) ( Table 4).

High frequency of rare variation in the EHVol cohort
The proportion of EHVol and CHVol controls with rare variants in DCM (popmax FAF <=8.4x10-5) and HCM (popmax FAF <=4.0x10-5) genes was calculated (Figure 3). In both cohorts, TTN, DSP, RBM20, MYH7 and SCN5A accounted for the majority of rare variation in DCM genes ( Figure 3A). MYBPC3, MYH7 and CACNA1C accounted for the majority of rare variation in HCM genes in both cohorts ( Figure 3B). Overall, the frequency of rare variants was higher among EHVols compared to CHVols. The proportion of controls with LoF variants in CM genes was almost the same in the EHVol (2.3%) and CHVol (2.33%) cohorts.

Comparison of the distribution of rare TTN and MYH7 variants between the different control cohorts
We investigated the distribution of variants in TTN and MYH7, as they accounted for relatively high frequencies of rare variation among DCM and HCM genes, respectively. We restricted the A.
B. analysis to rare LoF variants in TTN and missense and LoF variants in MYH7 (Figure 4), as they may be putative pathogenic. TTN consists of four regions, Z-disk, I-, A-and M-bands, each of which has a distinct function (Roberts et al., 2015). Of the four TTN regions, the I-band has the lowest expression level in the myocardium (i.e. percentage spliced in (PSI) <0.9), as I-band exons are variably spliced in different isoforms (Roberts et al., 2015). Across all cohorts, the majority of variants were located in exons that are not constitutively expressed in the myocardium ( Figure 4A). One LoF variant (p.Ser34842ProfsTer9) and 6 variants (p.Glu35478Ter, p.Gln16235Ter, p.Gln15575Ter, c.44015-1G>T, p.Pro4353GlnfsTer14 and p.Gln3243Ter) identified in the EHVol and CHVol cohorts, respectively, resided in cardiac constitutive exons, which would have been interpreted as "likely pathogenic" (LP) were they identified in individuals presenting with DCM (Roberts et al., 2015). The LoF variant p.Ser34842ProfsTer9 identified in the EHVol cohort, is absent among gnomAD as well as CHVol controls. The EHVol carrying the LoF variant did not show a clinical DCM phenotype but ECG analysis revealed sinus tachycardia.

DISCUSSION
The availability of large-scale population genetic databases, such as gnomAD, has opened up new possibilities in identifying putative disease-causing variants. Current data sets, however, do not adequately represent genetic variation in the MENA region, which hinders accurate variant interpretation in the respective populations. Several recent studies address this shortcoming. Scott et al., for instance, studied rare genetic variation of the Great Middle Eastern region, including North East-and North West Africa (Scott et al., 2016). However, individuals recruited in this study were from the general population (self-reported healthy volunteers) and thus may not necessarily be free of disease. To our knowledge, our study represents the first of its kind in the region to include high coverage sequencing data from a clinically phenotyped healthy cohort. We evaluated 440 self-declared healthy individuals of whom 49 (11%) were excluded for not meeting the first round inclusion criteria (9%) or showed cardiac abnormalities (2%). The remaining healthy individuals, the EHVols (n=391), were genetically characterised using the ICC gene panel. Our analyses revealed that 24.3% of EHVol variants were not captured in gnomAD, the most widely used reference dataset at present. Of these non-gnomAD variants 11.3% were captured in the GME database, confirming the current underrepresentation of genetic variation in current large-scale datasets.
In addition, we sought to compare genetic variation between our EHVols and an ethnicallydistinct population that is generally represented in gnomAD, the CHVol cohort. Our analysis showed that the proportion of non-gnomad variants identified in the EHVol cohort was significantly higher compared to the CHVol cohort. Furthermore, we examined the frequency of putative disease-causing variants in the EHVol and CHVol cohorts across the CM genes. The analysis showed that the proportion of controls with rare variation in the majority of DCM genes was higher in the EHVol cohort compared to CHVols. Rare variation in MYH7 and MYBPC3, the key genetic contributors to HCM, was also higher in the EHVol cohort. These findings highlight the importance of studying the prevalence of putative disease-causing variants in a large-scale ethnic-specific cohorts in order to confirm their pathogenicity in their respective populations. For example, a novel, non-gnomAD LoF TTN variant (p.Ser34842ProfsTer9) was identified in a 33-year old EHVol. This variant lies in the M-band, a constitutively expressed exon with a PSI score of >0.9, which affects cardiac remodeling in DCM (Roberts et al., 2015;LeWinter & Granzier, 2013). Had this LoF variant been identified in a well-defined DCM cohort, it would have been interpreted as a "likely pathogenic" (LP). Thus, it would be valuable to follow-up on EHVols carrying putative disease-causing variants to confirm their corresponding phenotypic manifestations. Integrating information from our EHVol cohort into genomic datasets and variant classification support tools, such as CardioClassifier, may affect the classification of detected variants.
Our analysis has provided preliminary insights into genetic variation of the underrepresented Egyptian population. This study constitutes an initial milestone towards developing a large-scale dataset comprising healthy EHVols from across the country. The development of an ancestry-specific genetic and phenotypic database will circumvent current issues with variant interpretation, deepen our understanding of African diversity and provide an opportunity for variant discovery. The expanded dataset will also directly aid in distinguishing between incidental and medically actionable variants and thus enhance diagnostic and treatment strategies. Beyond the ethnicity-specific aspects, this study provides valuable molecular and phenotypic data and a regional biobank. A limitation of this study is that allele frequency thresholds used in our analysis may not accurately define rare variants in CM genes in our population. That is because these thresholds were pre-computed by Whiffin et al., 2017 based on genetic findings from a Caucasian CM cohort. Also, future cohort studies might show that there are pathogenic variants above this threshold e.g. due to a founder. However, using different frequency thresholds in different populations would introduce an artifactual difference in rare variant burdens, as it would vary each population's definition of rare. A suitable future approach is to adjust filtering allele frequencies (FAF) whenever data from new populations are integrated into a large data set.