Introduction

Heart failure (HF) is a complex clinical syndrome characterized by abnormal cardiac structure and function that leads to reduced cardiac output and elevated filling pressures at rest or with exertion1. Although, there is increasing evidence that the risk and course of HF depend on genetic predispositions2, genome wide association studies (GWAS) have identified only a handful of genetic variants associated with it. For instance, the chromosome region 9p21 includes several highly replicated genetic variants associated with HF risk factors (e.g. NC_000009.11:g.22096055A > G and NC_000009.11:g.22124477A > G)3.

For better understanding heritability of HF, some studies combine results of multiple cohorts and involve more samples in the analysis through meta-analysis4,5. One of the largest studies on African-American population identified four variants associated with left ventricular mass and left ventricular internal diastolic diameter respectively using Echocardiography6. A meta-analysis on 5 cohorts of individuals with European ancestry identified five genetic loci harboring common variants associated with left ventricular diastolic dimensions and aortic root size7. More recently, a meta-analysis of a large set of samples including 73,518 individuals identified 32 novel loci associated with electrocardiographic markers of hypertrophy as an important and independent risk factor for the development of heart failure8.

Taking advantage of dozens-to-hundreds of traits measured on each study participant creates opportunities to obtain insights into the biology of HF, and consequently reduces morbidity, and economic burden of HF. Multi-trait analysis is toward this aim and increases the statistical power9,10,11,12. Although there are many studies on multi-trait approaches, applications of those methods have recently received increased attention e.g.13,14,15. A limitation of those methods is their complexity due to the large number of parameters in the model.

To reduce the complexity of the multi-trait approaches due to the number of parameters, we integrated a Bayesian network16 with a Bayesian multi-trait polygenic mixed model while setting G-Wishart prior17 on inverse of relatedness matrix and called it Integrative Bayesian Multi-Trait (IBMT) approach. Using IBMT approach, we conducted an analysis to identify genomic variants influencing 10 echocardiographic traits related to cardiac structure and function from Atherosclerosis Risk in Communities (ARIC) study18. We have genotype data of 7810 European American individuals from baseline measurement while 3387 of these individuals have phenotype records in Visit 5. The phenotype data are also recorded for 1265 new participants at visit 5 who do not have baseline genotype. Thus, we had three sets of data, including (i) individuals with only genotype data, (ii) individuals with only phenotype data, and (iii) individuals with both genotype and phenotype data. We incorporated information from all these three sets into the analysis to improve the statistical power, prevent overfitting, and avoid using data multiple times. The details are provided in the Methods section. These steps ultimately improve reliability and generalizability of the results.

After data preparation, we applied the IBMT method over the whole exome sequence data to investigate genomic and cardiac trait relationships. We identified 3 genetic variants (NC_000009.11:g.116346115C> A, NC_000017.10:g.7802658C > T, NC_000017.10: g.73897977C > T) in genes RGS3, CHD3, and MRPL38 with significant impact on the cardiac traits. The variant in RGS3 gene (NC_000009.11:g.116346115C> A) showed pleiotropic gene action on vertical mass index, left ventricular volume index and maximal left atrial anterior-posterior diameter). RGS3 can inhibit TGF-beta signaling associated with left ventricle dilation and systolic dysfunction and codes for GTPase-activating protein that inhibits G-protein-mediated signal transduction19,20. CHD3 encodes a protein with a chromatin organization modifier domain and a SNF2-related helicase/ATPase domain21. MRPL38 produces a mitochondrial ribosomal protein, involved in the synthesis of proteins within the mitochondrion22,23.

Materials and Methods

Study population

Echocardiographic and genomic data were collected on a subset of the ARIC study, a biracial longitudinal cohort of 15,792 middle-aged individuals who were randomly sampled from four US sites (Forsyth County, NC; Jackson, MS; suburbs of Minneapolis, MN; and Washington County, MD) and have been measured for risk factor traits related to health and chronic diseases. A detailed description of the ARIC study design and methods have been published elsewhere18,24. The data presented here includes 7810 European American individuals with baseline genotype available on dbGAP (https://www.ncbi.nlm.nih.gov/gap), accession number phs000090.v5.p1. A subset of individuals with genotype data, 3387 out of 7810 individuals, has phenotype records at visit 5. The phenotype data described in the following subsection are also recorded for 1265 new participants at visit 5 who do not have baseline genotype. Figure 1 visualized study population with phenotype and genotype records through Venn diagram.

Figure 1
figure 1

Venn diagram of study population with phenotype and genotype records.

Echocardiographic methods and measurements

Echocardiograms were obtained from participants at visit 5 using a standardized protocol as recommended by the American Society of Echocardiography. Images were digitally transferred to the Cardiovascular Imaging Core Laboratory at Brigham and Women’s Hospital, Boston, MA, for offline analysis. The intra-observer variability (coefficient of variation and interclass correlation) for key echocardiographic measures has been previously published25. Images were obtained in the parasternal long- and short-axis and apical 2- and 4-chamber views. Primary measures of the traits such as left ventricular (LV) dimensions, volumes, and wall thickness; left atrial (LA) dimensions, volumes, and areas were made in triplicate from the 2-dimensional views in accordance with the recommendations of the American Society of Echocardiography26. This study includes 10 cardiac structure and function tabulated in Table 1. LV mass was calculated from LV linear dimension and indexed to body surface area. Relative wall thickness was calculated using the posterior wall thickness and LV end-diastolic dimension. LA volumes were measured by methods of disks using apical 4- and 3- chamber views. LV volumes were calculated from the apical 4- and 2- chamber views utilized the modified Simpson method.

Table 1 Cardiac structural and functional traits under study.

Whole exome sequencing

Whole exome sequencing was performed on samples with the Illumina HiSeq platform. Mercury pipeline is applied for variant calling27. The reads are mapped into the Genome Reference Consortium Human Build 37 (GRCh37) sequence. Low-quality variants are filtered if they were outside the exon capture regions, belonged to multi-allelic sites, had missing rate >20% and had mean depth of coverage >500-fold. In addition, highly significant departures from race-specific Hardy-Weinberg equilibrium (P-value < 5e-6) are excluded from the data.

Statistical methods

The IBMT as an integrative Bayesian approach takes into account the underlying relationship among multiple traits and seeks for their significant association with genetic variants through multi-trait polygenic mixed model as the following

$${{\boldsymbol{Y}}}_{{\rm{n}}{\rm{q}}\times 1}={{\boldsymbol{\mu }}}_{{\rm{n}}{\rm{q}}\times 1}+({{\boldsymbol{X}}}_{{\rm{n}}\times {\rm{p}}}\otimes {{\boldsymbol{I}}}_{{\rm{q}}}){{\boldsymbol{\beta }}}_{{\rm{p}}{\rm{q}}\times 1}+{{\boldsymbol{U}}}_{{\rm{n}}{\rm{q}}\times 1}+{\epsilon }_{{\rm{q}}{\rm{n}}\times 1},\epsilon \sim N({\bf{0}},{\rm{\Sigma }})$$
(1)

Each entry of \({\boldsymbol{Y}}={\{{y}_{ij}\}}_{\begin{array}{c}i=1,\ldots ,q\\ j=1,\ldots ,n\end{array}}\) represents ith trait recorded for jth individual, βik is the effect of kth genomic variants on ith trait in coefficient vector \({\boldsymbol{\beta }}={\{{\beta }_{ik}\}}_{\begin{array}{c}i=1,\ldots ,q\\ k=1,\ldots ,p\end{array}}\), and vector \({\boldsymbol{U}}={\{{u}_{ij}\}}_{\begin{array}{c}i=1,\ldots ,q\\ j=1,\ldots ,n\end{array}}\) includes random effects corresponding to each sample  where U~N(0, Ψ) and Ψ is a relatedness matrix. The inverse of relatedness matrix (Ψ−1) represents the conditional dependence of samples between and within traits.

In large scale settings, Ψ imposes numerous parameters into the model1 and slows down the analysis. To overcome this limitation, we proposed to integrate the Bayesian network and the multi-trait polygenetic mixed model since Bayesian network represents the conditional dependence of the traits through graphical representation. Therefore, the IBMT first builds the Bayesian network among the traits of interest and then set the parameters in Ψ−1 to zero if their corresponding edges in Bayesian network are missing. To impose this sparsity to the model in posteriori, we set the G-Wishart prior on Ψ−1. The G-Wishart places no probability mass on zero entries of Ψ−1 and accordingly it reduces the number of parameters of the model and optimizes the performance and computational time of Gibbs sampling scheme. Further details of the model and the Gibbs sampling scheme are provided in the Supplementary, statistical methods section, although in the following, we presented general guidline on the IBMT application.

Adjusting covariates

There is a broad consensus on analytic techniques for covariate adjustment to discover genomic variants associated with traits of interest independently of the correlated covariates, and to improve statistical power by gaining precision. The set of covariates typically include principal components of individual genotypes to account for population structure, and correlated environmental or demographic factors such as gender and age. However, some of those covariates may not have significant impact on the traits and adjusting for those covariates simply leads to a loss of power and cause variance inflation of the effects. Therefore, instead of adjusting each trait for all routine covariates, we first investigated the effect of covariates on the traits using 1265 individuals with only phenotypic record. This not only prevents decreasing the statistical precision or contaminating data due to adjusting for unrelated covariates but also avoids using data twice which prevents overfitting.

Applying the Integrative Bayesian Multi-Trait approach (IBMT)

We first identified underlying relationships of the 10 cardiac functional and structural traits tabulated in Table 1 via application of Bayesian networks. We calculated hamming distance to measure structural similarities of networks based on different statistical significant levels (0.005, 0.01, 0.05) and found 0.05 as the best significant level to construct the network in our analysis28,29,30. The identified network, which is also supported by clinical background knowledge, revealed the sparsity level of the relationships. Figure 2 displays the network, where the nodes are the traits and the edges represent a significant relationship between the two corresponding traits after excluding the effect of the other traits. Clustering approaches31,32,33 can be also applied for the same purpose, although they may estimate more connections among the traits. To avoid overfitting, we built the Bayesian network using the subset of individuals with only phenotype records.

Figure 2
figure 2

The Bayesian network over the cardiac traits. The colors correspond to the degree of connectivity of each trait; darker color means greater connectivity.

Using the identified Bayesian network over the traits, we implemented the IBMT method and analyzed the whole exome sequencing data including 260,688 variants based on sliding window. Each window included 100 variants with a step size of 25, such that each variant appears in 4 windows. If a variant is selected in all 4 windows, we reported the variant as the most promising variant. The parameters of the model were updated at each iteration of Markov Chain Monte Carlo (MCMC) algorithm and estimated with posterior means, which calculate optimal point estimators under square error loss, after 200 burn-in period.

To identify genomic variants significantly associated with the traits, we calculated 98% credible interval34,35, (qL, qU), for the effects of all the variants to test a null hypothesis of no genomic effects. The endpoints of the intervals correspond to quantile (q) of the empirical distribution of the MCMC drawn from the marginal posterior distribution of genetic effects. A desired degree of precision for the endpoints of intervals is achieved by running a number of iterations until

$$P(|{q}_{i}^{L}-\,{q}_{i-1}^{L}|) < \zeta \,{\rm{\& }}\,P(|{q}_{i}^{U}-\,{q}_{i-1}^{U}|) < \zeta ,$$

where qL and qU are lower and upper quantile respectively and i represents the number of iterations. We set \({\rm{\zeta }}\) to 0.01 as a small value.

Results

Since the IBMT method is based on a linear polygenic mixed model, we first tested the normality of the traits as the main assumption. Except the ejection fraction and the maximal left atrial anterior-posterior diameter that are normally distributed, the other traits were transformed to normal using log transformation. The histograms of the traits after winsorization, standardization, and log transformation are represented in Supplementary Figure 1.

We then investigated the effects of gender, age, ever-smoked, body mass index (BMI), hypertension, systolic and diastolic blood pressure on the traits. Among them gender and BMI showed highly significant relationships (P-value < 1e-8) with all traits except the mean LV Wall Thickness (LV-WT) and the ejection fraction, which they were relatively less significant with P-values 0.05 and 0.005 respectively. Hypertension also showed significant effects (P-value < 1e-6) with all traits except the ejection fraction. We obtained these results on the set of individuals without genotype data to avoid the use of data twice. To generalize the results to the set of interest (individuals with both genotype and phenotype records), we compared BMI distribution, gender ratio (Female/Male), and ratio of (with/without) hypertension in the two sets. We observed that in both sets, the distribution of BMI is similar (Figure 3); gender ratios 1.302 and 1.387 showed almost the same proportion of female to male; and hypertension ratios 2.502 and 2.579 also showed almost the same proportion of individuals with and without hypertension. Therefore, we adjusted the traits for BMI, gender, and hypertension, in addition to the first 10 PCs from population stratification analysis.

Figure 3
figure 3

Histogram of BMI. Right: individuals without genotype data. Left: individuals with genotype data.

Applying IBMT, we first identified Bayesian network among 10 traits listed in Table 1. As shown in Figure 2, underlying relationships among the traits are sparse. Therefore, we do not need to consider all pairwise connectivity in the analysis. Incorporating this result into multi-trait mixed model, we reduced the number of parameters in the model and consequently increased the power of genotype-phenotype association identification (Supplementary Statistical methods). We could identify 3 genetic variants with significant impact on 4 cardiac traits using a 98% credible interval (Table 2 and Figure 4). Minor allele frequencies (MAF) in Table 2 show the identified variants are rare. The estimated effects of the identified rare variants are reported in Table 3.

Table 2 Selected genomic variants related to the traits, using a 98% Bayesian credible interval.
Figure 4
figure 4

Identified genetic pathway to cardiac and structure and function using IBMT.

Table 3 Estimated effect (Est-Eff) and standard deviation (SD-Eff) of the identified genes with significant effect on the traits.

Figure 5 shows empirical distributions of LV-MI (red/blue) for individuals with reference/alternative allele of one of the identified variants (NC_000017.10:g.73897977C > T). The noticeable shift of the distribution for individuals with mutation is observable. Supplementary Figure 2 shows levels of the traits for individuals with identified rare mutation over distribution of the traits.

Figure 5
figure 5

Empirical Distributions of log (LV-MI) for individuals with reference/alternate alleles in NC_000017.10:g.73897977C > T variant.

One of the identified variants located in chromosome 9 (NC_000009.11:g.116346115C > A) showed pleotropic effect on LV-MI, LA-VI and Max-LA-APD. Among 14 individuals (5 Females and 9 Males) having rare mutation in RGS3 (NC_000009.11:g.116346115C > A) with pleiotropic action, two of the females and one male showed different patterns with other individuals, such that their level of LV-MI, LA-VI and Max-LA-APD are not greater than third quartiles (Supplementary Figure 2). This suggests that the rare mutation identified in RGS3 may have higher impact on males. In addition, pleiotropic effect of this variant may represent different functions of the RGS3 gene.

According to the Variant Effect Predictor (VEP) analysis36, this variant is most likely a missense mutation in RGS3 (regulator of G protein signaling 3) gene  that could yield a codon change from ACC to AAC replacing the amino acid Threonine (Thr) by an Asparagine (Asn) in the protein sequence. RGS genes encode GTPase activating proteins (GAPs) and down-regulate G protein signaling. More details about the results from the VEP analysis is provided in Supplementary Table 1.

Another identified variant with impact on PLAx-IST is in chromosome 17 (NC_000017.10:g.7802658C > T). Individuals with T allele of this rare variant, including 2 Females and 8 Males, all have PLAx-IST level greater than third quartile of the trait (Supplementary Figure 2).

The variant NC_000017.10:g.7802658C > T is intronic to Chromatin Helicase DNA Binding Protein 3 (CHD3) gene based on VEP analysis (Supplementary Table 2). CHD3 belongs to a family of genes coding for proteins that bear CHROMO (chromatin organization modifier) and SNF2-related helicase/ATPase domains. CHD3 protein is one of the components of the Mi-2/NuRD (histone deacetylase) complex that participates in the remodeling of chromatin structure via histone deacetylation.

The third identified genetic variant influencing LV-MI is located in chromosome 17, gene MRPL38 (NC_000017.10:g.73897977C > T). All 8 male individuals with rare mutation in MRPL38 have LV-MI level greater than third quartiles of distribution (Supplementary Figure 2). However, the 2 female individuals do not show any different pattern.

The variant NC_000017.10:g.73897977C > T can be in a non-coding exon or be a missense mutation of MRPL38 gene depending on splicing. The missense mutation of CGG into CAG codon causes the substitution of amino acid Arginine (Arg) by a Glutamine (Gln) concluded. Gln is a non-charged amino acid and smaller than Arg with putatively less capacity to create hydrogen bonds and favorable electrostatic interactions than Arg (see Supplementary Table 3 for VEP results). MRPL38 encodes mitochondrial ribosomal proteins (MRP). The MRP family stabilizes mitochondrial ribosome (mitoribosome) and are responsible for the mitochondrial translation of 13 protein components of the Oxidative Phosphorylation (OXPHOS) gene complex in the mitochondrial DNA23.

Discussion

Single trait analysis did not identify any genetic variants with significant impact on the 10 considered traits of cardiac structure and function in European American individuals from ARIC study. Therefore, we integrated Bayesian network and Bayesian multi-trait approach to improve the performance of the analysis. This Integrative Bayesian Multi-Trait (IBMT) approach provides a sparse relatedness matrix and, eventually, more precise estimates of parameters in the model. Utilizing the IBMT method to increase the power of identification, in addition to carefully adjusting for covariates to avoid data contamination, and choosing the appropriate transformation function for each trait, we identified three significant genetic variants. These variants located in RGS3, CHD3, and MRLP38 genes are rare, hence a high statistical power is required to detect the association with the trait(s)37,38,39,40. Among those, the variant NC_000009.11:g.116346115C > A in RGS3 showed pleiotropic action on vertical mass index, left vertical volume index,  maximal left atrial anterior-posterior diameter (Figure 4).

The variant NC_000009.11:g.116346115C > A is in the exon of gene RGS3 (Regulator of G protein signaling 3) which belongs to RGS family. RGS family codes for proteins that act as GAPs and down-regulate G protein signaling. Many studies have proven that RGS gene expression is highly regulated in myocardium41,42,43. Quantitative messenger RNA (mRNA) analysis revealed that RGS3 is most highly expressed in human heart44,45,46. The N-terminus of RGS3 can inhibit TGFβ induced differentiation of pulmonary fibroblasts which is associated with left ventricular dilation and systolic dysfunction19,20,47. The identified variant in RGS3 was associated with pleotropic effect on LV mass (LV-MI) and measures of left atrial size (LA-VI and Max-LA-APD). The left atrial size may reflect the cumulative effects of increased LV filling pressure and diastolic function48 and is a predictor of heart failure, ischemic stroke, and death. Thus, genetic variants contributing to abnormalities of LV mass and worsened diastolic function would be expected to potentially be associated with LA size.

As missense mutation, NC_000009.11:g.116346115C > A could yield a codon change from ACC to AAC, which replaces a Threonine (Thr) to an Asparagine (Asn) in the protein amino acid sequence. The change from Thr to Asn does not alter side chain electrostatic charge because of both amino acids being electrostatically neutral. This could eventually affect hydrogen bond pattern since Asn has an extra hydrogen bond donor group (NH2). However, it is difficult to evaluate the final effect in the protein three-dimensional structure since there may be alternative spliced transcripts. If this mutation happened at the interaction interface between RGS3 and Gα subunit, it could eventually affect GTPase activity.

The variant NC_000017.10:g.7802658C > T that is significantly associated with parasternal long axis interventricular septum thickness (PLAx-IST) is in an intron of the CHD3 gene, which codes for the Chromatin Helicase DNA Binding Protein 3 as a part of the chromatin structure remodeling complex. This is a potential splicing variant that could affect the rate of mature mRNA synthesis, and ultimately impact gene transcription.

The other identified variant NC_000017.10:g.73897977C > T is within MRPL38 gene, a member of Mitochondrial ribosomal proteins (MRP) family that are part of the large subunit of the mitochondrial ribosome. As missense mutation, the change of CGG into CAG codon causes the substitution of amino acid Arginine (Arg) by a Glutamine (Gln). Gln is a non-charged amino acid and smaller than Arg with putatively less capacity to create hydrogen bonds and favorable electrostatic interactions than Arg. If the amino acid mutation takes place at the interface between MRPL38 and mitochondrial ribosome, it could decrease binding affinity and destabilize mitochondrial ribosome. This in turn may reduce ribosomal protein synthesis levels and affect the oxidative phosphorylation pathway.

Overall, this study suggests that rare mutation might provide a better understanding of genetic impact on cardiovascular structure and resulting in remodeling cardiovascular disease and/or heart failure, although further studies are required.