Association of protein function-altering variants with cardiometabolic traits: the strong heart study

Clinical and biomarker phenotypic associations for carriers of protein function-altering variants may help to elucidate gene function and health effects in populations. We genotyped 1127 Strong Heart Family Study participants for protein function-altering single nucleotide variants (SNV) and indels selected from a low coverage whole exome sequencing of American Indians. We tested the association of each SNV/indel with 35 cardiometabolic traits. Among 1206 variants (average minor allele count = 20, range of 1 to 1064), ~ 43% were not present in publicly available repositories. We identified seven SNV-trait significant associations including a missense SNV at ABCA10 (rs779392624, p = 8 × 10–9) associated with fasting triglycerides, which gene product is involved in macrophage lipid homeostasis. Among non-diabetic individuals, missense SNVs at four genes were associated with fasting insulin adjusted for BMI (PHIL, chr6:79,650,711, p = 2.1 × 10–6; TRPM3, rs760461668, p = 5 × 10–8; SPTY2D1, rs756851199, p = 1.6 × 10–8; and TSPO, rs566547284, p = 2.4 × 10–6). PHIL encoded protein is involved in pancreatic β-cell proliferation and survival, and TRPM3 protein mediates calcium signaling in pancreatic β-cells in response to glucose. A genetic risk score combining increasing insulin risk alleles of these four genes was associated with 53% (95% confidence interval 1.09, 2.15) increased odds of incident diabetes and 83% (95% confidence interval 1.35, 2.48) increased odds of impaired fasting glucose at follow-up. Our study uncovered novel gene-trait associations through the study of protein-coding variants and demonstrates the advantages of association screenings targeting diverse and high-risk populations to study variants absent in publicly available repositories.

www.nature.com/scientificreports/ to electronic medical records of patients from a large health care organization, the DiscovEHR study identified novel associations of heterozygous LOF variants in CSF2RB with blood cell counts (basophil and eosinophil), LOF variants in EGLN1 associated with hematocrit and hemoglobin, and deleterious missense variants in G6PC associated with triglyceride levels 4 . Studies in ancestrally distinct populations have also shown that a 2-step strategy that combines sequencing data of a subset of samples with subsequent genotyping in a large cohort can be an optimal way to maximize power while retaining experimental costs 5,6 . Low coverage sequencing has been shown to uncover novel variants in less studied populations 7 . American Indians have a high burden of cardiometabolic diseases and may harbor rare coding variants that contribute to this risk. Building upon our ongoing investigation of exonic variation in American Indians using WES, we recently genotyped WES-identified single nucleotide variants (SNV) and small insertions/deletions (indels) with predicted protein-altering function in 1,127 Strong Heart Family Study (SHFS) participants. Approximately 43% of these genotyped variants are currently not present in publicly available repositories and are likely specific to American Indians. The goal of this study is to assess the clinical and biomarker phenotypic associations for carriers of these SNVs in American Indians. The identification of genes for specific phenotypes may provide insights into disease mechanisms and this knowledge could be applied to overall human populations including American Indians.

Material and methods
Population and phenotypes. The Strong Heart Study (SHS) is a population-based study of cardiovascular disease in American Indians recruited from tribes in Arizona, Oklahoma, and South and North Dakota 8 . The SHFS is a family component of the SHS, which examined 3776 members in 94 multigenerational families 9 . The first SHFS full family exam (2001-5, SHS Phase 4, baseline visit for this study) consisted of a personal interview, a physical exam and laboratory tests. A re-exam from 2006 to 2010 (Phase 5) had > 91% retention rate and measures were similar to the first exam. During the clinical visits, various categories of phenotypes were obtained including standardized physical measures (anthropometrics, blood pressure) and clinical data (diabetes, hypertension, medication use). A 12-h fasting serum and a spot urine sample were collected for laboratory biomarkers (complete blood cell count, serum lipids, liver and kidney function serum biomarkers, and metabolic biomarkers such glucose, insulin and HbA1c). DNA was extracted for genetic studies 8 . Pedigree relationships and identity-by-descent (IBD) sharing were estimated as previously described 10 . One tribe withdrew its consent to participate in future investigations and was not included in this analysis. The study was approved by the institutional review boards (IRBs) at each field center, and all participants gave written informed consent.
This study used existing data from a case-control study of chronic kidney disease (n = 555) and controls (n = 572) which included SHFS participants from two clinical centers (Oklahoma and the Dakotas) selected from 24 pedigrees. Cases were defined by a self-reported kidney failure (dialysis or transplant, n = 28), an eGFR less than 60 ml/min/1.73 m 2 (n = 233) and/or urine albumin to creatinine ratio (UACR) ≥ 30 mg/g in any of the two clinical visits (n = 322, including n = 123 with both low eGFR and high UACR). Individuals without chronic kidney disease at two clinical visits and age > 40 years were selected as controls based on an eGFR > 80 ml/ min/1.73 m 2 , and an UACR < 30 mg/g. Phenotype definitions are shown in Table S1. Briefly, for participants using anti-hypertensive medications, we added 15 and 10 mmHg to their measured systolic and diastolic blood pressures, respectively. Estimated glomerular filtration rate (eGFR) was calculated using the serum creatinine-based Chronic Kidney Disease Epidemiology Consortium equation. LDL cholesterol (LDL-C) was estimated by the Friedewald formula for samples with triglycerides < 400 mg/dl and individuals were not taken statins at the time of lipid measures. For analyses of fasting glucose and insulin, we excluded individuals with diabetes. HOMA-IR (mmol/L) was calculated among non-diabetic individuals using the equation: fasting insulin in mU/L * fasting glucose in mmol/L)/22.5 11 . Incident diabetes was defined by a new-onset fasting glucose > 126 mg/dL (7.0 mmol/L) and/or use of diabetic medications at follow-up. Incident impaired fasting glucose was defined by a new fasting glucose between 100 mg/dL (5.6 mmol/L) and 125 mg/dL (6.9 mmol/L) at follow-up. A reference group was selected of participants with fasting glucose < 100 mg/dL (5.6 mmol/L) at baseline and follow-up. Molecular data: low pass WES and Amerindian custom genotyping panel. SNVs/indels tested in this study were selected from a low coverage WES data of 94 distantly related SHFS participants, selected to maximize the diversity across founders to identify the genetic variability in this population (given lack of publicly available reference panels for American Indians). Participants for the WES were selected from pedigrees with large number of descendants and were not ascertained for a disease or trait. WES used Illumina TruSeq Custom Amplicon assay which captured > 200,000 exons in > 20,000 genes, resulting in ~ 62 Mb of targeted genomic regions, and high quality and genome coverage (mean call rate = 0.98, mean transition-transversion = 2.5, mean coverage at 10x = 80%).
We selected 2709 variants (SNVs/indels) for genotyping through an Illumina custom panel. Criteria for variant selection were: (1) observed in at least two individuals, (2) not present in publicly available databases (dbSNP, Exome Sequencing Project [ESP], 1000 Genomes Project) at the time of variant selection (2015), and (3) predictive functionality based on Genome Variant Server (frameshift, splice-3, splice-5, stop-gain of function, stop-loss of function, and missense variants). We also included some variants within 3' or 5' UTR or introns to complete the custom iSelect Illumina panel. Among variants genotyped, 144 failed manufacturing, 1357 were homozygous in our samples, and two were excluded due to call rates < 95%. The final sample for this study included 1127 individuals and 1206 SNVs/indels, and there was no overlap of participants with WES and those genotyped with the custom Illumina panel. , gnomAD), and disease-related annotations (ClinVar). This annotation was used to assess the potential impact of the variants in protein function and to identify SNVs/indels that are novel, i.e., not present in the repositories listed above at the time of the annotation. We assigned variants as deleterious if there was an agreement among more than 3 different annotation tools as proposed by the American College of Medical Genetics (ACMG) for a supporting level of pathogenic classification by computational prediction for nonsynonymous and LOF variants 18 .
Statistical analyses. The main goal of analyses was to identify gene-phenotype associations for exonic variants while accounting for the case-control sampling and confounders. Traits were preprocessed through inverse normal transformation or outlier removal as needed. For a trait without transformation, observations more than 5 standard deviations away from the sample mean were set as outliers, with their corresponding values set to missing. No outliers were removed for traits that underwent inverse-normal transformation. Large pedigrees were split into families with no more than 33 members, by copying a child of a family and his/her genotype but not using the phenotype data 19 , as required for analyses of large-pedigree data using Merlin 20 .
We performed linear mixed model association analyses for each SNV/indel to account for family relatedness, which was implemented using the Merlin software 19 . We used additive genetic models and adjusted for age, sex, case-control status, and the first 10 genetic principal components estimated from a genome-wide genotype panel. For analyses of fasting insulin and glucose, we additionally adjusted for body mass index (BMI), reported as insulin adjBMI and glucose adjBMI . Only variants with a minimum minor allele count (MAC) of 10 were tested in association analyses.
Given the genetic correlation among SNVs/indels due to linkage disequilibrium, we used a p-value threshold for significance of < 5.5 × 10 -6 , which accounts for 9,122 effective independent tests. The number of tests was calculated based on the extended Simes method 21 , part of the GATES method to calculate the effective number of independent tests 22 .
In secondary analyses, we combined genotypes of four insulin adjBMI -related SNVs using an unweighted genetic risk score that sums the fasting insulin adjBMI increasing risk alleles for each participant and tested their association with incident diabetes and impaired fasting glucose at follow-up visit (Phase 5).

Validation of associations. Replication was assessed in two cohorts of American Indians living in Arizona
who had undergone WES performed by Regeneron Genetics Center (Tarrytown, New York). One cohort with WES data is part of a community-based study of Pima Indians (N = 6809) and the other cohort represents Urban Indians living in Phoenix Arizona (N = 850). Some variants were either not identified or had < 10 copies of the alternative allele in replication studies. Serum creatinine was not available in the replication cohorts. Therefore, two variants were tested for replication: rs779392624 for triglycerides and rs760461668 for fasting insulin adjBMI .
We also performed look ups for variants using publicly available data from the Type 2 Diabetes Knowledge Portal (https:// t2d. hugea mp. org/), which included two pre-print whole genome sequencing (WGS) publications from the Trans-Omics for Precision Medicine (TOPMed) project on fasting insulin adjBMI and Type 2 diabetes, respectively 23,24 , and the Metabolic Diseases Knowledge Portal (https:// hugea mp. org/) for variants and genes related to our lipids and creatinine findings. Additional evidence for plausibility was obtained through experimental studies including genetic knockout animal studies.
Ethnic statement. The study has been carried out in accordance with Declaration of Helsinki. The study was approved by the Institutional Review Boards of the participating Institutions (MedStar Research Institute, University of Oklahoma Health Science Center, Aberdeen Area IRB), and by the participating American Indian tribes 8,9 . All participants gave informed consent for genetic studies.

Results
The study design is shown in Fig. 1, clinical and biomarker phenotypes in Table S1, and participant characteristics in Table S2.
By querying the genotyped SNVs/indels in publicly available databases, 518 SNVs/indels were not present in dbSNP, including 339 variants also not present in gnomAD exome as per June/2019. These variants were considered novel. Most of the SNVs present in gnomAD had higher allele frequency in our sample compared www.nature.com/scientificreports/ to other populations ( Figure S1). The annotation of novel SNVs was missense SNVs/indels (n = 228), frameshift substitutions (n = 73), stop-gain or stop-loss variants (n = 8), and splice acceptor/donor (n = 11). All genotyped indels were novel. Of novel missense SNVs/indels, 44 (12%) were predicted deleterious by MetaSVM and 252 (66%) by FATHMM-MKL. All variants predicted deleterious by MetaSVM were also predicted deleterious by FATHMM-MKL (Table S3). In summary, our genotyped exome variants are composed of mostly low frequency and rare variants, likely American Indian-specific and enriched for predicted functionality.
Association results with 35 phenotypes. and lower uric acid (p = 0.03) among overall participants, although findings did not reach the multiple-testing significance threshold. The SNV at ACACA explained 1.87% of serum creatinine variance and was nominally associated with increased eGFR (p = 0.02).
Insulin-based genetic risk score and incident diabetes and impaired fasting glucose. Using an unweighted genetic risk score, we examined patterns of associations for the four insulin-related SNVs in relation to development of type 2 diabetes and impaired fasting glucose at follow-up. Individuals carried 0 to 4 insulinincreasing risk alleles from PHIL, TRPM3, SPTY2D1 and TSPO. Incident diabetes and impaired fasting glucose were obtained from a mean 5.3 years (standard deviation 1.1) from SHFS baseline visit. Among participants with normal fasting glucose at baseline, each added risk allele was associated with 53% odds of developing diabetes (p = 0.015) and 83% odds of developing impaired fasting glucose (p < 0.0001) at follow-up in models adjusted for age, sex, center, case-control status and principal components ( Table 2). The association with incident diabetes was attenuated with further adjustment for BMI (p = 0.05), but the association with incident impaired fasting glucose was unchanged by BMI adjustments (p = 0.0001). The genetic risk score was strongly associated with increased log-transformed HOMA-IR at baseline visit among participants without diabetes in models adjusted for age, sex and case-control status (N = 793, p < 0.001). Table 1. Main association results for variant-trait significant findings. For nonsynonymous rare variants and LOF variants, functional prediction algorithms were used to classify a SNV as deleterious based on agreement for at least three algorithms of prediction methods (see methods and Table S3). All SNVs listed in Table 1 had a CADD Phred score > 10-20, which is considered deleterious, except for rs756851199. Models adjusted for age, sex, center, and the first 10 principal components of ancestry. Fasting insulin was tested among nondiabetic individuals in models additionally adjusted for BMI. Amino acid change provided by the Variant Effect Predictor tool. N total number of participants. N/A, not available. Note three SNVs are not present in a publicly available database and lack rs#. Significance threshold p = 4.9 × 10 -6 is based on number of SNVs and phenotypes tested.  Table 2. Association of insulin-related SNV genetic risk score with incident diabetes and impaired fasting glucose. All analyses are adjusted for age, sex, center, case-control status, principal components (Model 1) and additional adjustments for BMI (Model 2). C.I. confidence interval, N number. All outcomes were obtained at follow-up visit. Genetic risk score was calculated by the unweighted sum of increasing insulin risk alleles of the SNVs (chr6:79,650,711, rs760461668, rs756851199, rs566547284).

Discussion
This study identified associations of several predicted deleterious rare and low frequency exonic variants with cardiometabolic biomarkers and clinical traits in American Indians. These findings include seven gene-trait significant associations for lipids, glucose/insulin and kidney traits. Several genes identified have not been previously reported in genome-wide association studies for these traits, although the evidence for their biological function is supported by experimental studies (Table 3). For example, the ABCA10 gene identified in association with lower fasting triglycerides is a cholesterol-responsive gene and encoded protein is involved in macrophage lipid homeostasis. Two recent studies have reported association of variants at the ABCA10 loci with lipid traits including an intergenic variant (rs12453914) associated with triglycerides (p = 1.67 × 10 -6 ) although findings were not genome-wide significant 23,24 . Genes identified in this study could be prioritized to uncover functional rare variants for these cardiometabolic traits. Among the four genes identified for fasting insulin among non-diabetic individuals, TRPM3 is expressed in insulinoma and pancreatic β-cells, and the protein is involved in calcium signaling in pancreatic β-cells in response to glucose stimuli [25][26][27][28] . TRPM3 channel activation has been shown to be inhibited by thiazolidinediones antidiabetic drugs such as pioglitazone and troglitazone, which are peroxisome proliferator-activated receptor (PPAR)gamma agonists 29,30 . PHIP encodes a protein that interacts with insulin receptor substrate 1 (IRS-1) and is involved in β-cell proliferation and survival 31 . Mice lacking PHIP develop hypoglycemia 32 . TSPO is involved in mitochondrial cellular metabolism and conditional tspo knockout mice manifest chronic hyperglycemia 33 . We were able to replicate the SPTY2D1 SNV for fasting insulin adjBMI in the TOPMed data, but not the other SNVs which were rare or not present in datasets. However, we identified some evidence to support associations for our gene-traits through look-ups of gene-based findings and low p-value SNVs for our traits within the identified genes.
Among the four genes (PHIP, TRPM3, SPTY2D1 and TSPO) associated with fasting insulin and combined into a genetic risk score, we showed that carriers of insulin increasing risk alleles had higher odds of developing diabetes and impaired fasting glucose at follow-up. These findings and the association of the genetic risk score with the HOMA-IR support insulin resistance as a mechanism for development of diabetes and impaired fasting glucose in our population. The attenuation of the association of the genetic risk score with incident diabetes when adjusting for BMI suggests mediation by obesity. In a randomized clinical trial, the PPARgamma agonist pioglitazone has shown to reduce the risk of diabetes among individuals with impaired glucose tolerance 34 . American Indians have a high prevalence of both type 2 diabetes and impaired glucose tolerance, and one could speculate that carriers of the TRPM3 SNV may benefit from using preferentially these medications for diabetes prevention. Therefore, the study of exonic variants can uncover not only biological relationships for gene-traits not previously reported in genome-wide association studies but also provide potential clinical applications for gene findings in high-risk populations such as ours.
An important aspect of this project is the study of American Indian-specific variants. We have shown that 1/3 of the variants assessed in this study are still not available in repositories. This includes all identified indels. The remaining variants are found in low frequency in Hispanics in the gnomAD exome variant database, given some Hispanics have Amerindian admixture 35 . These variants are not included in commercial GWAS genotyping panels. Therefore, they have not been previously queried for disease risk in large consortia of complex traits. The low coverage WES used to identify Amerindian SNVs has the advantage of low cost and capturing most of low frequency/common variants in our data, given variant reference panels are not available for our population. Studies have shown that low coverage WES identified variants perform well in association studies without an excess of false positive, although this strategy may have missed some variants 36,37 . This study exemplifies a Table 3. Supporting evidence for genes and associated traits. For replication of gene-trait associations, see Table S6. The encoded protein selectively interacts with the IRS-1, and IRS-1 has a central role in the downstream effects of insulin and insulin-like growth factor-1 42 PHIP controls β-cell proliferation and survival 31 . Phip mutant mice have postnatal growth deficit and develop hypoglycemia 32 . PHIP rare SNVs associated with childhood obesity, insulin resistance and repression of pro-opio melanocortin 38

TRPM3
Fasting insulin (adjusted for BMI) Transient receptor potential melastatin 3 (TRPM3) channels are non-selective cation channels that are expressed in insulinoma cells and pancreatic β-cells, and are important for cellular calcium signaling and homeostasis. TRPM3 mediates calcium signaling in pancreatic β-cells in response to glucose stimuli, supporting its role in pancreatic β-cells function 25,28 Trpm3-deficient mice do not show alterations in resting blood glucose levels in agreement with our findings 43  www.nature.com/scientificreports/ major advantage of leveraging WES findings to select predicted functional variants for association screenings in genetically less well-characterized populations. Our strategy for variant selection was driven by the current lack of reference panels for American Indians. We selected predicted functional variants from a WES performed in a subset of American Indian participants of the SHFS. We focused on variants that were not present in public repositories at the time of genotyping and then built a customized panel to genotype them in a larger sample of American Indians. This strategy offered some challenges including a large number of variants without an alternative allele (40%) at genotyping due to their low frequency in the studied population, and limitations for replication of variants. The genetic risk score results likely overestimated the effect as they were applied to the discovery sample. American Indians are characterized by distinct cultural and linguistic features, and separated by large geographic distances allowing for genetic variation between groups to have occurred. Our study included American Indians from tribes in the Dakotas and Oklahoma, but not Southwestern tribes who were used for replication. Given these challenges, we believe that the best approach to validate our findings will be the functional characterization of our variants in experimental models, and future target search for LOF exonic variants in the genes that we identified in this study.
In support to this strategy and a potential role of PHIP in insulin resistance, a recent study identified an excess of very rare predicted deleterious variants at PHIP in childhood severe obese individuals compared to controls, with some PHIP carriers showing insulin resistance and early type 2 diabetes 38 . Functional in vitro experiments supported a role of PHIP in human energy homeostasis through transcriptional regulation of central melanocortin signaling pathways 38 . Our participants carrying the PHIP SNV A allele had similar BMI than non-carriers (29.2 [standard deviation 5.5] and 31.4 [6.7] kg/m 2 for genotypes AG and GG, respectively, p = 0.18) and all analyses were adjusted for BMI.
In summary, this study of predicted functional Amerindian-specific exome variants identified seven gene-trait associations and uncovered potential new biological mechanisms and clinical implications for genes not previously reported to be associated with cardiometabolic traits. Our results add to the literature of exonic variants associated with cardiometabolic traits in American Indians.

Data availability
The Strong Heart Study and the SHFS 9 data is available through dbGaP Study Accession: phs000580.v1.p1 and upon request from the https:// stron ghear tstudy. org/. The summary data are included in the online supplemental files.