Introduction

Family-based linkage analysis has largely been supplanted by genome-wide association studies, often using unrelated samples, following the limited success of linkage when applied to complex traits. Family-based analyses, however, have inherent strengths, which complement other approaches for identification of contributors to complex phenotypes.1, 2 Such analyses may be especially applicable to identifying low-frequency (minor allele frequency (MAF) 0.01–0.05) to rare (MAF<0.01) alleles with high impact.3, 4, 5, 6, 7, 8 We have implemented approaches in parallel, which use simple two-point linkage analysis and conventional association analysis to search for genetic variants with meaningful contributions to phenotypic variance of traits. Two-point linkage analysis considers each variant independently, unlike multipoint analysis, which integrates the information from multiple variants simultaneously. Therefore, two-point linkage does not have the same issues with inflation because of linkage disequilibrium (LD) between markers and can be used to test putatively impactful variants for linkage directly. The combined two-point linkage and association approach has the advantage of being able to directly align single-nucleotide polymorphism (SNP) results for the two analyses, pinpointing variants that show evidence of both linkage and association at the single SNP level. In prior studies, this has been applied to exome chip data, thus focusing on coding variants9 and characteristics of a functional SNP.10

Evaluation of association in the context of linkage has an extensive history,11, 12, 13 with association typically used to determine whether genetic variants residing under the linkage peak explain the observed signal. We have observed that instances of strong linkage and association together at a single locus (e.g. APOE with ApoB levels, CETP (cholesterol ester transfer protein) with high-density lipoprotein (HDL) levels, ADIPOQ with adiponectin levels)9, 10 represent variants or loci that have a striking impact on phenotype, reflected as explanation of a high proportion of the variance of the trait (3–60%). We have also observed this across a range of minor allele frequencies (1–45%), indicating that this approach can be informative for a full range of genetic variation. Other groups have used combined metrics of linkage and association to identify variants with large impact;11 however, that is a project currently undergoing evaluation separate from these analyses.

Here we have investigated the performance of these approaches in a contemporary genetic data set consisting of comprehensive genome-wide and exome chip data encompassing 1.6 million SNPs in 90 Hispanic families from the Insulin Resistance Atherosclerosis Family Study (IRASFS). Based on our prior work and recent evidence for the existence of high impact noncoding variants,14 we hypothesize this family-based method is applicable to the search for such variants.

Materials and methods

Samples and phenotype data

The samples used in this study are from the Hispanic cohorts of the IRASFS.15 Briefly, subjects were ascertained on the basis of large family size in San Luis Valley, Colorado and San Antonio, Texas. The sample consisted of 1425 individuals from 90 families, who were extensively phenotyped, including a frequently sampled intravenous glucose test, measures of blood lipids and inflammatory markers, anthropomorphic measures, as well as fat deposition measures by computed tomography and dual X-ray absorptiometry scans. Institutional Review Board approval was obtained at all clinical and analysis sites, and all participants provided informed consent.

Genotype data

SNP genotype data from three genotyping chips were used. Illumina OmniExpress and Illumina Omni1S chips were genotyped as part of the Genetics Underlying Diabetes in Hispanics (GUARDIAN) Consortium (N=1034 and 1038, respectively),16 and the Illumina HumanExome Beadchip was genotyped on a larger subset (N=1414)9 of the full IRASFS Hispanic cohorts. Genotyping of the Illumina HumanExome BeadChip v.1.0 (N=552) and v.1.1 (N=862) was performed at the Wake Forest Center for Genomics and Personalized Medicine Research, whereas the Illumina HumanOmniExpress BeadChip and Illumina Omni1S BeadChip were genotyped at the core genotyping laboratory at Cedars-Sinai Medical Center (Los Angeles, CA, USA). All genotypes were called separately by genotyping array using GenomeStudio (Illumina, San Diego, CA, USA). Sample and autosomal SNP call rates were 0.98 (>0.99 SNP call rates for the OmniExpress and Omni1S chips), and Exome Chip SNPs with poor cluster separation (<0.35) were excluded. All data sets independently underwent Mendelian error checking using PedCheck17 to detect genotypes discordant in families for Mendelian inheritance, with resolution by removing all inconsistent genotypes. The total number of unique SNPs available for analysis following quality control was as follows: 81 559 from the Exome Chip, 668 758 from OmniExpress and 920 823 from the Omni1S chip, for a total of 1 671 140 SNPs.

Imputation to the 1000 Genomes integrated reference panel (version 2) was performed using genotypes and samples from the OmniExpress data set (N=634K genotypes and 1034 individuals) using SHAPEIT18 for phasing and IMPUTE219 for imputation.

Analyses

SNPs were evaluated for both two-point family-based linkage and single SNP association using Sequential Oligogenic Linkage Analysis Routines (SOLAR)20 separately by genotyping platform. Both analyses used age, sex, body mass index (BMI) and study center as covariates. All phenotypes evaluated were transformed to approximate normality of the residuals if necessary (Supplementary Table 1). Additionally, because of the high impact of a low-frequency variant known to influence adiponectin levels in this population,3, 10 presence of the variant encoding the G45R missense mutation in ADIPOQ (rs200573126) was included as a covariate for analyses involving adiponectin. Visceral adipose tissue (VAT) area, visceral-to-subcutaneous tissue ratio (VSR), waist circumference and waist-to-hip ratio were run both with and without BMI as a covariate. However, subcutaneous adipose tissue area, percent body fat and body adiposity index were not adjusted for BMI. All association analyses included three admixture proportions as covariates. Existing admixture proportion estimates were available from previously genotyped exome chip data; estimates were computed by maximum-likelihood estimation of individual ancestries in ADMIXTURE21 assuming five ancestral populations (K=5) from exome chip-wide SNP data after pruning for LD to produce admixture estimates for the greatest number of samples. Of the five variables considered, three variables were selected as representing the variation in these Hispanic samples, as inclusion of additional postulated ancestral populations began isolating individual pedigrees.

For validation of performance, genotypes imputed to the 1000 Genomes panel were also evaluated for linkage (and association) in two regions, which were selected for their linkage regions as well as being phenotypically of particular interest to our group: chromosome 1 for acute insulin response (AIR) to glucose and chromosome 7 for insulin sensitivity index (SI). Best guess genotypes from the imputed data were used in the linkage analysis because methods that account for imputation uncertainty have not been developed for linkage. These analyses used the same covariates as previously mentioned.

Results

The aim of this analysis was to test the utility of carrying out a combined linkage and association analysis in a contemporary data set made up of genome-wide association studies (GWAS) (Illumina OmniExpress and Omni1S) and exome chip data encompassing over 1.6 million SNPs. The combined performance was evaluated for a total of 50 quantitative traits from 7 phenotypic groups: glucose homeostasis, adiposity, lipids, biomarkers, hypertension, liver enzymes and liver fat, in 90 families from the IRASFS with an average family size of 15.4 individuals. Overall, 83 557 000 LOD (logarithm of the odds) scores and association P-values were calculated across the three genotyping sets.

Characteristics of the samples and genotyping are summarized in Table 1. The sample consisted of 1418 individuals from 90 families. Specifically, for the smallest genotyped sample (OmniExpress), sample sizes ranged from 786 (percent body fat) to 1034 (AIR), although larger sample sizes were available for SNPs present on the exome chip (up to 1256 for fibrinogen and albumin/creatinine ratio). Across all phenotypes, there were 9214 LOD scores 3, 845 4 and 89 5. Of the variants with LOD scores 5.0, 27 were linked to tumor necrosis factor-α (TNFα) receptor 2 levels, 13 to HDL levels, 24 to AIR, 13 to G45R-adjusted adiponectin levels and 3 to BMI-adjusted VSR. While a detailed summary of each trait analysis is impractical, following on our earlier observations,9, 10 we have initially focused on the patterns visible in linkage analysis followed by relating these results to association analysis results. In this report, we evaluated linkage and association with 50 cardiometabolic phenotypes (see Supplementary Table 1 for complete listing). Selected phenotypes, namely TNFα receptor 2 levels, HDL levels, AIR, adiponectin levels (adjusted for G45R, a high impact mutation identified previously in these samples3, 10) and VSR are summarized in Table 1. Overall, 12 phenotypes (from four phenotype groups: glucose homeostasis, lipids, adiposity and biomarkers) were represented in this category of LOD scores >5.0 (results are summarized in Table 2), where highest LOD scores are grouped by phenotype and chromosome. A complete summary of LOD scores >5 is presented in Supplementary Table 2.

Table 1 Demographic characteristics of the IRASFS Hispanic samples with selected phenotypes
Table 2 Summary of linkage results for phenotypes with at least one variant with LOD>4

Evaluation of loci with high LOD scores

The overall maximal LOD score of 6.49 was observed with rs12956744 with the biomarker TNFα receptor 2 levels (Table 3 and Figure 1a). This SNP is located in intron 1 (nearer the 5′ end) of LAMA1 (laminin subunit alpha-1 gene) on chromosome 18. Of note, three additional intronic variants in LAMA1 were also linked to TNFα receptor 2 levels with LOD>6, and nine SNPs overall were linked with LOD>3 (Table 3). Notably, one SNP (rs28569884) was also associated with TNFα receptor 2 levels (P-value=5.9 × 10−4; LOD=1.06). The variant rs28569884 (in intron 56) is distal to the striking linkage signal (146 kb apart), although there was another LOD score over 4 (rs4395154; LOD=4.47) just 13 kb away at the 3′ end of the LAMA1 gene (intron 62). LAMA1 is a very large gene, with 63 exons and 245 SNPs analyzed. Of these, 11 (4.4%) had nominally significant association (P-value<0.05) with TNFα receptor 2 levels. Comparatively, 9 variants had LOD scores >3 (3.7%) and 23 variants had LOD >1 (9.4%).

Table 3 Selected LAMA1 results with TNFα receptor 2 protein levels (LOD>1 and/or P-value <0.01)
Figure 1
figure 1

Opposed plots showing LOD (logarithm of the odds scores) from the two-point linkage (upper portion) and log-transformed P-values for association (lower portion) results across all arrays for (a) tumor necrosis factor-α (TNFα) receptor 2 levels, (b) acute insulin response (AIR) (note the broad linkage peak on chromosome 1, and the strong linkage also on chromosome 6), (c) insulin sensitivity index (SI) (of particular note are the signals on chromosomes 7 and 12) and (d) low-density lipoprotein (LDL) levels (note the signals on chromosome 4, contributed by LPHN3 and chromosome 19, which represents the APOE locus, evaluated in our previous publication with apolipoprotein B levels). A full color version of this figure is available at the Journal of Human Genetics journal online.

A major focus of our laboratory is identifying genetic contributors to metabolic measures of glucose homeostasis. The top linkage result of LOD=6.47 (Table 4) for AIR was rs28479408, an intronic variant located in SYCP2L (synaptonemal complex protein 2-like gene) on chromosome 6 (Figure 1b). Although this variant was not associated with AIR (P-value=0.71), six other SNPs in this gene were also linked (rs4713044, LOD=6.10; rs12190237, LOD=5.58; rs12214063, LOD=3.58; rs1767771, LOD=3.42; rs2153159, LOD=3.31; rs1632103, LOD=3.15) but not associated (P-values >0.5) (Table 4).

Table 4 Chromosome 6 AIR linkage peak with linked (LOD>3) and/or associated (P-value <0.05) variants

Strikingly, chromosome 1 had a broad linkage peak for AIR, with a maximal LOD score of 6.37 (rs2252384) in the region between FAM163A and TOR1AIP2 (located at ~179 Mb; 1q25.2; Figure 1b and Table 5). Chromosome 1 has a long history of linkage to diabetes, making this result all the more interesting.22, 23, 24, 25 Here, variants with LOD scores >3 spanned much of the proximal q arm of the chromosome, with the most concentrated linkage peak residing between 156 and 187 Mb, a region encompassing 357 RefSeq genes (1q22-31.1). Focusing on the peak LOD-1 substantially narrowed the region to a very narrow 1.57 Mb. Of the 343 variants within this region with LOD scores >3, 73 of them had P-values <0.05, with a best association signal occurring at rs6426957 (Chr1: 165 988 336; P-value=6.34 × 10−4, LOD=3.09, MAF=0.441; Supplementary Table 3). Notably, many variants within RASAL2 (RAS protein activator like 2 gene) showed nominal evidence of association (0.05>P-value>1.42 × 10−3) in addition to linkage (N=45 of 46 linked (LOD>3) SNPs; Tables 5 and 6). LOD scores at this gene ranged from 3.00 to 5.38.

Table 5 Broad linkage region on chromosome 1 with acute insulin response: variants with LOD >4.5
Table 6 Variants with LOD score >4 and P-value <0.005

Additional linkage results of interest include regions on chromosomes 7 and 12, which were linked to insulin sensitivity index (SI). Although these regions did not reach the magnitude seen for TNFα receptor 2 and AIR, the consistency of linkage in the region is compelling. On chromosome 7, the highest LOD score (5.11) was seen with rs1024591, an intergenic SNP over 300 kb from the nearest gene (a long intergenic noncoding RNA, LINC01372) (Supplementary Table 4). The linkage signal on chromosome 12 is made up of two distinct peaks (Figure 1c), one at ~53 Mb and the second at ~105 Mb (Supplementary Table 5). The LOD scores seen here are not as striking by magnitude (max LOD for each peak 4.27–4.28), but the consistency of LOD scores >3 into tight peaks is notable (Supplementary Table 5). The first peak consists of 14 variants with LOD scores >3, from 50.6 to 54.5 Mb, with multiple variants in the KRT8 (keratin 8 gene) and ESPL1 (extra spindle pole bodies like 1, separase) showing evidence for linkage, as well as single variants at the proximal end of the peak in LIMA1 (LIM domain and actin binding 1 gene), DIP2B (disco-interacting protein 2 homolog B gene) and SLC4A8 (solute carrier family 4, sodium bicarbonate cotransporter, member 8 gene). There was no evidence for association among linked variants at this linkage peak, although other, unlinked variants in the region showed nominal association (Supplementary Table 5).

The second linkage peak resides from 101 to 109 Mb on chromosome 12, and included 21 linked variants, which represented multiple signals from CHST11 (carbohydrate (chondroitin 4) sulfotransferase 11 gene), ACACB (acetyl-CoA carboxylase beta gene) and FOXN4 (forkhead box N4 gene), in addition to intergenic variants and genes implicated by a single variant, such as CMKLR1 (chemerin chemokine-like receptor 1 gene) (Supplementary Table 5). One of these linked variants showed nominal evidence of association, with a P-value of 5.50 × 10−3 (rs11114094 in SVOP (SV2-related protein gene); Table 6 and Supplementary Tables 3 and 5), although like the prior peak, other unlinked variants in the linkage region also demonstrated evidence of association.

Variants with evidence of both linkage and association

Using the linkage results as a search tool and prioritizing those with any evidence of association identified 1076 variants with P-values <0.05 as well as a LOD score 3 (Supplementary Table 3). Twenty-seven variants were associated with P<0.005, as well as having a LOD score>4 (Table 6). NFIB was the primary gene implicated under a linkage peak with TNFα receptor 2 levels on chromosome 9, where there was also evidence of nominal association (P-values on the order of 2 × 10−4; Figure 1a and Supplementary Table 6). NFIB, which encodes nuclear factor I/B, is represented by 293 SNPs (135 from OmniExpress, 157 from Omni1S, 1 from exome chip), 289 of which were located in introns. Only one coding variant in this gene was polymorphic from the exome chip data set, this SNP (rs114558598; I24F) was not linked (LOD=−0.005) or associated (P-value=0.08). Ten common variants (0.27<MAF>0.49) within this gene (all intronic) had LOD scores >3. Overall, 68 NFIB variants had LOD scores >1, and 24 had LOD scores >2.

LPHN3 on chromosome 4 was a strong signal for LDL levels, with two intronic variants being both linked and associated (rs2343249; LOD=4.30; P-value=1.00 × 10−5 and rs9312078, LOD=3.02; P-value=8.20 × 10−5; Table 7 and Figure 1d). Both the linkage and association signals were confined to the gene region, with strong LD (r2>0.8) between the two top SNPs. There was further support throughout the gene-encoding region for both modest linkage and association with diminishing LD (Supplementary Figure 1). The strongest association result among LOD scores 3 was with fibrinogen levels; rs1131878 from the OmniExpress chip, LOD=3.08 and P-value=1.99 × 10−6 (Supplementary Table 3). This SNP was located within the UGT2B4 gene, which encodes UDP glucuronosyltransferase 2 family polypeptide B4.

Table 7 LPHN3 linkage and association with LDL levels

Discussion

This study evaluated the utility of combining two-point linkage with association analysis in a data set comprised of array-based SNP genotyping totaling 1.6 million noncoding and coding variants in a family-based sample of Hispanics with extensive phenotype information. The aim of the study was to evaluate whether GWAS data in the context of linkage adds insight into the genetic origins of cardiometabolic traits, while using association analysis as a follow-up to determine likely candidate loci. This builds upon our prior evaluation of combined linkage and association using exome chip data in this cohort.9 Large-scale linkage analysis of SNP genotyping has been uncommon for complex phenotypes recently. To this end, we evaluated 50 phenotypes (46 distinct traits) related to glucose homeostasis, lipids, blood pressure, adiposity, liver fat and enzymes, and biomarkers. Given the breadth of genotypic data and the number of phenotypes, the results are extensive, but some noteworthy observations can be made. Broadly speaking, we believe the markedly denser genotypic data set reveals many insights into the genetic bases of the traits such as TNFα receptor 2, AIR and SI when compared with our prior study using the more limited data from the exome chip.

Relatively dense genotyping data provides visual evidence of linkage similar to conventional multipoint methods. In addition, while exome chip analysis primarily targets models where functional variants are exonic, the GWAS data sets can potentially address other models such as high impact noncoding variants, especially through linkage analysis. Here we have observed few examples where evidence for both linkage and association are apparent. An example is LPHN3 (Table 7 and Supplementary Figure 1), where LOD scores reached 4.30 with a P-value of 1.00 × 10−5, suggesting a true impact on LDL levels. Given the actual low density of coverage in GWAS data sets, which are designed to cover genomic regions through LD relationships, it is unlikely to capture truly causal variants by chance. The ultimate test of whether this approach will be successful will require whole-genome sequencing data. Overall, these results incorporating two-point linkage and association analyses can identify meaningful signals that impact cardiometabolic traits, often in the absence of striking association alone. These conclusions are consistent with our prior work9, 10 in which we have shown that linkage evidence can be relatively strong, but association evidence only appears when the functional variant is also captured. The latter is unlikely in a GWAS data set. For these reasons, our main focus was on regions with evidence of linkage based on both the power of linkage methods and the 'far-sighted' ability of linkage to identify genetic relationships.4, 5, 6, 7, 9, 10

As noted above, several genomic regions had relatively strong evidence of linkage, but limited association results. Based on our logic, this would suggest the possibility of underlying, as yet unidentified functional variants. Thus, for the strongest linkage with TNF2α receptor levels (LOD=6.49), we would hypothesize that one or more high impact noncoding variants lie within the linkage region. LAMA1 is similar to LAMA5, which has previously been related to TNFRSF1B expression,26 making it plausible for LAMA1 to be related to TNF2α receptor levels.

Analysis of traits of interest to our laboratory (AIR, SI) also resulted in notable linkage peaks. It is tempting to scan these linked regions for biologically relevant genes. Genes located under a broad AIR linkage region on chromosome 1 (Figure 1b and Table 5) included FAM163A, also known as neuroblastoma-derived secretory protein (NDSP), TOR1AIP2 and RASAL2. FAM163A (aka NDSP) has been associated in methylation analysis for borderline personality disorder27 with overexpression observed in neuroblastoma.28, 29 TOR1AIP2 encodes torsin A-interacting protein 2, which is involved in the nuclear envelope.30, 31 Mutations in TOR1AIP1 have been shown to cause muscular dystrophy.32 RASAL2 (RAS protein activator like 2) has been implicated as an obesity susceptibility gene in both Chinese33 and Mexican populations,34 as well as having a role in the susceptibility of many cancers, including liver,35 thyroid,36 ovarian,37 breast37, 38 and lung.39

Genes under the SI linkage peaks also included interesting candidates. On chromosome 12, the most relevant gene with linkage in the distal linkage peak was CMKLR1 (chemerin chemokine-like receptor 1), which is believed to have a role in glucose homeostasis,40, 41, 42 obesity41, 43, 44 and diabetes development.45 Of note, a strong association signal (P-value=1 × 10−7) was also seen within this linkage peak in WSCD2 (WSC domain containing 2; 100 Mb from CMKLR1) (Figure 1c).

Additional genes included LIMA1 (LIM domain and actin binding 1, also known as EPLIN and SREPB3), a tumor suppressor; DIP2B (disco-interacting protein 2 homolog B), replicated as a susceptibility locus for colorectal cancer;46 and SLC4A8, a sodium bicarbonate transporter, which may have a role in regulation of blood pressure with some variants in this gene having been previously implicated.47, 48 Further, KRT8 (keratin 8, type II), which is overexpressed in human liver disease, resides under the linkage peak on 12q.49 The linkage region on chromosome 7 contained only one putative gene, LOC102723427, about which there is no known information.

The most intriguing signal lies in LPHN3 and was both linked and associated with LDL levels at two separate variants. This gene encodes latrophilin 3 (recently renamed as ADGRL3;50 adhesion G-protein-coupled receptor L3), which is related to latrotoxin, the toxin produced by the black widow spider.51 There is evidence suggesting a role for latrophilin 3 (among other latrophilins) in binding to fibronectin leucine-rich transmembrane (FLRT) family members, which has been shown to promote the development of glutamatergic synapses.52, 53 Additionally, genetic variants in LPHN3 have been associated reproducibly with attention deficit hyperactivity disorder and other psychiatric conditions.54, 55, 56 LPHN3 is also being investigated as a pharmacogenetic target.57 Despite the lack of biological evidence directly supporting the link between LPHN3 variants and LDL cholesterol levels, cholesterol is crucially important in the brain, and further study may elucidate a mechanism by which genetic variants in LPHN3 impact plasma LDL levels.

We previously reported CETP linkage and association with HDL levels in exome chip data from this Hispanic sample.9 Linkage of CETP in this data set was stronger with LOD scores of up to 5.43, an increase of 1.14 over the previous top signal (Table 6 and Supplementary Table 2). The addition of GWAS data implicated additional linked variants (LOD>5, N=4) proximal to the coding region, perhaps occluding interpretation of the functional impact of this linkage result.

Here we assessed the impact of SNP density to provide insight into linkage relationships with the conclusion that dense SNP maps do reveal additional insight. We have extended this query further by evaluation of imputed genotype data in regions of particular interest because of evidence of strong linkage with glucose homeostasis-related phenotypes. Three regions were selected based on substantial linkage evidence and a particular interest in glucose homeostasis: chromosome 1 with AIR and chromosomes 7 and 12 with SI. Utilization of imputed data increases the number of markers capturing the region by 22-fold (18 411 directly genotyped markers, 406 K imputed markers). The maximal LOD score from the imputed AIR region was 6.45 at rs2252384 (the same SNP implicated in the directly genotyped data; Supplementary Figure 2). The slight increase in LOD score (6.37–6.45) can likely be attributed to more complete information following imputation of missing genotypes. For chromosome 7 with SI, a new best SNP rs2530421 had the maximum LOD score of 5.53 (compared with the prior best LOD of 5.11 at rs1024591). The imputed best SNP lies very near the original peak linkage, providing little additional guidance in refining the causal variant(s), given the high degree of correlation between the top-linked SNPs (r2=0.937). Evaluation of another linked region (chromosome 12 with SI) also showed some limited improvement in linkage signals, but linkage signals were only modestly increased, as could be expected because of the information carried by these imputed markers being wholly derived from the genotyped markers, which had already been informative. Thus, inclusion of imputed genotypes marginally improved the maximal LOD scores when evaluated in this small number of examples. However, the improvements did not further refine the regions of interest (Supplementary Figure 2).

In conclusion, we have built upon our previous analysis of combined two-point linkage and association9 and evaluated utility of the approach in a data set comprised of comprehensive genome-wide array-based SNP genotypes. As seen previously, there were few examples in these data where linkage and association both provided striking evidence at the same locus, which, based on our prior analysis,10 would implicate a likely ungentoyped causal variant. However, the GWAS plus exome chip design identified multiple additional regions of linkage, which were not seen in exome chip analysis alone. Positive, strong evidence of association with SNPs was not observed, suggesting that functional variants, if they are indeed captured by the linkage signal, have not been identified. To truly test the broad utility of this approach, whole-genome sequencing data will be necessary, which will incorporate the full spectrum of variant frequencies.