Introduction

Loss and function impairment of skeletal muscle is a common skeletal disorder affecting millions of people worldwide, especially in the elderly. Its most severe outcome is to predispose people to sarcopenia. It is also related to a series of other diseases and health problems, such as osteoporosis (MIM 166710), fracture, impaired protein balance, dyslipidemia (MIM 151660), obesity (MIM 601665), insulin resistance, overall frailty and increased mortality.1, 2 Skeletal muscle is characterized by the measurement of lean body mass (LBM), which is the single best predictor for sarcopenia. LBM is highly inheritable, with estimated heritability ranging from 52 to 84%.3, 4, 5 However, only a few genes for LBM have so far emerged,6 leaving the majority of the genetic background of LBM still unknown.

Traditional association analysis has focused largely on single nucleotide polymorphisms (SNPs). This assessment of SNP variation has proven fruitful; hundreds of common variants have been found to be associated with diseases such as obesity, osteoporosis, type 2 diabetes and immunological disease.7 However, recent studies have shown that another type of genomic variation, copy number variations (CNVs), has a significant role in influencing common diseases as well, and are likely to be at reasonably high frequencies in the population. Recent data imply that CNVs account for up to 4 Mb of genetic differences, whereas that for SNP variation is only 2.5 Mb.8 The widespread distribution of CNVs across the genome has made it an important type of genetic variation for identifying disease-associated genetic loci. Many diseases are found to be associated with copy number (CN) changes, including osteoporosis,9 lupus glomerulonephritis,10 autism,11 and HIV infection and progression.12 Therefore, investigation of CNVs would contribute to unravel the genetic basis of complex diseases and phenotypes. Nonetheless, to the best of our knowledge, there is no CNV-aimed association study on LBM reported. It is largely unknown whether CNV underlies the variation of LBM. In this study, we report a CNV-based genome-wide association study (GWAS) to identify genetic loci influencing LBM variation.

Materials and methods

Study subjects

The study sample consisted of 1627 (802 males and 825 females) unrelated Chinese-Han subjects living in the cities of Xi’an/Changsha and their neighboring areas. The study was approved by the local institutional review board. After signing an informed consent, all subjects received assistance in completing a structured questionnaire including questions about anthropometric variables, lifestyle, diet, family information and medical history, and so on.

Phenotyping

The cohort was recruited for studies aimed in searching for genes underlying body compositions (bone mass, fat mass and lean mass). Body composition was measured using a dual-energy X-ray absorptiometry scanner Hologic QDR 4500W (Hologic Inc., Bedford, MA, USA), following the manufacturer's protocol. A dual-energy X-ray absorptiometry scan can accurately measure total body and regional bone mass, fat mass and fat-free mass. Lean mass is calculated by taking bone mass away from fat-free mass.13, 14 After removal of all metals, a subject laid on a bed and was scanned from head to toe. Whole body composition, body compositions at sub-regions, such as head, trunk and limb, were measured by the dual-energy X-ray absorptiometry scanner.

To ensure the quality of collected data, all scans were conducted, reviewed and analyzed by a clinical expert. Body weight, height and age were obtained on the same visit. In this study, lean mass at four limbs, trunk and whole body were analyzed as main phenotypes.

Genome-wide genotyping and quality controls (QC)

Genomic DNA was extracted from peripheral blood leukocytes using standard protocols. Genome-Wide Human SNP Array 6.0 (Affymetrix Inc., Santa Clara, CA, USA), which includes 906 600 SNPs and 9 40 000 CN probes, was used to genotype each subject, according to the Affymetrix protocol. Briefly, 250 ng of genomic DNA was digested with restriction enzyme NspI and StyI. Digested DNA was adaptor ligated and PCR amplified for each sample. Fragment PCR products were then labeled with biotin, denatured and hybridized to the arrays. Arrays were then washed and stained using phycoerythrin on Affymetrix Fluidics Station, and scanned using the GeneChip Scanner 30007G to quantify fluorescence intensities (Affymetrix Inc.). Data management and analyses were conducted using the Affymetrix Genotyping Command Console. The Affymetrix contrast QC threshold was set at the default value of greater than 0.4 for sample QC. The final average contrast QC across the entire sample reached a high level of 2.62.

Assessment of genetic background

The method of genomic control implemented in the STRUCTURE2.2 program15 was used to detect possible population stratification of the study sample. For structure analysis, 2000 SNPs were randomly selected at the genome for clustering of all the subjects. The program uses a Markov chain Monte Carlo algorithm to cluster individuals into different cryptic subpopulations based on multilocus genotype data. Potential substructure was estimated under a priori assumption of K=2 discrete subpopulations. To cross-validate the results, we also conducted principal component analysis on selected genotypes using EIGENSTRAT.16 The calculated principal components are informative to correct for potential population stratification in subsequent association analyses.

CNV determination

CNVs were identified using the CANARY algorithm implemented in the Birdsuite software (Affymetrix Inc.),17 which utilized a previously defined CNV map based on HapMap samples.17 In order to generate results with high confidence, we conducted QC filtering both at the sample level and the CNV level, according to the previously reported methods.17

First, for the sample level QC, we used three quality metrics reported by the Birdseye method to evaluate the initial 1627 subjects for quality in CN genotyping. The following procedures were adopted: (1) we removed any sample that was greater or less than three s.d. values from the average estimate of CN, which was approximate two copies at genome-wide level; (2) we calculated the variability in CN and SNP probe intensities with each standardized per chromosome. We removed any sample with three s.d. values more than these estimates on the average genome-wide level; (3) we removed any sample in which more than two chromosomes failed any of these three metrics, that is, more than three values in estimated CN or excessive CNV or SNP variability for the chromosome.

Second, we conducted QC filtering at the CNV level. Out of the initial 1280 CNVs, we discarded (1) any CNVs in which more than 5% of the copy calls were uncertain (confidence score >0.1) or missing, and (2) any CNVs with the frequency of major variant greater than 99%. The filtering procedure resulted in 603 CNVs available for subsequent association analyses.

Statistical analyses

Lean mass at the following seven sites was analyzed: left and right arms, left and right legs, subtotal of limbs, trunk and whole body. Each phenotype was adjusted by age, gender, and the first two principal components calculated from the 2000 selected SNPs. Residual phenotypes were normalized by inverse quantile of the standard normal distribution, which imposes a standard normal distribution on the phenotype to be analyzed. Covariate adjustment and phenotype normalization were performed with Minitab (Minitab Inc., State College, PA, USA).

Association of lean mass and CNV was performed by a linear regression model using PLINK.18 In brief, CNVs were treated as predictors for lean mass. The PLINK input genotype file set includes three files: a family file, a map file and a gvar file. The family file and map file describe individuals and variants, and gvar describes the paternal and maternal origins of the derived CNs. Each row in a family file represents one subject with the following fields separated by a tab: family id, subject id, father id, mother id, sex and phenotype. Each row in a map file represents one variant with the following fields: chromosome, CNV id, genetic distance and start physical position. Each row in a gvar file has seven fields: family id, subject id, CNV id, first CN, first dosage, second CN and second dosage. Here, first CN and first dosage are the allele inherited from the first parent and its dosage, and the same for the second ones. The command implemented the association test is

The output file lists P-values for all CNVs.

We adopted the strict Bonferroni correction to account for multiple testing comparisons. Raw P-values were adjusted by the product of the number of CNVs (603) and the number of phenotypes (7). Significant results were declared at nominal level 0.05 after correction, corresponding to the genome-wide significance level 1.18E-5.

Results

Basic characteristics of the sample are summarized in Table 1. The STRUCTURE8 program clustered all subjects into one single homogeneous population (see Supplementary Figure S1). The estimated inflation factor (λ) from association analyses is 1.02, below the level of typical deviation for population stratification. All these results indicate that population stratification is not likely to present in the studied sample.

Table 1 Basic characteristics of the study sample

There is one CNV, CNV2073, which hits the genome-wide significance level with raw P-value 6.22E-7 (Bonferroni corrected P-value=0.002) for lean mass at the right arm (R-arm). Figure 1 displays the Manhattan plot of genome-wide scan for this phenotype. The association of this CNV and lean mass is also nominally significant at most other sites, though none of them achieves a genome-wide significance level (Table 2). Table 2 also lists the association results of the top 10 CNVs ranked according to P-values for lean mass at R-arm.

Figure 1
figure 1

Manhattan plot of lean mass at R-arm. The y axis represents –log10P, and x axis represents the start physical position of CNV along chromosomes. The plot displays P-values of lean mass at right arm for all 603 CNVs. The line shows the threshold for genome-wide signifcance level. The figure shows that CNV2073 is significant at genome-wide level 1.18E-5.

Table 2 P-values of the top 10 CNVs

CNV2073 locates from 2 83 77 089 to 2 85 36 721 bp (NCBI build 36.3) at chromosome 15q13.3. Three types of CN exist in the sample: CN=2, 3, and 4, with frequencies 0.06, 0.13 and 0.81, respectively. Compared with subjects with two CNs (normal diploid), subjects with three copies had 6.9% lower mean lean mass at R-arm, and subjects with four copies had 11.2% lower lean mass at R-arm (Figure 2). Linear regression analysis showed that CNV15q13.3 contributed to 1.0% of the total lean mass variance at R-arm.

Figure 2
figure 2

Mean R-arm lean mass values at different copy numbers for CNV2073. Copy numbers of 2, 3 and 4 exist for CNV2073. The y axis represents the mean lean mass at the right arm for a particular copy number. Error bars denote standard error.

Two genes, gremlin1 and chrfam7a, locate in the region 15q13.3, which is covered by CNV2073. Of them, gremlin1 is of particular interest. It is a candidate gene for LBM reported by both molecular function study19, 20, 21 and previous genetic linkage studies.22, 23 However, none of previous association studies has linked this gene to LBM variation yet.

Discussion

Lean mass has a considerable heritability. Although previous SNP-aimed association studies have identified several candidate genes,6 vast majority of genetic mechanisms of lean mass remains unclear, which may reside in CNVs. To the best of our knowledge, this is the first GWAS between lean mass and CNVs in the Chinese population. We identified a candidate genomic region 15q13.3 and an associated gene gremlin1 at the genome-wide significance level. Notably, this region was also indicated to be important for lean mass variation in our two previous linkage studies.22, 23 The first study performed a large-scale whole genome linkage scan for lean mass involving 4498 individuals from 451 Caucasian families. The most pronounced linkage signal was found at 15q13.3 with the LOD score 4.86.22 The second genome-wide linkage scan of 434 Caucasian pedigrees gave a suggestive linkage signal at this region with the LOD score 2.72.23 The current study first implied that this region is also associated with lean mass in the Chinese population.

The associated gene gremlin1 and CNV2073 are about 2 MB apart. The gene is a member of the bone morphogenetic protein (BMP) antagonistic family. It was first cloned from a Xenopus ovarian library for its axial patterning activities. The human gremlin1 gene encodes for a glycosylated homodimeric peptide of 28 kDa with 184 amino acids.24 As an antagonist to BMP, the regulation of gremlin1 is essential for mesoderm induction, establishment of dorsoventral polarity, ectodermal differentiation, somite formation and myogenesis induction.24

It is well known that the reduction of lean mass with aging is caused by the atrophy of type II myofiber.25, 26 The capacity of generating new myonuclei for myofiber repair, growth, or replacement is dependent upon the persistence of skeletal satellite cells.27, 28 The proliferation and differentiation of skeletal satellite cell is activated by the expression of the myogenin regulatory factor MyoD, which, in turn, is downregulated by BMP4. Gremlin1, as an antagonist to BMP4, therefore promotes the expression of MyoD, and the generation and repair of lean mass.29, 30 Figure 3 illustrate the regulation pathways of the effect of gremlin1 on lean mass.

Figure 3
figure 3

Hypothesized functional mechanism of gremlin1 to lean mass. In this plot, gremlin1 antagonizes the activity of bone morphogenetic protein 4 (BMP4), which inhibits the expression of MyoD. MyoD is a protein with a key role in upregulating muscle differentiation by stimulating the activity of skeletal muscle satellite cells. As a result, the synthesis of myoblast gets activated, and the levels of skeletal muscle and lean mass increase.

With the same sample pool and analytical approach, we performed another CNV-based GWAS previously.9 The identified CNVs were successfully validated by real-time PCR in that study. Therefore, though the identified CNV was not further validated by real-time PCR in this study, it may still be highly reliable.

In conclusion, we have conducted a GWAS between CNVs and lean mass in the Chinese population, and identified a significant gene gremlin1 in the region 15q13.3. Our study strengthens our understanding of the genetic determinants underlying sarcopenia-related phenotypes, and contributes to further functional studies.