Background & Summary

Overweight and obesity in children are a public health problem that has raised concern worldwide1. Childhood obesity is characterized by an expansion of the adipose tissue (AT)1 and plays an important role in the development of cardiometabolic alterations during early adulthood, thereby increasing morbidity and mortality2. According to twin and family studies, around 40–70% of the interindividual variability in body mass index (BMI) has been attributed to genetic factors3,4,5. Despite this, known single-nucleotide polymorphisms (SNPs) explain <2% of BMI variation6, a phenomenon termed ‘missing heritability’. Potential sources explaining this missing heritability include epigenetic components, the existence of low frequency and rare variants as well as the presence of X chromosome genetic variation.

Analysis in current genetic association studies is usually focused on autosomal variants while the sex chromosomes, and specially the X chromosome, are often neglected. Among the reasons, it highlights a lower gene density on the X chromosome, a lower coverage of the region in current genotyping platforms and a number of technical hurdles including complications in genotype calling, imputation and selection of test statistics7. According to a previous report, only 242 out of all 743 GWAS conducted from 2005 to 2011 considered the X chromosome in their analyses7. The proportion was similar when only family-based GWAS were considered. There is therefore a lack of available public datasets including X chromosome genotype data for analysis. On the other hand, although several X chromosome-specific statistical tests and guidelines have now become available, there is also a lack of readily available implementations and user-friendly apps incorporating them for routine analysis8,9.

The majority of the technical hurdles faced when analysing X chromosomal data rise from two of its main particularities. The first one is the fact of women having two allele copies while males having only one. As a consequence, if males are included in the analysis, special caution must be taken. Particularly, the study design process should be performed carefully, trying to maintain a balanced female/male ratio across experimental conditions. Otherwise, many available statistical tests will suffer from type I errors as soon as sex-specific allele frequencies occur, which is typically observed for a great number of variants. Other problems derived from an unbalanced sex ratio in the study sample include problems during the genotype calling process, as the signal intensities obtained from standard array genotyping platforms will be always lower in males than for females (who carry two alleles). The second uniqueness motivating X-chromosome specific analyses lies in the X chromosome inactivation (XCI) process, through which most of the cells of females express only one X chromosome allele in order to compensate the genetic dosage with regard to males. Before selecting a particular statistical approach, it should be mandatory to carefully investigate the concrete XCI model to assume for a gene in a particular tissue. Depending on the XCI model assumed, we should proceed one way or another during the selection of the test statistics. These and other particularities must be addressed as long as X chromosomal data are included into genetic studies.

In relation to obesity, only a few studies have reported association with markers on the X chromosome. One of the most remarkable findings involves the tenomodulin (TNMD) gene, a Xq22 located locus encoding a type II transmembrane glycoprotein. First time associated with adult obesity at the genetic level10, its presence in adult human AT has been demonstrated showing higher expression in obesity and lower expression after diet-induced weight loss. Regarding children population, our research group found that TNMD expression was five fold-times up-regulated in visceral adipose tissue (VAT) of children with obesity, compared with their normal-weight counterparts (Gene Expression Omnibus GSE9624)11,12. Recently, we have reported new associations between TNMD SNPs and childhood obesity and metabolic alterations in a Spanish children population13. Interestingly, our study has been the first to analyse and detect associations between X chromosome TNMD genetic variants and obesity in a children cohort. Similarly, SNPs in the SLC6A14 gene, also located in the X chromosome, have shown evidence of association with obesity14. As a whole, these TNMD and SLC6A14 reports support the fact that X chromosome genetic variants may be not only useful early life risk indicators of obesity but also an interesting source of missing heritability13.

Given the lack of public available genetic datasets incorporating X chromosome genetic variants and the still prevalent statistical hurdles that make the X chromosome a difficult region to be tested in functional genetics, we here aimed: (1) to make public and describe a dataset incorporating X chromosome genotype data from a children cohort13,15, and (2) to outline a whole implementation of the special X chromosome analytic process in genetics. The presented research dataset includes X-chromosomal SNP data (mapping the genes TNMD and SLC6A14) from a children cohort composed of 915 normal-weight, overweight and obese subjects. Some topics covered in this paper include dataset sharing and description, explanation of sample design, genotype calling, quality control, and test statistics selection procedures. Additionally, a short section explaining and interpreting findings obtained after analysing the dataset with a specific X chromosome analytic approach is presented.

Methods

Experimental design and study population

These methods are an expanded version of descriptions in our related work and general characteristics of the dataset have been previously described13. Briefly, in this case-control multicentre study, 915 Spanish children (438 males and 477 females) were recruited from three national health institutions: Lozano Blesa University Clinical Hospital, Santiago de Compostela University Clinical Hospital and Reina Sofía University Clinical Hospital. According to specific X-chromosomal analytic requirements, the female/male ratio of the study sample was perfectly balanced.

Childhood obesity status was defined according to the International Obesity Task Force (IOTF) reference for children16 which is based on the application, on children population, of the widely used cut-off points of BMI for adults (25 and 30 kg/m2, for overweight and obesity respectively). Particularly, these criteria constitute a range of age and sex specific cut-off points for children that have been extracted from solid percentile tables constructed on 97876 boys and 94851 girls ranging from 2 to 18 years. After the application these specific cut-off points, the dataset was composed of 480 children in the obesity group, 177 in the overweight group and 258 in the normal-BMI group. Children were allocated into two experimental conditions according to their obesity status; the affected group (cases) composed of both children with obesity or overweight and the control group composed of normal-weight children. An unbalanced female/male ratio across cases and controls has been proven to heavily affect the power of some specific X chromosome association tests17. In our study, a balanced female/male ratio was maintained across each experimental condition (122/136 in controls and 355/302 in cases) (Fig. 1).

Fig 1
figure 1

Study design and characteristics. (a) Experimental workflow used to generate and analyse the data. (b) Genomic context of selected markers; light blue boxes represent exons, while the connecting lines are introns. Abbreviations; rs, reference SNP; UTR, untranslated region.

Inclusion criteria were European-Caucasian heritage and the absence of congenital metabolic diseases. Otherwise, the exclusion criteria were non-European Caucasian heritage, the presence of congenital metabolic diseases (e.g., diabetes or hyperlipidemia), undernutrition, and the use of medication that alters blood pressure, glucose or lipid metabolism.

Ethical statement

All procedures in the study were conducted in accordance with the Declaration of Helsinki (Edinburgh 2000 revised), and followed the recommendations of both the Good Clinical Practice of the CEE (Document 111/3976/88 July 1990) and the legally enforced Spanish regulation for clinical investigation of human beings (RD 223/04 about clinical trials). The Ethics Committees on Human Research of all participant institutions approved all experiments and analyses with registration Code: “2011/198”. All parents or guardians provided written informed consent and children gave their assent.

DNA extraction, processing and analysis

The presented dataset consists on genotype data for eight target SNPs mapping the X-chromosomal genes TNMD and SLC6A14 in the study population. Details regarding SNP selection and molecular analyses are briefly covered here since they have already been fully detailed in our previous work13. On the contrary, we pay special attention in the explanation of X chromosomal particularities, data description as well as in the summarization of each data analysis and processing step.

Seven SNPs located at the TNMD locus and one located at the SLC6A14 were selected for genotyping analysis. Genomic DNA was extracted from peripheral white blood cells using two automated kits, the Qiamp DNA Investigator Kit for coagulated samples and the Qiamp DNA Mini & Blood Mini Kit for non-coagulated samples (QIAgen Systems, Inc., Valencia, CA, USA). All extractions were purified using the DNA Clean and Concentrator kit from Zymo Research (Zymo Research, Irvine, CA, USA). Genotyping was performed by TaqMan allelic discrimination assay using the QuantStudio 12 K Flex Real-Time PCR System (Thermo Fisher Scientific, Waltham, MA, USA). Given the X-chromosomal location, it is recommendable to analyse females and males in separate plates during the genotyping process or, at least, maintain a balanced female/male ratio by plate.

Once genotyping was accomplished, we checked candidate SNPs for sex-specific allele frequencies, which can induce type I errors in some statistical X-chromosome analyses (especially in the case of unbalanced designs). Tested by means of the Fisher exact test, all SNPs in the TNMD showed no significant P-values and thus equal allele frequencies across sex groups (Table 1). On the contrary, the SNP in the SLC6A14 did not (P = 0.01). This fact should be taken into consideration when selecting an appropriate test for high-level statistical analyses unless a balanced sex ratio across experimental conditions is presented in the population (which is our case). Information regarding minor allele frequencies (MAFs) stratified by experimental condition for all candidate markers is presented in (Table 2). Linkage disequilibrium (LD) status of the TNMD gene was studied using the Haploview Software separately in males and females13,18.

Table 1 Allele frequencies in the whole study population and each sex group.
Table 2 Allele frequencies in the study population stratified by experimental condition.

Data Records

The complete research dataset (genotype and phenotype data) has been uploaded into the European Genome-Phenome archive (EGA). The work can be found online with the title “X chromosomal genetic variants are associated with childhood obesity” or with the identifier EGAS00001002738 (2018)15. Online data are sorted and presented according to obesity status; the affected group (cases) composed of both children with obesity or overweight (EGA reference EGAD00010001482 (2018)) and the control group composed of normal-weight children (EGA reference EGAD00010001481 (2018)). Three files by-experimental condition (a total of six) are available online (.bed, bim and fam files). The bed files contain raw genotype data while the bim files describe information relative to target SNPs (chromosome number, SNP identifier, genetic distance in morgans (set as 0 for all markers), base-pair position and coding alleles). Instead, the fam files contain information relative to subjects (sample identifiers, family and paternal identifiers (here set as 0), sex (1 for males and 2 for females) and experimental group (1 for control and 2 for cases)). All presented formats can be easily readable in PLINK 1.9 software using the –bfile command option and further transformed into a more standard file format with the –dosage option19.

The complete data set in the current study complies with the requirements of the EGA archive. Detailed information about each sample and shared data files is presented in Online-only Tables 1 and 2, and Supplementary File 1. Specifically, DOI and descriptions for each shared file are provided in the Online-only Table 2.

Technical Validation

X chromosome particularities

Before introducing further steps, we here list two issues making the X chromosome a difficult region for genetic analyses. These particularities will determine important decisions related to genotype calling, data imputation and statistical analysis. It is important to note, however, that all here-described particularities are only applicable to those X chromosomal loci outside the pseudo-autosomal region of the X chromosome (which is the case of TNMD and SLC6A14).

The first noticeable uniqueness of the X chromosome is the fact of women having two allele copies while males having only one. As a result, while females can present the standard three possible allele combinations (AA, AB and BB), males are homozygous and have only two distinct possible genotypes (A- and B-). For this reason, standard autosomal association tests, such as the Cochran-Armitage trend test20,21, are not immediately applicable to X chromosome data. The second particularity affecting the X-chromosome analysis lies in the X chromosome inactivation (XCI) process, through which the transcription from one of the two X chromosome copies in female mammalian cells is silenced in order to balance the expression dosage between XX females and XY males. XCI is, however, incomplete in humans: with up to one-third of the X-chromosomal genes escaping from this silencing epigenetic mechanism. The degree of ‘escape’ from inactivation has been reported to strongly vary between genes, tissues and individuals22,23, with three possible scenarios at the gene level: complete XCI, partial XCI or total escape from XCI24,25. Depending on the XCI model assumed for a certain gene, we should proceed one way or another during the selection of the test statistics (see section ‘High-Level Analysis: Statistical Analysis’ for further details). The assumption of a particular XCI model is therefore a process that must be performed carefully.

Until date, the extent to which XCI is shared between cells and tissues remains poorly characterized and there is a lack of standardized criteria nor well-established databases to check if a gene escapes or not from XCI in a concrete situation. In order to do so, an exhaustive search in PUBMED and other scientific databases should be performed looking for particular studies supporting a certain XCI hypothesis. Currently, the most similar resource to a standardized database on this regard is the initiative carried out by the Genotype-Tissue Expression (GTEx) consortium9 in 2017, which describes a systematic survey of XCI, integrating over 5500 transcriptomes from 449 Individuals, spanning 29 tissues from the GTEx (v6p release) and 940 single-cell transcriptomes, combined with genomic sequence data. Particularly, they show that XCI at 683 X-chromosomal genes is generally uniform across human tissues and that incomplete XCI affects at least 23% of X-chromosomal genes. Overall, this work presents an updated catalogue of XCI across human tissues which may be of great utility during the selection of a particular XCI model for a gene. Other available resources also include the work of Slavney et al.26, which gathers the main XCI insights from previous studies on X-chromosome gene expression datasets.

By way of example, we here illustrate the whole process followed for the identification of the optimal XCI model in the case of TNMD. First, we interrogated the Slanvey et al. (2016) work26, where no evidence of escape from XCI was reported. In order to get more information about this fact, we further studied in detail the three works summarized in the Slanvey et al.26 paper. The first work on which the paper is based is a study from Carrel et al.22, in which we could not identify any probe covering the TNMD region. Instead, a few surrounding regions were mapped; among which the SRPX2, ZD89B07 and the SYTL4 reported escaping from the XCI process. In spite of it, this study was based on a fibroblast cell model and thus not applicable to our adipose tissue context. Regarding the second revised article27, again, there were not available probes covering TNMD. Thus, neither conclusions nor new information could be extracted. In relation to the third included article28, we were not able to find any table or supplemental material showing an output list of the analysed regions. Next, we investigated the well-established work from the GTEx consortium9 and found that the XCI status of the TNMD region remains catalogued as unknown (Supplementary Tables S2 and S13 of this paper). As a complementary approach, we performed a search in PUBMEP looking for individual studies focused on the gene expression status of TNMD from different sexes. As a result, we found a work reporting higher basal expression of TNMD in women than in men29, which could indicate that TNMD escapes from the XCI.

Taking all this into consideration and given the lack of agreement, both possibilities (‘escape from XCI’ and ‘XCI’) should be tested in the case of TNMD. A searching process like this is highly recommendable to be done for any X chromosome locus before the selection of a particular statistical approach.

Raw data processing

The primary step of the data analysis consisted on the extraction of genotype calls from fluorescence array data and the construction of work data files for data manipulation and analysis. Details regarding the exact procedure for genotype calling, which is an important procedure in X-chromosomal analyses, are listed below (‘Genotype Calling’ section).

Once we obtained genotype calls for the 915 individuals, we generated standard format files (.ped and .map) transforming the ThermoFisher cloud-derived outputs from long to wide format using an own script in R environment30. Finally, data were imported into PLINK 1.9 software19 and converted into binary format files using the –make-bed flag. These binary formats (.bed, .bim and .fam) are a more compact representation of the data that saves space and speeds up subsequent analyses.

Genotype calling

This is the first step of any primary genotype analysis and consists on the extraction of genotype calls from fluorescence array data at the SNP and individual level. Along with the test statistics selection procedure, the genotype calling process is an analytical step heavily affected by X chromosome particularities. Specifically, the main X chromosome uniqueness affecting this process is the dosage imbalance between males and females. Since males carry only one X allele, signal intensities obtained from the Real-Time PCR System are lower in males than for females and thus a correction should be implemented. On this matter, calling algorithms which apply different models to male and female samples (e.g. Illuminus and CRLMM) have been proven to generally perform better than methods which do not (e.g. GenCall and GenoSNP)31.

Here, we employed the Applied Biosystems qPCR app module (Thermo Fisher Cloud software) and the autocalling method for genotype calling. According to literature recommendations, the sex information for each sample was supplied to the software and genotype calling was performed separately in both sexes. In this regard, although genotyped plates did not consist on only boys or girls, the balanced sex ratio of our population (477 females and 438 males) favoured a better performance of the algorithm. Five signal clusters were identified (three in the case of females and two in the case of males). Then, sex information and scatter of the clusters were used to call the genotypes (AA, AB and BB for females, and A- and B- for males). Since the employed software also allows the option of applying user-definable boundaries for data analysis, those samples classified as undetermined by the autocalling method were recalled using the manual option. A set of controls were used to deduce these questionable genotype calls. Outliers were omitted from the analysis.

Data QC

Prior to high-level statistical analyses, the quality control (QC) process is an important step in any genetic analysis and especially in the X-chromosome analysis. Specific QC guidelines for X chromosome genotype data have been previously reviewed in detail8. All these criteria can help us to detect genotype errors or not reliable SNPs which should be excluded from analysis.

Here, the whole QC process was implemented in PLINK 1.9 software19. According to literature, two criteria concerning missing frequency were employed (the sex-specific missing frequency and the differential missingness between sexes)32,33. As genotype calling was performed separately in males and females (that is, no heterozygote calls in males were allowed), the proportion of heterozygote calls in males, proposed as a filter criterion by Ling and Ziegler et al.32,33, was not considered in our QC process. All SNPs (with exception of the rs11798018 and the rs2073163 from the TNMD gene) passed the recommended missing frequency filter in females (<=2%) (Table 3). On the other hand, none SNP passed the filter in males. Regarding the differential missingness test, the SNPs (rs11798018, rs4828037 and rs2073163) from the TNMD and the rs2011162 from the SLC6A14, passed the recommended filter (P ≥ 10−7). The other SNPs, instead, evidenced a marked differential missingness between sex groups. This test was performed in PLINK software using the flag “test-missing” and replacing the phenotype column of the ped file by the sex information (Table 3).

Table 3 Missing frequency quality control (QC) in our selected markers stratified by sex.

Regarding additional MAF quality checks, all SNPs showed appropriated frequencies >1% by sex groups (Table 1). When analysing the Hardy Weinberg equilibrium (HWE) in girls belonging to the normal-BMI group, all SNPs reported proper values (P ≥ 10−4) (Table 4). According to this QC process, we ensured that there were not important genotyping errors and that our genetic data were reliable for further analyses.

Table 4 Genotype counts and Hardy-Weinberg test statistics for each SNP in the female group.

On this point, it is important to note that since genotyping array technologies are not specially designed for sexual chromosomes, quality is always hoped to be lower on X chromosome genetic variants compared to autosomal data.

High-level analysis: statistical analysis

As we previously mentioned, most of available test statistics for performing genetic association analyses have been designed for autosomal variants and thus they are not applicable to X chromosome data (especially when dealing with mixed-sex samples). In these cases, testing for association on the X chromosome raises unique challenges that have motivated the development of X-specific statistical tests in the literature34,35. Association tests on the X chromosome should incorporate into their models not only the fact of dosage imbalance between males and females but also, depending on the analysed locus, a specific XCI model. Some of available approaches include:

  • Clayton Tests (2008)34. Clayton tests are two X chromosome specific versions of the common autosomal tests that explicitly account for the XCI process and allow the inclusion of males and females together. In the case of different allele frequencies in males and females, Clayton statistics have inflated type I error frequencies. These tests are available in the R package snpMatrix36 with the names:

    • S134: It is analogous to a Cochran-Armitage trend test of a combined male and female genotype contingency table; it follows a Chi² distribution on one degree freedom (df) under the null hypothesis.

    • S234: It is analogous to a Pearson’s Chi² test on 2 df of a combined male and female genotype contingency table, it follows a Chi² distribution on 2 df under the null hypothesis.

  • Zheng tests (2007)35. They are a set of six different statistics that apply to the same SNP and from which a minimum P-value is computed, needing to be adjusted according to the correlation between the test statistics. Zheng et al.35 showed that the optimal choice of statistic among the six tests depends on whether HWE holds at the locus and whether males and females have the same risk allele. For example, in the case there is departure from HWE in females, the Zheng (Z2mfG) test has been presented a good choice. For further information regarding test statistic selection, we recommend to read works8,35. Of note is that the Zheng’s tests do not explicitly account for the XCI process.

As previously mentioned, an unbalanced female/male ratio between cases and control would affect the relative power of both Zheng and Clayton statistics. If combined with sex-specific allele frequencies, these tests will suffer from increased type I errors.

  • Traditional methods easily implementable in PLINK 1.9 or R environment:

    • Ignore males entirely and analyse female data using conventional autosomal tests (a genotypic-based Cochran-Armitage trend test or an allele-based Chi² by Pearson with 1 df). The problem related to this approach is that we are missing all data from male subjects and therefore losing statistical power. The Cochran-Armitage trend test is the default test employed when a naive analysis of X chromosome data is run in PLINK using the flag –model19. Regarding males, an allele-based test accounting for the number of A- and B- alleles between experimental conditions should be employed apart.

    • Linear or logistic regression analyses on all the samples adjusting by sex. This approach further has the advantage of adjusting the model by covariates of interest. Here, if we assume that the locus of interest escapes from XCI, females should be coded as 0, 1, or 2, according to 0, 1, or 2 number of SNP risk alleles, and males should be coded as 0 or 1 according to 0 or 1 allele copies. On the contrary, if XCI is assumed to occur, females should be coded as 0, 1, or 2, according to 0, 1, or 2 number of SNP risk alleles, and males should be coded as 0 or 2 according to 0 or 1 allele copies. By default, the application of the “–dosage” flag to X chromosome input data files (.bed, bim and fam) in PLINK will produce a codification which assumes escape from XCI. For XCI to be considered, new allele code numbers should be manually replaced in male samples with a standard text editor (e.g: gedit software).

In general, the selection of the most suitable test among the presented choices will depend on three different criteria; the XCI model assumed for the locus of interest, deviation from HWE of analysed markers and the existence of sex-specific allele frequencies in the study population, which would be a substantial problem in the case of an unbalanced female/male ratio. Regarding XCI, if inactivation is assumed to occur, then either the Clayton’s statistics or regression models (with males coded as 0 and 2 (for 0 and 1 risk allele, respectively)) would be the tests of choice. On the contrary, in the case of a locus ‘escaping’ from XCI, Zheng’s tests or regression models (with males coded as 0 and 1 (for 0 and 1 risk allele, respectively)) should be employed. In the case of sex-specific allele frequencies, independently of the XCI assumed model, the Zheng’s test (Z2mfG) has been presented a better choice over the Clayton approach. On the other hand, in the case of an adjustment for covariates is required, only regression models can be applied. Of note is that most of the test statistics and analysis considerations covered here are available to implement in the command-line toolset XWAS developed by Keinan A. and collaborators37,38,39.

Although for the analysis of our dataset both possibilities (‘escape from XCI’ and ‘XCI’) were tested in the original work13, we here only present results under the XCI assumption. As we have previously seen, selected markers in our sample did not exhibit HWE deviations nor sex-specific allele frequencies. Moreover, the female/male ratio was balanced across experimental groups. For these reasons, and following published recommendations8,17,34,40, Clayton test was here selected to perform the main statistical analysis. According to an in silico simulation work, the Clayton’s S1 statistic has shown the best performance among all X-specific introduced tests across a wide range of disease models, sex ratios and allele frequencies40. Moreover, it allows the inclusion of females and males together, increasing thereby the statistical power.

In Table 5, results derived from the application of Clayton’s S1 and S2 statistics to three different continuous phenotypes of the population are presented. All these phenotype data have also been shared and are available in the metadata file (Online-only Table 1). The implementation of this process was performed in R, using the snpStats R package and the code have been shared online41. All reported associations in our previous work13 were here replicated under XCI assumption. These findings support therefore a good performance of the Clayton statistics as well as ensure the reliability of the present dataset.

Table 5 Association between X chromosome SNPs and HOMA-IR, Glucose and BMI z-score in our dataset.

In conclusion, we here share a genetic dataset and present a whole implementation of the special X chromosome analytic process in genetics. Altogether, the pipeline and the shared data will allow researchers to get familiar with the X chromosome particularities and should encourage them to include X chromosome into their genetic studies. Closing this gap is crucial to elucidate the genetic background of complex diseases, especially of those with sex-specific features.