X chromosome genetic data in a Spanish children cohort, dataset description and analysis pipeline

X chromosome genetic variation has been proposed as a potential source of missing heritability for many complex diseases, including obesity. Currently, there is a lack of public available genetic datasets incorporating X chromosome genotype data. Although several X chromosome-specific statistics have been developed, there is also a lack of readily available implementations for routine analysis. Here, we aimed: (1) to make public and describe a dataset incorporating phenotype and X chromosome genotype data from a cohort of 915 normal-weight, overweight and obese children, and (2) to deeply describe a whole implementation of the special X chromosome analytic process in genetics. Datasets and pipelines like this are crucial to get familiar with the steps in which X chromosome requires special attention and may raise awareness of the importance of this genomic region.

Methods experimental design and study population. These methods are an expanded version of descriptions in our related work and general characteristics of the dataset have been previously described 13 . Briefly, in this case-control multicentre study, 915 Spanish children (438 males and 477 females) were recruited from three national health institutions: Lozano Blesa University Clinical Hospital, Santiago de Compostela University Clinical Hospital and Reina Sofía University Clinical Hospital. According to specific X-chromosomal analytic requirements, the female/male ratio of the study sample was perfectly balanced.
Childhood obesity status was defined according to the International Obesity Task Force (IOTF) reference for children 16 which is based on the application, on children population, of the widely used cut-off points of BMI for adults (25 and 30 kg/m 2 , for overweight and obesity respectively). Particularly, these criteria constitute a range of age and sex specific cut-off points for children that have been extracted from solid percentile tables constructed on 97876 boys and 94851 girls ranging from 2 to 18 years. After the application these specific cut-off points, the dataset was composed of 480 children in the obesity group, 177 in the overweight group and 258 in the normal-BMI group. Children were allocated into two experimental conditions according to their obesity status; the affected group (cases) composed of both children with obesity or overweight and the control group composed of normal-weight children. An unbalanced female/male ratio across cases and controls has been proven to heavily affect the power of some specific X chromosome association tests 17 . In our study, a balanced female/male ratio was maintained across each experimental condition (122/136 in controls and 355/302 in cases) (Fig. 1).
Inclusion criteria were European-Caucasian heritage and the absence of congenital metabolic diseases. Otherwise, the exclusion criteria were non-European Caucasian heritage, the presence of congenital metabolic diseases (e.g., diabetes or hyperlipidemia), undernutrition, and the use of medication that alters blood pressure, glucose or lipid metabolism. www.nature.com/scientificdata www.nature.com/scientificdata/ DNa extraction, processing and analysis. The presented dataset consists on genotype data for eight target SNPs mapping the X-chromosomal genes TNMD and SLC6A14 in the study population. Details regarding SNP selection and molecular analyses are briefly covered here since they have already been fully detailed in our previous work 13 . On the contrary, we pay special attention in the explanation of X chromosomal particularities, data description as well as in the summarization of each data analysis and processing step.
Seven SNPs located at the TNMD locus and one located at the SLC6A14 were selected for genotyping analysis. Genomic DNA was extracted from peripheral white blood cells using two automated kits, the Qiamp DNA Investigator Kit for coagulated samples and the Qiamp DNA Mini & Blood Mini Kit for non-coagulated samples (QIAgen Systems, Inc., Valencia, CA, USA). All extractions were purified using the DNA Clean and Concentrator kit from Zymo Research (Zymo Research, Irvine, CA, USA). Genotyping was performed by TaqMan allelic discrimination assay using the QuantStudio 12 K Flex Real-Time PCR System (Thermo Fisher Scientific, Waltham, MA, USA). Given the X-chromosomal location, it is recommendable to analyse females and males in separate plates during the genotyping process or, at least, maintain a balanced female/male ratio by plate.
Once genotyping was accomplished, we checked candidate SNPs for sex-specific allele frequencies, which can induce type I errors in some statistical X-chromosome analyses (especially in the case of unbalanced designs). Tested by means of the Fisher exact test, all SNPs in the TNMD showed no significant P-values and thus equal allele frequencies across sex groups (Table 1). On the contrary, the SNP in the SLC6A14 did not (P = 0.01). This fact should be taken into consideration when selecting an appropriate test for high-level statistical analyses unless a balanced sex ratio across experimental conditions is presented in the population (which is our case). Information regarding minor allele frequencies (MAFs) stratified by experimental condition for all candidate • DNA preparaƟon.

Data Records
The complete research dataset (genotype and phenotype data) has been uploaded into the European Genome-Phenome archive (EGA). The work can be found online with the title "X chromosomal genetic variants are associated with childhood obesity" or with the identifier EGAS00001002738 (2018) 15 . Online data are sorted and presented according to obesity status; the affected group (cases) composed of both children with obesity or overweight (EGA reference EGAD00010001482 (2018)) and the control group composed of normal-weight children (EGA reference EGAD00010001481 (2018)). Three files by-experimental condition (a total of six) are available online (.bed, bim and fam files). The bed files contain raw genotype data while the bim files describe information relative to target SNPs (chromosome number, SNP identifier, genetic distance in morgans (set as 0 for all markers), base-pair position and coding alleles). Instead, the fam files contain information relative to subjects (sample identifiers, family and paternal identifiers (here set as 0), sex (1 for males and 2 for females) and experimental group (1 www.nature.com/scientificdata www.nature.com/scientificdata/ for control and 2 for cases)). All presented formats can be easily readable in PLINK 1.9 software using the -bfile command option and further transformed into a more standard file format with the -dosage option 19 .
The complete data set in the current study complies with the requirements of the EGA archive. Detailed information about each sample and shared data files is presented in Online-only Tables 1 and 2, and Supplementary  File 1. Specifically, DOI and descriptions for each shared file are provided in the Online-only Table 2. technical Validation X chromosome particularities. Before introducing further steps, we here list two issues making the X chromosome a difficult region for genetic analyses. These particularities will determine important decisions related to genotype calling, data imputation and statistical analysis. It is important to note, however, that all here-described particularities are only applicable to those X chromosomal loci outside the pseudo-autosomal region of the X chromosome (which is the case of TNMD and SLC6A14).
The first noticeable uniqueness of the X chromosome is the fact of women having two allele copies while males having only one. As a result, while females can present the standard three possible allele combinations (AA, AB and BB), males are homozygous and have only two distinct possible genotypes (A-and B-). For this reason, standard autosomal association tests, such as the Cochran-Armitage trend test 20,21 , are not immediately applicable to X chromosome data. The second particularity affecting the X-chromosome analysis lies in the X chromosome inactivation (XCI) process, through which the transcription from one of the two X chromosome copies in female mammalian cells is silenced in order to balance the expression dosage between XX females and XY males. XCI is, however, incomplete in humans: with up to one-third of the X-chromosomal genes escaping from this silencing epigenetic mechanism. The degree of 'escape' from inactivation has been reported to strongly vary between genes, tissues and individuals 22,23 , with three possible scenarios at the gene level: complete XCI, partial XCI or total escape from XCI 24,25 . Depending on the XCI model assumed for a certain gene, we should proceed one way or another during the selection of the test statistics (see section 'High-Level Analysis: Statistical Analysis' for further details). The assumption of a particular XCI model is therefore a process that must be performed carefully.
Until date, the extent to which XCI is shared between cells and tissues remains poorly characterized and there is a lack of standardized criteria nor well-established databases to check if a gene escapes or not from XCI in a concrete situation. In order to do so, an exhaustive search in PUBMED and other scientific databases should be performed looking for particular studies supporting a certain XCI hypothesis. Currently, the most similar resource to a standardized database on this regard is the initiative carried out by the Genotype-Tissue Expression (GTEx) consortium 9 in 2017, which describes a systematic survey of XCI, integrating over 5500 transcriptomes from 449 Individuals, spanning 29 tissues from the GTEx (v6p release) and 940 single-cell transcriptomes, combined with genomic sequence data. Particularly, they show that XCI at 683 X-chromosomal genes is generally uniform across human tissues and that incomplete XCI affects at least 23% of X-chromosomal genes. Overall, this work presents an updated catalogue of XCI across human tissues which may be of great utility during the selection of a particular XCI model for a gene. Other available resources also include the work of Slavney et al. 26 , which gathers the main XCI insights from previous studies on X-chromosome gene expression datasets.
By way of example, we here illustrate the whole process followed for the identification of the optimal XCI model in the case of TNMD. First, we interrogated the Slanvey et al. (2016) work 26 , where no evidence of escape from XCI was reported. In order to get more information about this fact, we further studied in detail the three works summarized in the Slanvey et al. 26 paper. The first work on which the paper is based is a study from Carrel et al. 22 , in which we could not identify any probe covering the TNMD region. Instead, a few surrounding regions were mapped; among which the SRPX2, ZD89B07 and the SYTL4 reported escaping from the XCI process. In spite of it, this study was based on a fibroblast cell model and thus not applicable to our adipose tissue context. Regarding the second revised article 27 , again, there were not available probes covering TNMD. Thus, neither conclusions nor new information could be extracted. In relation to the third included article 28 , we were not able to find any table or supplemental material showing an output list of the analysed regions. Next, we investigated the well-established work from the GTEx consortium 9 and found that the XCI status of the TNMD region remains catalogued as unknown (Supplementary Tables S2 and S13 of this paper). As a complementary approach, we performed a search in PUBMEP looking for individual studies focused on the gene expression status of TNMD from different sexes. As a result, we found a work reporting higher basal expression of TNMD in women than in men 29 , which could indicate that TNMD escapes from the XCI.
Taking all this into consideration and given the lack of agreement, both possibilities ('escape from XCI' and 'XCI') should be tested in the case of TNMD. A searching process like this is highly recommendable to be done for any X chromosome locus before the selection of a particular statistical approach.
Raw data processing. The primary step of the data analysis consisted on the extraction of genotype calls from fluorescence array data and the construction of work data files for data manipulation and analysis. Details regarding the exact procedure for genotype calling, which is an important procedure in X-chromosomal analyses, are listed below ('Genotype Calling' section).
Once we obtained genotype calls for the 915 individuals, we generated standard format files (.ped and .map) transforming the ThermoFisher cloud-derived outputs from long to wide format using an own script in R environment 30 . Finally, data were imported into PLINK 1.9 software 19 and converted into binary format files using the -make-bed flag. These binary formats (.bed, .bim and .fam) are a more compact representation of the data that saves space and speeds up subsequent analyses.
Genotype calling. This is the first step of any primary genotype analysis and consists on the extraction of genotype calls from fluorescence array data at the SNP and individual level. Along with the test statistics selection procedure, the genotype calling process is an analytical step heavily affected by X chromosome particularities.
Specifically, the main X chromosome uniqueness affecting this process is the dosage imbalance between males and females. Since males carry only one X allele, signal intensities obtained from the Real-Time PCR System are lower in males than for females and thus a correction should be implemented. On this matter, calling algorithms which apply different models to male and female samples (e.g. Illuminus and CRLMM) have been proven to generally perform better than methods which do not (e.g. GenCall and GenoSNP) 31 .
Here, we employed the Applied Biosystems qPCR app module (Thermo Fisher Cloud software) and the autocalling method for genotype calling. According to literature recommendations, the sex information for each sample was supplied to the software and genotype calling was performed separately in both sexes. In this regard, although genotyped plates did not consist on only boys or girls, the balanced sex ratio of our population (477 females and 438 males) favoured a better performance of the algorithm. Five signal clusters were identified (three in the case of females and two in the case of males). Then, sex information and scatter of the clusters were used to call the genotypes (AA, AB and BB for females, and A-and B-for males). Since the employed software also allows   www.nature.com/scientificdata www.nature.com/scientificdata/ the option of applying user-definable boundaries for data analysis, those samples classified as undetermined by the autocalling method were recalled using the manual option. A set of controls were used to deduce these questionable genotype calls. Outliers were omitted from the analysis. Data QC. Prior to high-level statistical analyses, the quality control (QC) process is an important step in any genetic analysis and especially in the X-chromosome analysis. Specific QC guidelines for X chromosome genotype data have been previously reviewed in detail 8 . All these criteria can help us to detect genotype errors or not reliable SNPs which should be excluded from analysis.
Here, the whole QC process was implemented in PLINK 1.9 software 19 . According to literature, two criteria concerning missing frequency were employed (the sex-specific missing frequency and the differential missingness between sexes) 32,33 . As genotype calling was performed separately in males and females (that is, no heterozygote calls in males were allowed), the proportion of heterozygote calls in males, proposed as a filter criterion by Ling and Ziegler et al. 32,33 , was not considered in our QC process. All SNPs (with exception of the rs11798018 and the rs2073163 from the TNMD gene) passed the recommended missing frequency filter in females (<=2%) ( Table 3). On the other hand, none SNP passed the filter in males. Regarding the differential missingness test, the SNPs (rs11798018, rs4828037 and rs2073163) from the TNMD and the rs2011162 from the SLC6A14, passed the recommended filter (P ≥ 10 −7 ). The other SNPs, instead, evidenced a marked differential missingness between sex groups. This test was performed in PLINK software using the flag "test-missing" and replacing the phenotype column of the ped file by the sex information (Table 3).
Regarding additional MAF quality checks, all SNPs showed appropriated frequencies >1% by sex groups (Table 1). When analysing the Hardy Weinberg equilibrium (HWE) in girls belonging to the normal-BMI group, all SNPs reported proper values (P ≥ 10 −4 ) ( Table 4). According to this QC process, we ensured that there were not important genotyping errors and that our genetic data were reliable for further analyses.
On this point, it is important to note that since genotyping array technologies are not specially designed for sexual chromosomes, quality is always hoped to be lower on X chromosome genetic variants compared to autosomal data.
High-level analysis: statistical analysis. As we previously mentioned, most of available test statistics for performing genetic association analyses have been designed for autosomal variants and thus they are not applicable to X chromosome data (especially when dealing with mixed-sex samples). In these cases, testing for association on the X chromosome raises unique challenges that have motivated the development of X-specific statistical tests in the literature 34,35 . Association tests on the X chromosome should incorporate into their models not only the fact of dosage imbalance between males and females but also, depending on the analysed locus, a specific XCI model. Some of available approaches include: • Clayton Tests (2008) 34 . Clayton tests are two X chromosome specific versions of the common autosomal tests that explicitly account for the XCI process and allow the inclusion of males and females together. In the case of different allele frequencies in males and females, Clayton statistics have inflated type I error frequencies.
These tests are available in the R package snpMatrix 36 with the names: • S1 34 : It is analogous to a Cochran-Armitage trend test of a combined male and female genotype contingency table; it follows a Chi² distribution on one degree freedom (df) under the null hypothesis. • S2 34 : It is analogous to a Pearson's Chi² test on 2 df of a combined male and female genotype contingency table, it follows a Chi² distribution on 2 df under the null hypothesis.
• Zheng tests (2007) 35 . They are a set of six different statistics that apply to the same SNP and from which a minimum P-value is computed, needing to be adjusted according to the correlation between the test statistics. Zheng et al. 35 showed that the optimal choice of statistic among the six tests depends on whether HWE holds at the locus and whether males and females have the same risk allele. For example, in the case there is departure from HWE in females, the Zheng (Z 2 mfG ) test has been presented a good choice. For further information regarding test statistic selection, we recommend to read works 8,35 . Of note is that the Zheng's tests do not explicitly account for the XCI process.
As previously mentioned, an unbalanced female/male ratio between cases and control would affect the relative power of both Zheng and Clayton statistics. If combined with sex-specific allele frequencies, these tests will suffer from increased type I errors.
• Traditional methods easily implementable in PLINK 1.9 or R environment: • Ignore males entirely and analyse female data using conventional autosomal tests (a genotypic-based Cochran-Armitage trend test or an allele-based Chi² by Pearson with 1 df). The problem related to this approach is that we are missing all data from male subjects and therefore losing statistical power. The Cochran-Armitage trend test is the default test employed when a naive analysis of X chromosome data is run in PLINK using the flag -model 19 . Regarding males, an allele-based test accounting for the number of A-and B-alleles between experimental conditions should be employed apart. • Linear or logistic regression analyses on all the samples adjusting by sex. This approach further has the advantage of adjusting the model by covariates of interest. Here, if we assume that the locus of interest escapes from XCI, females should be coded as 0, 1, or 2, according to 0, 1, or 2 number of SNP risk alleles, and males should be coded as 0 or 1 according to 0 or 1 allele copies. On the contrary, if XCI is www.nature.com/scientificdata www.nature.com/scientificdata/ assumed to occur, females should be coded as 0, 1, or 2, according to 0, 1, or 2 number of SNP risk alleles, and males should be coded as 0 or 2 according to 0 or 1 allele copies. By default, the application of the "-dosage" flag to X chromosome input data files (.bed, bim and fam) in PLINK will produce a codification which assumes escape from XCI. For XCI to be considered, new allele code numbers should be manually replaced in male samples with a standard text editor (e.g: gedit software).
In general, the selection of the most suitable test among the presented choices will depend on three different criteria; the XCI model assumed for the locus of interest, deviation from HWE of analysed markers and the existence of sex-specific allele frequencies in the study population, which would be a substantial problem in the case of an unbalanced female/male ratio. Regarding XCI, if inactivation is assumed to occur, then either the Clayton's statistics or regression models (with males coded as 0 and 2 (for 0 and 1 risk allele, respectively)) would be the tests of choice. On the contrary, in the case of a locus 'escaping' from XCI, Zheng's tests or regression models (with males coded as 0 and 1 (for 0 and 1 risk allele, respectively)) should be employed. In the case of sex-specific allele frequencies, independently of the XCI assumed model, the Zheng's test (Z 2 mfG ) has been presented a better choice over the Clayton approach. On the other hand, in the case of an adjustment for covariates is required, only regression models can be applied. Of note is that most of the test statistics and analysis considerations covered here are available to implement in the command-line toolset XWAS developed by Keinan A. and collaborators [37][38][39] .
Although for the analysis of our dataset both possibilities ('escape from XCI' and 'XCI') were tested in the original work 13 , we here only present results under the XCI assumption. As we have previously seen, selected markers in our sample did not exhibit HWE deviations nor sex-specific allele frequencies. Moreover, the female/male ratio was balanced across experimental groups. For these reasons, and following published recommendations 8,17,34,40 , Clayton test was here selected to perform the main statistical analysis. According to an in silico simulation work, the Clayton's S1 statistic has shown the best performance among all X-specific introduced tests across a wide range of disease models, sex ratios and allele frequencies 40 . Moreover, it allows the inclusion of females and males together, increasing thereby the statistical power.  Table 5. Association between X chromosome SNPs and HOMA-IR, Glucose and BMI z-score in our dataset. SNPs in bold showed statistically significant associations with presented phenotypes under Clayton Statistics. This test explicitly accounts for random X-inactivation and allows the inclusion of females and males together, increasing thereby the statistical power. P.1df and Chi.squared.1.df columns corresponds to Clayton S1 statistic results while P.2df and Chi.squared.2.df corresponds to Clayton S2 statistic. Abbreviations; SNP, Single Nucleotide Polymorphism; N, number of included subjects in the analysis; HOMA-IR, homeostasis model assessment for insulin resistance; BMI z-score, body mass index adjusted by sex and age.
www.nature.com/scientificdata www.nature.com/scientificdata/ In Table 5, results derived from the application of Clayton's S1 and S2 statistics to three different continuous phenotypes of the population are presented. All these phenotype data have also been shared and are available in the metadata file (Online-only Table 1). The implementation of this process was performed in R, using the snpStats R package and the code have been shared online 41 . All reported associations in our previous work 13 were here replicated under XCI assumption. These findings support therefore a good performance of the Clayton statistics as well as ensure the reliability of the present dataset.
In conclusion, we here share a genetic dataset and present a whole implementation of the special X chromosome analytic process in genetics. Altogether, the pipeline and the shared data will allow researchers to get familiar with the X chromosome particularities and should encourage them to include X chromosome into their genetic studies. Closing this gap is crucial to elucidate the genetic background of complex diseases, especially of those with sex-specific features.

Code availability
All custom R codes employed in this work have been shared online in a GitHub repository (10.5281/ zenodo.2578182) 41 . Two short scripts are available online; "script_from_long_to_wide.r" and "Clayton_analysis_ code.r".
The first one (named "script_from_long_to_wide.r") is a short script designed for loading a genetic dataset (genotype calls) derived from OpenArray technology and transforming it into a handy-format file, which can be further imported into PLINK software. Basically, this script carries out a dataset manipulation and transformation from long to wide format. In order to run the script, users will need an input file derived from OpenArray technology containing information in the long format arranged into three columns (NCBI_SNP_Reference, Sample_ID and Genotype_Call).
The second script shared (named "Clayton_analysis_code.r") gathers functions and R commands required for the application of the X-chromosome specific statistical tests developed by Clayton and collaborators 34,36 (see section 'High-Level Analysis: Statistical Analysis' for further details).