YY1 binding association with sex-biased transcription revealed through X-linked transcript levels and allelic binding analyses

Sex differences in susceptibility and progression have been reported in numerous diseases. Female cells have two copies of the X chromosome with X-chromosome inactivation imparting mono-allelic gene silencing for dosage compensation. However, a subset of genes, named escapees, escape silencing and are transcribed bi-allelically resulting in sexual dimorphism. Here we conducted in silico analyses of the sexes using human datasets to gain perspectives into such regulation. We identified transcription start sites of escapees (escTSSs) based on higher transcription levels in female cells using FANTOM5 CAGE data. Significant over-representations of YY1 transcription factor binding motif and ChIP-seq peaks around escTSSs highlighted its positive association with escapees. Furthermore, YY1 occupancy is significantly biased towards the inactive X (Xi) at long non-coding RNA loci that are frequent contacts of Xi-specific superloops. Our study suggests a role for YY1 in transcriptional activity on Xi in general through sequence-specific binding, and its involvement at superloop anchors.


S1
Sex labeling from the Random Forest classification and the list of outliers. Table A and B list the classified sex from labelled and unlabelled samples, respectively.
In Table A, additional information about outlier samples is provided. These entries are distinguished by the column 'Label==pred' being set to FALSE. The 'Female.votes' and 'Male.votes' columns show the proportion of support for each sex within the Random Forest classification process. As XIST expression is particularly important for sex classification, the 'sumExpXIST' column reports the sum of normalized tags per million of all TSSs nearest to XIST from FANTOM5 CAGE data for reference. Table. Counts of female and male FANTOM 5 CAGE samples associated with cell ontology terms. S3 Table. Differential transcription analyses of TSSs between sexes using FANTOM5 CAGE data. Table A lists all TSSs on chrX in a decreasing order of significance of assessed differential transcription between sexes. Table B lists the TSSs on autosomes with Bonferroni-corrected p-values under 0.05 in a decreasing order of significance in differential transcription between sexes. The autosomal TSS list is significantly overrepresented with repetitive elements, for which alignment biases could be contributing to the difference (see S1 File).  TSSs of two subject genes (S), a bi-allelically transcribed escapee (Bi-E) and XIST comparing female and male cells. The bottom track shows the sex comparison with data above or below the horizontal line indicating higher level in male or female, respectively. In general, 'Bi-E' TSSs exhibit higher transcription level in female than male cells, while having similar DNAm levels between sexes. 'S' TSSs share similar transcription levels between sexes, but higher DNAm level in female cells. The XIST TSS is mainly expressed in female cells, and has higher DNAm level in male cells. Figure (C) illustrates the ChIP-seq and input reads at TSSs of the same genes in female and male cells. In general, the overall ChIP-seq read depth of a positive regulator in female cells is higher at 'Bi-E' than 'S' TSSs due to its bi-allelic activity. With only one chrX in male cells, such difference is not expected. Figure (D) illustrates the compilation of allelic read counts at where heterozygous sites are present. Xa-biased binding at 'S' (H1), relatively balanced binding at 'Bi-E' (H2), and Xi-biased binding at XIST (H3) are expected for a positive regulator. Heterozygous sites (such as H4) not within the binding peak of the positive regulatory are not informative. Regions without a heterozygous site are missing data.  Figure (A) shows the ratio of DDX3X-AS to ACTIN as detected by qPCR. The somatic cell hybrid cell lines AHA-11aB1 and t60-12 contain the mouse genome along with a human active X chromosome. T11-4Aaz5, t75-2maz34-4a and t86-B1maz1b-3a contain a mouse genome along with a human inactive X chromosome and between one and ten other human chromosomes but not including an active X chromosome. The mouse parental cell line, tsA1S9-az31b was used as a control to ensure that the DDX3X-AS primers were human specific, while the ACTIN primers were not human specific. The Y-axis is two to the power of the CT difference between DDX3X-AS and ACTIN for each cell line. Figure   axis) range from 0 (unmethylated) to 1 (fully methylated). The three TSSs are most proximal to the following genes (from top to bottom): XIST, an escapee (ZFX) and a subject gene (HMGB3). Each square represents a sample for the cancer dataset. Red or blue color represents a female or male sample, respectively. Each violin plot in gray lines shows the distributions of beta values for each sex at each probe. Plots (B) and (C) show MA plots for chrX probes and autosomal probes on chr7 between sexes, respectively. Each dot represents a probe from the array. M (difference) on y-axis is the logged differential methylation value between sexes, and A (magnitude) on x-axis is the logged average methylation value (as indicated in Methods). The fitted robust regression line is represented in gray, with the corresponding function and correlation reported. Green and red colors in plot (B) represent probes nearest to escapees and subject genes previously reported in Cotton et al. 2015. Gold and gray colors represent probes nearest to XIST and genes not in either three categories. (D) Violin plots showing the distributions of DNA methylation similarity scores between sexes for probes within 50bps of escTSSs and nondifferentially transcribed (nonDT) TSSs on chrX. The similarity score of DNA methylation on y-axis is the residual of M as a function of A on chrX. Only TSSs with at least one probe within 50bps were plotted, and for those TSSs within 50bps of multiple probes, the average similarity scores of probes were obtained. The p-value from the Wilcoxon test is reported. Table. Motif over-representation around escTSSs reported by the CAGEdoPOSSUM tool. Table A and B list all 478 motifs for vertebrates from JASPAR 2016 with the corresponding Fisher scores per motif compared to two background sets, respectively: escTSSs_bg and nonDT. Table. TF ChIP-seq peak over-representation testing around escTSSs. The table lists the significance of over-representation for each ChIP-seq data compared to both X_bg and Auto_bg backgrounds. The table is ordered according to the significance compared to X_bg.

S4 Fig. Motifs and ENCODE ChIP-seq peaks of YY1, Myc and CTCF around top escTSSs and superloop-associated lncRNAs.
Figures show 500bps up-and down-stream regions for seven of the top escapees and heterozygous sites with significant Xi-biased YY1 binding in the four lncRNAs. Only motifs with a minimum score of 85% are shown.

S6 Table. TF ChIP-seq peak over-representation testing around TSSs of escapees from X;autosome translocation studies.
The table lists the significance of over-representation for each ChIP-seq data comparing TSSs of escapees to TSSs of subjects reported in the literature. The table is ordered according to the significance.

S7 Table. Allelic reads in YY1 ChIP-seq on Xi and Xa at heterozygous sites within YY1 peaks in the GM12878 cell line.
The table listed 67 heterozygous sites in GM12878 that were within uniformly processed YY1 ChIP-seq peaks from UCSC. Allelic read counts reported were sum of reads from the duplicated data generated by HudsonAlpha Institute for Biotechnology. Table. Allelic reads at heterozygous sites within merged TF peaks in all ChIP-seq data available in the GM12878 cell line.

S5 Fig. Combined allelic imbalance of all 1321 heterozygous sites within merged TF peaks on chrX in GM12878.
Scatter plot showing the log2 ratios of total number of reads on Xa relative to Xi at 1321 heterozygous sites within merged TF peaks from a total of 101 ChIP-seq and DNase I data sets generated by the ENCODE consortium from the GM12878 cell line. Where replicates were provided, reads at Xi and Xa were summed separately. Datasets, represented as squares, are arranged in increasing order of log2 ratio. The dashed line displays the baseline where there is no overall allelic imbalance. The names of outlying data sets, those exceeding ±2 standard deviations from the mean, are labeled. Each square/data is colored in quartiles of the total read counts at all 1321 heterozygous sites.

S9 Table. Significance of allelic imbalance from Fisher's exact test and FDR correction.
The table lists pairs of heterozygous sites and datasets with significant allelic imbalance. Positive or negative log2Odds values indicate Xa-or Xi-biased occupancy of the data at the corresponding heterozygous site, respectively.

Cross-reactivity bias of differentially transcribed TSSs and differentially methylated probes on the autosomes
In addition to chrX, sex-specific expression has been previously identified on autosomes in human pancreatic islets, primary T cells, peripheral blood, and brain 1-4 . Applying our differential transcription analysis to autosomes, we found 48 TSSs that were significantly differentially transcribed between sexes on the autosomes using CAGE data (Bonferronicorrected p-values ≤ 0.05; Supplementary Table S3B). However, these autosomal TSSs were significantly enriched with repetitive elements (21 out of 61 as opposed to 10 out of 103 TSSs on chrX without the repetitive TSS filter; one-sided Fisher's Exact Test p= 0.0001).
The GTEx study also reported differentially expressed autosomal genes between sexes 5 , and the only common gene we found was FRG1B. An overlap between the nearest genes to both differentially transcribed TSSs and differential methylated probes across four cancer types (see methods) revealed two autosomal genes, FRG1B and PSMA6. However, probes for the two genes were confounded with cross-reactivity reported in Price et al. 6 . The PSMA6 probes aligned to the sex chromosomes, whereas FRG1B probes can align to multiple places on autosomes. The significant overlap with repetitive elements and cross-reactivity collectively revealed potential bias in conducting sex analysis on the autosomes.
Cross-reaction and multiple aligning hits to the genome of probes in Illumina 450k DNA methylation arrays reported in the literature 6-8 can induce bias to sex analyses, especially when autosomal TSSs or probes align to sex chromosomes. Due to cross-reactivity of probes and the significant enrichment in overlapping repetitive elements, we found our CAGE and DNA methylation analyses on autosomes to be strongly biased. Despite previous success in identifying sex-specific expression on autosomes with relevant functions, the differentially transcribed autosomes TSSs we identified using CAGE datasets significantly overlapped with repetitive elements or had cross-reactivity to multiple places in the genome when compared to differentially methylated probes. We note that the successful studies focused on specific cell types, whereas our model was geared to identify differentially expressed autosomal gene across cell types. This is consistent with the previous report of substantial tissue specificity of sex-biased autosomal genes 3 . Interaction terms between sex and cell category covariates can be introduced to identify cell type-specific sex differences given sufficient number of samples.
Overall Xa-biased occupancy in the female GM12878 cell line