Landscape of allele-specific transcription factor binding in the human genome

Abramov, Sergey; Boytsov, Alexandr; Bykova, Daria; Penzar, Dmitry D.; Yevshin, Ivan; Kolmykov, Semyon K.; Fridman, Marina V.; Favorov, Alexander V.; Vorontsov, Ilya E.; Baulin, Eugene; Kolpakov, Fedor; Makeev, Vsevolod J.; Kulakovskiy, Ivan V.

doi:10.1038/s41467-021-23007-0

Download PDF

Article
Open access
Published: 12 May 2021

Landscape of allele-specific transcription factor binding in the human genome

Nature Communications volume 12, Article number: 2751 (2021) Cite this article

14k Accesses
39 Citations
31 Altmetric
Metrics details

Subjects

Abstract

Sequence variants in gene regulatory regions alter gene expression and contribute to phenotypes of individual cells and the whole organism, including disease susceptibility and progression. Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Differential transcription factor binding in heterozygous genomic loci provides a natural source of information on such regulatory variants. We present a novel approach to call the allele-specific transcription factor binding events at single-nucleotide variants in ChIP-Seq data, taking into account the joint contribution of aneuploidy and local copy number variation, that is estimated directly from variant calls. We have conducted a meta-analysis of more than 7 thousand ChIP-Seq experiments and assembled the database of allele-specific binding events listing more than half a million entries at nearly 270 thousand single-nucleotide polymorphisms for several hundred human transcription factors and cell types. These polymorphisms are enriched for associations with phenotypes of medical relevance and often overlap eQTLs, making candidates for causality by linking variants with molecular mechanisms. Specifically, there is a special class of switching sites, where different transcription factors preferably bind alternative alleles, thus revealing allele-specific rewiring of molecular circuitry.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Introduction

Sequence variants located in noncoding genome regions attract an increasing researchers’ attention due to the frequent association with various traits, including predisposition to diseases^1,2. Single-nucleotide variants (SNVs) in gene regulatory regions may affect gene expression³ by altering binding sites of transcription factors (TFs) in gene promoters and enhancers and, consequently, efficiency of transcription⁴.

On the one hand, parallel reporter assays allow massive assessment of variants in terms of gene expression alteration^5,6 but do not reveal particular TFs involved. On the other hand, there are multiple ways to assess if a single-nucleotide substitution changes TF-binding affinity, from detailed measurements of the TF affinity landscape in vitro^7,8 to conventional experiments on individual sequence variants^9,10 and computational modeling^11,12,13. However, it is not trivial to utilize these data for annotating SNV effects at the genome-wide scale in a cell type-specific manner.

The functional effect of single-nucleotide substitutions can be studied in heterozygous chromosome loci, where TFs differentially bind to sites in homologous chromosomes with alternative SNV alleles. Reliable evidence comes from modern in vivo methods based on chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq). ChIP-Seq provides a deep read coverage of TF-binding regions, and non-perfect alignments of reads often carry single-nucleotide mismatches arising from heterozygous sites. Statistical biases between the numbers of mapped reads containing alternative SNV alleles reveal the so-called allele-specific binding events^1,14 (ASB, Fig. 1a).

**Fig. 1: A scheme of allele-specific binding events, an overview of the ADASTRA pipeline, and its application to ChIP-Seq data.**

Chromatin accessibility often serves as a proxy for the regulatory activity of a genomic region¹⁵. Massive assessment of allele-specific chromatin accessibility in more than 100 cell types¹⁶ reported more than 60 thousand of significantly imbalanced sites. Yet, so far, only 10–20 thousand ASBs were reported per study (Supplementary Table 1), and the potentially vast landscape of allele-specific TF binding remains mostly unexplored.

Reliable identification of ASBs (the ASB calling) requires high read coverage at potential sites, which results either from deep sequencing of individual ChIP-Seq libraries or from data aggregation across multiple experiments. Reprocessed ChIP-Seq data for hundreds of TFs and cell types are available in databases such as GTRD¹⁷ and ReMap¹⁸, opening a way to an integrative meta-analysis, which could yield raw statistical power to detect cell type- and TF-specific ASBs.

Straightforward meta-analysis of the ASBs has two major limitations. First, many ChIP-Seq data sets are obtained in aneuploid cell lines, and copy-number variants (CNVs) are common even for normal diploid cells. Both the chromosome multiplication and local CNVs affect the expected read coverage of the respective genomic regions¹⁹ and bring about imbalanced read counts at SNVs, possibly generating false-positive ASB calls (Fig. 1a). There exist strategies to reduce this bias (Supplementary Table 2), in particular, the known CNV regions can be filtered out²⁰ or predicted from a computational analysis of the corresponding genomic DNA^21,22 (which is often used as the ChIP-Seq control sample) and incorporated in statistical criteria when evaluating the potential ASB calls¹⁹. However, in many published experiments, the input DNA data control was omitted in favor of other controls, such as preimmune IgG, or had a limited sequencing depth making it useless for CNV predictions. Furthermore, currently, there are no systematic data on global (chromosome duplications) and local (CNVs) structural variations across all cell types with public ChIP-Seq data on TFs. Even when the external data on structural variation are available for particular cells, it is not guaranteed that the same estimates would be valid for ChIP-Seq data obtained elsewhere, since long-cultivated immortalized cell lines might keep accumulating unreported differences in genome dosage across chromosomes²³.

The second major problem in ASB calling is the so-called reference read mapping bias^21,24. Standard read alignment tools generally map more reads to the alleles present in the reference genome assembly, as such mapping has lower or no mismatch penalties. To account for the reference read mapping bias, an ideal scenario involves mapping to individually reconstructed genomes^22,25 or computational simulations²⁰ that provide estimates of mapping probabilities to alternative alleles separately for each SNV (see Supplementary Table 2 for an overview). Yet, these solutions are not applicable to premade read alignments (which are usually obtained with a simple reference genome) and hardly applicable to understudied cell types or particular samples that do not provide enough data to reconstruct an individual genome.

In this work, we present a novel framework for ASB calling from existing read alignments or premade variant calls, accounting for the allelic dosage of aneuploidy and CNVs, and read mapping bias. With this framework, we have performed a comprehensive meta-analysis to identify ASBs in the human ChIP-Seq data from the GTRD database¹⁷. The database of Allelic Dosage-corrected Allele-Specific human Transcription factor binding sites (ADASTRA, http://adastra.autosome.ru) provides ASB events across 674 human TFs (including epigenetic factors) and 337 cell types. We demonstrate that the single-nucleotide polymorphisms (SNPs) with ASBs often overlap expression quantitative trait loci (eQTLs) and exhibit associations with various normal and pathologic traits. A comparison of data for multiple TFs highlights the cases where different TFs preferentially bind to different alleles, i.e., when a single-nucleotide substitution can change an entry point of the involved regulatory pathway. Finally, we discuss selected cases where the ASB at SNPs reveals molecular mechanisms of associations between SNPs and important medical phenotypes.

Results

We present a reproducible workflow for ASB calling and meta-analysis across human TFs and cell types (Fig. 1b). First, the variants are called from premade ChIP-Seq read alignments against the hg38 genome assembly. Next, the variant calls are filtered by excluding homozygous and low-covered variants (<5 reads supporting any of two alleles), as well as variants absent from the dbSNP²⁶ common subset (as putative de novo point mutations). The filtered SNVs from related ChIP-Seq data sets (sharing the cell type and particular wet lab) are used to identify the cell type features (aneuploidy and CNVs). A total set of variants is used to assess the global read mapping bias that is used as the basis for statistical model parametrization. Finally, ASB calling is performed separately for each ChIP-Seq experiment, and the resulting allele read bias P values are aggregated using the George–Mudholkar’s method²⁷ for each SNV, either at the TF level (across ChIP-Seq data for a selected TF from all cell types) or the cell type level (across ChIP-Seq data for a selected cell type for all TFs).

We used the workflow to process 7669 ChIP-Seq read alignments from GTRD covering 1025 human TFs and 566 cell types, and detected more than 2 hundred thousand ASBs at more than 2 hundred thousand SNPs for various TFs and 3 hundred thousand ASBs for cell types passing the Benjamini–Hochberg (FDR) adjusted P value of 0.05, see Fig. 1c, d for an overview. Reaching these numbers has become possible because of the large volume of the starting data (the filtered list of considered variant calls contained more than 54 million entries) and the advanced statistical framework that we describe below. An overview of the processed data sets and variant calls per TF and cell type is shown in Supplementary Fig. 1.

Estimating background allelic dosage (BAD) from single-nucleotide variant calls

ASB is assessed against expected relative frequencies of reads supporting alternative alleles of a particular SNV in a particular genomic region. Assuming there was no read mapping bias, these expected frequencies would be mostly determined by the copy number of the respective genomic segments. In this study, we estimated the joint effect of local copy-number variation and global chromosome ploidy from the read counts at SNV calls, taking into account that the background for ASB calling is defined by the expected relative frequencies of the read counts supporting alternative alleles rather than by absolute allelic copy numbers.

We introduce BAD as the ratio of the major to minor allele dosage in the particular genomic segment, which depends on chromosome structural variants and aneuploidy. BAD can be estimated from the number of reads mapped at each allelic variant and does not require haplotype phasing. For example, if a particular genomic region has the same copy number of both alleles, e.g., 1:1 (diploid), 2:2, or 3:3, then it has BAD = 1, i.e., the expected ratio of reads mapped to alternative alleles on a heterozygous SNV is 1. All triploid regions have BAD = 2, and the expected allelic reads ratio is either 2 or ½. In general, if BAD of a particular region is known, then the expected frequencies of reads supporting alternative alleles are 1/(BAD + 1) and BAD/(BAD + 1).

Importantly, accounting for BAD provides an answer to the question of the necessity of overdispersion in the statistical evaluation of ASBs^19,22. In fact, a large portion of overdispersion of read counts disappears once the variant calls are segregated according to BADs of the respective genomic segments (see “Methods”).

BAD calling with Bayesian changepoint identification

In this study, we present a novel method for reconstructing a genome-wide BAD map of a given cell type. The idea is to find genomic regions with approximately stable BAD using the read counts at SNV calls. Assuming that both differential chromatin accessibility and sequence-specific TF binding affect only a minor fraction of variants, the read counts for most of the SNVs must be close to equilibrium and thus provide imprecise but multiple measurements of BAD.

We have developed a Bayesian changepoint identification algorithm, which (1) segments the genomic sequence into regions of the constant BAD using dynamic programming to maximize the marginal likelihood and then (2) assigns BAD with the maximal posterior to each segment (see “Methods”). An additional preprocessing employs distances between neighboring SNVs to exclude long deletions and centromeric regions from BAD estimation. The BAD caller in action is illustrated in Fig. 2a for two chromosomes using ENCODE K562 data (see the segmentation map of the complete genome with multiple deletions in Supplementary Fig. 2).

**Fig. 2: Bayesian changepoint identification allows reconstructing reliable genome-wide maps of background allelic dosage from single-nucleotide variant calls.**

We performed the BAD calling for 2556 groups of variant calls, where each group consisted of calls obtained from ChIP-Seq alignments for a particular cell type and GEO series or ENCODE biosample ID (i.e., for K562 cells of different studies, the BAD calling was performed independently). In BAD calling, recurrent SNVs sharing dbSNP IDs and found in different data sets within the same group were considered as independent observations. To systematically assess the reliability of the resulting BAD maps, we compared the predicted BADs at all SNVs with the ground truth BADs estimated from COSMIC²⁸ CNV data for 76 matched cell types, with K562 and MCF7 being the most represented. For K562 and multiple other cell types, the Kendall τ_b rank correlation was consistently better for joint data sets with higher numbers of SNVs (Fig. 2b), which justifies the usage of read counts at SNVs as point measurements of BAD.

Genome structural variations are the most likely yet not the only reason for unbalanced allelic dosage in a particular genomic region. In our case, the agreement of BAD and COSMIC copy-number maps confirms the validity of BAD estimates. However, even suboptimal agreement between a BAD map and the copy-number profile is not a problem as soon as the allelic dosage is estimated correctly.

Particularly, we found that BAD maps of MCF7 agreed poorly with COSMIC independently from the number of SNVs in the data set. To clarify the issue, we processed external deep genomic sequencing data for MCF7 with the ADASTRA pipeline (see “Methods”). The resulting BAD map from these data was not dependent on the ChIP procedure but agreed reasonably well with the MCF7 BAD maps from ChIP-enriched data sets, thus validating the ChIP-Seq-based BAD maps for MCF7 cells (see Supplementary Fig. 2).

Of note, the ChIP-independent BAD map for MCF7 still rather poorly agreed with the COSMIC copy numbers markup (SNP-level Kendall τ_b ~0.2), suggesting that for MCF7 the latter is likely an inadequate proxy for the actual BAD. We have no ultimate explanation for this observation but would like to remark that MCF7 was found among the most unstable cell types²⁹, which probably leads to discrepancies between exact CNV profiles and BAD estimates for cells originating from different studies.

Additionally, we have analyzed microarray-based CNV estimates for major cell types²⁹, including 13 cell types matching across these data, COSMIC, and our study (Supplementary Fig. 3). Interestingly, when compared to COSMIC, those data showed a higher correlation for MCF7 rather than for K562 cells. To a varying degree, such discrepancies can be observed for other cell types. Thus, careless recruitment of copy-number profiles obtained with different methods from different data sources as estimates of BAD may reduce the reliability of called ASBs, the disadvantage that is avoided by using BAD estimates directly from ChIP-Seq data.

As a separate test, we used the predicted BAD maps as multiple binary classifiers for different BAD values using SNP calls across all cell types. With the COSMIC data as the ground truth, we plotted a receiver operating characteristic (ROC) and a precision-recall curve (PRC) for each BAD (Fig. 2c, d). For the most widespread BADs (1–3) covering more than 90% of candidate SNVs (Supplementary Fig. 3), we reached >0.83 area under curve for ROC and 0.66–85 for PRC (Supplementary Table 3), proving the reliability of the predicted BAD maps.

With BAD maps at hand, we segregated the variant calls from all data sets by BAD and by fixed read coverage either at reference or alternative alleles. Then, for each such set of SNVs, we fitted the background distribution as a mixture of two negative binomial distributions with BAD-determined p parameters (see “Methods”). ASBs were called independently for the reference (Ref-ASB) and the alternative (Alt-ASB) allele using separately fit background distributions for the fixed read counts at alternative and reference alleles, respectively, thus accounting for general read mapping bias.

Overview of the ADASTRA database

The results of the ASB calling are provided in the ADASTRA database (the database of ADASTRA factor binding sites). In ADASTRA, each dbSNP ID can have several ASB entries for different TFs or cell types. ADASTRA consists of two parts: the first part (TF-ASB, 233290 ASBs at 147909 SNPs) contains ASB obtained by aggregation of individual P values for each TF over cell types. The listed ASBs passed multiple testing correction (P < 0.05 after Benjamini–Hochberg adjustment for the number of tested ASBs). P value estimation (see below), aggregation, and multiple testing correction were performed separately for ASBs with preferred binding to the reference (Ref-ASB) and alternative (Alt-ASB) alleles, and for each TF. The other part of the database (CT-ASB, 351967 ASBs at 252469 SNPs) contains a similar aggregation of individual ASBs over TFs for each cell type.

TFs and cell types were unequally represented in the source data. Thus, the numbers of the resulting ASB calls were also biased toward most studied cell types and TFs (Fig. 3a, b), with the top contributions from CTCF for TFs and K562 for cell types. However, the top 8 TFs and top 5 cell types covered only half of ASB calls (for cell types) or less than a half of ASB calls (for TFs); thus, the produced data on ASB events is diverse across different samples.

**Fig. 3: An overview of the ADASTRA ASBs and their genomic localization.**

Next, we assessed how ASBs and candidate SNVs are distributed in different genomic regions (Fig. 3c). Compared to all SNVs and tested candidate ASB sites, the significant ASBs were enriched in enhancers (~4x more than expected from the number of SNVs for which there were candidate ASBs, Fisher’s exact test P < 10⁻³⁰⁰) and promoters (~3x more than expected, P < 10⁻³⁰⁰). We consider this observation consistent with both the actual location of functional TF-binding sites and deeper coverage of the actual TF-binding regions with ChIP-Seq reads. In fact, ASBs are likely to cluster at the scale of the typical ChIP-Seq peak width, as revealed by the distribution of pairwise distances between SNVs with and without ASBs, which has a bimodal shape (Supplementary Fig. 4). This effect is likely caused by peak-scale clustering of ChIP-Seq reads allowing for higher sensitivity of both SNP calling and ASB calling in the vicinity of ChIP-Seq peak summits.

We also compared the SNPs listed in ADASTRA with those of the previous ASB collections (Supplementary Fig. 5). ADASTRA includes ASBs at 38%, 44%, 57%, and 64% of dbSNP SNPs reported as ASBs in AlleleDB²², and collections published in^20,30, and¹⁹, respectively. We additionally assembled a reproducible ASB set consisting of 2039 SNPs with ASBs detected in any two of those four ASB sets and found that ADASTRA included 1573 (77%) of the respective SNPs. Of note, taking pairwise, these four existing ASB data sets also poorly overlap each other (see Supplementary Table 4), suggesting that the major fraction of ASBs is non-reproducible between studies and arise either from particular ChIP-Seq data sets or from unique procedures of different ASB calling pipelines.

To study in detail why ADASTRA failed to capture ASBs found in other studies, we used the set of ASB SNPs identified by one of the most advanced methods for ASB calling, BaalChIP¹⁹. ASB event could be missed at the SNP calling stage, could fail to pass the read coverage thresholds, or fail to pass the significance threshold for FDR-corrected P value estimated against BAD. To assess the contribution of different stages of our pipeline to ASB calling sensitivity, we performed a stage-by-stage analysis of the underlying SNP set (see Fig. 3d). It turned out that the fraction of BaalChIP ASB SNPs recovered by ADASTRA was different for different cell types, with most of ASBs recovered for the cell types with the deepest sequencing coverage.

On the one hand, the basic coverage filters significantly reduced the number of SNPs under consideration resulting in a major loss in the fraction of recovered ASBs. On the other hand, we did not observe critical effects from any of the subsequent stages. For all cell types, the number of BaalChIP ASB SNPs recovered by ADASTRA decreased monotonously, suggesting that there was no particular bottleneck defining the sensitivity of the whole pipeline except the basic coverage filters. As more sites were recovered for cell types with better coverage, one can predict that the difference between different ASB calling pipelines would decrease as soon as more ChIP-Seq data would become available for analysis.

In general, there is an overlap between ADASTRA ASBs and the existing data on regulatory SNPs, including sites of allele-specific DNA accessibility¹⁶ and reporter assay quantitative trait loci⁶ (Supplementary Fig. 5), but the vast majority of ADASTRA data are novel.

Given the diversity of assessed TFs, it became possible to systematically compare SNVs carrying TF ASBs and identify the pairs of TFs preferring to share ASBs (Supplementary Fig. 6). Indeed, hundreds of TF pairs are significantly enriched for common ASBs (one-tailed Fisher’s exact test P value from 0.05 to 10⁻³⁰⁰ upon correction for multiple tested TF pairs with the Benjamini–Hochberg procedure). As a rule, shared ASBs were not related to interacting TFs (considering protein–protein interactions from STRING-db³¹). However, there was a systematic overlap between ASBs for chromatin-interacting epigenetic factors and related proteins, suggesting many of shared events are “passengers” in regions of allele-specific chromatin accessibility with TFs bound only to the accessible chromosome. Still, some interacting proteins (such as CTCF-RAD21) strongly prefer to share ASBs, and the same holds for particular composite elements of binding sites such as AR-FOXA1³².

Motif annotation is concordant with ASB calls

For TFs specifically interacting with DNA, it is possible to perform computational annotation of ASBs with TF-recognized sequence motifs³³. When a strong binding site overlaps an ASB SNP and the alternating alleles directly change the key nucleotides in the TF-binding DNA sequence, this SNP likely relates to different TF-binding affinity to the sites at homologous chromosomes, which directly produce the ChIP-Seq allelic imbalance. We call such events “driver” ASBs to distinguish them from side effects of piggyback TF binding and chromosome-specific local chromatin accessibility, the examples of “passenger” ASBs. Motif annotation highlights the driver ASBs and allows comparing the observed ASB effect (the allelic imbalance) and the effect predicted by sequence analysis (the difference in binding specificity reflected in the motif prediction scores), providing an independent evaluation of the reliability of ASB calls.

An ASB was considered as overlapping the TF motif occurrence if the TF position weight matrix (PWM) scored a hit with P ≤ 0.0005 for any of the two alleles. The log ratio of P values corresponding to PWM hits at alternative alleles was used as an approximation of the TF affinity fold change (FC). Fig. 4a compares the ASB significance (X-axis, signed log₁₀ FDR; the sign set positive for Alt-ASBs and negative for Ref-ASBs) with the log ratio of motif hits P values (Y-axis) for 218 TFs having at least 1 ASB within a motif hit. Predominantly, at heterozygous sites, alleles with more specific motif hits are covered with more ChIP-Seq reads, revealing the prevalence of motif-concordant ASB events (blue dots in Fig. 4a). Such concordance persists for more than 80% of SNVs with ASB allelic imbalance FDR < 5%, growing with decreasing ASB FDR and saturating at about 90% of SNVs (Fig. 4b). At 5% FDR, good motif concordance stands for many TFs, as illustrated by the top 10 TFs with the highest number of motif hits at ASBs (Fig. 4c). Importantly, even at larger FDR, there are more concordant than discordant ASBs.

**Fig. 4: Motif annotation of SNPs agrees with TF-ASB calls.**

Yet, for ~10–20% of SNVs, the motif hit odds ratios are discordant with the allelic imbalance (corrected for BAD), that is, more reads are attracted to the weaker motif hit (red dots in Fig. 4a and red bars in Fig. 4c). We believe that in such cases the allelic imbalance arises from other contributors (allele-specific chromatin accessibility or indirect TF binding), which override the sequence-specific TF affinity. Also, we use the motif prediction scores as a proxy of the TF-binding affinity and it is possible that the observed limited discordance partly reflects the imperfectness of the utilized motif models.

To quantify ASB allelic imbalance for BAD other than one, we defined the ASB effect size (ES) as follows (see “Methods” for details). For individual SNV (SNV in a single data set):

$${\rm{ES}}_{{\rm{Ref}}} = \, {\rm{log}}_2\left(C_{\rm{Ref}}/E\left(C_{\rm{Ref}}|C_{\rm{Alt}}\right)\right)\ {\rm{and}}\\ {\rm{ES}}_{{\rm{Alt}}} = \, {\rm{log}}_2 \left(C_{\rm{Alt}}/E\left(C_{\rm{Alt}}|C_{{\rm{Ref}}}\right)\right)$$

Here C_Ref and C_Alt are the read counts at the Ref and Alt alleles, and E is the expectation. For BAD = 1: ES_Ref ≈ log₂(C_Ref/C_Alt).

The aggregated ES of an ASB is calculated as a weighted mean of ES values for the same allele for SNVs aggregated at the same genome position over TFs or cell types, with weights equal to negative logarithms of individual P values, separately for each of the alleles.

BAD-corrected estimates of the ASB ES allow to visualize the magnitude of allelic imbalance at different positions of significant motif hits. To this purpose, we introduce a staveplot (Fig. 4d and Supplementary Fig. 7) that is partitioned into sections corresponding to the motif positions, and each section is a stave of four strings denoting the minor allele. Individual ASBs are shown as the beads on the staves, with the major allele encoded with color, following the palette of the motif logo diagram that is shown underneath. As an illustrative example, we use ASBs of the CEBPB TF (Fig. 4d). For example, the first string from the left denotes A as the minor allele found in the first position of CEBPB motif hits. The string carries multiple beads, each of which is the major allele of a particular heterozygous SNV within an ASB site of CEBPB. The position of a bead on the Y-axis shows the ASB ES in log-scale. The most conserved motif positions 3-7-9-10 are almost unicolor, with the major allele usually being the same as the consensus letter in the motif (hence the beads on the strings depicting minor alleles mostly share the color of the preferred major allele). Lowly conserved positions (e.g., 1 or 12) allow for more options with various pairwise combinations of alleles (i.e., with minor allele strings carrying the beads of all four possible colors). Of note, the staveplot reveals a clear pattern where beads found in core motif positions are located generally higher, i.e., marking a greater ES for heterozygous variants at conserved motif positions. This agrees with the commonly accepted testimony that substitutions in the core motif positions bring about larger changes in TF-binding affinity.

Particularly for CEBPB, position 6 is of special interest: it displays frequent T/C ASBs with C being the major allele. These cytosines belong to the core CG pair which is prone to spontaneous deamination. The produced mismatches are then protected from repair through enhanced CEBPB binding resulting in mutation fixation³⁴. Such ASBs, on the one hand, confirm frequent mutagenesis of CEBPB binding sites, and, on the other hand, suggest the action of purifying selection that stabilizes such sites as heterozygous variants. The staveplots for other TFs are shown in Supplementary Fig. 7.

Machine learning predicts ASBs from sequence analysis and chromatin accessibility

With previously published ASB sets of smaller volumes, it was possible to predict ASB from chromatin properties and a sequence analysis²⁰. To assess to what degree this holds for ADASTRA data, we applied machine learning with a random forest model³⁵ atop experimentally determined allele-specific chromatin DNase accessibility data¹⁶, predicted allele-specific chromatin profile from DeepSEA¹¹, and sequence motif hits (Supplementary Table 5).

A generic classification problem (ASBs vs non-ASBs) can be formalized in two subtasks: (1) general assessment, i.e., to predict if an SNV makes the ASB for any of the TFs or in any of the cell types, and (2) TF- and cell type-specific assessment, i.e., to predict if an SNV makes the ASB for the particular TF or in the particular cell type. Models for both subtasks were trained and validated using multiple single-chromosome hold-outs: iteratively for each of 22 autosomes, one autosome was selected for validation, and 21 other autosomes were used for training. At each iteration, the model performance was estimated at the held-out autosome, and the resulting ROC and PRC were averaged.

For the first subtask, the performance at TF and cell type ASBs was 0.74 and 0.73 for the area under the receiver operating characteristic (auROC), and 0.44 and 0.56 for the area under the precision-recall curve (auPRC), respectively (see the plots in Supplementary Fig. 8). For the second subtask, we used the top 10 TFs and top 10 cell types with the highest numbers of ASBs, and a dedicated model was trained for each TF and each cell type (Supplementary Table 6 and Supplementary Fig. 8). The quality of the models was different for different TFs and cell types, with the highest auROC of 0.72 and 0.81 for CTCF (of TFs) and HepG2 (of cell types), and the highest auPRC of 0.35 and 0.64 for CTCF and A549. Of note, RAD21 ASBs were also predicted with very high reliability, as they are often located at the same variants as CTCF ASB.

Analysis of the feature importance (Supplementary Fig. 8) showed that all models utilized signals from the final layer of the DeepSEA neural network that was specifically designed to distinguish regulatory SNVs. Of note, among multiple DeepSEA features, those for the matched cell types were automatically prioritized. In agreement with previous studies^16,20, the models also obtained useful information from the experimental DNase-Seq data, and the data on allelic imbalance were generally more important than the basic read coverage. In the case of ASBs of particular TFs, motif-based features further facilitated distinguishing ASBs from non-ASBs. We expect that the same framework can allow further improvement of ASB prediction when supplied with additional chromatin accessibility and allelic imbalance data from matched cell types and with improved models of TF-binding sites.

Disease-associated SNPs and eQTLs are enriched with ASBs

To assess if ASB facilitates the identification of functional regulatory sequence alterations, we annotated the ASB-carrying SNVs using data from several databases on phenotype–genotype associations: NHGRI-EBI GWAS catalog³⁶, ClinVar³⁷, PheWAS³⁸, and BROAD fine-mapping catalog of causal autoimmune disease variants³⁹. With these data, we counted the number of known associations per SNP, considering SNVs of several classes: low-covered SNVs not tested for ASB (non-candidate sites having the maximal read coverage across experiments not reaching 20); candidate sites that exhibit or not exhibit ASB from the data sets of a single TF; candidate sites from the data sets for two or more TFs that, again, exhibit ASB or do not; and finally, regulator-switching ASBs, where different TFs prefer to bind alternative alleles, e.g., in different cell types. All variants were segregated into classes in regard to known associations: no known associations, with a single association, and with multiple associations.

We have found that the share of ASB variants with genetic associations was consistently higher than expected by chance (Fig. 5a), which apparently makes ASBs good candidates for prospection for causal SNVs. Specifically, the odds ratio between the observed and expected SNP numbers was specifically high for TF-switching ASBs, although only 1.5% of such ASBs were involved in two or more known GWAS associations. For many variants, there are no known associations with “macro-phenotypes,” as provided by GWAS studies, but there are data on molecular phenotypes like variations in mRNA levels. In fact, the effect of the so-called eQTLs⁴⁰ can be explained by the alteration of TF-binding affinity that is revealed by ASB. Using the same classification of SNVs as above, we tested ASB and non-ASB SNVs for overlaps with GTEx⁴¹ eQTLs and observed the same pattern as for phenotype associations, with the strongest enrichment of ASBs for which different TFs preferably bind alternative alleles (Fig. 5b). The enrichment also grew stronger with the number of genes, mRNA levels of which were associated with the variant. The same effect holds for multi-cell type ASBs (Supplementary Fig. 9).

**Fig. 5: ASBs are enriched with pathologic phenotype associations and eQTLs.**

More than 80% of ASB SNVs with alternative alleles preferably bound by different TFs overlap eQTLs in at least one cell type, whereas 10% of such ASB SNVs overlap eQTLs targeting ten or more genes. A large fraction of genes of medical relevance from the ClinVar catalog³⁷ was found among protein-coding genes associated with ASB eQTLs (twofold enrichment as compared to random expectation, Fisher’s exact test P ~10⁻⁴⁹). Of note, as many as 90% of genes of medical relevance in ClinVar are eQTL targets of ASB SNVs, and this constitutes 30% of all target genes of ASB eQTLs.

It is not trivial to measure the reliability of ASB identification due to difficulty in assembling a highly reliable “ground truth” set of ASBs, that is necessary to compute standard performance measures based on true and false positives/negatives. For instance, only synthetic data were used for benchmarking purposes in the original BaalChIP paper¹⁹. On the other hand, despite difficulties in the direct evaluation of ASB calling performance, it is possible to estimate implicitly the “regulatory potential” of particular SNPs from functionally related data. We performed a comparison of ASB calls between ADASTRA, BaalChIP, and Shi et al. data comparing ASBs with GTEx eQTLs (Fig. 5c and Supplementary Fig. 9). The level of eQTL support for ADASTRA ASBs turned out to be comparable to that of the BaalChIP ASB set, with Shi et al. data close behind.

We also studied the association of GWAS-tested phenotypes with all candidate SNVs, not necessarily significant ASBs, found in TF-binding regions. To this end, we performed a general enrichment analysis for SNPs found in ChIP-Seq data of particular TFs within linkage disequilibrium blocks (LD-islands identified in⁴²) using Fisher’s exact test (see “Methods”). Thus we identified TFs for which phenotype-associated SNVs were enriched within TF-binding regions (Supplementary Fig. 9). For a number of TFs such association with phenotypes was reported in other studies. The examples include FOXA1 (involved in prostate development⁴³ and in our case, found associated with prostate cancer), IKZF1 (for which the protein damaging mutations are associated with leukemia), STAT1 (involved in the development of systemic lupus erythematosus⁴⁴), and others. Practically in all cases one or several of the associated SNVs also acted as ASBs of the respective TF, providing strong candidates for causality.

To illustrate how the functional role of regulatory SNPs can be highlighted with ASB data, we present several case studies. First, there is rs3761376 (G > A) that serves as a Ref-ASB for ESR1, which was already confirmed by electrophoretic mobility shift assay⁴⁵. rs3761376 is located in the TFF1 gene promoter and was shown to reduce TFF1 expression through altered ESR1 binding, suggesting a molecular mechanism of the increased risk of gastric cancer⁴⁵.

Next, there is rs17293632 (C > T) that serves as a Ref-ASB for 25 different TFs and was previously reported to affect the chromatin accessibility in the adjacent region⁴⁶. rs17293632 is associated with Crohn’s disease. This SNP is located in SMAD3 intron and overlaps an eQTL targeting SMAD3, AAGAB, and PIAS1 genes⁴¹. Interestingly, a variant of SMAD3 is also associated with Crohn’s disease, particularly, with increased risk of repeated surgery and shorter relapse⁴⁷. Among the TFs displaying ASBs, there are JUN/FOS proteins with the ASB-concordant motif annotation. The AP1 pioneer complex of JUN/FOS likely serves as a “driver” for changes both in gene expression and chromatin accessibility, and is likely to cause ASB of all 25 TFs.

Apart from multi-TF ASBs which are linked to local chromatin changes, non-trivial cases can be found among TF-switching ASBs. For example, SNP rs58726213 is associated with psoriasis and is ASB of CREB1 (reference allele preference, concordant with motif) and JUN (alternative allele preference). rs58726213 is located in the STX4 intron or upstream region depending on a transcript variant. STX4 is significantly downregulated in psoriasis⁴⁸, and, according to GTEx, rs58726213 serves as an eQTL of STX4 and HSD3B7; the latter is also reported as psoriasis susceptibility locus⁴⁹.

Another example is SNP rs11257655 that is associated with type 2 diabetes⁵⁰. rs11257655 is reported to be located in the CDC123 regulatory region and exhibits ASB of FOXA1 (alternative allele preference, concordant with the sequence motif), ESR1 (reference allele preference), and three other TFs (SPI1, STAT1, and SMC3). According to UniProt⁵¹, FOXA1 is involved in liver and pancreas development, and in glucose homeostasis. At the same time, polymorphisms in the ESR1 gene are associated with type 2 diabetes and with fasting plasma glucose^52,53.

Thus, ASBs highlight the cases where phenotype–genotype associations arise with different mechanisms, either from protein structure variation, or due to altered gene expression caused by nucleotide substitutions in the gene regulatory region.

Discussion

The functional annotation of noncoding variants remains a challenge in modern human genetics. Phenotype-associated SNPs found in GWAS are usually located in extensive linkage disequilibrium blocks, and reliable selection of causal variants cannot be done purely by statistical means. Additional data for the identification of causal variants come from functional genomics. In particular, an important class of causal variants consists of regulatory SNVs affecting gene transcription. For those variants, there are various approaches, e.g., parallel reporter assays, to obtain high-throughput data on molecular events caused by particular nucleotide substitution. Another common strategy is to check if a variant of interest falls into a known gene regulatory region detected by chromatin immunoprecipitation or chromatin accessibility assay followed by deep sequencing. By assessing the allele specificity, it is possible to further profit from these data through direct estimation of the effect that a particular allele has on the binding of relevant regulatory proteins or chromatin accessibility.

In this meta-study, for each SNV, we integrated the data by considering a TF bound to SNV in different cell types or a cell type and different TFs bound to the same SNV. Surprisingly, ASB identification through data aggregation had better sensitivity than standard ChIP-Seq peak calling at the level of individual data sets. Particularly, in GTRD, the ChIP-Seq peak calls were gathered from four different tools (MACS, SISSRs, GEM, and PICS), but only 85–90% of significant ASBs were detected within peak calls (199,819 of 233,290 and 324,890 of 351,965 for TF-centric and cell type-centric aggregation), suggesting that up to 15% of ASBs could be lost if the ASB calling was restricted to the peak calls only.

Each particular ASB can either be a “driver” directly altering TF-binding affinity, or a “passenger” with differential binding resulting from differential chromatin accessibility (in turn, caused by some neighboring SNVs), or a protein–protein interaction with the causal TF. In terms of machine learning, we expected the TF ASBs to provide an easier prediction target since they could be mostly determined by the sequence motif of the respective TF. However, as found, the percentage of “passenger” ASBs is rather large (e.g., 24,662 out of 27,233 CTCF ASBs lack significant CTCF motif hits), and the TF-specific models showed a limited ASB prediction quality. Further surprise came from cell type-specific models which displayed a notably higher performance. We interpret these data as follows: the cell type ASBs are easier to predict by learning a small set of cell type-specific master regulators, while passenger TF-level ASBs are very diverse, as coming from data aggregation of many different cell types with varying cell type-specific features such as key TFs.

ASB events should be distinguished from other sources of allelic imbalance such as aneuploidy and local CNVs, which can imitate ASB by varying the allelic dosage. Commonly used cell types are often aneuploid: K562 and MCF7 cells are triploid on average, and 59 of 121 cell types overlapping between ADASTRA and COSMIC also have median copy number above 2. The ADASTRA pipeline, to our knowledge, includes the first control-free approach to reconstruct the genomic map of BAD directly from SNP calls and to use this map as a baseline for detecting genuine allelic imbalance. Despite a multitude of available software for ASB calling, there has been no approach suitable for the uniform analysis of diverse existing data. Thus, when developing ADASTRA, the intention was to be able to process and include most of the data including non-replicated experiments, data sets lacking genomic input controls, or with the controls sequenced at low coverage, at the expense of general sensitivity achieved at particular data sets. Further on, such a pipeline might be applicable to other sequencing data that allow allele specificity, e.g., analyses of allele-specific expression or chromatin accessibility. With matched cell types, BAD-corrected data on allele-specific chromatin accessibility will also allow for better classification of driver and passenger ASBs and better application of machine learning techniques.

Our collection of ASB events per se is also useful for other research areas involving TF-DNA interactions. First, ASBs provide unique in vivo data on differential TF binding and can be used for testing the predictive power of computational models for precise recognition of TF-binding sites³³. Second, the TF binding not only affects transcript abundance, but also affects RNA splicing, localization, and stability^54,55. Thus, ASBs may affect other levels of gene expression, particularly, the mRNA posttranscriptional modification: out of 65 RNAe-QTLs reported in⁵⁶, 4 are listed as ASBs in ADASTRA.

Last but not least, ADASTRA reports hundreds of TF-switching ASBs, where alternative alleles are preferably bound by different TFs. This possibility has been discussed previously⁵⁷ but, to our knowledge, we are first to report the genome-wide inventory of such events. Importantly, the respective SNVs exhibit the highest enrichment with phenotype associations. Probably these sites serve varying and allele-dependent molecular circuits. A particularly interesting example is rs28372852 located in the G elongation factor mitochondrial 1 (GFM1) gene promoter. According to ADASTRA, rs28372852 serves as the Alt-ASB of CREB1 and Ref-ASB of MXI1, and in both cases, the allelic imbalance is concordant with the respective binding motifs. Also, according to GTEx⁴¹, GFM1 is the target of rs28372852 eQTL. According to UniProt⁴⁸, CREB1 is a transcriptional activator, while MXI1 is a transcriptional repressor, suggesting that ASB can directly switch the gene expression activity. At the same time, UniProt reports four amino acid substitutions in GFM1 that are associated with combined oxidative phosphorylation deficiency. Interestingly, according to ClinVar³⁷, this SNP is benign in regard to combined oxidative phosphorylation deficiency; and in this case we speculate that ASB data might facilitate reevaluating the variants’ functional roles and pathogenic potential. We believe that further analysis of TF-switching ASBs in the scope of metabolic and regulatory pathway alterations will provide valuable insights into molecular mechanisms underlying particular normal and pathologic traits.

Methods

Variant calling from GTRD alignments

We used 7669 premade short read alignments against hg38 genome assembly produced with bowtie2⁵⁸ and stored in the GTRD¹⁷ database. PICARD was used for deduplication, followed by GATK base quality recalibration. Next, the variants were called with GATK HaplotypeCaller, with dbSNP²⁶ (common variant set of the build 151) for annotation. The resulting variant calls were filtered to meet the following requirements: (1) an SNV must be biallelic and heterozygous (GATK annotation GT = 0/1); (2) an SNV must have read coverage ≥ 5 at both the reference and alternative alleles; (3) an SNV must be listed as an SNP in the dbSNP 151 common set. Of note, we considered all eligible SNVs as candidate ASB, not necessarily located within ChIP-Seq peak calls.

We restricted ourselves with variants from the dbSNP common subset due to the following reasons: (1) allelic read counts at de novo mutations reflect the composition of the cell population (i.e., the fraction of cells carrying the mutation) rather than the local copy-number ratio or ASB; (2) de novo point mutations within particular copies of duplicated segments (considering, e.g., chromosome duplications) will exhibit allelic imbalance (e.g., in a tetraploid region with 2:2 ratio of allelic reads at SNPs, de novo mutations will likely exhibit the ratio of 1:3) and may lead to false-positive ASB calls.

Accounting for BAD

The observed distribution of ChIP-Seq allelic read counts on heterozygous SNVs significantly depends on aneuploidy and the CNV profile of the cells (Fig. 6a, b). The modes of distribution correspond to the most represented copy number, e.g., the distribution is bimodal for mostly triploid K562 cells, Fig. 6b. However, the mixture of two Binomial distributions poorly approximates the data, showing a significant overdispersion. To systematically reduce the overdispersion from local CNVs and aneuploidy, we reconstructed the genome-wide BAD maps from read counts at the heterozygous variants (see below). The distributions of the allelic read counts at SNVs segregated by BAD show a notably reduced overdispersion (Fig. 6c, d).

**Fig. 6: Distribution of read counts at SNVs significantly depends on background allelic dosage.**

BAD calling with Bayesian changepoint identification

To construct genome-wide BAD maps from filtered heterozygous SNV calls, we developed a novel algorithm, the BAD caller by Bayesian changepoint identification (BABACHI).

At the first stage, BABACHI divides the chromosomes into smaller sub-chromosome regions by detecting centromeric regions, long deletions, loss of heterozygosity regions, and other regions depleted of SNVs. At this stage, only the distances between neighboring SNVs are taken into account and long gaps are marked. The sub-chromosome regions with <3 SNVs or chromosomes with <100 SNVs are removed. Next, BABACHI finds a set of changepoints in each sub-chromosome region that further divide it into smaller segments of stable BAD. The optimal changepoints are chosen to maximize the marginal likelihood to observe the experimental distribution of allelic read counts at the SNVs, given a region-specific (yet unknown) BAD persist in each region enclosed between neighboring changepoints. Finally, a particular BAD is assigned to each segment according to the maximum posterior.

The likelihood is calculated for the statistic x = min(C_Ref, C_Alt), assuming C_Ref to be distributed according to the truncated Binomial distribution ~TruncatedBinom(n, p) given that C_Ref + C_Alt = n, the number of reads overlapping the variant; the number of successes k is limited to 5 ≤ k ≤ n−5 (the read coverage filter), and p is either 1/(BAD + 1) or BAD/(BAD + 1), matching one of the expected allelic read frequencies.

BAD of each segment is selected from the discrete set {1, 4/3, 3/2, 2, 5/2, 3, 4, 5, 6}, considering that the total copy number of a particular genomic region rarely exceeds 7. The prior distribution of BAD is assumed to be a discrete uniform, with the support being the same discrete set as above (non-informative prior). Details and mathematical substantiation of the algorithm are provided in the Supplementary Methods.

Practical BAD calling with the ADASTRA pipeline

To provide better genome coverage and robust BAD estimates, we merged the sets of variant calls from ChIP-Seq data sets produced in the same laboratory for the same cell type and in the same series (i.e., sharing either ENCODE biosample or GEO GSE ID). Different SNVs at the same genome position (either originating from different data sets or with different alternative alleles) were considered as independent observations. For each data set, chromosomes with < 100 SNVs were excluded from BAD calling and further analysis.

To assess the reliability of the BAD maps, for each BAD, we separately estimated ROC and PRC. Here we considered the BAD maps as binary classifiers of SNVs according to BAD, with COSMIC CNV data as the ground truth. To plot a curve for BAD = x, the following prediction score was used:

S = L(BAD = x) − max_y≠x L(BAD = y), where L denotes the log-likelihood of the segment containing the SNV to have the specified BAD (Fig. 2c, d).

Construction of an independent BAD map for MCF7 cells

The paired-end reads of MCF7 deep genome sequencing (SRA accession SRR8652105) were aligned to hg38 genome assembly using bowtie2 with default settings. Overall, 28,278,026 (2.5%) of a total of 1,136,666,560 paired reads were marked as duplicates, 112,323,925 (9.9%) were filtered by GATK filter by mapping quality ≥ 10, leaving 996,064,609 reads for SNP calling. A total of 3,969,250 SNPs was reported by GATK HaplotypeCaller, among which 1,427,492 SNPs were annotated as heterozygous, passed the basic ADASTRA filter (≥5 reads on each allele), and were used to produce the independent reference MCF7 BAD map with BABACHI⁶⁹.

ASB calling with the Negative Binomial mixture model

To account for mapping bias, we fitted separate Negative Binomial mixture models for the scoring of Ref- and Alt-ASBs. For each BAD and each fixed read count at Ref- and Alt- alleles, we obtained separate fits using SNVs from all available data sets.

For every fixed read count value at a particular allele, we approximated the distribution of read counts mapped to the other allele as a mixture of two Negative Binomial distributions. The model estimates the number of successes x (the number of reads mapped to the selected allele) given the number of failures r (the number of reads mapped to the second allele) in the series of Bernoulli trials with probability of success p (for the first distribution in the mixture) or 1 − p (for the second distribution in the mixture). The following holds for scoring Ref-ASBs at fixed Alt-allele read counts:

$${C}_{{\rm{Ref}}}|{\rm{fixed}}\ {C}_{{\rm{Alt}}} \sim (1-w)\times {\rm{NegativeBinomial}}(r,p)+w\times {\rm{NegativeBinomial}} (r,1-p)\\ P({C}_{{\rm{Ref}}}=x|{\rm{fixed}}\ {C}_{{\rm{Alt}}}=m,{C}_{{\rm{Ref}}}\,\ge 5)\\ =\,\left(\frac{x+r-1}{x}\right)\left((1-w)\times {(1-p)}^{r}\times {p}^{x}+w \times {(1-p)}^{x}\times {p}^{r}\right)/A\\ A=1-P\left({C}_{{\rm{Ref}}} < 5|{\rm{fixed}}\ {C}_{{\rm{Alt}}}=m\right)$$

(1)

where p and 1 − p were fixed to reflect the expected frequencies of allelic reads, namely, 1/(BAD + 1) and BAD/(BAD + 1). The parameters r (number of failures) and w (weights of distributions in the mixture) were fitted with L-BFGS-B algorithm from scipy.optimize⁵⁹ package to maximize the model likelihood iteratively with boundaries r > 0 and 0 ≤ w ≤ 1, assigning initial values of r = m (number of reads on the fixed allele) and w = 0.5, respectively. A is the normalization coefficient (necessary due to truncation) corresponding to allelic reads cutoff of 5. The goodness of fit was assessed by root mean square error of approximation (RMSEA⁶⁰, Supplementary Fig. 11). Low-quality fits with RMSEA > 0.05 were discarded, fixing the parameters at r = m and w = 1, thereby penalizing the statistical significance of ASB at such SNVs, as fitted r is systematically lower than m (Supplementary Fig. 12). Of note, the values of r for distribution of reference allele read counts (with fixed alt-allele read counts) were systematically higher than those for alternative allele read counts (with fixed Ref-allele read counts), thus balancing the reference mapping bias. The obtained fitted models were used for statistical evaluation of ASB for alternative and reference alleles independently, with one-tailed tests. Examples of fits for BAD = 1 and 2 are shown in Fig. 6e, f, with RMSEA < 0.02 for the fixed Ref/Alt read counts of 10.

Aggregation of ASB P values from individual data sets

For each ChIP-Seq read alignment (except control data), we performed the ASB calling. Next, the SNVs were grouped by a particular TF (across cell types) or by a particular cell type (across TFs). A group of SNVs with the same position and alternative alleles was considered as an ASB candidate if at least one of the SNVs passed a total coverage threshold ≥ 20. Next, for each ASB candidate, we performed logit aggregation of individual ASB P values²⁷, independently for Ref-ASB and Alt-ASB. Individual P values of 1 were excluded from aggregation, and if none were left, the aggregated P value for an SNV was set to 1.

Logit aggregation is the method of a choice, as it has two advantages. First, compared to Fisher’s method, it cancels out symmetrical P values like 0.01 and 0.99 to 0.5. Second, the pattern of evidence is not known in advance, significant ASB P values can arise both from a small number of strongly imbalanced SNVs in deeply sequenced data sets and from a large number of weakly imbalanced SNVs in data sets with low or medium coverage. Compared to the similar Stauffer’s method, the logit aggregation is less sensitive to the extreme P values and can be considered a robust choice⁶¹. The resulting aggregated P values were FDR corrected (Benjamini–Hochberg adjustment) for multiple tested SNVs separately for each TF and each cell type. SNVs passing 0.05 FDR for either Ref or Alt-allele were considered ASB.

ASB effect size estimation

We define the ES separately for reference allele ASB (ES_Ref) and alternative allele ASB (ES_Alt) as the log ratio of the observed number of reads to the expected number. To account for BAD and mapping bias, we use fitted Negative Binomial mixture at the fixed allele read counts:

$${\rm{E{S}}}_{{\rm{Ref}}}= \, {\rm{log}}_{2}\left({C}_{{\rm{Ref}}}/E\left({C}_{{\rm{Ref}}}|{C}_{{\rm{Alt}}}\right)\right),\\ {\rm{ES}}_{{\rm{Alt}}}= \, {\rm{log}}_{2}\left({C}_{{\rm{Alt}}}/E\left({C}_{{\rm{Alt}}}|{C}_{{\rm{Ref}}}\right)\right)$$

(2)

In the basic case of BAD = 1, the ES can be approximated as the log ratio of read counts, taking into account that the expectation bias due to the truncation is relatively small and r is close to the read count on the fixed allele: ES_Ref ≈ log₂(C_Ref /C_Alt).

In the case of BAD > 1, the same assumptions lead to the following estimation of the ES:

$${\rm{log}}_{2}({C}_{{\rm{Ref}}}\times {\rm{BAD}}/{C}_{{\rm{Alt}}})\lesssim {\rm{ES}}_{{\rm{Ref}}}\lesssim {\rm{log}}_{2}({C}_{{\rm{Ref}}}/({\rm{BAD}}\times {C}_{{\rm{Alt}}}))$$

(3)

This holds due to the fact that for fixed BAD, C_Ref expectation is either C_Alt × BAD or C_Alt/BAD, depending on a haplotype. Therefore, the expectation of C_Ref according to the Negative Binomial mixture model is approximately w × C_Alt × BAD + (1 − w) × C_Alt/BAD.

The final ASB ES is estimated for SNVs with aggregated significance either across TFs or across cell types. The ES value is calculated as a weighted average of ES of individual SNVs in aggregation, with weights assigned as negative logarithms of individual P values. ES is not assigned in the case if all individual P values are equal to 1.

SNV and ASB annotation

Genomic annotation

To annotate SNVs according to their genomic location (Fig. 3c), we started with mapping SNVs to FANTOM5 enhancers and promoters⁶². The remaining SNVs were annotated with ChIPseeker⁶³ with a hierarchical assignment of the following categories: promoter (≤1 kb), promoter (1–2 kb), promoter (2–3 kb), 5′UTR, 3′UTR, Exon, Intron, Downstream, Intergenic. For clarity, promoter (≤1 kb) and 5′UTR categories were both tagged as “promoter”; promoter (1–2 kb) and promoter (2–3 kb) were both tagged as “upstream.”

Sequence motif analysis of ASBs

For TF ASBs, we annotated the corresponding SNVs with sequence motif hits of the respective TFs. To this end, we used models from HOCOMOCO v11 core collection⁶⁴ and SPRY-SARUS⁶⁵ for motif finding. The top-scoring motif hit was taken considering both Ref and Alt alleles, and, at this fixed position, the “motif FC” was calculated as the log₂-ratio of motif P values at the reference and alternative variants so that the positive FC corresponded to the preference of the alternative allele.

To analyze the ASB motif concordance (Fig. 4), we considered the ASB SNVs (min(FDR_Ref, FDR_Alt) ≤ 0.05) that overlapped the predicted TF-binding site: (min(motif P value_Ref, motif P value_Alt) ≤ 0.0005), and had |FC| ≥ 2. We defined the motif concordance/discordance as a match/mismatch of the signs of FC and ΔFDR = log₁₀(FDR_Alt) − log₁₀(FDR_Ref).

Annotation of ASBs with phenotype associations

To assess enrichment of ASBs within phenotype-associated SNPs, we used the data from four different SNP-phenotype associations databases, namely: (1) NHGRI-EBI GWAS catalog³⁶, release 8/27/2019 with EFO mappings⁶⁶ used to group phenotypes by their parent terms for Supplementary Fig. 9; (2) ClinVar catalog³⁷, release 9/05/2019 (entries with “likely pathogenic,” “pathogenic,” or “risk factor” clinical significance); (3) PheWAS catalog³⁸; (4) BROAD fine-mapping catalog of causal autoimmune disease variants³⁹. All entries were systematized in the form of triples <dbSNP ID, phenotype, database>. Next, the entries were annotated with the TF- or cell type-ASB data.

To evaluate TF-phenotype associations in detail, we used NHGRI-EBI GWAS catalog and the following pipeline:

(1)
We filtered out TFs with less than two candidate ASBs, and phenotypes associated with less than two SNPs, resulting in 765 TFs and 2688 phenotypes suitable for the analysis. For each TF, we considered all SNPs with candidate ASBs passing the coverage thresholds.
(2)
For each pair of a TF and a phenotype, we calculated the odds ratio and the P value of the one-tailed Fisher’s exact test on SNPs with candidate ASBs considering two binary features: whether the SNP is associated with the phenotype, and whether the SNP is included in ASB candidates of the particular TF. The superset of SNPs was collected independently for each TF by gathering SNPs with candidate ASBs for all TFs but only from LD blocks⁴² сontaining either TF-specific SNPs or phenotype-associated SNPs. The P values were then FDR corrected for multiple tested TFs separately for each phenotype.

Analysis of eQTLs and eQTL target genes

To analyze an overlap between ASBs and eQTLs, we used significant <variant, gene> pairs from GTEx⁴¹ (release V8).

To evaluate ASB-driven eQTL target genes’ associations with medical phenotypes, a one-tailed Fisher’s exact test was performed on the enrichment of protein-coding genes of medical relevance (6026 genes found linked with entries with “pathogenic,” “likely pathogenic,” or “risk factor” clinical significance in ClinVar catalog³⁷) among eQTL target genes of ASB SNPs (16,865 protein-coding genes according to GTEx), considering all human protein-coding genes from GENCODE⁶⁷ (v35, 19,929 gene symbols) as the background set.

ASB prediction with machine learning

In our work, we used a standard software implementation of the random forest model from the scikit-learn package. The number of estimators was set to 500 and the other parameters were defaults. Three feature types were used (Supplementary Table 4): allele-specific chromatin DNase accessibility, synthetic data from neurons from the last layer of the DeepSEA¹¹, and HOCOMOCO motif predictions obtained with SPRY-SARUS⁶⁵. As a global set of SNVs, we used 231,355 dbSNP IDs overlapping between ADASTRA and Maurano et al.¹⁶ data, which provided allele-specific DNase accessibility. For the general model, we used SNVs with ASBs for any of TFs or in any of cell types as members of the positive class, and the remaining set of candidate SNVs as members of the negative class. For TF- and cell type-specific assessment, we defined ASB and non-ASB SNVs for a particular TF or in a particular cell type as the positive and negative class, respectively.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The complete data on ASBs across TFs and cell types described in this study are available in the release 1.6.10-Soos of the ADASTRA database (http://adastra.autosome.ru/) and provided online: http://adastra.autosome.ru/soos/, the generated BAD maps and the list of ChIP-Seq data sets are available at http://adastra.autosome.ru/soos/downloads. The reprocessed ChIP-Seq peaks and metadata are available in the GTRD database: http://gtrd.biouml.org.

Code availability

The ADASTRA pipeline is available at GitHub: https://github.com/autosome-ru/ADASTRA-pipeline⁶⁸. BABACHI segmentation software is available at GitHub: https://github.com/autosome-ru/BABACHI⁶⁹. The code for machine learning analysis is available at GitHub: https://github.com/autosome-ru/ASB-ML⁷⁰. The SPRY-SARUS motif scanner is available at GitHub: https://github.com/autosome-ru/sarus⁶⁵.

References

Ponomarenko, J. V. et al. rSNP_Guide: an integrated database-tools system for studying SNPs and site-directed mutations in transcription factor binding sites. Hum. Mutat. 20, 239–248 (2002).
Article CAS PubMed Google Scholar
Cavalli, M. et al. Allele-specific transcription factor binding to common and rare variants associated with disease and gene expression. Hum. Genet. 135, 485–497 (2016).
Article CAS PubMed PubMed Central Google Scholar
PCAWG Drivers and Functional Interpretation Working Group et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).
Deplancke, B., Alpern, D. & Gardeux, V. The genetics of transcription factor DNA binding variation. Cell 166, 538–554 (2016).
Article CAS PubMed Google Scholar
Penzar, D. D. et al. What do neighbors tell about you: the local context of cis-regulatory modules complicates prediction of regulatory variants. Front. Genet. 10, 1078 (2019).
Article CAS PubMed PubMed Central Google Scholar
van Arensbergen, J. et al. High-throughput identification of human SNPs affecting regulatory element activity. Nat. Genet. 51, 1160–1169 (2019).
Article PubMed PubMed Central CAS Google Scholar
Bulyk, M. L. Protein binding microarrays for the characterization of DNA–protein interactions. in Analytics of Protein–DNA Interactions (ed. Seitz, H.) Vol. 104, 65–85 (Springer Berlin Heidelberg, 2006).
Rockel, S., Geertz, M. & Maerkl, S. J. MITOMI: A microfluidic platform for in vitro characterization of transcription factor–DNA interaction. in Gene Regulatory Networks (eds. Deplancke, B. & Gheldof, N.) Vol. 786, 97–114 (Humana Press, 2012).
Korneev, K. V. et al. Minor C allele of the SNP rs7873784 associated with rheumatoid arthritis and type-2 diabetes mellitus binds PU.1 and enhances TLR4 expression. Biochim. Biophys. Acta 1866, 165626 (2020).
Article CAS Google Scholar
Putlyaeva, L. V. et al. Potential markers of autoimmune diseases, alleles rs115662534(T) and rs548231435(C), disrupt the binding of transcription factors STAT1 and EBF1 to the regulatory elements of human CD40 gene. Biochemistry 83, 1534–1542 (2018).
CAS PubMed Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Article CAS PubMed PubMed Central Google Scholar
Quang, D., Chen, Y. & Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015).
Article CAS PubMed Google Scholar
McDaniell, R. et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science 328, 235–239 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Maurano, M. T. et al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat. Genet. 47, 1393–1401 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yevshin, I., Sharipov, R., Kolmykov, S., Kondrakhin, Y. & Kolpakov, F. GTRD: a database on gene transcription regulation—2019 update. Nucleic Acids Res. 47, D100–D105 (2019).
Article CAS PubMed Google Scholar
Chèneby, J. et al. ReMap 2020: a database of regulatory regions from an integrative analysis of human and arabidopsis DNA-binding sequencing experiments. Nucleic Acids Res. gkz945 https://doi.org/10.1093/nar/gkz945 (2019).
de Santiago, I. et al. BaalChIP: Bayesian analysis of allele-specific transcription factor binding in cancer genomes. Genome Biol. 18, 39 (2017).
Article PubMed PubMed Central CAS Google Scholar
Shi, W., Fornes, O., Mathelier, A. & Wasserman, W. W. Evaluating the impact of single nucleotide variants on transcription factor binding. Nucleic Acids Res. gkw691 https://doi.org/10.1093/nar/gkw691 (2016).
Rozowsky, J. et al. AlleleSeq: analysis of allele‐specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).
Article PubMed PubMed Central CAS Google Scholar
Chen, J. et al. A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals. Nat. Commun. 7, 11101 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Multi-omic measurements of heterogeneity in HeLa cells across laboratories. Nat. Biotechnol. 37, 314–322 (2019).
Article CAS PubMed Google Scholar
Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wei, Y., Li, X., Wang, Q. & Ji, H. iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets. BMC Genomics 13, 681 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sherry, S. T. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed PubMed Central Google Scholar
George, E. O. & Mudholkar, G. S. On the convolution of logistic random variables. Metrika 30, 1–13 (1983).
Article MathSciNet MATH Google Scholar
Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Article CAS PubMed Google Scholar
Varma, S., Pommier, Y., Sunshine, M., Weinstein, J. N. & Reinhold, W. C. High resolution copy number variation data in the NCI-60 cancer cell lines from whole genome microarrays accessible through CellMiner. PLoS ONE 9, e92047 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Cavalli, M. et al. Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases. Sci. Rep. 9, 2695 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
Szklarczyk, D. et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).
Article CAS PubMed Google Scholar
Wang, D. et al. Reprogramming transcription by distinct classes of enhancers functionally defined by eRNA. Nature 474, 390–394 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wagih, O., Merico, D., Delong, A. & Frey, B. J. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. https://doi.org/10.1101/253427 (2018).
Ershova, A. S. et al. Enhanced C/EBPs binding to C>T mismatches facilitates fixation of CpG mutations. https://doi.org/10.1101/2020.06.11.146175 (2020).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Article CAS PubMed Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Article CAS PubMed PubMed Central Google Scholar
Farh, K. K. -H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).
Article ADS CAS PubMed Google Scholar
Brem, R. B. Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752–755 (2002).
Article ADS CAS PubMed Google Scholar
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article CAS Google Scholar
Berisa, T. & Pickrell, J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics btv546 https://doi.org/10.1093/bioinformatics/btv546 (2015).
Pomerantz, M. M. et al. Prostate cancer reactivates developmental epigenomic programs during metastatic progression. Nat. Genet. 52, 790–799 (2020).
Article CAS PubMed Google Scholar
Aue, A. et al. Elevated STAT1 expression but not phosphorylation in lupus B cells correlates with disease activity and increased plasmablast susceptibility. Rheumatology keaa187 https://doi.org/10.1093/rheumatology/keaa187 (2020).
Wang, W. et al. A functional polymorphism in TFF1 promoter is associated with the risk and prognosis of gastric cancer: a functional polymorphism in TFF1 promoter. Int. J. Cancer 142, 1805–1816 (2018).
Article CAS PubMed Google Scholar
Gate, R. E. et al. Genetic determinants of co-accessible chromatin regions in activated T cells across humans. Nat. Genet. 50, 1140–1150 (2018).
Article CAS PubMed PubMed Central Google Scholar
Fowler, S. A. et al. SMAD3 gene variant is a risk factor for recurrent surgery in patients with Crohn’s disease. J. Crohns Colitis 8, 845–851 (2014).
Article PubMed Google Scholar
AlFadhli, S., Al-Zufairi, A. A. M., Nizam, R., AlSaffar, H. A. & Al-Mutairi, N. De-regulation of diabetic regulatory genes in psoriasis: deciphering the unsolved riddle. Gene 593, 110–116 (2016).
Article CAS PubMed Google Scholar
Collaborative Association Study of Psoriasis (CASP) et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat. Genet. 44, 1341–1348 (2012).
Carayol, J. et al. Genetic susceptibility determines β-cell function and fasting glycemia trajectories throughout childhood: a 12-year cohort study (EarlyBird 76). Diabetes Care 43, 653–660 (2020).
Article CAS PubMed Google Scholar
Consortium, T. U. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Article CAS Google Scholar
Dahlman, I. et al. Estrogen receptor alpha gene variants associate with type 2 diabetes and fasting plasma glucose. Pharmacogenet Genomics 18, 967–975 (2008).
Article CAS PubMed Google Scholar
Zhao, L. et al. Estrogen receptor 1 gene polymorphisms are associated with metabolic syndrome in postmenopausal women in China. BMC Endocr. Disord. 18, 65 (2018).
Article PubMed PubMed Central CAS Google Scholar
Bellofatto, V. & Wilusz, J. Transcription and mRNA stability: parental guidance suggested. Cell 147, 1438–1439 (2011).
Article CAS PubMed Google Scholar
Zid, B. M. & O’Shea, E. K. Promoter sequences direct cytoplasmic localization and translation of mRNAs during starvation in yeast. Nature 514, 117–121 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Belkadi, A. et al. Identification of genetic variants controlling RNA editing and their effect on RNA structure stabilization. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-020-0688-7 (2020).
Ameur, A., Rada-Iglesias, A., Komorowski, J. & Wadelius, C. Identification of candidate regulatory SNPs by combination of transcription-factor-binding site prediction, SNP genotyping and haploChIP. Nucleic Acids Res. 37, e85–e85 (2009).
Article PubMed PubMed Central CAS Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Browne, M. W. & Cudeck, R. Alternative ways of assessing model fit. Sociol. Methods Res. 21, 230–258 (1992).
Article Google Scholar
Loughin, T. M. A systematic comparison of methods for combining p-values from independent tests. Comput. Stat. Data Anal. 47, 467–485 (2004).
Article MathSciNet MATH Google Scholar
The FANTOM consortium et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 16, 22 (2015).
Yu, G., Wang, L. -G. & He, Q. -Y. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).
Article CAS PubMed Google Scholar
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Article CAS PubMed Google Scholar
Denisenko, N., Kulakovskiy, I. & Vorontsov, I. autosome-ru/sarus: SPRY-SARUS v2.0.2. (Zenodo, 2020). https://doi.org/10.5281/ZENODO.4015924.
Malone, J. et al. Modeling sample variables with an experimental factor ontology. Bioinformatics 26, 1112–1118 (2010).
Article CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS PubMed Google Scholar
Abramov, S. & Boytsov, A. autosome-ru/ADASTRA-pipeline: release-Soos (Zenodo, 2020). https://doi.org/10.5281/zenodo.4008546.
Abramov, S. & Boytsov, A. autosome-ru/BABACHI: release 1.3.7 (Zenodo, 2020). https://doi.org/10.5281/ZENODO.4008544.
Penzar, D. autosome-ru/ASB-ML: ASB-ML (Zenodo, 2020). https://doi.org/10.5281/ZENODO.4043865.

Download references

Acknowledgements

We thank the organizers and members of the GRECO consortium for the series of workshops (held under European Union COST Action CA15205—GREEKС, coordinator Martin Kuiper) which provided a fruitful networking and discussion platform for ideas of this study. We personally thank Denis Litvinov for help in GTRD metadata processing and Evgenia Serebrova for help in paper preparation. This study was supported by RFBR grant 18-34-20024 to I.V.K. (basic ADASTRA pipeline), RSF grant 20-74-10075 to I.V.K. (machine learning and additional analysis), RSF grant 19-14-00295 to F.K. (GTRD data extraction).

Author information

These authors contributed equally: Sergey Abramov, Alexandr Boytsov.

Authors and Affiliations

Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
Sergey Abramov, Alexandr Boytsov, Dmitry D. Penzar, Ilya E. Vorontsov & Ivan V. Kulakovskiy
Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
Sergey Abramov, Alexandr Boytsov, Dmitry D. Penzar, Marina V. Fridman, Alexander V. Favorov, Ilya E. Vorontsov, Vsevolod J. Makeev & Ivan V. Kulakovskiy
Moscow Institute of Physics and Technology, Dolgoprudny, Russia
Sergey Abramov, Alexandr Boytsov, Dmitry D. Penzar, Eugene Baulin & Vsevolod J. Makeev
Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
Daria Bykova & Dmitry D. Penzar
Federal Research Center for Information and Computational Technologies, Novosibirsk, Russia
Ivan Yevshin, Semyon K. Kolmykov & Fedor Kolpakov
Sirius University of Science and Technology, Sochi, Russia
Ivan Yevshin, Semyon K. Kolmykov & Fedor Kolpakov
BIOSOFT.RU LLC, Novosibirsk, Russia
Ivan Yevshin, Semyon K. Kolmykov & Fedor Kolpakov
Johns Hopkins University School of Medicine, Baltimore, MD, USA
Alexander V. Favorov
Institute of Mathematical Problems of Biology RAS—The Branch of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia
Eugene Baulin
State Research Institute of Genetics and Selection of Industrial Microorganisms of the National Research Center Kurchatov Institute, Moscow, Russia
Vsevolod J. Makeev
Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
Vsevolod J. Makeev & Ivan V. Kulakovskiy

Authors

Sergey Abramov
View author publications
You can also search for this author in PubMed Google Scholar
Alexandr Boytsov
View author publications
You can also search for this author in PubMed Google Scholar
Daria Bykova
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry D. Penzar
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Yevshin
View author publications
You can also search for this author in PubMed Google Scholar
Semyon K. Kolmykov
View author publications
You can also search for this author in PubMed Google Scholar
Marina V. Fridman
View author publications
You can also search for this author in PubMed Google Scholar
Alexander V. Favorov
View author publications
You can also search for this author in PubMed Google Scholar
Ilya E. Vorontsov
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Baulin
View author publications
You can also search for this author in PubMed Google Scholar
Fedor Kolpakov
View author publications
You can also search for this author in PubMed Google Scholar
Vsevolod J. Makeev
View author publications
You can also search for this author in PubMed Google Scholar
Ivan V. Kulakovskiy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.A. and A.B. developed the computational framework and database; S.A., A.B., and I.E.V. developed the website; D.B., E.B., I.E.V., A.V.F., and M.V.F. performed the functional annotation and motif annotation of ASBs; D.B. and D.D.P. performed the machine learning analysis; I.Y., S.K.K., and F.K. established the GTRD alignments processing; V.J.M. and I.V.K. designed and supervised the study. All authors participated in the paper preparation.

Corresponding authors

Correspondence to Vsevolod J. Makeev or Ivan V. Kulakovskiy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Bart Deplancke and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Abramov, S., Boytsov, A., Bykova, D. et al. Landscape of allele-specific transcription factor binding in the human genome. Nat Commun 12, 2751 (2021). https://doi.org/10.1038/s41467-021-23007-0

Download citation

Received: 15 October 2020
Accepted: 12 April 2021
Published: 12 May 2021
DOI: https://doi.org/10.1038/s41467-021-23007-0

This article is cited by

AStruct: detection of allele-specific RNA secondary structure in structuromic probing data
- Qingru Xu
- Xiaoqiong Bao
- Kunhua Hu
BMC Bioinformatics (2024)
rs822336 binding to C/EBPβ and NFIC modulates induction of PD-L1 expression and predicts anti-PD-1/PD-L1 therapy in advanced NSCLC
- Giovanna Polcaro
- Luigi Liguori
- Francesco Sabbatino
Molecular Cancer (2024)
Complex regulatory networks influence pluripotent cell state transitions in human iPSCs
- Timothy D. Arthur
- Jennifer P. Nguyen
- Kelly A. Frazer
Nature Communications (2024)
Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences
- Raphaël Mourad
BMC Bioinformatics (2023)
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers
- Alexander Karollus
- Thomas Mauermeier
- Julien Gagneur
Genome Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Estimating background allelic dosage (BAD) from single-nucleotide variant calls

BAD calling with Bayesian changepoint identification

Overview of the ADASTRA database

Motif annotation is concordant with ASB calls

Machine learning predicts ASBs from sequence analysis and chromatin accessibility

Disease-associated SNPs and eQTLs are enriched with ASBs

Discussion

Methods

Variant calling from GTRD alignments

Accounting for BAD

BAD calling with Bayesian changepoint identification

Practical BAD calling with the ADASTRA pipeline

Construction of an independent BAD map for MCF7 cells

ASB calling with the Negative Binomial mixture model

Aggregation of ASB P values from individual data sets

ASB effect size estimation

SNV and ASB annotation

Genomic annotation

Sequence motif analysis of ASBs

Annotation of ASBs with phenotype associations

Analysis of eQTLs and eQTL target genes

ASB prediction with machine learning

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links