The ENCODE Blacklist: Identification of Problematic Regions of the Genome

Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

www.nature.com/scientificreports www.nature.com/scientificreports/ normalized, and the median signal is selected (See methods). This defines a comprehensive and cell-type agnostic signal across the genome that is unaffected by high signal from a particular cell-line (eg. CNVs) or low signal due to differential processing of input data. Regions with read depths or multi-mapping read rates in the top 1% are considered likely artifacts (Supplemental Fig. 1). In all cases, the mitochondrial DNA and any reads mapping to these sequences are pre-filtered from analysis and are considered part of the blacklist.
A blacklist was built for the human, mouse, worm, and fly genomes using all reads from input samples. In each organism, only a small portion of the genome was flagged as containing artifact sequence signal (Figs 1 and 2a). However, these regions were enriched for ChIP-seq reads in ChIP experiment for transcription factors (Fig. 1) and are particularly enriched for reads and peaks from lower quality experiments. In fact, ENCODE uses this as a quality control metric with some experiments having up to 87% of reads falling into blacklisted regions 5 . In Fig. 1 we show the distribution of all input reads mapped across chromosome 1, where reads mapping to blacklist regions are represented in red. The signal at blacklist regions are extremely high even though they account for a small fraction of the mappable chromosome (Fig. 1b). For example, this represented 582 million of 2.5 billion uniquely aligning reads mapping to blacklisted regions in the human ENCODE ChIP-seq data in hg19. These findings emphasize the extreme nature of these artifact regions and highlight the importance of filtering these regions to avoid incorrect biological conclusions.
We investigated the underlying characteristics of our automated hg19 blacklisted regions and their agreement with previously published lists 6 , which included our manually curated hg19 blacklist (DAC) (Supplemental Fig. 2). Though satellite repeats were used in the original ENCODE blacklists, they represent a small portion of the automated hg19 blacklist and are generally repeated in the genome annotation (Supplemental Fig. 2c). Additionally, they are not uniquely mapped with the algorithms used which prevents alignments to these regions. The automated hg19 and DAC blacklists detect the vast majority of regions flagged by a similar and complementary technique 6 (Supplemental Fig. 2b). Of the regions unique to the automated hg19 blacklist, all cover gaps in the genome assembly (Supplemental Fig. 2c), lending evidence that these regions of the genome are incomplete in the hg19 assembly. Indeed, a large number of these regions were patched in the next iteration of hg19 or dropped in the GRCh38 assembly (Supplemental Fig. 2c). Almost all of the flagged regions also included nuclear mitochondrial DNA segments (NUMTs, Supplemental Fig. 2c), a criterion that was overlooked in the initial manual blacklists. There are many mitochondrial genomes in comparison to the singular nuclear genome, leading to a high read depth of NUMTs. Additionally, NUMT sequences are scattered throughout the genome, contributing www.nature.com/scientificreports www.nature.com/scientificreports/ to overrepresentation of NUMTs in the input sequence. For these reasons, it is critical to include these sequences in the blacklist. A majority of the regions that were flagged by the DAC blacklist but missed in the automated hg19 blacklist were defined by repeatmasker class annotations as Satellite repeats (Supplemental Fig. 2d). While many of these repeat regions contain anomalous signal, those that were excluded from the automated hg19 blacklist do not show high signal and are uniquely mapped in hg19. None of the regions unique to the DAC blacklist were patched or removed in the new assembly.
We next sought to characterize the differences between regions blacklisted from the automated pipeline in hg19 compared to the blacklist from the hg38 genome assembly. Generally, the same classes of regions are enriched in both assemblies. Many regions do not overlap due to assembly differences such as the expanded centromere and satellite sequences that are features of the hg38 assembly as well as fixed/new gaps that vary in both builds (Fig. 2b). A large portion of the differences occurs at low-mappability (Low Map.) annotations which represent short repetitive elements in the genome assembly that are poorly mappable and as a consequence do not map well between assemblies (Fig. 2b). Overall, these differences lead to the conclusion that the major differences between blacklists are due to underlying changes in the sequence assemblies. Consequently, this lends to the hypothesis that the driving factor behind artifact regions in the genome are due to issues in the assembly rather than other factors.
Finally, to demonstrate the artificial correlation created by these peak regions, we performed a correlation analysis of the ENCODE peak regions in the human genome using a blacklist-filtered set as well as an unfiltered set of peaks (Fig. 2c). In the unfiltered set of peaks, blacklist regions sequester a large portion of ChIP-seq reads, leading to an illusion of high correlation of these regions with others. After blacklist filtering, the correlation structure is more distinct. As a specific example, the correlation of REST (a known repressor) auto correlates with other TFs (most of them activating) without the filtering of the blacklist. The removal of the blacklist regions removes spurious correlations seen with REST has essentially disappeared as would be expected given the known biology. We highlight the clear case of removing the noise in REST correlation, but the same standard holds true for the remaining factors in Fig. 2c. The ENCODE blacklists have been used to filter all of the ChIP-seq data from the ENCODE project and improvements in data from the application of the blacklist to these data are a key evaluation metric used by the consortium. For a complete list of artifact effects on peaks from all ChIP experiments used in ENCODE, we have provided a reference to the ENCODE quality control metric spreadsheet 5 . Furthermore, another detailed analysis of the detrimental effects of not excluding these artifact regions has been previously described 2 . Biological validations of the most robust signal regions will likely result in testing of these artifact regions, potentially resulting in incorrect biological conclusions. Therefore, identification of these regions and subsequent filtering lead to more accurate and stable results across experiments.
The method implemented here requires a significant amount of input sequencing data from different sources in order to generate an accurate blacklist. For our analyses, we use all available ENCODE ChIP-seq input datasets to estimate the genomic regions that have these artifact properties, and the use of multiple cell-types is important to avoid blacklisting regions that are specific to a single cell-type or tissue. We also caution that the blacklists are specific to each genome assembly and a lift over from an old assembly is not meaningful or valid. Finally, those studying genes in unmappable regions of the genome will find their data filtered by the blacklist. These regions account for ~3% of human protein coding genes that have previously been shown to be unmappable using short-read technologies 7 .  www.nature.com/scientificreports www.nature.com/scientificreports/ We present a resource of genomic regions that should be identified and either filtered from study or analyzed independently for better understanding as to their potential regulatory function. It is important to note that we do not propose a single blacklist that can encapsulate error defined across all NGS based assays. The presented blacklist shows high concordance between chromatin-based filtering (DNase/ATAC-DAC blacklist) and ChIP-seq input based filtering. This is not surprising given that the input DNA for ChIP-seq has been shown to be a proxy for lightly digested open chromatin assays 8 . However, the same criteria cannot be applied for whole genome sequencing (WGS) filtering and RNA-seq filtering. WGS filtering does not result in poorer annotations if there are higher read depths in regions, and therefore this method would be counterproductive to genome assembly. In the case of RNA-seq, more cell-type specific corrections for copy number are appropriate as there is virtually no overlap of coding regions with existing blacklist regions. The method presented is employed by the ENCODE project, as well as many other established analysis pipelines, and allows for a noise filtering on DNase-seq, ATAC-seq, and ChIP-seq datasets to help improve the accuracy of studies using these data. The removal of blacklists differs from the typical removal of signals from duplicate reads since these regions are problematic across different cell types and individual experiments. Ultimately, the removal of blacklists should be integrated within genomic assay analysis pipelines that incorporate high-throughput sequencing in order to assess biologically relevant and true signals.

Methods
Selection of input datasets. All data were acquired from the ENCODE Data Coordination Center. Using a previously published perl script (https://github.com/Boyle-Lab/ENCODE-API-Apps) 9 , we queried the ENCODE DCC API for unfiltered bam files labeled "input" that were released for the correct genome assembly. In the case of humans, these bam files were merged based on ENCODE assigned donor accession numbers to collapse data by cell type or individual. This was performed using 'samtools sort' to first sort all samples, 'samtools merge' to merge, and finally 'samtools index' to generate a new index of the resulting bam files 10 .
Generation of mappability data. The Umap tool was used to identify all positions on both strands of a target genome for which reads of a desired length starting at that position are uniquely mappable 11 .
Building the blacklist. For each input dataset (or merged input for human) from ENCODE, the number of reads per mappable base and the number of multimapping reads per million reads is calculated for each bin of 1 kb with 100 bp overlap across all chromosomes. The values across bins are then quantile normalized and a standard value at the 50% quantile is selected to represent each bin. This threshold was selected to avoid high signal outliers from individual cell types (for example, from copy number variants) and to avoid low signal from failed or incorrectly labeled input datasets. The standard values across the genome are then flagged if they are in the top 0.1% of signal for either read depth or mappability. Neighboring regions are merged if they maintain a signal in the top 1% of all signal or if they have no signal due to no mappability in the genome and any flagged regions within 20 kb were combined. This generates contiguous regions of abnormal signal across the genome.