Chromosomal rearrangement (CR) events have been implicated in many tumor and non-tumor human diseases. CR events lead to their associated diseases by disrupting gene and protein structures. Also, they can lead to diseases through changes in chromosomal 3D structure and gene expression. In this study, we search for CR-associated diseases potentially caused by chromosomal 3D structure alteration by integrating Hi-C and ChIP-seq data. Our algorithm rediscovers experimentally verified disease-associated CRs (polydactyly diseases) that alter gene expression by disrupting chromosome 3D structure. Interestingly, we find that intellectual disability may be a candidate disease caused by 3D chromosome structure alteration. We also develop a Web server (3Disease Browser, http://3dgb.cbi.pku.edu.cn/disease/) for integrating and visualizing disease-associated CR events and chromosomal 3D structure.
Chromosomal rearrangement (CR) events are observed in many tumor and non-tumor human diseases1,2. A CR event occurs as a consequence of double-strand breaks (DSBs) of DNA, followed by abnormal rejoining of non-homologous ends, producing a new chromosomal arrangement3. Alternatively, a CR event results from crossing-over between repetitive DNA sequences during cell division4. CR events are categorized into different types: deletion, duplication, insertion, inversion, translocation and ring chromosome. The information of CR events and their associated diseases and clinical symptoms is available in databases such as ClinVar5 and dbCRID6.
CRs have functional consequences in gene regulation and expression. CRs may result in the formation of a fusion gene from which a fusion transcript and protein is generated. An example is the BCR-ABL gene in chronic myeloid leukemia (CML)7. The characteristic genetic abnormality of CML is a reciprocal translocation between the long arms of chromosomes 9 and 227. The molecular consequence of this translocation is the generation of the fusion protein BCR-ABL, a constitutively activated tyrosine kinase. A drug specifically targeting this fusion protein proved effective for CML patients8.
Chromosomes in the nucleus fold into highly complex three-dimensional structures that are dynamically regulated9,10,11,12. 3C-based techniques such as 5C, Hi-C and ChIA-PET capture genome-wide spatial interaction between chromatins at a high resolution13,14,15. The generated data result in deeper insights into the 3D genome and its functional consequences. For example, the interactions between gene promoters and remote enhancers are facilitated by loop structures between them15. The mammalian genome is organized into ~1 Mb topologically associated domains (TADs) which are functional units of gene replication and transcription16,17. When CRs alter chromosomes, they may also alter chromosomal 3D structure18,19. Figure 1a shows a model of TAD variation caused by CRs. In condition 1, TAD A and TAD B are insulated by a TAD boundary, which is often occupied by the CTCF protein. The enhancer in TAD B cannot interact and activate the gene in TAD A. However, when the TAD boundary is deleted by a CR, TAD A and TAD B merge into a single TAD C. In this condition, the enhancer and the gene can interact with each other, leading to gene activation. A de novo TAD boundary may also be generated by duplication or inversion events. Therefore, the structural variations of TADs can influence gene expression18.
In recent years, advances in DNA microarray and sequencing technologies have led to accurate identification of CR events and chromosomal 3D structures in large scale5,13. Combining these data sets may provide us more insights on the pathogenicity of CR events that affect chromosomal 3D structure. In this study, we predicted disease-associated CR events that may alter chromosomal 3D structure by integrating Hi-C and ChIP-seq data sets. We also developed a Web server for easily exploring the relationship between disease-associated CR events and chromosome 3D structures of multiple human cell types.
Predicting disease-associated CR events that affect TADs
Based on previous studies, abnormal 3D chromosome structure may be a consequence of CR events leading to diseases. When CR events affect TAD boundaries, they alter the interactions between genes and enhancers, leading to abnormal expression of genes (Fig. 1a). A recent study shows that CR events cause polydactyly diseases through altering 3D chromosomal organization18. A model for pathogenicity of CR events is shown in Fig. 1b. Wild-type chromosome conformation shows the structure of three TADs, separated by boundary elements. The activity of an enhancer (E) is restricted to gene 1 (G1) located inside TAD B. In an inversion event, the enhancer is moved out of TAD B and placed near gene 2 (G2), and the boundary is now on the right side of E. This results in interaction between E and G2, but prevents the original interaction between E and G1. In a duplication event, E is placed next to the duplicated G2 within a newly created TAD, resulting in their interaction and abnormal expression of G2 (Fig. 1b). Lastly, if a deletion event removes the boundary B1 and parts of TAD A and TAD B, the two TADs merge and E is able to interact with both G1 and G2, resulting in abnormal expression of G2.
Based on the above model, we calculate an SI (Structure Influence) score, which quantifies the degree that a disease-associated CR alters 3D chromosome structure (Fig. 1c). We inspect two features from these types of CR events: inversion, duplication and deletion. First, a CR region contains at least one TAD boundary, so the CR probably disrupts the structure of neighboring TADs. Second, there must be at least one predicted enhancer near the CR breakpoints or the TAD boundary, so the enhancer may mis-regulate the expression of nearby genes in disrupted TADs. To quantitatively evaluate these two features, for each CR, we calculate an Insulation Score measuring the strength of TAD boundaries in the CR15,20,21, and an Enhancer Score measuring the abundance of nearby enhancers (see Methods, Fig. 1c). The SI score is given by SI = Percentile (Insulations Score) ∗ Percentile (Enhancer Score), measuring the degree that a disease-associated CR may alter gene expression through changing 3D chromosome structure.
Top predicted disease-associated CR events that affect TADs
We collect 157,003 disease-associated CR events from ClinVar (http://www.clinvar.com)5 and 84 additional CRs by manual search in PubMed. We filter the data by genome loci, CR region length and proximity to TAD boundaries in the hESC (human embryonic stem cell) cell line (Fig. 2a), and obtain 5369 disease-associated CR events for the next analysis. The median length of these CR regions is about 800 kb (Fig. 2b). Considering the similar size of the filtered CRs and TADs (median size is 1.1 Mb in the hESC cell line), these CRs are likely to influence the chromosome 3D structure at the TAD level. The number of TAD boundaries in the hESC cell line overlapping with a CR region is mostly one or two (Fig. 2c).
Using the hESC cell line’s Hi-C and ChIP-seq data, we predict 261 CR events (top 5% SI score, FDR = 0.03, Fig. 2d) that may disrupt TAD boundaries and mis-regulate gene expression (black dots in Fig. 2e, top 20 in Table 1). The FDR is 0.03 and 0.04 for top 5% and 10% predicted CRs (Fig. 2d). There is only a moderate correlation between SI Score and CR length (The Pearson correlation is 0.24). We also use the sum of percentile (Insulation Score) and percentile (Enhancer Score) as a score to predict TAD-altering CR events, and the top 5% of predicted CRs are mostly similar (99% overlapping). The SI score of a CR associated with polydactyly diseases is 0.72, and it is ranked within top 4% among all filtered disease-associated CRs (red dot in Fig. 2e). For four of the five CRs from the Lupianez et al. study18 that are associated with limb developmental disorders, the SI scores are significant (Table 2). Therefore, our algorithm rediscovers experimentally verified disease-associated CRs that alter gene expression by disrupting chromosome 3D structure.
Although the hESC cell line is a suitable surrogate for the 3D genome of key cell types responsible for developmental diseases, we ask how different cell types affect our prediction. We expand analysis by using Hi-C and ChIP-seq data of six cell lines (GM12878, K562, hESC, HUVEC, HMEC and NHEK) representing different cell types. The top 5% predicted CRs in different cell lines are mostly cell line specific (Fig. 2f). Considering the differences and conservation of TADs and enhancers among cell types, it is essential to match a disease and its related cell types in our analysis. The conservation of the 3D structures (loop level) of different human cell lines is about 55% ~ 75% 15, and the conservation of enhancer can be as low as 30% between HeLa and K562 22. However, Hi-C data are currently not available for many cell types, and we do not find suitable Hi-C data for some diseases. To reduce the influence of mismatched cell types and a disease, we average the SI score of six different cell lines to calculate a mean SI score to screen disease-associated CRs. If the mean SI score is high and the FDR is significant, the TAD boundary and enhancer is more conserved among cell types and the result is less likely to be affected by mismatched cell types and disease.
Visualizing the relationship between TADs and CR events
We make several types of graphs to help us understand how CR events may lead to diseases via altering 3D chromosome structure. The first graph contains detail information about TADs, including Hi-C contact matrix, TAD boundaries16,21, ChIP-seq data and gene information (Fig. 3a). The second graph shows the 3D model of the chromosome structure surrounding a CR event (Fig. 3b). The third graph is the 3D model of the chromosome structure overlaid with additional genomic annotations (Fig. 3c, H3K4me1). These graphs help us to understand how CR affects 3D chromosomal structure and the distribution of enhancer on the 3D structure. Here we show two examples using known and predicted TAD-affecting CRs. For a CR of a patient with polydactyly disease18, the SI score (hESC) and mean SI score are both significant: 0.72 (FDR = 1.1%) and 0.23 (FDR = 1.1%), respectively. In this patient, there is a 926 kb duplication (red bar in Fig. 3a) in chromosome 2. This duplication causes an enhancer (blue box) ectopically interact with the nearby gene IHH, leading to high expression of IHH18. A specific spatial and temporal expression of IHH within the developing limb buds is essential for accurate digit outgrowth and correct digit number23. The 3D models of the TAD regions surrounding the duplicated region show that the duplication affects both TAD A and TAD B (Fig. 3b,c).
Among our top predicted CR events, one is associated with intellectual disability2. The SI score (hESC) and mean SI score of this CR are both significant: 0.40 (FDR = 4.7%) and 0.11 (FDR = 4.7%), respectively. In this patient of intellectual disability, there is a 1.6 Mb deletion (chr7: 70,257,735 - 719,093,76, red bar in Fig. 4a)24. The deleted genes (MIR3914-1, WBSCR17 and CALN1) are not known to relate to intellectual disability. At one side of the deletion is the gene AUTS2 (Autism susceptibility candidate 2, 7q11.22). The deletion in this patient does not affect the coding region of AUTS2 (Fig. 4a). Nevertheless, previous studies show that genomic rearrangements involving AUTS2 are associated with autism and intellectual disability2. In some patients CR events alter AUTS2’s coding region, and in other patients disruption of intergenic regions may drive the phenotype2. In particular, abnormal expression of the AUTS2 may cause phenotypes of intellectual disability25. Since AUTS2 locates within one TAD (Fig. 4a), when the deletion removes the TAD boundaries near its one end, nearby enhancer elements (blue box) may alter the expression of AUTS2. Hence we propose that the deletion may affect the expression of AUTS2 through changing the local 3D chromosome structure (Fig. 4b,c). In both examples, the TAD structures and enhancers are largely conserved among different cell types (Fig. 5), leading to significant mean SI scores.
Web server for the 3D structure of disease-associated CR events
We implement a constraint-based modeling that converts Hi-C data to 3D chromosome models based on the ShRec3D algorithm26. It uses a shortest-path method to convert Hi-C contact frequencies to relative distances between chromosome bins, and then uses the multi-dimensional scaling (MDS) method to reconstruct the three-dimensional coordinates of the bins that fit to their pair-wise distances. By using this workflow, we convert the Hi-C interaction matrix surrounding a CR region to 3D models (Figs 3b and 4b).
To display and share the integrated data and prediction results, we developed a Web server for querying and visualizing disease-associated CR events and 3D genome data (3Disease Browser, http://3dgb.cbi.pku.edu.cn/disease/). The visualization of the disease-associated CR and Hi-C data is user-friendly: (1) Type in an interested gene name or disease name in the search panel, and the records containing the query texts will be displayed below the search box (Fig. 6a). (2) Click the “View” link for a particular item in the list and it will generate a new page with detailed TAD and 3D model information for the particular item (Fig. 6c). (3) The default cell line is hESC, and click a different cell line name to view the corresponding data for that cell line. (4) Use mouse to interactively rotate and zoom the 3D structure, and click different ChIP-seq data buttons to view the corresponding epigenetic information overlaid on the 3D structure.
CR events are frequently associated with cancer and developmental diseases. In addition to their roles in disrupting coding regions, recent studies found that CRs may alter gene expression by affecting chromosomal structures. In this study, we developed a method to predict disease-associated CRs that may influence chromosomal 3D structure using Hi-C and ChIP-seq data. Our method rediscovers experimentally validated disease-causing CRs in the polydactyly diseases that alter gene expression by disrupting chromosome 3D structure18. Among our top predicted CR events, intellectual disability is a candidate disease caused by TAD-affecting CRs. This new discovery provides important clues about the causes of intellectual disability. Follow-up validations may compare the gene expression level of normal people and patients with CR in affected cell types, and use 4C and CRISPR experiments to inspect the impact of CRs on chromatin interactions.
In addition to TADs, chromatin loops are identified by 3D genome data15. Mammalian genomes are partitioned into larger TAD domains that contain smaller chromatin loops linking promoters and enhancers27,28. Capture Hi-C is used to investigate the interactions between autoimmune diseases-associated SNPs and genes29. In the future, we will expand our method to predict chromatin loops that are disrupted by disease-associated CRs or SNPs such as those in cancer30. Other ChIP-seq markers such as those indicating promoters may also be added to the scoring algorithm.
Although the 3D structures of different cell lines are conserved15, the enhancer patterns in different cell lines are cell line-specific22, causing differential TAD-disrupting capacity of a CR in different cell types. Therefore, TAD-affecting CR prediction should match a disease and its related cell types. We use mean SI score to mitigate this issue. Recently the chromatin interaction maps of 98 cell lines were predicted from histone modification data31. For future work, we will add these data to our model and Web server to better match diseases and 3D genome data.
We develop 3Disease Browser (http://3dgb.cbi.pku.edu.cn/disease/) for querying and visualizing human chromosomal 3D structure and disease-associated CRs. Existing 3D genome browsers can display Hi-C contact matrices (Hi-C Data Browser, Dekker lab), overlay annotation tracks such as ChIP-seq data and genes with the Hi-C heatmaps (Juicebox, Lieberman- Aiden lab), and visualize 3D genome structure (GMOL). Recently, the HiView Web server combines the GWAS variants with Hi-C data32. Our 3Disease Browser is the first disease-centric 3D genome browser that integrates Hi-C data, annotation tracks and 3D structure overlaid with epigenetic information (Table 3 for a comparison of 3D genome browsers).
In summary, integrating data of chromosomal 3D structure and disease-associated CRs provides important clues about disease pathogenicity. We made a first attempt to systematically correlate these data types to prioritize disease-related genomic alterations that may contribute to disease initiation and progression through 3D genome reorganization. Our methods and Web server are easily expandable to integrate other data types with 3D genome data, such as copy number alterations in cancer and disease-associated loci from GWAS studies.
We used high-resolution Hi-C data from Gene Expression Omnibus (GEO) with the accession number GSE63525 (GM12878, K562, HMEC, HUVEC and NHEK)15 and GSE35156 (hESC)16. The current version of 3Disease Browser contains these six human cell types (http://genome.ucsc.edu/ENCODE/cellTypes.html). GM12878 is a lymphoblastoid cell line produced from the blood of a female with European ancestry. K562 is an immortalized cell line from a female patient with chronic myelogenous leukemia (CML). hESC is an embryonic stem cell line from a normal male. HMEC is a mammary epithelial cell line from a normal person. HUVEC is an umbilical vein endothelial cell line from a normal person. NHEK is an epidermal keratinocyte cell line from the skin of a normal person. The ChIP-seq data of the above cell lines come from the ENCODE project (https://www.encodeproject.org). The CR data come from ClinVar (Release 2015-12, http://www.ncbi.nlm.nih.gov/clinvar/) and manual search of PubMed.
Calculation of Structure Influence score
We calculated the Insulation score15,20,21 of each chromosomal bin using the 10 kb resolution Hi-C data. DI (Directionality Index) and HS (HMM State) were calculated using the Dixon method16. The Insulation Score of a CR region is the highest score of the TAD boundaries identified in the CR. The enhancer data come from the Broad ChromHMM track in the UCSC Table Browser (http://genome.ucsc.edu). We only used the strong enhancer (state 4 and state 5) to calculate the Enhancer Score. For a CR event, we calculated the sum of enhancer base pairs in each of the three 400 kb regions surrounding the two CR breakpoints and the TAD boundary with the highest insulation score TAD boundary as the Enhancer Score. Finally we calculated the SI score as Percentile (Insulation Score) * Percentile (Enhancer Score), with percentile relative to the respective scores of other CR regions. Alternatively, we also calculate the sum of enhancer base pairs of the whole CR region normalized by CR length as the Enhancer Score (Figure S1a). The Pearson’s correlation of the two methods’ SI score is high (0.87, Figure S1b). We conclude that both methods have their merits.
To calculate the FDR of top 5% scores, we permuted corresponding relationship between Insulation Score and Enhancer Score of the CR regions 1000 times. Each time, we record the number of SI scores larger than 0.68 (the smallest score of top 5%). The ratio between the mean number of SI scores > 0.68 in permuted data and the number of top 5% CRs is the estimated FDR. The minimum FDR of the six cell lines is regard as the FDR of mean SI score.
Web server implementation
How to cite this article: Li, R. et al. 3Disease Browser: A Web server for integrating 3D genome and disease-associated chromosome rearrangement data. Sci. Rep. 6, 34651; doi: 10.1038/srep34651 (2016).
We thank Jian Wang and Yujie Sun for helpful discussion, Yuxuan Zheng, Qing Chen and Qiang Shi for disease data collection and analysis, and Lei Kong for Web server assistance. We thank Yang Chen, Zhihua Zhang and reviewers for critical comments on the manuscript. The work is supported by grants from Peking-Tsinghua Center for Life Sciences.