Introduction

Balanced chromosome abnormalities (BCAs) are known to be a class of structural variations that changes the orientation and/or localization of a chromosome segment without gain or loss of genomic material. Although most carriers are phenotypically normal, around 6% of them are thought to have an associated disease phenotype [1], commonly presenting with multiple congenital abnormalities and/or intellectual disability [2]. The precise breakpoint mapping is crucial for genotype-phenotype correlations, annotation of disease associated genes [3] and can provide information about mutational signatures at junction points [4]. In females with X-autosome balanced translocation there is a preferential inactivation of the normal X-chromosome. Thus, gene disruption in the rearranged X chromosome may cause complete absence of functional copies of this gene [5]. In this study, we evaluated ten female patients with balanced X-autosome translocations and abnormal phenotype. We sequenced their junction points at base-pair resolution and inferred the mechanisms of formation in each case. We demonstrated that the nucleotide resolution characterization of breakpoints in BCAs with one of the breakpoints at heterochromatic and highly repetitive regions (e.g., centromeres or short arms of acrocentric chromosomes) is feasible by combining cytogenomic methods and short-read sequencing.

Materials and methods

Enrollment

We studied ten female patients with reciprocal X-autosome balanced translocations and eight female controls with age ranging from 7 to 51 years were selected for gene expression investigation. All samples used in this study were collected after written appropriate informed consent and approval of the local ethics committee (CONEP 36019314.9.0000.5505, CEP 0028/2015). Of note, clinical and cytogenomic evaluation for patients 1, 4, 5, 6, 7, 8, and 10 had previously been reported [6,7,8,9]. However, most rearrangements had not been mapped to sequence resolution, except for patients 6 and 8 who were previously described in clinical studies [8, 9]. The clinical description of all patients, the molecular characterization of their rearrangements and their expression profiles are described in detail in Supplementary Case Reports.

Banding, molecular cytogenetic evaluation and X-inactivation pattern evaluation

Five hundred and fifty resolution G-banding karyotype was performed from lymphocyte cultures. DNA samples were isolated from peripheral blood using Gentra Puregene Kit (Qiagen-Sciencesm). Chromosomal microarrays were performed using the Affymetrix Genome-Wide Human SNP-Array 6.0 (Affymetrix Inc.) or Human Genome CGH array 4 × 44k (Agilent Technologies). X-chromosome inactivation studies were performed by Human Androgen Receptor Assay, replication banding with BrdU and/or 5-ethynyl-2′-deoxyuridine (EdU) incorporation assay [10].

Breakpoint mapping

Two strategies were used to map the breakpoints: array painting followed by WGS, and array-CGH in a family member carrier of the same rearrangement but in unbalanced form.

Array painting, FISH, and whole genome sequencing

Breakpoints were ascertained by microdissection and amplification of the derivative chromosomes followed by array-CGH as previously described [11]. FISH using BAC (bacterial artificial chromosome) probes, which mapped across the breakpoints, was applied on 12 out of 16 breakpoints, to validate their locations (Supplementary Table S1). In order to reach a higher resolution, the patients had their breakpoints addressed by shallow WGS. For the WGS, 2 µg of genomic DNA was sheared using Covaris and 550 bp fragments were targeted to maximize the physical coverage. The sequencing library was prepared using Tru-Seq DNA PCR-free Sample Prep Kit (Illumina) to avoid PCR duplicates, and 100 bp paired-end reads were sequenced on the HiSeq 2500 platform (Illumina). Sequence-control, software real-time analysis, and CASAVA software v1.8.2 (Illumina) were used for image analysis and base calling. Burrows-Wheeler Aligner (BWA-MEM) [11] with default parameters was used to map the data to hg38. We reached 4.8× coverage and a sequence yield of 15.3 Gb on average. The mean insert size was 606 bp, alignment rate 97.8%, proper pair coverage 94.5%, duplicate rate 4.8%, and chimera rate 0.3%. The BAM file (Binary Alignment/Map format) was submitted to BreakDancerMax version 1.4.4 analysis [12] using the –t option to detect interchromosomal junctions. Based on the breakpoint localization according to array painting and/or BreakDancerMax analysis, BAM files were filtered for aligned reads adjacent to the breakpoints. Filtered BAM files were visualized in Integrative Genomics Viewer (IGV) [13] in search of chimeric inserts and split-reads. Patient 6’ rearrangement was evaluated only by WGS (Table 2).

Array-CGH and sequencing

Patient 6 transmitted her rearrangement as an unbalanced translocation to her daughter. The child’s rearrangement was characterized by chromosomal microarray analysis (CytoScan ® 750 K Array, Affymetrix and 8 × 60 K customized slide Agilent Technologies) as previously described [9].

Junction point sequencing

The junction points were further characterized by Sanger sequencing. Different sets of primers were designed around the breakpoints and long-range PCR was performed as previously described [14]. Primers are detailed in Supplementary Table S2. To amplify junctions between two non-heterochromatic regions, the primers were designed for the sequence flanking the breakpoints according to reference genome. However, four patients presented one breakpoint in non-heterochromatic region and the other breakpoint at heterochromatic region, including reference genome gaps, e.g., 9q12 and 21p13, or at regions with highly repetitive sequence, e.g., chromosome 9 centromere and 21p11.1. In order to design primers flanking the breakpoints in these four heterochromatic regions, we used IGV to visualize the BAM files from the WGS, inspecting the reads mapped around the breakpoint in the non-heterochromatic region. We selected the reads that were part of chimeric inserts and searched for the sequence of their mates in the raw sequencing data (FASTQ files). We manually de novo assembled the heterochromatic breakpoint region sequence combining the sequence of these reads obtained from the FASTQ files. The primers were designed based on this de novo assembly to amplify the junction point. The strategy used to Sanger sequence the junction point in these heterochromatic regions is schematically described in Fig. 1.

Fig. 1
figure 1

Strategy used to Sanger sequence the junction-points of translocations with breakpoints at heterochromatic regions. a IGV screenshot of whole genome sequencing pair-end reads of patient 8, visualizing split-reads (dark red) and chimeric inserts (red) in Xq13.3 breakpoint region. b Numbered reads from panel a were selected and the sequence of their mates were obtained from the FASTQ files, to assembly the sequence of the autosomal breakpoint region, which is located at chromosome 9 centromeric region. Reads in dark blue are the remaining portion of the split-reads 1 and 2 (dark red in panel a) observed in IGV. c At the top, Sanger sequencing electropherogram of the der(X) junction point. The breakpoint is indicated by an arrowhead. Below, alignment of the Sanger sequencing of the der(X) junction point, read 1, read 2, and the read 6’s mate sequences from panel a to chromosomes X and 9 references. Sequences from chromosome X and 9 are in red and blue, respectively. A deletion of 2 bp in patient 8’s chromosome X is indicated in bold in the X-chromosome reference sequence

Chromosomal rearrangements were fully described in Supplemental Case Reports following HGVS nomenclature [15]. All gene disruptions validated by Sanger sequencing were submitted to the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/), as well as their clinical consequences to the patients (accessions SCV000854423 to SCV000854429).

Classification of disrupted genes pathogenicity

The genes disrupted by the breakpoints were classified in three categories: pathogenic, potentially pathogenic, or uncertain clinical significance. As pathogenic, we considered genes that presented three criteria: (I) no description of a loss of function genetic variant in general population databases, (II) previous association with the patient’s phenotype in at least two other individuals, and (III) clear relation to the development, function or maintenance of the affected tissue or cell type. Potentially pathogenic genes matched only two of these criteria and genes with uncertain clinical significance, one or none of them.

Prediction of topological associated domains (TADs) disruption

TADs were addressed in human embryonic stem cells [16] and lymphoblastoid cell lines [17] using the Encode browser for Hi-C data visualization. Genes strongly associated with the patient’s clinical features and encompassed by disrupted TADs were considered likely affected by position effect and further investigated. Additionally, positions of predicted regulatory elements were obtained from RegulomeDB chromatin state annotations [18], which are based on the Roadmap Epigenomics Consortium integrative analysis of human epigenomes from different tissues and cell types [19].

Reverse transcription quantitative PCR (RT-qPCR)

Whole blood RNA was isolated from the patients and controls using the PAXgene BloodRNA MDx Kit (Qiagen-Siences). The cDNA was synthetized using High Capacity cDNA Reverse Transcription kit (Thermo Fisher Scientific). TaqMan assays (Thermo Fisher Scientific) were selected for all the genes disrupted at the breakpoints and for genes potentially affected by position effect. GAPDH and ACTB gene expression were used as internal controls. Assays are detailed in Supplementary Table S3.

Results

G-banding karyotype and chromosomal microarray indicated that all patients presented BCAs, without cryptic genomic imbalances. As expected, X-chromosome inactivation studies showed a skewed X-inactivation pattern, with the normal X-chromosome preferentially inactivated. Patients 1 to 9 had at least one of the two junction points characterized at the nucleotide level. The breakpoint definition for patient 3 is illustrated in Fig. 2, whereas results for the other patients are fully described in the Supplementary Case Reports and summarized in Table 1.

Fig. 2
figure 2

Patient 3’s breakpoint characterization at the nucleotide level and rearrangement’s impact on the expression of disrupted genes. a Partial G-banding karyotype and b ideogram of the chromosomes involved in her translocation. c, d Sanger sequencing electropherograms of the der [7] (c) and der(X) (d) junction points. Black dashed box on der[7] shows microhomologies between chromosomes X and 11 breakpoint regions. Black dashed box on der(X) shows nucleotide insertions at the junction-point. Below, alignment of the junction points to chromosome X and 11 reference sequences. e RT-qPCR expression levels of APOOL in whole blood of n = 8 Brazilian female controls and patient 3. Note APOOL unmodified expression in the patient when compared to controls

Table 1 Summary of patient’s phenotypic features and chromosomal rearrangements

Breakpoint mapping

Molecular cytogenetic methods were effective in mapping 12 out of 16 breakpoints with a resolution range of 3.5 to 427.7 kb (Table 2). Given the inability of array-based methods to evaluate heterochromatic and highly repetitive regions, the autosomal breaks of patients 7 to 10 could not be mapped by array painting. The latter, however, mapped the non-heterochromatic breakpoints in these four patients, allowing the identification of inserts spanning the junction points in the further WGS analysis (Fig. 2).

Table 2 Effectiveness of the methods applied to characterize the break-junctions

The analysis of the WGS data with the BreakDancerMax software only detected junctions between non-heterochromatic regions, as it was effective in mapping both breakpoints in patients 1 to 6. BAM file chimeric inserts and/or split-reads were visualized on IGV and the breakpoints at non-heterochromatic regions were mapped with a higher resolution, ranging from 2 to 297 bp (Table 2). On the other hand, none of the four patients with one of the breakpoints in a heterochromatic region (patients 7 to 10) had their chromosomal junctions detected by the BreakDancerMax algorithm (Table 2). The four highly repetitive breakpoint regions could be manually de novo assembled (Fig. 2).

Sanger sequencing validation of the junction points

All 12 junctions between two non-heterochromatic regions were successfully amplified and sequenced (Table 2). For patients 7 to 10, in accordance to the nature of heterochromatic and highly repetitive regions, the blat search of sequences from the autosomal breakpoint regions returned pericentromeric regions and short arm of acrocentric chromosomes (Supplementary Table S4). Four out of eight junctions between a heterochromatic and a non-heterochromatic region could be Sanger sequenced (patients 7 to 9, Table 2). From the four junctions involving heterochromatic regions in which Sanger sequencing was inefficient, two are from patient 10. This patient is carrier of a translocation between Xp11.4 and 21p13, with the autosomal breakpoint in a gap in the reference genome. The few bases of the amplicons obtained from her both derivative chromosomes aligned to Xp11.4. However, the electropherogram base calling was interrupted before spanning the interchromosomal junctions, indicating that her junction points were successfully PCR amplified, but the highly repetitive nature of 21p13 sequence prevented the full sequencing of the amplicon (Supplementary Case Reports).

Mechanisms of chromosomal rearrangement formation

The sequencing of the junction points allowed the delineation of the disrupted sequences’ nature for 16 non-heterochromatic breakpoint regions (Table 3). Sixty-nine percent (11/16) of these breakpoints were located at repeat elements, e.g., LINEs, SINEs, LTRs, and STR. Five out of the six breakpoints at LINE sequences were in the X-chromosome. Only patients 1 and 3 presented both breakpoints at repeat elements.

Table 3 Junction points’ sequencing, their features and mechanisms of formation

The mutational signatures of the junction points were used to infer about the mechanisms of formation (Table 3). Microhomology at the junction points was observed in five patients (1, 3, 4, 6, and 8). In two cases (patients 2 and 5), nucleotides deletion and the insertion of few base pairs suggested the Non-Homologous End Joining (NHEJ) mechanism [20]. For patient 3, the presence of a duplication of about 200 bases accompanied by microhomology indicated the Microhomology Mediated Break Induced Repair (MMBIR) [21]. For three patients (1, 4, and 6), either NHEJ or MMBIR could be attributed to the formation of the translocations. For patients 7 to 10 the mechanism of formation could not be determined since their heterochromatic breakpoints could not be confidently mapped.

Expression and phenotypic outcome of disrupted genes

The disruption of six coding sequences and one promoter region were observed at the breakpoints in six out of ten patients (Table 1). Five genes/promoters were disrupted at the X-chromosomal breakpoint: NEXMIF, ZDHHC15, AMMECR1, APOOL, and IL1RAPL1 (Table 1). These first three genes had the complete absence of expression in the patient (see Supplementary Case Reports for details). Two autosomal genes were found as disrupted: ZNF611 and NEDD4L. The former presented a reduced expression and the later, a higher expression in the patient (Table 1, Supplementary Case Reports). Patient 1 presented fusion genes formed at the junction points with the theoretical fusion transcripts being in phase. The experimental evidences of these fusion genes and fusion transcripts are detailed in Supplementary Case Reports.

The functional impairments of NEXMIF, IL1RAPL1, and AMMECR1 genes were considered as pathogenic. NEDD4L gene disruption was considered potentially pathogenic, and ZNF611, APOOL, and ZDHHC15 were classified as genes with uncertain clinical significance. The interpretation of the disrupted genes pathogenicity is fully described in Supplementary Case Reports. Four subjects (patients 5, 6, 9, and 10) presented no gene or promoter disruption at the breakpoints, suggesting that more complex mechanisms are responsible for their phenotypes. None of the four patients with premature ovarian failure had a disrupted gene that could be considered as candidate for their ovarian phenotype.

Disruption of TADs

In four patients, genes harbored at the disrupted TAD matched the criteria to be considered potentially affected by position effect. None of these patients have gene disruptions that explain their phenotypic outcomes. In patient 10, the X-chromosome breakpoint is localized ~20 kb downstream TSPAN7 (MIM# 300096) gene sequence (Fig. 3a), a well-known neurodevelopment gene [22] involved in excitatory synapse development [23]. The TSPAN7 gene expression was abrogated in whole blood from patient 10 (Fig. 3b), suggesting that her neurodevelopment delay was caused by TSPAN7 loss of function via position effect. Concordantly, the TAD disrupted by patient 10’s X-chromosome breakpoint harbors TSPAN7 and predicted enhancers active in brain and blood cells. In patient 6, chromosome 4 breakpoint is 19 kb downstream to NKX3-2 (MIM# 602183) and disrupted the TAD that contains this gene and predicted chondrocyte enhancers. NKX3-2 is a homeobox gene that plays a role in the development of skeleton including cranial bones [24]. Abnormal expression of this gene has been previously associated with oculo-auriculo-vertebral spectrum [25], condition present in patient 6. The NKX3-2 gene is not expressed in whole blood.

Fig. 3
figure 3

Predicted position effect on genes neighboring the breakpoints. a HiC data visualization from ENCODE [17] of X-chromosome 37.1-41.2 Mb (hg19) region showing TAD (dashed line) that harbors patient 10’s X-chromosome breakpoint (black bar). This breakpoint is 20 kb downstream TSPAN7 gene (arrow). b RT-qPCR expression levels of TSPAN7 in whole blood of n = 8 Brazilian female controls and patient 10. Note the absence of TSPAN7 expression in the patient. c HiC data visualization from ENCODE [18] of X-chromosome 72.1-77.6 Mb (hg19) region showing TAD (dashed line) that harbors X-chromosomal breakpoint of patients 4 and 7 (black bars) and a ~850 kb duplication (gray bar) previously described [31], and the TAD (dashed line) that harbors FGF16 gene (arrow)

The other patients with position effect as a likely pathogenic mechanism present with ovarian failure. Patient 3’s X-chromosome breakpoint disrupts a TAD that harbors POF1B (MIM# 300603) gene, which encodes a cytoplasmic protein involved in regulating epithelial polarity and in the formation of epithelial permeability barriers [26]. Point mutations in POF1B have been associated with primary amenorrhea [27]. In patient 5, chromosome 2 breakpoint is 123 kb upstream FHL2 (MIM# 602633) gene, which is highly expressed in ovarian granulosa cells and related to DAX1 upregulation, required for ovarian development [28]. POF1B and FHL2 gene expression study in whole blood showed an unmodified expression level between patients and controls (data not shown), which does not exclude long-range regulatory changes, since the chromatin topology and interactions might differ between blood and ovarian cells.

Two other translocations associated with premature ovarian failure (patients 4 and 7) did not match the criteria that we established to indicate “position effect”; however, their X-chromosome breakpoint disrupted the same TAD, which also harbors a ~850 kb duplication previously associated with premature ovarian failure [29] (Fig. 3c). The neighboring TAD harbors FGF16, which acts as a key regulator in early oocyte development in teleost fishes, tilapia and chicken [30,31,32], is expressed in the human ovary and promotes cell proliferation and invasion in ovarian cancer [33]. The FGF16 gene is not expressed in whole blood.

Discussion

We performed a high-resolution breakpoint characterization on a very selective and rare cohort: molecularly confirmed balanced X-autosome translocations in female patients with phenotypic alterations. Illumina short-read sequencing is considered the current gold-standard method for routine genome-wide breakpoint mapping in structural variants [34]. However, the alignment of repeated DNA sequences has always been a technical limitation. To enrich the library for the breakpoint regions, with DNA segments that harbor the junction points, a PCR-based library preparation is required, which affects the coverage and the alignment quality of the highly repetitive and CG-rich genomic regions [2]. In our study, the usage of PCR-free library preparation method to perform the shallow WGS, with no targeted genomic capture, was crucial to address breakpoints in heterochromatic regions with minimized costs. Future dissemination of long-read sequencing might improve the ability to characterize highly repetitive DNA segments of the human genome [35]; however, these methods are still not implemented in routine diagnosis.

Currently, standard sequencing-based methods are ineffective on the high-resolution breakpoint definition in BCAs with one of the chromosomal breaks at highly repetitive regions regardless of the coverage, tool or algorithm used for the breakpoint calling, since these DNA segments cannot be confidently mapped (e.g., centromeres or the short arm of acrocentric chromosomes). It is expected that ~7–8% of BCAs’ breakpoints are inaccessible by short-read sequencing [36]. Patients with these inaccessible breakpoints are frequently removed from high-resolution breakpoint mapping studies on subjects with BCA, leading to biased conclusions about the frequency of formation mechanisms of the rearrangements and the genomic nature of the DNA sequences that they disrupt. Even though our cohort has four out of ten subjects with breakpoints at heterochromatic regions, we were able to identify all the 20 breakpoint regions. Additionally, the Sanger sequencing validation was efficient for most of the junction points analyzed. These analyses demonstrate the feasibility of identifying breakpoints at highly repetitive DNA segments, including gaps in the reference genome, using a cost-effective approach. The combination of cytogenomic methods with low-coverage WGS was crucial for characterizing junction points involving heterochromatic regions, since in these cases the tracking of WGS chimeric inserts and split-reads was based on the array painting results.

The junction point sequencing allowed insights about the formation mechanism of these chromosomal rearrangements. The mutational signatures of NHEJ, the major mechanism in BCAs’ formation [37], were compatible with the junction points of six patients. For non-recurrent translocations associated with human congenital anomalies, it has been described that breakpoint regions are enriched for repeat elements [36]. In our study, 11 out of 16 non-heterochromatic breakpoints occurred at repeat elements. LINE sequences were the most frequent repeat element disrupted in our cohort and four out of nine X-chromosome breakpoints Sanger-validated disrupted a LINE element. This pattern can indicate a possible role of LINE sequences in the mutational mechanism of rearrangements involving the X-chromosome or just reflect the X-chromosome composition. Further studies are required to clarify the relation between LINE sequences and the formation of chromosomal rearrangements involving the X-chromosome.

It has been demonstrated that there is an enrichment of genes related to developmental disorders at the breakpoints of BCA associated to congenital abnormalities [36]. In our cohort, two genes (NEXMIF and IL1RAPL1) were determined as causative of intellectual disability [6] and one gene (NEDD4L) is potentially associated with neurodevelopmental disorders. In the four patients with no gene disruption at their breakpoints, more complex pathogenic mechanisms await clarifications, such as disruption of regulatory interactions, enhancer adoption, modification on the chromatin landscape, among others. In six patients, position effect affecting breakpoint neighboring genes was pointed as a putative pathogenic genetic mechanism and the regulatory interactions that act on these candidate genes (TSPAN7, NKX3-2, POF1B, FHL2, and FGF16) require further functional analysis. Four of these six patients present ovarian failure phenotype and, concordantly, premature ovarian failure has been associated with position effect in females with balanced X-autosome translocations [38]. It is worth mentioning that high-resolution chromosomal architecture data is only publicly available for a limited variety of human cell types and tissues, which does not include the developing ovary. During the development of the female gonad, the reorganization of the chromatin contacts caused by these chromosomal rearrangements might alter gene expression and function.

In this work, six out of ten patients had their breakpoints revised after the detailed molecular breakpoint definition. This emphasizes the importance of breakpoints mapping at nucleotide resolution in balanced chromosomal rearrangements to reveal the pathogenic mechanism and improve the genotype-phenotype correlation. This study demonstrates the feasibility to molecularly characterize breakpoints at heterochromatic regions associating cytogenomic methods and short-read next-generation sequencing in a cost-effective manner. This characterization of BCAs with chromosomal breaks at highly repetitive DNA segments can provide new insights into their mechanisms of formation and the properties of the genomic regions disrupted in these rearrangements, describing new genes associated with human diseases.

Web resources

Web Resources: 1000 Genomes: http://www.internationalgenome.org/. DECIPHER: https://decipher.sanger.ac.uk/. DGV (Database of Genomic Variants): http://dgv.tcag.ca/dgv/app/home. ENCODE (Encyclopedia of DNA Elements): https://www.encodeproject.org/. ExAC (The Exome Aggregation Consortium): http://exac.broadinstitute.org/. GO (Gene Ontology): http://www.geneontology.org/. GTEx (Genotype-Tissue Expression): http://www.gtexportal.org/. IMPC (International Mouse Phenotyping Consortium): http://www.mousephenotype.org/. OMIM (Online Mendelian Inheritance in Man): http://www.omim.org/.UCSC Blat Search: https://genome.ucsc.edu/cgi-bin/hgBlat