INTRODUCTION

Structural variations (SVs) have been recognized as an important cause of intellectual disability and multiple congenital abnormalities (ID/MCA) for many years.1, 2 An SV is defined as a difference in the DNA copy-number, orientation or location of relatively large genomic segments (typically >1 kb)3 and may include deletions, duplications, insertions, inversions and translocations.4 Genomic microarrays have been instrumental for the identification of one type of SV being submicroscopic copy-number variants (CNVs) in patients with idiopathic ID and congenital anomalies (reviewed in Vissers et al5). The diagnostic yield in studies using genomic microarrays for these patients is around 14–18%,6, 7, 8, 9, 10 which is a major improvement compared with conventional karyotyping. The genetic cause, however, remains elusive in a large proportion of patients with ID/MCA, and it is generally assumed that these patients’ genomes harbor hitherto undetected genomic alterations.

In the last few years, next-generation sequencing (NGS) has emerged as a very powerful technology and has led to the identification of the causal gene for many rare Mendelian disorders.11, 12, 13 In this study, paired-end mapping or mate pair sequencing was used. In comparison with conventional paired-end sequencing, the whole genome can be interrogated for structural variations with less sequence reads, while reaching the same physical coverage. Further, mate pair sequencing facilitates mapping across small repetitive regions because of its longer insert sizes.14 For this technique, genomic DNA is fragmented into preset fragment lengths (=insert size, eg, 3 kb) of which the ends (=mates) are sequenced by paired-end sequencing. This technology allows the discrimination between concordant mates (reads that map 3 kb from each other on the reference genome with correct orientation) and discordant mates (=reads that map closer or further than 3 kb and/or with incorrect orientation). In this way, structural variations, both balanced as well as unbalanced, can be detected. Korbel et al15 were the first to use NGS to map structural variations in the human genome, and several other groups have used this technique to finemap the breakpoints of specific structural aberrations (mostly apparently balanced chromosomal aberrations) in patients with ID/MCA.16, 17, 18, 19, 20, 21, 22, 23 Therefore, the technique has proven its usability in characterizing individual SVs. Here we describe the first systematic comparison between mate pair sequencing, genomic microarrays and karyotyping in a large cohort of ID/MCA patients referred to our diagnostic departments.

Our aim was threefold: first, we determined whether mate pair sequencing enables the identification of all previously detected balanced, unbalanced and complex chromosomal aberrations. Second, we explored the additional clinical value of mate pair sequencing in determining the precise structure of pathogenic SVs and the effects on underlying genes. Third, we evaluated whether the high resolution of mate pair sequencing could lead to an improved diagnostic yield.

MATERIALS AND METHODS

Collection of patients

DNA samples from patients with ID and/or congenital anomalies were collected from the genetic centers of Ghent (Belgium), Antwerp (Belgium) and Utrecht (The Netherlands). From five patients, the parent samples were also collected and investigated with mate pair sequencing. Informed consent was obtained from parents of all the patients.

Chromosome analysis

Analysis of G-banded metaphase chromosomes was performed on short-term lymphocyte cultures using standard procedures. For fluorescent in situ hybridization (FISH), probes were labeled with SpectrumGreen or SpectrumOrange with the nick translation kit (Abbott Molecular, Ottignies, Belgium) according to the manufacturer’s instructions. FISH was performed as previously described.24

Array CGH

Copy-number profiling was performed using 105 K (amadid#019015) or 180 K (amadid#023363) Human Genome CGH Microarray slides from Agilent Technologies (Santa Clara, CA, USA) following the manufacturer’s protocols.

Data analysis and visualization was done with our in-house developed webtool Vivar (http://www.medgen.ugent.be/vivar/; Sante et al, in preparation). DNA CNVs were identified by circular binary segmentation.25 Interpretation of CNV data was performed as described in Buysse et al.7

Mate pair sequencing

Illumina library preparation

Illumina libraries were made with the Illumina 2–5 kb mate pair protocol v2 with minor modifications (Supplementary Methods).

Samples were pooled into groups of 4. Each pool was sequenced (2 × 50 bp) on a single lane of the HiSeq2000 (Illumina, San Diego, CA, USA).

SOLiD library preparation

SOLiD mate pair libraries were generated according to the SOLiDv4 and SOLiD5500 long mate pair library preparation manuals (Life Technologies, Carlsbad, CA, USA). Libraries were sequenced on one quadrant of a SOLiDv4 sequencer or one lane of a SOLiD5500xl sequencer.

Analysis of mate pair data

Mapping of the data was done using Stampy26 for the Illumina data and BWA27 for the SOLiD data. Only uniquely mapped reads were further analyzed and all duplicate reads were removed (Supplementary Table S1).

Cluster analysis

Cluster analysis of discordant mate pairs was performed using an in-house developed script (Vivar, Sante et al in preparation).

DOC analysis

Coverage analysis was performed using CNV-Seq.28 As a reference pool, experiments were grouped according to GC-bias for normalization.

Filtering strategies

To differentiate between possible pathogenic aberrations and benign SVs, several filtering steps were introduced. An overview of the different filtering steps used in the cluster, coverage and array CGH analysis is given in Figure 1.

Figure 1
figure 1

Flowchart of the analysis of the array CGH, cluster and coverage analysis data. Two main filtering steps were used: comparison with DGV followed by a comparison of the remaining aberrations to the aberrations detected in the patient pool (#). Stampy and BWA were, respectively, used for Illumina and SOLiD data (*).

A more detailed description of the cluster analysis, depth of coverage (DOC) analysis and filtering strategies is given in the Supplementary Methods. Sequencing files were deposited to the European Nucleotides Archive (http://www.ebi.ac.uk/ena) under accession number PRJEB4453. Sequencing data from patients 10, 11 and 13 were already deposited under accession number ERP001438.

Sanger sequencing of the breakpoints and qPCR

For the remaining aberrations after cluster analysis and filter steps 1 and 2 in the patients with a normal array profile, validation was performed by PCR amplification and capillary sequencing. Other selected breakpoints were also confirmed in this way (Supplementary Table S3). In the patients with a normal array profile, the remaining aberrations after DOC analysis and filter steps 1 and 2 affecting coding regions were validated by quantitative PCR (qPCR). More information can be found in the Supplementary Methods.

RESULTS

A systematic comparison was made between mate pair sequencing versus array CGH and karyotyping in a cohort of 50 ID/MCA patients who were referred to our diagnostic departments. This cohort was selected to represent the variety of SVs found in ID/MCA patients and contains 21 patients with deletions or duplications (recurrent or nonrecurrent), 6 patients carrying an apparently balanced aberration, 3 patients with a complex chromosomal rearrangement and 20 patients without a causal aberration (Tables 1 and 2, and Supplementary Table S2). For five patients, the parents were included in the analysis allowing the immediate identification of inherited and de novo aberrations.

Table 1 Overview of the detected aberrations in 30 patients with ID/MCA
Table 2 Genomic coordinates (in GRCh37) of identified clusters on chromosome 21 (patient 29) and chromosome 18 (patient 30)

Mate pair sequencing detects aberrations previously detected by array CGH and karyotyping

Our first aim was to investigate whether mate pair sequencing was able to detect the aberrations previously detected by array CGH and karyotyping. The mate pair data of the 50 ID/MCA patients and 10 parents were analyzed using DOC analysis and by clustering of the discordant mate pairs (see Materials and Methods and Supplementary Materials and Methods).

In a first analysis, we evaluated the ability of mate pair sequencing to detect copy-number changes based on DOC measurements. To assess the resolution at which copy-number changes can be detected using our mate pair data, we calculated the distribution of window sizes across the genome (in nucleotides) using 250 reads per window (Supplementary Figure S2A), yielding an average window size between 5.7 and 20.5 kb, depending on the amount of sequencing reads generated per sample (Supplementary Figure S2B). On the basis of these estimates, the resolution of DOC analysis is comparable with or higher than the 180 K CGH arrays, with an average probe spacing of 13 kb, used in this study.

Our patient cohort included 31 possibly relevant DNA copy-number changes (12 deletions and 19 duplications ranging in size from 66 kb to 8.1 Mb) divided over 22 patients (patients 1–21, patient 28). Using DOC analysis of the mate pair data of these 22 patients, we identified between 9 and 217 copy-number changes per sample (Supplementary Table S2). Next, we applied two filtering strategies to identify private possible pathogenic rearrangements among the predicted SVs in each of the patient genomes based on the DOC analysis. In the first step, we removed all variants that were overlapping with variants in the Database of Genomic Variants (DGV; Figure 1). This resulted in a reduction of the possible pathogenic aberrations by 75%. The second cross-sample filtering step involved comparison with other patients in our cohort (as explained in the Supplementary Materials and Methods), which further reduced the data sets to an average of 15 ‘private’ variants per patient (Supplementary Table S2). Based on this analysis scheme, we could readily identify all 31 copy-number changes in the 22 patients but no additional causal events were detected.

We tested the use of healthy parents as a control in family-trio-based detection of copy-number changes for patients 10 and 13, which carry de novo duplications. For each of the patients, we determined the overlapping copy-number changes that were found with respect to both parents. We found that only the expected duplications could be detected, showing that parents are optimal controls for the detection of de novo copy-number changes (Supplementary Figure S3). Overall, we conclude that the DOC signature of mate pair data is a robust alternative for array CGH for the clinical detection of copy-number changes.

As a second approach, we used clustering of anomalous read pairs to identify the known aberrations in our patients. We performed the same filtering procedures as described above and detected an average of four private variants per patient. In 18 patients, we were able detect the previously identified aberrations (both balanced as well as unbalanced) using cluster analysis (patients 2–4, 6–7, 9–13, 22–24 and 26–30). In 12 patients, cluster analysis failed to detect the aberration (ie, 8 recurrent and 4 nonrecurrent aberrations; Table 1) because of flanking segmental duplications (patients 1, 5 and 14–21) or other repetitive regions such as centromeric satellite repeats near the breakpoints (patient 25). In these cases, the flanking repeats impede the unambiguous mapping of short sequencing reads and hence the proper localization of the exact breakpoints.

Altogether, all the SVs that were previously detected by either karyotyping or array CGH in the patients from our cohort were readily detected in the mate pair data as based on clustering of discordant pairs and/or DOC analysis (Table 1) with the exception of a balanced whole-arm translocation (patient 25).

Mate pair sequencing defines the precise structure of SVs previously found by array CGH or karyotyping and reveals underlying gene defects

Although array CGH can efficiently detect regions with DNA copy-number changes, it does not provide the precise breakpoint junctions that underlie such changes. This information could be important in determining whether an SV is pathogenic or not. For all rearrangements that were identified based on clustering of discordant mate pairs, we delineated the molecular structure of the rearrangements. This included the genomic architecture of 15 duplications in 9 patients (patients 2–3, 7, 9–13 and 28). In 10 cases, the duplicated segment resulted from non-inverted tandem duplications. This was confirmed in two patients by capillary sequencing (Supplementary Table S3). In patient 9, two seemingly independent duplications on chromosome band 4q34.1 were detected by array CGH. Cluster analysis revealed the complex nature of this rearrangement, pinpointing one single event (Figure 2a). In patient 28, array CGH revealed the presence of three duplications on the long arm of the X chromosome. Mate pair sequencing allowed the direct reconstruction of the genomic architecture of this complex chromosomal rearrangement (Figure 2b), which was confirmed by FISH analysis (Supplementary Figure S4).

Figure 2
figure 2

Genomic nature of the duplications observed in patients 9 and 28. (a) Cluster analysis revealed the position of the duplications on chromosome band 4q34.1 in patient 9. (b) The duplications on chromosome X in patient 28 lead to a complex rearrangement, which was resolved by cluster analysis. A microhomology of 2 bp was observed at the first boundary and at the second Alu elements with a similarity of 93% were noted. Clusters of aberrant mates are indicated at the bottom as blue arrows linked together by the dashed lines. The direction of the arrows indicates the orientation of the reads (both experiments were done on the HiSeq).

Mate pair sequencing further enabled the refinement of five out of six apparently balanced aberrations (patients 22–24 and 26–27). None of these aberrations showed evidence for gains or losses near the breakpoints after array CGH and DOC analysis. Breakpoints were sequenced by capillary sequencing, revealing microhomology (2–3 bp) and insertions (2–13 bp) at the breakpoints suggestive for nonhomologous end repair as the primary mechanism (Supplementary Table S3). Mate pair sequencing revealed a complex chromosomal rearrangement in patient 27 instead of an apparently balanced translocation, as was apparent based on conventional chromosome analysis. This complex aberration consists of four breakpoints, involving an inversion on chromosome 14 and a translocation between chromosome 2 and 14 (Figure 3).

Figure 3
figure 3

Complex rearrangement in patient 27. Cluster analysis reveals a complex chromosomal rearrangement disrupting the NRXN3 and MAPK9 gene.

In patient 29 and 30, a chromothripsis-like event was observed with array CGH involving, respectively, chromosome 21 and chromosome 18.29 Figure 4 gives an overview of the results for all the detection techniques (ie, conventional karyotyping (panels a and b), array CGH (panels c and d) and mate pair sequencing (panels e and f)) nicely showing that cluster analysis revealed many breakpoint junctions between remote genomic regions on chromosome 21 (patient 29) and chromosome 18 (patient 30; Table 2).

Figure 4
figure 4

Chromothripsis-like events in patients 29 and 30. (a, b) Karyogram of patient 29 (a) and patient 30 (b). (c, d) array CGH profile of chromosome 21 (patient 29, c) and chromosome 18 (patient 30, d). (e, f) Cluster and coverage profiles of the rearranged chromosomes. Aberrant clusters are depicted as red arches. Segmental duplications on these chromosomes are shown at the bottom of each profile (ie, segdups).

In 14 patients, cluster analysis enabled the identification of a gene that was disrupted by a balanced rearrangement or a duplication event (Table 1).

No additional pathogenic SVs detected by mate pair sequencing in the ID/MCA patients in our cohort

The third aim of this study was to evaluate the possibility to identify clinically relevant genomic rearrangements beyond the resolution of routine array CGH analysis. To this purpose, we included 20 patients with a ‘chromosomal phenotype’ (indicating abnormalities in multiple organs and severe mental retardation) without causal aberration and a normal karyotype. Using a combination of cluster analysis and DOC measurements SVs could be detected. In this context, it is important to define the boundaries (resolution) with which SVs can be detected using the mate pair technology that we use here.

The detection limits of mate pair cluster and DOC analysis are mainly influenced by two factors. First, the mate pair library insert size determines the resolution for detection of SVs based on cluster analysis. In our experiments, we used an insert size of 3 kb, which provides a good balance between resolution and the ability to detect breakpoints across repetitive regions.14, 30 We estimated the resolution that can be reached based on clustering analysis by calculating the distribution of the sizes of deletions predicted by our analysis (Supplementary Figure S5). This shows that the majority of deletions is smaller than 10 kb and the lower detection limit is 1 kb. The human genome contains many more small deletions (<1 kb) than large deletions.31 Thus, a major fraction of small deletions remains undetected using read-pair analysis of libraries with 3 kb insert size as used here.

A second factor that primarily determines the resolution of detection for DOC analysis is the overall amount of sequencing reads generated for a sample.31 As described above, we estimate that the resolution for detection of copy-number changes in our data set varies between 5.7 and 20.5 kb depending on overall amount of sequencing reads generated for a sample.

Finally, we should note that a substantial fraction of the genome is refractory to sequencing using short reads, such as (sub)telomeric and centromeric regions. We calculated the distribution of physical genomic coverage across 7.9 Mb of non-repeat masked genomic sequences and we found that >95–100% of the sequences are covered at more than 1x and 80–100% are covered at more than 20x for each of the samples examined (Supplementary Figure S6).

With these resolution and coverage parameters for mate pair sequencing in mind, we examined the SV calls in the 20 patients without array CGH diagnosis. We found that >75% of these aberrations were also present in the DGV or in at least four other patients from our cohort. For the remainder of the rearrangements detected with cluster or DOC analysis, respectively, PCR across the breakpoint junctions in the patients and their respective parents or qPCR was performed to find out whether any of these rearrangements had occurred de novo. Although some of the rearrangements could not be confirmed by conventional PCR or qPCR (possibly false positive), the other rearrangements were inherited from one of the parents. We therefore conclude that mate pair sequencing did not reveal any de novo SVs in the samples beyond the resolution of array CGH and thus led to the same diagnostic results as the array CGH profiling.

DISCUSSION

SVs are an important cause of ID and congenital malformations. However, the current diagnostic tools to detect SVs have limited resolution (karyotyping) or cannot detect copy neutral rearrangements (array CGH) implying that improved technologies may have benefits. Previous work has shown that paired-end mapping or mate pair sequencing has unprecedented resolution for the detection of SV breakpoints of balanced genomic rearrangements in patients with ID/MCA and may thus be a valuable tool for diagnostic implementation.16, 17, 18, 19, 20, 21, 22, 23 Here we made a systematic comparison between mate pair sequencing versus array CGH and karyotyping. As a first important conclusion from this work, we demonstrate that all types of pathogenic SVs previously found with array CGH or karyotyping could also be found using a combined strength of the cluster and DOC signatures of mate pair sequencing. Even when breakpoints reside in repetitive regions (centromere or segmental duplications), DOC analysis could still reveal the breakpoint.

A major strength of mate pair sequencing is in the elucidation of the precise genomic architecture of the SVs (Table 1). For example, cluster analysis showed that most duplications are in tandem with a head-to-tail orientation. In the two patients where the duplications were not tandem, the precise genomic locations of the duplicated segments could be defined (Figure 2). In 14 patients, cluster analysis enabled the identification of a gene that was disrupted by a balanced rearrangement or a duplication event. In five of these cases, the disrupted gene could explain the phenotype of the patient (Table 1). Especially in patient 27, mate pair sequencing was of intrinsic value. The t(2;14) translocation in this patient disrupts the NRXN3 gene, but this does not explain the cardiac defects in patient 27. The second aberrant cluster on chromosome 14 due to an inversion disrupts the MAP3K9 gene, which is an interesting candidate gene for the cardiac phenotype observed in the patient.32 Without mate pair sequencing, this crucial information would not have been revealed. The enhancement in resolution of mate pair sequencing and hence the ability of directly pinpointing disease genes has great diagnostic value.

Both array CGH analysis and mate pair sequencing may be hampered in accurately assessing genomic regions with highly repetitive regions such as centromeric and telomeric regions. For example, recurrent aberrations (surrounded by low copy repeats) were only detected by coverage analysis and not by cluster analysis. Balanced whole chromosome arm translocations cannot be detected because of the repetitive nature of the centromeric breakpoints; these aberrations, however, have no immediate causal relationship to the phenotype of the patient.

The resolution of mate pair sequencing is merely dependent on the insert size and the amount of sequence reads. In our analysis, we noted big differences between different samples and between different runs, resulting in practical resolutions of 20.5 kb down to 5.7 kb for DOC analysis and a lower detection limit of 1 kb for cluster analysis. In 2011, Cooper et al33 compared CNVs in 15 767 children with ID and various congenital defects to 8329 adult controls, and concluded that in 14.2% of these children the disorder is caused by CNVs>400 kb. On the basis of this, Vermeesch et al34 suggested that arrays should aim to detect at least any imbalance larger than 500 kb, which is definitely the case for the arrays used in this study (13 kb). Still a large fraction of ID/MCA cases remains without a conclusive genetic diagnosis. We show that mate pair sequencing at the resolution used here did not enhance the diagnostic yield for these undiagnosed patients. This finding should fuel further efforts in order to search for smaller de novo coding and noncoding mutations as the underlying cause of the disease in a subset of these patients by means of exome or whole-genome sequencing.11, 35, 36 Other approaches, such as improvements of mate pair protocols, higher sequence coverage, combinations of different library insert sizes or screening of larger patient cohorts could possibly also lead to higher detection rates.

Altogether, we made a systematic comparison between mate pair sequencing and array CGH/karyotyping for the genetic diagnosis of patients with ID/MCA. We demonstrate that mate pair sequencing enables the rapid identification and delineation of structural variants and has added value for the identification of disease genes in these patients. Further improvements in sequencing throughput will allow the identification of the whole spectrum of genomic mutations from single nucleotide changes up to large SVs by means of (paired-end) whole-genome sequencing, making sequencing a holistic detection platform.