Mapping and phasing of structural variation in patient genomes using nanopore sequencing

Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse the genomes of two patients with congenital abnormalities using the MinION nanopore sequencer and a novel computational pipeline—NanoSV. We demonstrate that nanopore long reads are superior to short reads with regard to detection of de novo chromothripsis rearrangements. The long reads also enable efficient phasing of genetic variations, which we leveraged to determine the parental origin of all de novo chromothripsis breakpoints and to resolve the structure of these complex rearrangements. Additionally, genome-wide surveillance of inherited SVs reveals novel variants, missed in short-read data sets, a large proportion of which are retrotransposon insertions. We provide a first exploration of patient genome sequencing with a nanopore sequencer and demonstrate the value of long-read sequencing in mapping and phasing of SVs for both clinical and research applications.


Supplementary Figure 3. Distributions of the percentages of reads mapped by LAST.
For each sequencing run the percentage of mapped reads was calculated. The boxplots indicate the distribution of the percentages of mapped reads for multiple runs, stratified by data type (R7, R9_pass, R9_fail, R9_rapid and R9.4). R7, R9 and R9.4 represent different nanopore sequencing chemistries for MinION. Pass and fail indicates reads that were classified as either 'pass' or 'fail' following Metrichor basecalling. 2D indicates consensus reads generated from a template and complement read of a DNA duplex. 1D template and complement indicate reads derived from only one of the two strands (template or complement) of a DNA duplex. 'Rapid' means data from a rapid nanopore library preparation.

Supplementary Figure 6. Genomic GC content versus error rate in R9.4 MinION sequencing data.
A set of 1,064,470 randomly generated genomic positions (excluding polymorphic sites) were sampled from chromosome 1. For each of these positions the fraction of reads with deletion errors, insertion errors and mismatches was determined, based on MinION data from Patient2 (R9.4). In addition, the GC content of the reference genome was calculated based on a window of 10bp at each examined genomic position. a The fraction of deletion errors, b insertion errors, c mismatches and d matches to the reference genome are depicted (y-axis) as a function of genomic GC content (x-axis). For deletion errors, a linear regression model shows a statistically significant dependency of the error rate on the GC content (p < 10^-16). The estimated coefficient, as change of error fraction per percent of GC-content, is 0.0072 (std. error = 0.0007).

Supplementary Figure 7. Coverage distribution for sequencing data from Patient1 and Patient2.
Coverage distributions were generated by calculating the coverage for 1,000,000 random genomic positions, excluding positions in the gap table downloaded from the UCSC genome browser (GRCh37 gaps in golden path).

Supplementary Figure 8. Average coverage distribution as a function of GC-content for MinION and Illumina sequencing data of Patient1 and Patient2.
Panels a and b show statistics of depth of coverage for Illumina data (red) and MinION data (green) for Patient1 and Patient2 respectively. Panel c shows statistics of depth of coverage for Illumina data (red), MinION nanopore "pass" data (green) and MinION nanopore "fail" data (dark yellow) of Patient1. Panel d shows the GC content distribution across our randomly sampled intervals. The average coverages across 100,000 randomly sampled 5kb genomic intervals were used in each panel. Average coverage outliers, defined as 6 or more interquartile distances away from the median, were discarded for each technology respectively. The remaining data were normalized to N(0,1), to account for different genome-wide sequencing average coverage and binned by GC-content. A linear regression model shows a statistically significant dependency of coverage depth on the GC content expressed as percentage, for both technologies (p < 10^-16). The estimated coefficients, as number of standard deviations of change, per percentage of GC-content, are -0.094 (std. error = 0.0004) and -0.029 (std. error = 0.0004) for the Illumina and MinION data of Patient1 respectively (panel a non-binned data ). Conversely the estimated coefficients for Patient2 are 0.033 (std. error = 0.0004) and -0.018 (std. error = 0.0004) for the Illumina and MinION nanopore data respectively (panel b non-binned data).

Supplementary Figure 9. K-mer distribution in MinION sequencing data of Patient1.
Plotted observed (MinION data) versus expected (GRCh37 reference genome) relative k-mer frequencies for 4-mers ( top ), 5-mers ( middle ) and 6-mers ( bottom ). The expected kmer frequencies are computed from the relative frequency of each kmer on the reference genome primary assembly for each k-mer size. The MinION data k-mer frequency was similarly computed, across all MinION reads, further stratified by "pass" ( middle ) or "fail" ( right ) read status. The "All" ( left ) represents the aggregate "pass" and "fail" MinION data. Figure 10. Overview of NanoSV algorithm. NanoSV uses LAST mapping output for discovery of SVs. In a first step candidate breakpoint junctions are detected using split read mappings. Candidate breakpoint junctions are subsequently clustered across multiple reads based on the overlap of junction coordinates and orientation. Clusters of breakpoint junctions are reported as SVs in VCF format. The tool is available on github: https://github.com/mroosmalen/nanosv . Figure 11. Detection of different SV types by NanoSV. NanoSV detects most types of breakpoints junctions with the exception of insertions consisting of unmapped repeat elements which are longer than the nanopore read lengths, e.g. LINE insertions may be missed if the read length falls below the typical length of LINE elements (~6kb). Genomic coordinates of mapped segments are indicated by s1 / s2 (start of segments) and e1 / e2 (end of segments). Gaps within reads represent unmapped segments, which may result from repeat insertions or complex variations. Deletions are discerned from insertions if the gap length is smaller than the distance between the joined genomic positions ( s2-e1 which represents SV size for variants other than insertions). Figure 12. Recall-precision curve for SV calling performance on simulated nanopore data. Breakpoints (501) were simulated on reference chr1 and based on the resulting chromosomal sequence nanopore reads were simulated using NanoSim 1 . SV calling using Lumpy 2 , Sniffles 3 and NanoSV was performed on subsets of the simulated nanopore reads to estimate the effect of read coverage. The recall (true positives/true positives + false negatives) and precision (true positives/true positives + false positives) was calculated for each call set, without any additional post-calling filters being applied.

Supplementary Figure 13. Structure of two complex breakpoint-junctions in Patient2 chromothripsis.
Long-insert mate-pair sequencing was previously used to study the chromothripsis in Patient2 4 . The long-insert size of these mate-pair libraries hampers detection of short chromosomal segments, because the short sequence reads can jump over the short segments and only reveal the connection between the segments flanking these short segments. In the upper panel, an 80bp segment from chr9 is depicted, which was identified using nanopore reads and confirmed by Sanger sequencing. The lower panel highlights two adjacent short genomic segments -both from chr9 -that were missed by the long-insert mate-pair sequencing, but detected by nanopore reads and subsequent PCR and Sanger sequencing.

Supplementary Figure 14. The effect of subsampling the MinION sequencing data on chromothripsis breakpoint-junction detection.
Nanopore sequencing reads were subsampled from 10% to 90% of the original data and each subsampled dataset was analyzed using NanoSV to determine the fraction of known chromothriptic breakpoint-junctions that could be detected. Below a coverage of~14x (Patient1), the fraction of detected breakpoint junctions drops below 1. Order and orientation of chromosomal regions involved in the chromothripsis rearrangements of Patient1 is depicted by colored lines with arrowheads. The resulting chromosomal configuration is based on overlapping nanopore reads derived from the paternal haplotype of Patient1. Nanopore reads that are instrumental for segment connectivity are indicated by black bars. The coverage track has been generated from all paternal reads mapping to the respective chromosomal segments. The order and orientation of the joined chromosomal segments matches the chromothripsis structure that is described in Figure 3 .

Supplementary Figure 17. Contig structure produced by Miniasm assembly of chromothripsis regions in Patient1.
Order and orientation of chromosomal regions involved in the chromothripsis rearrangements of Patient1 is depicted by colored lines with arrowheads and was obtained as for Figure 3 . The structure for two chromothriptic regions, containing three genomic segments each, was supported by contiguous sequences (contigs) resulting from Miniasm 5 assembly of nanopore reads, excluding reads that were assigned to the maternal haplotype. Both contigs (utg000068l and utg000063l) support part of the structure of derivative chromosome 2. The black arrows indicate the positions and orientations of contig segments mapped to the human reference genome (GRCh37). The illustrated ROC curve is obtained from 100 cross-validation random forest training runs (split 90%-10% for training-testing) from the total set of 354 true positive and 300 true negative SVs from the NA12878 sample. The chosen, optimal operating point has a precision of 82% at a recall rate of 75%.

Supplementary Figure 20. Heatmap showing the overlap of SV calls between different callers and SV datasets.
We used the NanoSV SV call set of Patient1 and Patient2 as a basis for intersection with SV call sets generated from Illumina data, using six different tools. Additionally, we used two tools for detection of SVs in the Nanopore data from Patient1 and Patient2. Finally, we intersected the NanoSV calls with the 1000 Genomes phase 3 consensus calls 6 . a Heatmap showing overlaps of 6,616 NanoSV SVs predicted as true positive by a random forest classifier ( Methods ). b Heatmap showing overlaps of the initial call set consisting of 15,369 candidate NanoSV SVs, following filtering for SVs that overlap homopolymers and tandem repeats ( Methods ). Figure 21. GC bias of nanopore specific SVs. GC content distributions across 500 base-pair windows around the high confidence set of SV calls that are detected in both Illumina and MinION nanopore data (red) and nanopore data only (blue). The average GC content in the regions where an SV is detected only in the nanopore data is 1.4% higher than the the average GC content where an SV is detected in both Illumina and MinION nanopore data (Welch two sample t-test: p-value = 1.8e-13, 95% CI = 1.0 -1.8).

Supplementary Figure 22. Patient1 and Patient2 cumulative distributions of SVs.
We plotted numbers of SV calls across SV types ( a and b ) and across SV annotations ( c and d ), after random forest filtering. a shows the histogram of SV type across both patients, subsetted for the "Illumina and nanopore" data and "nanopore" only data. b shows the SV type distribution for the same subsets as a . c shows the annotations distribution, by class, for all deletions detected in both nanopore and Illumina data. d shows the annotations distribution, by class, for all insertions detected in both nanopore and Illumina data. Figure 23. Nanopore read phase support. The plot show the distribution (density) of the percentage p of SNVs per read supporting the read phase of each nanopore read covering at least 20 phase-informative SNVs. The percentage p is defined as SNV supp /SNV total , where SNV supp is the number of phase-informative SNVs that support the read phase and SNV total is the total number of phase-informative SNVs covered by the nanopore read. Figure 24. Phasing-score distribution for nanopore reads from Patient1. For each nanopore read a phasing-score S was calculated (x-axis, Methods). The plot shows the distribution of phasing scores ( S ) for nanopore reads overlapping 1 to 10 phase-informative SNVs. If the phasing score S is positive, the read is assigned to the paternal haplotype, while for a negative value of S the read is assigned to the maternal haplotype.

Supplementary Figure 25. Alignment differences between BWA MEM and LAST.
Two examples ( a and b ) of how BWA and LAST segment the same read differently at alignment. Each whole read is depicted in blue. For each caller, the two grey/black lines depict how the read is split into two segments at alignment. The black line depicts the part of the read that is aligned and the grey parts depict the clipped parts of the read, for each segment respectively. Both these examples show how bwa splits reads into (at least slightly) overlapping segments, which impair our ability to evaluate candidate breakpoints. Only the read from example b contributes to a non HOM_REF SV call in our dataset.

Supplementary Figure 26. Distribution of random forest feature values in the NA12878 data.
Distribution of random forest feature values, within the NA12878 training data, for true positives (green) and false positives (red) respectively. P-values are derived from a two-sided unpaired wilcoxon test.

Supplementary Figure 27. Distribution of random forest feature values across samples.
Distribution of random forest feature values across all SV calls (after filtering for homopolymers and simple repeats) within NA12878 (green), Patient1 (red) and Patient2 (blue). The feature distribution of the training data (NA12878) is compared to the feature distribution of the two test samples, Patient1 and Patient2 using a wilcoxon paired test and the two p-values are reported in each feature plot; p-value p1 (comparing NA12878 and Patient1 distributions) and p-value p2 (comparing NA12878 and Patient2 distributions).