LinkedSV for detection of mosaic structural variants from linked-read exome and genome sequencing data

Linked-read sequencing provides long-range information on short-read sequencing data by barcoding reads originating from the same DNA molecule, and can improve detection and breakpoint identification for structural variants (SVs). Here we present LinkedSV for SV detection on linked-read sequencing data. LinkedSV considers barcode overlapping and enriched fragment endpoints as signals to detect large SVs, while it leverages read depth, paired-end signals and local assembly to detect small SVs. Benchmarking studies demonstrate that LinkedSV outperforms existing tools, especially on exome data and on somatic SVs with low variant allele frequencies. We demonstrate clinical cases where LinkedSV identifies disease-causal SVs from linked-read exome sequencing data missed by conventional exome sequencing, and show examples where LinkedSV identifies SVs missed by high-coverage long-read sequencing. In summary, LinkedSV can detect SVs missed by conventional short-read and long-read sequencing approaches, and may resolve negative cases from clinical genome/exome sequencing studies.


Illustration of two types of evidence near SV breakpoints 92
Two types of evidence may be introduced while a genomic rearrangement happens: 1) reads 93 from one HMW DNA molecule which spans the breakpoint being mapped to two genomic 94 locations and 2) reads from two distant genome locations that get mapped to adjacent positions. 95 Both types of evidence can be used for SV detection. 96 First, we describe the signals of type 1 evidence. After reads mapping, the original HMW DNA 97 molecules can be computationally reconstructed from the sequenced short reads using their 98 barcodes and mapping positions. In order to distinguish them from the physical DNA molecules, 99 we use fragments to refer to the computationally reconstructed DNA molecules. A fragment has 100 a left-most mapping position, which we call L-endpoint, and a right-most mapping position, 101 which we call R-endpoint. As a result of genomic rearrangement, reads from one breakpoint-102 spanning HMW DNA molecule would be mapped to two different genome loci on the reference 103 genome. This split-molecule event has two consequences: 1) observing two separate fragments 104 sharing the same barcode; 2) each of the two fragment has one endpoint close to the true 105 breakpoints. Therefore, in a typical linked-read WGS data set, multiple split-molecule events could be captured and we would usually observe multiple shared barcodes between two distant 107 genome loci and multiple fragment endpoints near the breakpoints. 108 To illustrate this, Figure 1a shows the split-molecule events of a deletion, where breakpoints 1 109 and 2 are marked by red arrows. Multiple fragment endpoints are enriched near the two 110 breakpoints of a large deletion. This can be observed in deletions with minimal size of about 5-111 10 kb. Figure 1b and Supplementary Figures 1-3 show the patterns of enriched fragment 112 endpoints that are introduced by different types of SVs. As an example, Figure 1c shows the 113 number of fragment endpoints in a 5-kb sliding window near two deletion breakpoints, based on 114 a 35X coverage linked-read WGS data generated from the NA12878 genome (genome of a 115 female individual extensively sequenced by multiple platforms). At the breakpoints, the number 116 of fragment endpoints in the 5-kb sliding window is more than 100 and is five times more than 117 normal regions, forming peaks in the figure. 118 Since the fragments can be paired according to their barcodes, we can also observe fragment 119 endpoints of this deletion in a two-dimensional view. As shown in Figure 1d, each dot indicates 120 two endpoints from a pair of fragments which share the same barcode. The x-value of the dot is 121 the position of the first fragment's R-endpoint and the y-value of the dot is the position of the 122 second fragment's L-endpoint. The bottom panel and right panel in Figure 1d shows number of 123 dots that are projected to the x-axis and y-axis. Similar with the one-dimensional plot (Figure 1c), of the two-dimensional plot is cleaner than the one-dimensional plot since the fragments that do 126 not share barcodes are excluded. Therefore, the two-dimensional plot is more useful when the 127 variant allele frequency (VAF) is very low and there are only a few supporting fragments. 128 Next, we describe the signals of type 2 evidence. The barcodes between two nearby genome 129 locations is highly similar because the two locations are spanned by almost the same set of input 130 HMW DNA molecules. However, due to the genome rearrangement, the reads mapped to the left 131 side and right side of a breakpoint may originate from different locations of the alternative 132 genome and thus have different barcodes (Figure 1e). Dropped barcode similarity between two 133 nearby loci therefore indicates an SV breakpoint. LinkedSV detects this type of evidence by a 134 twin-window method, which uses two adjacent sliding windows to scan the genome and find 135 regions where the barcode similarity between the two nearby window regions is significantly 136 decreased. Figure 1f illustrates an inversion breakpoint detected by LinkedSV from the 137 NA12878 genome. The change of barcode similarity was plotted and a peak was formed at the 138 breakpoint. After searching for the two types of evidence, LinkedSV combines the candidate SV 139 regions and quantifies the evidence using a novel probabilistic model. The breakpoints are 140 further refined using short-read information, including discordant read pairs and split-reads. 141 142

Performance evaluation on simulated WGS data
To assess LinkedSV's performance, we simulated a 35X linked-read WGS data set with 1,175 144 SVs inserted using LRSIM 13 (see Methods for details). The breakpoints of the simulated SVs 145 were designed to be located in repeat regions, since we found that LinkedSV and other available 146 SV callers performed very well when the breakpoints were located in non-repeat regions, and 147 thus we set to test the performances of all the SV callers under more challenging situations. This 148 makes sense because SV breakpoints are more likely to be in repeat regions 7,8,9 , and because 149 these situations represent those that are difficult to be addressed by conventional short-read 150 sequencing approaches. 151 The simulated reads were aligned to the reference genome using the Longranger 12 package 152 provided by 10X Genomics. The Longranger pipeline internally uses the Lariat aligner 16 , which 153 was designed for the alignment of linked reads. SV calling was performed using LinkedSV as 154 well as three other available SV callers designed for linked-read sequencing: Longranger, 155 GROC-SVs 14 and NAIBR 15 . Two widely used short-read SV callers (Delly 3 and Lumpy 17 ) were 156 also used. portion of the false negative calls by GROC-SVs represents duplications that are smaller than 163 twice the fragment length. For large duplications, the reads of the alternative allele are separated 164 by a large gap so that we can observe two sets of fragments with the same set of barcodes, which 165 indicate an SV (Supplementary Figure 4a). If the duplication is not large enough, the reads will 166 be probably clustered into one fragment (Supplementary Figure 4b). Even in this case, we can 167 observe enriched fragment endpoints near the duplication breakpoints in LinkedSV. As an 168 example, Figure 2b shows the endpoint signal of a missed duplication call by GROC-SVs. The 169 supporting fragments of this duplication is shown in Figure 2c Somatic SVs are commonly found in cancer genomes 18,19,20 . However, due to the high 178 heterogeneity of genomic alteration in cancer genomes, somatic SVs often have low (as opposed 179 to ~50% in a germline genome) VAF and thus are more difficult to detect by SV callers designed Recalls, precisions and F1 values of the six SV callers were evaluated on both data sets ( Figure  182 3a, Figure 3b). When the VAF was 20%, the recall of LinkedSV (0.803) was much higher than 183 that of Longranger (0.306), GROC-SVs (0.324) and NAIBR (0.679) The F1 score of LinkedSV 184 (0.855) was also the highest among all the SV callers. When the VAF was 10%, LinkedSV still 185 had a recall of 0.761, which was 72% higher than the second best SV caller NAIBR. Longranger 186 detected 17% of the SVs while GROC-SVs almost completely failed to detect the SVs. The 187 recall rates of Delly and Lumpy were 0.28 and 0.72, respectively, indicating that some of the 188 SVs can be detected even without barcode information. These observations confirmed that other 189 SV callers were mainly designed for germline genomes and had substantial difficulty in 190 detecting SVs with somatic mosaicism. However, due to the combination of barcode overlapping 191 and enriched fragment endpoints in our statistical model (see Methods for details), LinkedSV 192 was able to achieve a good performance even when VAF was very low. We manually checked 193 the barcode overlapping evidence of some SV calls using the Loupe software developed by 10X 194 Genomics. Figure 3c shows an inversion that was missed by Longranger, and NAIBR but 195 detected by LinkedSV (at VAF of 10%). Although the variant frequency is low, the overlapped 196 barcodes between the two inversion breakpoints can be clearly visualized (in the black circles) in 197 the figure. Figure 3d shows the supporting fragments of the inversion detected by LinkedSV. 198 Each horizontal line represent two fragments that share the same barcode and support the SV. 199 These results suggest that the manufacturer-provided software tool has limitations for SV 200 detection, despite its strong functionality in visualization.
To test the performance of LinkedSV on the detection of disease casual SVs, we simulated one 202 germline and two somatic (VAF = 10% and 20%) linked-read WGS data sets with 51 203 deletions/duplications that were known to cause CNV (copy number variation) syndromes 204 Recently, the Genome in a Bottle (GIAB) Consortium released a benchmark call set for the 212 evaluation of germline SV detection 21 . The benchmark set was based on the HG002 genome and 213 was generated from integrating multiple SV calling methods from multiple sequencing platforms 214 including 10X Genomics sequencing and PacBio long-read sequencing. The current GIAB call 215 set only contains insertions and deletions. Since LinkedSV and the other three linked-read SV 216 callers cannot detect insertions, we only benchmarked the performance to detect deletions using 217 this benchmarking data set.
LinkedSV uses different strategies to detect deletions of different sizes. For deletions that are 219 more than 10 kb, LinkedSV uses the two types of evidence from barcode signals as described 220 above; for deletions that are within 1 -10 kb, LinkedSV uses a combination of read depth and 221 paired-end signals, with additional consideration of local haplotypes; for detection of SVs that 222 are less than 1kb, LinkedSV uses a local assembly-based method. Specifically, we modified the 223 FermiKit 22 de novo assembly pipeline to be a local assembler to improve speed and reduce the 224 complexity of the assembly graph (see Method for details). 225 Supplementary Figure 7a showed the performance on detection of deletions that were more than 226 10 kb. The recall and F1 score of LinkedSV was the highest among the seven methods. The four 227 linked-read SV callers performed better than the three short-read SV callers in terms of F1 score. 228 The performance on detection of deletions that within 1 -10 kb were shown in Supplementary 229 Figure 7b. The performance of LinkedSV was similar to Longranger, which also provided an 230 algorithm to detect small deletions. NAIBR and GROC-SVs did not perform well because they 231 were not designed to detect small events including small deletions. For deletions that were less 232 than 1 kb, LinkedSV (with modified FermiKit) performed best (recall = 0.48, F1 score = 0.64), it 233 detected more calls than the original de novo assembly version (recall = 0.43, F1 score = 0.60), 234 indicating that local assembly reduced the complexity of the assembly graph and improved the 235 performance. NAIBR, GROC-SVs and Lumpy did not perform well on deletions of this scale. However, by combining linked-read sequencing with WES capture platforms, it is possible to 245 alleviate this problem, and significantly improve the sensitivity of SV detection using WES. 246 To evaluate SV detection on linked-read WES data, we simulated a 40X coverage linked-read 247 WES data set with 1,160 heterozygous SVs (see Methods for details). 44.3% of the breakpoints 248 were not in exon regions. SV calling was performed using the six SV callers. As shown in Figure  249 4a, LinkedSV had the highest recall (0.79) and highest F1 score (0.86). In terms of the balanced 250 accuracy (F1 score), NAIBR was the second best caller, followed by GROC-SVs. 251 We analyzed false negative calls of the second best SV caller NAIBR. NAIBR tends to miss 252 some SV events that have shared barcodes but lack short-read support. For example, Figure 4b of capture regions. Breakpoint 1 (chr1:172545561) was 768 bp away from the nearest capture 255 region and breakpoint 2 (chr1: 173504265) was 392 bp away from the nearest capture region. 256 Unfortunately, no discordant read pairs that support the deletion could be found. However, 257 shared barcodes between the two breakpoints were clearly indicated by the Loupe software 258 ( Figure 4b). In addition, LinkedSV also detected 28 pairs of fragments that share the same 259 barcodes and support the SV. These fragments were plotted in Figure 4c. Although no short-read 260 support was found, the SV type could be determined using the pattern of enriched fragment 261 endpoints shown in Figure 1b. In this SV event, R-endpoints were highly enriched for the first 262 set of fragments and L-endpoints were highly enriched for second set of fragments. Thus, the SV 263 type was predicted as deletion. 264 265

Detection of F8 inversion from clinical WES data 266
We also tested the performance of LinkedSV on several clinical samples with linked-read WES 267 data. First, we applied LinkedSV on a WES sample of a male individual with Hemophilia A. 268 Previous experiments had shown that the patient had type I inversion of the F8 gene, where the 269 two breakpoints resided in intronic/intergenic regions, thus the inversion and its breakpoints 270 cannot be inferred from conventional WES. The F8 gene is located in Xq28. The intron 22 of F8 271 gene contains a GC-rich sequence (named int22h-1) that is duplicated at two positions towards 272 the Xq-telomere (int22h-2 and int22h-3). Int22h-2 has the same direction with int22h-1 while int22h-3 has the inverted direction. The type I inversion is induced by the recombination 274 between int22h-1 and int22h-3 23, 24 (Figure 5a). BLAST alignment of int22h-1 and int22h-3 275 showed that the two sequences had 99.88% identity. Since the breakpoints were located in two 276 segmental duplications with nearly identical sequences, the inversion is undetectable by 277 conventional short-read sequencing. Delly 3 and Lumpy 17 failed to detect the inversion from the 278 linked-read WES data (results were shown in Supplementary Tables 1-2). 279 Longranger, GROC-SVs, NAIBR and LinkedSV were also used to detect SVs from this sample. 280 None of the first three methods detected this inversion (results were shown in Supplementary 281 Tables 3-5), although the overlapping barcodes can be visualized using the Loupe software 282 ( Figure 5b). However, LinkedSV successfully detected this inversion by combining two types of 283 evidence. As described above, barcode similarity between two nearby regions are very high but 284 drops suddenly at the breakpoints. Figure 5c shows the suddenly drop of barcode similarity at the 285 two breakpoints. Each dot in the figure represents the reciprocal of the barcode similarity 286 between its left 40 kb window and right 40 kb window thus the Y value of the dots are reversely 287 related to the barcode similarity and positively related to the probability of being a breakpoint. 288 The barcode similarities are lowest at the two breakpoints and thus form two peaks in the figure  289 (marked by red arrow). In addition, LinkedSV also identified the supporting fragments of the SV 290 using type 1 evidence ( Figure 5d). The predicted breakpoint positions are consistent with the

Detection of mosaic NF1 deletion from clinical WES data 294
Another linked-read WES sample was from an individual who was clinically diagnosed with 295 Neurofibromatosis type 1. Neurofibromatosis type 1 is caused by mutations in the NF1 gene on 296 chromosome 17q11.2, which encodes neurofibromin, a GTPase activating protein that has a role 297 in the regulation of RAS signaling. Since standard genetic testing techniques including cDNA 298 sequencing and multiplex ligation-dependent probe amplification (MLPA) revealed no 299 constitutional or mosaic pathogenic mutation in this patient, we hypothesized that this patient 300 may carry an SV affecting the NF1 gene that escapes the detection by the applied standard 301 techniques. To evaluate LinkedSV, we utilized the 10X Genomics Chromium platform to 302 generate linked-read WES data to confirm and resolve the mutation. SV detection was conducted 303 using the four linked-read SV callers as well as Delly and Lumpy. Longranger detected 304 overlapped barcodes between exon 54 of the NF1 gene and intron 3 of RAB11FIP4. However, 305 the SV type was unknown and no supporting read pairs or split-reads were found. GROC-SVs, 306 NAIBR, Delly and Lumpy failed to detect this SV (Supplementary Tables 6-9). As shown in 307 Figure 6a, LinkedSV detected 16 fragment pairs that may support a deletion spanning the region 308 of chr17:29684175-29822527. In addition, a discordant read pair spanning the two breakpoints 309 were found (Figure 6b), which gave further evidence supporting the deletion. The breakpoints 310 were estimated from this discordant read pair and thus the resolution is a few hundred base pairs. In Figure 6, each colored line represent a reconstructed fragment, and ~13% of the fragments 312 belong to the variant allele, indicating the somatic mosaicism of this deletion. The right 313 breakpoint was within an AluJr sequence masked by repeat masker, which may explain why the 314 deletion was difficult to be detected by conventional methods. 315 In comparison, the clinical lab used massive parallel sequencing (TruSightCancer panel on a 316 MiSeq platform (Illumina)) and successfully revealed in exon 54 a transition of NF1 sequences 317 into a non-NF1 derived sequence. This sequence transition at NF1 position c.7886_7887 was 318 present in 8% of the reads covering this site in germline DNA of the patient. Analysis of the 319 reads displaying the aberrant sequence in exon 54 showed that the non-NF1 derived sequence 320 was part of an Alu element that matched best a sequence in intron 3 of the RAB11FIP4 gene 321 located 138 kb downstream of NF1 exon 54. These results suggested a low-level (~8%) 322 mosaicism of a deletion encompassing the region intervening between NF1:c.7886 and 323 RAB11FIP4:c.337-22216, so that the true deletion spans chr17:29684367-29822453, which is 324 very similar to our estimated breakpoints from LinkedSV above. In summary, our analysis on 325 two clinical samples with F8 inversion and NF1 deletion demonstrated the unique advantage of 326 linked-read sequencing in confirming and resolving structural variants in repetitive regions and 327 challenging situations. 328 We previously reported the de novo genome assembly of a Chinese individual (HX1) 25 . This 331 genome was sequenced deeply at 103X coverage using PacBio long-read sequencing. Recently, 332 the developers of SMRT-SV 10, 26 reported the SV calls of HX1 detected from the PacBio data. 333 Additionally, we have also generated a 37X linked-read WGS dataset on HX1. Therefore, in the 334 current study, we detected SVs from the linked-read data using LinkedSV and compared the SV 335 calls detected by LinkedSV and SMRT-SV. The SMRT-SV call set has 17 large deletions 336 (≥10kb), all of which were detected by LinkedSV. In addition, LinkedSV detected another 46 337 large deletions, which were missed by SMRT-SV. To validate these deletion calls, we mapped 338 the PacBio reads of HX1 to GRCh38 reference genome using minimap2 27 , and manually 339 examined all the SV affected regions in both PacBio data and linked-read data, using the 340 Integrative Genomics Viewer (IGV) 28 and the Loupe software tool. We classify a deletion as a 341 true deletion if there are decreased read depth in the deletion region and clear boundaries at the 342 breakpoints. After the manual inspection, we found that among the 46 deletions that are only 343 detected by LinkedSV, 34 of them have clear evidence of deletion in the WGS data; 10 of them 344 are complex SV events that need to be fully resolved; and 2 of them are false positive events. 345 Figure 7a-c showed an example of a deletion that were detected by LinkedSV but missed by 346 SMRT-SV. This is a 45 kb deletion located in chr2:110395971-110441346. A deletion pattern 347 was clearly indicated by the Loupe software tool (Figure 7a). After examine the PacBio reads, 348 we were able to found clipped reads at the breakpoint positions (Figure 7b, 7c). However, for 349 most of the clipped reads, the clipped sequences were aligned to the hs38d1 decoy sequence, except for 5 reads with clipped sequence > 7 kb. Analysis of the 5 reads revealed that the two 351 breakpoints in chr2 were not directly joined. There was a 6 kb insertion in between. The inserted 352 sequence was from hs38d1 ( LinkedSV. In addition, none of these duplications could be detected by Sniffles 11 , another widely 376 used long-read SV caller. After comparing with the segmental duplication database 29 , we found 377 that 182 of the 193 duplication calls (94.3%) were located in large segmental duplication regions. 378 Both long reads and linked reads could not be reliably mapped in these regions. As an example, 379 we plotted the read depth distribution in the region around a 25 kb duplication call of SMRT-SV.

Discussion
In this study, we present LinkedSV, a novel open source algorithm for structural variant 388 detection from linked-read sequencing data. We assessed the performance of LinkedSV on three 389 simulated data sets and two real data sets. By incorporating the two types of evidence as outlined 390 below, LinkedSV outperforms all existing linked-read SV callers including Longranger,  SVs and NAIBR on both WGS and WES data sets. caller that use type 2 evidence to detect breakpoints. Type 2 evidence is independent to type 1 402 evidence, and gives additional confidence to identify the breakpoints. In addition, type 2 403 evidence can be detected locally, which means we can detect a weird genomic location without 404 looking at the barcodes of the other genomic locations. This is particularly useful in two breakpoint is detectable and the other breakpoint located in a region where there is little coverage 407 within 50 kb, which is often the case in target region sequencing. As LinkedSV incorporates two 408 types of evidence from barcodes, and performs local assembly to detect small deletions, the 409 computation time of LinkedSV is longer than NAIBR, but shorter than GROC-SVs and 410 Longranger (Supplementary Figure 13). 411 In recent years, WES has been widely used to identify disease causal variants for patients with 412 suspected genetic diseases in clinical settings. Identification of SVs from WES data sets are more 413 challenging because the SV breakpoints may not be in the capture regions and thus there would 414 be little coverage at the breakpoints. Linked-read sequencing increases the chance of resolving 415 such type of SVs by providing long-range information. As well as there are a few capture regions 416 nearby, the fragments can still be reconstructed and type 1 and type 2 evidence can still be 417 observed. Our statistical models for both type 1 evidence and type 2 evidence were designed to 418 handle both WGS and WES data sets. GROC-SVs uses a local-assembly method to verify the SV 419 call, which requires sufficient coverage at the breakpoints. By using these two types of evidence, 420 LinkedSV can be less relied on short-read information (e.g. pair-end reads and split-reads). We 421 demonstrated that LinkedSV has better recall and balanced accuracy (F1 score) on the simulated 422 WES data set and can detect SVs even when the breakpoints were not located in capture regions 423 and have no short-read support. In addition, LinkedSV is also the only SV caller that clearly 424 detected the F8 intron 22 inversion and NF1 deletion from the clinical WES data sets.
Linked-read sequencing has several advantages over traditional short-read sequencing on the 426 purpose of SV detection. First, the human genome is highly repetitive. Previous studies have 427 shown that SVs are closely related to repeats and many SVs are directly mediated by 428 homologous recombination between repeats 30 . In traditional short-read sequencing, if the 429 breakpoint falls in a repeat region, the supporting reads would be multi-mapped and thus the SV 430 cannot be confidently identified. However, this type of SVs are detectable by linked-read 431 sequencing when the HMW DNA molecules span the repeat region. We can observe type 1 and 432 type 2 evidence in the non-repeat region nearby. In our benchmarking, LinkedSV detected more 433 SVs than Delly and Lumpy, especially when the VAF is low. Secondly, SVs are undetectable 434 from traditional short-read sequencing if there is little coverage at the breakpoints, which is often 435 the case in WES data sets. As described above, this type of SV can also be resolved by linked-436 read sequencing and LinkedSV. Third, linked-read sequencing requires less coverage for 437 detection of SVs with low variant allele frequencies. In linked-read sequencing data, short read 438 pairs are sparsely and randomly distributed along the HMW DNA molecule. In a typical linked-439 read WGS data set, the average distance between two read pairs derived from the same HMW 440 DNA molecule is about 1000 bp and each HMW DNA molecule only has a short-read coverage 441 of about 0.2X. Therefore, there are about 150 HMW DNA molecules (reconstructed fragments) 442 covering a genomic location of 30X depth. An SV of 10% VAF will has15 supporting fragment 443 pairs in a 30X depth location in linked-read WGS data set, which is sufficient to be detected by LinkedSV. However, an SV of 10% VAF will only has 3 supporting read pairs in a 30X depth 445 location in traditional short read WGS, which makes the detection more challenging. 446 Linked-read sequencing also has several advantages over long-read sequencing in terms of SV 447 detection. The fragment length of linked-reads (typically 50-100 kb) is longer than the read 448 length of regular long-read sequencing (typically 20-30 kb). Therefore, linked-read sequencing 449 has unique advantages for detection of large SVs. In our study, LinkedSV detected several large 450 SVs that were missed in the long-read SV call set. We also showed that the sequencing error (13-451 15%) of long-read sequencing technologies potentially had a negative effect on reads mapping 452 and subsequent SV calling (Supplementary Figure 11). In terms of library preparation, Linked-453 read sequencing only requires 1 ng input DNA, which is two orders of magnitude smaller than 454 what is needed by long-read sequencing. Therefore, disease samples of very low DNA amount 455 can be easily sequenced by linked-read sequencing. In addition, SNPs, indels and SVs can be 456 detected from linked-read sequencing simultaneously. 457 LinkedSV may have limitations on detection of SVs in large segmental duplication regions, 458 where the linked reads have low mapping qualities. SMRT-SV was able to find 194 large 459 duplications in the HX1 genome, which were not detected by LinkedSV and Sniffles, two 460 alignment-based SV callers. SMRT-SV detects SVs using an assembly-based approach. During 461 the assembly process, the assembly contigs were error corrected and polished by the PacBio 462 reads. Therefore, the assembly contigs are potentially more accurate and longer than each of the raw reads. Thus, it is possible for SMRT-SV to detect SVs in these large segmental duplication 464

regions. 465
The linked-read technology provides strong evidence to detect large SVs, but it provides little 466 additional evidence to detect small SVs. Therefore, LinkedSV has limited power to detect small 467 SVs such as small duplications and inversions. However, based on our analysis of SV size 468 distribution, large SVs are associated with diseases such as cancers and CNV syndromes 469 (Supplementary Note 2). Therefore, we expect that linked-read technology can help resolve 470 disease associated SVs. Similar to the existing linked-read SV callers, LinkedSV currently does 471 not handle insertions and repeat expansions. As a future direction, we plan to detect novel 472 sequence insertions using type 2 evidence, since this type of SV also cause a decrease of barcode 473 similarity between nearby regions and can be detected by the twin-window method. The exact 474 insertion sequence may then be inferred from the assembly of all the reads that share barcodes 475 with the candidate breakpoint. LinkedSV currently already supports local assembly to detect 476 deletions, but it has not been parameterized and optimized to be combined with type 2 evidence 477 for detection of insertions. 478 In summary, we present LinkedSV, a novel SV caller for linked-read sequencing. LinkedSV 479 outperformed current existing SV callers, especially for identifying SVs with low allele 480 frequency or identifying SVs from target region sequencing such as linked-read WES. We expect that LinkedSV will facilitate the detection of SVs from linked-read sequencing data and help 482 solve negative cases from conventional short-read sequencing. 483

Breakpoint detection from type 1 evidence 485
First, LinkedSV reconstructs the original long DNA fragments from the reads using mapping 486 positions and barcode information. All mapped reads are partitioned according to the barcode 487 and sorted by mapping position. We define gap distance as the distance between two nearest 488 reads with the same barcode. Two nearby reads are considered from the same long DNA 489 fragment if they have the same barcode and their gap distance is less than a certain distance G. G 490 is determined using two steps. First, we use G = 50 kb (the same as Zheng et.al 12 ) to group the 491 reads into fragments. This value is suitable for detection of large SVs. However, it may be too 492 large for detection of SVs that are smaller than 50 kb. Therefore, we calculate the empirical 493 distribution of intra-fragment gap distance, which is the distance of two nearby reads that are 494 grouped in one fragment. The empirical distribution of intra-fragment gap distance is calculated 495 from all the fragments, and we assign G as the 99 th percentile of this distribution. G is a fixed 496 number for all fragments and is usually between 5-15 kb, depending on the data set. Fragments 497 with a gap distance larger than G potentially span a breakpoint and will be separated to two 500 In non-SV regions, all the reads from the same HMW DNA molecule would be reconstructed 501 into a single DNA fragment. The reads from the breakpoint-spanning HMW DNA molecule will 502 be mapped to two different positions in the genome. As illustrated in the Result section, this 503 split-molecule event has two consequences: 1) observing two fragments sharing the same 504 barcode; 2) each of the two fragment has one endpoint close to the breakpoints. Therefore, we 505 could observe enriched fragment endpoints near the breakpoints, in both one-dimensional view 506 ( Figure 1c) and two-dimensional view (Figure 1d). The type of the endpoints (L-endpoint or R-507 endpoint) that enriched near the breakpoints depends on the type of SV (Figure 1b). The two-508 dimensional view has less background noise because the fragments that do not share barcodes 509 and thus do not support the SVs are excluded. Therefore, we detect the enriched endpoints in the 510 two-dimensional view. 511 We now describe how we detect the type 1 evidence of deletion calls, but the method can be 512 applied to other types of SVs. We define fragment pair to be two fragments sharing the same 513 barcode. Let b 1 , b 2 be the positions of the two breakpoint candidates (assuming b 1 < b 2 ). Let n be 514 the number of fragment pairs that may support the SV between b 1 and b 2 . Let F i1 , F i2 denote the 515 i th fragment pair that support the SV. Let B(F) denote the barcode of fragment F. Therefore, we 516 have: Let L(F) denote the L-endpoint position (i.e., left-most position) of fragment F, R(F) denote the 519 R-endpoint position (i.e., right-most position) of fragment F. Since this is a deletion and b 1 < b 2 , 520 R(F i1 ) is the position on F i1 that is closest to b 1 and L(F i2 ) is the position on F i2 that is closest to 521 b 2 (Supplementary Figure 14a). The distance between the fragment endpoint and its 522 corresponding breakpoint should be within gap distance distribution (explained in 523 Supplementary Figure 15). Therefore, for almost all (99% × 99%) of the fragment pairs, we have: 524 As described above, G is the 99 th percentile of the empirical distribution of intra-fragment gap 526 distance. 527 If we regard (R(F i1 ), L(F i2 )) as a point in a two-dimensional plane, according to equation (2), for 528 almost all (98.01%) of the fragment pairs (F i1 , F i2 ), ((R(F i1 ), L(F i2 )) is restricted in a G × G 529 square region with the point (b 1, b 2 ) being a vertex (Supplementary Figure 14b). 530 We used a graph-based method to fast group the points into clusters and find square regions 531 where the numbers of points were more than expected. First, every possible pair of endpoints 532 (R(F 1 ), L(F 2 )) meeting B(F 1 ) = B(F 2 ) formed a point in the two-dimensional plane. Each point 533 indicated a pair of fragments that share the same barcode. For example, if 10 fragments share the 534 same barcode, C 10 2 pairs of endpoints will be generated. A point/pair of endpoints may or may 535 not support an SV because there are two possible reasons for observing two fragments sharing the same barcode: 1) the two fragments originated from two different HMW DNA molecules but 537 were dispersed into the same droplet partition and received the same barcode; 2) the two 538 fragments originated from the same HMW DNA molecule but the reads were reconstructed into 539 two fragments due to an SV. The points are sparsely distributed in the two-dimensional plane 540 and it is highly unlikely to observe multiple points in a specific region. Next, a k-d tree (k = 2) 541 was constructed, of which each node stores the (X, Y) coordinates of one point. A k-d tree is a 542 binary tree that enable fast query of nearby nodes. Therefore, we could quickly find all pairs of 543 points within a certain distance. Any two points (x 1 , y 1 ) and (x 2 , y 2 ) were grouped into one cluster 544 if |x 1 -x 2 | < G and |y 1 -y 2 | < G. For each cluster, if the number of points in the cluster was more 545 than a user-defined threshold (default: 5), it was considered as a potential region of enriched 546 fragment endpoints. If the points in the cluster were not within a G × G square region, we used a 547 G × G moving square to find a square region where the points are best enriched. Theoretically, 548 the best enriched square region should contain 98.01% (0.99 × 0.99) of the points, according to 549 equation (2). The predicted breakpoints were the X and Y coordinates of the right-bottom vertex 550 of the square. The points in the square region were subjected to a statistical test describe below. 551 552

Quantification of type 1 evidence 553
Let n be the number of points in the square region. Each point corresponds to a pair of fragment 554 predicted breakpoint. Equation (1) and (2) hold for all the fragment pairs F i1 , F i2 (i = 1, 2, 3, …, 556 n). We then test the null hypothesis that there is no SV between b 1 and b 2 . 557 First, we test the hypothesis that the n fragment pairs F i1 , F i2 have originated from different DNA 558 molecules, but coincidently received the same barcode. Here we define two fragments F a and F b 559 as an independent fragment pair if F a and F b share the same barcode but have originated from 560 different DNA molecules. Thus, R(F a ) and L(F b ) are independent variables. All the fragment 561 pairs that do not support SVs are independent fragment pairs. It is reasonable to assume the 562 generation of HMW DNA molecules from chromosomal DNA is a random process thus both 563 R(F a ) and L(F b ) are uniformly distributed across the chromosome. Therefore, the point ((R(F a ), 564 L(F b )) is equal likely to be in any place in the two-dimensional plane. Technically, we connect 565 all the chromosomes in a head-to-tail order so that both intra-chromosomal events and inter-566 chromosomal can be analyzed at the same time. Observing at least n independent fragment pairs 567 meeting equation (2) is equivalent to the event that observing at least n points ((R(F i1 ), L(F i2 )) 568 located in a squared region with an area of G 2 on the two-dimensional plane. The probability of 569 this event is: 570 where Binomial_pmf is the probability mass function of binomial distribution; L is the total 572 length of the genome (also the side length of the two-dimensional plane); N ifp is the total number 573 of independent fragment pairs.
Since we are doing multiple hypothesis testing in the data set, the probability need to be adjusted. 575 We reject the hypothesis if p adjusted1 < p threshold . p threshold is 10 -5 by default. 577 Next, we test the hypothesis that fragment pairs F i1 , F i2 (i = 1, 2, 3, …, n) have originated from 578 the same DNA molecule, but no reads were sequenced in the gap between R(F i1 ) and L(F i2 ). Let 579 g i denote the length of the gap between F i1 and F i2 , g denote the mean of g i , and we have: If g is too large such that the probability of no reads being generated is smaller than a threshold, 583 we can reject this hypothesis. 584 Similar to the model described by 10X Genomics 12 , we assume the read generation on a DNA 585 molecule is a Poisson process with constant rate λ across the genome. Let r be the number of 586 reads generated in a region of length g, then r ~ Pois (λg). Let P gap (g) denote the probability of no 587 read being generated in length g, we have: 588 Therefore, the gap length g i follows Exponential distribution: g i ~ Exp (λ). Recalling that 1) the 590 Exponential distribution with rate parameter λ is a Gamma distribution with shape parameter 1 and rate parameter λ; 2) the sum of n independent random variables from Gamma (1, λ) is a 592 Gamma random variable from Gamma (n, λ), we have: 593 ~ Gamma (n, λ), (8) 594 n ~ Gamma (n, nλ), (9) 595 Therefore, the probability that observing n gap regions with mean length equal to or larger than g 596 is: 597 where Gamma_cdf is the cumulative distribution function of Gamma distribution. 599 Since we are doing multiple hypothesis testing in the data set, the probability need to be adjusted. 600 where N rp is the total number of read pairs. 602 We reject the hypothesis if p adjusted2 < p threshold . p threshold is set as 10 -5 by default. If both p adjusted1 603 and p adjusted2 are less than p threshold , we accept the hypothesis that the SV is true. For each 604 candidate SV, we report a confidence score for type 1 evidence as: 605

Breakpoint detection from type 2 evidence 608
Barcode similarity between two nearby regions is very high because the reads originate from 609 almost the same set of HMW DNA molecules. However, at the SV breakpoint, the aligned reads 610 from the left side and right side may have originated from different locations in the alternative 611 genome. Thus, the barcode similarity between the left side and right side of the breakpoint are 612 dramatically reduced (as described in the Result section and shown in Figure 1e-f). To detect this, 613 LinkedSV uses two adjacent sliding windows (twin windows, moving 100 bp) to scan the 614 genome and calculate the barcode similarity between the twin windows. The window length can 615 be specified by user. By default, it is G for WGS data sets and 40 kb for WES data sets. 616 The barcode similarity can be simply calculated as the fraction of shared barcodes. This method 617 is suitable for WGS, where the coverage is continuous and uniform. But it does not perform well 618 for WES, where the numbers of reads in the sliding windows vary a lot due to capture bias and 619 the length of capture regions. Therefore, we use a model that considering the variation of 620 sequencing depth and capture region positions. The barcode similarity is calculated as: 621 S = The confidence score of type 2 evidence is:

Combination of both types of evidence 643
Type 1 evidence gives pairs of endpoints that indicate two genomic positions are joined in the 644 alternative genome. Type 2 evidence gives genomic positions where the barcodes suddenly 645 changed, regardless of which genomic position can be joined. Therefore, type 1 and type 2 646 evidence are independent. The candidate breakpoints detected from type 2 evidence were 647 searched against the candidate breakpoint pairs detected from type 1 evidence so that the calls 648 were merged. The combined confidence score is: 649 Combined score = Confidence score 1 + Confidence score 2a + Confidence score 2b, (16)  650 where Confidence score 1 is the confidence score calculated from type 1 evidence (equation 12); 651 Confidence score 2a and Confidence score 2b are the confidence scores of the two breakpoints 652 calculated from type 2 evidence (equation 14). 653 654

Refining breakpoints using short-read information 655
For large SV events, we search for discordant read-pairs and clipped reads that are within 10 kb 656 to the predicted breakpoint pairs by the above approach. We use a graph-based approach that is the clipped reads that can be mapped to the both breakpoints, and the map direction matches the 659 SV type. If both discordant read-pairs and split-reads are found to support the SV, we use the 660 breakpoints inferred by split-reads as the final breakpoint position. 661 662 Detection of small deletions that are within 50 bp -10 kb 663 We use a 1Mb moving window (with 0.1 Mb overlapping) to scan the genome. For each window, 664 all the aligned reads (including phased and un-phased reads) were extracted and were assembled 665 by the FermiKit pipeline. Regions with extreme high coverage (more than 20-fold of average 666 coverage) were skipped. The resulting contigs were mapped back to the 1 Mb reference sequence 667 of the moving window using bwa-mem and deletions were called from the aligned contigs if the 668 alignments were unique within the 1 Mb moving window. The local assembly based process 669 mainly contribute to the detection of deletions within 50-1000 bp. To detect deletions that are 670 larger than 1 kb and might be missed by the assembly-based process, we use a 500 bp moving 671 window (with no overlapping) to find candidate regions where the read depth of either haplotype 672 is less than 10% of the average depth of the haplotype. Next we extract all the read pairs of this 673 haplotype and test if the mean insert size of these read pairs is significantly larger than the mean 674 value of the whole genome, assuming the average insert size of n read pairs follows normal 675 distribution: N(μ，σ 2 /n), where μ and σ are the mean and standard deviation of the insert size of 676 the whole genome.
We use a read depth based method to detect deletions that are larger than 1 kb and lack read pair 678 support. If there are m consecutive windows where the read depths are less than 10% of the 679 average depth, we assume the read depth of each window is independent, and calculate the p 680 value using the simple equation: p = (a/b) m , where b is the total number moving windows and a 681 is the total number of moving windows where the read depths are less than 10% of the average 682 depth. A deletion is called if p < 10 -10 . 683 684

Generation of simulated linked-read WGS data set 685
The linked reads were simulated by LRSIM, which can generate linked-reads from a given 686 FASTA file containing the genome sequences. We generated a diploid FASTA file based on 687 hg19 reference genome with SNPs and SVs inserted. The purpose of inserting SNPs was to 688 mimic real data. The generation of the diploid FASTA file is described below. First, we inserted 689 SNPs to hg19 using vcf2diploid 31 . The inserted SNPs were from the gold standard SNP call set 690 (v.3.3.2) of NA12878 genome 32 . The vcf2diploid software generated two FASTA files, each of 691 which was a pseudohaplotype (paternal or maternal) with the phased SNPs inserted. Next, we 692 insert SVs into the paternal FASTA file using our custom script. The breakpoints were located in 693 the repetitive regions in hg19 and the distance between the two breakpoints were in the range of 694 50 kb to 10 Mb. In total, we simulated 351 deletions, 386 duplications, 353 inversions and 85 695 translocations, all of which were in the paternal copy and were heterozygous SVs. We then concatenate the paternal and maternal FAST file into a single FASTA file and simulated linked-697 reads using LRSIM. To mimic real data, the barcode sequences and molecule length distribution 698 used for simulation were from the NA12878 whole-genome data set released by 10X Genomics. 699 The number of read pairs was set to 360 million so that a 35X coverage data set was generated. 700 The genome coordinates of simulated SVs was shown in Supplementary Data 1. The size 701 distribution of the simulated SVs was shown in Supplementary Figure 16a. 702 703

Generation of WGS data set with low VAF 704
In cancer samples or mosaic samples, the total DNA is a mix of a small portion of variant alleles 705 and a large portion of normal alleles. To simulate the WGS data sets with low variant 706 frequencies, we used the same paternal and maternal FASTA file described above but the 707 combined FASTA file contained multiple copies of the normal allele (the maternal FASTA) and 708 only one copy of the variant allele (the paternal FASTA). For example, to simulate a WGS data 709 set with VAF of 20%, four copies of the maternal FASTA and one copy of the paternal FASTA 710 were combined. The linked reads were simulated using LRSIM with the same parameters and a 711 35X coverage data set was generated. 712 To test the performance of LinkedSV on the detection of disease casual SVs, we downloaded a 715 list of expert-curated deletions and duplications that were known to cause CNV syndromes 716 involved in developmental disorders. This list was downloaded from the DECIPHER database, 717 and contained 67 CNV syndromes. Some syndromes were affected by CNV events in the same 718 region. After removing redundant syndromes, we got 51 CNV events (Supplementary Table 10). 719 Based on the 51 CNV events we simulated a germline WGS data set and two mosaic WGS data 720 sets (VAF = 10% and 20%) using the same method described above. 721 722

Generation of simulated linked-read WES data set 723
To generate the linked-read WES data set, we first generate a 100X linked-read WGS data set 724 and then down-sample it to be a WES data set. Generation of the simulated linked-read WGS 725 data set with SNPs and SVs inserted was similar to the method described above. In total, we 726 inserted 1160 heterozygous SVs. The SV breakpoints were randomly selected from regions that SVs. The number of inserted SVs in the simulated WES data set was slightly smaller than that in 731 the simulated WGS data set because the SV breakpoints were designed to reside within 2000 bp 732 of an exon. The simulated reads were generated using LRSIM and were mapped to hg19 reference genome using the Longranger pipeline (default settings). The phased bam generated by 734 Longranger was down-sampled to be a simulated WES data set. To mimic real WES data set, we 735 used the coverage distribution of the linked-read WES data set of NA12878 genome (released by 736 10X Genomics) to guide the down-sampling process. We bin the genome into 10 bp windows 737 and calculate number of reads mapped to each window (left mapping positions were used) in 738 NA12878 linked-read WES data. The simulated WES data set was generated by sampling reads 739 from the 100X WGS data according to number of reads mapped to the same 10 bp window in the 740