Designed hybrids facilitate efficient generation of high-resolution linkage maps

In sequencing eukaryotic genomes, linkage maps are indispensable for building scaffolds with which to assemble and/or to validate chromosomes. However, current approaches to construct linkage maps are limited by marker density and cost-effectiveness, especially for wild organisms. We have now devised a new strategy based on artificially generated hybrid organisms to acquire ultra high-density genomic markers at lower cost and build highly accurate linkage maps. Using this method, linkage maps and draft sequences for two species of pufferfish were obtained simultaneously. We anticipate that the method will accelerate genomic analysis of sexually reproducing organisms.

technologies remains challenging. Accordingly, methods such as Irys genome mapping 33 (BioNano Genomics Inc.), Hi-C, Chicago library (Dovetail) and linked reads (10× 34 Genomics) have been developed to generate longer scaffolds and improve 35 assembly (Burton et al. 2013;Putnam et al. 2016;Bickhart et al. 2017;Weisenfeld et al. 36 2017). Nevertheless, genetic linkage analysis would likely remain essential to 37 reconstruct most eukaryotic genomes, even if sequencing methods continue to advance. 38 Indeed, linkage maps are used not only to assemble contigs/scaffolds, but also to 39 compare and evaluate the assembled sequences. 40

Construction of conventional linkage maps based on microsatellite markers 41
generally is costly and time-consuming, and the maps are of low resolution and 42 genomic coverage due to the limited number of markers, usually several hundreds to 43 several thousands (Kai et al. 2011). Several new methods to construct linkage maps and 44 assemble contigs/scaffolds have also been developed, including SNP array (Bodenes et 45 al. 2016), RAD-seq (Arora et al. 2017), and genotyping-by-sequencing (Nunes et al. 46 2017). However, SNP arrays require SNP data and hundreds of arrays prepared in 47 advance, and therefore may not be suitable for de novo genome analysis. In contrast tens 48 respectively. Because the size of the genome is doubled (782 Mb) in hybrids, coverage 96 for each was 1.8 fold on average. 97

Reference genome assemblies 99
The N50 for the de novo T. rubripes genome that we obtained (de-novo-contigs-Tr) was 100 13.0 kb (Table 2A), which is 66 times shorter than that of the previous version of the T. 101 rubripes genome (FUGU4), for which the N50 was 858 kb. Therefore, FUGU4 was 102 used as a reference in further analysis. For T. stictonotus, the genome that we obtained 103 (de-novo-contigs-Ts), with N50 of 15.9 kb, was used as a reference because no other 104 genome sequence is available. 105 compared the assembly with FUGU5, the current version of the T. rubripes genome, 112 which was constructed by ordering FUGU4 contigs/scaffolds according to a linkage 113 map of 1,222 microsatellites. We regarded the FUGU4 plus de-novo-contigs-Ts as a 114 haploid reference sequence for a hybrid individual. FUGU4 and the de-novo-contigs-Ts 115 genome were 98.1 % homologous over aligned sequences throughout the genome. We 116 then mapped each read of the hybrid individual to the FUGU4 plus de-novo-contigs-Ts 117 genome using BWA mem(Li and Durbin 2010), and extracted those in which a 118 continuous stretch of 90 bp or more (out of 100 bp) was mapped exclusively and 119 uniquely. The average mapping rate after this process was 82.8 %. Subsequently, SNPs 120 were called in GATK UnifiedGenotyper (McKenna et al. 2010). We note that SNPs 121 were called even if present in only one read, which is generally discarded as a sequence 122 error in conventional analysis. 123 124 Extraction of SNP markers form T. rubripes similar between the two species and thus excluded from linkage analysis, if even a 128 single T. stictonotus read was mapped to T. rubripes ( Fig. 2A). As a result, only 129 1,293,394 SNPs remained for further processing. To minimize potential errors, reads 130 mapped with a depth more than 4 times the average depth were then removed. 131 Furthermore, we extracted SNPs confirmed to be heterozygous in the paternal genome, 132 as every SNP found in paternal sequences in hybrid fry should also be found in the 133 diploid genome of the father. After these steps, 606,110 markers were available for 134 analysis. SNPs found in less than 10 % of samples were excluded, along with those for 135 which minor alleles were found in less than 30 % of samples (major alleles found in 136 more than 70% of samples). SNPs heterozygous in more than 20 % of individuals were 137 suspected to be multi-copy genes, and were also filtered out. Finally, SNPs that 138 appeared to be a crossover breakpoint in a scaffold/contig of more than 10 % of 139 individuals were regarded as noise and were excluded. Ultimately, 442,723 markers 140 were used in linkage analysis, for an average density of one per 903 bp. 141 Because algorithms to order and orient scaffolds are known to be NP-hard (Pop et al. 144 2004;Tang et al. 2015), a normal heuristic approach was applied. Briefly, each hybrid 145 fry was phased from the beginning of a scaffold, and phase changes in the middle of the 146 scaffold were listed as break points. After end-to-end phase patterns were collected 147 from 188 individuals, scaffolds were sorted in descending order of length. Scaffolds 148 with the most similar phase patterns to FUGU4 were then connected to FUGU4 149 scaffolds at both ends. This process was repeated for the resulting connected scaffolds, 150 until no test scaffold was found for which the similarity to a target scaffold was above 151 the threshold value. If the phases at both ends of the newly connected scaffold were the 152 same, the orientation of the scaffold could not be determined, and was either filled with 153 Ns (any of A, C, G, T bases) along the length of the unoriented scaffold, or was used 154 without determining the orientation. Both approaches were used in this study. 155 Using 442,723 SNP markers, we phased all 188 T. rubripes genomes, as well as 156 4,513 of 7,213 scaffolds in FUGU4, such that phase was obtained for 95.7 % of all 157 nucleotides in the genome. Orientation was determined for 834 scaffolds representing 158 74.8 % of nucleotides in the genome.

Comparison of FUGU5 and SELDLA-extended FUGU4 161
As listed in Table 2B, extension by SELDLA increased the N50 in FUGU4 by 17 fold, 162 such that it was now longer than that of FUGU5, which was generated by extending 163 FUGU4 using conventional linkage analysis. A Circos plot (Fig. 3) confirmed that the 164 genomic organization in SELDLA-extended FUGU4 is consistent with that of FUGU5 165 over 22 chromosomes, except at one site. Dot plots comparing all scaffolds between 166 FUGU5 and SELDLA-extended FUGU4 are shown in Fig. 4 and Supplementary Fig. 1, 167 and indicate that alignment was mostly successful throughout the entire genome. In 168 addition, we found that many small scaffolds/contigs unmapped in the former were 169 aligned at both ends to chromosomes in the latter ( Fig. 4 and Supplementary Figs. 1), as 170 clearly shown in a zoomed-in view of chromosome 1, a representative example. Indeed, 171 several small scaffolds/contigs were incorporated and ordered into chromosome 1 in 172 SELDLA-extended FUGU4, but not in FUGU5. This result implies that microsatellite 173 markers near telomeric regions were inadequate in FUGU5, and highlights the large 174 difference in marker density between microsatellites in FUGU5 and non-selected SNPs in this study. Several genomic inversions and translocations were also observed, 176 suggesting structural polymorphisms between T. rubripes individuals. We note that 177 these may cause erroneous ordering in SELDLA-extended FUGU4, FUGU5, or both. 178 Finally, comparison of physical and linkage distance ( Supplementary Fig. 2) indicated 179 that linkage distance was tend to be longer around telomeres, confirming that our result 180 was consistent with previous study (Kai et al. 2011). 181 182

Scaffold elongation of the T. stictonotus genome by SELDLA 183
In addition, we also attempted to extend T. stictonotus contigs by SELDLA. These 184 contigs were obtained from a T. stictonotus female in a single round of paired-end 185 sequencing on Illumina HiSeq2000, followed by assembly in CLC genomics 186 workbench. The N50 for this genome was 15 kb. Although it is generally very difficult 187 to perform linkage analysis using very short and numerous scaffolds/contigs, such 188 analysis was possible for T. stictonotus due to the very high density of SNP markers. was also determined for 1,097 scaffolds, covering 7.7 % of all nucleotides. The 196 remaining 9,212 scaffolds/contigs were mapped to linkage groups, but in unknown 197 orientations, mainly because there were no crossover points in the genome 198 corresponding to short scaffolds/contigs from any of 188 individuals. In some cases, this 199 effect might be due to lack of real SNPs. Surprisingly, the maximum scaffold length in 200 the SELDLA-extended genome was 15.9 Mbp, which is 68 % of the maximum scaffold 201 length in FUGU5. 202 As T. rubripes and T. stictonotus belong to the same genus, the genome organization 203 of both was expected to be similar. Indeed, a Circos plot of SELDLA-extended T. 204 stictonotus and FUGU5 showed not only synteny, but also highly conserved sequence 205 organization (Fig. 5). In addition, whole-genome dot plots ( Fig. 6 and Supplementary 206 We initially speculated that whole-genome sequencing would be suitable to type SNPs 212 in double haploid and/or haploid organisms because a single read would be sufficient. 213 We note that SNPs are generally useful as polymorphic markers not only because they 214 are easy to type, but also because they are very common. In addition, we can 215 compensate for missing data at a site in one individual because the phase is successive 216 except around crossover points, of which there were, on average, 1.3 per chromosome 1 217 in T. rubripes and 3.6 per chromosome 1 in T. stictonotus. In contrast, double haploid 218 and/or haploid organisms are unsuitable for typing microsatellite polymorphic markers 219 by conventional methods because they contain only half of the information compared 220 with diploids. Moreover, we speculated that if we can trace the origin of each read in an 221 organism to either of its parents, linkage analysis for the father and mother can be 222 performed independently. This process is essentially equivalent to simultaneous construction of linkage maps for two lines of double haploid/haploid organisms. 224 Therefore, it should also be possible to generate linkage maps even from low-coverage 225 sequencing data. In support of this strategy, we also developed a new analytical 226 pipeline, SELDLA, to generate high-resolution linkage maps, and, in parallel, to extend 227 scaffolds/contigs based on thin, incomplete SNP data obtained from low-coverage 228 genome data. 229 Following this approach, we mated two interfertile species, T. rubripes and T. 230 stictonotus, to generate hybrid individuals with genes easily traceable to the parents. We 231 succeeded in generating a new linkage map for T. rubripes from 442,723 SNPs in 188 232 hybrid individuals. These markers are significantly more than the number of crossover 233 points in the sperm of the father, which was estimated to be several thousands. 234 Therefore, the resolution of the linkage map was limited not by the number of markers, 235 but by the number of crossover sites. Indeed, the number of SNP markers used in this 236 analysis is the highest among any other combination of markers and typing methods. 237 We also succeeded in generating a draft genome of higher quality than the current draft 238 (FUGU5). For example, we mapped 95.7 % of scaffolds to the genome, of which 80 % were oriented, including those that were not previously located. In addition, we 240 generated a linkage map for T. stictonotus for the first time, even with a very short N50 241 of 15 kb. In this case, we mapped 22.9 % of sequences to the genome. 242 Furthermore, we show in Fig. 4 that the genome we obtained can be extended,  Supplementary Fig. 4), respectively. 248 Of note, the reference sequence for SNP mapping was generated de novo by only one 249 round of paired sequencing. However, combining a series of mate-pair and/or paired-250 end sequence data will generate a much longer N50, and enable precise and easy 251 location of most longer scaffolds/contigs, as was achieved in T. rubripes. 252 PacBio and Oxford Nanopore sequencing have been recently demonstrated to 253 generate much longer contiguous sequences. Several new techniques to locate 254 sequences in the genome, including 10× Genomics and Irys, have also been established.
These techniques also generate high-quality de novo genome sequences, and we 256 anticipate that our strategy will be proved very useful when combined with these new 257 methods. More importantly, our method is based on genetic mapping, whereas others 258 are based on physical mapping. We note that regardless of advances in physical 259 methods, a genetic map would still be required to compare with and evaluate physical 260 maps, as well as for various types of genetic analysis. 261 To save cost and labor, several genotyping methods based on a small number of 262 polymorphic sites have been developed, including RAD-seq and genotyping-by-263 sequencing. If an inbred line is obtained by subrearing, the inbred individuals are 264 essentially clones of the same haplotype, which can then be easily determined. 265 However, this approach is limited to model organisms such as mice and zebrafish that 266 can be inbred. In addition, there seems to be no actual example of this approach, 267 because it is time-consuming. Similarly, generating double haploids/haploids is often 268 difficult, and may also limit applicability. On the contrary, our method is based on 269 mating interfertile species, and therefore should be more widely applicable. In our 270 experience, torafugu embryos before hatching already provide sufficient DNA for genotyping, suggesting that hatching is not even necessary, which would further 272 broaden the mating possibilities. 273 Thus, our method is one of the most effective and promising approaches for 274 obtaining genomes from organisms that produce many eggs or seeds in one generation, 275 such as fish, mollusks, amphibians, insects, and plants. In addition, we may even be 276 able to apply our method to viviparous organisms via in vitro fertilization and/or 277 differentiation of induced pluripotent cells into germ cells. Accordingly, we are now 278 applying or planning to apply our method to obtain genomes from various organisms. T. rubripes and T. stictonotus adults were also sequenced to generate reference 300 sequences with which to identify the source of each raw read in hybrid fry, and to call 301 and map SNPs. Raw data were obtained from one round of sequencing on HiSeq 2500. 302 For T. rubripes, the stored sperm was used as source of paternal genomic DNA. For T. 303 stictonotus, a second individual as described above was used as the source of a mother-304 equivalent genome. Reads were assembled in CLC Assembly Cell 5.0.3 (QIAGEN), 305 with an insert size of 100-800 bp and default values for all other parameters. 306 307

SELDLA. 308
We developed SELDLA, a novel data processing pipeline to construct a linkage map 309 from genomic hybrids. SELDLA is available at http://www.suikou.fs.a.u-310 tokyo.ac.jp/software/SELDLA/, and is illustrated in Fig. 2  We excluded regions in the combined genome to which reads from both T. stictonotus and T. rubripes were mapped. We also excluded regions to which reads were mapped 320 with more than four times the average depth. Each contig/scaffold was then phased, and 321 linkage was measured by concordance rate of phases from all samples. In particular, the 322 phase at two contigs/scaffolds was considered concordant when such contigs/scaffolds 323 were completely linked, i.e., without crossover points, in 188 hybrids. In contrast, the 324 concordance rate was close to 50 % for two contigs/scaffolds that were not linked. We 325 note that satisfactory scaffold elongation was achieved by setting the concordance rate