From single nuclei to whole genome assemblies

A large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates1–4. Genomic data are key to unravel both their identity and function5. The development of metagenomic methods6,7 and the advent of single cell sequencing8–10 have revolutionized the study of life and function of cryptic organisms by upending the need for large and pure biological material, and allowing generation of genomic data from complex or limited environmental samples. Genome assemblies from metagenomic data have so far been restricted to organisms with small genomes, such as bacteria11, archaea12 and certain eukaryotes13. On the other hand, single cell technologies have allowed the targeting of unicellular organisms, attaining a better resolution than metagenomics8,9,14–16, moreover, it has allowed the genomic study of cells from complex organisms one cell at a time17,18. However, single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life, including sponges19, lichens3,20, intracellular parasites21,22, and plant endophytes23,24. Among the most important plant endophytes are the obligate mutualistic symbionts, arbuscular mycorrhizal (AM) fungi, that pose an additional challenge with their multinucleate coenocytic mycelia25. Here, the development of a novel single nuclei sequencing and assembly workflow is reported. This workflow allows, for the first time, the generation of reference genome assemblies from large scale, unbiased sorted, and sequenced AM fungal nuclei circumventing tedious, and often impossible, culturing efforts. This method opens infinite possibilities for studies of evolution and adaptation in these important plant symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.

A large proportion of Earth's biodiversity constitutes organisms that cannot be 23 cultured, have cryptic life-cycles and/or live submerged within their substrates [1][2][3][4] . 24 Genomic data are key to unravel both their identity and function 5 . The development 25 of metagenomic methods 6,7 and the advent of single cell sequencing 8-10 have 26 revolutionized the study of life and function of cryptic organisms by upending the 27 need for large and pure biological material, and allowing generation of genomic data 28 from complex or limited environmental samples. Genome assemblies from 29 metagenomic data have so far been restricted to organisms with small genomes, such 30 as bacteria 11 , archaea 12 and certain eukaryotes 13 . On the other hand, single cell 31 technologies have allowed the targeting of unicellular organisms, attaining a better 32

Main text 51
AM fungi is a group of diverse obligate symbionts that have colonized root cells and 52 formed mycelial networks in soil since plants first colonized land [25][26][27] . Their entire 53 life-cycle is completed underground and they propagate with multinuclear asexual 54 spores 27 (Figure 1). Genomic research on AM fungi has been hampered by technical 55 challenges involving isolation and culturing, and accordingly, few species have been 56 successfully sequenced. To date, the reference genomes of only few species that can 57 be grown in axenic culture, i.e., Rhizophagus irregularis, R. clarus, R. diaphanus, R. 58 cerebriforme, Gigaspora rosea and Diversispora epigaea, have been published 28-33 . 59 representation of a spore, containing nuclei, lipid vesicles and endosymbiotic bacteria. 68 The hyphae have very reduced compartmentalization with incomplete septa and 69 nuclei appear to move freely. 70 A method was developed in which genomic fungal DNA can be obtained, free of 71 plant and microbial DNA, directly from individual nuclei of multinucleate spores. In 72 brief, spores from a trap culture fungal strain of Claroideoglomus clarodeium/C. 73 luteum (SA101) were obtained from the INVAM pot culture collection. An initial trial 74 to sort AM nuclei was carried out using pools of spores in order to assess the optimal 75 settings. Cleaned spores were crushed vigorously, and the solution was stained and 76 analyzed by Fluorescence-Activated Cell Sorting (FACS), recording level of 77 fluorescence as a measure of DNA content, and light scattering as proxy for size and 78 particle granularity (Figure 2 a-h). A distinct cloud of particles was observed above 79 the background in the scatter plot (Figure 2h, inside the blue box) which, after 80 microscopy and PCR verification tests with fungal and bacterial specific primers, was 81 confirmed to consist of intact biological structures containing mostly fungal DNA 82 (Figure S1-S3, Table S1). Hence, we concluded that these particles were fungal nuclei 83 and restricted future sorting to this window. Thereafter, individual nuclei from a 84 single spore of the same strain were sorted into wells of a 96-well plate ( Figure S4, 85 Table S2) and whole genome amplified (WGA) using multiple displacement 86 amplification (MDA; Figure 2 i-j). The amplified DNA was scored for pure fungal 87 origin by parallel amplification of rDNA barcode regions for both fungi and bacteria 88 Figure S5). Twenty-four amplified nuclei samples, confirmed to contain 89 only fungi ( Figure S4, Table S3, S4), were sequenced with Illumina HiSeq X ( Figure  90 2l). Further, the MinION Nanopore-based sequencing device (Oxford Nanopore 91 Technologies, ONT, UK) was used to obtain long read sequences for amplified DNA 92 from multiple (5-100) nuclei separated from a pool of 30 spores of the same strain 93 ( Figure 2 i-k, m). 94 Figure 2. From a soil sample to AM fungal genome assemblies. a) Whole inoculum 96 from the culture collection INVAM is blended with water and (b) poured into a set of 97 sieves, the material stuck in the 38 µm sieve is placed into a (c) tube that contains a 98 solution of 60% sucrose, then centrifuged for 1 min. The supernatant is again run 99 through a 38 µm sieve and washed with water. d) The sieve content is placed in a 100 Petri dish for the spores to be manually picked using a glass pipette. e) After cleaning 101 the spores with ddH 2 O, these are placed one-by-one into tubes and crushed with a 102 pestle. f) The DNA from a broken spore is stained with SYBR Green, giving a strong 103 fluorescent signal for the nuclei, and lighter for the background, organelles and (o). For assembly 3 reads from all nuclei are combined before normalization and then 119 assembled with SPADES 35 (q). Individual nuclei assemblies from method 1 and 2 are 120 assembled together using Lingon 36 (p). Nanopore data is assembled with Canu 37 (r), 121 polished with Pilon 38 using the Illumina raw-reads and used to scaffold the three 122 generated assemblies using Chromosemble, of Satsuma 39 (s). 123 124 Three customized assembly workflows were developed in order to evaluate assembly 125 quality in the light of coverage bias introduced by WGA, which is the biggest 126 challenge when assembling sequence data from amplified single nuclei. The MDA 127 method, however, has an advantage over PCR-based methods in that it produces 128 longer fragments of DNA with a lower error rate, and that the coverage bias is 129 For the first two assembly workflows individual nuclei assemblies were generated and 131 subsequently combined to generate a consensus assembly using the workflow 132 manager Lingon 36 (Figure 2p), which consists of a motif-distance based long 133 sequence overlap finder that merge sequences based on mutual maximal overlaps. In 134 the first assembly workflow, raw Illumina reads were assembled using MaSuRCA 34 135 ( Figure 2n) resulting in 24 assemblies, ranging in size from 14 to 69 Mbp (Tables S5). 136 In order to overcome MDA generated differences in coverage across the genome the 137 second assembly workflow normalized raw reads to maximum 100X before assembly 138 using SPADES 35 (Figure 2o), generating 24 assemblies ranging in size from 11 to 50 139 Mbp (Table S5). A third assembly was created using SPADES 35 after combining raw 140 reads from 24 nuclei followed by normalization to 100X (Figure 2q). One full 141 assembly with 24 nuclei was generated from each workflow and subsequently 142 scaffolded with a Nanopore assembly built with Canu 37 (Figure 2r combinations of two to twelve nuclei and one random combination for 13-23 nuclei. 147 BUSCO 42 , assembly size and N50 was used to compare these to full and single nuclei 148 assemblies. 149

Results 151
The different assembly workflows resulted in assemblies that vary in sizes, 152 fragmentation and completeness (Table 1). Based on BUSCO analyses, workflow 3 153 generates the most complete assembly, with 89% for assembly 3n, compared to 2n at 154 80%, and 1n at 78% (Table 1). Of the core single copy genes identified by BUSCO, 155 few were fragmented or duplicated in assembly 3n indicating that the set of 14,600 156 predicted genes is likely to be complete and a close representation of the genetic 157 diversity in this strain (Table 1). This number is lower than the number of genes 158 found in other sequenced AM fungi such as R. irregularis 28 and R. clarus 31 , also 159 lower than those predicted in assemblies 1n and 2n (Table 1). Interestingly, assembly 160 normalization is expected to disproportionally reduce high coverage genomic 166 sequences such as repeat elements, and collapsing those regions when assembling. 167 Note that this effect of normalization is eluded in assembly workflow 2, in which 168 nuclei are normalized and assembled individually; repetitive regions will collapse but 169 in different parts of the genome, ending up represented in the final assembly when 170 combined. In contrast, workflow 1 is based on non-normalized reads. Due to uneven 171 coverage, this workflow assembles less of the genome (an average of 55% of the raw 172 reads align to the individual nuclei assemblies, as opposed to 96% of the reads 173 mapping to the normalized individual nuclei assemblies (Table S5)) but generate 174 contigs well supported by high coverage. Combining these incomplete assemblies 175 from single nuclei using Lingon, generates an accurate assembly 1 comparable to 176 assembly 3 with a better representation of repeats (Table 1). 177 Combinations of increasing number (1-24) of random nuclei were produced for all the 178 assembly workflows in order to evaluate the number of nuclei needed to produce a 179 good final assembly. As shown in figure 3, single nuclei assemblies are most 180 complete when using normalized workflows (2 and 3), with an average of 40% 181 BUSCO estimated completeness. Interestingly, there is an increasing number of gene 182 duplications among the complete genes as more single nuclei assemblies are 183 combined for method 2 compared to method 1 (Figure 3a-b). Higher amount of gene 184 duplications was confirmed by locating known single copy genes in all assemblies 185 (Table S6). The duplications in workflow 2 are likely generated because read 186 normalization allows for assembly of regions with low coverage that are prone to 187 errors and prevents contigs from being properly assembled by the workflow manager 188    Summary statistics for different number of assembled nuclei (1-24) using three different assembly methods. BUSCO estimates of completeness for a) method 1: raw reads of individual nuclei assembled using Masurca, consensus assembly using Lingon b) method 2: normalised reads of individual nuclei assembled using SPADES, consensus assembly using Lingon and c) method 3: reads from individual nuclei are pooled and normalised before assembly with SPADES. Percentage of single copy core genes detected as single copy (S: grey), duplicated (D: light grey) or fragmented (F: black). Average of 3-6 replicate assemblies up to 12 nuclei with error bars indicating SEM. In d) assembly size (dashed lines) and N50 (solid lines) for the there methods 1 (black), 2 (grey) and 3 (light grey).

Nuclei extraction and sorting 246
After spore extraction from soil, individual spores were placed in 30 µl ddH2O in 1.5 247 ml Eppendorf tubes. One tube with 15 spores was used to establish the sorting 248 window. An amount of 50 µl 1x PBS was added to each tube before crushing the 249 spores using a sterile pestle. DNA was stained by adding 1 µl of 200x SYBR Green I envelope of 1 at 700 to 1200 events per second. Thus, if a particle fitting within the 259 sorting window passes by the laser together with another particle, these would be 260 discarded. Particles from region R1, assumed to be nuclei ( Figure S4), were sorted 261 individually into 96 well plates containing 1 µl 1x PBS/well, groups of 5 particles 262 were collected for positive control, and empty wells were kept as negative control 263 (Table S2). 264

Whole Genome Amplification 265
Sorted nuclei were lysed and neutralized followed by whole genome amplification 266 using Phi29 and MDA as described by Rinke et al., 2014 45 (Table S3). From the samples that scored positive for 306 presence of fungi, 24 undiluted samples were selected for sequencing and the DNA 307 amount was measured using Qubit (Brand, country) after addition of 30 µl ddH2O 308 (Table S4). much lower (100,000 and 106,000 reads respectively). 371

Computational analysis, assembly and annotation 372
The quality of the Illumina reads was assessed with FastQC 48 . Genome size 373 estimation was done for each paired raw-reads from individual nuclei with SGA-374 PreQC 49 . Contamination was assessed with Kraken 50 in some of the raw-reads. CG 375 content was computed using the NBIS-UtilityCode 51 toolbox. 376 Assembly workflow 1: Individual assemblies for each of the 24 nuclei was done using 377 MaSuRCA 34 using default options. The resulting assemblies were iteratively merged 378 using Lingon 36 , which computed overlaps based on the spacing of sequence motifs 379 (CATG, CTAG, GTAC, GATC, TATA, ATAT, and GC), and merged contigs based 380 on pairwise maximal extensions. Each motif was iterated over ten times. Three 381 versions of the assembly were generated when contigs smaller than <500, <1000 and 382 <2000 were removed from the individual assemblies prior to Lingon. 383 Assembly workflow 2: Each set of reads was normalized using bbnorm of BBMap 52 384 v. 38.08 with a target average depth of 100x. Normalized data were assembled 385 individually into 24 assemblies using SPADES 35 , and a consensus assembly was 386 generated with Lingon 36 , with the same sequence motifs as for assembly 1. 387 Assembly workflow 3: The 24 datasets were combined and normalized with bbnorm 388 of BBMap 52 v. 38.08 with a target average depth of 100x, and posteriorly assembled 389 using SPADES 35 . 390 Nanopore assembly: Nanopore reads were assembled using Canu 37 v.1.7-86da76b, 391 this specific beta version made it possible to assemble a difficult dataset like ours, 392 with highly uneven coverage across the genome. An assembly was created using 393 default settings together with the known information (genomeSize=117m -Nanopore-394 raw). The resulting individual assembly was polished with three rounds of Pilon 38 395 v.1.22 using the raw Illumina reads from the 24 nuclei mapped with Bowtie2 53 . 396 The contigs of the final assemblies from single nuclei were scaffolded with the 397 Nanopore assembly using Chromosemble from the Satsuma package 39 . 398 Comparative assembly analysis 400 to the low BUSCO statistics shown in the study 61 , and that could have negatively 426 affected the annotation quality. 427 Protein and gene names were assigned to the gene predictions using a BLASTx 55 428 v2.6.0 search of predicted mRNAs against the UniProt/Swiss-Prot 59 database with 429 default e-value parameters (1x10-5). The ANNotation Information Extractor, Annie 62 , 430 was used to extract BLAST matches and to reconcile them with the gene predictions. 431 432 Sequences, assemblies and annotation can be found in the BioProject: PRJNA528883 433