Building de novo reference genome assemblies of complex eukaryotic microorganisms from single nuclei

The advent of novel sequencing techniques has unraveled a tremendous diversity on Earth. Genomic data allow us to understand ecology and function of organisms that we would not otherwise know existed. However, major methodological challenges remain, in particular for multicellular organisms with large genomes. Arbuscular mycorrhizal (AM) fungi are important plant symbionts with cryptic and complex multicellular life cycles, thus representing a suitable model system for method development. Here, we report a novel method for large scale, unbiased nuclear sorting, sequencing, and de novo assembling of AM fungal genomes. After comparative analyses of three assembly workflows we discuss how sequence data from single nuclei can best be used for different downstream analyses such as phylogenomics and comparative genomics of single nuclei. Based on analysis of completeness, we conclude that comprehensive de novo genome assemblies can be produced from six to seven nuclei. The method is highly applicable for a broad range of taxa, and will greatly improve our ability to study multicellular eukaryotes with complex life cycles.

Supplementary information, 9 figures and 6 tables 8 9 Figure S1. Spores of C. claroideum/C. luteum (SA101) 10 Figure S2. Scatter plots from the FACS sorting during method development 11 Figure S3. Agarose gel of PCR products of particles in R1, R2-R4, R4 12 Figure S4. Scatter plot of particles from one single crushed spore 13 Figure S5. PCR scoring for fungi and bacteria in amplified single nuclei 14 Figure S6. Summary statistics for assembled of 1-24 nuclei in assembly workflows 1 15 Figure S7. Original gel image corresponding to figure S3 16 Figure S8. Original gel image corresponding to the upper part of figure S5 17 Figure S9. Original gel image corresponding to the lower part of figure S5 18 19  Table S1. Number of particles collected 20 Table S2. 96 well plate layout for sorting. 21 Table S3. Summary of PCR scoring for fungi and bacteria 22 Table S4. DNA concentrations of 24 sorted and amplified nuclei 23 Table S5. Single nuclei assembly statistics for the assembly workflow 1 and 2 24 Table S6. Presence of the single copy genes EF1 and RPB1 in the three assemblies 25 26 Developing the protocol of single nuclei extraction 27 Broken spores were observed in a microscope equipped with a fluorescent filter to 28 assess amount, release efficiency and size of the nuclei. Spores were placed on a slide 29 and the excess water was removed with a pipette. A small drop of Gold Slow Fade + 30 DAPI was added and the glass cover was placed pressing gently with the back of a 31 pencil, to crush open the spore without destroying it. Crushed spores were visualized 32 under fluorescent microscope. ( Figure S1, Original pictures in OSF Repository 44 ). 33 34 Figure S1. Spores of C. claroideum/C. luteum (SA101) at 20X (a) and 40X (b) seen in 35 the microscope after staining with DAPI in order to visualize the nuclei and assess their 36 size.

38
For the sorting of the nuclei, two tubes with 15 and 25 spores of the isolate respectively, 39 previously extracted and cleaned were crushed in 200 μl of ddH2O, and 200 μl of 1X 40 PBS was added together with 1.25 μl of 200X SYBR Green. After 37 min staining, 100 41 μl of 1X PBS was added to increase the volume before running the sample in the FACS.

42
FSC on the 488 nm laser was used as trigger, and a 530±20 nm (530/40) band pass filter 43 for fluorescence signal detection. The best resolution of distinctive populations of 44 particles in the scatter plot was obtained when plotting level of fluorescence against the 45 SSC representing granularity ( Figure S2). 46 Three regions were defined in the scatter plot: R1 (a distinctive line with strong  47 fluorescence), R4 (a distinctive community lower fluorescence and lower SSC values 48 than R1), and R2 minus R4 (an undifferentiated area between R1 and the background 49 fluorescence) ( Figure S2). Particles appearing in those areas were sorted 50 simultaneously into three different tubes (Table S1). Additionally, two tubes were filled 51 with particles from all the areas in each sorting round, to be used as a positive control. 52 53 Figure S2. Scatter plots from the FACS sorting of particles from 15 (a, b) or 25 (c, d) 54 crushed spores using Forward Scatter (FSC) as trigger. On the x axis, SSC, on the y 55 axis, the SYBR Green fluorescence. The low-fluorescence part of the scatter plots 56 represents background fluorescence in the spore content, that is either not stained or 57 very lightly stained. a) Scatter plot of from the first quick run identifying the areas to 58 sort from. b) Scatter plot of actual sorting into pools (R1, R2, R4) of particles from the 59 15 spores (Table S1). c) Scatter plot of sorting into pools (R1, R2, R4) of particles from 60 the 25 spores (Table S1). d) Scatter plot when sorting only the area R1 from 25 crushed 61 spores (Table S1).

63
DNA was extracted from each of the tubes using the Genomic DNA from Plant kit 64 (Macherey-Nagel, Germany) in order to obtain clean DNA that could be tested for 65 fungal presence. The protocol was adapted to the characteristics of the samples having 66 low amounts of material and no cell biological material to homogenize, and 67 encompased the following: step 1, 125 μl of buffer MC1 and 2.5 μl of RNAse A; step 68 2 was eliminated, but a short centrifuge of 30 s at 5800 xg was added to bring down the 69 evaporates; step 3, 7.5 μl magnetic beads and 100 μl MC2; step 4, 140 μl MC3; step 5, 70 140 μl MC4; step 6, 125 μl of 80% ethanol; step 7, 125 μl MC5; step 8, 30 μl MC6. 71 Nanodrop was used to determine the amount of DNA in the samples after the DNA 72 extraction (Table S1) we concluded that fungal nuclei were most frequently appearing in the region R1 of the 90 scatter plot, as these samples generally showed stronger amplification of the fungal 91 barcode region ( Figure S3). 92 93 Figure S3. Agarose gel showing the PCR products for particles collected from the 94 scatter plot regions R1, R2-R4, R4 as well as a pool of all regions as positive control 95 sorted from C. claroideum /C. luteum (SA101). Gel electrophoresis was done using a 96 1.5% agarose gel, run for 80 min at 70 V. Thermo Scientific GeneRuler 50 bp DNA 97 Ladder was used. For easier visualization, samples that belong to other samples have 98 been covered, no further modification was done, for original image see Figure S7.

100
Nuclei extraction and sorting 101 102 Figure S4. To the left, scatter plot of particles from one single crushed spore from which 103 the R1 region was sorted into wells. Two right panels, the particles that were sorted and 104 were further tested for fungi/bacteria. Every particle can be traced back from the plate 105 to the exact sorting event. Colors represent the obtained results observed in Figure S5:  From the sorted particles nuclei were selected for sequencing when the fungal barcode 116 region amplified but not the bacterial region ( Figure S5, Table S3). The DNA 117 concentration of the 24 selected nuclei was estimated using Qubit (Invitrogen, Austria) 118 (Table S4). 119 120 Figure S5. PCR amplification of sorted and MDA single nuclei for a) Fungi (positive 121 control Agaricus bisporus, negative control ddH2O) and b) Bacterial (positive control 122 Legionella, negative control ddH2O  133 Raw read quality control 134 In the OSF repository for the project 44 all quality control results for the raw reads are 135 deposited, as well as the qualitative and quantitative assessments of the assemblies. 136 The standard of Minimum Information about a Single Amplified Genome (MISAG) 137 developed by the Genomic Standards Consortium (GSC) 64 was used as a guideline to 138 assess the quality of the assemblies (Table S5). The highest quality obtained in this 139 study is a high-quality draft, considering that the final assembly still contains gaps 140 spanning repetitive regions. Exploring the effect of short contigs in assembly workflow 141 1 and found that contigs <2000bp tend to inflate genome size due to duplications 142 ( Figure S6, Table S6). 143 144 Table S5. Single nuclei assembly statistics for the assembly workflow 1 (raw reads 145 assembled with MaSuRCA 34 ) and 2 (normalized reads assembled with SPADES 35 ).

146
Assembly size (Mb), number of contigs, N50, size of largest contig and % of raw 147 reads covered by the individual assembly is presented for each nucleus (1-24).

149
Assembly workflow 1 (raw reads + MaSuRCA) Assembly workflow 2 (normalized reads + SPADES) 150 Figure S6. Summary statistics for different number of assembled nuclei (1-24) using 151 assembly workflows 1 based on raw reads of individual nuclei assembled using 152 Masurca, consensus assembly using Lingon. BUSCO estimates of completeness for a) 153 workflow 1 contigs >500bp, b) workflow 1 contigs >1000bp, and c) workflow 1 contigs 154 >2000bp. The later is the best option of the three, presented as method 1 in the main 155 text. Percentage of single copy core genes detected as single copy (S: grey), duplicated 156 (D: light grey) or fragmented (F: black). Average of 3-6 replicate assemblies up to 12 157 nuclei with error bars indicating SEM. In d) assembly size (dashed lines) and N50 (solid 158 lines) for the there methods 1 >2000 bp (black), >10000 bp (grey) and >500 bp (light 159 grey). 160 Supplementary figure x Summary statistics for different number of assembled nuclei (1-24) using assembly methods 1 based on raw reads of individual nuclei assembled using Masurca, consensus assembly using Lingon. BUSCO estimates of completeness for a) method 1 contigs >500bp, b) method 1 contigs >1000bp, and c) method 1 contigs >2000bp. The later is the best option of the three, presented as method 1 in the main text. Percentage of single copy core genes detected as single copy (S: grey), duplicated (D: light grey) or fragmented (F: black). Average of 3-6 replicate assemblies up to 12 nuclei with error bars indicating SEM. In d) assembly size (dashed lines) and N50 (solid lines) for the there methods 1 >2000 bp (black), >10000 bp (grey) and >500 bp (light grey).  Table S6. Presence of the single copy genes EF1 and RPB1 in the generated assemblies. 161 Present as single copy in both assembly methods 1 and 3, but not in assembly method 162 2. Contigs and regions where the gene is found are shown on the