Diversity and evolution of the emerging Pandoraviridae family

With DNA genomes reaching 2.5 Mb packed in particles of bacterium-like shape and dimension, the first two Acanthamoeba-infecting pandoraviruses remained up to now the most complex viruses since their discovery in 2013. Our isolation of three new strains from distant locations and environments is now used to perform the first comparative genomics analysis of the emerging worldwide-distributed Pandoraviridae family. Thorough annotation of the genomes combining transcriptomic, proteomic, and bioinformatic analyses reveals many non-coding transcripts and significantly reduces the former set of predicted protein-coding genes. Here we show that the pandoraviruses exhibit an open pan-genome, the enormous size of which is not adequately explained by gene duplications or horizontal transfers. As most of the strain-specific genes have no extant homolog and exhibit statistical features comparable to intergenic regions, we suggest that de novo gene creation could contribute to the evolution of the giant pandoravirus genomes.

For P. macleodensis, cell lysis was observed after several passaging of the culture medium on fresh cells. The culture medium was thus recovered and centrifuged 5 min at 500 x g to remove debris. The supernatant was centrifuged 5 min at 16,000 x g and the pellet resuspended in 100 µl PBS supplemented with antibiotics and used to infect fresh Acanthamoeba cells.
For P. quescus, both the supernatant and the resuspended pellet produced cell lysis. In both cases, the culture medium was recovered and centrifuged 5 min at 500 g. The supernatant was recovered and centrifuged 30 min at 14,000g and the pellet was resuspended in 100 µl PBS supplemented with antibiotics which were then used to infected fresh Acanthamoeba cells. As for P. neocaledonia, and P. macleodensis, visible particles resembling pandoraviruses were visible in the culture media after cell lysis.

Virus cloning and purification
For each newly isolated pandoravirus, we cloned it 1 and amplified the clone prior purification, DNA extraction, proteome analysis, and cell cycle characterization by transcriptomic and electron microscopy.
Pandoraviruses were purified after cell lysis by centrifuging the culture media for 5 min at 500 × g to remove the cellular debris and the virus was pelleted by centrifugation at 6,800 × g for 45 min. The viral pellet was then resuspended, washed twice in PBS buffer, layered on a discontinuous sucrose gradient [50%/60%/70%/80% (wt/vol)], and centrifuged at 5,000 × g for 45 min. The viral layer was recovered, washed twice in PBS buffer, and stored at 4 °C.

Synchronous infections for TEM observations
For each time point, 3.5x10 7 A. castellanii -adherent cells in 40 ml culture medium were infected with each pandoravirus with a MOI ranging from 50 to 100 for synchronization. After 1h of infection at 32°C, cells were washed 3 times with 30mL of PPYG to eliminate the excess of viruses. For each infection time (every hour from 1h to 15h pi), 2.5 ml were recovered for inclusions and the rest of the cells were centrifuged for 10min at 5,000 x g. The collected 2.5 ml A. castellanii cells were fixed by adding an equal volume of PBS with 5% glutaraldehyde and incubating 1 h at room temperature. Cells are than centrifuged 10 min at 5,000 x g and the cell pellets were resuspended in 1 ml PBS with 2.5% glutaraldehyde, incubated at least 1 h at 4 °C, and washed twice in PBS buffer prior coating in agarose and embedding in Epon resin. Each pellet was mixed with 2% low melting agarose and centrifuged to obtain small flanges of approximately 1 mm 3 containing the sample coated with agarose. These samples were then prepared using the OTO (osmiumthiocarbohydrazide-osmium) method: 1h fixation in 2% osmium tetroxide with 1.5% potassium ferrocyanide, 20 minutes in 1% thiocarbohydrazide, 30 minutes in 2% osmium tetroxide, overnight incubation in 1% uranyl acetate, 30 minutes in lead aspartate, dehydration in increasing ethanol concentrations (50%, 70%, 90%, and 100% ethanol), and embedding in Epon-812. Ultrathin sections of 90 nm were observed using a FEI Tecnai G2 operating at 200 kV. The infectious cycle of the different pandoravirus strains were then scrutinized for potential differences.

mRNA preparation
A fraction of the different time points of infected cells used for the TEM study were pooled to retrieve the poly(A) + fraction of RNA and access the information on pandoravirus genes structure. RNA were extracted from the A. castellanii cells using the RNeasy Midi kit (catalog no. 75144; Qiagen) using the manufacturer's protocol. Briefly, the cell pellets were resuspended in 4 mL of RLT buffer supplemented with 0.1% β-mercaptoethanol and disrupted by subsequent −80 °C freezing and thawing at 37 °C for 10 min. For each pandoravirus infection, the total RNA were eluted with two successive additions of ∼200 µL of RNase-free water and quantified on the nanodrop spectrophotometer (Thermo Scientific). Two successive poly(A) enrichments were performed (Life Technologies, Dynabeads oligodT 25 ) in order to completely remove ribosomal RNA.

DNA extraction for sequencing
DNA was extracted from 2 × 109 purified particles. They were first centrifuged 3 min at 16,000 x g and the pellet was resuspended in 300 µl ultra-pure water to which 500 µl of CTAB buffer (20 g/L CTAB, 1,4 M NaCl, 100 mM Tris-HCl, 20 mM Na2EDTA, pH 8,0), 20 mg/mL Proteinase K and 3 μL de DTT 1 M were added. The sample was incubated 1 and half hour at 65°C and an additional 10 min after adding 20 mg/mL RNase A. 500 μL chloroform were then added and the sample was vortexed 30 sec prior centrifugation10 min at 16,000 g. The upper aqueous layer was recovered and centrifuged 5 min at 16,000 x g at 4°C. The supernatant is then mixed with an equal volume of chloroform, vortexed and centrifuged 5 min at 16,000 x g at 4°C. The aqueous layer was again recovered and mixed with 2 volumes of precipitation buffer (5 g/L CTAB, 40 mM NaCl, pH 8.0) and gently mixed. The sample was left at room temperature 1 h prior 5 min centrifugation at 16,000 g. The pellet was gently dissolved in 350 µl 1.2 M NaCl, 350 µl chloroform were added, the solution was vortexed few sec and centrifuged 5 min at 16,000 g at 4°C. The aqueous layer was again recovered and 0.6 volumes isopropanol were added and gently mixed until the DNA precipitated. The sample was centrifuged 10 min at room temperature at 16,000 g and the pellet was gently washed with 500 µl 70% ethanol. After cenfrifugation 10 min at room temperature at 16,000 x g, the pellet was left to dry prior being resuspended in 20 µl ultra-pure water.

Proteomic analysis
For P. salinus, P. dulcis, P. neocaledonia and P. quercus the virions produced from each clone were purified on Cesium chloride gradient. Proteins were extracted using a standard protocol 2 starting from 10 9 virions. Two biological replicates were produced for P. dulcis starting with two independent productions. For the 2 other pandoraviruses, independant triplicates were produced starting from the same pandoravirus production. Characterization of the different clones and technical replicates were performed for P. dulcis. Peptides and proteins were identified and quantified as previously described 1 using MaxQuant software 3 (version 1.5.5.1) in independent runs for each virus. Spectra were searched against the corresponding pandoravirus protein sequences as well as A. castellanii protein sequences, and the frequently observed contaminants database embedded in MaxQuant. Trypsin was chosen as the enzyme and two missed cleavages were allowed. Peptide modifications selected during the searches were carbamidomethylation (C, fixed), acetyl (protein Nter, variable), and oxidation (M, variable). Minimum peptide length was set to 7 aa. The 'match between run' option was activated. Maximum false discovery rates were set to 0.01 at peptide and protein levels. Intensity-based absolute quantification (iBAQ) 4 values were calculated from MS intensities of 'unique+razor' peptides. For each pandoravirus, iBAQ values were column-wise normalized and the protein ranking in particle was deducted from to the sum of the normalized iBAQ values per protein. For characterization of particle protein contents, only proteins identified by MS/MS in at least 2 replicates and quantified in all three replicates were taken into account for further analyses.

Genome sequencing and assembly
Pandoravirus neocaledonia genome was sequenced using 2 SMRT cells of the Pacbio sequencing technology. Sequencing reads were pre-assembled using the HGAP workflow 5 from the SMRT analysis framework version 2.3.0 with default parameters, resulting in 38,592 corrected reads. The same workflow was then used to perform the final polished assembly resulting in a contig of 2,003,191 nt. Read coverage was uniform except for a region of 10 kb at one extremity of the genome showing a large increase in coverage (60x vs 1950x). This is characteristic of terminal repeats as was showed in previously published pandoraviruses' genomes 2,6 .
The same approach was used to sequence the Pandoravirus quercus genome with 1 SMRT cell resulting in 72,468 corrected reads assembled in a single contig of 2,077,288 nt. Again, while read coverage was uniform in most of the genome, a peak was found in a 40 kb region at one extremity (400x vs 650x read coverage).
Pandoravirus macleodensis was sequenced using the Illumina MiSeq technology with large insert (5-8 kb) mate pair sequences. After read filtration for low quality sequenced bases and adapter cleaning, the dataset contained a total of 3,492,652 mate pair reads of 122 nt on average. The SPAdes assembler version 3.8.0 7 was used with the defaults parameters except for the "careful" option and the following set of kmers: 21, 41, 61, 81, 91, 101, 121, 127. A single contig of 1,838,258 nt was obtained. Read coverage was uniform except for a 30kb-long region at the end of the contig showing a higher coverage (220x vs 460x).

Stringent genome sequence annotation
For the 4 viruses for which we had transcriptomic and proteomic data available (i.e. P. salinus, P. dulcis, P. quercus and P. neocaledonia; see Table 1) we sought to stringently predict protein-coding genes. As summarized in Supplementary Fig. 2 the pipeline takes into account RNA-seq data, Mass spectroscopy data of the particles as well as protein conservation data.
RNA-seq transcriptomic data from pooled time-course infection data points were exploited as follows. We first mapped the stranded RNA-seq reads to the viral reference genomes using Tophat version 2.0.12 12 with the following parameters: i=20, I=1500, b2-verysensitive and no-discordant. We then performed the assembly of the RNA-seq reads into transcripts using Trinity 13 with the RNA-seq reads alignments (ie "genome-guided"). The following options were used: jaccard_clip, trimmomatic and genome_guided_max_intron=1500. Finally assembled transcripts were mapped to the reference genome using the PASA pipeline 14 .
Peptides from the proteomic data of the virions particles were identified using MaxQuant 3 on a very permissive definition of ORFs. The reference ORF database was made using EMBOSS getorf allowing overlapping ORFs on the six frames. Once identified, significant peptides were mapped to the genomic coordinates using the CDSmapper script 15 .
Protein conservation among the pandoraviruses was exploited as follows. We performed a protein clustering of the proteins predicted ab initio by GenemarkS 8 in all 6 Pandoravirus genomes using OrthoMCL 16 with default parameters. Excluding singletons we then aligned the resulting protein clusters with mafft-linsi 17 and built an HMM profile for each alignment using Hmmer 18 . Finally HMM profiles were aligned to the previously described permissive ORF database (based on EMBOSS getorf, see previous paragraph) using Hmmer. Matching regions were mapped back to the genomic coordinates using the CDSmapper script 15 .
Subsequently, transcriptomic data, proteomic data and protein conservation data were combined in order to choose the best gene models using EvidenceModeler (EVM) 19 . The following weights were used to score the different gene models from the gene finders: Braker prediction=5, GenemarkS prediction=3, GenemarkS-T=5, EMBOSS getorf=1, and the external evidences: virion peptide=25, HMM match=2 and transcript alignment=10. After visual inspection of the gene models picked up by EVM we realized that a handful of probably genuine protein-coding genes were missed. We thus added potential proteincoding genes as long as: i) they showed transcription evidence (aligned transcript) and ii) an ORF >50 amino acids was predicted in the cognate strand. Finally, untranslated regions (UTRs) of predicted protein coding genes were defined using PASA 14 and the transcripts alignment.
For the two genomes for which we had no transcriptomic and proteomic data (i.e. P. inopinatum and P. macleodensis; see Table 1), we first trained an Augustus 20 model for gene prediction based on the stringently predicted genes in the other pandoraviruses (see above). We then predicted genes using Augustus 20 , GenemarkS 8 and EMBOSS getorf 9 . EVM 19 was then used to choose the best gene models with the following weights: Augustus prediction=5, GenemarkS prediction=3, EMBOSS getorf=1, and the external evidence of protein conservation HMM match=2.
Functional annotation of protein-coding genes was performed using a combination of sequence similarity searches (Blastp 21 evalue < 1e-5) and protein motifs detections (Interproscan 22 ).
Non-coding RNA (ncRNA) genes were predicted for the viruses for which we had transcriptomic data (see Table 1). We defined ncRNAs as genomic region that exhibit transcription (ie transcript alignment) and no predicted ORF in the cognate strand. In order to separate potential transcriptional noise from genuine ncRNA transcripts, we defined a transcript as ncRNA only if the RNA-seq signal was of high enough. The threshold for RNAseq signal was defined as follows: we collected the RNA-seq signal corresponding to the genomic regions of the peptides identified from the proteomic data of the virions. We then used a Gaussian mixture model to separate this RNA-seq signal from the one corresponding to the opposite strand (considered as potential transcriptional noise). Candidate ncRNA transcripts were finally classified based on this model. tRNAs were predicted using tRNAscan-SE 23 with default parameters.

Protein clustering
Clustering of pandoravirus protein-coding genes was performed using BlastP 21 with default parameters and Evalue < 1e-5, Bit-score>50 and filtering=F, followed by an MCL clustering 24 with an inflation parameter=1.5. A similar clustering was also applied on individual genomes of several giant viruses.

Genome rearrangements
Genomic rearrangements were estimated using Mauve 25 with default parameters in combination with the R package GenoPlotR 26 .
Syntenic regions between pandoraviruses were also identified using the MScanX toolkit 27 .

Codon Adaptation Index
Codon adaptation index was computed using the "cai" tool from the EMBOSS package 9 . Reference codon usage was calculated on the most expressed protein coding genes in A. castellanii (see Supplementary dataset S3).

Phylogenetic analyses
Phylogenetic trees were computed for each protein-coding gene cluster (aligned with Mcoffee 28 ) using the protocol described in 29 . Orthologous gene pairs were defined using the Species Overlap algorithm from the ETE toolkit 30 .
Estimation of synonymous (dS), nonsynonymous (dN) substitution rates and dN/dS ratios were calculated using the YN00 method from the PAML package 31 . In order to avoid saturation and outliers we only considered dN/dS if the number of substitutions per synonymous site < 2, the number of nonsynonymous substitutions > 1 and the number of substitutions > 1.
The phylogenetic tree of the pandoraviruses was computed from the codon super alignment of the 1:1 orthologous genes present in all viruses using PhyML 32 with the JC69 substitution model. Estimation of the dN/dS ratios of each branch was computed using the CodeML method from the PAML package 31 through the ETE toolkit 30 . The dN/dS ratio from the two branches (see Fig. 3) were found to be significantly different (pvalue< 1e-5 between models M0 and b_free).
The cladistic tree was computed using the Neighbour-Joining method from the presence/absence matrix of 5375 clusters derived from an OrthoMCL 16 clustering. The distance was the one used in 33 . Support values were estimated using bootstrap resampling (n = 10000). Besides the pandoravirus genomes, the following genomic sequences were included in the analysis: A maximum likelihood phylogenetic tree based on the DNA polymerase B sequence contained in each of these genomes is shown in Supplementary Fig. 14. A protein sequence multiple alignment was performed with MAFFT using default parameters 17 . We then used Prottest to estimate the best substitution model which was defined as the LG+I+G model. Finally PhyML 32 was used with the following parameters to compute the final phylogenetic tree: -c 4 -m LG -v e -a e -o lr -f d -d aa.
The analysis of potential horizontal gene transfer was performed on P. salinus genes that exhibit a significant BlastP match (Evalue<1e-5) against the NR database. Significant matches of each query were aligned using Mcoffee 28 , trees were computed using PhyML 32 and analysed manually to identify HGT direction when possible. Supplementary Fig. 1. Nuclei in infected A. castellanii cells. Late stage of P. neocaledonia (ab) (scale bar is 1 µm) or P. salinus (c) infectious cycle (scale bar is 2 µm). Maturing particles (arrows) co-exist with a nucleus-like compartment limited by a membrane. Mature particles in vacuoles are also visible (arrow-heads). Supplementary Fig. 2. Diagram of the stringent gene annotation pipeline. Ab initio gene prediction (blue), transcriptomic data (red), proteomic data (green) and protein conservation data (purple) are combined to accurately predict protein-coding genes. Supplementary Fig. 4. Taxonomic distribution of protein-coding genes best matches for the various pandoravirus strains. Searches were done using BlastP (Evalue < 1e-5) against the NR database excluding hits corresponding to the pandoraviruses.  Supplementary Fig. 5. Protein-coding and lncRNA gene mapping using RNA-seq data. The genomic region of Pandoravirus neocaledonia between 882,000 and 909,000 is displayed. Protein-coding genes are shown in the green panel. The coding region is colored in red for genes encoded in the forward strand and in blue for genes encoded in the reverse strand. UTRs are shown in gray. Non-coding RNAs from the forward strand (in pink) and reverse strand (in light blue) are shown in the purple panel. Strand-specific RNA-seq read density from the forward (red) and reverse strand (blue) is shown in the orange panel.