Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges

Next generation sequencing (NGS) is superseding Sanger technology for analysing intra-host viral populations, in terms of genome length and resolution. We introduce two new empirical validation data sets and test the available viral population assembly software. Two intra-host viral population ‘quasispecies’ samples (type-1 human immunodeficiency and hepatitis C virus) were Sanger-sequenced, and plasmid clone mixtures at controlled proportions were shotgun-sequenced using Roche's 454 sequencing platform. The performance of different assemblers was compared in terms of phylogenetic clustering and recombination with the Sanger clones. Phylogenetic clustering showed that all assemblers captured a proportion of the most divergent lineages, but none were able to provide a high precision/recall tradeoff. Estimated variant frequencies mildly correlated with the original. Given the limitations of currently available algorithms identified by our empirical validation, the development and exploitation of additional data sets is needed, in order to establish an efficient framework for viral population reconstruction using NGS.

N ext generation sequencing (NGS) is becoming an established experimental framework for genomics and clinical diagnostics 1 . NGS use covers a broad spectrum of applications, including de novo sequencing, resequencing, and metagenomics 2 . NGS has been employed successfully to characterize highly variable pathogens in different contexts [3][4][5][6][7][8][9][10] . The combination of huge coverage, deep resolution, primer-or shotgun-based sequencing, along with the possibility to analyze multiple samples in a single run, at a reasonable cost, makes NGS a valuable resource for diagnostic purposes 11,12 .
Several challenges have arisen concerning the raw NGS data processing, namely mapping to an appropriate reference sequence, error detection/correction 13,14 , single nucleotide variants (SNV) identification, and genome assembly methods 15 . In addition to sample processing and library design, problems for computational assembly and SNV detection of long haploid/diploid genomes (several thousands of kilo-bases and above) include genome coverage, repeated regions, and genome rearrangements. Distinct complexities arise with viral populations or socalled 'quasispecies' 16,17 . Quasispecies are composed of different proportions of distinct variants, some at very low frequency, that are often subject to recombination. One supposes here that the experimental procedures, including polymerase chain reaction amplification, minimize the chance to generate in vitro recombinants. When processing a quasispecies with NGS, it is vital to correctly recognize minority SNV which can be close to the sequencing error rate. Another challenge is to correctly assemble whole-gene or whole-genome variants, along with their proportions, without generating spurious in silico recombinants.
The current approach for analyzing a whole-gene or whole-genome quasispecies relies on clonal Sanger sequencing. While clonal sequencing has the advantage of an almost perfect variant reconstruction, given an appropriate primer design, the approach is not readily automatable, and cumbersome in terms of bench-work. In principle, NGS will overcome Sanger's limitations. A key step will be when available SNV detection and assembly algorithms are sufficiently reliable. Approaches for SNV analysis that can identify minority variants have been devised 8,[18][19][20] . Several methodologies for quasispecies assembly, all based on reference mapping and overlap-graph paradigm, have been also introduced 21-30 , along with software implementations, tested both on simulated and quasi-empirical data, such as pooled inter-host samples 18 or artificially created mixtures 19 . Performance of quasispecies reconstruction algorithms depend on many biological factors: the efficacy of error correction methods, overlap lengths, assembly heuristics, including the quasispecies diversity and of the extent of heterogeneity. It is of vital importance to assess the efficacy of quasispecies assembly methods with real data sets in order to test their use for diagnostics in a clinical context. This work introduces two new empirical validation data sets and tests quasispecies assembly algorithms. From the plasma samples of two patients infected with type-1 human immunodeficiency or hepatitis C viruses (HIV-1, HCV), using Sanger sequencing, plasmids isolated from transformed bacterial cultures were mixed at controlled proportions, and then the mixtures were shotgun-sequenced with the Roche's 454 sequencing platform. Performance of four different assembly algorithms, including QuRe, PredictHaplo, Geneious TM de novo, and ShoRAH (see Beerenwinkel et al. 26 for a comprehensive review) were evaluated in terms of phylogenetic clustering with the original Sanger clones, precision/recall indices, diversity and recombination analysis.
The Roche's 454 platform (GS FLX Titanium XLR70) yielded 156,655 reads for the HCV and 153,670 for the HIV-1 plasmid mixture, sequenced on two separate lanes. The average (st.dev.) read length was 363 (109) for HCV and 346 (98) for HIV. Using QuRe, 27,912 (17.82%) and 53,930 (35.09%) reads were successfully mapped (and trimmed) to the HCV and HIV-1 reference sequences, respectively. The length ratios between the sequenced regions and the plasmid were , 16% for HCV and , 39% for HIV-1. The average (st.dev.) coverage of each mapped base was 9,553 (1,881) for HCV and 9,565 (4,216) for HIV. All the reads were also mapped to the plasmid reference sequence and a random subset of 70,000 reads was retained from both experiments to assess sequencing error rates.
Error rate assessment and correction. Out of the two 70,000 random read sets selected from both experiments, 49,866 (71.24%) of the HCV and 62,303 (89.00%) of the HIV-1 mixture were retained when mapped (and trimmed) to the plasmid reference. Note, the observed differences in plasmid mapping and sequence recovery are because two different insertion points were used, whilst the same linear plasmid reference sequence was used for both mappings.
Using the error correction method embedded in QuRe, the overall error was decreased by 85.29% (6.80 fold) in the HCV sample and by 69.77% (3.54 fold) in the HIV-1 sample. With ShoRAH, the decrease was 85.96% (7.12 fold) for HCV and 83.95% (6.66 fold) for HIV-1. Table 1 reports error rates for the two data sets and correction methods, overall and stratified by error type (insertions, deletions, mismatches and homopolymeric regions).
Quasispecies assembly and evaluation of reconstruction performance. QuRe, PredictHaplo, ShoRAH and Geneious TM de novo were applied to both NGS data sets, using the corresponding HIV-1/HCV reference sequences. QuRe used the homopolymeric and nonhomopolymeric error rates as estimated from mapping the reads to the plasmid reference (default parameters for the rest). The other programs were run with default parameters.
QuRe, PredictHaplo, ShoRAH and Geneious TM de novo yielded 9, 4, 200, and 81 distinct variants respectively for the HCV data set, and 9, 4, 1,247, and 159 variants for HIV-1 data set. Tables 2 and 3 report precision/recall performance of each reconstruction algorithm, population characteristics and recombination tests. QuRe and PredictHaplo exhibited the highest precision (between 0.55 and 1.00), whilst ShoRAH had the highest recall (0.3 to 0.5). Reconstructed variants were classified as 'correct' when clustering with original Sanger clones in a phylogenetic tree at $ 75% bootstrap support (see Methods). Of note, when inferring a phylogeny from the sole Sanger data set, the proportion of highly-supported nodes (.75% bootstrap) was 43% and 50%, for HCV and HIV-1, respectively (Figure 1, upper panels). Therefore placement for some variants in the tree was ambiguous (i.e. some variants were probably too similar to distinguish).
Detection of recombination, using the pairwise homoplasy index test (PHI-test), was marginally significant in the HCV Sanger data set (P 5 0.043), and highly significant in the HIV-1 data set (P , 0.0001). Recombination networks are shown in Figure 1 (lower panels). QuRe and PredictHaplo yielded the lowest number of in silico recombinants, showing a moderate decrease of the PHI-test P-value. In terms of variant frequency estimation, all methods correlated significantly with the original distributions, with Pearson's linear correlation ranging from 0.3 to 0.6 (P , 0.0001).    Figure 3. Figure 4 shows trees for the two data sets (left and right panels for HCV and HIV-1, respectively) inferred by combining all variants from all algorithms and original Sanger sequences. Although reconstruction performance was not optimal in most cases, the same -overall-population structure can be found across all trees.

Discussion
This work evaluated the performance of viral quasispecies assembly algorithms in an empirical setting. Comparisons were performed using both an HCV and HIV-1 data sets, sequencing plasmid clones with Sanger technology and mixing them at known proportions. Plasmid mixtures were shotgun-sequenced using Roche's 454 platform.
The estimated error rates were similar in the HIV-1 and HCV samples. The overall error prevalence was consistent with previous estimates 10,13,14 , as well as the higher prevalence of errors in homopolymeric regions, and the higher prevalence of insertions and deletions as compared to mismatches (although the ratio varied across previous studies). ShoRAH provided a better error reduction as compared to QuRe. Note that the error correction was evaluated only in terms of removing mismatches or indels from the plasmid reference sequence, and not on the ability to distinguish true signal from false signal, especially in presence of minority changes. As a consequence, error correction should be evaluated also on data sets where true mutations exist and assess precision/recall rates at different frequency thresholds. Additionally, the read filtering, mapping, and error counting was carried out with QuRe, which potentially gives QuRe an advantage in the error correction phase. This is because QuRe performed error correction after mapping, while ShoRAH corrected reads before the mapping. Then ShoRAH reads were re-mapped with QuRe (not enabling its error correction module) in order to obtain comparable output.
All of the quasispecies assembly methods published to-date have performed robustly in simulations and empirical experiments 18,23,31,32 . This study is the first to consider actual intra-patient quasispecies, sequenced with Sanger, mixed at controlled proportions and then processed using NGS. Previously, validation on real data had been performed using inter-patient samples 18,23 . A recent work also used a controlled proportion approach to assess performance of error correction algorithms, but an artificial quasispecies was generated 19 . Here, both for HCV and HIV-1 data sets, all assemblers were able to capture a proportion of the most divergent lineages as indicated by the phylogenetic analysis. PredictHaplo was the most conservative algorithm, with the least number of variants reported, while ShoRAH yielded the highest number. QuRe and PredictHaplo exhibited the highest precision, whilst ShoRAH had the highest recall. QuRe showed the best precision/recall tradeoff. Estimated variant frequencies mildly correlated with the original values. In silico recombination was more pronounced in the output by ShoRAH and Geneious TM de novo. Geneious TM de novo was the least effective assembler in terms of precision and recall, but when considering only long contiguous sequences its performance increased substantially. Note that originally Geneious TM de novo has not been designed to assemble a quasispecies. In regards to the parameter set-up of assemblers, PredictHaplo did not require complex tuning and results were similar across different runs. ShoRAH may have benefited from a parameter optimisation (especially the sliding window sizes, but also number of iterations and the hyper-parameter a), but the search space would have required a considerable computational burden. QuRe required only a few parameters (error rates, estimated from the plasmid data, and number of iterations) and the rest was auto-optimised during the process. For all variants reconstructed by the algorithms, the cutoff for clustering with a Sanger clone was set to 75%, without a fixed threshold on nucleotide differences. Results may change when the cutoff or distance criteria are altered.
To determine the performance of reconstruction algorithms, one has to evaluate their output in different contexts. Performance of programs can be also subject to error patterns proper of NGS machinery. Besides the experimental yield, i.e. read length and coverage, factors to consider include: the number of variants in the population, the variant prevalence, the average diversity, the rate of heterogeneity, and the recombination signal. We look forward to testing methodological approaches for quasispecies assembly that specifically incorporate recombination, such as the recent Quasi-Recomb, which currently performs only local reconstruction 33 .
Evaluation of quasispecies assemblers in different scenarios, using both simulated and empirical data, will be an important step in characterizing the clinical relevance of viral populations in infected individuals.

Methods
Sample preparation and processing. Plasmids containing HIV-1 env gp120 V1-V5, or genotype 1 HCV envelope E1-E2 sequences, derived from virus in the plasma of infected patients, previously published by Gray et al. 34 and Ho et al. 35 , isolated from transformed bacterial cultures, were mixed at controlled decreasing proportions (from 42% to 0.1%, see the Results section). Sample preparation and processing procedures were based on the previous work of Zagordi et al. 18 . The cloning kit used was Invitrogen TOPOH TA CloningH Kit (with pCRH2.1-TOPOH Vector) with One ShotH Mach1 TM T1 Phage-Resistant Chemically Competent E. coli (http:// products.invitrogen.com/ivgn/product/K451020).
Study design and sample usage for viral genotyping was approved by University of Florida Institutional Review Board. NGS was performed at University of Florida using the Roche's GS FLX Titanium XLR70 platform. Raw NGS data are available for free upon formal request to the authors.
Mapping to a reference and quasispecies assembly. The HCV reference used for read mapping was the H77 (Genbank NC_004102.1), encompassing 669 bases of the two envelope regions E1-E2 (positions 1,315-1,984 relative to the full-genome numbering). The HIV-1 reference sequence was the HXB2 (Genbank K03455) env gp160 (positions 6,225-8,792 relative to the full-genome numbering).
Reads were mapped to each reference using QuRe 32 , a multi-threaded, platformindependent software designed for read mapping, error correction and quasispecies assembly. QuRe uses the JAligner (http://jaligner.sourceforge.net/) implementation of the Smith-Waterman-Gotoh local alignment algorithm 36 for mapping reads and discards those that do not exhibit an alignment score sufficiently high as compared to a quasi-random score distribution, according to a z-test as discussed by Bacro and Comet 37 . For this study, gap open and extension penalties of 23 and 0.3 were chosen, respectively. Reads that spanned both a part of the viral insert and the plasmid were trimmed by the local alignment and retained if the statistical test was below the chosen threshold (P , 0.01 by default, corrected with the Benjamini-Hochberg procedure). Therefore, a read could be mapped to both the plasmid reference and the viral reference.
Error correction was performed using QuRe 32 and ShoRAH 30 . The former implemented the method by Wang et al. 8 , which assumed a Poisson distribution of errors parameterized differently in homopolymeric and non-homopolymeric regions. A position in the reference sequence was defined as homopolymeric if the substring generated by elongating three bases to the left and three bases to the right of that position contained at least three consecutive bases of the same type. ShoRAH implemented a Bayesian probabilistic clustering over sliding windows across the reference sequence, correcting reads by assigning each one to the closest cluster 25 . Errors were calculated overall and stratified by type (insertion, deletions, mismatches, and homopolymeric vs. non-homopolymeric region).
Phylogenetic, recombination analysis and performance assessment. Reconstructed variants were aligned with the original Sanger sequences and a phylogenetic/ recombination analysis was carried out using the MEGA software v5.05 38 and SplitsTree v4 39 , as follows. In MEGA, a multiple alignment was obtained using MUSCLE 40 , then an optimal nucleotide substitution model was selected by minimizing the AIC, and finally bootstrapped (n 5 500) neighbor-joining trees were inferred. The multiple alignment was used also as input for the recombination analysis in SplitsTree, using the PHI-test by Bruen et al. 41 , and generating phylogenetic networks with the bootstrapped (n 5 500) neighbor-net algorithm.
By looking at the inferred phylogenetic trees, an assembled variant that was clustering with an original Sanger clone with $ 75% of bootstrap support was considered as a correct reconstruction. If more than one reconstructed variant was clustering with one or more original Sanger clones, this was counted as a unique correct reconstruction. A minimal nucleotide difference threshold was not fixed. This criterion was chosen to account for the fact that assemblers may output variants of different lengths and a subsequent distances obtained from the multiple alignments may be biased towards longer/shorter sequences. Precision (# of correctly reconstructed variants/# total number of reconstructed variants) and recall (# of correctly reconstructed variants/# number of original variants) were used as performance indicators, as well as the P-value of recombination as obtained by the PHI-test.