Zika virus evolution and spread in the Americas

Metsky, Hayden C.; Matranga, Christian B.; Wohl, Shirlee; Schaffner, Stephen F.; Freije, Catherine A.; Winnicki, Sarah M.; West, Kendra; Qu, James; Baniecki, Mary Lynn; Gladden-Young, Adrianne; Lin, Aaron E.; Tomkins-Tinch, Christopher H.; Ye, Simon H.; Park, Daniel J.; Luo, Cynthia Y.; Barnes, Kayla G.; Shah, Rickey R.; Chak, Bridget; Barbosa-Lima, Giselle; Delatorre, Edson; Vieira, Yasmine R.; Paul, Lauren M.; Tan, Amanda L.; Barcellona, Carolyn M.; Porcelli, Mario C.; Vasquez, Chalmers; Cannons, Andrew C.; Cone, Marshall R.; Hogan, Kelly N.; Kopp, Edgar W.; Anzinger, Joshua J.; Garcia, Kimberly F.; Parham, Leda A.; Ramírez, Rosa M. Gélvez; Montoya, Maria C. Miranda; Rojas, Diana P.; Brown, Catherine M.; Hennigan, Scott; Sabina, Brandon; Scotland, Sarah; Gangavarapu, Karthik; Grubaugh, Nathan D.; Oliveira, Glenn; Robles-Sikisaka, Refugio; Rambaut, Andrew; Gehrke, Lee; Smole, Sandra; Halloran, M. Elizabeth; Villar, Luis; Mattar, Salim; Lorenzana, Ivette; Cerbino-Neto, Jose; Valim, Clarissa; Degrave, Wim; Bozza, Patricia T.; Gnirke, Andreas; Andersen, Kristian G.; Isern, Sharon; Michael, Scott F.; Bozza, Fernando A.; Souza, Thiago M. L.; Bosch, Irene; Yozwiak, Nathan L.; MacInnis, Bronwyn L.; Sabeti, Pardis C.

doi:10.1038/nature22402

Letter
Published: 24 May 2017

Zika virus evolution and spread in the Americas

Hayden C. Metsky^1,2^na1,
Christian B. Matranga¹^na1,
Shirlee Wohl^1,3^na1,
Stephen F. Schaffner^1,3,4^na1,
Catherine A. Freije^1,3,
Sarah M. Winnicki¹,
Kendra West¹,
James Qu¹,
Mary Lynn Baniecki¹,
Adrianne Gladden-Young¹,
Aaron E. Lin^1,3,
Christopher H. Tomkins-Tinch¹,
Simon H. Ye^1,5,
Daniel J. Park¹,
Cynthia Y. Luo^1,3,
Kayla G. Barnes^1,3,4,
Rickey R. Shah^1,6,
Bridget Chak^1,3,
Giselle Barbosa-Lima⁷,
Edson Delatorre⁸,
Yasmine R. Vieira⁷,
Lauren M. Paul⁹,
Amanda L. Tan⁹,
Carolyn M. Barcellona⁹,
Mario C. Porcelli¹⁰,
Chalmers Vasquez¹⁰,
Andrew C. Cannons¹¹,
Marshall R. Cone¹¹,
Kelly N. Hogan¹¹,
Edgar W. Kopp¹¹,
Joshua J. Anzinger¹²,
Kimberly F. Garcia¹³,
Leda A. Parham¹³,
Rosa M. Gélvez Ramírez¹⁴,
Maria C. Miranda Montoya¹⁴,
Diana P. Rojas¹⁵,
Catherine M. Brown¹⁶,
Scott Hennigan¹⁶,
Brandon Sabina¹⁶,
Sarah Scotland¹⁶,
Karthik Gangavarapu¹⁷,
Nathan D. Grubaugh¹⁷,
Glenn Oliveira¹⁸,
Refugio Robles-Sikisaka¹⁷,
Andrew Rambaut^19,20,
Lee Gehrke^21,22,
Sandra Smole¹⁶,
M. Elizabeth Halloran^23,24,
Luis Villar¹⁴,
Salim Mattar²⁵,
Ivette Lorenzana¹³,
Jose Cerbino-Neto⁷,
Clarissa Valim^4,26,
Wim Degrave²⁷,
Patricia T. Bozza^28,29,
Andreas Gnirke¹,
Kristian G. Andersen^17,18,30^na2,
Sharon Isern⁹^na2,
Scott F. Michael⁹^na2,
Fernando A. Bozza^7,31^na2,
Thiago M. L. Souza^32,33^na2,
Irene Bosch²¹^na2,
Nathan L. Yozwiak^1,3^na2,
Bronwyn L. MacInnis^1,4^na2 &
…
Pardis C. Sabeti^1,3,4,34^na2

Nature volume 546, pages 411–415 (2017)Cite this article

55k Accesses
270 Citations
454 Altmetric
Metrics details

Subjects

Abstract

Although the recent Zika virus (ZIKV) epidemic in the Americas and its link to birth defects have attracted a great deal of attention^1,2, much remains unknown about ZIKV disease epidemiology and ZIKV evolution, in part owing to a lack of genomic data. Here we address this gap in knowledge by using multiple sequencing approaches to generate 110 ZIKV genomes from clinical and mosquito samples from 10 countries and territories, greatly expanding the observed viral genetic diversity from this outbreak. We analysed the timing and patterns of introductions into distinct geographic regions; our phylogenetic evidence suggests rapid expansion of the outbreak in Brazil and multiple introductions of outbreak strains into Puerto Rico, Honduras, Colombia, other Caribbean islands, and the continental United States. We find that ZIKV circulated undetected in multiple regions for many months before the first locally transmitted cases were confirmed, highlighting the importance of surveillance of viral infections. We identify mutations with possible functional implications for ZIKV biology and pathogenesis, as well as those that might be relevant to the effectiveness of diagnostic tests.

You have full access to this article via your institution.

Download PDF

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

Infectious disease in an era of global change

Article 13 October 2021

Antibiotic resistance in the environment

Article 04 November 2021

Main

Since its introduction into the Americas, mosquito-borne ZIKV (family: Flaviviridae) has spread rapidly, causing hundreds of thousands of cases of ZIKV disease, as well as ZIKV congenital syndrome and probably other neurological complications^1,2,3. Phylogenetic analysis of ZIKV can reveal the trajectory of the outbreak and detect mutations that may be associated with new disease phenotypes or affect molecular diagnostics. Despite the 70 years since its discovery and the scale of the recent outbreak, however, fewer than 100 ZIKV genomes have been sequenced directly from clinical samples. This is due in part to technical challenges posed by low viral loads (for example, these are often orders of magnitude lower than in Ebola virus or dengue virus infection^4,5,6), and by loss of RNA integrity in samples collected and stored without sequencing in mind. Culturing the virus increases the material available for sequencing but can result in genetic variation that is not representative of the original clinical sample.

We sought to gain a deeper understanding of the viral populations underpinning the ZIKV epidemic by extensive genome sequencing of the virus directly from samples collected as part of ongoing surveillance. We initially pursued unbiased metagenomic sequencing to capture both ZIKV and other viruses known to be co-circulating with ZIKV⁵. In most of the 38 samples examined by this approach there proved to be insufficient ZIKV RNA for genome assembly, but it still proved valuable to verify results from other methods. Metagenomic data also revealed sequences from other viruses, including 41 likely novel viral sequence fragments in mosquito pools (Extended Data Table 1). In one patient we detected no ZIKV sequence but did assemble a complete genome from dengue virus (type 1), one of the viruses that co-circulates with and presents similarly to ZIKV⁷.

To capture sufficient ZIKV content for genome assembly, we turned to two targeted approaches for enrichment before sequencing: multiplex PCR amplification⁸ and hybrid capture⁹. We sequenced and assembled complete or partial genomes from 110 samples from across the epidemic, out of 229 attempted (221 clinical samples from confirmed and possible ZIKV disease cases and eight mosquito pools; Table 1, Supplementary Table 1). This dataset, which we used for further analysis, includes 110 genomes produced using multiplex PCR amplification (amplicon sequencing) and a subset of 37 genomes produced using hybrid capture (out of 66 attempted). Because these approaches amplify any contaminant ZIKV content, we relied heavily on negative controls to detect artefactual sequence, and we established stringent, method-specific thresholds on coverage and completeness for calling high-confidence ZIKV assemblies (Fig. 1a). Completeness and coverage for these genomes are shown in Fig. 1b, c; the median fraction of the genome with unambiguous base calls was 93%. Per-base discordance between genomes produced by the two methods was 0.017% across the genome, 0.15% at polymorphic positions, and 2.2% for minor allele base calls. Concordance of within-sample variants is shown in more detail in Fig. 1d–f. Patient sample type (urine, serum, or plasma) made no significant difference to sequencing success in our study (Extended Data Fig. 1).

Table 1 Samples and genomes by region

Full size table

**Figure 1: Sequence data from clinical and mosquito samples.**

To investigate the spread of ZIKV in the Americas we performed a phylogenetic analysis of the 110 genomes from our dataset, together with 64 published genomes available on NCBI GenBank and in refs 10 and 11 (Fig. 2a). Our reconstructed phylogeny (Fig. 2b), which is based on a molecular clock (Extended Data Fig. 2), is consistent with the outbreak having originated in Brazil¹²: Brazil ZIKV genomes appear on all deep branches of the tree, and their most recent common ancestor is the root of the entire tree. We estimate the date of that common ancestor to have been in early 2014 (95% credible interval (CI) August 2013 to July 2014). The shape of the tree near the root remains uncertain (that is, the nodes have low posterior probabilities) because there are too few mutations to clearly distinguish the branches. This pattern suggests rapid early spread of the outbreak, consistent with the introduction of a new virus to an immunologically naive population. ZIKV genomes from Colombia (n = 10), Honduras (n = 18), and Puerto Rico (n = 3) cluster within distinct, well-supported clades. We also observed a clade consisting entirely of genomes from patients who contracted ZIKV in one of three Caribbean countries (the Dominican Republic, Jamaica, and Haiti) or the continental United States, containing 30 of 32 genomes from the Dominican Republic and 19 of 20 from the continental United States. We estimated the within-outbreak substitution rate to be 1.15 × 10⁻³ substitutions per site per year (95% CI (9.78 × 10⁻⁴, 1.33 × 10⁻³)), similar to prior estimates for this outbreak¹². This is 1.3–5 times higher than reported rates for other flaviviruses¹³, but is measured over a short sampling period, and therefore may include a higher proportion of mildly deleterious mutations that have not yet been removed through purifying selection.

**Figure 2: Zika virus spread throughout the Americas.**

Determining when ZIKV arrived in specific regions helps to elucidate the spread of the outbreak and track rising incidence of possible complications of ZIKV infection. The majority of the ZIKV genomes from our study fall into four major clades from different geographic regions, for which we estimated a likely date for ZIKV arrival. In each case, the date was months earlier than the first confirmed, locally transmitted case, indicating ongoing local circulation of ZIKV before its detection. In Puerto Rico, the estimated date was 4.5 months earlier than the first confirmed local case¹⁴; it was 8 months earlier in Honduras¹⁵, 5.5 months earlier in Colombia¹⁶, and 9 months earlier for the Caribbean–continental US clade¹⁷. In each case, the arrival date represents the estimated time to the most recent common ancestor (tMRCA) for the corresponding clade in our phylogeny (Fig. 2c; see Extended Data Fig. 3 and Extended Data Table 2 for details). Similar temporal gaps between the tMRCA of local transmission chains and the earliest detected cases were seen when chikungunya virus emerged in the Americas¹⁸. We also observed evidence for several introductions of ZIKV into the continental United States, and found that sequences from mosquito and human samples collected in Florida cluster together, consistent with the finding of local ZIKV transmission in Florida in ref. 11.

Principal component analysis (PCA) is consistent with the phylogenetic observations (Fig. 2d). It shows tight clustering among ZIKV genomes from the continental United States, the Dominican Republic, and Jamaica. ZIKV genomes from Brazil and Colombia are similar and distinct from genomes sampled in other countries. ZIKV genomes from Honduras form a third cluster that also contains genomes from Guatemala or El Salvador. The PCA results show no clear stratification of ZIKV within Brazil.

Genetic variation can provide important insights into ZIKV biology and pathogenesis and can reveal potentially functional changes in the virus. We observed 1,030 mutations in the complete dataset, and they were well distributed across the genome (Fig. 3a). Any effect of these mutations cannot be determined from these data; however, the most likely candidates for functional mutations would be among the 202 nonsynonymous mutations (Supplementary Table 2) and the 32 mutations in the 5′ and 3′ untranslated regions (UTRs). Adaptive mutations are more likely to be found at high frequency or to be seen multiple times, although both effects can also occur by chance. We observed five positions with nonsynonymous mutations at more than 5% minor allele frequency that occurred on two or more branches of the tree (Fig. 3b); two of these (at positions 4,287 and 8,991) occurred together and might represent incorrect placement of a Brazil branch in the tree. The remaining three are more likely to represent multiple nonsynonymous mutations; one (at 9,240) appears to involve nonsynonymous mutations to two different alleles.

**Figure 3: Geographic and genomic distribution of Zika virus variation.**

To assess the possible biological significance of these mutations, we looked for evidence of selection in the ZIKV genome. Viral surface glycoproteins are known targets of positive selection, and mutations in these proteins can confer adaptation to new vectors¹⁹ or aid immune escape^20,21. We therefore searched for an excess of nonsynonymous mutations in the ZIKV envelope glycoprotein (E). However, the nonsynonymous substitution rate in E proved to be similar to that in the rest of the coding region (Fig. 3c, left); moreover, amino acid changes were significantly more conservative in that region than elsewhere (Fig. 3c, middle and right). Any diversifying selection occurring in the surface protein thus appears to be operating under selective constraint. We also found evidence for purifying selection in the ZIKV 3′ UTR (Fig. 3d, Supplementary Table 3), which is important for viral replication²².

While the transition-to-transversion ratio (6.98) was within the range seen in other viruses²³, we observed a considerably higher frequency of C-to-T and T-to-C substitutions than other transitions (Fig. 3d, Extended Data Fig. 4, Supplementary Table 3). This enrichment was apparent both in the genome as a whole and at fourfold degenerate sites, where selection pressure is minimal. Many processes could contribute to this conspicuous mutation pattern, including mutational bias of the ZIKV RNA-dependent RNA polymerase, host RNA editing enzymes (for example, APOBECs, ADARs) acting upon viral RNA, and chemical deamination, but further investigation is required to determine the cause of this phenomenon.

Mismatches between PCR assays and viral sequence are a potential source of poor diagnostic performance in this outbreak²⁴. To assess the potential influence of ongoing viral evolution on diagnostic function, we compared eight published qRT–PCR-based primer/probe sets to our data. We found numerous sites at which the probe or primer did not match an allele found among the 174 ZIKV genomes from the current dataset (Fig. 3e). In most cases, the discordant allele was shared by all outbreak samples, presumably because it was present in the Asian lineage that entered the Americas. These mismatches could affect all uses of the diagnostic assay in the outbreak. We also found mismatches from new mutations that occurred after ZIKV entry into the Americas. Most of these were present in less than 10% of samples, although one was seen in 29%. These observations suggest that genome evolution has not caused widespread degradation of diagnostic performance during the course of the outbreak, but that mutations continue to accumulate and ongoing monitoring is needed.

Analysis of within-host viral genetic diversity can reveal important information for understanding virus–host interactions and viral transmission. However, accurately identifying these variants in low-titre clinical samples is challenging, and further complicated by potential artefacts associated with enrichment before sequencing. To investigate whether we could reliably detect within-host ZIKV variants in our data, we identified within-host variants in a cultured ZIKV isolate used as a positive control throughout our study, and found that both amplicon sequencing and hybrid capture data produced concordant and replicable variant calls (Fig. 1d). In clinical and mosquito samples, hybrid capture within-host variants were noisier but contained a reliable subset: although most variants were not validated by the other sequencing method or by a technical replicate, those at high frequency were always replicable, as were those that passed a previously described filter²⁵ (Fig. 1e, f, Extended Data Table 3). Within this high confidence set we looked for variants that were shared between samples as a clue to transmission patterns, but there were too few variants to draw any meaningful conclusions. By contrast, within-host variants identified in amplicon sequencing data were unreliable at all frequencies (Fig. 1f, Extended Data Table 3), suggesting that further technical development is needed before amplicon sequencing can be used to study within-host variation in ZIKV and other clinical samples with low viral titres.

Sequencing low-titre viruses such as ZIKV directly from clinical samples presents several challenges that are likely to have contributed to the paucity of genomes available from the current outbreak. While the development of technical and analytical methods will surely continue, we note that factors upstream in the process, including collection site and cohort, were strong predictors of sequencing success in our study (Extended Data Fig. 1). This finding highlights the importance of continuing development and implementation of best practices for sample handling, without disrupting standard clinical workflows, for wider adoption of genome surveillance during outbreaks. Additional sequencing, however challenging, remains critical to ongoing investigation of ZIKV biology and pathogenesis. Together with refs 10 and 11, this study advances both technological and collaborative strategies for genome surveillance in the face of unexpected outbreak challenges.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Ethics statement

The clinical studies from which samples were obtained were evaluated and approved by the relevant Institutional Review Boards/Ethics Review Committees at Hospital General de la Plaza de la Salud (Santo Domingo, Dominican Republic), University of the West Indies (Kingston, Jamaica), Universidad Nacional Autónoma de Honduras (Tegucigalpa, Honduras), Oswaldo Cruz Foundation (Rio de Janeiro, Brazil), Centro de Investigaciones Epidemiologicas—Universidad Industrial de Santander (Bucaramanga, Colombia), Massachusetts Department of Public Health (Jamaica Plain, Massachusetts), and Florida Department of Health (Tallahassee, Florida). Informed consent was obtained from all participants enrolled in studies at Hospital General de la Plaza de la Salud, Universidad Nacional Autónoma de Honduras, Oswaldo Cruz Foundation, and Universidad Industrial de Santander. IRBs at the University of West Indies, Massachusetts Department of Public Health, and Florida Department of Health granted waivers of consent given this research with leftover clinical diagnostic samples involved no more than minimal risk. Harvard University and Massachusetts Institute of Technology (MIT) Institutional Review Boards/Ethics Review Committees provided approval for sequencing and secondary analysis of samples collected by the aforementioned institutions.

Sample collections and study subjects

Patients with suspected ZIKV infection (including high-risk travellers) were enrolled through study protocols at multiple aforementioned collection sites. Clinical samples (including blood, urine, cerebrospinal fluid, and saliva) were obtained from suspected or confirmed ZIKV cases and from high-risk travellers. De-identified information about study participants and other sample metadata are reported in Supplementary Table 1.

Viral RNA isolation

RNA was isolated following the manufacturer’s standard operating protocol for 0.14–1-ml samples³² using the QIAamp Viral RNA Minikit (Qiagen), except that in some cases 0.1 M final concentration of β-mercaptoethanol (as a reducing agent) or 40 μg/ml final concentration of linear acrylamide (Ambion) (as a carrier) were added to AVL buffer before inactivation. Extracted RNA was resuspended in AVE buffer or nuclease-free water. In some cases, viral samples were concentrated using Vivaspin-500 centrifugal concentrators (Sigma-Aldrich) before inactivation and extraction. In these cases, 0.84 ml of sample was concentrated to 0.14 ml by passing through a 30-kDa filter and discarding the flow-through.

Carrier RNA and host rRNA depletion

In a subset of human samples, carrier poly(rA) RNA and host rRNA were depleted from RNA samples using RNase H selective depletion^9,33. In brief, oligo d(T) (40 nt long) and/or DNA probes complementary to human rRNA were hybridized to the sample RNA. The sample was then treated with 15 units Hybridase (Epicentre) for 30 min at 45 °C. The complementary DNA probes were removed by treating each reaction with an RNase-free DNase (Qiagen) according to the manufacturer’s protocol. Following depletion, samples were purified using 1.8× volume AMPure RNAclean beads (Beckman Coulter Genomics) and eluted into 10 μl water for cDNA synthesis.

Illumina library construction and sequencing

cDNA synthesis was performed as described in previously published RNA-seq methods⁹. To track potential cross-contamination, 50 fg synthetic RNA (gift from M. Salit, NIST) was spiked into samples using unique RNA for each individual ZIKV sample. ZIKV negative control cDNA libraries were prepared from water, human K-562 total RNA (Ambion), or EBOV (KY425633.1) seed stock; ZIKV positive controls were prepared from ZIKV Senegal (isolate HD78788) or ZIKV Pernambuco (isolate PE243; KX197192.1) seed stock. The dual index Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) was used for library preparation. Approximately half of the cDNA product was used for library construction, and indexed libraries were generated using 18 cycles of PCR. Each individual sample was indexed with a unique barcode. Libraries were pooled at equal molarity and sequenced on the Illumina HiSeq 2500 or MiSeq (paired-end reads) platforms.

Amplicon-based cDNA synthesis and library construction

ZIKV amplicons were prepared as described^8,11, similarly to ‘RNA jackhammering’ for preparing low-input viral samples for sequencing³⁴, with slight modifications. After PCR amplification, each amplicon pool was quantified on a 2200 Tapestation (Agilent Technologies) using High Sensitivity D1000 ScreenTape (Agilent Technologies). Two microlitres of a 1:10 dilution of the amplicon cDNA was loaded and the concentration of the 350–550-bp fragments was calculated. The cDNA concentration, as reported by the Tapestation, was highly predictive of sequencing outcome (that is, whether a sample passed genome assembly thresholds) (Extended Data Fig. 5). cDNA from each of the two amplicon pools was mixed equally (10–25 ng each) and libraries were prepared using the dual index Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) according to the manufacturer’s protocol. Libraries were indexed with a unique barcode using seven cycles of PCR, pooled equally and sequenced on the Illumina MiSeq (250-bp paired-end reads) platform. Primer sequences were removed by hard trimming the first 30 bases for each insert read before analysis.

Zika virus hybrid capture

Virus hybrid capture was performed as previously described⁹. Probes were created to target ZIKV and chikungunya virus (CHIKV). Candidate probes were created by tiling across publicly available sequences for ZIKV and CHIKV on NCBI GenBank³⁵. Probes were selected from among these candidate probes to minimize the number used while maintaining coverage of the observed diversity of the viruses. Alternating universal adapters were added to allow two separate PCR amplifications, each consisting of non-overlapping probes. (To download probe sequences, see Supplementary Information.)

The probes were synthesized on a 12k array (CustomArray). The synthesized oligos were amplified by two separate emulsion PCR reactions with primers containing T7 RNA polymerase promoter. Biotinylated baits were in vitro transcribed (MEGAshortscript, Ambion) and added to prepared ZIKV libraries. The baits and libraries were hybridized overnight (~16 h), captured on streptavidin beads, washed, and re-amplified by PCR using the Illumina adaptor sequences. Capture libraries were then pooled and sequenced. In some cases, a second round of hybrid capture was performed on PCR-amplified capture libraries to further enrich the ZIKV content of sequencing libraries (Extended Data Fig. 6). In the main text, ‘hybrid capture’ refers to a combination of hybrid capture sequencing data and data from the same libraries without capture (unbiased), unless explicitly distinguished.

Genome assembly

We assembled reads from all sequencing methods into genomes using viral-ngs v1.13.3 (refs 36, 37). We taxonomically filtered reads from amplicon sequencing against a ZIKV reference, KU321639.1. We filtered reads from other approaches against the list of accessions provided in the Supplementary Information. To compute results on individual replicates, we de novo assembled these and scaffolded against KU321639.1. To obtain final genomes for analysis, we pooled data from multiple replicates of a sample, de novo assembled, and scaffolded against KX197192.1. For all assemblies, we set the viral-ngs ‘assembly_min_length_fraction_of_reference’ and ‘assembly_min_unambig’ parameters to 0.01. For amplicon sequencing data, unambiguous base calls required at least 90% of reads to agree in order to call that allele (‘major_cutoff’ = 0.9); for hybrid capture data, we used the default threshold of 50%. We modified viral-ngs so that calls to GATK’s UnifiedGenotyper set ‘min_indel_count_for_genotyping’ to 2.

At three sites with insertions or deletions (indels) in the consensus genome CDS, we corrected the genome using Sanger sequencing of the RT–PCR product (namely, at 3,447 in the genome for sample DOM_2016_BB-0085-SER; at 5,469 in BRA_2016_FC-DQ12D1-PLA; and at 6,516–6,564 in BRA_2016_FC-DQ107D1-URI, coordinates as in KX197192.1). At other indels in the consensus genome CDS, we replaced the indel with ambiguity.

Depth-of-coverage values from amplicon sequencing include read duplicates. In all other cases, we removed duplicates with viral-ngs.

Identification of non-ZIKV viruses in samples by unbiased sequencing

Using Kraken v0.10.6³⁸ in viral-ngs, we built a database that included its default ‘full’ database (which incorporates all bacterial and viral whole genomes from RefSeq³⁹ as of October 2015). Additionally, we included the whole human genome (hg38), genomes from PlasmoDB⁴⁰, sequences covering mosquito genomes (Aedes aegypti, Aedes albopictus, Anopheles albimanus, Anopheles quadrimaculatus, Culex quinquefasciatus, and the outgroup Drosophila melanogaster) from GenBank³⁵, protozoa and fungi whole genomes from RefSeq, SILVA LTP 16 S rRNA sequences⁴¹, and all sequences from NCBI’s viral accession list⁴² (as of October 2015) for viral taxa that have human as a host. (To download the database, see Supplementary Information.)

For each sample, we ran Kraken on data from unbiased sequencing replicates (not including hybrid capture data) and searched its output reports for viral taxa with more than 100 reported reads. We manually filtered the results, removing ZIKV, bacteriophages, and known laboratory contaminants. For each sample and its associated taxa, we assembled genomes using viral-ngs as described above; the results are in Extended Data Table 1a. We used the following genomes for taxonomically filtering reads and as the reference for assembly: KJ741267.1 (cell fusing agent virus), AY292384.1 (deformed wing virus), NC_001477.1 (dengue virus type 1) and LC164349.1 (JC polyomavirus). When reporting sequence identity of an assembly to its taxon, we used BLASTN⁴³ to determine the identity between the sequence and the reference used for its assembly.

To focus on metagenomics of mosquito pools (Extended Data Table 1b), we considered unbiased sequencing data from eight mosquito pools (not including hybrid capture data). We first ran the depletion pipeline of viral-ngs on raw data and then ran the viral-ngs Trinity⁴⁴ assembly pipeline on the depleted reads to assemble them into contigs. We pooled contigs from all mosquito pool samples and identified all duplicate contigs with sequence identity >95% using CD-HIT⁴⁵. Additionally, we used predicted coding sequences from Prodigal 2.6.3 (ref. 46) to identify duplicate protein sequences at >95% identity. We classified contigs using BLASTN⁴³ against nt and BLASTX⁴³ against nr (as of February 2017) and discarded all contigs with an E value greater than 1 × 10⁻⁴. We define viral contigs as contigs that hit a viral sequence, and we manually removed all reverse-transcriptase-like contigs owing to their similarity to retrotransposon elements within the Aedes aegypti genome. We categorized viral contigs with less than 80% amino acid identity to their best hit as likely novel viral contigs. Supplementary Table 4 lists the unique viral contigs we found, their best hit, and information scoring the hit.

Relationship between metadata and sequencing outcome

To determine whether available sample metadata are predictive of sequencing outcome, we tested the following variables: sample collection site, patient gender, patient age, sample type, and the number of days between symptom onset and sample collection (collection interval). To describe sequencing outcome of a sample S, we used the following response variable Y_S:

mean ({I(R) * (number of unambiguous bases in R) for all amplicon sequencing replicates R of S }), where I(R) = 1 if median depth of coverage of R ≥ 275 and I(R) = 0 otherwise.

This value is listed in Supplementary Table 1 under ‘Dependent variable used in regression on metadata’. We excluded the saliva, cerebrospinal fluid, and whole blood sample types owing to sample number (n = 1), and also excluded mosquito pool samples and rows with missing values. We excluded samples from one collection site (prefix JAM_2016_WI-) because most had missing values. We treated samples with type ‘Plasma EDTA’ as having type ‘Plasma’. We treated the collection interval variable as categorical (0–1, 2–3, 4–6, and 7+ days).

With a single model we underfit the zero counts, possibly because many zeros (samples without a replicate that passed ZIKV assembly) are truly ZIKV-negative. We thus view the data as coming from two processes: one determining whether a sample is ZIKV-positive or ZIKV-negative, and another that determines, among the observed passing samples, how much of a ZIKV genome we are able to sequence. We modelled the first process, predicting whether a sample is passing, with logistic regression (in R using GLM⁴⁷ with binomial family and logit link); here, the observed passing samples are the samples S for which Y_S ≥ 2,500. For the second, we performed a beta regression, using only the observed passing samples, of Y_S divided by ZIKV genome length on the predictor variables. We implemented this in R using the betareg package⁴⁸ and transformed fractions from the closed unit interval to the open unit interval as the authors suggest.

To test the significance of predictor variables, we used a likelihood ratio test. For variable X_i we compared a full model (with all predictors) against a model that used all predictors except X_i. The results of these tests are shown in Extended Data Fig. 1a, d. We explored the effects of sample type and collection interval on obtaining a passing assembly in Extended Data Fig. 1b, c, respectively. Error bars are 95% confidence intervals derived from binomial distributions. We explored the effects of these same two variables on Y_S (in passing samples only) in Extended Data Fig. 1e, f.

Criteria for pooling across replicates

We attempted to sequence one or more replicates of each sample and attempted to assemble a genome from each replicate. We discarded data from any replicates whose assembly showed high sequence similarity, in any part of the genome, to our assembly of the genome in a sample consisting of an African (Senegal) lineage (strain HD78788) of ZIKV. We used this sample as a positive control throughout this study, and considered its presence in the assembly of a clinical or mosquito pool sample to be evidence of contamination. Similarly, we discarded data from four replicates belonging to samples from the Dominican Republic because they yielded assemblies that were unexpectedly identical or highly similar to our assembly of the ZIKV isolate PE243 genome, another positive control used in this study. We also discarded data from replicates that showed evidence of contamination, at the RNA stage, by the baits used in hybrid capture; we detected these by looking for adapters that were added to these probes for amplification.

For amplicon sequencing, we considered an assembly of a replicate to be ‘passing’ if it contained at least 2,500 unambiguous base calls and had a median depth of coverage of at least 275× over its unambiguous bases (depth includes duplicate reads). For the unbiased and hybrid capture approaches, we considered an assembly of a replicate ‘passing’ if it contained at least 4,000 unambiguous base calls. For each approach, the unambiguous base threshold was based on an observed density of negative controls below the threshold (Fig. 1a). For amplicon sequencing assemblies, we added a coverage depth threshold because coverage depth was roughly binary across replicates, with negative controls falling in the lower class. On the basis of these thresholds, 0 of 99 negative controls used throughout our sequencing runs yielded passing assemblies and 32 of 32 positive controls yielded passing assemblies.

We considered a sample to have a passing assembly if any of its replicates, by either method, yielded an assembly that passed the above thresholds. For each sample with at least one passing assembly, we pooled read data across replicates for each sample, including replicates with assemblies that did not pass the assembly thresholds. When data were available from both amplicon sequencing and unbiased/hybrid capture approaches, we pooled amplicon sequencing data separately from data produced by the unbiased and hybrid capture approaches, the latter two of which were pooled together (henceforth, the ‘hybrid capture’ pool). We then assembled a genome from each set of pooled data. When assemblies on pooled data were available from both approaches, we selected for downstream analysis the assembly from the hybrid capture approach if it had at least 10,267 unambiguous base calls (95% of the reference genome used, GenBank accession KX197192.1); when this condition was not met, we selected the one that had more unambiguous base calls.

The number of ZIKV genomes publicly available before this study was the result of an NCBI GenBank³⁵ search for ZIKV in February 2017. We filtered any sequences with length <4,000 nt, excluded sequences that are being published as part of this study or in refs 10, 11, excluded sequences from non-human hosts, and excluded sequences labelled as having been passaged. We counted fewer than 100 sequences, the precise number depending on details of the count.

Visualization of coverage depth across genomes

For amplicon sequencing data, we plotted coverage across the 110 samples that yielded a passing assembly by amplicon sequencing (Fig. 1b). With viral-ngs, we aligned depleted reads to the reference sequence KX197192.1 using the novoalign aligner with options ‘-r Random -l 40 -g 40 -x 20 -t 100 -k’. Because of the nature of amplicon sequencing, duplicates were not identified or removed. We binarized depth at each nucleotide position, showing red if depth of coverage was at least 100×. Rows (samples) are hierarchically clustered to ease visualization.

For hybrid capture sequencing data, we plotted depth of coverage across the 37 samples that yielded a passing assembly (Fig. 1c). We aligned reads as described above for amplicon sequencing data, except we removed duplicates. For each sample, we calculated the depth of coverage at each nucleotide position. We then scaled the values for each sample so that each would have a mean depth of 1.0. At each nucleotide position, we calculated the median depth across the samples, as well as the 20th and 80th percentiles. We plotted the mean of each of these metrics within a 200-nt sliding window.

Multiple sequence alignments

We aligned ZIKV consensus genomes using MAFFT v7.221 (ref. 49) with the following parameters: ‘--maxiterate 1000 --ep 0.123 --localpair’.

In Supplementary Data, we provide sequences and alignments used in analyses.

Analysis of within- and between-sample variants

To measure overall per-base discordance between consensus genomes produced by amplicon sequencing and hybrid capture, we considered all sites at which base calls were made in both the amplicon sequencing and hybrid capture consensus genomes of a sample, and we calculated the fraction in which the bases were not in agreement. To measure discordance at polymorphic sites, we searched for positions with a polymorphism in all genomes generated in this study that we selected for downstream analysis (see ‘Criteria for pooling across replicates’ for choosing among the amplicon sequencing and hybrid capture genome when both are available). We then looked at these positions in genomes that were available from both methods, and we calculated the fraction in which the alleles were not in agreement.

To measure discordance at minor alleles, we searched for minor alleles in all genomes generated in this study that we selected for downstream analysis. We then looked at all sites at which there was a minor allele and for which genomes from both methods were available, and we calculated the fraction in which the alleles were not in agreement. For these calculations, we tolerated partial ambiguity (for example, ‘Y’ is concordant with ‘T’). If one genome had full ambiguity (‘N’) at a position and the other genome had an indel, we counted the site as discordant; otherwise, if one genome had full ambiguity, we did not count the site.

After assembling genomes, we identified within-sample variants by running V-Phaser 2.0 via viral-ngs³⁷ on all pooled reads mapping to each sample assembly. When determining per-library allele counts at each variant position, we modified viral-ngs to require a minimum base (Phred) quality score of 30 for all bases, discard anomalous read pairs, and use per-base alignment quality (BAQ) in its calls to SAMtools⁵⁰ mpileup. This is particularly helpful for filtering spurious amplicon sequencing variants because all generated reads start and end at a limited number of positions (owing to the pre-determined tiling of amplicons across the genome). Because amplicon sequencing libraries were sequenced using 250-bp paired-end reads, bases near the middle of the ~450-nt amplicons fall at the end of both paired reads, where quality scores drop and incorrect base calls are more likely. To determine the overall frequency of each variant in a sample, we summed allele counts (calculated using SAMtools⁵⁰ mpileup via viral-ngs) across libraries.

When comparing variant frequencies between amplicon sequencing (seven technical replicates) and hybrid capture (seven technical replicates) replicates of the PE243 positive control (Fig. 1d), we included only positions at which the mean (pooled) frequency across replicates within at least one method was ≥1%. When comparing allele frequencies between replicate libraries, we restricted the sample set to only samples with a passing assembly in both methods, and included only samples with two or more replicates. By contrast, when comparing alleles across methods, we included samples that have a passing assembly by either method, with any number of replicates. For these comparisons, we included only positions with a minor variant; that is, positions for which both libraries/methods had an allele at 100% were removed, even if the single allele differed between the two libraries/methods. Additionally, we considered any allele with frequency <1% as not found (0%).

When comparing allele frequencies across methods: let f_a and f_hc be frequencies in amplicon sequencing and hybrid capture, respectively. If both are non-zero, we included an allele only if the read depth at its position was ≥1/min(f_a, f_hc) in both methods, and if depth at the position was at least 100× for hybrid capture and 275× for amplicon sequencing. If f_a = 0, we required a read depth of max(1/f_hc, 275) at the position in the amplicon sequencing method; similarly, if f_hc = 0 we required a read depth of max(1/f_a, 100) at the position in the hybrid capture method. This was to eliminate lack of coverage as a reason for discrepancy between two methods. When comparing allele frequencies across sequencing replicates within a method, we imposed only a minimum read depth (275× for amplicon sequencing and 100× for hybrid capture), but required this depth in both libraries. In samples with more than two replicates, we considered only the two replicates with the highest depth at each variant position.

We considered allele frequencies from hybrid capture sequencing ‘verified’ if they passed the strand bias and frequency filters described in ref. 25, with the exception that we imposed a minimum allele frequency of 1% and allowed a variant identified in only one library if its frequency was ≥5%. In Fig. 1f and Extended Data Table 3, we considered variants ‘validated’ if they were present at ≥1% frequency in both libraries or methods. When comparing two libraries for a given method M (amplicon sequencing or hybrid capture): the proportion unvalidated is the fraction, among all variants in M at ≥1% frequency in at least one library, of the variants that are at ≥1% frequency in exactly one of the two libraries. Similarly, when comparing methods: the proportion unvalidated for a method M is the fraction, among all variants at ≥1% frequency in M, of the variants that are at ≥1% frequency in M and <1% frequency in the other method.

We called SNPs on the aligned genomes using Geneious version 9.1.7 (ref. 51). We converted all fully or partially ambiguous calls, which are treated by Geneious as variants, into missing data. We then removed all sites that were no longer polymorphic from the SNP set and re-calculated allele frequencies. A nonsynonymous mutation is shown on the tree (Fig. 3b) if it includes an allele that is nonsynonymous relative to the ancestral state (see ‘Molecular clock phylogenetics and ancestral state reconstruction’ section below) and has a minor allele frequency of >5%; all occurrences of nonsynonymous alleles are shown. (Two mutations, at positions 2,853 and 7,229, had nominal derived allele frequencies over 95%; in both cases, the ‘ancestral’ allele was seen only in a small clade within the tree, suggesting that the ancestral allele was incorrectly assigned. These are not shown.) We placed mutations at a node such that the node leads only to samples with the mutation or with no call at that site. Uncertainty in placement occurs when a sample lacks a base call for the corresponding mutation; in this case, we placed the mutation on the most recent branch for which we have available data. We also used this ancestral ZIKV state to count the frequency of each type of substitution over various regions of the ZIKV genome, per number of available bases in each region (Fig. 3d and Supplementary Table 3).

We quantified the effect of nonsynonymous mutations using the original BLOSUM62 scoring matrix for amino acids⁵², in which positive scores indicate conservative amino acid changes and negative scores unlikely or extreme substitutions. We assessed statistical significance for equality of proportions by χ² test (Fig. 3c, middle), and for difference of means by two-sample t-test with Welch–Satterthwaite approximation of d.f. (Fig. 3c, right). Error bars are 95% confidence intervals derived from binomial distributions (Fig. 3c, left and middle; Fig. 3d) or Student’s t distributions (Fig. 3c, right).

Maximum likelihood estimation and root-to-tip regression

We generated a maximum likelihood tree using a multiple sequence alignment that included genomes generated in this study, as well as a selection of other available sequences from the Americas, Southeast Asia, and the Pacific. The sequences are listed in Supplementary Information. We ran PhyML⁵³ with the GTR substitution model and 4 gamma substitution rate categories; for the tree search operation, we used ‘BEST’ (best of NNI and SPR). In FigTree v1.4.2 (ref. 54), we rooted the tree on the oldest sequence used as input (GenBank accession EU545988.1).

We used TempEst v1.5 (ref. 55), which selects the best-fitting root with a residual mean squared function, to estimate root-to-tip distances. We performed regression in R with the lm function⁴⁷ of distances on dates. The relationship between root-to-tip divergence and sample dates (Extended Data Fig. 2) supports the use of a molecular clock analysis in this study.

In Supplementary Data, we provide the output of PhyML, as well as the dates and distances used for root-to-tip regression.

Molecular clock phylogenetics and ancestral state reconstruction

For molecular clock phylogenetics, we made a multiple sequence alignment from the genomes generated in this study combined with a selection of other available sequences from the Americas. We did not use sequences from outside the outbreak in the Americas. Among ZIKV genomes published and publicly available on NCBI GenBank³⁵, we selected 32 from the Americas that had at least 7,000 unambiguous bases, were not labelled as having been passaged more than once, and had location metadata. We also used 32 genomes from Brazil published in ref. 10 that met the same criteria. The sequences are listed in Supplementary Information.

We used BEAST v1.8.4 to perform molecular clock analyses⁵⁶. We used sampled tip dates to handle inexact dates⁵⁷. Because of sparse data in non-coding regions, we used only the CDS as input. We used the SRD06 substitution model on the CDS, which uses HKY with gamma site heterogeneity and partitions codons into two partitions (positions (1+2) and 3)⁵⁸. To perform model selection, we tested three coalescent tree priors: a constant-size population, an exponential growth population, and a Bayesian Skyline tree prior (ten groups, piecewise-constant model)⁵⁹. For each tree prior, we tested two clock models: a strict clock and an uncorrelated relaxed clock with log-normal distribution (UCLN)⁶⁰. In each case, we set the molecular clock rate to use a continuous time Markov chain rate reference prior⁶¹. For all six combinations of models, we performed path-sampling (PS) and stepping-stone sampling (SS) to estimate marginal likelihood^62,63. We sampled for 100 path steps with a chain length of 1 million, with power posteriors determined from evenly spaced quantiles of a Beta(alpha = 0.3; 1.0) distribution. The Skyline tree prior provided a better fit than the two other (baseline) tree priors (Extended Data Table 2), so we used this tree prior for all further analyses. Using a constant or exponential tree prior, a relaxed clock provides a better model fit, as shown by the log Bayes factor when comparing the two clock models. Using a Skyline tree prior, the log Bayes factor comparing a strict and relaxed clock is smaller than it is using the other tree priors, and it is similar to the variability between estimated log marginal likelihood from PS and SS methods. We chose to use a relaxed clock for further analyses, but we also report key findings using a strict clock.

For the tree and tMRCA estimates in Fig. 2, as well as the clock rate reported in main text, we ran BEAST with 400 million MCMC steps using the SRD06 substitution model, Skyline tree prior, and relaxed clock model. We extracted clock rate and tMRCA estimates, and their distributions, with Tracer v1.6.0 and identified the maximum clade credibility (MCC) tree using TreeAnnotator v1.8.4. We visualised the tree in FigTree v1.4.2 (ref. 54). The reported credible intervals around estimates are 95% highest posterior density (HPD) intervals. When reporting substitution rate from a relaxed clock model, we give the mean rate (mean of the rates of each branch weighted by the time length of the branch). Additionally, for the tMRCA estimates in Fig. 2c with a strict clock, we ran BEAST with the same specifications (also with 400M steps) except using a strict clock model. The resulting data are also used in the more comprehensive comparison shown in Extended Data Fig. 3.

For the data with an outgroup in Extended Data Fig. 3, we ran BEAST as specified above (with strict and relaxed clock models), except with 100 million steps and with outgroup sequences in the input alignment. The outgroup sequences were the same as those used to make the maximum likelihood tree (see Supplementary Information). For the data excluding sample DOM_2016_MA-WGS16-020-SER in Extended Data Fig. 3, we ran BEAST as specified above (with strict and relaxed clocks), except we removed the sequence of this sample from the input and ran 100 million steps.

We used BEAST v1.8.4 to estimate transition and transversion rates within the CDS and non-coding regions. The model was the same as above except that we used the Yang96 substitution model on the CDS, which uses GTR with gamma site heterogeneity and partitions codons into three partitions⁶⁴; for the non-coding regions, we used a GTR substitution model with gamma site heterogeneity and no codon partitioning. There were four partitions in total: one for each codon position and another for the non-coding region (5′ and 3′ UTRs combined). We ran this for 200 million steps. At each sampled step of the MCMC, we calculated substitution rates for each partition using the overall substitution rate, the relative substitution rate of the partition, the relative rates of substitutions in the partition, and base frequencies. In Extended Data Fig. 4, we plot the means of these rates over the steps; the error bars shown are 95% HPD intervals of the rates over the steps.

We used BEAST v1.8.4 to reconstruct ancestral state at the root of the tree using CDS and non-coding regions. The model was the same as above except that, on the CDS, we used the HKY substitution model with gamma site heterogeneity and codons partitioned into three partitions (one per codon position). On the non-coding regions we used the same substitution model without codon partitioning. We ran this for 50 million steps and used TreeAnnotator v1.8.4 to find the state with the MCC tree. We selected the ancestral state corresponding to this state.

In all BEAST runs, we discarded the first 10% of states from each run as burn-in.

In Supplementary Data, we provide BEAST input (XML) and output files. We also provide the sequence of the reconstructed ancestral state.

Principal component analysis

We carried out principal component analysis using the R package FactoMineR⁶⁵. We imputed missing data with the package missMDA⁶⁶ and we show the results in Fig. 2d.

Diagnostic assay assessment

We extracted primer and probe sequences from eight published RT–qPCR assays^{26,27,28,29,30,31} and aligned them to our ZIKV genomes using Geneious version 9.1.7 (ref. 51). We then tabulated matches and mismatches to the diagnostic sequence for all outbreak genomes, allowing multiple bases to match where the diagnostic primer and/or probe sequence contained nucleotide ambiguity codes (Fig. 3e).

Data availability

Sequence data that support findings of this study have been deposited in NCBI GenBank³⁵ under BioProject accession PRJNA344504. Zika virus genomes have accession numbers KY014295– KY014327 and KY785409– KY785485. The dengue virus type 1 genome sequenced in this study has accession number KY829115. See Supplementary Table 1 for a mapping of sample names to accession numbers.

Accession codes

Primary accessions

BioProject

PRJNA344504

NCBI Reference Sequence

Referenced accessions

GenBank/EMBL/DDBJ

References

World Health Organization. Zika situation report: Zika virus, Microcephaly and Guillain–Barré syndrome. http://who.int/emergencies/zika-virus/situation-report/2-february-2017/en/ (2017)
Reynolds, M. R. et al. Vital signs: update on Zika virus-associated birth defects and evaluation of all U.S. infants with congenital Zika virus exposure—U.S. Zika pregnancy registry, 2016. MMWR Morb. Mortal. Wkly. Rep. 66, 366–373 (2017)
Article PubMed PubMed Central Google Scholar
de Vigilância em Saúde, S. Protocolo de Vigilância e Resposta à Ocorrência de Microcefalia (Ministério da Saúde Brasília, 2016)
Schieffelin, J. S. et al. Clinical illness and outcomes in patients with Ebola in Sierra Leone. N. Engl. J. Med. 371, 2092–2100 (2014)
Article CAS PubMed PubMed Central Google Scholar
Sardi, S. I. et al. Coinfections of Zika and chikungunya viruses in Bahia, Brazil, identified by metagenomic next-generation sequencing. J. Clin. Microbiol. 54, 2348–2353 (2016)
Article CAS PubMed PubMed Central Google Scholar
Martina, B. E. E., Koraka, P. & Osterhaus, A. D. M. E. Dengue virus pathogenesis: an integrated view. Clin. Microbiol. Rev. 22, 564–581 (2009)
Article CAS PubMed PubMed Central Google Scholar
Fauci, A. S. & Morens, D. M. Zika virus in the Americas—yet another Arbovirus threat. N. Engl. J. Med. 374, 601–604 (2016)
Article PubMed Google Scholar
Quick, J. et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protocols http://dx.doi.org/10.1038/nprot.2017.066 (2017)
Matranga, C. B. et al. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples. Genome Biol. 15, 519 (2014)
Article PubMed PubMed Central CAS Google Scholar
Faria, N. R. et al. Establishment and cryptic transmission of Zika virus in Brazil and the Americas. Nature http://dx.doi.org/10.1038/nature22401 (2017)
Grubaugh, N. D. et al. Genomic epidemiology reveals multiple introductions of Zika virus into the United States. Nature http://dx.doi.org/10.1038/nature22400 (2017)
Faria, N. R. et al. Zika virus in the Americas: early epidemiological and genetic findings. Science 352, 345–349 (2016)
Article ADS CAS PubMed PubMed Central Google Scholar
Sall, A. A. et al. Yellow fever virus exhibits slower evolutionary dynamics than dengue virus. J. Virol. 84, 765–772 (2010)
Article CAS PubMed Google Scholar
Centers for Disease Control and Prevention. First case of Zika virus reported in Puerto Rico. https://www.cdc.gov/media/releases/2015/s1231-zika.html (2015)
Pan American Health Organization. Zika: Epidemiological Report Honduras. http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=35137&Itemid=270 (2017)
Pan American Health Organization. Epidemiological Update: Zika Virus Infection. http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=32021&Itemid=270 (2015)
Pan American Health Organization. Zika: Epidemiological Report Dominican Republic. http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=35103&Itemid=270 (2017)
Nunes, M. R. T. et al. Emergence and potential for spread of chikungunya virus in Brazil. BMC Med. 13, 102 (2015)
Article PubMed PubMed Central Google Scholar
Tsetsarkin, K. A., Vanlandingham, D. L., McGee, C. E. & Higgs, S. A single mutation in chikungunya virus affects vector specificity and epidemic potential. PLoS Pathog. 3, e201 (2007)
Article PubMed PubMed Central CAS Google Scholar
Piantadosi, A. et al. HIV-1 evolution in gag and env is highly correlated but exhibits different relationships with viral load and the immune response. AIDS 23, 579–587 (2009)
Article CAS PubMed Google Scholar
Villabona-Arenas, C. J. et al. Dengue virus type 3 adaptive changes during epidemics in São Jose de Rio Preto, Brazil, 2006–2007. PLoS One 8, e63496 (2013)
Article ADS CAS PubMed PubMed Central Google Scholar
Brinton, M. A. & Basu, M. Functions of the 3′ and 5′ genome RNA regions of members of the genus Flavivirus. Virus Res. 206, 108–119 (2015)
Article CAS PubMed PubMed Central Google Scholar
Duchêne, S., Ho, S. Y. W. & Holmes, E. C. Declining transition/transversion ratios through time reveal limitations to the accuracy of nucleotide substitution models. BMC Evol. Biol. 15, 36 (2015)
Article PubMed PubMed Central CAS Google Scholar
Corman, V. M. et al. Clinical comparison, standardization and optimization of Zika virus molecular detection. Bull. World Health Organ. (2016)
Gire, S. K. et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 345, 1369–1372 (2014)
Article ADS CAS PubMed PubMed Central Google Scholar
Pyke, A. T. et al. Imported zika virus infection from the Cook islands into Australia, 2014. PLoS Curr. http://dx.doi.org/10.1371/currents.outbreaks.4635a54dbffba2156fb2fd76dc49f65e (2014)
Lanciotti, R. S. et al. Genetic and serologic properties of Zika virus associated with an epidemic, Yap State, Micronesia, 2007. Emerg. Infect. Dis. 14, 1232–1239 (2008)
Article CAS PubMed PubMed Central Google Scholar
Faye, O. et al. One-step RT–PCR for detection of Zika virus. J. Clin. Virol. 43, 96–101 (2008)
Article CAS PubMed Google Scholar
Faye, O. et al. Quantitative real-time PCR detection of Zika virus and evaluation with field-caught mosquitoes. Virol. J. 10, 311 (2013)
Article PubMed PubMed Central CAS Google Scholar
Balm, M. N. D. et al. A diagnostic polymerase chain reaction assay for Zika virus. J. Med. Virol. 84, 1501–1505 (2012)
Article CAS PubMed Google Scholar
Tappe, D. et al. First case of laboratory-confirmed Zika virus infection imported into Europe, November 2013. Euro Surveill. 19, 20685 (2014)
MathSciNet PubMed Google Scholar
U.S. Food and Drug Administration. Zika virus response updates from FDA. https://www.fda.gov/EmergencyPreparedness/Counterterrorism/MedicalCountermeasures/MCMIssues/ucm485199.htm#eua (2017)
Morlan, J. D., Qu, K. & Sinicropi, D. V. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One 7, e42882 (2012)
Article ADS CAS PubMed PubMed Central Google Scholar
Worobey, M. et al. 1970s and ‘Patient 0’ HIV-1 genomes illuminate early HIV/AIDS history in North America. Nature 539, 98–101 (2016)
Article ADS CAS PubMed PubMed Central Google Scholar
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–D72 (2016)
Article CAS PubMed Google Scholar
Park, D. J. et al. Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone. Cell 161, 1516–1526 (2015)
Article CAS PubMed PubMed Central Google Scholar
Tomkins-Tinch, C. et al. Broad Institute viral-ngs: v1.13.3. https://github.com/broadinstitute/viral-ngs/releases/tag/v1.13.3 (2016)
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)
Article PubMed PubMed Central Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016)
Article CAS PubMed Google Scholar
Aurrecoechea, C. et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37, D539–D543 (2009)
Article CAS PubMed Google Scholar
Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008)
Article CAS PubMed Google Scholar
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015)
Article CAS PubMed Google Scholar
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44, D7–D19 (2016)
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011)
Article CAS PubMed PubMed Central Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012)
Article CAS PubMed PubMed Central Google Scholar
Hyatt, D., LoCascio, P. F., Hauser, L. J. & Uberbacher, E. C. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223–2230 (2012)
Article CAS PubMed Google Scholar
Core Team, R. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2016)
Cribari-Neto, F. & Zeileis, A. Beta regression in R. J. Stat. Softw. 34, 1–24 (2010)
Article Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013)
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)
Article PubMed PubMed Central CAS Google Scholar
Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–1649 (2012)
Article PubMed PubMed Central Google Scholar
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992)
Article ADS CAS PubMed PubMed Central Google Scholar
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010)
Article CAS PubMed Google Scholar
Rambaut, A. FigTree. Version 1.4.2 (Inst. Evol. Biol., Univ. Edinburgh, 2014)
Rambaut, A., Lam, T. T., Max Carvalho, L. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol. 2, vew007 (2016)
Article PubMed PubMed Central Google Scholar
Drummond, A. J., Suchard, M. A., Xie, D. & Rambaut, A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973 (2012)
Article CAS PubMed PubMed Central Google Scholar
Shapiro, B. et al. A Bayesian phylogenetic method to estimate unknown sequence ages. Mol. Biol. Evol. 28, 879–887 (2011)
Article CAS PubMed Google Scholar
Shapiro, B., Rambaut, A. & Drummond, A. J. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol. Biol. Evol. 23, 7–9 (2006)
Article CAS PubMed Google Scholar
Drummond, A. J., Rambaut, A., Shapiro, B. & Pybus, O. G. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22, 1185–1192 (2005)
Article CAS PubMed Google Scholar
Drummond, A. J., Ho, S. Y. W., Phillips, M. J. & Rambaut, A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4, e88 (2006)
Article PubMed PubMed Central CAS Google Scholar
Ferreira, M. A. R. & Suchard, M. A. Bayesian analysis of elapsed times in continuous-time Markov chains. Can. J. Stat. 36, 355–368 (2008)
Article MathSciNet MATH Google Scholar
Baele, G. et al. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol. Biol. Evol. 29, 2157–2167 (2012)
Article CAS PubMed PubMed Central Google Scholar
Baele, G., Li, W. L. S., Drummond, A. J., Suchard, M. A. & Lemey, P. Accurate model selection of relaxed molecular clocks in bayesian phylogenetics. Mol. Biol. Evol. 30, 239–243 (2013)
Article CAS PubMed Google Scholar
Yang, Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 (1996)
Article ADS CAS PubMed Google Scholar
Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 25, 1–18 (2008)
Article Google Scholar
Josse, J. & Husson, F. missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw. 70, 1–31 (2016)
Article Google Scholar
Gourinat, A.-C., O’Connor, O., Calvez, E., Goarant, C. & Dupont-Rouzeyrol, M. Detection of Zika virus in urine. Emerg. Infect. Dis. J. 21, 84 (2015)
Article CAS Google Scholar
Paz-Bailey, G. et al. Persistence of Zika virus in body fluids—preliminary report. N. Engl. J. Med. https://doi.org/10.1056/NEJMoa1613108 (2017)
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank M. and L. Benioff for their vision and support; L. Brown, E. Lee, M. Giovanni, J. Levin-Allerhand and E. S. Lander for support and guidance; M. Schleicher, E. Lipscomb, A. Felix, A. Saltzman, and S. Donnelly for assistance with IRB and ethics processes; E. Mair, L. Nogelo and E. Carmean for legal counsel; T. Mason and the Broad Institute Genomics Platform for sequencing support; A. Matthews, S. Chapman, D. Neafsey, and B. Birren for management and guidance; O. Pybus and ZiBRA Project colleagues for sharing data before publication; D. Olson, E. Asturias, M. Salit, and E. Simon-Loriere for sharing samples and reagents; and E. Holmes, G. Bello, R. Tewhey, A. Piantadosi, C. Edwards and the Sabeti Laboratory for discussions and reading of the manuscript. We are indebted to Zika patients and clinical teams for making this work possible. Funding was provided by: Marc and Lynne Benioff (P.C.S.); NIH NIAID U19AI110818 (Broad Institute); Howard Hughes Medical Institute (P.C.S.); Harvard University Burke Global Health Fellowship (P.C.S.); Broad Institute BroadNext10 program (A.G. and P.C.S.); AWS Cloud Credits for Research (P.C.S.); Conselho Nacional de Desenvolvimento Científico e Tecnológico (440909/2016-3) and Fundação de Amparo a Pesquisa do Estado do Rio de Janeiro (E-26/201.320/2016, E-26/201.332/2016, E-26/010.000194/2015) (P.T.B. and F.A.B.); NIH NIAID 1R01AI099210 (S.I. and S.F.M.); MIDAS-National Institute of General Medical Sciences U54GM111274 (M.E.H. and D.P.R.); NIH NIAID AI100190 (I.B. and L.G.); AEDES Network (I.B.) and Colombian Science, Technology and Innovation Fund of Sistema General de Regalías-BPIN 2013000100011 (L.V., R.M.G.R., M.C.M.M., and I.B.); ASTMH Shope Fellowship (K.G.B.); NSF DGE 1144152 (A.E.L.); PNPD/CAPES Postdoctoral Fellowship (E.D.); Fulbright-Colciencias Doctoral Scholarship (D.P.R.); NIH training grant 5T32AI007244-33 (N.D.G.); EU under grant agreements 278433-PREDEMICS and 643476-COMPARE (A.R.); and NIH NCATS CTSA UL1TR001114, NIH NIAID contract HHSN272201400048C, The Ray Thomas Foundation, and Pew Biomedical Scholarship (K.G.A.).

Author information

Hayden C. Metsky, Christian B. Matranga, Shirlee Wohl and Stephen F. Schaffner: These authors contributed equally to this work.
Kristian G. Andersen, Sharon Isern, Scott F. Michael, Fernando A. Bozza, Thiago M. L. Souza, Irene Bosch, Nathan L. Yozwiak, Bronwyn L. MacInnis and Pardis C. Sabeti: These authors jointly supervised this work.

Authors and Affiliations

Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
Hayden C. Metsky, Christian B. Matranga, Shirlee Wohl, Stephen F. Schaffner, Catherine A. Freije, Sarah M. Winnicki, Kendra West, James Qu, Mary Lynn Baniecki, Adrianne Gladden-Young, Aaron E. Lin, Christopher H. Tomkins-Tinch, Simon H. Ye, Daniel J. Park, Cynthia Y. Luo, Kayla G. Barnes, Rickey R. Shah, Bridget Chak, Andreas Gnirke, Nathan L. Yozwiak, Bronwyn L. MacInnis & Pardis C. Sabeti
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Hayden C. Metsky
Department of Organismic and Evolutionary Biology, Center for Systems Biology, Harvard University, Cambridge, Massachusetts, USA
Shirlee Wohl, Stephen F. Schaffner, Catherine A. Freije, Aaron E. Lin, Cynthia Y. Luo, Kayla G. Barnes, Bridget Chak, Nathan L. Yozwiak & Pardis C. Sabeti
Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
Stephen F. Schaffner, Kayla G. Barnes, Clarissa Valim, Bronwyn L. MacInnis & Pardis C. Sabeti
Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Simon H. Ye
Harvard University Extension School, Cambridge, Massachusetts, USA
Rickey R. Shah
National Institute of Infectious Diseases Evandro Chagas, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, Rio de Janeiro, Brazil
Giselle Barbosa-Lima, Yasmine R. Vieira, Jose Cerbino-Neto & Fernando A. Bozza
Laboratório de AIDS e Imunologia Molecular, Instituto Oswaldo Cruz, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
Edson Delatorre
Department of Biological Sciences, College of Arts and Sciences, Florida Gulf Coast University, Fort Myers, Florida, USA
Lauren M. Paul, Amanda L. Tan, Carolyn M. Barcellona, Sharon Isern & Scott F. Michael
Miami-Dade County Mosquito Control, Miami, Florida, USA
Mario C. Porcelli & Chalmers Vasquez
Division of Disease Control and Health Protection, Florida Department of Health, Bureau of Public Health Laboratories, Tampa, Florida, USA
Andrew C. Cannons, Marshall R. Cone, Kelly N. Hogan & Edgar W. Kopp
Department of Microbiology, The University of the West Indies, Mona, Kingston, Jamaica
Joshua J. Anzinger
Instituto de Investigacion en Microbiologia, Universidad Nacional Autónoma de Honduras, Tegucigalpa, Honduras
Kimberly F. Garcia, Leda A. Parham & Ivette Lorenzana
Grupo de Epidemiología Clínica, Universidad Industrial de Santander, Bucaramanga, Colombia
Rosa M. Gélvez Ramírez, Maria C. Miranda Montoya & Luis Villar
Department of Epidemiology, College of Public Health and Health Professions, University of Florida, Gainesville, Florida, USA
Diana P. Rojas
Massachusetts Department of Public Health, Jamaica Plain, Massachusetts, USA
Catherine M. Brown, Scott Hennigan, Brandon Sabina, Sarah Scotland & Sandra Smole
Department of Immunology and Microbial Science, The Scripps Research Institute, La Jolla, California, USA
Karthik Gangavarapu, Nathan D. Grubaugh, Refugio Robles-Sikisaka & Kristian G. Andersen
Scripps Translational Science Institute, La Jolla, California, USA
Glenn Oliveira & Kristian G. Andersen
Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, EH9 3FL, UK
Andrew Rambaut
Fogarty International Center, National Institutes of Health, Bethesda, 20892, Maryland, USA
Andrew Rambaut
Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Lee Gehrke & Irene Bosch
Department of Microbiology and Immunobiology, Harvard Medical School, Boston, Massachusetts, USA
Lee Gehrke
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
M. Elizabeth Halloran
Department of Biostatistics, University of Washington, Seattle, Washington, USA
M. Elizabeth Halloran
Institute for Tropical Biology Research, Universidad de Córdoba, Montería, Córdoba, Colombia
Salim Mattar
Department of Osteopathic Medical Specialties, Michigan State University, East Lansing, Michegan, USA
Clarissa Valim
FIOCRUZ, Instituto Oswaldo Cruz, Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, Rio de Janeiro, Brazil
Wim Degrave
Laboratório de Imunofarmacologia, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz,
Patricia T. Bozza
Rio de Janeiro, Rio de Janeiro, Brazil
Patricia T. Bozza
Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, USA
Kristian G. Andersen
D’Or Institute for Research and Education, Rio de Janeiro, Brazil
Fernando A. Bozza
National Institute for Science and Technology on Innovation on Neglected Diseases, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
Thiago M. L. Souza
Center for Technological Development in Health, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
Thiago M. L. Souza
Howard Hughes Medical Institute, Chevy Chase, Maryland, USA
Pardis C. Sabeti

Authors

Hayden C. Metsky
View author publications
You can also search for this author in PubMed Google Scholar
Christian B. Matranga
View author publications
You can also search for this author in PubMed Google Scholar
Shirlee Wohl
View author publications
You can also search for this author in PubMed Google Scholar
Stephen F. Schaffner
View author publications
You can also search for this author in PubMed Google Scholar
Catherine A. Freije
View author publications
You can also search for this author in PubMed Google Scholar
Sarah M. Winnicki
View author publications
You can also search for this author in PubMed Google Scholar
Kendra West
View author publications
You can also search for this author in PubMed Google Scholar
James Qu
View author publications
You can also search for this author in PubMed Google Scholar
Mary Lynn Baniecki
View author publications
You can also search for this author in PubMed Google Scholar
Adrianne Gladden-Young
View author publications
You can also search for this author in PubMed Google Scholar
Aaron E. Lin
View author publications
You can also search for this author in PubMed Google Scholar
Christopher H. Tomkins-Tinch
View author publications
You can also search for this author in PubMed Google Scholar
Simon H. Ye
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Park
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia Y. Luo
View author publications
You can also search for this author in PubMed Google Scholar
Kayla G. Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Rickey R. Shah
View author publications
You can also search for this author in PubMed Google Scholar
Bridget Chak
View author publications
You can also search for this author in PubMed Google Scholar
Giselle Barbosa-Lima
View author publications
You can also search for this author in PubMed Google Scholar
Edson Delatorre
View author publications
You can also search for this author in PubMed Google Scholar
Yasmine R. Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Lauren M. Paul
View author publications
You can also search for this author in PubMed Google Scholar
Amanda L. Tan
View author publications
You can also search for this author in PubMed Google Scholar
Carolyn M. Barcellona
View author publications
You can also search for this author in PubMed Google Scholar
Mario C. Porcelli
View author publications
You can also search for this author in PubMed Google Scholar
Chalmers Vasquez
View author publications
You can also search for this author in PubMed Google Scholar
Andrew C. Cannons
View author publications
You can also search for this author in PubMed Google Scholar
Marshall R. Cone
View author publications
You can also search for this author in PubMed Google Scholar
Kelly N. Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Edgar W. Kopp
View author publications
You can also search for this author in PubMed Google Scholar
Joshua J. Anzinger
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly F. Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Leda A. Parham
View author publications
You can also search for this author in PubMed Google Scholar
Rosa M. Gélvez Ramírez
View author publications
You can also search for this author in PubMed Google Scholar
Maria C. Miranda Montoya
View author publications
You can also search for this author in PubMed Google Scholar
Diana P. Rojas
View author publications
You can also search for this author in PubMed Google Scholar
Catherine M. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Scott Hennigan
View author publications
You can also search for this author in PubMed Google Scholar
Brandon Sabina
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Scotland
View author publications
You can also search for this author in PubMed Google Scholar
Karthik Gangavarapu
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Grubaugh
View author publications
You can also search for this author in PubMed Google Scholar
Glenn Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Refugio Robles-Sikisaka
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Rambaut
View author publications
You can also search for this author in PubMed Google Scholar
Lee Gehrke
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Smole
View author publications
You can also search for this author in PubMed Google Scholar
M. Elizabeth Halloran
View author publications
You can also search for this author in PubMed Google Scholar
Luis Villar
View author publications
You can also search for this author in PubMed Google Scholar
Salim Mattar
View author publications
You can also search for this author in PubMed Google Scholar
Ivette Lorenzana
View author publications
You can also search for this author in PubMed Google Scholar
Jose Cerbino-Neto
View author publications
You can also search for this author in PubMed Google Scholar
Clarissa Valim
View author publications
You can also search for this author in PubMed Google Scholar
Wim Degrave
View author publications
You can also search for this author in PubMed Google Scholar
Patricia T. Bozza
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Gnirke
View author publications
You can also search for this author in PubMed Google Scholar
Kristian G. Andersen
View author publications
You can also search for this author in PubMed Google Scholar
Sharon Isern
View author publications
You can also search for this author in PubMed Google Scholar
Scott F. Michael
View author publications
You can also search for this author in PubMed Google Scholar
Fernando A. Bozza
View author publications
You can also search for this author in PubMed Google Scholar
Thiago M. L. Souza
View author publications
You can also search for this author in PubMed Google Scholar
Irene Bosch
View author publications
You can also search for this author in PubMed Google Scholar
Nathan L. Yozwiak
View author publications
You can also search for this author in PubMed Google Scholar
Bronwyn L. MacInnis
View author publications
You can also search for this author in PubMed Google Scholar
Pardis C. Sabeti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.B.M., S.W., C.A.F., S.M.W., K.W., J.Q., M.L.B., A.G.-Y., C.Y.L., R.R.S., G.B.-L., Y.R.V., L.M.P., A.L.T., C.M.Ba., M.C.P., C.Vas., A.C.C., M.R.C., K.N.H., E.W.K., J.J.A., K.F.G., L.A.P., R.M.G.R., M.C.M.M., C.M.Br., S.H., B.S., S.Sc., K.G., G.O., R.R.-S., and I.B. performed laboratory experiments and prepared samples for sequencing. H.C.M., C.B.M., C.A.F., S.M.W., K.W., J.Q., M.L.B., C.Y.L., A.G.-Y., N.D.G, A.G., and K.G.A. developed methods for ZIKV detection, targeted enrichment, and/or sequencing library preparation. H.C.M., C.B.M., S.W., S.F.S., M.L.B., A.E.L., C.H.T.-T., S.H.Y., D.J.P., E.D., A.R., T.M.L.S., I.B., and B.L.M. performed sequence assembly, curation, and/or data analyses. S.Sm., L.V., S.M., I.L., S.I., S.F.M., and F.A.B. led clinical studies and/or study sites. K.G.B., B.C., D.P.R., N.D.G., L.G., M.E.H., A.R., A.G., J.C.-N., C.Val., W.D., P.T.B., A.G., K.G.A., S.I., S.F.M., F.A.B., T.M.L.S., and I.B. provided critical insights and guidance. H.C.M., C.B.M., T.M.L.S., N.L.Y., B.L.M., and P.C.S. oversaw study design and management. H.C.M., C.B.M., S.W., S.F.S., A.E.L., N.L.Y., B.L.M. and P.C.S. drafted the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Thiago M. L. Souza or Bronwyn L. MacInnis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Reviewer Information Nature thanks K. St George, A. Wilder-Smith and M. Worobey for their contribution to the peer review of this work.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Figure 1 Relationship between metadata and sequencing outcome.

Analysis of possible predictors of sequencing outcome: the site where a sample was collected, patient gender, patient age, sample type, and collection interval. a, Prediction of whether a sample will pass assembly thresholds by sequencing. Rows show results of likelihood ratio tests on each predictor by omitting the variable from a full model that contains all predictors. Sample site and patient gender improve model fit, but sample type and collection interval do not. b, Proportion of samples that pass assembly thresholds by sequencing, divided by sample type, across six sample sites. c, Same as b, but divided by collection interval. d, Prediction of the genome fraction identified, using samples that passed assembly thresholds. Rows show results of likelihood ratio tests, as in a. Collection interval improves the model, but sample type does not. e, Sequencing outcome for each sample, divided by sample type, across six sample sites. f, Same as e, but divided by collection interval. Samples collected seven or more days after symptom onset produced, on average, the fewest unambiguous bases, though these observations are based on a limited number of data points. While the sample site variable accounts for differences in cohort composition, the observed effects of gender and collection interval might be due to confounders in composition that span multiple cohorts. These results illustrate the effects of variables on sequencing outcome for the samples in this study; they are not indicative of ZIKV titre more generally. Other studies^67,68 have analysed the impact of sample type and collection interval on ZIKV detection, sometimes with differing results.

Extended Data Figure 2 Maximum likelihood tree and root-to-tip regression.

a, Maximum likelihood tree. Tips are coloured by sample source location. Labelled tips indicate genomes generated in this study; all other coloured tips are other publicly available genomes from the outbreak in the Americas. Grey tips are genomes from ZIKV cases in Southeast Asia and the Pacific. b, Linear regression of root-to-tip divergence on dates. The substitution rate for the full tree, indicated by the slope of the black regression line, is similar to rates of Asian lineage ZIKV estimated by molecular clock analyses¹². The substitution rate for sequences within the Americas outbreak only, indicated by the slope of the green regression line, is similar to rates estimated by BEAST (1.15 × 10⁻³; 95% CI (9.78 × 10⁻⁴, 1.33 × 10⁻³)) for this dataset.

Extended Data Figure 3 Substitution rate and tMRCA distributions.

a, Posterior density of the substitution rate. Shown with and without the use of sequences (outgroup) from outside the Americas. b–e, Posterior density of the date of the most recent common ancestor (MRCA) of sequences in four regions corresponding to those in Fig. 2c. Shown with and without the use of outgroup sequences. The use of outgroup sequences has little effect on estimates of these dates. f, Posterior density of the date of the MRCA of sequences in a clade consisting of samples from the Caribbean and continental United States. Shown with and without the sequence of DOM_2016_MA-WGS16-020-SER, a sample from the Dominican Republic that has only 3,037 unambiguous bases; this is the most ancestral sequence in the clade and its presence affects the tMRCA. In all panels, all densities are shown as observed with a relaxed clock model and with a strict clock model.

Extended Data Figure 4 Substitution rates estimated with BEAST.

Substitution rates estimated in three codon positions and non-coding regions (5′ and 3′ UTRs). Transversions are shown in grey and transitions are coloured by transition type. Plotted values show the mean of rates calculated at each sampled Markov chain Monte Carlo (MCMC) step of a BEAST run. These calculated rates provide additional evidence for the observed high C-to-T and T-to-C transition rates shown in Fig. 3d.

Extended Data Figure 5 cDNA concentration of amplicon primer pools predicts sequencing outcome.

cDNA concentration of amplicon pools (as measured by Agilent 2200 Tapestation) is highly predictive of amplicon sequencing outcome. On each axis, 1 + primer pool concentration is plotted on a log scale. Each point is a technical replicate of a sample and colours denote observed sequencing outcome of the replicate. If a replicate is predicted to be passing when at least one primer pool concentration is ≥0.8 ng μl⁻¹, then sensitivity is 98.71% and specificity is 90.34%. An accurate predictor of sequencing success early in the sample processing workflow can save resources.

Extended Data Figure 6 Evaluating multiple rounds of Zika virus hybrid capture.

Genome assembly statistics of samples before hybrid capture (grey), and after one (blue) or two (red) rounds of hybrid capture. Nine individual libraries (eight unique samples) were sequenced all three ways, had more than one million raw reads in each method, and generated at least one passing assembly. Raw reads from each method were downsampled to the same number of raw reads (8.5 million) before genomes were assembled. a, Per cent of the genome identified, as measured by number of unambiguous bases. b, Median sequencing depth of ZIKV genomes, taken over the assembled regions.

Extended Data Table 1 Viruses other than Zika uncovered by unbiased sequencing

Full size table

Extended Data Table 2 Model selection for BEAST analyses

Full size table

Extended Data Table 3 Within-sample variant validation between and within sequencing methods

Full size table

Related audio

Reporter Kerri Smith speaks to the researchers who traced Zika’s spread across the Americas

Supplementary information

Supplementary Information

This file contains Supplementary Text, including links to publicly available data used in the analyses and listings of accession numbers of sequences used. (PDF 108 kb)

Supplementary Table 1

This table contains information regarding the 229 samples that were attempted to be sequenced, including the 110 whose genomes were analyzed. This provides GenBank accessions, sequencing outcome, and metadata on the samples. (XLSX 57 kb)

Supplementary Table 2

This table lists observed nonsynonymous SNPs across the data used for SNP analysis. It includes frequency and count of ancestral and derived alleles at each position, as well as amino acid changes caused by each SNP. (XLSX 24 kb)

Supplementary Table 3

This table gives substitution rates across the 174 genomes analyzed (110 of which were sequenced as part of this study). It includes observed mutations per available base (used in Fig. 3d), as well as substitution rates estimated by BEAST (used in Extended Data Fig. 4). (XLSX 46 kb)

Supplementary Table 4

This table lists unique viral contigs assembled from 8 mosquito pools. It includes the best hit of each contig according to a BLASTN/BLASTX search and information scoring the hit. (XLSX 46 kb)

Supplementary Data

This zipped file contains sequences, alignments, BEAST input and output files, and root-to-tip data used in analyses. See README.txt for details. The Supplementary Information file list was corrected on 25 May 2017 to correct the order of the files and to include a missing file. (ZIP 35000 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Metsky, H., Matranga, C., Wohl, S. et al. Zika virus evolution and spread in the Americas. Nature 546, 411–415 (2017). https://doi.org/10.1038/nature22402

Download citation

Received: 01 March 2017
Accepted: 02 May 2017
Published: 24 May 2017
Issue Date: 15 June 2017
DOI: https://doi.org/10.1038/nature22402

This article is cited by

Development of a quantitative NS1 antigen enzyme-linked immunosorbent assay (ELISA) for Zika virus detection using a novel virus-specific mAb
- Stefanny Viloche Morales
- Gabriela Mattoso Coelho
- Claudia Nunes Duarte dos Santos
Scientific Reports (2024)
Absence of Zika virus among pregnant women in Vietnam in 2008
- Y.-C. Chiu
- D. Baud
- M. Stojanov
Tropical Diseases, Travel Medicine and Vaccines (2023)
African ZIKV lineage fails to sustain infectivity in an in vitro mimetic urban cycle
- Bárbara Floriano Molina
- Nayara Nathiê Marques
- Paula Rahal
Brazilian Journal of Microbiology (2023)
Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations
- Karthik Gangavarapu
- Alaa Abdel Latif
- Laura D. Hughes
Nature Methods (2023)
The envelope protein of Zika virus interacts with apolipoprotein E early in the infectious cycle and this interaction is conserved on the secreted viral particles
- Yannick Tréguier
- Jade Cochard
- Marianne Maquart
Virology Journal (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Methods

Ethics statement

Sample collections and study subjects

Viral RNA isolation

Carrier RNA and host rRNA depletion

Illumina library construction and sequencing

Amplicon-based cDNA synthesis and library construction

Zika virus hybrid capture

Genome assembly

Identification of non-ZIKV viruses in samples by unbiased sequencing

Relationship between metadata and sequencing outcome

Criteria for pooling across replicates

Visualization of coverage depth across genomes

Multiple sequence alignments

Analysis of within- and between-sample variants

Maximum likelihood estimation and root-to-tip regression

Molecular clock phylogenetics and ancestral state reconstruction

Principal component analysis

Diagnostic assay assessment

Data availability

Accession codes

Primary accessions

BioProject

NCBI Reference Sequence

Referenced accessions

GenBank/EMBL/DDBJ

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Related audio

Supplementary information

PowerPoint slides

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links