Introduction

Identifying the centre and date of emergence of plant viruses is important for planning control measures1,2,3,4. There have been such studies of plant viruses with single- and double-stranded DNA genomes, including begomoviruses and mastreviruses in the family Geminiviridae 5, 6 and cauliflower mosaic virus (CaMV) in the family Caulimoviridae 7. Similar analyses have been conducted for plant viruses with RNA genomes, such as cucumber mosaic virus (CMV) in the family Bromoviridae 8, and turnip mosaic virus (TuMV)9,10,11 and potato virus Y12 in the family Potyviridae. Genetic data for such studies are scarce for most viruses, except orthomyxoviruses and lentiviruses, and most have been done using partial genome sequences or using a single gene. In contrast, studies of BK13, influenza14 and Ebola15 viruses have been carried out using around 150 whole-genome sequences. For plant viruses, the largest study using whole-genome sequences has probably been 353 isolates of maize streak virus strain A (genome length; ~2700 nucleotides) in the family Geminiviridae 16.

One of the largest genera of plant RNA viruses is Potyvirus. It contains 90% of the species of the family Potyviridae 17. Potyviruses infect a wide range of monocotyledonous and dicotyledonous plant species4. They are spread by aphids in a non-persistent manner, and also in seed and infected living plant materials. They have flexuous filamentous particles 700–750 nm long, each of which contains a single copy of the genome. The genome is a single-stranded, positive-sense RNA molecule of approximately 10,000 nucleotides (nt). It has one major open reading frame (ORF) that is translated into one large polyprotein and with a small overlapping ORF18. The polyprotein is autocatalytically hydrolysed into at least ten proteins4, 17.

In the potyvirus phylogenetic tree, TuMV clusters with narcissus, scallion and yam viruses to form the TuMV phylogenetic group4. TuMV, from the genus Potyvirus, is one of the best-studied plant-infecting RNA viruses in terms of its evolution. TuMV damages most domestic brassica crops in modern agriculture. These plants were developed from wild brassica plants by plant breeders during the expansion of agriculture. Previous studies have shown that this virus originated from wild orchids in Europe10 and then spread among species of wild and domestic Brassicaceae plants, from the Mediterranean region, including South-east Europe, the Middle East and Central Asia9, 19, 20, to other parts of the world including East Asia21,22,23, Oceania11 and the Americas. There are two reports of TuMV from Middle Eastern countries24, 25. However, these studies reported only two whole-genome sequences, leaving considerable uncertainty about the population structure and diversity of the virus in the region.

In this study, we collected 179 TuMV isolates in Turkey, Greece and Iran over two decades, mostly from brassica hosts, and determined their genome sequences. This region is thought to be the centre of emergence and spread of this virus, and the region in which it adapted to agricultural crops. We estimated the evolutionary rate and timescale of this virus using synonymous sites8 and inferred its phylodynamic history using a combined data set of 417 novel and published genome sequences. These analyses reveal the present geographical structure of TuMV populations in and around the centre of TuMV emergence. Our study possibly represents the largest evolutionary study of an RNA plant virus, set in the context of the agricultural development of its hosts.

Results

Sample collection, virus isolation and pathogenicity

A total of 179 TuMV isolates were collected from agricultural crops and wild plants: 43 in Greece, 77 in Iran and 59 in Turkey (Fig. 1 and Supplementary Table S1). All of the Greek, Iranian and Turkish isolates infected Brassica juncea cv. Hakarashina and Brassica rapa cv. Hakatasuwari plants. However, few infected Brassica oleracea var. capitata cvs. Ryozan 2-go and Shinsei. Many did not infect Japanese radish (Raphanus sativus cvs. Akimasari and Taibyo-soubutori), but infected Chinese radish (R. sativus cv. Everest). Many Greek isolates did not infect radish; most of these isolates were of the [B] host-infecting type. Only a few infected radishes, perhaps because few radishes are grown in Greece (N. Katis, personal observation). In fact, we were not able to find diseased radish in Greece, and 32 out of 43 (74%) Greek isolates were [B] host-infecting type, eight were [B(R)] and none was [BR]. It was noticeable that in Turkey, which is a neighbour of Greece, we were able to find both radish crops and wild radish. Only 26 out of 59 (44%) Turkish isolates were [B] host-infecting type, 27 were [B(R)] and five were [BR]. The Iranian population was similar: 24 out of 77 (31%) were [B] host-infecting type, 39 were [B(R)] and 13 were [BR].

Figure 1
figure 1

Map showing the provenance of the turnip mosaic virus isolates from Greece, Turkey and Iran. Dots on the map correspond to the isolates listed in Supplementary Table S1 (http://www.freemap.jp/about_use_map.html).

Molecular characteristics and recombination analyses

We analysed the 179 sequences reported here, along with 238 whole-genome TuMV sequences obtained from online sequence databases. The publicly available data included three sequences from Greek isolates and two each from Iran and Turkey19, 24, 25. The 179 newly sequenced genomes had lengths of 9792–9798 nt (excluding 5′-end 35 nt primer sequences). The regions encoding the protein 1 (P1), helper-component proteinase protein (HC-Pro), protein 3 (P3), pretty interesting Potyviridae ORF (PIPO), 6 kDa 1 protein, cylindrical inclusion protein, 6 kDa 2 protein, genome-linked viral protein (VPg), nuclear inclusion a-proteinase protein (NIa-Pro), nuclear inclusion b protein (NIb) and coat protein (CP) had respective lengths of 1086, 1374, 1065, 177, 156, 1932–1935, 159, 573–576, 729, 1551 and 864–867 nt. The 3′ non-coding regions (NCRs) were 207–209 nt in length. All of the motifs reported for different potyvirus-encoded proteins were found.

Twenty-one unequivocal recombination sites were found in the genomes of 186 Greek, Iranian and Turkish isolates (Fig. 2 and Supplementary Table S2). Only one recombination type pattern, seen in a GRC 27 isolate genome with world-B3 x Asian-BR parents, was found in an earlier study19. Therefore, 40 novel recombination type patterns were found among the sequences from these three countries. The commonest recombination patterns in Greek genomes were intralineage recombinants of basal-B or world-B parents. In the Iranian population, most recombinants were intralineage and had Iranian subgroup parents, but the interlineage recombinants of world-B and Asian-BR parents were also widely distributed. In Turkey, most recombinants were intralineage recombinants and had basal-B or world-B parents, as in the Greek population; however, the recombination patterns differed between the two countries.

Figure 2
figure 2

Recombination map of turnip mosaic virus genomes of the isolates from Greece, Iran and Turkey. The estimated nucleotide positions of the recombination sites and those in parentheses are shown relative to the 5′ end of the genome using the numbering of the aligned sequences used in the present study and the UK 1 isolate (Jenner et al.57). Vertical arrows and lines show estimated recombination sites (listed in Supplementary Table S2). The clear (bold line) and tentative (thin line) recombination sites identified in the present study are listed separately.

Phylogenetic analyses

A phylogenetic network was inferred using Neighbor-Net26 from the concatenated 5′ NCR, polyprotein and 3′ NCR sequences (Supplementary Fig. S1). The isolates from Greece fell into the ‘basal-B group and recombinants’ and ‘world-B group and recombinants’ clusters. The isolates from Turkey fell into ‘basal-B group and recombinants’, ‘Asian-BR group and recombinants’ and ‘world-B group and recombinant’ clusters. The isolates from Iran fell into several clusters, not only the ‘Iranian group and recombinant’ cluster, but also ‘basal-B group and recombinants’ and ‘Asian-BR group and recombinants’ clusters. Therefore, all of the isolates from these three countries fell into the ‘basal-B group and recombinants’ cluster and clustered with Italian isolates. None of the isolates from Greece, Iran or Turkey clustered with the ‘Orchis group’.

We inferred a maximum-likelihood phylogenetic tree using the polyprotein-encoding (major ORF) sequences of the non-recombinants (Fig. 3) together with isolates represented by the three regions that contained no recombination cross-over points in any sequence: HC-Pro* (nt 1460–2494, numbers corresponding to the positions in original UK 1 genome; partial HC-Pro), P3* (nt 2591–3463; partial P3) and NIb* (nt 7208–8068; partial NIb) (see Yasaka et al.11). Trees were estimated using 420, 410 and 423 non-recombinant sequences, respectively (data not shown). These partitioned most of the sequences into the same five major genetic groups that were reported previously10, 11, Orchis, basal-B, basal-BR, Asian-BR, and world-B groups, and a new Iranian group. The basal-B group further split into basal-B1 and B2 subgroups and the world-B group split into the world-B1, B2 and B3 subgroups, as found in an earlier study11. In the present study, many non-recombinant sequences in TuMV population were found for the first time.

Figure 3
figure 3

Maximum-likelihood tree inferred from the major open reading fame sequences of turnip mosaic virus. Only non-recombinant sequences were used. Numbers at each node indicate bootstrap percentages based on 1000 pseudoreplicates. The scale bar indicates 0.1 substitutions per site. The genomic sequence of the isolates of narcissus late season yellows virus (NLSYV), narcissus yellow stripe virus (NYSV), Japanese yam mosaic virus (JYMV), and scallion mosaic virus (ScaMV) were used as outgroup taxa. Details of the isolates are given in Supplementary Table S1.

Time of TuMV emergence

We found that there was little saturation across the TuMV protein sequences in our data sets, based on our analyses of the aligned ORF sequences using the Iss statistic in DAMBE27. The estimates of Iss were significantly lower than Iss.c for all data sets, and were 4–5 times lower for major ORF, HC-Pro*, P3*, and NIb* sequences.

Using a Bayesian phylogenetic approach, we estimated the evolutionary rates and timescales for the complete major ORF, HC-Pro*, P3* and NIb* regions. We found 106 non-recombinants in this study, so we used these to provide ORF sequences for analysis. The HC-Pro*, P3* and NIb* regions were shorter than major ORF sequences, but many more sequences were available (329, 369 and 351, respectively).

Based on a comparison of marginal likelihoods, the constant-size demographic model was the best supported for all four proteins (Table 1). An uncorrelated exponential relaxed-clock model28 provided a better fit than the strict-clock model, indicating the presence of rate variation among lineages. All data sets passed date-randomization tests for temporal structure8,9,10, 29, 30.

Table 1 Estimates of nucleotide substitution rate and time to the most recent common ancestor for turnip mosaic virus.

The time to the most recent common ancestor (TMRCA) for each of the four protein-coding regions (major ORF, HC-Pro*, P3* and NIb*) was estimated using all sites and found to be 998 years on average (Table 1a). Mean estimates from individual protein-coding regions ranged from 1201 years [95% credibility intervals (CIs) 468–2150] years for the major ORF to 758 (95% CI 274–1548) years for P3*. The estimates had overlapping 95% CIs for rates and TMRCAs. However, the 1.6-fold range of estimated mean TMRCAs compromised their ability to distinguish among historical events that might have influenced TuMV evolution.

We also checked whether using only the synonymous sites in the sequences decreased the variability of the results (Table 1b). The effect of limiting the analysis to the synonymous sites is to minimize the influence of purifying selection, which can otherwise lead to an underestimation of TMRCAs when sampling dates are used for calibration31, 32. The synonymous sites from these four proteins passed the date-randomization test. Bayesian maximum-clade-credibility (MCC) chronograms of major ORF, HC-Pro*, P3* and NIb* were inferred from synonymous sites (Fig. 4 and Supplementary Fig. S2). The TMRCA of the major ORF region was 1570 (95% CI 521–3430) years, and those of three shorter protein-coding regions were 1059–1178 years (95% CI 549–1867 years). The ranges of the estimates were similar to those from the whole sequences (Table 1).

Figure 4
figure 4

Bayesian maximum-clade-credibility chronogram inferred from the polyprotein-coding region of turnip mosaic virus genomes. The tree was estimated from the major open reading frame (ORF) sequences of 106 non-recombinant isolates. Detail of the region is given in the Methods. Horizontal blue bars represent the 95% credibility intervals of estimates of node ages. The bar graph shows the root state posterior probabilities for each location. Grey bars show the probabilities obtained with 10 randomizations of the tip locations. Year before present; 2012.

Geographical spread of TuMV

The likely routes of TuMV dissemination in/into Turkey, Greece and Iran were assessed using a Bayesian phylogeographical analysis33, based on the non-recombinant sequences of the major ORF, HC-Pro*, P3* and NIb* regions (Figs 4 and 5 and Supplementary Figs S2 and S3; Supplementary Table S3). The major ORF data sets contained no recombination cross-over points from non-recombinant isolates, whereas the HC-Pro*, P3* and NIb* data sets also contained no recombination cross-over point sequences but from both non- and recombinant isolates. The partial-genome data sets contained at least three times as many sequences as the ORF data sets, but the optimal trade-off between sequence length and number remains unclear. Additionally, the major ORF data set yields evidence of the dissemination of non-recombinants, whereas the HC-Pro*, P3* and NIb* data sets provides evidence about the dissemination of both non-recombinants and partial-genome sequences in recombinants.

Figure 5
figure 5

Plausible historical dissemination pathways of turnip mosaic virus inferred using the major open reading frame (ORF) and partial protein 3 (P3*) sequences using non-recombinant sequences. Details of the regions of (a) major ORF and (b) P3* are given in the Methods. Dissemination routes are only shown for the Middle East, and only when supported by a Bayes factor >10. The dissemination pathways for basal-B1 +2, Iranian 1 + 2, basal-BR, Asian-BR and world-B1 +2 + 3 group (subgroup) isolates are shown (https://www.mapbox.com/about/maps/).

We investigated the routes of spread for each TuMV phylogenetic group or subgroup. In the partial-genome data sets, TuMV seems to have circulated not only within each country (Greece, Iran and Turkey), but also between Turkey and Greece, between Turkey and Iran, and between Turkey and Italy. The last of these is also supported by the major ORF data set. Spread from Germany to Turkey was found in the major ORF data, but not in HC-Pro*, P3* or NIb*; the isolates from orchids and Allium sp. plants were found in Germany and spread earlier than elsewhere (298 years ago, 95% CI 121–429). By contrast, the world-B group isolates spread in all three countries and the UK was also involved. The results were confirmed in the maximum-likelihood and Bayesian trees of the major ORF, HC-Pro*, P3* and NIb* regions (Figs 3 and 4 and Supplementary Fig. S2). In addition, most of these disseminations were supported by the results of the MigraPhyla program34 using the same data set of major ORFs including only non-recombinants (Fig. 6). The spread between Turkey and Greece and between Turkey and Iran were seen, and between Europe and these countries. The results of all analyses supported the conclusion that TuMV had entered the Middle East from the west and had progressively spread eastwards.

Figure 6
figure 6

Predicted dissemination events between Middle Eastern countries and other countries using the major open reading frame sequences of turnip mosaic virus. Lines indicate dissemination events between and within pairs of countries and cities, with colors indicating the source state. The colors of the inner and outer circles show the source and sink cities. Only the dissemination pathways for (a) basal-B1 +2, (b) Iranian 1 + 2, (c) basal-BR and (d) world-B1 +2 + 3 group (subgroup) isolates are shown. Narrow links indicate dissemination events that are not statistically significant. Bold links indicate dissemination events with P < 0.05.

Timescale of TuMV groups and recombination

In both the analyses of all sites and synonymous sites in the major ORF coding region, the basal-B group seemed to be the oldest lineage of TuMV (excluding the Orchis group). For instance, the TMRCA of the basal-B group by synonymous site analysis in the ORF coding regions was 389 (95% CI 200–676) years. The basal-B group was placed as the sister lineage to other brassica-infecting phylogenetic groups10, as seen in the maximum-likelihood and Bayesian ORF trees (Figs 3 and 4). We also estimated the diversity and extent of negative selection on the sequences in each phylogenetic group. The basal-B group has the greatest diversity, although the strength of selection was similar across all phylogenetic groups.

We estimated the ages of recombination events using the method described by Visser et al.12 and Yasaka et al.11. Recombinant sequences were split into their separate regions and realigned using gaps. For example, a recombinant with two ‘parents’ was split into two regions and the empty sites were filled with gaps. In this way, a recombinant sequence becomes two non-recombinant sequences, each with missing data. The oldest recombination event that was detected occurred 188 (95% CI 153–222) years ago in Turkey and produced an intralineage recombinant of basal-B2 parents (Table 2). The six oldest recombination events were all intralineage recombinants of basal-B2 subgroup parents in Greece and Turkey. The three oldest recombination cross-over points were located at nt 6222 in VPg, nt 7120 in NIa-Pro and nt 8963 in CP coding regions.

Table 2 Estimates of the timing of recombination events of turnip mosaic virus in Greece, Iran and Turkey.

Discussion

We have reported the most detailed evolutionary study of isolates from the centre of emergence of a plant virus, based on a global sample of more than 400 whole-genome sequences of TuMV (approximately 10,000 nt). We previously reported genetic analyses of the TuMV populations in Europe, East Asia and Oceania. However, the centre of emergence is thought to be in the populations of the Middle East, which remain largely uncharacterized. Our earlier studies10, 19 reported that approximately 75% of isolates from TuMV populations are recombinants. Therefore, to resolve the evolutionary history of this virus, we must analyse non-recombinants, especially from its centre of emergence. The many non-recombinants that we have identified in this study have allowed us to resolve the possible dissemination routes of this virus, along with the genetic changes that occurred as it adapted to new hosts and moved to other parts of the world.

In this study, we found a sixth phylogenetic group of TuMVs, the Iranian group. This adds to the previously described Orchis, basal-B, basal-BR, Asian-BR and world-B groups. The Orchis group consists of isolates from Europe (Germany) and is probably the original lineage of TuMV. The basal-B group is probably the sister lineage to the remaining groups and splits into two subgroups: the basal-B1 subgroup, which consists of isolates from Europe (Italy and Greece); and the basal-B2 subgroup, which consists of Middle East isolates (Turkey and Iran). Although each subgroup is geographically restricted, the basal-B1 subgroup seems to be the oldest modern subgroup, because many basal-B2 group isolates were recombinants whereas basal-B1 isolates were not (Fig. 2). The isolate ASP from Allium sp. is resolved as the sister lineage to all basal-B isolates. Further sampling of TuMV lineages, particularly from monocotyledonous plants, is needed to determine the history of adaptation that led to the divergence into the Orchis lineage and the brassica-infecting lineage.

We were unable to find TuMV in wild orchids in Greece, Turkey and Iran in the present study. Thus, ORF trees inferred using non-recombinant sequences still indicated that TuMV infecting brassicas might have originated from ancestral populations in wild orchids, Orchis militalis, O. morio and O. simia 10. Although the wild orchids were collected in Northern Germany, it is unclear whether the wild orchids were infected with TuMV-OMs10 in Germany or in other European countries. This is because various species of wild orchids are widely distributed in European countries and, as they are bulbs, they are often transported by plant collectors, and both the Orchis-infecting and Allium-infecting isolates came from an orchid collection in Gatersleben, Germany (Supplementary Table S1). This country is probably the source of TuMV basal-B group and might be the site of origin of TuMV. The virus might then have spread to Italy and Greece, and infected wild brassicas, and from there to Turkey and Iran. Denser sampling of the TuMV lineages in these groups will shed further light on these questions.

At or after the emergence of basal-B, TuMV spread to Iran and split into two subgroups. Because TuMV has not yet been collected from the neighbouring countries of Afghanistan, Iraq and Pakistan, we suspect that the Iranian groups are unique.

The BEL 1 isolate collected from Rorippa nasturtium-aquaticum (watercress) was placed as the sister lineage to all other world-B isolates. Rorippa is perennial and thought to have originated in Europe and Central Asia. However, the trees do not tell us the origin of world-B group, hence more isolates of the group need to be collected to answer this question.

Non-recombinant isolates from the Asian-BR group that infect R. sativus (radish) were previously found in China19. In this study, however, some Asian-BR non-recombinants were found in Turkey. Hence, the Asian-BR group might have originated in Turkey (Fig. 3), which is considered also to be one of origins of wild radish (R. raphanistrum). In fact, we saw many wild radish plants in the fields along the shore of the Aegean Sea during our collecting trips (K. Ohshima and S. Korkmaz, personal observation). However, our Bayesian phylogeographic analyses only found that the Asian-BR subgroup spread in Iran and from Turkey to Iran (Supplementary Table S3), and thence to southern Asia, where radish is one of the major crops and important for Asian cuisine.

Another group of isolates that infect radish belong to the basal-BR group. This modern group of isolates possibly originated in Italy, given the phylogenetic distribution of Italian lineages within the group. Other isolates have been found in Germany, Iran and Japan. No non-recombinants of the basal-BR group have been found in China, so we are unsure whether the dissemination route of this group to East Asia is the same as that of the Asian-BR group; more samples of TuMV from Central Asian countries are needed to answer this question. The basal-BR and Asian-BR populations might have spread to the east in plant material carried along the Northern or Southern Silk roads, an ancient network of trade routes between the Mediterranean and East Asia. However, our analyses indicate that different TuMV populations seem to have spread individually to different parts of the world.

Our estimation of the evolutionary and phylogeographic timescale was based on complete sequences as well as the synonymous sites. This approach was also previously used for estimating TMRCAs for CMV8. There are small differences between our two estimates of the evolutionary timescale. The mean TMRCA estimates from the synonymous sites were less variable than those from the complete sequences (Table 1). The longer sequences of the major ORF might give us a more reliable estimate of the TMRCA. However, the shorter sequences of HC-Pro*, P3* and NIb* yielded consistent estimates of the TMRCAs, and these three regions had three times as many isolates as the major ORF.

If the ancestors of the present TuMV populations depended on agricultural practices for their maintenance and spread, such as the collection and transport of TuMV-infected seed, then the estimated TMRCAs set limits on when brassicas were adopted as agricultural crops. The emergence of the brassica-infecting group corresponds well with the periods of territorial expansion of the Ottoman Empire in Greece, Turkey and Iran, and the spread of agriculture to the world.

Methods

Virus isolates and host tests

The brassica crop-producing areas of Greece, Iran and Turkey were surveyed during the growing seasons of 1993–2012. All of the collected plant leaves were tested by double-antibody sandwich enzyme-linked immunosorbent assay (DAS-ELISA)35 using the antiserum to TuMV9. The virus isolates were found in fields as well as in home gardens. In Turkey, wild Raphanus plants are common, and diseased plants were relatively easy to find. Thus, 27 Turkish isolates were collected from wild plants (R. raphanistrum) and crops (R. sativus), 25 isolates from brassicas, and six from other species of Brassicaceae. The Greek samples included 27 from Brassica spp., 15 from other Brassicaceae plants and four from Allium spp. In Iran, we were able to collect many brassica plants throughout the country, but not from from the border regions because of the armed conflict occurring there. Details of the TuMV isolates, their place of origin, original host plant, year of collection, host-infecting type, accession numbers, and references are shown in Supplementary Table S1.

All of the isolates were sap-inoculated to Chenopodium quinoa plants using 0.01 M potassium phosphate buffer (PPB) (pH 7.0) and serially cloned through single lesions at least three times using chlorotic local lesions that appeared approximately 10 days after inoculation. The biological cloning step is important because TuMV isolates were often co-infected with CMV and/or CaMV, and some plants contained a mixture of different TuMV isolates. Hence, there is a possibility that artificial recombination events will be detected in the sequence data from uncloned isolates. Biologically purified TuMV isolates were propagated in Nicotiana benthamiana and B. rapa cv. Hakatasuwari (turnip) plants. Plants infected systemically with each of the TuMV isolates were homogenized in 0.01 M PPB (pH 7.0), and the isolates were mechanically inoculated to young brassica plants, as described by Nguyen et al.10. Inoculated plants were kept at 25 °C for at least four weeks in a glasshouse at Saga University.

Sequencing and alignment

We determined the full genomic sequences of 179 TuMV isolates collected in Greece, Iran and Turkey. The viral RNAs were extracted from TuMV-infected N. benthamiana or turnip leaves using Isogen (Nippon Gene, Japan). The RNAs were reverse transcribed by PrimeScript Moloney murine leukemia virus reverse transcriptase (Takara Bio, Japan) and amplified using high-fidelity Platinum™ Pfx DNA polymerase (Invitrogen, USA). The products obtained by reverse transcription and polymerase chain reaction (RT-PCR) were separated by electrophoresis in agarose gels and purified using the QIAquick Gel Extraction Kit (Qiagen K. K., Japan).

Sequences from each isolate were determined using three or four overlapping independent RT-PCR products to cover the complete genome. To ensure that they were from the same genome and were not from different components of a genome mixture, the sequences of the RT-PCR products of adjacent regions of the genome overlapped by 200–350 nt. Each RT-PCR product was sequenced by primer walking in both directions using a BigDye Terminator v3.1 Cycle Sequencing Ready Reaction kit (Life Technologies, USA) and an Applied Biosystems 310 and 3130 Genetic Analyzer. Sequence data were assembled using BioEdit v5.0.936.

We assembled a data set of 417 genome sequences (Supplementary Table S1), comprising the 179 sequences determined in this study and 238 published sequences from online databases (collected in September 2015). The genomic sequences of the isolates of narcissus late season yellows virus (NLSYV; accession numbers JQ326210, JX156421 and NC_023628), narcissus yellow stripe virus (NYSV; JQ395042, JQ911732 and NC_011541), Japanese yam mosaic virus (JYMV; AB016500, KJ789140 and NC_000947) and scallion mosaic virus (ScaMV; NC_003399) were used as outgroup taxa because those viruses are members of TuMV phylogenetic group.

The nucleotide sequences of the polyprotein-encoding regions were aligned using TRANSALIGN (kindly supplied by Georg Weiller) and their encoded amino acid sequences aligned using CLUSTAL_X237. The aligned nucleotides were then reassembled to form whole-genome sequences by adding the aligned 5′ and 3′ NCR regions of RNA. This produced sequences of 9051 nt that excluded the 35 nucleotides that were used as primers for RT-PCR amplification.

Recombination analyses

Putative recombination breakpoints in all sequences were identified using RDP38, GENECONV39, BOOTSCAN40, MAXCHI41, CHIMAERA42 and SISCAN43 programs, implemented in the RDP4 package44, and also the original SISCAN v243 program. Each of the identified sites was examined individually, and a phylogenetic approach was used to verify the parent/donor assignments made using the RDP4 package44. These analyses were done using default settings for the different detection programs and a Bonferroni-corrected P-value cut-off of 0.01.

We tested for recombination in our data set of 417 genome sequences. Having examined all sites with an associated P-value of < 10−6 (i.e., the most likely recombination sites), we retained the intralineage recombinants (parents from the same major group lineage) and removed the interlineage recombinants (i.e., those with parents from different major lineages). The identified recombination sites were treated as missing data in subsequent analyses. All isolates that had been identified as likely recombinants by the programs in RDP4, supported by three different methods with an associated P-value of > 10−6, were rechecked using the original SISCAN program. We checked 50 nt slices of all sequences for evidence of recombination using these programs. These analyses also determined which non-recombinant sequences had regions that were closest to those of the recombinant sequences and hence indicated the lineages that were likely to have provided those regions of the recombinant genomes. For convenience, we called these the ‘parental isolates’ of the recombinants. Finally, TuMV sequences were also aligned without outgroup sequences, producing sequences of 9693 nt for full genome RNA. We checked these for evidence of recombination using the programs described above.

Estimation of substitution rates and divergence times

The phylogenetic relationships of the aligned full and partial genomic sequences were inferred using the Neighbor-Net method in SPLITSTREE v4.11.326 and maximum likelihood in PhyML v345. For the ML analysis, we used the general time-reversible (GTR) model of nucleotide substitution with rate variation among sites modeled using a gamma distribution and a proportion of invariable sites (GTR + I + G). This model was selected using jModelTest245, 46. Branch support was evaluated by bootstrap analysis based on 1000 pseudoreplicates. The inferred trees were displayed using TreeView47.

The degree of mutational saturation in the aligned ORF sequences was evaluated using the Iss statistic in DAMBE27. BEAST v1.8.248 was used to estimate the evolutionary rate and timescale of TuMV populations. Analyses were first based on complete sequences of the complete major ORF of the genomes (nt 131–9622, corresponding to the positions in the original TuMV-UK 1 isolate genome). Recombinant sequences were discarded from the ORF dataset (see Supplementary Table S2). The sampling times of the sequences were used to calibrate the molecular clock.

Bayes factors were used to select the best-fitting clock model and coalescent tree prior for each data set. We compared strict and relaxed (uncorrelated exponential and uncorrelated lognormal) clock models28, as well as five demographic models (constant population size, expansion growth, exponential growth, logistic growth and the Bayesian skyline plot). Posterior distributions of parameters, including the tree, were estimated by Markov chain Monte Carlo (MCMC) sampling. Samples were drawn every 104 MCMC steps over a total of 108 steps, with the first 10% of samples discarded as burn-in. Sufficient sampling from the posterior and convergence to the stationary distribution were checked using the diagnostic software Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/). Bayesian maximum-clade-credibility trees were generated with software included in the BEAST package.

For reliable rate estimates from time-structured sequence data, the range of sampling times needs to be wide enough to allow an appreciable amount of genetic change to occur49, 50. We checked the temporal signal in our data sets by comparing our rate estimates with those from ten date-randomized replicates. We used two different criteria to test for temporal structure, as described previously29, 30. According to the standard criterion, 95% CIs of date-randomized replicates should not overlap with the mean estimate from the original data set. A more conservative criterion, proposed by Duchêne et al.30, checks for overlap between the 95% CIs of the estimates from the date-randomized replicates and the original data set.

BEAST analyses were also done using the synonymous sites of TuMV polyprotein-encoding sequences. A simple pairwise sliding-window method DnDscan51 was used to identify codons in the alignments that had not evolved or had evolved non-synonymously. These codons were removed using SEQSPLIT v1.0 (written and provided by the late John Armstrong, http://192.55.98.146/_resources/e-texts/blobs/SeqSplit.ZIP). After silent sites were chosen from each protein region, those sequences were concatenated to produce 6078 nt sequences. The resulting sequences of the synonymous sites (300 to 6078 nt) of the major ORF, HC-Pro*, P3* and NIb* regions were 64%, 57%, 34% and 61% of the length of each complete protein-coding sequence (Table 1b). Non-synonymous (dN) and synonymous (dS) substitution (dN/dS) ratios were calculated using MEGA752.

The spatial population dynamics of TuMV through time were inferred in BEAST using a diffusion model with discrete location states33. This approach uses a model that describes the spatial spread of TuMV lineages throughout their demographic history. The most important diffusions between pairs of locations can be identified using Bayes factors53. We produced a graphical animation of the estimated spatio-temporal movements of TuMV lineages using SPREAD v.1.0.654 and Google Earth (http://www.google.com/earth).

The program MigraPhyla34 was used to infer the dissemination pathways of the virus. To estimate the reliability of the predicted dissemination events, a Monte Carlo simulation of 10 000 trials was performed by randomizing the character states of the leaf nodes while retaining the tree topologies. The sparse false discovery rate (sFDR) correction was used to account for multiple comparisons. Only the dissemination events with P < 0.05 and greater than the sFDR cut-off were considered significant. The dissemination pathways were represented using Circos55, 56 and marked on a map.