Metagenomics detection and characterisation of viruses in faecal samples from Australian wild birds

We present an optimised metagenomics method for detection and characterisation of all virus types including single and double stranded DNA/RNA and enveloped and non-enveloped viruses. Initial evaluation included both spiked and non-spiked bird faecal samples as well as non-spiked human faecal samples. From the non-spiked bird samples (Australian Muscovy duck and Pacific black ducks) we detected 21 viruses, and we also present a summary of a few viruses detected in human faecal samples. We then present a detailed analysis of selected virus sequences in the avian samples that were somewhat similar to known viruses, and had good quality (Q20 or higher) and quantity of next-generation sequencing reads, and was of interest from a virological point of view, for example, avian coronavirus and avian paramyxovirus 6. Some of these viruses were closely related to known viruses while others were more distantly related with 70% or less identity to currently known/sequenced viruses. Besides detecting viruses, the technique also allowed the characterisation of host mitochondrial DNA present and thus identifying host species, while ribosomal RNA sequences provided insight into the “ribosomal activity microbiome”; of gut parasites; and of food eaten such as plants or insects, which we correlated to non-avian host associated viruses.

infecting viruses while we also detected one leech and one dragonfly virus, likely derived from food eaten by the ducks. While several possible bacteriophages were identified in this sample, we only present results for one bacteriophage as an example, the Enterobacteria phage N4 like virus, which is not known to integrate into its host's DNA, and therefore the reads identified being most likely from true virus particles [ Table 2].
Partial genome and evolutionary analysis of selected viruses. Among the 21 viruses found in the MUD and MAD faecal samples, a more in-depth analysis of the assembled sequences of 10 viruses was performed. This included both RNA and DNA viruses that had areas of good coverage and good quality NGS reads as described in the methods. The consensus sequences generated have been submitted to NCBI. Description of the sequences on the representative phylogenetic trees of three avian viruses from MAD (avian paramyxovirus 6, avian deltacoronavirus and avian adenovirus) and one virus from MUD (segment 1 of Hubei chryso-like virus 1) faecal samples are given in Table 3, and others are given in the Supplementary Material 2.
In the MAD sample. Avian paramyxovirus 6 (APMV6). The MAD sample had many high-quality reads (Q32 or higher) that could be mapped to avian paramyxovirus type 6 (APMV-6 AB759118), a ssRNA virus belonging to the family Paramyxoviridae, with individual reads being 95-98% identical to AB759118 (APMV-6 red-necked stint-Japan-2008) 15 . We assembled several areas of good coverage including 459 nucleotides of the polymerase gene (coverage , 1839 nucleotides covering the complete hemagglutinin-neuraminidase (HN) gene (coverage 5-103) and 651 nucleotides of the fusion (F) gene (coverage [2][3][4][5][6][7][8]. For the polymerase gene, the similarity was 447/459 (97.3%) and 444/459 (96.7%) to AB759118 (APMV-6 red-necked stint-Japan-2008) 15 and GQ406232 (APMV-6 duck-Italy-07) 16 , but only around 75% similar to the other known lineage of APMV-6, the so-called Hong Kong 1977 lineage 17 [ Fig. 2 and Table 4]. For the HN gene, the similarity was 1787/1839 (97.1%) and 1785/1839 (97.0%) to AB759118 (APMV-6 red-necked stint-Japan-2008) and GQ406232 (APMV-6 duck-Italy-07), but only around 71% similar to the other known lineage of APMV-6 [ Fig. 3 and Table 5]. For the fusion gene, the similarity was 631/651 (96.9%) and 630/651 (96.7%) to AB759118 (APMV-6 red-necked stint-Japan-2008) and GQ406232 (APMV-6 duck-Italy-07), but only around 73% similar to the other known lineage of APMV-6 [ Fig. 4 and Table 6]. Thus, the APMV-6 present in our sample from the Australian juvenile Pacific black ducks (MAD) was closely related to the red-necked stint-Japan-2008 and duck-Italy-07 lineage of APMV-6 and only distantly related to the other major lineage of APMV-6. Furthermore, our Australian APMV-6 also has an additional basic amino acid around the fusion protein cleavage site, REPR-L rather than PEPR-L, characteristic for the Japan/Italy 2007/2008 lineage of APMV-6 17 . APMV6 sequences covering other regions of the APMV6 genome similarly found that the virus belonged to the Japan-Italy lineage [Figures S1.1-S1. 23 and Table S1.1-S1. 23  The virus family and its characteristics are shown. We identified these viruses using the reference sequences mentioned in column 4 to which the NGS reads were mapped. We found how much the reads of our viruses are identical to the closest virus from the NCBI dataset using MEGA 6 or 7 software. However, for individual reads generated by NGS, we used BLASTN to identify the closest virus from the NCBI dataset. Their likely source of origin in our sample was determined using the NGS reads generated and correlating them from literature.   -paramyxovirus-6large-polymerase-protein-459nt-Q32-C-9-192-2016   MAD-APMV6-Pol-459nt  9-192  459  32  MH000419   3 and 5   MAD-Avian-paramyxovirus-6hemagglutinin-neuraminidase-1839nt-Q32-C-5-103_2016   MAD-APMV6-HN-1839nt  5-103  1839  32  MH000415   4 and 6  MAD-Avian-paramyxovirus-6-fusionprotein-651nt-Q32-C-2-8-2016  MAD-APMV6-FP-651nt  2-8  651  32  Two large contigs of 14822 and 8322 nucleotides and two smaller contigs spanning parts of the deltacoronavirus Orf1ab replicase gene and the S1-S2 (Spike) and E, M and NS6 genes were obtained from the assem-blerSPAdes analysis, and the NGS reads mapped to these contigs using the TMAP assembler to determine the coverage depth. From further analysis, we initially selected an area of 1131 nucleotides (coverage 290-1700) from the Spike gene and 2121 nucleotides (coverage 39-1266) from the Orf1ab gene which overlaps with NSP7, NSP8, NSP9 and NSP10. These sequences were then analysed using MEGA and aligned with ClustalW by codons. For the selected Spike gene segment alignment, the deltacoronavirus sequence had a 738/1131 nucleotides ( We then expanded the analysis to larger areas of these mapped contigs (Q value at 20 or above and coverage of at least 5 and for some areas up to 6400), dividing areas into open reading frames (to avoid stop codons and thus making Clustal W codon based alignment possible). For all selected areas, the Wigeon coronavirus JQ065058 was the closest related sequence, and the percentage identities to the Wigeon and other related deltacoronavirus sequences at the nucleotide and amino acid levels are shown in Table 7. Selected phylogenetic trees are shown in Virus related to goose adenovirus 4 (GoA4) and duck adenovirus 2 (DuA2). A dsDNA virus belonging to the family Adenoviridae and approximately 75-86% identical to goose adenovirus 4 (JF510462) from Hungary 20 and duck adenovirus 2 (KR135164) from China 21 was found in the MAD faecal sample. The assembled sequences were analysed further including a 279 nucleotides area of the IVa2 protein coding region (Q20 or higher, coverage 4-69), a 177 nucleotides area of the IU34_gp11 gene (Q20 or higher, coverage 21-261) and a 114 nucleotides area of the IU34_gp21 gene (Q32 or higher, coverage . We found the similarity of assembled sequences to be 229/279 nucleotides (82%) match to the GoA4 and 210/279 nucleotides (75.26%) match to the DuA2 [ Fig. 8 Table 3. Description of the sequences on the representative phylogenetic trees generated for four selected viruses analysed using MEGA 6 or 7 software. The table gives details of the sequences used for phylogenetic analysis by the Maximum Likelihood method of APMV6, DCoV, AV and HCLV1. The first half of the table for each virus provides details to identify the sample from which the virus was isolated, protein encoded by the consensus sequences that are being analysed, the length of the consensus sequences and gene being analysed, minimum mapping quality threshold used in IGV for the generation of the consensus sequences, the coverage of the consensus sequences from IGV, the year of collection of the sample and the NCBI accession number assigned to the particular consensus sequence. Corresponding short names have been used in the phylogenetic trees which contain the sample, virus, protein and the number of nucleotides. The second half of the table for each virus provides the details of the sequences that were used for the comparative molecular phylogenetic analysis. They were found to be the most closely related sequences to the consensus sequences generated using BLASTN. Corresponding short names have been used in the phylogenetic trees which contain the NCBI accession number, virus and the country of collection. The collection date is given as retrieved from the corresponding NCBI nucleotide reference dataset which together with the country of collection can provide an insight into the possible evolution of the virus through the years. Rotavirus G. The faecal sample from the Pacific black ducks had good quality individual reads (Q32 or higher) with 80-89% identity to Rotavirus G chicken segment 2 (NC021580) and 4 (JQ920005) (Note: Designated as segment 4 as per the submitted NCBI reference sequence, but appear to be VP3 and possibly segment 3) isolated from Germany 22 and Rotavirus G pigeon segment 1 and 8 (KC876010 and KC876006 respectively) isolated from China 23 . Rotavirus G is a segmented, dsRNA virus belonging to the Reoviridae family. For this virus, good coverage of 3 or higher with a mapping quality threshold of 20 or higher was not obtained. To the best of our knowledge, these sequences are the first assembled reads of a Rotavirus G from Australia. Hence, to undergo a phylogenetic analysis of the virus, the mapping quality threshold was dropped to 10 or higher to get coverage of 3 and above. Three regions each from segment 1, 3/4 and 8 with individual areas being a 230-nucleotides  The evolutionary history was inferred by using the Maximum Likelihood method based on the Tamura 3-parameter model 70 . The tree with the highest log likelihood (−5515.26) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 54.95% sites). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 8 nucleotide sequences. Codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 1842 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 63 .

MAD-APMV6-FP-650nt
AB759118-APMV6-JP 20    71 . The tree with the highest log likelihood (−1837.17) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.4985)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 59.29% sites). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 8 nucleotide sequences. Codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 651 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 63 .  The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 1.4644)). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 9 nucleotide sequences. All positions containing gaps and missing data were eliminated. There were a total of 10137 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 63 .    Table 9. Estimates of Evolutionary Divergence between Sequences. The number of base differences between sequences are shown. The analysis involved 9 nucleotide sequences. Codon positions included were 1st+2nd+ 3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 2076 positions in the final dataset. Evolutionary analyses were conducted in MEGA7. Hubei reo-like virus 7-like virus. From the MUD faecal sample, virus sequences with similarity to the Hubei reo-like virus 7, an unclassified dsRNA virus was assembled. We were able to generate good quality reads (Q20 or higher) distributed throughout the genome. As this virus is a relatively newly characterised virus from mosquitoes in China 1 and not closely related to any other hitherto sequenced viruses, our sequences only resemble that from China. From the polymerase gene of the virus, partial gene sequences that were assembled from the NGS dataset including a 240-nucleotides area (Q20 or higher, coverage 3-11), a 184-nucleotides area (Q32 or higher, coverage 3-6) and a 117-nucleotides area (Q32 or higher, coverage 4-5) were analysed. We found that the     Table 11. Estimates of Evolutionary Divergence between Sequences. The number of base differences between sequences are shown. The analysis involved 5 nucleotide sequences. Codon positions included were 1st+2nd+ 3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 279 positions in the final dataset. Evolutionary analyses were conducted in MEGA7.    although recombination among strains appears to be part of its ancestry. The representative phylogenetic trees are shown in Supplementary material 2 Figures S11.3-S11.6 and Tables S11.3-S11.6. From the contigs, a few larger contigs were selected that by BLASTN alignment contained sequences with a high degree of identity to human viruses or known bacteriophages. Contigs and closely related sequences in the NCBI databases were then used to map back the NGS reads using the TMAP assembler. Assembled BAM files were then analysed using IGV and areas with reasonable coverage and a Q value at 20 or higher selected for further analysis of consensus sequences by using BLASTN and BLASTX.
From sample ST4 a sequence of 909 nucleotides in length was assembled with a coverage of 3-52 reads and a close nucleotide match to rotavirus sequences in the NCBI databases, e.g. 908-909/909 nucleotides match (99.9-100%) to GU565088, the VP4 gene of a reassortant rotavirus vaccine strain (USA RotaTeq) 30 and 880/909 nucleotides match (96.8%) to U53923, the VP4 gene from a bovine rotavirus isolated in the USA 31 more than 20 years ago. It should be noted that the rotavirus sequence in our sample was almost certainly from the vaccine virus although at one position the GU565088 sequence in NCBI reportedly has a G nucleotide while our sequence had 12 reads (55%) with a G in this position and 10 reads with an A indicating a mixed nucleotide at this position  We also did phylogenetic inferences of these sequence datasets by Maximum Likelihood (ML) phylogenetic analyses using the MEGA 6/7 software, and the representative phylogenetic trees are shown in Supplementary material 2 Figures S11.7-S11.8 and Tables S11.7-S11.8.
Additional information extracted from the NGS reads. In addition to virus enrichment, the optimised method also, (i) preserves a minor amount of host mitochondria/mitochondrial DNA, and, (ii) enriches for other small stable structures resistant or partially resistant to nucleases, such as, e.g. ribosomes. Thus, further scrutiny of the obtained NGS reads outside looking for viruses, can provide valuable information.
Species sampled. In the faecal samples from the Muscovy duck (Cairina moschata) there were good quality reads with a coverage of 23-66 and a 114/114 (100%) nucleotides match to the Cairina moschata mitochondrial COX1 gene (FJ808630) 33 35 and only around 88% to Cairina moschata. Consequently, the results "mined" from the NGS results supported the species of the birds sampled noting that Pacific black ducks (Anas superciliosa) are genotypically similar to the mallard (Anas platyrhynchos) or the bird that we sampled could be a cross-breed 36 .
Food eaten by the birds sampled and possible correlation to non-host viruses detected. As this was not the focus of this particular study, only a few examples are summarised; however, further mining of the data is likely to provide further insight into how much information can be extracted in order to get the full value of generated reads.
Consensus reads from the Cairina moschata samples showed a meager number of reads with matches of e.g. 139/139 (100%) and 176/177 ((99.4%) nucleotides match to Culex (quinquefasciatus) partial 28 S RNA (KY087523), while no such matches could be found in the Pacific black ducks sample that in contrast had many reads mapping to the 28 S RNA of tapeworms and other cestodes with, e.g. 167/167 (100%) nucleotides matches. Other reads could also be mapped to the 28 S ribosomal RNA of plants such as, e.g. triticum, sorghum, Oryza and mays. The finding of a low number of reads mapping to the ribosomal RNA of Culex mosquitoes is consistent with the abundant finding of Culex mosquito viruses in the Cairina moschata (MUD) sample and as no reads could be mapped to the Culex mitochondrial COX1 gene, indicate that the mosquito-associated viruses detected in the Cairina moschata faecal sample came from mosquitoes being eaten and going through the intestinal tract (thus completely digesting any mitochondria/mitochondrial DNA while not completely digesting ribosomes) rather than from potential mosquito contamination of the faecal sample before collection.
The "ribosomal activity microbiome". The analysis of the NGS data using the ThermoFisher Ion Reporter microbiome software provided insight into the active bacterial communities present in collected faecal samples although did not necessarily correlate with results obtained by a conventional DNA extraction 16SrRNA microbiome diversity profiling (data not shown). Furthermore, an overall analysis of repeated samples indicated that while detection of viruses was very robust over time, the "ribosomal activity microbiome" is susceptible to freeze-thawing with the ratio of reads mapped to Gram-negative bacteria (particularly phylum Bacteroidetes) decreasing when compared to Gram-positive bacteria (particularly phylum Firmicutes).

Discussion
The optimum protocol for the metagenomics detection and characterisation of viruses from a given sample should allow detection of all types of viruses, including single and double stranded RNA/DNA, circular, linear or segmented genome viruses; enveloped and non-enveloped viruses; and small and large viruses present. To establish such a protocol for the enrichment, detection and characterisation of a maximum number of viruses in complex biological samples such as faecal samples, we used a method previously described by others on a mock community called NetoVIR as the starting point and tested different enrichment techniques using various biophysical methods 6,37,38 . The use of marker viruses representing the different virus types enabled us to evaluate and optimise a metagenomics method suitable for detecting and characterising viruses from complex biological samples, in this study faecal samples.
The main aim was to minimise host and non-virus RNA/DNA and maintain as much virus RNA/DNA as possible to optimise subsequent random amplification of nucleic acids followed by NGS detection and characterisation. The method which maintained a maximum number of viruses and obtained high-quality and a higher number of NGS reads were found to be variation D; equivalent to the NetoVIR method with an added ultracentrifugation step to further concentrate and enrich for particles resistant to nucleases. However, variation C, or NetoVIR 6 itself, also produced good quality NGS reads for the viruses, although in lower amounts compared to Scientific RepoRtS | (2018) 8:8686 | DOI:10.1038/s41598-018-26851-1 variation D. Careful homogenization was significant for the homogeneous distribution of virus particles in the samples. Initial centrifugation and filtration using the 0.8 µm PES filter as described in NetoVIR removed larger particles while virus particles and some small particles such as ribosomes and some mitochondria were not filtered out and were subsequently spun down and concentrated using ultracentrifugation and thus enriching the sample. Harsh nuclease treatment ensured the reduction of free nucleic acids and other less protected nucleic acids. Finally, the nucleic acid extraction method did allow the extraction of genomes of both DNA and RNA viruses from the sample.
Although detergent treatment may enhance enrichment of certain viruses, it was found to be disadvantageous for some enveloped viruses and thus not incorporated in the final protocol. We assume that the detergent disrupted the structure of the enveloped, non-inner capsid containing virus particles causing the nuclease to digest the virus nucleic acids. Filtration using 0.45 µm filters might also have negatively affected the virus enrichment process, as we were not able to produce many good reads for the marker viruses BTV, BVDV and IBV. Both our filtration results using 0.45 µm filtration and the detergent results, although in the NetoVIR study 6 the authors used chloroform, correlate well with how the mock communities reacted to these enrichment techniques.
Unlike NetoVIR 6 , we used the SeqPlex RNA amplification kit instead of the Whole Transcriptome Amplification Kit 2 (WTA2, Sigma) for cDNA synthesis, amplification and primer removal. While library preparation and NGS was carried out using Nextera XT DNA Library Preparation kit (Illumina) and HiSeq ™ 2500 platform (Illumina) in NetoVIR 6 , we used Ion Plus Fragment Library Kit (Thermo Fisher Scientific) and Ion Torrent S5XL System (Thermo Fisher Scientific). However, a recent study has shown that the selection of different sequencing platforms and corresponding library preparation methods have a minimal impact on viral metagenomes 39 .
While we detected and characterised 9 avian host associated viruses in the MAD faecal sample, it is noteworthy that no avian viruses were detected in the MUD faecal sample. We speculate that this is because of the age of the bird (a mature older bird of unknown age as compared to the MAD samples being from juveniles) and it, despite being a wild bird living outside, was from an urban-like environment with less interaction with many other wild birds.
Avian paramyxovirus 6 (APMV6) was first isolated in 1977 from a healthy duck at a domestic duck farm in Hong Kong and considered as the prototype strain for the entire serotype 40 . The virus seems to be non-pathogenic to chickens but can cause mild respiratory disease and a drop in egg production in turkeys 41 . While only distantly related to the Hong Kong lineage of APMV6 mentioned above, the APMV6 detected in the MAD faecal sample is around 95-98% identical to another lineage of APMV6 isolated from Japan 15 and Italy 16 in the year 2008 and 2007, respectively. Interestingly, the detection of the genetically similar APMV6 in Japan was from a red-necked stint (Calidris ruficolis), a migratory bird species which breeds in the Arctic, but migrates in the non-breeding season as far south as New Zealand and Southern Australia including the location from which the APMV6 from this study was collected. The fusion protein of the paramyxoviruses directs the membrane fusion 42 and paramyxoviruses where the fusion protein cleavage site does not possess the furin recognition site (RXK/RR-F) are usually limited to the respiratory tract, and hence the sequence at this site contributes to the virulence of these viruses 42,43 . The sequences of the fusion protein in our APMV6 from the MAD faecal sample has an additional basic amino acid around the fusion protein cleavage site (REPR-L) characteristic for the Japan/Italy 2007/2008 lineage as compared to the Hong Kong lineage of APMV-6 17 and hence any further changes at this site could potentially make the virus highly virulent.
Gammacoronaviruses and deltacoronaviruses both from the family Coronaviridae, have been previously detected in wild birds in other parts of the world 44 . Avian coronaviruses frequently cause severe infections in poultry in many countries 45 . Coronaviruses have also been associated with inter-species spill over with recent examples being Severe acute respiratory syndrome coronavirus and Middle Eastern respiratory syndrome coronavirus. The gammacoronavirus detected in the MAD faecal sample was more closely related to the duck gammacoronavirus from China in 2014 18 than to the other gammacoronavirus sequences available. While we observed more than 96% identity in the Orf1ab polyprotein gene, the Spike gene identity was only 89-92% to the duck gammacoronavirus. For the deltacoronavirus detected in the MAD sample, the closest relative was the wigeon deltacoronavirus from Hong Kong collected in 2008 19 . The Spike protein of the coronaviruses mediates the entry of the virus particles into cells, and mutations in or recombination of this region are important for host species specificity and potential cross-species transmissibility of the virus 42,43 . These are the first wild bird coronaviruses from Australia to have their spike genes sequenced and further analysis to understand the relationship between wild bird coronaviruses is needed. A detailed study of the prevalence of coronaviruses among Australian wild birds (including the MAD sample) using PCR amplification of a small fragment of the Orf1ab gene with a detailed description and phylogenetic analysis of a small fragment of around 300 nucleotides is described elsewhere 46 . However, while that study did detect both gamma and deltacoronaviruses, the PCRs included were not able to detect coinfections (i.e. birds having both a gamma and a deltacoronavirus) while it is worth mentioning that such coinfection can be readily detected using our metagenomics method as evident from the NGS data generated in the present study.
In the MAD faecal sample, we also characterised two DNA viruses, one a virus related to DuA2 and/or GoA4 virus and another virus related to an adeno-associated (parvo) virus. The consensus sequences of the adenovirus that we assembled were related (75-86%) to both the DuA2 and GoA4 virus. The DuA2 (KR135164) was characterised in China in the year 2014 and found to be closely related to the GoA4 (JF510462) isolated from Hungary 20 . Avian adenoviruses have great variability in their pathogenesis and virulence. The goose adenovirus 4 has been associated with hepatitis and hydropericardium 20 . The virus related to adeno-associated (parvo) virus from the Pacific black ducks was somewhat related to a duck adeno-associated virus from China in 2015. It is currently unclear whether the adeno-associated (parvo) virus detected in the MAD sample can replicate without helper virus or not, however, we consider it likely that the adenovirus (related to the DuA2 and GoA4 mentioned above) Scientific RepoRtS | (2018) 8:8686 | DOI:10.1038/s41598-018-26851-1 detected in the same sample, may act as a helper virus for enhanced replication as adeno-associated parvoviruses often need a helper virus for its replication 47 . As both viruses are present simultaneously and possibly enhance replication of the parvovirus present, this co-infection may be associated with disease in the host although this is currently speculative and will require further studies before any conclusion can be reached as to its significance in regards to the health of infected birds.
Rotavirus G has also been previously found in birds 48 . Although infection with this particular serotype has not been considered fatal, infection from another rotavirus serotype such as Rotavirus A can be lethal to the birds in extreme cases 49,50 . Even though rotavirus infections in avian species are associated with diarrhoea, cross-species transmission of the avian rotaviruses is usually very rare 49 . To the best of our knowledge, our sequences are the first assembled reads of a Rotavirus G from Australia and from a duck sample, which could explain why we had to drop the mapping quality threshold to 10 to get a coverage of 3 and above for the generation of consensus sequences. Nevertheless, even then, we were only able to generate consensus sequences for only segments 1, 3/4 and 8. We believe the sequences for one segment that matched to segment 4 of the NCBI reference (JQ920005) isolated from Germany 22 is in reality segment 3 as it encodes for the VP3 protein of the rotavirus, which universally is accepted to be part of segment 3. Nevertheless, it appears that the rotavirus detected in the MAD sample is a new rotavirus and may have included reassortment in its evolutionary history as we were only able to map 3 of the gene segments and they, in turn, mapped most closely to different reference sequences.
Other avian host associated viruses we identified in the Pacific black ducks sample (MAD) are related to avian megrivirus, avian encephalomyelitis virus and avian calicivirus. Among them, the avian megrivirus and avian encephalomyelitis virus belong to the virus family Picornaviridae. The avian viruses in this virus family have been associated with different diseases such as duck viral hepatitis, turkey viral hepatitis and avian encephalomyelitis 49,51 . Avian encephalomyelitis virus is a highly infective virus to bird species such as chickens, partridge, turkey, quail, guineafowl and pheasants. They can severely affect the neuro-system and has high mortality rates 49 . However, the pathogenicity of avian calicivirus is still unclear although it has been isolated from chickens and turkeys having various diseases like the runting and stunting syndrome and enteric diseases 52,53 . The actual pathogenicity, if any, of the related viruses here detected at very low levels in the MAD sample is currently unknown.
The non-avian host associated viruses that were characterised in both MUD and MAD faecal samples were likely part of the food eaten by the ducks. For example, for the mosquito viruses detected in our MUD faecal sample, our detection of partial 28 S RNA of Culex mosquito in the sample supports this speculation. Ngewotan virus (Nam Dinh like virus), Hubei chryso-like virus 1, Culex Negev-like virus 3 (Biggie/Goutanap virus-like) and a virus related to Hubei reo-like virus 7 detected and characterised in the MUD sample are recently identified in mosquitoes 1,24 . All the mosquito viruses, except for the virus related to Hubei reo-like virus 7 (which was first characterised in China 1 ) are nearly 100% identical to the viruses detected in Western Australia 24 , approximately 3000 km from Geelong, Victoria where the MUD faecal sample was collected. These viruses being RNA viruses, which can have high mutation rates due to the low polymerase fidelity 54 , a nearly 100% identity is rather interesting. The mode of travel of these viruses could be from birds that have eaten the infected mosquitoes and travel long distances and then could be depositing these viruses in their faeces. We also cannot rule out that the infected mosquitoes may be taken from one region to another by human means such as aircrafts, trains, vehicles or by the wind. Independent of mode, the close identity of these mosquito viruses from distant locations indicate that such viruses may be easily dispersed over great distances.
Virus related to invertebrate iridescent virus 30 has previously been found in moths 55 and viruses related to Hubei picorna-like virus 19 and Hubei picorna-like virus 51, here detected and characterised in the MAD sample, were recently identified in leech and dragonfly, respectively 1 . Presence of plant viruses such as a virus related to black-grass cryptic like virus 2 and a virus related to Hordeum vulgare endornavirus is also likely to be from the food eaten. Only a single long good quality read with high similarity to the Israeli acute paralysis virus was detected. This virus may cause devastating infection in bees (colony collapse) when associated with varroa mite infestations. This virus has previously been detected in Australian bees, however, as Australia is free of varroa mites, is not thought to cause any disease in bees here 56,57 . Nevertheless, the finding in the MUD sample likely suggests that it could be from a bee eaten by the Muscovy duck.
Some of the other viruses identified in the duck samples belonging to the virus families Picornaviridae, Adenoviridae, Parvoviridae, Partitiviridae, Endornaviridae, Myoviridae, Iridoviridae and Calciviridae were only around 70-80% similar to the closet sequences in the database (see Tables 1 and 2). A closer examination of these related/like viruses may lead to the identification and full characterisation.
Birds may serve as hosts, carriers or transporters of parasites and pathogens 58 . Birds that migrate through the East Asian-Australasian flyway may transmit viruses to and from Australia. The similarity of some of our sequences of the avian and non-avian host associated viruses to the viruses that were characterised in e.g. Japan and China supports this theory. For example, the APMV6 phylogenies suggest that despite Australian ducks not migrating, they still are part of the global circulation of duck viruses, and that migrating shorebirds may likely be the vector for such viruses entering and exiting from Australia. It is also noteworthy to mention that during the BLASTN search, except for the Hubei chryso-like virus 1, Ngewotan virus (Nam Dinh like virus) and Culex Negev-like virus 3 very recently described in Western Australia, none of the top matched sequences were to similar strains or related viruses from Australia. Hence to the best of our knowledge we are the first to provide virus sequences from Australia for avian paramyxovirus 6, avian adeno-associated parvovirus and avian calicivirus.
The optimised virus enrichment metagenomics method presented here was able to not only provide information on viruses within the gut of a bird, but also identify the species of the bird, indicate dietary preferences and identify bacterial populations present from a fresh faecal sample. It is therefore a non-invasive technique that could have application to ecological and wildlife health surveys, and can be applied to many different species, including humans. Furthermore, the technique could be applied to different biological samples including saliva and tissue samples (data not shown) and be of use to medical and veterinary diagnostics. The "ribosomal activity microbiome" that was revealed using the optimised method may provide valuable information about the composition of live and active bacteria in faecal samples. Ribosomal RNA is partially protected from nuclease activity 59 , and as the optimised method allows ribosomes to be enriched together with the virus particles, the ribosomal profile of the faecal microbiome can also be investigated. These results provide a indication of the "ribosomal activity microbiome", i.e. the microbiome of active bacteria in the final deposited feces as active bacteria have a much higher number of ribosomes per bacteria (up to 70000 ribosomes or more per bacteria) than resting or dead bacteria (7000 or fewer ribosomes per bacteria) 60,61 . Although the results based on this method cannot necessarily be directly compared to microbial assessment by 16 S DNA diversity profiling, understanding of the active bacteria may be important 62 . The decrease that was observed in the Bacteroidetes to Firmicutes ratio with repeated freeze-thaw cycles may also be useful to assess the prior freeze-thaw/storage history of a sample.
Finally, we also would like to mention, that in addition to the bird and human faecal samples described here, the optimized method has been tested on dog faecal and saliva samples and on pig lung/tracheal swab samples with good results (data not shown) indicating that the method may have broad applicability for different sample types and species.

Materials and Methods
Samples and marker viruses for spiking samples. Marker viruses for spiking samples (Table S1 of Supplementary material 1) were obtained from the CSIRO-Australian Animal Health Laboratory and the Elizabeth Macarthur Agricultural Institute. All virus stocks were stored at −80 °C.
Human stool samples collected from 3 children aged 3 months (sample ST4), 18 months (ST5) and 4 years (ST6) from Geelong were also used to test the optimised method. The study involving these samples here were performed in accordance with relevant guidelines and regulations and was deemed negligible or low risk by the Barwon Health Human Research Ethics Committee and therefore exempt from full committee review (project no: 17.119).
The MUD faecal sample was selected for spiking with the marker viruses. A non-spiked MUD sample acted as control for the optimisation of the best protocol for metagenomics of viruses. All faecal samples were then processed according to the optimised protocol without any spiking. The detailed steps of the final optimised protocol are given in the supplementary material (Supplementary material 3). However, the steps in the protocol are briefly described below.
Real-time polymerase chain reaction (PCR) assays for marker viruses. We initially optimised real-time PCR assays for the marker viruses to detect and quantify them during the virome detection and characterisation protocol development. Virus-specific quantitative assays were either developed or retrieved from the literature (see Table S2 of Supplementary material 1 for details), and quantity required to spike the faecal samples assessed based on these assays. All quantitative assays for RNA viruses were carried out using Power SYBR ® Green RNA-to-CT ™ 1-Step kit (Thermofisher Scientific) using the following conditions: 48  Virus enrichment from samples. Removing other impurities such as host cells, bacteria, food particles and free nucleic acids from the faecal samples and enriching for virus particles followed by nucleic acid extraction is the first step in metagenomics of viruses. For this purpose, we tested different enrichment techniques using various biophysical methods 6,37,38 . Six method variations from A to F with a different combination of virus enrichment techniques using various biophysical methods were carried out (Fig. 1) on the spiked MUD faecal sample. We took the already tested protocol called the NetoVIR 6 as our starting method for virus enrichment and designated it as 'variation C' . We used the MUD faecal sample and incorporated two additional potential enrichment steps to this method, being detergent treatment (variation A & B) and ultracentrifugation (variation A, D & F) and also tested filtration using 0.45 µm filters (variation E & F) as described, but not recommended as optimum on their mock community, by the authors of NetoVIR 6 . This enabled us to evaluate, compare and develop an optimised protocol for virus community analysis from a biological sample. The combination of biophysical enrichment techniques that maintained the maximum number and levels of marker viruses was selected to proceed for cDNA synthesis, amplification, library preparation and NGS. The method which then gave the maximum number and obtained high-quality NGS reads of the marker viruses was then selected as the final method to conduct metagenomics of project samples for the detection of viruses. Sample preparation. A 1:10 dilution of the bird and human faecal samples were prepared using sterile phosphate buffered saline (PBS). The samples were vortexed thoroughly prior to downstream processing.
Homogenization (During method optimisation stage). A volume of 2 mL of the faecal samples were taken for variations involving ultracentrifugation, while for other variations only 500 µL was taken to maximise handling efficiency and protect virus nucleic acid during extraction. Homogenization was carried out using TissueLyser II homogenizer (Qiagen) at 25 Hz for 2 min.
Homogenization (After method optimisation). A volume of 2 mL of the faecal samples were taken for homogenization. After homogenization, the sample was divided into two, one tube containing 1725µL which was processed according to Variation D and another containing 250 µL which was processed according to Variation C.
Detergent Treatment. To the homogenised sample of variation A and B, 0.25% (vol/vol) Nonidet P-40 detergent was added. It was centrifuged at 10,000 rpm for 30 min at 4 °C, and the supernatant was collected and processed downstream.
Centrifugation. The samples were centrifuged at 17,000 g for 3 min at room temperature, and the supernatant was collected for downstream processing.
Filtration. The samples were filtered through 0.8 µm polyethersulfone (PES) spin filters at 17,000 g for 1 min or until all of the sample had passed through the filter. The filtrate was taken and processed downstream.
During method optimisation stage, the spike MUD faecal sample under variation E and F were then filtered through 0.45 µm PES syringe filters, and the filtrate processed downstream.
Ultracentrifugation. The samples of variation A, D and F were then ultracentrifuged using a Beckman Coulter Airfuge Ultracentrifuge at 178,000 g for 1 hour (30 psi for 1 hour) at room temperature. The supernatant was discarded, and the pellet was suspended in 130 µL of sterile 1XPBS.
Virus nucleic acid extraction. Virus nucleic acids were extracted using the QIAamp Viral RNA Mini Kit (Qiagen), importantly without adding any carrier RNA. Both DNA and RNA virus nucleic acids were extracted using this kit. The quantity of the isolated nucleic acids was determined using Nanodrop and in selected cases a Bioanalyzer, however, the quantity of extracted nucleic acids was as expected found to be very low after the enrichment and nuclease treatment process.
Next-generation sequencing. Virus nucleic acids extracted from the spiked MUD faecal sample that undergone variations C, D, E and F were processed downstream as these methods showed presence or detection of all spiked marker viruses. Other samples were also processed as described below. Extracted virus nucleic acids were subjected to cDNA synthesis, amplification and primer removal using SeqPlex RNA Amplification Kit (Sigma) from all the samples according to the kit instructions. An initial RNA denaturation step of 95 °C for 3 min followed by placing the sample in −20 °C cold ethanol was carried out just before cDNA synthesis to ensure the denaturation of the nucleic acids of any double stranded RNA viruses. The quantity and quality of the amplified product were checked using the Bioanalyzer.
Library preparation was carried out using Ion Plus Fragment Library Kit (Thermo Fisher Scientific), Ion Xpress Barcode Adapters 1-96 Kit (Thermo Fisher Scientific), Agencourt AMPure XP kit (Beckman Coulter) and Ion Library TaqMan ™ Quantitation Kit (Thermo Fisher Scientific). A volume of 20 µL of the cleaned product was used for end repair using the end repair enzyme in Ion Plus Fragment Library Kit. The sample was then purified using Agencourt AMPure XP kit at a 1.8X sample to bead ratio and eluted in 25 µL of Low TE buffer. Following elution, barcoded libraries were prepared using Ion Plus Fragment Library Kit and Ion Xpress Barcode Adapters 1-96 Kit as per the kit instructions. A 1.5X sample to bead ratio of the Agencourt AMPure Kit was used to purify the library. Then it was eluted in 20 µL of Low TE buffer. Quantification of unamplified libraries was performed as per the protocol from Ion Library TaqMan ™ Quantitation Kit using QuantStudio ™ 6 Flex Real-Time PCR System. Based on the concentration estimated, the dilution that resulted in a concentration of ~100 pM was then calculated and made. Libraries were then pooled prior to loading onto Ion 530 Chips using the Ion Chef Instrument. Following template preparation, the chips were run on the Ion Torrent S5XL System (Thermo Fisher Scientific) as per company protocols. NGS and associated reactions were performed at the Geelong Centre for Emerging Infectious Diseases (GCEID), Geelong, Victoria, Australia. We ran NGS for the spiked MUD faecal sample for the variations C to F only once and generated around 3.4 million reads. Approximately 4 million reads were generated for the non-spiked MUD faecal samples on the first run during which it partly acted as a control, and 7 million reads generated for the MAD faecal sample. For further analysis, the MUD and MAD faecal samples were rerun, and about 9.5 and 7.1 million reads were generated respectively. However, NGS for the human samples were run only once as it was a demonstration of the optimised protocol only and about 7.1 million reads were generated. Next generation sequence analyses. Although the NGS data provided an insight into the virome of the bird faecal samples, we narrowed our focus to virus sequences that were somewhat similar to already known viruses, in most cases had good quality and quantity NGS reads, could serve as examples of the utility of the method, and/or were of interest from a virological point of view. All available virus sequences were downloaded from the NCBI GenBank genetic sequence database (May 2017), and a local BLAST database created. Bacterial plasmid sequences were downloaded from the NCBI Reference Sequence Database (RefSeq) repository and 16 S, 18 S, 23 S 25 S, 5 S and 5.8 S ribosomal sequences were downloaded from GenBank. These were used to create plasmid and ribosomal BLAST databases respectively. Virus sequences from the NCBI RefSeq database were downloaded and queried against the plasmid and ribosomal RNA databases using a megablast query and an e-value cut-off of 1 × 10 −10 . Any virus sequences which matched either plasmid or ribosomal sequences were removed, and the remaining virus sequences used to create a reference sequence virus BLAST database. The BLAST package version 2.6.0 from NCBI was used to create the local databases and perform the database queries.
A BLASTN query was performed on the reads from each sample against the initial virus database with an e-value cut-off score of 1 × 10 −3 . A BLASTN query against the virus reference sequence database was performed with an e-value cut-off of 1 × 10 −10 . A TBLASTX query against the virus reference sequence database was performed with an e-value cut-off of 1 × 10 −10 . BLAST query results files were converted into spreadsheet files, sorted by virus matches, and a list of potential virus targets created for each sample. The list was then manually inspected to identify viruses of interest.
Mapping of sequence reads was performed against reference genomes of the viruses of interest using TMAP plugin on the Ion Torrent server. The full or partial consensus sequences of the viruses were obtained from TMAP using Integrative Genomics Viewer software (IGV) (Broad Institute, MA, USA). Consensus sequences were generated from mapped sequences with a mapping quality score of 20 or higher and a coverage depth of at least 3.
AssemblerSPAdes 5.6.0 plugin on the Ion Torrent Server was used to generate contigs from the sequence reads of each sample. The contigs were queried against the virus reference sequence database using BLASTN and TBLASTX to identify virus sequences as described above. Contigs were then used as references in TMAP and trimmed to regions with a mapping quality of 20 or higher and a coverage depth of at least 3. The contigs of deltacoronavirus from the MAD faecal sample and selected viruses from the human samples were then selected for more in-depth analysis using MEGA software 63  Phylogenetic analysis of selected virus sequences. Among the identified viruses, 10 viruses which provided good quality and coverage of the NGS reads mapped to NCBI virus genomes as references and also covered the different virus genome types (dsRNA, ssRNA, dsDNA and ssDNA viruses) were selected for more detailed analysis. For these viruses, NGS reads were further analysed using MEGA software 63 as previously described in our laboratory 64,65 . To understand phylogenetic relationships among the virus sequences, they were compared to the NCBI database using nucleotide BLAST (Basic Local Alignment Search Tool) 66,67 and aligned using Clustal-W 68 and some cases using Clustal-W (Codons) or Muscle using the MEGA 6 or 7 software 63,69 . Then the sequences were manually trimmed to put them in frame. Maximum Likelihood phylogenetic analyses were conducted using the MEGA 6 or 7 software 63,69 , and the best model for generating a phylogenetic tree was determined (data not shown). The robustness of different nodes was assessed by bootstrap analysis using 1000 replicates for nucleotide alignments and 500 replicates for amino acid alignments. The distance between the branches based on the number of nucleotide differences was calculated using the same software.