Genomic analysis of phylogroup D Escherichia coli strains using novel de-novo reference-based guided assembly

Escherichia coli are highly diverse bacteria with different pathogenic types, serotypes and phylogenetic types/phylotypes. In recent years, infections with E. coli have increased worldwide and so has the emergence of antibiotic resistant strains. In the present study we have assembled, annotated and analysed genome sequences of three strains of the phylogroup D of E. coli. These strains were isolated from the river Yamuna, a prominent anthropogenic urban river of northern India. These strains showed varied antibiotic susceptibilities, one was susceptible to all the antibiotics tested except ampicillin while of the other two, one was multi-β-lactam resistant and the other was multi-drug resistant (resistant to multiple β-lactams, fluoroquinolones and kanamycin). The short-sequence reads were assembled into contigs using the de-novo approach and further, scaffolding of contigs was performed by using the best reference genome for a particular isolate which resulted in a significant increase in the N50 value of each assembly. The bioinformatics assembly approach used in this study could be easily applied to study other bacterial genomes.


Background & Summary
Escherichia coli are highly diverse bacteria with different pathogenic types, serotypes and phylogenetic types/ phylotypes 1,2 .They were earlier believed to be the inhabitants of the lower intestinal tract of human beings and warm-blooded animals which reached the environment through the faecal discharge and waste-water treatment plants 3 .However, recent studies have shown that besides host, E. coli can successfully survive in various ecological niches like soil, water and plantations [3][4][5][6][7][8] .Genomic and phylogenetic studies have identified divergent lineages of E. coli which can survive in various ecological niches, well-adapted to a non-host lifestyle 9 .Commensal strains of E. coli (gut-associated, non-pathogenic) mostly belong to phylogroups A and B1 while the pathogenic strains belong to phylogroups B2 and D 10 .
Investigating the bacterial population inhabiting the local water bodies is important because it reflects the bacterial clones perpetuating in the human population of the adjoining areas.Thus, E. coli from water bodies have been extensively investigated by several researchers and concerns over public-health risks associated with such contaminated water have been raised [11][12][13][14][15][16][17][18][19][20][21][22] .Moreover, water bodies can also act as genetic reactors in a manner similar to the host intestine, facilitating genetic exchange of virulence and antimicrobial resistance genes among different bacteria as these are often associated with mobile genetic elements like plasmids, transposons, insertion sequences etc 20,23,24 .
As per the latest records (retrieved in October 2022) 226,541 genome sequences of E. coli were present in the EnteroBase, of which 646 sequences were from strains isolated from India Supplementary Table 1.Deducing the phylogroups of these strains was difficult as the EnteroBase records did not specify their phylogroups and most of the original research papers describing these strains did not divulge their phylogroups.Thus, to the best of our knowledge this is the first report describing the genomic features of three E. coli phylogroup D strains isolated from India.The strains E. coli IP9, E. coli KKA and E. coli IPE were earlier isolated from the river Yamuna, a major anthropogenic urban river of India 25,26 .The 22 km stretch of the river that traverses through the National Capital Region of India from which these strains were isolated, receives effluents from industries, hospitals, local population etc.The antibiotic susceptibilities/resistance of these strains were determined by the standard disk diffusion assay and confirmed by gene sequencing of the relevant genes, and the results have been published 25,26 .Among these strains, E. coli IP9 was susceptible to all antibiotics tested, except ampicillin, E. coli KKA was a multi-β-lactam resistant strain 25,26 and E. coli IPE was a multidrug resistant strain (resistant to multiple β-lactams, fluoroquinolones and kanamycin) 25,26 .
Despite the presence of a large number of genome sequences in the databases, it is very difficult to select the most appropriate genome as the template for the assembly.Hence, in the present study we have adopted a novel approach to assemble the whole genomes of these strains by combining both de-novo and reference-based approaches.Named as de-novo reference-based guided assembly, this approach helps the user in selecting the best reference genome from the public databases for a better-read assembly/contig scaffolding.Since these isolates showed varied antibiotic susceptibilities, we have also annotated their genomes to predict/identify genes for antimicrobial resistance and virulence factors.Also, we have provided Gene Ontology (GO) annotations that will be useful in analysing the role of individual genes.

Methods
DNa isolation, library preparation and sequencing.The three E. coli strains preserved as glycerol stocks (50% v/v) in a −80 °C deep refrigerator in our laboratory were revived by overnight incubation in LB broth at 37 °C, 200 rpm.Bacteria were grown up to the exponential phase (OD 600 = 0.8) and harvested by centrifugation at 8000 rpm for 8 minutes at 4 °C.Genomic DNA was extracted using Nucleospin Microbial DNA kit (Macherey Nagel, Germany) and quantified by Qubit 4 fluorometer (Thermo Fisher Scientific, USA).These strains were isolated from the river Yamuna traversing through the National Capital Region of Delhi and identified/characterized by standard microbiological methods described earlier 25 .The DNA sequencing library was prepared by QIASeq FX DNA Library Kit (Qiagen, USA).Quantitative and qualitative library QC was done by Qubit 4 fluorometer (Thermo Fisher Scientific, USA) and TapeStation 4150 (Agilent technologies, USA), respectively.The libraries were sequenced on NovaSeq.6000 (Illumina, USA) using 2 × 150 bp paired end reads with an insert size of 250-350 bp.The raw data statistics is shown in Table 1.
Selection of closest reference for scaffolding.The trimmed short paired-end reads obtained after Trimmomatic run were used to calculate the mash value against the 1875 E. coli complete genomes, downloaded  from the NCBI database.Using plentyofbugs (https://github.com/nickp60/plentyofbugs)we identified the best reference genome for each strain.Then the trimmed reads of each strain were aligned against these respective best reference genomes using Bowtie 2 31 and the percentage of alignment was calculated.The best reference genomes selected by the plentyofbugs for the three E. coli strains were different.For IP9 it was NZ_CP025859.1 (identity 77.60%), for KKA it was NZ_CP018206.1 (identity 96.34%) and for IPE it was NZ_CP046396.1 (85.39%).
Closest reference-based scaffolding.De-novo assembled contigs of each isolate were scaffolded using their respective identified reference genome to fill the gaps between the assembled contigs using the Align-graph 32 tool.The number of contigs generated after scaffolding were 124, 75 and 137 for IP9, KKA and IPE, respectively.
The N 50 values were assessed using QUAST (v5.0.2) which were 137,460 bp, 283,877 bp and 295,668 bp for E. coli strains IP9, KKA and IPE, respectively.The complete flow chart of the methodology is shown in Fig. 1.Further, BUSCO 33 analysis revealed that the genomes showed 100% gene completeness Fig. 2.

Genome annotations.
The assembled genomes were annotated using the prokaryotic genome annotation pipeline Prokka 34   Sequence type (ST), phylotyping, virulence factors and antimicrobial resistance genes identification.To identify the sequence type (ST), phylotyping of each strain was performed using EzClermont 37 (v 0.7.0) and serotyping was done using ECTyper 38 .Virulence factors were checked using VFDB 39 (Virulence Factors Database) using BLASTn (parameters: identity = 50%, query coverage = 50%, e-value = 1e-6).VFDB revealed the presence of genes encoding for virulence factors like adhesion, effector delivery system, exotoxin, invasion, mobility and nutritional/metabolic factors.Mobile genetic elements were found to be present in all the E. coli strains while genes for biofilm formation were found only in multi-β-lactam resistant and multidrug resistant strains.Several antimicrobial resistance genes were present only in multi-β-lactam resistant and multidrug resistant strains (KKA and IPE) but absent in the ampicillin resistant strain (IP9) as discerned using antibiotic resistance gene databases ResFinderdb 40 at BLASTn parameters, identity = 75%, e-value = 1e-6.The presence of bacterial drug efflux pump genes was checked using BacEffluxPred 41 .The results have been summarized in Supplementary Table 2.

Data records
DNA sequencing data was submitted to NCBI Sequence Read Archive (SRA) database under the IDs: SRR21887619 42 , SRR21887618 43 and SRR21887617 44 and are associated with the NCBI BioProject accession number PRJNA890036.The genome assemblies are available under NCBI GenBank accession number GCA_026183935.1 45 , GCA_026183955.1 46 and GCA_026183915.1 47 .The genome annotation and gene ontology files are publicly available at Figshare 48 .
technical Validation optimization.Improvement in the N 50 values of the assemblies after reference-based scaffolding strongly indicated the benefits of our approach.To find the least identity percentage between the sample reads and the genome sequence in the database till which a genome sequence may be used as a template for scaffolding, we determined the N 50 value of the assembled genomes before and after scaffolding at different identity percentages.
For this, we calculated the Mash values between the sample reads and all the E. coli genomes available in the NCBI genome database using Mash tool 49 (v2.3) at a default kmer size of 21.To reduce the time required to scaffold a large number of assembled genomes, the Mash value was normalized in the range of 0-1.Starting from 0, a sliding cut-off of 0.05 was used to select the genomic scaffold.Based on this, 16, 12 and 13 genomes were selected as references for IP9, KKA and IPE, respectively.We also aligned the sample reads with the selected E. coli genomes to calculate the % of alignment using Bowtie 2 and SAMtools 50 (--flagstat).Initially, de-novo assembly of all the three sample reads were performed using Unicycler and then scaffolding was done using AlignGraph with all the selected genomes as reference.The N 50 value before and after scaffolding was calculated using QUAST.Our results

Fig. 2
Quality assessment of the assembled genomes.BUSCO analysis showed 100% gene completeness in genomes of all the three strains, with no fragmented or missing gene orthologs.
revealed that as the Mash value (i.e., distance) between the genome investigated and the references increased, the % of alignment and N 50 both decreased 51 Fig. 3a-c.We also found that when sample reads were aligned with at least 75% of the reference genome, the N 50 value after scaffolding was roughly equal to the de-novo N 50 value 51 Fig. 3d.This suggests that selection of the reference for scaffolding should be based on two considerations: (a) at least 75% of the sample reads should be aligned to the reference genome and (b) among all the available genomes in the database, the reference genome should have minimum Mash value.

Benchmarking.
To benchmark the performance of our approach vis-à-vis de-novo and reference-based assembly, we compared the N 50 values, contig numbers and contig distribution of the assemblies obtained by all the three methods.Reference-based assembly was performed by aligning trimmed pair-end reads to the reference genome using Bowtie 2. On the basis of aligned reads, a consensus sequence was created using Medaka v1.3.3 (https://github.com/nanoporetech/medaka). The consensus sequences were considered as an assembled genome.In all the three genomes, the highest N 50 values were discerned with the de-novo reference-based guided assembly approach i.e., 295,668 bp for IPE, 283,877 bp for KKA and 137,460 bp for IP9.The N 50 values with the de-novo approach were 252,171 bp for IPE; 178,741 bp for KKA and 129,988 bp for IP9.The reference-based assembly gave the lowest N 50 values i.e., 240,470 bp for IPE; 267,493 bp for KKA and 112,332 bp for IP9.
In context to the contig numbers, no correlation was observed between the number of contigs and the approach used for assembly.The number of contigs generated after de-novo reference-based guided assembly approach were 137 for IPE, 75 for KKA were and 124 for IP9.With the de-novo approach, the number of contigs were 183 for IPE, 100 for KKA and 139 for IP9.With the reference-based approach, the number of contigs were 85 for IPE, 74 for KKA and 461 for IP9.
Analysis of the contig lengths indicated that the length of the contigs with de-novo reference-based guided assembly approach was greater than the other two assembly approaches.Interestingly, with increase in the percentage identity of the investigated genome with the reference genome, lengths of the contigs also increased Fig. 4. Overall, results of the benchmarking indicated that de-novo reference-based guided assembly approach gave a higher N 50 value and contig length.
Quality assessment (3 C's) of the genome assemblies.Contiguity.Contiguity assessment of the three E. coli genome assemblies with our approach in comparison with the de-novo and reference-based assembly was done on the basis of the N 50 values.The N 50 value of each assembly was calculated using QUAST (QUality ASsessment Tool).The N 50 values of the E. coli strains IP9, KKA and IPE using the de-novo approach were 129,988 bp, 178,741 bp and 252,171 bp, respectively.The N 50 values of the E. coli strains IP9, KKA and IPE using reference-based approach were 112,332 bp, 267,493 bp and 240,470 bp, respectively.While, the N 50 value of the E. coli strains IP9, KKA and IPE using de-novo reference-based guided approach were 137,460 bp, 283,877 bp and 295,668 bp, respectively.The maximum increment in N 50 was recorded with the de-novo reference-based guided approach Supplementary Table 3.
Correctness.The correctness of the assembly was assessed by remapping the trimmed pair-end reads on the assembled genomes with all the three approaches using short read aligner Bowtie 2. The percentage (%) of alignment of reads with the de-novo assembly approach was 96.65% for IPE, 97.15% for KKA and 93.11% for IP9.The percentage (%) of alignment of the reads with the reference-based assembly approach was 84.43% for IPE, 96.05% for KKA and 77.75% for IP9.The percentage (%) of alignment of reads with the de-novo reference-based guided assembly was 96.88% for IPE, 97.47% for KKA and 93.41% for IP9.The correctness of the assembly was found to be maximum with the de-novo reference-based guided approach Supplementary Table 3.
Completeness.Assessment of completeness of genome assemblies of the three E. coli strains was done by BUSCO (Benchmarking Universal Single-copy Ortholog) using lineage dataset bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124) and enterobacterales_odb10 (Creation date: 2021-02-23, number of genomes: 212, number of BUSCOs: 440).The assessment of BUSCO is based on the number of single-copy orthologs that are shared among the newly assembled genome and higher taxonomic groups.A high fraction of complete BUSCOs indicates completeness of assembly while a high proportion of missing and fragmented BUSCOs indicates incompleteness of the assembly.Using bacteria_odb10 124 BUSCOs and with enterobacterales_odb10, 440 BUSCOs were discerned in the three genomes assemblies and none was found to be fragmented or missing with the de-novo reference-based guided approach.Similarly, none of the genome assemblies was fragmented or missing using the de-novo approach.Though IP9 and KKA genome assemblies were complete, IPE assembly was found to be fragmented with the reference-based approach Supplementary Table 3.This indicated that the de-novo reference-based guided and de-novo approaches give complete genome assemblies than the reference-based approach.The evaluation of completeness of the 16S rRNA genes of all the strains by the three assembly approaches by BLAST against the 16S rRNA database (NCBI) revealed ≥99% alignment in the E. coli 16S rRNA sequences present in the database and the assemblies.The technical checks for assessing the quality of the assembled genomes like, N 50 value, BUSCO, % of read alignment and reference genome selection using plentyofbugs suggests that genome assembly using the de-novo reference-based guided assembly is better than de-novo assembly and reference-based assemblies.
The contig distribution pattern based upon the contig number and the length of each contig suggested that the selection of approach for genome assembly should be based upon two considerations: (a) when the identity between the reference and the reads is less than 75%, both de-novo and de-novo reference-based guided assembly approaches are good, because both give similar N 50 values.However, the reference-based approach is unsuitable in these cases, as it gives a distorted assembly with a very high contig number and very low N 50 value.(b) when the identity between the reference and the reads is more than or equal to 75%, de-novo reference-based guided assembly is better than de-novo and reference-based assemblies because it gives better contiguity (N 50 values), completeness and correctness of the assembly Fig. 4.

Usage Notes
The assembly method was built using the well-known genome assemblers and tools hence, can be easily integrated in any genome analysis workflow.Further, all the software commands have been compiled under one wrapper code, making it convenient and user-friendly for analysing prokaryotic genomes.We believe that the methods described here would help the scientific community in faster/better assembly using short read sequencing.

Fig. 3
Fig. 3 (a-c) Depiction of N 50 values (blue colour) and percentage alignment (pink colour) of sequencing reads and reference genomes vis-à-vis Mash values discerned through the de-novo reference-based guided approach for E. coli strains (a) E. coli IP9 (b) E. coli KKA (c) E. coli IPE; (d) composite depiction of comparative N 50 values of the three E. coli strains using de-novo reference-based guided assembly (magenta dots) and de-novo assembly (green dots).The orange dot represents the N 50 value of the assembly when 75% of the sequence reads were aligned with reference genomes.

Fig. 4
Fig. 4 Contig distribution pattern plot depicts the contig length distribution, contig number and N 50 value after de-novo, reference-based and de-novo reference-based guided approach for genome assemblies of (a) E. coli IPE (b) E. coli KKA (c) E. coli IP9.

Table 1 .
Sequencing raw data statistics of E. coli strains.

Table 2 .
E. coli strains genome assembly statistics.