Nanopore metatranscriptomics reveals cryptic catfish species as potential Shigella flexneri vectors in Kenya

Bacteria in the Shigella genus remain a major cause of dysentery in sub-Saharan Africa, and annually cause an estimated 600,000 deaths worldwide. Being spread by contaminated food and water, this study highlights how wild caught food, in the form of freshwater catfish, can act as vectors for Shigella flexneri in Southern Kenya. A metatranscriptomic approach was used to identify the presence of Shigella flexneri in the catfish which had been caught for consumption from the Galana river. The use of nanopore sequencing was shown to be a simple and effective method to highlight the presence of Shigella flexneri and could represent a potential new tool in the detection and prevention of this deadly pathogen. Rather than the presence/absence results of more traditional testing methods, the use of metatranscriptomics highlighted how primarily one SOS response gene was being transcribed, suggesting the bacteria may be dormant in the catfish. Additionally, COI sequencing of the vector catfish revealed they likely represent a cryptic species. Morphological assignment suggested the fish were widehead catfish Clarotes laticeps, which range across Africa, but the COI sequences from the Kenyan fish are distinctly different from C. laticeps sequenced in West Africa.

Zoonotic pathogens pose an ongoing threat to human society, as acutely highlighted with SARs-COV2, the causative agent of Covid-19 1 . Generally, zoonotic viruses originate in mammalian or avian hosts, with no known zoonotic viruses originating in fish 2 . However, fish can be sources of a number of bacterial and parasitic diseases 3,4 . The primary pathogens from fish that pose a risk to humans are bacterial, with pathogens such as Salmonella sp., Campylobacter sp., Escherichia coli, Listeria monocytogenes and Yersinia sp. being responsible for most foodborne outbreaks from fish worldwide 5 . It is likely also that undescribed zoonotic pathogens are circulating in wild fish, with developing countries in the tropics predicted to be the highest risk areas for future zoonoses to emerge 6 . Therefore in this paper, a pilot study for an early warning system using shotgun sequencing approach was used to determine whether any pathogens could be identified from wild caught widehead catfish Clarotes laticeps found in the Galana river in South East Kenya. Clarotes laticeps are an important protein source for local communities and are routinely harvested directly from the river. The species is a tropical, freshwater catfish in the family Claroteidae 7 , which is present in all Nilo-Sudanese basins 8 in addition to East Africa and the Nile 9 .
A number of bacterial pathogens are known to be associated with the consumption of catfish species (both farmed and wild caught), such as Salmonella, Campylobacter, E. coli, L. monocytogenes, Staphylococcus aureus and Vibrio 3 . Some bacteria, such as Salmonella [10][11][12] and Campylobacter 13 are thought to be introduced to aquaculture ponds through the faeces of birds and wildlife, and faecal matter from ruminants is known to contaminate ponds with pathogenic E. coli 3 . Additionally, Salmonella has been detected in wild catfish 14 and Grimontia hollisae (formerly known as Vibrio hollisae) was implicated as the cause of septicaemia in a man who had eaten wild caught catfish in the USA 15 . In Sub-Saharan Africa, wild caught fish have been shown to be sources of bacterial enteropathogens, with one example in Central African Republic specifically linking the presence of Shiga toxin producing E. coli in wild fish and river water to the upstream slaughter of zebu cattle Bos indicus 16 17 . Similar to E. coli, Shigella spp. are also known to be introduced to water sources from anthropogenic sources such as untreated sewage and agricultural runoff 18 . Outbreaks of shigellosis due to Shigella spp. from contaminated water are known throughout Sub-Saharan Africa 19 , in addition to Asia 18,20 . Additionally the presence of antimicrobial resistance genes has been detected in Shigella spp. in waterways, likely due to pollution in water sources from industry, veterinary medications and human medical treatment 18 .
The detection of bacterial pathogens, both from fish and other sources, has increasingly been based on the use of polymerase chain reaction (PCR) based approaches over the past two decades 21 . One significant drawback of this approach is that it can only be used to test for known pathogens, with multiple presence/absence tests needed to check for all potential pathogens. Second and third generation sequencing platforms such as Illumina's MiSeq and Oxford nanopore's MinION, however, allow for shotgun sequencing approaches which can be used to sequence all DNA present in a sample, potentially revealing most microbial life present in one test, saving time and allowing for the rapid discovery of novel disease causing pathogens. Third generation nanopore sequencing has been used in a number of shotgun metagenomic studies to identify pathogens, such as diagnosing the cause of infections in orthopaedic implants 22 , identifying oral bacteriophages 23 and characterising the microbiomes of preterm human babies 24 . The approach has also been used to identify waterborne pathogens, with Reddington et al. 25 showing how wastewater influents increased the abundance of Arcobacter when compared to cleaner parts of the Havelse river in Denmark.
In this study, a shotgun transcriptomics approach was used in an attempt to identify pathogens found in wild widehead catfish in the Galana river in South-East Kenya. In addition to shotgun sequencing, the cytochrome oxidase I (COI) gene of the fish was sequenced using Sanger sequencing, in order to confirm the species of catfish found in the river. Although the widehead catfish in the Galana river is currently classified as C. laticeps, there remains taxonomic uncertainty as to whether this population of catfish are in fact C. laticeps 26 .

Methods
Sample collection and sequencing. Four catfish samples were collected between the 7th and 13th of August, 2019 from three points along the Galana river (Fig. 1). The deceased catfish, which had been caught using a baited hook and line, were gifted by local scouts and samples were taken from remains prior to preparation as food. The catfish sampled were identified as widehead catfish based on morphological characteristics. Fish were dissected in a sterile metal tray, with utensils sterilised by ethanol and flamed between samples. Single muscle and heart samples were taken from each fish, but in the process of dissecting the fish the intestines were significantly nicked, meaning the heart samples were likely contaminated with intestinal material. Tissue samples were placed in RNAlater (Invitrogen, USA) in 2 mL O-ring tubes and kept at ambient temperature in the field for up to eleven days, and later stored at − 80 °C.
The four heart samples were homogenised using a TissueLyser (Qiagen), and RNA was then extracted using an QIAamp Viral RNA Mini Kit (Qiagen) via a QiaCube extraction machine (Qiagen). The RNA was converted  Nanopore data analysis. Guppy v.3.2.10 (Oxford Nanopore Technologies) was used to base-call the output from the MinION sequencing run. Porechop (https:// github. com/ rrwick/ Porec hop) was used to remove the adaptor sequences from the MinION sequence data. Porechop was also originally used to sort samples by barcodes, but as 215,028 reads were left unassigned (89.44% of reads) and due to the low read depth it was decided to pool the reads for all four catfish samples for further analysis.
In order to first assess the microbial organisms present, the fastq reads were processed using the What's In My Pot (WIMP) workflow on Oxford Nanopore's EPI2ME platform (https:// epi2me. nanop orete ch. com/, Oxford Nanopore Technologies). Following the WIMP analysis, a reference fasta database was generated using an existing genome for North American yellow catfish Tachysurus fulvidraco (Genbank Assembly Accession: GCA_003724035.1) to act as a proxy for the host genome. Additionally a genome for the parasite Schistosoma haematobium (Genbank Assembly Accession: GCA_000699445.2), the 12,642 existing viral sequences from rayfinned fish Actinopterygii hosts available on the NCBI virus database (https:// www. ncbi. nlm. nih. gov/ labs/ virus/ vssi/#/) and the genomes for the 23 bacterial and fungal species identified through WIMP (Table 1) were added to the reference fasta file. The fastq reads were mapped against the reference fasta database using the NanoPipe web server (http:// www. bioin forma tics. uni-muens ter. de/ tools/ nanop ipe2/ index. hbi?) 27 . The consensus sequences for mapped reads were then subject to a BLASTn search (https:// blast. ncbi. nlm. nih. gov/ Blast. cgi), and consensus sequences which did not correspond to any existing sequences were removed from the final result.
Sequencing and analysing the COI of the host. A separate DNA extraction was performed on the catfish muscle samples collected, with samples being incubated at 56 °C for 2 h in a mix of 12 µl of proteinase K (20 mg/ml) and 400 µl of 10% Chelex solution, followed by 15 min at 99 °C. Samples were then centrifuged for 1 min at max speed (20,817 G) and 150 µl of DNA supernatant was placed in a new 1.5 mL Eppendorf tube. A  29 , and the consensus sequences were initially analysed using BLAST (https:// blast. ncbi. nlm. nih. gov/ Blast. cgi). Following this the corresponding aligned sequence from the top 100 BLAST hits were downloaded in fasta format and added to an alignment containing the four sequences for the Galana widehead catfish. The alignment file was then uploaded to the IQ-TREE web server for model selection (http:// iqtree. cibiv. univie. ac. at/), 30 and the web server of the RAxML program (https:// raxml-ng. vital-it. ch) 31 was used to generate an ML tree, with 100 bootstrap replicates performed.

Results
MinION sequence analysis. Following adaptor removal, 240,429 reads remained, of which 68,908 were successfully mapped to one of the reference sequences using NanoPipe 27 . Of the 68,908 mapped reads, 59,849 mapped to the proxy host (Fig. 2), 9031 mapped to bacterial sequences, 22 mapped to fungal sequences, and 6 mapped to viral sequences. The majority of the 9031 bacterial reads (99.06%, 8,938 reads) (Fig. 2) mapped to the genome of Shigella flexneri, specifically to a predicted response regulator (Genbank Accession: CP045522.1, position 4,265,154-4,265,537) which shares protein homology with the two component response regulator gene DpiA in Shigella boydii 32 . Further investigation of the viral and fungal mapped reads revealed these to be ambiguous, and no bacteriophages were detected.
Host COI analysis. Clean, unambiguous Sanger sequences were generated for all four widehead catfish samples, with all samples sharing the same haplotype. A BLAST search of the consensus sequence revealed the closest match to be a sequence for Bathybagrus tetranema (Accession number: HG803463) 33 at 93.75% similarity. The consensus sequence matched 91.89% to an existing Clarotes laticeps on Genbank (Accession number: HG803491) 33 .
The maximum likelihood tree (Fig. 3) showed that whilst the host catfish sequence is not closely related to any existing catfish sequences, it does cluster within the family Claroteidae. The Galana widehead catfish show a deep split from the widehead catfish sequenced from Nigeria (Fig. 3), and show a basal split near the base of the Claroteidae cluster.

Discussion
Metagenomic approaches using second generation sequencing platforms have been used previously to identify foodborne pathogens directly from a source animal 34 , however it would appear that this study represents the first time a metagenomic approach using nanopore sequencing has been used to detect a foodborne bacterial pathogen directly from an animal. Previous studies have shown the effectiveness of MinION's nanopore sequencer as a tool for sequencing foodborne bacterial pathogens [35][36][37] , but these studies have relied on bacteria that were isolated from a food source and then cultured prior to sequencing. It is estimated that approximately 99% of prokaryotes are unable to be cultured in the laboratory with currently available methods 38 , which highlights the importance of developing diagnostic workflows which are culture independent, such as the approach outlined in this study. The identification of the presence of Shigella flexneri in the catfish in this study highlights the potential of nanopore metagenomics as a relatively simple and effective way to detect human pathogens in one test. The causative agent of human shigellosis, Shigella is a genus of gram-negative bacteria that cause diarrhoea and dysentery, being a major cause of moderate-to-severe diarrhoea in sub-Saharan Africa 39 . Annually, Shigella is estimated to cause 600,000 deaths from 80 to 165 million cases worldwide 40 . In the developing world, S. flexneri is the predominant cause of shigellosis, with S. sonnei being the predominant strain in industrialised countries 40 . The www.nature.com/scientificreports/ faecal-oral route is the primary way by which Shigella spreads, with transmission also documented via contaminated food, drinking water and flies 41 . The S. flexneri gene that the majority of the reads mapped to, DpiA, has been shown to interfere with plasmid maintenance when overexpressed, inducing the SOS response 42 , which is a bacterial response that promotes dormancy in unfavourable environmental conditions 43 . The overabundance of the DpiA gene could imply that the bacteria were in a dormant state, but as other genes known to be involved in the SOS response were not detected with this approach, this cannot be confirmed. It may be the case that DpiA is expressed at much higher levels than the other genes in the SOS response, and a deeper sequencing effort might reveal further genes being expressed. Additionally the overexpression of DpiA has been shown to be part of the bacterial defence following exposure to β-lactam antibiotics in E. coli 44 , which could imply the presence of antimicrobial resistance as seen with Shigella spp. in other water sources 18 . It is unclear how the catfish became carriers of S. flexneri, so it can only be speculated how the bacteria may have been introduced to the fish based on the known biology of the pathogen. Although the Shigella RNA was extracted from the heart tissue, we believe in fact it was present in the intestinal tract of the catfish, as whilst the fish were dissected in a sterilised tray, clean gloves were worn and utensils were flamed before each dissection, in the process of initially opening the fish the intestines were significantly nicked, meaning the heart samples were likely contaminated with DNA from material within the intestines. The primary hosts of S. flexneri are humans and primates 45 , but it has also been detected in rabbits 46 , cattle 47 , pigs 48 and chicken 45 . Previous studies have also shown that bacterial pathogens can be introduced to fish in water bodies via the faeces of cattle, birds and wild animals 3,[10][11][12][13]16 . This could imply that S. flexneri was introduced to the river either via the large herds of cattle which are brought to the river to drink, or through the various wild animals that use the river, with species such as hippopotamus and elephant known to be sources of other bacterial diseases such as anthrax, brucellosis, tetanus and salmonellosis (hippopotamus) 49 and tuberculosis (elephant) 50 , and subsequently ingested in sediment by the bottom feeding catfish. In addition to the potential animal sources listed, a number of villages located approximately 150 km upstream of the sampling site, on the western side of Tsavo East National Park, are suspected to be primary sources of untreated sewage into the Galana river (John Byrne, University College Dublin, pers. comm.).
The sequence analysis of the host catfish COI also revealed that it was not the species Clarotes laticeps as suggested based on morphology, but the phylogeny does suggest the fish belongs in the family Claroteidae. The species C. laticeps is found naturally in West Africa, the Nilo-Sudanese basin and East Africa 8,26,51,52 , and the other C. laticeps sequence included in the analysis (Fig. 3) was collected in the Amambra river in Nigeria 53 which is ~ 4000 km away from the Galana river, and the two rivers are separated by at least two major watersheds. According to both Okeyo 54 and Seegers et al. 26 , and based on morphology, the only member of the family Claroteidae found in the Galana river is C. laticeps, however Seegers et al. 26 added the caveat "the taxonomic status of the Kenyan populations is uncertain and needs detailed study". This would suggest that the fish examined in this study were in fact the fish that to date have been classified as C. laticeps based on morphology, but now the genetic data have revealed that this population of catfish are likely a cryptic species. Based on these findings we tentatively suggest a new species name Clarotes kambare, or Kenyan widehead catfish, with "kambare" being the Swahili word for catfish, although further in-depth morphological and genetic work will be needed to clarify the unique taxonomic status of this population of fish. Further studies are also required to determine the full geographic location of this species and its conservation status.
In conclusion, this study has highlighted how a shotgun metatranscriptomic approach can be used to identify human pathogens in wild caught fish, and how it may highlight the transmission potential of enteropathogens in non-host animals. Further research is needed going forward to explore the potential benefits of this approach over the presence/absence results of conventional PCR assays or 16S sequencing, as while we did highlight the presence of one gene involved in the bacterial SOS response, there are multiple genes involved which were not detected. Furthermore, the potential identification of the host as a cryptic species highlights the need for further populations of wild harvested fish to be characterised genetically, as unknowingly managing a species complex as one species may lead to severe threats to local cryptic species, in addition to masking potential disease risks in the cryptic species.

Data availablity
The raw fastq sequences generated during the MinION sequencing are available on NCBI GenBank under the accession number PRJNA785244, and the fasta sequences for the catfish COI sequences are available under the GenBank accession numbers OM176588-OM176591.