The phylogenomics of CRISPR-Cas system and revelation of its features in Salmonella

Salmonellae display intricate evolutionary patterns comprising over 2500 serovars having diverse pathogenic profiles. The acquisition and/or exchange of various virulence factors influences the evolutionary framework. To gain insights into evolution of Salmonella in association with the CRISPR-Cas genes we performed phylogenetic surveillance across strains of 22 Salmonella serovars. The strains differed in their CRISPR1-leader and cas operon features assorting into two main clades, CRISPR1-STY/cas-STY and CRISPR1-STM/cas-STM, comprising majorly typhoidal and non-typhoidal Salmonella serovars respectively. Serovars of these two clades displayed better relatedness, concerning CRISPR1-leader and cas operon, across genera than between themselves. This signifies the acquisition of CRISPR1/Cas region could be through a horizontal gene transfer event owing to the presence of mobile genetic elements flanking CRISPR1 array. Comparison of CRISPR and cas phenograms with that of multilocus sequence typing (MLST) suggests differential evolution of CRISPR/Cas system. As opposed to broad-host-range, the host-specific serovars harbor fewer spacers. Mapping of protospacer sources suggested a partial correlation of spacer content with habitat diversity of the serovars. Some serovars like serovar Enteritidis and Typhimurium that inhabit similar environment/infect similar hosts hardly shared their protospacer sources.


Serovar Anatum
Broad-host range Serovar Paratyphi A USDA-ARS-USMARC-1175 CP007483. 2  Host-restricted-Primates Host-cold-blooded animals, mainly reptiles and sheeps Host-cold-blooded animals, mainly reptiles and sheeps Host-restricted-Primates (not frequent) and known to infect other animals like pigs

Host-restricted-Primates
Host-restricted-Poultry and other avains Host-restricted-Poultry Supplementary  Supplementary Table S3. Name and accession number for whole genome sequence for the strains analysed in the study.
The CRISPR1 leader sequences of these strains were used for the phylogenetic analysis. The abbreviations are the key to the figure.

Figure S3
a.    c.

Figure S12
a. b.

Figure S13
Max Total Query E Per.

CRISPR loci data collection in correct orientation:
Our study comprises 133 strains belonging to two species, S. bongori, and S. enterica, including 22 serovars and three subspecies (supplementary table S1). These samples were primitively isolated from multiple sources, including primates, poultry, swine, cattle, food specimens, and natural environment (GenBank database). The complete genome sequences for all these annotated strains were obtained from the GenBank database. Only experimentally validated sequences were considered to ensure the legitimacy of the data being used. The CRISPR loci were identified in two steps -the annotation and orientation of the CRISPR array were retrieved from the online database of CRISPR-Cas++ 19 . The upstream and downstream regions of these arrays were aligned with the leader sequences previously reported by 19 to know the correct sequence of the CRISPR array. The arrays were then classified as CRISPR1 and CRISPR2 after verifying the leader sequence and its position with respect to the cas operon 20 .
Most strains of S. enterica subsp. enterica had both, the CRISPR1 and CRISPR2 arrays. However, all the analyzed strains of S. enterica subsp. enterica serovar Heidelberg, a few strains of serovar Typhimurium, and one strain of serovar Tennessee are reported to harbor more than two CRISPR arrays 19 . Instead, our analysis confirmed that the CRISPR1 array of serovars Typhimurium and Heidelberg were divided into two parts by a stretch of 74 nucleotides consisting of two truncated spacers and a direct repeat (DR) (supplementary fig.  S1A). The two parts of the CRISPR1 array taken together in concatenation aligned well with the intact CRISPR1 array of other strains of serovar Typhimurium. Similarly, the CRISPR1 array of serovar Tennessee strain (str.) CFSAN070645 was divided into three parts (containing 19, 24, and 16 spacers) and the CRISPR2 into two parts (consisting 10 and 11 spacers) due to the presence of mutated DRs rendering a stretch of 91bp undetectable as a part of the CRISPR array. Therefore, we considered the concatenated forms of these CRISPR arrays as a single unit for further analysis. Our analysis also indicated the occurrence of CRISPR1 array with two spacers each, in the serovars Dublin, Gallinarum, Pullorum, and Gallinarum/Pullorum. However, neither of these CRISPR arrays were described as valid in the CRISPR-Cas++ database, and the CRISPRCasFinder software allocated 27bp long DRs and 34bp long spacer sequences. Likewise, the CRISPR2 arrays of serovar Typhi and serovar Pullorum str. S06004 identified through our analysis were not detectable by this database. The CRISPR2 array of serovar Typhi possessed only one erratic spacer and that of serovar Pullorum str. S06004 had two spacers. We considered all these strains and their respective CRISPR-Cas systems in our analysis.

Protospacer analysis:
The spacer sequences were extracted from the CRISPR-Cas++ database. The data of all the strains belonging to one particular serovar was combined. A unique spacer set was created for that serovar. The spacer sequences were then uploaded on CRISPRTarget tool 20 . Genbank Phage, RefSeq-Plasmid and IMGVR databases were selected as the target databases. The parameters for initial BLAST screen in CRISPRTarget were kept default. Output obtained gave the accession number of the protospacer sources corresponding to these spacers. The hits obtained for Genbank Phage and RefSeq-Plasmid had accession numbers corresponding to NCBI. While, the accession number for the hits obtained from IMGVR database corresponds to IMG/VR viral resource. The accession numbers of the protospacer hits obtained were matched across serovars using a customized bash script. Based on these matches a heat map was created.

Mapping 236 spacer-and Cascade-binding sites obtained by ChIP seq of Cas5:
The ChIP analysis done by Stringer et al. 21 revealed 236 binding sites. We mapped these sites on the complete genome (CP001363) available at NCBI and extracted the genes present at these sites. The functions of these gene were checked using UniProt. The genes having role in virulence (with support from literature) are marked in bold in Table S8.