Comparative analysis, distribution, and characterization of microsatellites in Orf virus genome

Genome-wide in-silico identification of microsatellites or simple sequence repeats (SSRs) in the Orf virus (ORFV), the causative agent of contagious ecthyma has been carried out to investigate the type, distribution and its potential role in the genome evolution. We have investigated eleven ORFV strains, which resulted in the presence of 1,036–1,181 microsatellites per strain. The further screening revealed the presence of 83–107 compound SSRs (cSSRs) per genome. Our analysis indicates the dinucleotide (76.9%) repeats to be the most abundant, followed by trinucleotide (17.7%), mononucleotide (4.9%), tetranucleotide (0.4%) and hexanucleotide (0.2%) repeats. The Relative Abundance (RA) and Relative Density (RD) of these SSRs varied between 7.6–8.4 and 53.0–59.5 bp/kb, respectively. While in the case of cSSRs, the RA and RD ranged from 0.6–0.8 and 12.1–17.0 bp/kb, respectively. Regression analysis of all parameters like the incident of SSRs, RA, and RD significantly correlated with the GC content. But in a case of genome size, except incident SSRs, all other parameters were non-significantly correlated. Nearly all cSSRs were composed of two microsatellites, which showed no biasedness to a particular motif. Motif duplication pattern, such as, (C)-x-(C), (TG)-x-(TG), (AT)-x-(AT), (TC)- x-(TC) and self-complementary motifs, such as (GC)-x-(CG), (TC)-x-(AG), (GT)-x-(CA) and (TC)-x-(AG) were observed in the cSSRs. Finally, in-silico polymorphism was assessed, followed by in-vitro validation using PCR analysis and sequencing. The thirteen polymorphic SSR markers developed in this study were further characterized by mapping with the sequence present in the database. The results of the present study indicate that these SSRs could be a useful tool for identification, analysis of genetic diversity, and understanding the evolutionary status of the virus.


Materials and methods
Genome sequences. The publicly available eleven complete genome sequences of ORFV isolates obtained from the NCBI database (www.ncbi.nlm.nih.gov) were used for genome-wide in-silico microsatellites analysis. To compare genomic sequences of different lengths, we calculated the Relative Density (RD) and Relative Abundance (RA) values. RD is defined as the total length (bp) contributed by each microsatellite per kilobase (kb) of sequence analyzed whereas; RA is the number of microsatellites present per kb of the genome (kb). Among all the strains, we have chosen OV-SA00 (Acc. number: AY386264) as the reference to evaluate the polymorphism of microsatellites through in-silico approach as well as the development of SSRs for Indian origin ORFV (Table 1).

Microsatellites identification, investigation, and statistical analysis.
For identification of perfect mono, di, tri, tetra, penta, hexa as well as compound microsatellites, IMEx software 37  www.nature.com/scientificreports/ lites from genomes were extracted using the ' Advance-Mode' of IMEx using the parameters previously used for RNA viruses 38,39 and DNA viruses 40 . The parameters used were as follows: type of repeat: perfect; repeat size: all; minimum repeat number: 6, 3, 3, 3, 3, 3 for mono, di, tri, tetra, penta and hexanucleotide repeats, respectively. The maximum distance allowed between any two SSRs (dMAX) is 10 nucleotides. Other parameters were used as default. Compound microsatellites (cSSRs) were not standardized in order to determine real composition. . Tissue samples in the form of scabs from four suspected goats were collected at both infective and recovery/convalescent phase and simultaneously treated for wounds with 2% boro glycerine and parenteral application of Enrofloxacin @ 5 mg/kg IM (Fig. 1). About 5 g of tissue samples were collected from each animal and subsequently dissolved in phosphate-buffered saline (PBS, pH 7.2) added with antibiotics and antifungal supplements in a labeled sterile tube. The homogenized samples were then treated with tissue lysis buffer containing proteinase K, and the mixture was incubated at 56 °C overnight. Finally, the mixture was passed through a column, and DNA was purified from the column by using the standard phenol-chloroform method as described by Sambrook et al. 43 and stored at − 20 °C until further use. The suspected samples collected during this outbreak produced the expected PCR-amplified fragment size of 140 bp using ORFV specific primers orf1 and orf2 44 having nucleotide sequences Orf1: 5′-CGC AGA CGT GGC TGA GTA CGT-3′ and Orf2: 5′-TGA GCT GGT TGG CGC TGT CCT-3′, which confirmed the presence of the virus.

Multiple sequence alignment and identification of polymorphic
Development of polymorphic SSRs. The polymorphic microsatellites identified through in-silico approach were further validated through in-vitro approach using ORFV positive clinical sample. Motifs located within defined flanking regions were PCR amplified using specially designed SSR-PCR primer pairs by Prim-er3Plus web tool (https ://www.bioin forma tics.nl/cgi-bin/prime r3plu s/prime r3plu s.cgi/). The primer length was kept between 18 and 22 bp with product size in the range of 130-200 bp. For proper annealing to the template DNA, the annealing temperature was adjusted between 54 and 61 °C. The thermal cycling conditions for all genes were as follows: initial denaturation step at 95 °C for 5 min, with 35 cycles of denaturation at 95 °C for 50 s, with varying annealing temperature for each set of primers (55-61 °C) and extension step at 72 °C for 90 s with a final extension at 72 °C for 7 min. PCR amplification was performed in a Thermal Cycler system 2,720 (Applied Biosystems, USA) ( Table 2).
The amplified products were resolved by electrophoresis in a 3% agarose gel. The PCR amplified products, stained with ethidium bromide, were visualized and photographed using a Gel Doc™ XR + System with Image Lab™ Software (Bio-Rad®). Subsequently, the amplified products were purified using QIAquick® purification kit (QIAGEN, USA) and the purified fragments were sent for sequencing using 3100 ABI sequencer (Applied Biosystems, USA) as described by Sanger et al. 45 . All sequences obtained were analyzed and verified twice in each direction.
Sequencing data analysis and phylogenetic tree construction. The sequencing results of the developed SSR markers were aligned by using discontiguous-MegaBLAST to identify specific regions among the reads (microsatellites) within the ORFV genome 46 . Next, the sequencing results were subjected to the BLASTx analysis, which compares translational products of the nucleotide query sequence to protein databases (https :// www.ncbi. nlm.nih.gov). A concatenated phylogenetic tree was constructed using the bootstrap consensus tree

Distribution of SSRs and cSSRs in ORFV genome.
Our study revealed a large number of SSRs scattered throughout the ORFV genomes varying from 1,036 to 1,181 in number with an average of 1,092 per genome. The RA and RD ranged from 7.6-8.4 and 53.0-59.5, respectively, in the analyzed ORFV genomes. However, previous reports in other DNA viruses such as human papillomaviruses (HPVs), the RA and RD ranged from 3.6-8.3 and 23.9-59.1 47 . In the case of Herpesviruses, RA and RD occurred to be 4.1-13.3 and 26.9-102.9 48 . On examining the SSR unit size classes, dinucleotide repeats were found to be most abundant (76.9%), followed by trinucleotide (17.7%) and mononucleotide repeats (4.9%) in all the genomes. Tetranucleotide and hexanucleotide repeats were least in number and represented 0.4% and 0.2% within the ORFV genome, respectively. There were no SSRs with pentanucleotide repeats observed in the ORFV genome. Approximately 90% and 10% of microsatellite motifs were distributed within coding and noncoding regions. Among the noncoding region, 4.8% are present in the UTR, while 5.4% in the intergenic regions, where functional protein and hypothetical protein occupied 68.8% and 21%, respectively. The genome-wide scan revealed the presence of 83-107 cSSRs, with an average of 93 occurrences per genome. In the case of compound microsatellites, the cal- Approximately 89.5% and 10.5% of microsatellite motifs were distributed within coding or non-coding regions, respectively. Among the non-coding region, 5.0% were represented in the UTR while 5.5% in the intergenic region, where functional protein and hypothetical protein occupied 60.7% and 28.8%, respectively ( Figure S2). The percentage of individual microsatellites being part of compound microsatellite (cSSR%) ranged from 7.9 to 9.0 (Table 1). Based on dMAX value, the maximum distance between any two adjacent microsatellites and if the distance separating two microsatellites is less than or equivalent to dMAX, than microsatellites are classified as cSSRs 49 . To determine the impact of dMAX, all the studied genome sequences were chosen to determine the variability of cSSRs with increasing dMAX. The value of dMAX was set between 10 and 100 by Microsatellite Identification Search Analysis (MISA) 50 . Our analysis revealed an overall increase in the number of cSSRs with higher dMAX value and attained a plateau (Fig. 2).
Motif complexity of compound microsatellites. Compound microsatellites (cSSRs) are composed of two or more adjacent individual microsatellites. Generally, cSSR having the pattern like, m1-xn-m2, m1-xn-m2-xn-m3 are considered as '2-microsatellite' and '3-microsatellite' , respectively 49 . Majority of cSSRs were composed of two motifs, followed by tri, tetra, and penta-motifs (Supplementary file 1). Interestingly, two long stretches of cSSR were composed of identical motifs repeated 12 times, which was exclusively found in the genome of AY386264. The CTG-CAG compound microsatellite composed of self-complementary motifs has been proposed to be created by recombination 51 . However, our study showed no such compound microsatellites which contained self-complementary motifs, suggesting that these compound microsatellites were not likely to be derived from recombination. Motifs exhibiting the form [m1]n-xn-[m2]n can be termed as SSR-couples and are represented the maximum time in the genome. In this study, SSR couples, such as (CG)-x-(GC), (GC)-

Identification of polymorphic microsatellites through in silico approach. For a polymorphic
microsatellite, the length of the repeat block should be non-identical with that of the other sequences in the database, and this length difference must be a multiple of the repeat unit 19,30,52 . For the identification of a polymorphic microsatellite, eleven strains of ORFV were used, where (AY386264) acted as the reference. A total thirteen number of polymorphic microsatellites were observed; among these, two were observed within the hypothetical protein, three in the intergenic regions, and rest eight in the protein-coding/genic regions. The polymorphic genic region containing the microsatellites encodes several important proteins such as Ankyrin repeat protein (ANK protein), DNA-binding phosphoprotein, virion core protein, Granulocyte-macrophage colony-stimulating factor (GM-CSF), Interleukin 10 protein (IL-10), Putative serine/threonine-protein kinase protein (Table 2, Figure S3). The Circos map provides a clear vision regarding the SSRs and cSSRs distribution and other related details in ORFV (OV-SA00) genome (Fig. 4).

Development and characterization of SSR markers.
All clinical samples collected during the outbreak were found to be positive for ORFV tested by producing the desired PCR amplicon size of 140 bp (Fig. 5).
We chose all thirteen polymorphic markers to validate in-vitro. Hence, PCR was set with each primer sets to amplify the DNA isolated from a positive clinical sample. The SSR name, primer sequences, expected size, targeted motif, functional region, protein motif position, gene, ORF number, and annealing temperature, were summarized in Table 2. All the SSR markers produced reliable and reproducible PCR products with the expected molecular size (Fig. 6).
The amplified SSRs were further characterized by sequencing, mapping with the GenBank database through BLASTn and BLASTx. The results of BLASTn alignment revealed a 100% of query coverage and a high identity percentage (91-100%) between the respective sequencing product and their equivalent genes from the published OV-SA00 isolate genome sequence. The results of BLASTx alignment revealed various degrees of query coverage (38-96%) and a high identity percentage (91-100%) with their equivalent amino acid sequences (Table 3). www.nature.com/scientificreports/ The concatenated phylogenetic tree showed the ORFV of our study closely related to Chinese isolate (MG712417) (Fig. 7). We observed the presence of 2-3 alleles within ORFV genomes.

Discussion
Microsatellites, otherwise known as short tandem repeats (STRs), or a variable number of tandem repeats (VNTRs) are being used to discriminate various viruses, such as human cytomegalovirus (hCMV) 22 67 due to its polymorphic in nature. To get the insight into the microsatellite in ORFV, we have employed a comparative genomics approach for development and characterization through in-silico and in-vitro analysis and validated our findings using samples collected from the recent Orf outbreak for the first time.
The specific parameters, such as its incidence, RA and RD of SSR and cSSR in ORFV genomes, show abundance variation as compared to their genome size and GC content due to the heterogeneity of ORFVs. Until now, limited full-length ORF genomes exist in the database. Based on our analysis, we observed little variation in RA and RD in ORFV. However, in other viruses such as HPVs 47 and Herpesviruses 48 , higher variation in RA and RD were reported. The large variation with the parameters was not observed in ORFV, probably due to the lack of enough size difference in the genome. However, a limited number of complete genome sequences are available for this virus, in comparison to HPV and herpesviruses, which act as a constraint to get the optimal range. Correlation analysis confirmed that incidence of both SSRs and cSSRs, RA of cSSRs were dependent on   www.nature.com/scientificreports/ genome size, but independent of GC content, which was similar to that of HPV 47 , but opposite to HIV 68 , potexvirus, carlavirus, and tobamovirus [69][70][71] . The distribution of microsatellite in the viral genome is pathogen-specific rather than host-specific. The increase of cSSR is predominant when dMAX approaches 10-90 bp and further decreases with the increase of dmax (Fig. 2). This may be due to the occurrence of SSR in the overlapping regions of increasing dMAX. The ORFV genomes have more SSR within coding regions than non-coding regions in comparison with other DNA virus, such as herpes simplex virus. This might be due to higher relaxed selection pressure on coding regions in comparison to the non-coding region in the respective virus. The cSSRs percentages of ORFV ranges from 7.9 to 9.0%, which is lower in comparison to HIV-1, 0-24.2% 68 , Geminivirus, 0-27.2% 72 , Herpesvirus, 8.1-33.3% 48 . Generally, the number of compound microsatellites decreases with an increase in complexity 73 . Moreover, the lack of sufficient genomic resources from diverse geographical locations may contribute to a stagnant range of cSSRs%. In ORFV, 22.1% of cSSRs were composed of similar motifs, probably contributed by genome duplication. Some study suggests that genome duplication may be helpful for the repeat tendency mechanism 74 , which promotes the expansion of genome size such as yeast 75,76 .
In ORFV genomes, the poly A/T repeats were significantly more prevalent than poly G/C repeats, similar to eukaryotic and prokaryotic genomes 1,2 . The presence of mononucleotide repeats in Mengovirus and Encephalomyocarditis virus affect virus growth in murine cell culture 77 . In the case of ORFV, its significance needs further validation. In this study, we also observed the microsatellites having polymorphism in poly A/T (ORF117), poly C/G (ORF121), within the important immune-regulatory genes, such as in GM-CSF and ANK protein, respectively (Supplementary file 2). GM-CSF secreted by a variety of cell types triggers neutrophil, monocyte, and eosinophil myelopoiesis and stimulate early events in immune responses, controlling the differentiation and function of antigen-presenting dendritic cells. IL-2 is a T-cell-derived lymphokine that stimulates T-cell and NK cell activation and proliferation and activated-B-cell proliferation 78,79 . ANK protein leads to the down-regulation of hypoxia-induced factor (HIF) activity and regulates energy metabolism, angiogenesis, the apoptotic cascade, the NF-kB signaling pathway, and cell cycle regulation 80 . The functional effects of this polymorphism in these regions require further investigations.
Dinucleotide CG/GC is more prevalent in most of the ORFV genomes, similar to that of DNA viruses such as HPVs 47 , Caulimoviruses, Geminiviruses 52,81 . CG/GC repeat could form Z-conformation or other alternative secondary DNA to facilitate the recombination activity 82 . In our study, the polymorphism within dinucleotide (AC/CA)3 and (CG/CG)3 observed within the hypothetical protein. Dinucleotide repeats have the highest slippage rate as compared to any other type of repeats 81 . Among 257 viral genomes examined in a published study, the highest number of dinucleotide SSRs were found when compared to the other types 83 . Dinucleotide repeats are also speculated to be recombination hot spots 84,85 . In this study, the presence of higher di-nucleotide repeats over tri-nucleotide repeats suggests a possible role of hosts in the evolution of di-nucleotide repeats within poxvirus genomes. Inconsistency frequency of SSRs in different accession of the same virus may be attributed to instability because of a higher slippage rate 86 .
Trinucleotide motif ATA/TAA/AAT or ATT/TTA/TAT were most prevalent in most genomes of poxvirus whereas in other DNA virus GAG/AGA was most prevalent in HPVs and AAG/GAA in caulimoviruses. The higher density of trinucleotide repeats was observed compared to any other repeat type within coding regions of eukaryotic and prokaryotic genomes 32 . Interestingly, dynamic mutations within trinucleotide repeats responsible for the development of some diseases in humans 87 , as well as viral enzymes that interfere pathogenicity of Influenza virus 88 . Our study revealed the presence of trinucleotide CGC/GCG and GCC/GGC repeats to be most prevalent than others. The trinucleotide polymorphism was observed in some immunoregulatory genes such as ANK protein (GGC/GCC) 3 (ORF008), IL-10 protein (AGT/ACT) 3 (ORF127) and structural genes virion core protein (GAG/CTC) 3 , Putative serine/threonine-protein kinase (CGC/GCG) 3 , which needs further functional evaluation. www.nature.com/scientificreports/ Three polymorphic SSRs such as (A/T) 7 , (T/A) 6 , (AGT TAC / GTA ACT ) 3 were observed within non-coding regions. The microsatellite present within the non-coding reasons was evolutionarily neutral and can be utilized as an excellent molecular marker 30 . Finally, we have characterized, those polymorphic markers present at nongenic as well as coding (genic) regions. These genic microsatellites, however, may provide adaptive variation important to viral evolution and genetic variability, perhaps similar to the functionally important mononucleotide runs found in VSV 34 and RSV 36 and virulence of avian influenza virus encephalo-myocarditis virus 89,90 . It is noteworthy to mention that, recently, the microsatellite present in HSV-1 glycoprotein coding region US4 was useful for strain differentiation 30 . The concatenated tree, which was constructed utilizing sequence information of characterized markers, confirmed that the ORFV of the present study closely related to Chinese isolate (MG712417). Our previous report, as well as several other studies, observed a similar pattern of relationship 18,91 . We speculate that trans-boundary and cross-species transfer of ORFV isolates could have resulted in this, as India is geographically adjacent to China. It is interesting to observe the presence of a number of the alleles (2-3) within ORFV genomes indicates the existence of polymorphism within microsatellites, which could act as a useful tool to estimate the diversity 61 . Using a single repeated mononucleotide was able to follow the dynamics of transmission of a human adenovirus during an epidemic 63 . Therefore, microsatellites constitute a potentially powerful tool for epidemiological studies of the transmission routes and evolution of ORFV and other related poxviruses. This study provides an important new type of molecular markers useful to investigate questions not only related to epidemiology but also for deciphering the diversity of the virus. However, the characterized microsatellites of the present study are not biased to the particular strain, which indicates the presence of recombinant strains circulating within the Indian subcontinent. This information is not concrete, which requires validation by several whole-genome sequence analysis of ORFV isolates from Indian origin. So far, our understanding of the functional and evolutionary role of microsatellites in ORFV biology is limited, which needs further in-depth evaluation and possible implementation.
In conclusion, the study of microsatellites in ORFV genome is a key step towards better understanding the nature, function, and evolutionary biology of the species. Our preliminary results can be considered as a useful tool for ORFV strain demarcation, diversity estimation, and evolutionary analysis. Our next plan is to characterize several ORFV strain complete genome from Indian origin through next-generation sequencing to get a better insight into genome organization, development of a suitable multiplex panel, which can be utilized as an effective tool for virus identification, genotyping and evolutionary analysis of the respective virus.