Uncovering natural allelic and structural variants of OsCENH3 gene by targeted resequencing and in silico mining in genus Oryza

Plant breeding efforts to boost rice productivity have focused on developing a haploid development pipeline. CENH3 gene has emerged as a leading player that can be manipulated to engineer haploid induction system. Currently, allele mining for the OsCENH3 gene was done by PCR-based resequencing of 33 wild species accessions of genus Oryza and in silico mining of alleles from pre-existing data. We have identified and characterized CENH3 variants in genus Oryza. Our results indicated that the majority CENH3 alleles present in the Oryza gene pool carry synonymous substitutions. A few non-synonymous substitutions occur in the N-terminal Tail domain (NTT). SNP A/G at position 69 was found in accessions of AA genome and non-AA genome species. Phylogenetic analysis revealed that non-synonymous substitutions carrying alleles follow pre-determined evolutionary patterns. O. longistaminata accessions carry SNPs in four codons along with indels in introns 3 and 6. Fifteen haplotypes were mined from our panel; representative mutant alleles exhibited structural variations upon modeling. Structural analysis indicated that more than one structural variant may be exhibited by different accessions of single species (Oryza barthii). NTT allelic mutants, though not directly implicated in HI, may show variable interactions. HI and interactive behavior could be ascertained in future investigations.


Results
In silico analysis elucidates that the OsCENH3 gene has no other paralogs and is under purifying selection within genus oryza. The tBLASTn analysis revealed significant hits on Chromosome 5 in the region 24070193 -24067694 which coincided with the OsCENH3 coordinates ( Supplementary Fig. S1, Supplementary Table S1) 18 . Similar results were seen upon carrying out Nucleotide BLAST (Supplementary  Table S2) 18 . The lack of significant hits on any other part of the genome assembly established that the rice genome, in all likelihood, does not have any paralogs for the CENH3 gene.
Selecton analysis of Oryza genus CENH3 CDS sequences revealed the protein to be under purifying selection ( Fig. 1a) 19 . Varying degree of purifying selection was observed among 164 residues of rice CENH3 protein.
Contrastingly, the analysis based on a query set of 98 CDS sequences, many residues of NTT showed positive selection (Fig. 1b). This implied that the CENH3 protein sequences are highly conserved within the Oryza genus, while exhibiting some variation in the NTT region among different genera. This could be explained by the fact that different species and genera have a propensity to incorporate certain residues at higher rates than others. NTT is known to be more gapped/unalignable while HFD is highly conserved 15 . This is also in conference with results obtained earlier by other workers 17 . Allelic variation in the wild species accessions for the gene OsCENH3. Amplification  For the eight accessions where the complete gene sequence could not be obtained, the sequence region pertaining to HFD was analyzed. SNPs observed in exonic regions and major indels found in introns are listed in Table 2. No indels were found in exons.
Notably, various SNPs were found across genus Oryza irrespective of species e.g., SNP A/G at position 69 was found in accessions of AA genome species (O. glaberrima, O barthii, O. nivara, O rufipogon, O. longistaminata, O. glumaepatula and O. meridionalis) as well as non-AA genome species (O. australiensis) (Fig. 2a). Some accessions of the progenitor species exhibited no SNP at this position compared to OsCENH3 which indicated that the actual progenitor of the present-day cultivated rice might have carried this version of the gene. It is interesting to note that synonymous SNPs seen in O. glaberrima are also present O. barthii (Table 2). Non-synonymous SNPs were found to occur in NTT region only (Fig. 2b) (Table 2). Therefore, different species exhibit different alleles at nucleotide level but amino acid changes are few and limited to NTT. www.nature.com/scientificreports/ Haplotype structure varied mostly amongst species but sometimes also within the species. Within the species, these differences were attributable to at most one or two SNPs, that in most cases did not lead to any amino acid changes. Thus, different nucleotide haplotypes could be reduced to a single protein sequence. For instance, haplotypes H8 and H10 translate to identical protein sequence. Similar is the case with H12, H13 and H14.

Species specific variation in
Phylogenetic relationships of OsCENH3 gene among diverse wild species germplasm of rice. Upon phylogenetic analysis of complete CENH3 sequences, the accessions clustered into four major clades (Fig. 4). Upon analysis limited to just the Histone Fold Domain (HFD), two major clades were formed (Fig. 5)    www.nature.com/scientificreports/ residues. Ten haplotypes were seen in Oryza genus upon aligning and analyzing entire protein length (Table 4). In these ten haplotypes, the major differences arise from changes in NTT, while HFD in many variants is identical. Upon limiting the analysis to just the HFD, five variant types were observed (Table 4). It is noteworthy that O. meyeriana carries the CENH3 variant (A0A6G1EGA2) with most changes (ten) in the HFD region when compared to the reference protein Q6T367 (Oryza sativa ssp japonica), which can be attributed to the fact that

Indel intron 6
OsCENH3 GAA  GGC  ACG  GGT  AAG  AGG  GCA  ACC  TCT  TGC  GCC  GCC  ATC  --H1  GAG  GGG  ACG  GGT  AAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H2  GAG  GGG  ACG  GGG  AAG  AGG  GCA  ACC  TCT  TGC  GCC  GCC  ATC  --H3  GAG  GGC  ACG  GGT  AAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H4  GAG  GGG  ACG  GGG  AAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H5  GAG  GGG  GCG  GGT  AAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H6  GAG  GGT  ACG  GGT  AAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H7  GAG  GGT  ACG  GGT  AAG  AGG  GCG  ACC  TCT  TGC  GCC  GCC  ATC  --H8  GAG  GGC  ACG  GGT  GAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC  --H9  GAG  GGC  ACG  GGT  AAG  AGG  GCG  ACT  TCT  TGC  GCC  GCC  ATC  --H10  GAA  GGC  ACG  GGT  GAG  AGG  GCA  ACT  TCT  TGC  GCC  GCC  ATC --  www.nature.com/scientificreports/ it belongs to the tertiary gene pool. Interestingly, eight out of these ten SNPs were not observed in any other species. The most common changes (P70S and L117I) appear to always co-exist and were observed only in the CC genome. Another set of amino acid substitutions (I131L and M152I) also co-exist and were seen to occur in BB, DD as well EE genome. AA genome species do not exhibit any changes in the HFD region, indicating that the CENH3 has remained functionally conserved in the primary gene-pool of the genus Oryza. Curiously, the O. barthii CENH3 sequence on alignment with other AA-genome CENH3 sequences was revealed to be missing exon 2 completely. Resequencing of O. barthii CENH3 alleles from our panel revealed one accession to carry a SNP in beginning of exon 2 which we inferred to lead to an amino acid substitution. This anomaly is not seen in any other species including O glaberrima which is closely related to O. barthii. We found synonymous as well non-synonymous SNPs upon resequencing CENH3 gene in primary gene pool (AA-genome species) of genus Oryza. Additionally, sequencing of the single O. australienesis (EE) accession did not yield any SNPs. In contrast, the in-silico mining of the protein sequences belonging to species from secondary and tertiary gene pools exhibited several amino acid changes specific to genome groups. Most allotetraploid species belonging to the O. officianalis complex (secondary gene pool) exhibit at most two changes in HFD. It could also be inferred that various allotetraploid species carry the same CENH3 complement E.g., In O. alta  Mining allelic variants in HDRA and SNPSeek panels. Our analysis of HDRA and SNPseek panels returned a total of nine OsCENH3 variants (Supplementary Table S5) [20][21][22] . The HDRA variant allele (rs# 5655822) was found to have a synonymous effect. Of the eight variants mined from SNPseek, only one (rs#175193154) occuring in exon 6, was found to have non synonymous effect (Cysteine to Serine). Of the remaining seven, six were intronic in nature while one was placed in 3′ UTR. rs#175193154 was employed for homology modelling and alignment.   Fig. S5a-g). The OsCENH3 secondary structure presents six helices with one helix present in NTT. The haplotype variants H5 and H8 also have six helices but also carry a two-residue strand not seen in the reference. H12 variant has six helices in the regions comparable to the reference but these helices differ in length, a similar case is presented by rs#175193154 ( Supplementary  Fig. S5h) To better understand structural differences, 3D modelling was carried out. The human nucleosome structure containing the histone variant H3.2 -3AV1_A exhibited highest percent identity and was employed as template for homology modelling of the reference (Supplementary Fig. S6). Five models were built for each sequence respectively and the model with minimum DOPE score (Discrete Optimized Protein Energy) was selected (Supplementary Table S6). Protein structures for OsCENH3 and of variant representatives H5, H10, H12 as well as Hδ, Hζ and Hθ (Fig. 7) as well as rs#175193154 were superimposed (Supplementary Fig. S7) and analyzed for structural differences. All models showed > 90% residues in most favored region (Supplementary Figs. S8, S9 and S10). The protein structure of OsCENH3 displayed 4 helices (αN Helix, α1 Helix, α2 Helix and α3 Helix) and two loops (loop1 and loop2), loop1and α2 Helix define CATD ( Supplementary Fig. S11), characteristic of CENH3 structure as studied by various workers in different species 15,[23][24][25][26][27] .
The comparison of 3D structures showed that H5 and H10 are very similar to OsCENH3 with RMSD (Root mean square deviation) of 2.704 and 8.330 respectively (Supplementary Table S7). No major changes in the overall architecture of proteins were observed. The position 71 in H5 (red) was part of helix while it is not in OsCENH3 (green) (Supplementary Fig. S12). All the helix and loops in H12 and H10 were comparable to OsCENH3. Gly 37 to Gly 54 was deleted (disordered region) in Hδ (O. barthii) but still all helices and loops are conserved. In Hθ (O. officinalis), Thr 38 was absent (purple) and an extra triplet of TAA amino acids is found at position 48-50, still the loops and helices were conserved (Fig. 8). In O. meyeriana (Hζ), the helix regions were identical but some variation existed in the loops, e.g., Gly 27 to Pro 31 region does not occur as loop in OsCENH3. Like Hθ, Thr38 was missing in Hζ. An extra loop spanning tripeptide Trp 52 to Ala 54 was also seen in Hζ. Although, the wild relatives exhibit natural variants of OsCENH3 but the mutations (SNPs or Deletions) fail to disrupt the major structural components. Additionally, the rs#175193154 exhibited complete superimposition with reference OsCENH3 (DOPE score: − 10105.18555). 15,28,29 . A corresponding resource does not exist for rice CENH3 mutants. Most of the allelic variation at any given locus is expected to occur in the wild germplasm and not the crop itself, due to the unavoidable loss of variation during domestication [30][31][32] . In the current study, we have scanned the Oryza germplasm accessions for CENH3 variants with simultaneous in silico analysis. This study underlines that the OsCENH3 gene is a highly conserved, single copy gene. Contrastingly,

Wild germplasm presents synonymous as well as non-synonymous variations.
To maximize chances of identifying variant alleles, the allele mining subset was selected using "P" procedure 35 ; while also considering the original geographical distribution of the accessions. Despite having distinctive morphology, lifecycle duration and pollination system, the amino acid sequences of CENH3 in the Oryza accessions are highly similar. Synonymous changes outnumber non-synonymous variations at the level of the DNA. The absence of non-synonymous changes across HFD in CENH3s of genus Oryza is explained by earlier findings wherein multiple sequence alignment has shown that HFD amino acid sequence is similar even among species belonging to different genera 15 . The in-silico analyses of the OsCENH3 gene carried out as part of the present study also indicated in the same direction, in coherence with results obtained in rye 34 .
On aligning Arabidopsis thaliana and rice CENH3 protein sequences, various motifs were seen to be conserved between these two species (Supplementary Fig. S13) which was in coherence with the present investigation as well as previous findings 15 . This could be attributed to the fact that CENH3 like other histones, has certain specific motifs that are indispensable to its function. The MARTKH and PGTVAL motifs, that mark the beginning of NTT and HFD respectively in CENH3 protein and thus define its identity, are seen invariably in all species' sequences used for alignment by Kuppu et al. 15 . While mining alleles for OsCENH3 in wild species germplasm of Oryza genus, not all accessions could be amplified uniformly. Similar problems were faced by other workers while mining alleles for specific genes from Oryza germplasm 36,37 . Transferability of the primer pairs between species belonging to the same genus, is not always 100% as observed in Passiflora and Allium 38,39 .
L117I was one of the most common single-point amino acid change observed on in silico analysis of Oryza genus CENH3 protein sequences. Its corresponding mutant in A. thaliana, L130F/ L130I, has been reported to successfully complement cenh3 null mutant with haploid inducing abilities (HI, 4.8%) 29 . Arabidopsis CENH3 mutants P82S (corresponding to P70S) and A127V (corresponding to L114Y) also exhibit HI capabilities 15 . Notably, Arabidopsis CENH3 variant with A127T is non-complementing 15,28 .
A total of 157 single-point amino acid substitutions (arising from nucleotide transitions) are possible in OsCENH3-HFD (unpublished data), but only a handful of these were seen to exist naturally in Oryza relatives and none in closely related-AA genome species. The AA genome species might still be carrying SNPs that are not reflected upon translation as seen on resequencing a subset of accessions for OsCENH3. Similar studies in banana (Musa spp.) and carrot (Daucus spp.) also noted highly similar CENH3s between different species within the genus 40,41 . Our structural analyses point out that these natural variants do not exhibit major variations in conformation.
Various studies have etched out specific roles played by certain important residues found in NTT region of CENH3 in different species [42][43][44] . Arginine residues seem to play important role in kinetochore recruitment and subsequent interactions 42,43 . Lysine residue in HFD and Serine residue in NTT have also been implicated in controlling certain functions/behaviours 44,45 . In our analysis, we came across variants in which arginine and lysine residues had been replaced by other amino acids. These variants can be analyzed further for their interactions during kinetochore recruitment. Another allelic variant discovered from SNPseek panel has a serine residue replacing cysteine, and this might also modulate HI behavior of the accessions carrying these.

Phylogenetic implications of resequencing and in silico analysis. Phylogenetic relationships
among different species of genus Oryza have been studied based on entire genomes, chloroplast genome and individual genes by various workers previously. Analysis of various nuclear genes established O. meridionalis to be most divergent among AA genome species 46,47 . Current study found CENH3 alleles of O.longistaminata to be most divergent from O. sativa reference and other species. This could be explained by presence of two long intronic indels in CENH3 sequences pertaining to O.longistaminata. This pattern is also represented to an extent by 23S rRNA of these species (Fig. 6). Interestingly, O. australiensis (EE genome) HFD sequence clusters with O. nivara and O. rufipgon. This is understandable considering that the 23S rRNA originating from O. australiensis also exhibited the same pattern, in addition to being closest to cultivated species. This can also be explained by the fact that Adh1 sequences sourced from various diploid species of genus Oryza were found to be identical 47 . www.nature.com/scientificreports/ Currently, we sampled eight accessions of O. rufipogon and five accessions of O. nivara that capture the geographical diversity of the two species. Phylogenetic analyses show that they are scattered in two subclades and overlap with each other as well as O. sativa reference, which elucidates their close genetic relationships. On both phylogenetic trees, O. rufipogon and O. nivara accessions do not cluster distinctly at the species level. This is identical to findings of Zhu and Ge 47 but in contrast to results obtained by Wambugu et al. 48 who observed that even O. rufipogon accessions segregate into "Asian" and "Australian" clusters. Currently, the African species O. barthii and O. glaberrima form a distinct cluster which was also seen upon analysis of chloroplast sequences 48 .
Currently sequenced variant alleles from Oryza species germplasm exhibit amino acid changes in region of N-terminal tail domain. Histone fold domain does not exhibit any functional SNPs and thus any variants that might be useful towards haploid induction pipeline were not present in our allele mining panel. The single copy OsCENH3 gene could be easily manipulated for creating mutants that can then be tested for HI trait. The variants found in wild species germplasm along with other OsCENH3 mutants sourced from our mutant garden are under analysis in our lab for their HI capabilities.

Materials and methods
Preliminary in silico characterization of OsCENH3 gene. Nucleotide and protein sequences of CENH3 were retrieved from NCBI (https:// www. ncbi. nlm. nih. gov/) and UniProt (https:// www. unipr ot. org), respectively. Of 894 nucleotide entries found on NCBI, the complete coding sequences and protein sequences were retrieved for 286 entries only. Further, 111 sequences (out of 419 returned upon keyword search) were downloaded from UniProtKB. Redundancy was removed from both nucleotide and protein sequence files using CD-HIT suite with identity cut-off of 0.99, leaving behind 107 nucleotide, 106 CDS derived protein sequences and 89 protein sequences from UniProtKB. The non-redundant nucleotide and protein sequences were subjected to domain analysis using CD search at NCBI server. Any sequences with domains other than pfam00125 (histone-specific domain) were not considered further. The two protein sequence files were combined and redundancy check completed again, returning 144 non-redundant protein sequences. Thus, 144 protein and 107 nucleotide sequences were used for investigating any OsCENH3 paralogs. The sequences were BLAST searched 18 (BLAST + version was used; 49 ) against the complete rice genome assembly version 7.0 50 , retrieved from the RGAP database (ftp:// ftp. plant biolo gy. msu. edu/ pub/ data/ Eukar yotic_ Proje cts/o_ sativa/ annot ation_ dbs/ pseud omole cules/ versi on_7. 0/ all. dir/). Standalone BLASTn and tBLASTn searches were performed using the rice genome assembly as database and the CENH3 CDS and protein FASTA files as the query sequences. In a separate analysis, effect of amino acid substitution in the CENH3 sequences was analysed using Selecton server (http:// selec ton. tau. ac. il/) 19 . , O. australienesis (n = 1) was used for the present study. These accessions were initially procured from International Rice Research Institute (IRRI), Philippines and National Rice Research Institute (NRRI), Cuttack, India and are being actively maintained at Punjab Agricultural University, Ludhiana. Standard agronomic practices were followed to raise the crop, as reported previously 51 . Voucher specimens for all accessions used currently have been deposited in the herbarium of National Bureau of Plant Genetic Resources (NBPGR), New Delhi, India. These accessions are part of the global and national germplasm collections maintained by IRRI (identifiable by IRGC accession numbers) and NRRI (identifiable by CR accession numbers) and can be accessed through these institutions as well as NBPGR. Initial identification of these materials was carried out by scientists of IRRI and NRRI undertaking germplasm collection. Experimental research and field studies on these accessions including the collection of plant material, were in accordance with the relevant institutional, national, and international guidelines and legislation. Since the plant material has been maintained by PAU, India, permissions regarding the collection of seed specimens were not required.

Natural allelic diversity for
Primer designing and PCR amplification. Genomic DNA was isolated using previously reported protocol 36,52 . The complete sequence of chromosome 5 was retrieved and aligned with gene sequence obtained from RGAP database (RGAP Locus ID: LOC_Os05g41080, http:// rice. plant biolo gy. msu. edu/ index. shtml) using offline BLAST and further trimmed to obtain gene sequence flanked by additional 267 bp before the start codon and 467 bp after stop codon, yielding a sequence of 2900 bases. This sequence was used to design three sets of overlapping primer pairs (Supplementary Table S3). Supplementary Figs. S1, S4 show structure of the gene and Supplementary Fig. S2 shows position of primers along the length of the gene. PCR was performed in a 30 μl reaction mix containing 0.25 μl TaKaRa Ex-Taq DNA polymerase, 1 μl of genomic DNA (100 ng/μl), 3 μl of 10X TaKaRa Ex-Taq buffer, 3 μl of dNTPs (1 mM), 1.5 μl each of forward and reverse primers (5 μM), and 19.75 μl Nuclease Free Water. The thermal cycling conditions were as follows: an initial denaturation at 94 °C for 5 min; 35 cycles of 1 min denaturation at 94 °C followed by 45 s annealing at 55 °C and 1 min extension at 72 °C; and a final 7 min extension at 72 °C. Detailed protocols may be found in 52 .
Sequencing of OsCENH3 gene in selected accessions. 5 μl PCR product for each sample was electrophoresed on the ethidium bromide stained 1.0% agarose gel along with 1 kb plus ladder (Thermo Scientific Generuler) to estimate the DNA fragment size. The Wizard® SV PCR Clean-Up System (Promega, USA) as per the manufacturer's protocol was followed to purify the DNA fragments from remaining 25 μl PCR product. The nucleotide sequence information of the PCR products was generated as described previously 51  www.nature.com/scientificreports/ Analysis of the generated nucleotide sequences. For comparative sequence analysis, contigs were assembled from individual reads produced by overlapping primers, using DNA Baser Assembler v5.15.0., to generate the contiguous sequence of OsCENH3 alleles from selected genotypes. ClustalX 2.0.11 was employed to align contig sequences individually with the reference (RGAP Locus ID: LOC_Os05g41080) and to trim the sequences at both ends to retain only the genic portion (i.e., from ATG to TGA). Sequences were also trimmed to retain just HFD or individual exons. Pairwise alignments and multiple sequence alignments of the trimmed sequences with annotated version of the reference were carried out using Geneious Prime version 2021.1.1. Based on these alignments, SNPs and haplotypes were predicted. The detected SNPs were then manually curated by analyzing and comparing chromatogram files to the Geneious alignment files and DNA Baser contig files. Effects of detected SNPs in terms of coded amino acids, were also visualized using Geneious Prime version 2021.1.1.
Phylogenetic analysis. The MEGA (version X) software was used to generate the phylogenetic trees using multiple sequence alignment file 53 . The evolutionary distances were computed using the Neighbor Joining Method with 1,000 bootstraps using the Kimura2-parameter model.
Additionally, SNP calling was carried out on SNPSeek and HDRA panels in the region limited to OsCENH3 coordinates. Effects of detected SNPs in terms of coded amino acids, were also visualized using Geneious Prime version 2021.1.1.
Structure analysis of variants. Representative variant protein sequences along with OsCENH3 reference were subjected to secondary structure prediction using PsiPred 54,55 . Further, OsCENH3 protein sequence was subjected to blastp search against the PDB database at NCBI (https:// blast. ncbi. nlm. nih. gov/ Blast. cgi) for identification of a suitable template. Homology modeling of 3D structures for OsCENH3 as well as variant representatives H5, H10, H12 as well as Hδ, Hζ and Hθ, and rs#175193154 was done using Modeller 10.1 56 Quality of structures was accessed through the SAVES server version 6.0 (https:// saves. mbi. ucla. edu/) and Ramachandran plot was made for the selected models to assess the stereo-chemical properties. The structures were aligned with OsCENH3 using PyMOL version 2.5.0. OsCENH3 was also aligned with 3av1.