Analysis of oat seed transcriptome with regards to proteins involved in celiac disease

Oat (Avena sativa L.) is considered to be a healthy food. In contrast to other grain crops, oat is high in protein, lipids, dietary fiber, antioxidants, and uniquely in avenanthramides. The question of whether it can also be consumed by people suffering from celiac disease is still unresolved. The main aim of this study was to extract and sequence genes for potentially harmful avenins, globulins, and α-amylase/trypsin inhibitors in six oat varieties and to establish their variability using PacBio sequencing technology of enriched libraries. The results were compared with sequences of the genes already present in databases. In total, 21 avenin, 75 globulin, and 25 α-amylase/trypsin inhibitor genes were identified and mapped in the hexaploid oat chromosomes. In all of the three gene families, only marginal sequence differences were found between the oat varieties within the individual genes. Avenin epitopes were found in all four types of avenin genes occurring in all oat varieties tested within this study. However, the number of avenin genes was nearly four times lower than of globulin genes and, on the protein level, formed only 10% of storage proteins. Therefore, the question of whether oat is safe to celiac disease people is a question of boundary values.

www.nature.com/scientificreports/ Sequence analysis of avenin protein sequences revealed four avenin groups called A, B, C1, and C2 15,16 with a molecular structure analogous to other prolamins. The protein sequences contain three conserved regions interspersed by two repetitive regions with lower proline and glutamine content when compared to other prolamins, especially wheat α-and γ-gliadins 10 . Thus, it can be expected that avenins are more easily hydrolyzable by duodenal enzymes than other prolamins and therefore its consumption should be safe for CD patients 9 . However, recent studies cast doubt on the safety of oats associating oat consumption in some CD patients with inflammatory symptoms 5,17 . Regardless of possible contamination by wheat, barley or rye, contamination-free oat varieties differ in their capacity to induce inflammatory response in CD patients 18 . Moreover, there are other compounds such as α-amylase/trypsin inhibitors (AATI) in oat seeds that can also contribute to the intestinal inflammation via activation of innate immune pathways 2,19,20 . On the one hand, oats can form a healthy, nutritious, fibre-rich, and safe complement to the gluten-free diet 9 and on the other hand, its consumption may present a risk for CD patients.
The main aim of this study was (1) to identify all avenin, globulin, and α-amylase/trypsin inhibitor genes present in selected oat varieties and to map them on the oat chromosomes; (2) to investigate differences in oat avenin, globulin, and α-amylase trypsin inhibitor composition in six oat varieties with various level of reactivity. As oat is an allohexaploid (2n = 6x = 42), and its genome is large and complex with 2C = 25.7 21 , we employed NGS technology to explore the oat seed globulin-and avenin-enriched transcriptome. Seed transcriptome has been already investigated by Illumina technology, but only with respect to the avenanthramide and tocol pathways 22 . To the best of our knowledge, this is the first study that uses NGS approach to investigate oat avenin, globulin, and α-amylase trypsin inhibitors.
All six varieties were planted in a field experiment. Developing hulled seeds were collected at 21 days after anthesis (DPA) because avenin expression levels have been proved to peak between 20 and 28 DPA 10 . All samples were frozen in liquid nitrogen and stored at − 80 °C until RNA extraction.

RNA extraction.
Pooled samples of approximately 20 seeds of each sample were used for RNA extraction using the TRIzol method (Invitrogen, Carlsbad, CA, US) according to the manufacturer's instructions. RNA was then purified with the RNeasy Plus Mini Kit (Qiagen, Hilden, Germany) following the standard protocol and treated with RNase-free DNase I (Qiagen, Hilden, Germany). The quality and integrity of the RNA was determined electrophoretically and spectrophotometrically with a GeneQuant Pro spectrophotometer (Biochrom, Cambridge, UK).
Avenin, globulin, and α-amylase/trypsin inhibitor transcriptome enrichment. To enrich target sequences, the first strand of cDNA was created using reverse transcriptase and oligo dT. The second strand was then synthetized using biotinylated probes specific to avenins or globulins (Table 1). cDNAs with ligated probes were selected using streptavidin bound to magnetic beads (New England Biolabs, Ipswich, MA). Target sequences were then amplified by PCR with Avena primers (Table 1) and Pfu proof reading polymerase (Fermentas, Lithuania). PCR products were purified using QIAquick PCR Purification Kit (Qiagen, Hilden, Germany). Avenins, globulins, and α-amylase/trypsin inhibitors enriched sequences were put together for each oat variety. This process was repeated three times for each oat variety separately and the respective three subsamples Table 1. Primers, oligos and probes used in the study.

Research involving plants.
Authors confirm that all methods were performed in accordance with the relevant guidelines and regulations. Oat seeds were provided by breeders collaborating on the project QK1810102.

Results
To compare avenin, globulin, and α-amylase/trypsin inhibitor composition differences among the oat varieties, six target enriched DNA libraries derived from RNA samples were prepared. The enrichment was performed using hybridization by specific degenerate probes, selection of caught molecules, and their amplification. Samples were sequenced using the PacBio platform in a SMRT Cell. After removal of the primer adaptor sequences and low-quality reads, 436,000 single-end reads of 1027 nt on average in length were obtained. The results are provided in six separate files, each containing from 61,329 to 95,485 sequences (NCBI Sequence Read Archive: PRJNA774959).
All sequences were mapped to the oat reference (PepsiCo OT3098), merged, and extracted using SAMtools. After skipping duplicates, 56,394 sequences were obtained in total. These sequences were blasted and annotated (blastx, 1E−10) against SwissProt database. Matches were found for 38,867 sequences (68.9%). Functional classification of blasted sequences was performed through a gene ontology categorization. An ontology annotation was found for 32,925 of the sequences (58%). In total, 14,031 sequences were annotated as avenin, globulin, gliadin, or glutelin. After merging overlapped loci, 213 contigs were defined in a bed file and used to prepare avenin, globulin, and α-amylase/trypsin inhibitor target oat reference (Reference_OT3098.fasta; Supplementary File S1).
Sequences of all samples were mapped against the developed reference ( Fig. 1) and consensus sequences were extracted through variant calling process. The presence of sequence of oat samples in each of the 213 loci was checked in Integrative Genomic Viewer and re-annotated (Supplementary Table S1). Out of the 213 loci, 21 genes for avenins, 87 genes for globulins, and 23 genes for α-amylase trypsin inhibitor were found (Supplementary Table S1).
Avenins. From the total of 21 contigs identified in this study as avenins, one contig contained a sequence of avenin promoter and three contigs contained only fragments of avenin genes or pseudogenes with internal stop codons. These sequences were thus omitted from the subsequent analyses. Seventeen contigs of OT3098 were aligned together with sequences available in public databases. Five clusters were identified ( Fig. 1; Supplementary Figure S1). These contigs of all sample sequences were then aligned and the resulting phylogenetic tree is presented in Fig. 2 showing six groups. Two most genetically distant groups are labelled D. The other groups involve avenins of the type A, B, C1, and C2. The dendrogram also shows that there is a low level of sequence diversity between oat varieties within genes. All varieties had the same gene compositions with the exception of 'Ebene', 'Jim', 'Poncho', and 'Sirene' that did not contain avenin gene B on the chromosome 3D (chr3D-479782203). Further, only 'Poncho' contained four additional gene fragments (chr3D-450843917; chr3D-479670106; chrUn-5633586; chrUn-15655427) ( Fig. 1; Supplementary  www.nature.com/scientificreports/ Four avenin genes were mapped on the chromosome 1D and the genes were localized in the order: C1-B-C2-A. Nine genes were situated on the chromosome 3D in the order D-fragment-D in higher distances and then as a cassette promoter-C1-B-C2-C2-A. The remaining eight genes stayed unmapped even when the reference sequence OT3098 v2 was used and were situated on the chromosome Un as a cassette in the order: fragment-D-C1-B-C2-C2-fragment-A (Fig. 1). The length of the avenin proteins ranged from 160 to 265 aminoacids with an average of 223 aminoacids including 19 aminoacids of a signal protein. The most abundant aminoacids in all avenins were proline (8.2-11.0%) and glutamine (24.1-34.9%). In C1 avenins, leucine was the most abundant (15.3-17.4%) ( Table 2). The structure of avenin genes consisted of seven domains as was proposed by Anderson (2014): signal peptide, three conserved domains, two variable domains with high representation of proline, glutamine and leucine residues in irregular repetitions, and C-terminal domain. All avenin genes contained eight cysteine residues with the exception of A group avenins that had a ninth cysteine in their C-terminal domains.
Four celiac disease T-cell HLA-DQ2.5 immunoreactive epitopes 28 were identified in avenin sequences. The epitope HLA2.5-ave-1a (PYPEQQEPF) was found in avenins of the C2 group, only once per protein, therefore in five molecules per oat haploid genome. The epitope HLA2.5-ave-1b (PYPEQEQPF) was predominant in C1 avenins as PYPEQQQPF once per protein, occurring in three molecules per oat haploid genome. Epitope HLA2.5-ave-1c (PYPEQEQPI) was identified in B avenins as PYPEQQQPI in ' Atego' and 'Dalimil' in three molecules per oat haplotype and in the remaining varieties only in two molecules per oat haplotype. The epitope HLA DQ2.5-ave-2 (PYPEQQPF) was specific to A avenins and occurred in three molecules per oat haplotype. In addition, a 9-mer PFVQQQQPF sequence, formerly known as Av-γ9B epitope, was located downstream of the HLA DQ2.5 epitopes in the first repetitive region only in the C2 avenins, usually in two or three repetitions per each C2 avenin gene. So, 10-15 9-mer PFVQQQQPF sequences occurred per haploid oat genome.
The globulin sequences of the reference OT3098 were aligned with globulin genes published by Anderson 29 and phylogenetic tree was constructed (Fig. 3). Four main groups can be seen in the dendrogram. The first group (A) involves globulin genes of the types Glo-1-Glo-8, the second group (B) containes globulin genes Glo-10-Glo-14, the third cluster (C) is formes of the globulin genes Glo-15 and Glo-16, and the fourth cluster www.nature.com/scientificreports/ (D) involves globulin genes Glo-9 and Glo-17-Glo-24. Similarly to avenins, there were a low level of sequence diversity between oat varieties within genes (Supplementary Figure S2). The majority of globulin sequences were mapped on the chromosome 3D (26), followed by the chromosomes 1A (16), 1D (14), and 7A (9); eight genes have not been mapped (chrUn) ( Fig. 1; Supplementary File S1). Globulin genes contained nine domains: signal domain, four exons, three introns, and C-terminal domain. Besides the introns, the most variable part was the exon 3 between intron 2 and 3. There was a relatively high frequency of glutamin. The second most variable sequence of oat globulins was the C-terminal domain. It varied in sequence and in the length as well.
Α-amylase/trypsin inhibitor. In total, 25 α-amylase/trypsin inhibitor (AATI) genes were identified within the six oat varieties, 14 as whole genes and 11 as gene fragments or pseudogenes. Most of them were localized on the chromosome 4(10 genes and 3 fragments): five genes and one fragment on the chromosome 4A, four genes and one fragment the chromosome 4C, and one gene and one fragment on the chromosome 4D. Further, one gene and four fragments were mapped on the chromosome 7C, and three genes and one fragment on the chromosome 1D (Fig. 1).
The AATI sequences of the six oat varieties and of the reference OT3098 were aligned together with AATI genes published by Gazza et al. 19 and phylogenetic tree was constructed (Fig. 4). Five main groups can be seen in the dendrogram. The first group (A) involves AATI-1.* genes, the second group (B) contained AATI-2.*, the third cluster (C) is formed of AATI-3.* genes, and the last two clusters involve non-classified AATI genes (Fig. 4,  Supplementary Table S1). Similarly to avenins, there were a low level of sequence diversity between oat varieties within the genes.. The length of AATI genes ranged from 432 bp (AATI 3.2 on the chromosome 4C) to 696 bp (AATI-? on the chromosome 7C). None of the analysed oat varieties contained all AATI genes. In 'Dalimil' and 'Sirene' , 'Poncho' , ' Atego' and 'Ebene' , and 'Jim' , 17, 18, 19, and 21 genes were found, respectively. www.nature.com/scientificreports/

Discussion
The prerequisite for this study was to find as much avenin, globulin, and α-amylase/trypsin inhibitor genes as possible in the selected oat varieties. On the other hand, the sequencing of the whole oat genome or transcriptome would be too expensive. That is why a strategy to sequence enriched transcriptome libraries of oat seeds collected in the peak of aimed gene expression according to Real et al. 10 was selected. The enrichment consisted in the amplification of the targeted genes selected using degenerate probes specific to avenins, globulins, and AATIs. Results showed that some DNA fragments and non-targeted genes were also selected and sequenced. From the 56,394 sequences, only 14,031 (24.9%) were annotated as the targeted genes. This is probably due to the use of degenerate probes that increase the likelihood of non-specific hybridization.
Avenins. In this study, 14 avenin genes, six pseudogenes or fragments with internal stop codons, and one avenin promoter were identified. Cluster analysis showed five groups named according to Real et al. 10 : avenin-A, B, C1, C2, and an additional group named D. Avenin-D genes differ from the other four groups in aminoacid composition, and therefore they could be regarded as avenin pseudogenes. To the best of our knowledge, it is the highest number of avenin genes that has been found. Londono et al. 15 , identified ten genes in 'Gigant' . Anderson 29 assembled avenin ESTs and got nine genes in 'Dancer' . Avenin sequences were mapped into the reference PepsiCo OT3098. All genes were localized in a kind of cassette in three regions on the chromosome 1D, 3D, and an unknown chromosome, as even version 2 of the reference did not allow to assign all loci to chromosomes. Each cassette of avenin genes contained one gene copy of avenin-A, B and C1, and one or two copies of avenin-C2. Except of the cassette on the chromosome 1D, the others were accompanied by pseudogenes and avenin gene fragments. Therefore, it is likely that each oat variety contained at least 12-14 or even more genes through additional gene duplication 15 . In Tanner et al. 30 , a higher number of avenins in oat genome was indicated by the level of protein via combination of several methods of protein extraction 30 . It is interesting that five groups of avenin proteins were identified that could be assigned in compliance with the epitopes to avenin groups A, B, C1, C2, and D 30 .
Only marginal sequence differences were found between the oat varieties within the individual genes, therefore it is unlikely that the differences in avenin genes on DNA level could be used for oat variety identification. There is a question of whether it is even possible at least on the protein level 10 . This poor diversity can be explained by a more recent evolutionary history of oat compared to other cereals 29 .
The structure of avenin proteins A, B, C1, and C2 corresponded to already published results 15,29 , as well as the content and the positions of cysteine residues. Avenin sequences of the group B and C presented eight cysteine residues, whereas avenins of group A showed nine. It is likely that it formed intermolecular disulphide bonds and formed a polymer similar to wheat glutenins 10,31 . Aminoacid composition differed from the work done by Real et al. 10 only in the content of alanin in avenins A (7.98% on average in this work and 4.87% in Real et al. 10 ) www.nature.com/scientificreports/ and B (6.60% on average in this work and 4.14% in Real et al. 10 ), of leucine in avenins B (11.55% on average in this work and 7.53% in Real et al. 10 ), and in the content of glutamine in avenins A (25.81% on average in this work and 22.24% in Real et al. 10 ) and C (32.54% on average in this work and 29.25% in Real et al. 10 ). On the contrary, the average value of proline and glutamine content was lower than in 'Dancer' 29 . When comparing to wheat gliadin and glutenin, lower proline and glutamine content of avenins was found which could lead to the lower celiac toxicity with respect to wheat prolamins. The avenin-specific T-cell epitopes DQ2.5-ave-1a, DQ2.5-ave-1b, DQ2.5-ave-1c, and DQ2.5-ave-2 5,28 occurred only once per protein, therefore there was 12-14 epitope residues per Avena sativa haplotype. When compared to the composition of wheat, barley, and rye epitopes (Supplementary Figure S3), avenin epitopes grouped together with glutenin epitopes DQ2.5-glut-L1 and DQ2.5-glut-L2, gliadin epitopes DQ2.5-glia-1a, DQ2.5-glia-1b, and DQ2.5-glia-1(2), hordein epitope DQ2.5-hor1, and secalin epitope DQ2.5-sec1 but with a low level of bootstraps. Avenin epitope similarity to gliadin, glutenin, hordein, and secalin epitopes expressed per each aminoacid incidence was: P(0. Moreover, Ellis et al. 32 replaced successively all aminoacids of the epitope DQ2-α-II by alanine residues and studied these epitopes' reactivity. They found that the replacement of any position within the 9-mer led to a significant decrease of the reactivity. Therefore, the precise epitope sequence is important. In oat, there are only four aminoacids (marked above in bold) that approximate avenin epitopes to immunoreactive epitopes of wheat, barley, and rye. Moreover, unlike wheat, barley, and rye where prolamins are the main storage protein constituting 60-80% of the grain total protein content, oat prolamins (avenins) account for around 10% 33 . Hardy et al. 34 found the low rates of T-cell activation after an oat consumption of 100 g per day. They suggested that doses of oats commonly consumed were insufficient to cause clinical relapse and supported the safety of oats demonstrated in long-term feeding studies 34 . So, it is possible to hypothesise that oat could be safe for most people with celiac www.nature.com/scientificreports/ disease, but there is a cohort of them who are likely be sensitive to any doze of oat gluten. For them, the strict gluten-free diet will be the only option.
Globulins. In total, 75 globulin genes were identified and mapped. In spite of the different number of genes found in each of the six varieties, it was around two times more than was found in 'Dancer' 29 . The ratio between the number of prolamin and avenin genes corresponded to the proportion of prolamin (80%) and avenin (10-15%) protein content 35 . Cluster analysis confirmed the existence of two main and two other groups ( Fig. 3; 29 ). Location into clusters did not correspond to the position of the gene on a chromosome in the oat genome but rather to the type of globulin genes Glo-1-Glo-24 29 . Several genes showed differences from these already published globulins. Non-substantial sequence diversity was found between oat varieties that differed mainly in the presence or absence of individual genes; however, it can be a false negative result despite the prepared oat storage protein enriched libraries. The structure of globulin genes was in accordance with those identified in 'Dancer' 29 . The only difference was that three introns were identified in the genes when mapping sequences after transcription to the genomic oat reference OT3098. The first intron is 116 to 128 bp in length, the second is the most variable in length (112-137 bp), and the third is the shortest (104 bp). Although some of their regions are relatively rich in proline and glutamine, no epitope-like sequences have been identified. www.nature.com/scientificreports/ Oat protein consisted mainly by globulins plays a significant positive role in controlling blood glucose response by slowing the gastric emptying rate, promoting the secretion of insulin, and affecting the digestibility of starch 36 . High protein content (12-17%) and the good ratio between globulins and avenins enable oat to be a good nutritional ingredient for both animals and humans, favourable even for individuals with celiac disease. α-amylase/trypsin inhibitors. The third potentially harmful protein family are the AATIs belonging together with lipid transfer and seed storage proteins to the AATI-LTSS protein superfamily which is unique to higher plants. Proteins in this family are known to play important roles in defending plants from insects and pathogens, lipid transport between intracellular membranes, and nutrient storage. They have also been identified as allergens in humans 37 .
From 25 AATI genes, 14 were sequenced from start to stop codons. They belonged to three families named AATI-1, AATI-2, and AATI-3 with contrasting primary structures, molecular weights, and isoelectric points 19 . Unlike 'Donata' where four genes per each AATI gene family were identified 19 , two genes of AATI-1, three genes of AATI-2, and five genes of AATI-3 were found in this study.
Oats synthetize AATI proteins as precursors that undergo specific cleavage of a leader peptide of 25-28 aminoacid residues and then accumulate on the starch granules, together with more abundant amounts of vromindoline proteins. In this study, genes for vromindoline and puroindoline proteins were also blasted in mapped contigs which suggested a homology among all studied proteins. However, detailed analysis goes beyond the aim of this study. Tanner et al. 30 also found that in 18 peaks of HPLC analysis of extracted oat grain proteins, there were avenins and gliadin-like avenins accompanied by a large number of AATIs, vromindolines, enzymes, and metabolic proteins. The number of AATIs detected in 50% ethanol purified protein was in the same range as the avenins 30 .
Oat AATIs showed a high level of homology compared with α-amylase/trypsin inhibitors from wheat and barley which have been associated with human disease including food allergies and baker's asthma. Similarly to avenins and globulins, higher variability was found between individual genes than between six oat varieties and oat reference OT3098.

Conclusion
To the best of our knowledge, this is the first study in which such a high number of avenin, globulin, and α-amylase/trypsin inhibitor genes were identified and also mapped to the hexaploid oat chromosomes via a unique procedure of enriched library preparation and bioinformatics. The number of genes corresponded to the ratio of grain storage protein content. Oat prolamins (avenins) formed only 10% of storage proteins and moreover, there was a lower proline content compared to wheat prolamins. Therefore, whether oat is less toxic to people with celiac disease is a question of dose and it depends rather on patient susceptibility. It is likely that oat could be safe for most people suffering with celiac disease. For those who are sensitive to any doze of oat gluten, the strict gluten-free diet will be the only option. Low level of variability was identified in the sequences of avenin, globulin, and AATI genes between the six oat varieties with different immunoreactivity declared at the beginning of the study. Therefore, we can conclude that on the level of gene sequences none of the six oat varieties can be more suitable for patients with CD than others. The reason for the variability found on the protein level should be further investigated on the level of posttranscriptional and posttranslational modifications.

Data availability
DNA sequencing data were deposited in NCBI Sequence Read Archive-PRJNA774959. Analysed data are included in this published article, especially in its supplementary files. Material-oat seeds collected within this study are available from the corresponding author on reasonable request. www.nature.com/scientificreports/