Drought-responsive genes in tomato: meta-analysis of gene expression using machine learning

Plants have diverse molecular mechanisms to protect themselves from biotic and abiotic stressors and adapt to changing environments. To uncover the genetic potential of plants, it is crucial to understand how they adapt to adverse conditions by analyzing their genomic data. We analyzed RNA-Seq data from different tomato genotypes, tissue types, and drought durations. We used a time series scale to identify early and late drought-responsive gene modules and applied a machine learning method to identify the best responsive genes to drought. We demonstrated six candidate genes of tomato viz. Fasciclin-like arabinogalactan protein 2 (FLA2), Amino acid transporter family protein (ASCT), Arginine decarboxylase 1 (ADC1), Protein NRT1/PTR family 7.3 (NPF7.3), BAG family molecular chaperone regulator 5 (BAG5) and Dicer-like 2b (DCL2b) were responsive to drought. We constructed gene association networks to identify their potential interactors and found them drought-responsive. The identified candidate genes can help to explore the adaptation of tomato plants to drought. Furthermore, these candidate genes can have far-reaching implications for molecular breeding and genome editing in tomatoes, providing insights into the molecular mechanisms that underlie drought adaptation. This research underscores the importance of the genetic basis of plant adaptation, particularly in changing climates and growing populations.


Introduction
The growing global population and the worsening effects of climate change present a formidable challenge to the eld of agriculture.To meet the increasing demand for food, agricultural scientists must nd ways to increase production despite the adverse effects of climate change.One of the major factors that can reduce crop yields is water scarcity, which is a common problem in many regions, such as South Asia, where prolonged summers lead to low water availability and moisture content in the soil (Hayat et al., 2008).The prolonged stress can severely diminish plant growth and productivity, and cause the accumulation of reactive oxygen species (ROS) in the plant.Lower carbon xation due to stomata closure can lead to reduced NADP + regeneration through the Calvin cycle, which, when coupled with changes in chlorophyll production, can result in an increased production of ROS in water-stressed plants (Biehler et al., 1996;Zhang et al., 2022).The accumulation of ROS can lead to a cascade of harmful effects, including peroxidation of lipids, protein oxidation, enzymatic activity inhibition, oxidative damage to nucleic acids, and ultimately, cell death.These consequences can have serious implications for agriculture, as they can ultimately lead to decreased crop yields and food shortages.Therefore, nding effective ways to mitigate the impact of water scarcity on crops is crucial to ensure food security in the face of climate change.
Tomatoes are a widely cultivated crop throughout the world, originating in southern America as a cultivated solanaceous plant known as Lycopersicon esculentum.Despite its origins, tomatoes are now produced and consumed worldwide, with demand slightly outstripping production (Bai et al., 2007).Due to their relatively short life cycle, easy cultivation, and simple genetics with lower gene duplication, tomatoes are considered to be a standard model of vegetables (Bergougnoux, 2014).Molecular studies of tomatoes can also provide valuable insights into other related species (Kimura & Sinha, 2008).While tomatoes are economically important, there is still much to learn about their molecular responses to various abiotic stresses, such as water stress.
Next-generation sequencing technologies have rapidly become the preferred method for characterizing and quantifying entire genomes.One powerful application of these high-throughput sequencing methods is RNA sequencing (RNA-Seq), which allows researchers to gain insight into the transcriptome of cells.
Compared to previous approaches such as Sanger sequencing and microarray-based methods, RNA-Seq provides higher resolution and greater sensitivity for characterizing the dynamic nature of the transcriptome.In addition to quantifying gene expression, RNA-Seq data can facilitate the discovery of novel transcripts, identi cation of alternative splicing events, and detection of allele-speci c expression.Recent advances in the RNA-Seq work ow, including improvements in sample preparation, library construction, and data analysis, have enabled researchers to uncover even more of the functional complexity of transcription (Kukurba & Montgomery, 2015).
The analysis of RNA-Seq data is a signi cant challenge due to its complex and high-dimensional nature.
However, in recent years, machine learning has emerged as a successful approach in various biological contexts.Machine learning algorithms have demonstrated impressive e ciency in handling large datasets that possess characteristics such as noise, high dimensionality, and incompleteness.One notable advantage is that these algorithms require minimal assumptions about the underlying probability distributions and generation methods of the data.In contrast to traditional statistical approaches that focus on inference, machine learning methods prioritize prediction, providing greater exibility to accommodate diverse data characteristics.Machine learning plays a crucial role in the analysis of biological data, particularly RNA-Seq data, due to its capability to handle nonlinearity.Linear models often fall short of capturing the intricate relationships present in biological datasets.Conversely, machine learning algorithms excel at capturing and utilizing the complex patterns inherent in such data.This ability enhances our understanding and interpretation of RNA-Seq data, enabling more accurate predictions and deeper insights into biological phenomena (Mahood et al., 2020).
In this research, we gathered transcriptomics data from various tomato genotypes subjected to different drought conditions.The experiment encompassed diverse genotypes and included six time points representing varying durations of drought, along with corresponding control groups.Our analysis focused on identifying genes that exhibited differential expression, allowing us to capture both early and late responsive genes to drought.To achieve this, we employed machine learning models, systematically exploring the dataset to uncover the genes most affected by drought in tomato plants.Using the identi ed genes, we constructed a protein-protein interaction (PPI) network and determined their targets as drought-responsive.Additionally, we observed that the chromosomal positions of these genes overlapped with previously reported drought-responsive Quantitative Trait Loci (QTLs) in tomato.
The study has also investigated differentially expressed genes and enriched Gene Ontology (GO) categories in tomatoes subjected to prolonged drought stress and subsequent rehydration.These studies revealed that down-regulated genes after drought stress were enriched in GO categories related to photosynthesis and cell proliferation, while upregulated genes belonged to GO categories more directly associated with stress responses (Iovieno et al., 2016;Alam et al., 2010;Kosová et al., 2011).However, a comprehensive examination of genetic responses in tomato plants to drought stress is still lacking.Therefore, the primary objective of our study was to employ machine learning techniques to investigate the responsive genes of tomato plants under prolonged water de cit conditions.Through our systematic approach, we aimed to identify the genes speci cally responsive to this particular stress condition in plants.

Differentially expressed gene patterns
The expression of genes can vary over time in response to different stimuli, and the identi cation of DEGs can provide insights into the underlying biological processes.In this study, the expression of genes in a time series scale was analyzed, and the results were visualized in Figure 2.
The DEGs were compared with their individual controls, and the analysis revealed interesting patterns (Figure 3 and Figure 4).Speci cally, it was observed that the most upregulated genes were found in the later stages of stress, speci cally at 120 hours post-treatment (Figure 3).This nding suggests that the cellular stress response becomes more pronounced as the duration of the stress increases.Interestingly, some genes were also found to be upregulated in the early stages of stress and overlapped with those found in the later stages.This observation implies that the cellular stress response may involve immediate and delayed mechanisms.And in the medium stages of water de cit, a few genes were found to be signi cantly upregulated.Although the number of DEGs was lower in this stage, it is important to note that these genes may play important roles in the early stages of the stress response.This study provides valuable insights into the temporal dynamics of gene expression in response to stress.The identi cation of DEGs at different time points highlights the importance of considering the duration of stress when studying cellular responses.Additionally, the overlapping DEGs observed across different stages suggest that the cellular stress response involves a complex interplay of immediate and delayed mechanisms.

Gene ontology enrichment analysis
Gene ontology enrichment analysis is a powerful tool used to interpret the biological function of differentially expressed genes (DEGs) in a high-throughput manner.In this study, the functions of DEGs were predicted from GO and the results were visualized in Figure 5.The analysis revealed that a large proportion of DEGs (approximately 600 genes) were involved in organic substance biosynthetic processes and organonitrogen compounds.This nding suggests that the cellular stress response involves the production and modi cation of organic compounds, which play important roles in cellular metabolism and signaling pathways.Moreover, approximately 500 genes were found to be related to stress, indicating that the cellular stress response involves the activation of various stress response pathways.This nding is in agreement with previous studies that have shown that stress response genes are upregulated in response to various environmental stimuli.Additionally, more than 400 genes were found to respond to chemicals, highlighting the importance of chemical signaling in cellular responses.These genes may be involved in the synthesis, transport, and degradation of chemical signals, or they may play roles in downstream signaling pathways that are activated by chemical stimuli.Furthermore, approximately 200 genes were found to be related to translation and responded to osmotic stress.This nding suggests that osmotic stress may affect translation processes, which play important roles in the synthesis of proteins that are involved in cellular responses to stress.In summary, the gene ontology enrichment analysis conducted in this study provides valuable insights into the biological function of DEGs in response to stress.The identi cation of genes involved in various metabolic, signaling, and stress response pathways highlights the complexity of the cellular stress response and emphasizes the importance of considering multiple biological processes when studying stress responses.

Machine learning performance
In this study, we utilized gene expression values as features to train classi cation models using the XGBoost in R. XGBoost is a popular implementation of the gradient boosting algorithm, which combines multiple weak learners, such as shallow trees, into a strong one.To evaluate the performance of our models, we split our dataset into training and test sets.The accuracy of the models was measured using the area under the curve ("auc") value, which is a commonly used metric for evaluating classi cation models.The results of our analysis, shown in Figure 6, demonstrate that the accuracy of the training and test sets are quite close to each other, indicating that our models are well-tted.Furthermore, we observed that the accuracy of the models increased over iterations, suggesting that the models learned from the data and were able to improve their performance as they were trained with more data.The use of the XGBoost in our study is particularly bene cial due to its ability to handle high-dimensional datasets, such as those generated from gene expression studies.By incorporating multiple weak learners, the algorithm can effectively capture the complex relationships between gene expression values and biological outcomes, such as disease states or cellular responses to stress.In summary, our results demonstrate the effectiveness of XGBoost in classifying biological samples based on gene expression values.The close agreement between the accuracies of the training and test sets suggests that our models are robust and generalizable.The use of XGBoost has enabled us to extract meaningful insights from high-dimensional gene expression datasets, and its exibility makes it a valuable tool for future studies in this area.

Feature importance
In this study, we employed a robust approach to determine the most in uential genes that contribute to the classi cation of samples based on their gene expression values.Following the training and validation of our classi cation model using the XGBoost algorithm, we extracted the features with their corresponding importance scores.The importance score of a feature re ects its contribution to the classi cation of samples and thus provides valuable insights into the underlying biological mechanisms.Among the top six genes identi ed as candidate genes, we found that FLA2 gene had the highest importance score (Figure 7).This observation suggests that the expression of FLA2 gene is a critical factor in determining the classi cation of samples in our study.FLA2 gene encodes for fasciclin-like arabinogalactan protein 2, which is a cell surface protein involved in various cellular processes such as cell adhesion, growth, and differentiation.The high importance score of FLA2 gene in our analysis may indicate its involvement in the cellular stress response and thus may have potential implications in the development of stress-tolerant crops.It is worth noting that the other candidate genes identi ed in our analysis also play important roles in various cellular processes.Further investigation of these genes could provide valuable insights into the underlying biological mechanisms involved in stress responses.
In conclusion, the identi cation of FLA2 gene as the most important gene in our study highlights its potential signi cance in stress responses.The use of the XGBoost algorithm has enabled us to extract meaningful insights from high-dimensional gene expression datasets, and its exibility makes it a valuable tool for future studies in this area.

Protein-protein network
To gain a better understanding of the biological functions and interactions of the candidate genes identi ed in our study, we constructed protein-protein interaction networks for each gene using publicly available databases.The resulting network for each candidate gene is shown in Figure 8, with nodes representing genes and edges representing interactions between genes.In these networks, we observed that the candidate genes identi ed in our study were represented by the color red, indicating their high importance in abiotic stress response.We also observed that the targets of these candidate genes showed abiotic stress-related functions, suggesting their involvement in the cellular stress response.The protein-protein interaction networks provide valuable insights into the potential pathways and mechanisms involved in the cellular stress response.By identifying the interactions between genes, we can better understand the complex interactions and functions of genes in the network.This information could be further used to develop targeted interventions for improving stress tolerance in crops.
Overall, the protein-protein interaction networks of the candidate genes identi ed in our study highlight their potential signi cance in stress responses and provide a foundation for further investigation into their roles in stress tolerance (Supplementary 2).

Candidate genes overlapped with drought QTLs of Tomato
In this study, we aimed to explore the potential role of three candidate genes, FLA2, ASCT, and NPF7.3, in the response of tomato plants to drought stress.To investigate this, we compared the location of these genes with the previously identi ed QTLs associated with drought-related traits in tomato by Diouf et al. (2018).
Our analysis revealed that all three candidate genes overlapped with QTLs that were previously associated with drought stress, suggesting their potential involvement in the plant's response to water de cit conditions (Table 1).Speci cally, the QTLs RIP, SSC, NFr, and FW were identi ed under drought stress for the traits of time to ripe, soluble solid content, number of fruit, and fruit weight, respectively.
The importance of our ndings lies in the potential for identifying speci c genes that contribute to the drought stress response of tomato plants.A better understanding of the genetic mechanisms involved in this response is crucial for developing crop varieties that can withstand environmental stressors such as water scarcity.
Our study suggests that FLA2, ASCT, and NPF7.3 could be promising candidate genes for further investigation in the context of drought stress in tomato plants.Future studies could focus on elucidating the speci c molecular pathways through which these genes affect the plant's response to water de cit conditions.This could lead to the development of more targeted breeding programs that utilize these genes to develop drought-tolerant tomato varieties.

Discussion
The abstract of our study highlights the crucial importance of gaining a thorough understanding of the genetic composition of tomato plants to improve their ability to withstand drought stress, especially given the signi cant global demand for this staple crop.To achieve this objective, we employed state-of-the-art machine learning techniques to identify the most responsive genes associated with prolonged water scarcity in tomato plants.Our investigation uncovered six promising candidate genes, with FLA2 emerging as the top gene, indicating its potential as a critical target for enhancing drought resistance in tomatoes.Moreover, the gene ontology enrichment analysis conducted in our study demonstrated that the functions of the identi ed candidate genes were closely linked to drought stress.These ndings are consistent with previous research (Iovieno, et al. 2016), reinforcing the notion that these candidate genes play a pivotal role in imparting drought tolerance in tomato plants.Our study signi cantly advances our knowledge of the molecular mechanisms underlying the response of tomato plants to drought stress.The identi cation of FLA2 and other candidate genes, coupled with the functional analysis of their gene ontology, lays a robust foundation for future research focused on developing drought-resistant tomato varieties.Such research can help mitigate the negative impact of water scarcity on global food production, making our study highly pertinent and timely.
The ndings of this study demonstrate that speci c genes were upregulated in later stages of stress, while others overlapped with early stages, aligning with previous research on plant response to drought stress (Atkinson et al., 2013;López-Galiano et al., 2019).These results underscore the complexity of the molecular mechanisms involved in plant responses to stress and emphasize the need for a comprehensive understanding of these mechanisms that encompass multiple genetic pathways.Additionally, the identi cation of these genes on different chromosomes suggests that they may have distinct modes of action in responding to stress, underscoring the signi cance of investigating multiple genes and their interactions to gain a deeper understanding of the fundamental molecular processes.By shedding light on the intricate mechanisms that govern plant response to drought stress, this study contributes to advancing our knowledge in the eld and lays a foundation for future research aimed at developing drought-tolerant plant varieties.This article emphasizes the critical importance of identifying candidate genes involved in the response to abiotic stress, particularly drought stress, and understanding their interactions using protein-protein interaction networks.Our study focused on six genes (FLA2, ASCT, DCL2b, ADC1, NPF7.3, and BAG5), and the constructed protein-protein interaction networks offer valuable insights into the potential pathways and mechanisms involved in drought responses.This research provides a foundation for developing targeted interventions to enhance stress tolerance in crops.The study highlights the signi cant role of the identi ed candidate genes in the response to abiotic stress, especially drought stress, with these genes exhibiting functions related to the cellular stress response.Our ndings are consistent with previous research (Muhammad et al., 2019), which identi ed various genes, including SlMAPK3, involved in drought stress responses in tomatoes.The overexpression of SlMAPK3 improved the tolerance of tomato plants to drought stress by regulating the expression of genes involved in drought stress responses.
Therefore, our study makes a valuable contribution to ongoing research on drought stress tolerance in tomatoes.The identi cation of candidate genes involved in drought stress response and the construction of protein-protein interaction networks lay a foundation for further investigation into the molecular mechanisms involved in stress tolerance.These ndings could be leveraged to develop new strategies for improving crop productivity and sustainability in the face of drought stress and other environmental stressors, ultimately bene ting global food security (Zhou et al., 2007;Kosmala et al., 2012).
The study has provided compelling evidence that FLA2, ASCT, and NPF7.3 genes overlap with QTLs associated with drought-related traits in tomato plants.In a recent study by Verinico et al. (2022), FLA2 was identi ed as drought-responsive, which further con rms the importance of this gene in the plant's response to water de cit conditions.
Developing crops that are more resilient to environmental stressors like water scarcity has become an urgent priority, given the growing global population and climate change challenges facing agriculture.By understanding the molecular pathways through which these candidate genes affect the plant's response to drought stress, researchers could develop targeted breeding programs to develop drought-tolerant tomato varieties.
The integration of bioinformatics and computational methods, particularly machine learning, has greatly advanced the eld of plant stress response research.The study presented in this paper is a remarkable example of how such approaches can provide novel insights into the genetic mechanisms underlying plant responses to environmental stress (Auer et al., 2010).By identifying candidate genes and regulatory elements involved in stress response, machine learning can help develop crop breeding programs that enhance stress tolerance, ultimately contributing to addressing food security challenges (Rico-Chávez et al., 2022; Veronico et al., 2022).
The use of machine learning allows for the analysis of large amounts of genomic data and the identi cation of complex gene networks involved in stress response.This approach has enabled researchers to identify critical genetic targets and pathways that can be manipulated to enhance stress tolerance in crops.Such insights can be leveraged to develop crop varieties that can withstand harsh environmental conditions, including drought stress, which is one of the most signi cant challenges facing global food production (Rico-Chávez et al., 2022; Lopez et al., 2020).Furthermore, machine learning can also help identify and prioritize the most promising candidate genes and regulatory elements for further experimental validation.This approach can signi cantly accelerate the breeding process by reducing the time and resources required to develop new crop varieties.Therefore, the integration of bioinformatics and computational methods, particularly machine learning, represents a powerful approach to identifying the genetic mechanisms underlying plant responses to environmental stress.
Such approaches have the potential to make signi cant contributions to addressing food security challenges, particularly in areas where drought stress is a signi cant constraint on crop production.The use of such methods can ultimately lead to the development of drought-resistant crop varieties, which is critical for ensuring global food security in the face of climate change and other environmental challenges.Therefore, the integration of bioinformatics and computational methods is a promising tool for future research aimed at developing crops that can withstand environmental stressors and ultimately improve global food production.

RNA Seq data
For drought stress, RNA Seq data of tomatoes were extracted from Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/).The data was collected using the following keywords in GEO: "Solanum lycopersicum"[Organism] OR Tomato [All Fields]) AND ("drought response"[MeSH Terms] OR Drought stress[All Fields]) (Figure 1).RNA Seq data were downloaded by SRAToolkit.Each treatment has its corresponding controls.Data were sorted according to time series points of 3, 6, 48, 72, 96, and 120 hours of treatment respectively.There was also data for recovery samples (Supplementary 1).

Mapping
Reads were mapped against the coding sequence (CDS) of the Heinz 1706 genome.The latest annotation (ITAG4.1,https://solgenomics.net/) of the Heinz 1706 genome was used.Kallisto was used for mapping.34688 genes were found after mapping.The abundance le contained transcript id, length, effective length, estimated counts, and TPM (Transcript per million).Transcript id estimated counts and TPM values of all samples were gathered in one tab-separated value le for further analysis.The estimated counts of all samples per time point were averaged and labeled as a single time point.

Gene expression analysis
Differentially expressed genes (DEGs) were analyzed using the Bioconductor edgeR package in R. Estimated counts value of all samples were taken to analyze differentially expressed genes.Data were normalized with library size.Dispersions of samples were calculated with glm approaches of the edgeR package by comparing the treatments with their respective controls.Exact tests were carried out to verify the signi cance of change gene expression and their log fold change values.Multiple testing corrections were done with the topTags function.It ranks genes by p-value.The threshold of considering differentially expressed genes was False discovery rate (FDR)<0.01.For upregulated genes, the logFC threshold was >1, and for downregulated genes, it was < -1.Visualizing of gene expression was done with heatmap using the Complexheatmap Bioconductor package of R. Gene expression was shown with genotype-speci c, tissue-speci c, stress-speci c, heat level-speci c, and sample-speci c using row Annotation, bottom Annotation, and top Annotation functions of Complex heatmap package in R. Intersection and total set of genes of samples was visualized with an upset plot using R for both upregulated genes and downregulated genes.Some selected transcription factors from diversi ed families were collected in literature which are said to be drought-responsive genes.The expressions of that transcription factor were also analyzed across time series scales by comparing them with their respective controls.

Gene ontology analysis
Gene ontology enrichment analysis of signi cant genes of samples was carried out with the topGO package in R. The genome annotation format (GAF, ITAG4.1)le of tomato was collected from (Carolyn, 2021, https://doi.org/10.25739/zh2v-4p15)and a map le was created with the bash script.The map le contained two columns with tabs separated where the rst column was GO terms and the second column was genes to respective GO terms with space-separated.The functions were considered only those who were involved in biological processes.Fisher's classic test was done to nd signi cant levels of GO terms and their functions.The top 15 functions were selected as thresholds.

Candidate selection
We used a tree model with gradient boosting, XGBoost in R to train and test the models.We split the data for each species into training (80%) and testing (20%) sets.We used ve-fold internal cross-validation to select the optimized hyperparameters.We tuned "nrounds" (number of trees), "colsample_bytree" (the proportion of features for constructing each tree), "subsamples" (the portion of training data samples for training each additional tree), and "eta" (shrinkage of feature weights to make the boosting process more conservative and prevent over tting) in an XGBoost: classi cation model.We used the XGBoost-generated feature importance score that indicates how useful each feature was in the construction of the model.

Protein-protein interaction
To nd out the likely interactions of candidate genes the STRING ((https://string-db.org) tool was used (Karimizadeh et al., 2019).For the search interface, parameters were set to full network type, con dence score >0.4, and more than 10 interactors.In the organism interface, it was restricted to Solanum lycopersicum.In the networks graph, the colored nodes display the proteins and the edges represent the interactions.

Conclusion
In this study, we demonstrated that some genes of tomato were signi cantly upregulated in later stages of water stress, while some other genes were also found in the early stages and overlapped with later stages.Our results suggest that some genes play distinct roles in the plant's response to drought stress.Therefore, a comprehensive understanding of these genes is crucial for developing strategies to improve crop resistance to stress.Furthermore, our machine learning approach resulted in identi cation of six candidate genes, with FLA2 having the highest importance score, provides a starting point for further research into the speci c functions and mechanisms of the genes in the context of drought stress.The gene ontology enrichment analysis predicted that the functions of candidate genes are related to drought stress, indicating that they may play crucial roles in the plant's ability to cope with environmental stress.Our study suggests that FLA2, ASCT, and NPF7.3 could be promising candidate genes for further investigation in the context of drought stress in tomato plants.Overall, this study highlights the importance of understanding the genetic makeup of tomato plants to develop strategies to increase stress resistance and fruit quality.The ndings of this study shed light on the genetic makeup of tomato plants and their response to prolonged water de cit, and environmental stress that can signi cantly impact crop yield and quality.Candidate genes and their importance scores for drought response.The importance scores were obtained from machine learning models.

Figures
Figures

Figure 1 Flow
Figure 1

Figure 5 Functions
Figure 5

Table 1 .
Drought stress QTLs and overlapped candidate genes.