Introduction

The impact of global warming on crop productivity is alarming and presumed to decrease global crop yield by 1.5% per decade. The effects of high-temperature (HT) stress are detrimental to plant growth and development, physiological processes, and crop yield per se1. At the cellular level, basic stresses, such as temperature, drought, and salinity, result in cell injury due to osmotic and oxidative stresses. Being immobile, plants respond through a variety of adaptive, avoidance, and/or acclimation mechanisms to mitigate HT stress. These responses include the activation of various physiological and biochemical processes, antioxidant defences, and metabolite synthesis pathways. Similarly, several genetic components, such as structural and regulatory genes perform essential roles in HT stress alleviation.

Heat shock proteins (HSPs) are molecular chaperones that execute crucial functions in response to HT stress. These proteins respond by the folding, accumulation, and degradation of other protective proteins when cells are exposed to HT stress2. The expression of these HSP-coding genes is regulated by a group of DNA-binding transcription factors, known as heat shock factors (HSFs). Therefore, HSFs are the primary regulators of the HT stress-responsive gene expression pathway, which operates by modulating a cascade of signal transduction networks3. Structurally, HSF proteins comprise an N-terminal conserved DNA-binding domain (DBD) of helix-turn-helix motifs that specifically interact with the heat shock elements (HSEs) of HSP gene promoters. Adjacent to DBD exists the oligomerization domain (OD) with the characteristic heptad hydrophobic repeat (HR-A/B) motif. Variations in the amino acid residues of HR-A/B motifs and the distance between the DBD and the OD facilitate the grouping of HSF proteins. Plant HSFs are grouped together within HSFA, B, and C with further sub-groups existing within the respective groups. In addition, HSF proteins also comprise a C-terminal activation domain (CTAD) with an AHA motif, often nuclear localization (NLS) and nuclear export signals (NES)4. The intra-HSF protein domain interactions usually regulate the activation and cellular localization of HSF proteins. Under natural conditions, the HSFA monomer, containing one C-terminal and three N-terminal leucine zipper repeats, is suppressed by an association with HSPs to inactivate this protein in the cytosol. A bi-partite NLS sequence flanking the N-terminal zippers confers nuclear localization. An interaction between the N- and C-terminal zippers in the HSFA monomer masks the NLS sequence. In addition, the interaction of HR-A/B motifs maintains HSFA in a monomeric form and negatively regulates CTAD under normal conditions. Upon HT stress, several proteins in the cell misfold, to which HSPs interact and become dissociated from HSFA. This dissociation allows HSFA to form trimers, expose the NLS sequence and translocate to the nucleus to trigger transcription. The DBD of the trimeric HSF recognizes at least three copies of a typical penta-nucleotide sequence, 5′-nGAAn-3′, in the HSE to regulate HSP transcription3. With much sequenced plant genome data available, a large number of HSFs were characterized in several plant species, and their putative roles were predicted through gene expression studies3. The genome-wide analysis of HSF genes in various plants has revealed their regulatory roles not only in HT stress but also in other abiotic stress responses. This finding emphasizes their possible involvement in a complex crosstalk among the different stress response pathways4. Hence, HSFs are excellent candidate genes for genetic engineering, gene editing or breeding of climate-resilient crops. A thorough delineation of these key factors at the genome-scale is indispensable to the target species before they can be harnessed in any genetic improvement programme.

Flax (Linum usitatissimum L.) is an important global cash crop producing seed oil (linseed) and bast tissue-derived fibre (linen) as economic products. For various reasons, there is a renewed interest in the cultivation and advanced scientific study of flax. Seed oil from flax is an abundant source of alpha-linolenic acid (ALA) and omega-3 fatty acid. It serves as an exceptional food, feed and industrial feedstock for several purposes. In addition, the cellulosic stem fibre serves as a source for fine textile-grade fabric, the geotextile industry, the composite industry, and the paper and pulp industry5,6. Since flax is a rabi season crop, HT stress is one of the major limiting factors of flax cultivation, especially at the terminal stages. A cold temperature over an extended period is essential for fibre maturation. Thus, the adaptability of elite fibre flax genotypes to warmer climates is extremely poor. The HT stress of 40 °C, over five to seven days, affects pollen viability, boll formation, and seed setting7,8. However, the genetically variable superior alleles of HSPs and HSFs can be harnessed to breed flax varieties with an enhanced capacity to adapt to warm climatic conditions. A comprehensive analysis of the HSF genes from the flax genome9 is thus apt for the genetic improvement of flax with an enhanced resilience to adverse climatic conditions. In this study, we identified and characterized HSFs from the flax genome. The characterization included phylogeny, evolutionary time, and gene expression analysis in tissues and with HT stress treatment. We also identified guide RNA (gRNA) sequences from the LusHSFs to be used in functional studies and genetic improvement through gene editing with the aim of obtaining minimal off-target genomic effects.

Results

HSF gene identification in the flax genome and their sequence features

A search for HSFs Hidden Markov Model (HMM)-based Pfam ID PF00447 in the genome of L. usitatissimum (cv. CDC Bethune) hosted in the Phytozome database produced 40 sequences. The individual HSF protein sequences were further supported by scanning against the Pfam-A database at the E-value threshold of 10−3 and the Simple Modular Architecture Research Tool (SMART) web server for the presence of the characteristic HSF-DBD and coiled-coil structures. Finally, the 40 putative HSF protein sequences were analysed in the HEATSTER database, revealing six loci (Lus10005925, Lus10016634, Lus10022546, Lus10026819, Lus10029852, and Lus10038874) consisting of incomplete domains that are essential for classification as HSF proteins (Supplementary Table S1). These six loci were removed from further analysis because they lacked the essential ‘coiled coil’ oligomeric domain (HR-A/B region), which functions through trimerization upon HT response. The remaining 34 HSF sequences consisting of characteristic DBD, HR-A, and HR-B motifs were named LushsfA1a to LushsfC1b based on their classification in the HEATSTER database (Table 1). Other domains, such as NLS, NES, activator motifs (AHA), and tetrapeptide repressor domain (RD), were also located on the LusHSF proteins. As per Table 1, the length of the LusHSF genes and their CDS ranged from 912 bp (LushsfA4c) to 3585 bp (LushsfA1d) and 273 bp (LushsfB5b) to 1473 bp (LushsfA4d), respectively. The amino acid sequence length of the LusHSF proteins varied from 200 (LushsfB5b) to 822 (LushsfB1a) amino acids. The molecular weight (Mw) and isoelectric points (pI) of the LusHSF proteins ranged from 23.19 (LushsfB5b) to 55.15 (LushsfA4d) kDa and 4.78 (LushsfA8b) to 9.32 (LushsfB5a), respectively. The GRAVY score of each LusHSF protein was found to be negative, ranging from −0.995 to −0.499, indicating that these proteins are highly polar molecules. Subcellular localization predictions of the LusHSF proteins based on the k-nearest neighbour classifier of the WoLF PSORT program showed that most of these proteins are localized in the nucleus.

Table 1 Features of LusHSF genes and proteins in the flax genome.

Chromosomal distribution and gene duplications

The genomic coordinates of the LusHSF genes on the scaffolds and flax chromosomes10 allowed us to estimate the physical location of these genes. Except for one gene, we found all LusHSF genes were randomly distributed on 14 out of the 15 flax chromosomes (Fig. 1). LushsfA7c, which is located on scaffold 87, remains unmapped because the entire scaffold has yet to be mapped on any chromosome. Not a single LusHSF gene was mapped on chromosome 5, while chromosome 8 consists of a maximum of five LusHSF genes. Four chromosomes, viz. 2, 4, 6 and 13, consisted of one LusHSF gene each. Gene expansion by duplication of the LusHSFs was checked through sequence homology analysis and their distribution patterns on the chromosomes. These analyses disclosed that twelve LusHSF genes have homologous gene pairs with >70% sequence identity and >90% query coverage. Eleven LusHSF genes have their duplicate counterparts (paralogues) distributed in separate chromosomes, while one pair, viz. LushsfA1c and LushsfA1a, are located on the same chromosome (Fig. 1). To further investigate whether this interspersed pattern of gene duplication resulted from segmental gene duplications, we compared LusHSF genes and their adjacent genomic regions using the GEvo tool of the CoGe database. Most of the putative LusHSF paralogues and their adjacent regions evolved because of local genomic rearrangements or microcolinearity (see Supplementary Fig. S1). This result indicates that segmental duplication played a significant part in the expansion of the LusHSF genes.

Figure 1
figure 1

Chromosomal locations and duplication of LusHSF genes. Each bar represents the flax chromosome with the chromosome number shown above the bars. Chromosomal lengths are represented in Mbp. All 34 LusHSF genes are mapped on 14 out of the 15 flax chromosomes. The numbers on the left side of the chromosomes represent their physical positions in Mbp from top to bottom. Putative paralogous LusHSF genes are depicted through connected lines.

Phylogenetic relationships of LusHSFs

Employing multiple sequence alignment to Arabidopsis thaliana HSF (AtHSF) and Oryza sativa (OsHSF) proteins, the LusHSF proteins were classified into diverse groups, and a Maximum Likelihood (ML) tree was constructed based on highest log likelihood score (−4581.19) (Fig. 2). The best amino acid substitution model was found Jones-Taylor-Thornton (JTT) with lowest Bayesian Information Criterion (BIC) score of 14963.17. As per the phylogenetic tree, the LusHSFs were clustered into three broad groups, A, B, and C, and a total of 13 sub-groups, A1, A2, A3, A4, A6, A7, A8, A9, B1, B2, B4, B5 and C1, according to the HSF proteins grouped in clusters. These groupings were supported by high bootstrap values (>90%). All of the LusHSFs in the phylogenetic tree corroborated the classifications obtained from the HEATSTER database (Table 1). Neither of the LusHSF proteins was clustered in A5 and C2 sub-groups. Two LusHSF proteins, LushsfB5a and LushsfB5b, clustered separately in B5 sub-group, whereas AtHSFs and OsHSFs lack members from B5. Sub-group A1 comprised the most LusHSF proteins (five), followed by sub-group A4 (four). In the comparative phylogenetic analysis of HSFs with other plant species from the Malpighiales order and other fibre crops (such as cotton and jute), LusHSFs were grouped distinctly from the other proteins (see Supplementary Fig. S2).

Figure 2
figure 2

Phylogenetic clustering of LusHSFs, AtHSFs and OsHSFs. The phylogenetic relationship tree was inferred from the Maximum Likelihood (ML) method and JTT + G + I matrix-based model in MEGA-X. Domain-centric alignment of amino acid sequences from DBD and OD domains were performed using the MUSCLE algorithm with maximum 16 iterations. Thirty-four LusHSF, 21 AtHSF and 33 OsHSF proteins were clustered into 3 broad classes A, B, and C and 15 sub-classes within. Sub-groups marked in grey did not consist any LusHSFs. Bootstrap support values of >50% are shown on the nodes.

Organization of gene structure

The gene structure pattern, including the exon and introns on the LusHSF genes, was analysed by comparing the respective coding sequence and genomic sequences (Fig. 3). Introns were found in all 34 LusHSF genes, which ranged from one to three. The pattern of occurrence, position and length of the introns were found similar among the LusHSFs grouped under different sub-categories. The closely associated members of the same HSF group shared similar intron numbers and lengths, except for the LushsfA4d and LushsfB2d. A maximum of three introns was observed in the LushsfA4d sequence. The longest intron sequence (2.263 kbp) was found in LushsfB1c, followed by LushsfA1d (2.169 kbp). The smallest intron length, 76 bp, was observed in the LushsfB2d gene. In all LusHSF genes, an intron sequence was observed within the HSF-DBD, thus splitting the domain into two. The splicing phase class of all the introns within the HSF-DBD was observed as ‘0’ (i.e., between two codons resulting in unchanged frames or intact codon), except in LushsfA4d, where an intron splicing phase in one of the three introns was observed as ‘1’ (i.e., splitting codons between the first and second nucleotides). The presence of a single intron within the HSF-DBD region is one of the common features of plant HSFs that might have a possible role in mediating alternate splicing in genes that encode diverse protein products.

Figure 3
figure 3

Gene structure showing the distribution of exons and introns of LusHSF genes. A phylogenetic ML tree rectangular diagram of LusHSFs genes is shown on the left. The lengths of the boxes and lines were scaled based on gene length. Blue bars represent exons, while thin black lines indicate introns. The green bars denote the position of the HSF DBD on exons. The numbers indicate splicing phases of LusHSF genes: 0 for phase 0 and 1 for phase 1.

Conserved protein domain and motif predictions

A systematic examination of all 34 LusHSF protein sequences revealed positions and sequences of discrete conserved motifs and domains (Table 2 and Fig. 4). Six types of conserved domains, DBD, OD (HR-A/B), NLS, NES, AHA, and RD, were identified within the LusHSF protein sequences. Except in one protein (LushsfB2d), the DBD was found at the N-terminus of all the LusHSF proteins, followed by the HR-A/B motif and other conserved motifs. In LushsfB2d, RD and NLS motifs precede the DBD. This finding indicated that the DBD and OD, comprising the HR-A/B motif, are the highly conserved domains on LusHSF proteins, followed by the NLS and NES domains. The NLS and NES domains, which are responsible for translocating HSF proteins to the nucleus, were found on most of the LusHSFs either individually or together, except in three proteins, LushsfB1c, LushsfB5b, and LushsfB1a (Table 2). A thorough look at the multiple sequence alignments of the DBD revealed a highly structured domain of conserved motifs that forms three bundles of alpha helices (α1, α2 and α3) and four antiparallel beta strands (see Supplementary Fig. S3a). However, minor differences in the DBD length and amino acid sequence insertions were observed in the DBD alignment, notably in the LushsfA2b sequence. Compared to the amino acid sequence alignment of the DBD, the HR-A/B motif, which forms a coiled-coil structure, was found to be more variable (see Supplementary Fig. S3b). Typically, HR-A and HR-B are two conserved motifs with sequence inserts between them. The HR-A motif was absent or partial in six of the 34 LusHSFs, while ten LusHSFs consisted of partial or no HR-B motifs.

Table 2 Conserved domains and motifs on LusHSF proteins.
Figure 4
figure 4

Distribution of conserved domains of the LusHSF proteins. A phylogenetic ML tree rectangular diagram of LusHSFs is shown on the left. Proteins with DBD and OD (HR-A/B motif) were scaled according to their lengths. Domain and motif legends are provided below the protein-length scale. For the detailed positions of the domains and motifs, see Table 2.

DNA interaction interface predictions on LusHSF proteins

Identification of protein-protein interaction sites and protein-DNA binding sites on the LusHSF proteins through the PredictProtein server showed a change in the number and location of the active sites (see Supplementary Fig. S4a). Except in LushsfA3a, all LusHSF amino acid sequences were predicted to have these macro-molecular interaction sites. Twenty-two out of the 34 LusHSFs consisted of polynucleotide binding sites. The diversity of the DNA contact points or active DNA binding sites on the LusHSF proteins was further analysed utilizing the protein model-based server TFmodeller, which revealed that most of the LusHSF proteins have DNA contact sites in the N-ring, i.e., purine or pyrimidine through six amino acid interface residues (Table 3). These DNA contact sites were predicted from a matrix of homologous interface contacts by comparing structurally related protein-DNA complexes. The six amino acid residues included serine (S), glutamine (Q), asparagine (N), threonine (T) and two arginine (R) residues. These residues are conserved in all the contact sites, except for LushsfA8b and LushsfA8a, where threonine (T) is replaced with isoleucine (I). The only notable diversity of these DNA contact sites is generally owing to the positional variance of these six amino acid residues in the protein sequence, typically residing between 60 and 221 amino acids from the N-terminus. With our findings, the template human heat shock factor protein model 5d5v_B chain was compared to reveal protein-DNA interface sites on most of the LusHSFs (see Supplementary Fig. S4b). Four LusHSF proteins found no homologous templates to model the protein-DNA interface. The specificity, which represents the evolutionary proportion of sequence-specific contacts for the complex, was almost comparable, 0.26–0.27 (except in four non-homologous LusHSFs, 0.04), in all the LusHSFs, but the level of entropy varied from 0.73 to 1.00.

Table 3 Details of DNA binding site predictions on LusHSF proteins.

Orthologues of LusHSFs, syntenic relationships and divergence time

Putative orthologues of LusHSFs genes were predicted using the reciprocal protein blast approach through crb-blast and OrthoFinder software. HSF proteins from three related plant systems, such as Populus trichocarpa, Ricinus communis, Manihot esculenta, and three additional plant systems where the HSF genes are well characterized, such as A. thaliana, Vitis vinifera, and Glycine max, were compared to LusHSF proteins. The crb-blast showed that 31 LusHSF proteins matched to 87 unique HSF hits. OrthoFinder placed the 34 LusHSF proteins into eleven orthogroups and matched to 140 HSF hits. Of the 34 LusHSF proteins, thirty-one (91.2%) were consistent in both programmes and had orthologues in at least one of these six species. A maximum of 17 LusHSF orthologues was related to both P. trichocarpa HSFs (36.2%) and M. esculenta HSFs (43.6%), while a minimum of nine orthologues (47.4%) was related to the V. vinifera HSFs (Supplementary Table S2. The synteny map of the above LusHSF orthologous genes revealed that these genes are conserved and are randomly assigned in most of the chromosomes of the orthologous species (Fig. 5). To determine the evolutionary status of the putative LusHSF gene paralogues and orthologues, the ratio of substitution rates of non-synonymous (dN) versus synonymous (dS) sites was computed for each pair of duplicated genes. The dN/dS ratios computed for all the putative paralogues and orthologues varied from 0.0065 (LushsfA3a-Glyma.09G190600.1) to 0.6022 (LushsfA3b-LushsfA3a). The overall distribution of the dN/dS ratios is presented in Fig. 6a and Supplementary Table S2. The average and median dN/dS ratios were lowest, 0.096 and 0.105, for the putative LusHSFs and Arabidopsis HSF orthologues, respectively, while these values were highest, 0.268 and 0.234, for the putative LusHSF paralogues, respectively (Supplementary Table S2). In general, the dN/dS ratio was <1.0, indicating that these duplicated genes are under negative or purifying selection pressure. The dN/dS ratios were further used to predict gene duplication times in terms of million years ago (MYA) for each of these putative paralogous and orthologous gene pairs (Supplementary Table S2). The time for the gene duplication of LusHSFs (average ~12.5 MYA, median ~10.6 MYA) was observed as a more recent event than that for the divergence of the orthologues (Fig. 6b). The latest duplication time was estimated at ~6.5 MYA (LushsfA7a-LushsfA7b) and with oldest duplication time occurring ~24.5 MYA (LushsfC1a-LushsfC1b). The median values for the divergence of LusHSFs from the orthologues of P. trichocarpa HSFs were predicted as the latest (~186.2 MYA), while the earliest divergence time prediction was for orthologues from Arabidopsis HSFs (~259.7 MYA). Among the LusHSF orthologues analysed, five gene pairs, viz. LushsfB1a- AT4G36990.1, LushsfA1a- Glyma.09G206600.2, LushsfC1b- Glyma.09G190600.1, LushsfA2a-29912.m005526, and LushsfB2a-30147.m014282, showed dS values > 10 and predicted highly conserved evolutionary times, dating back >1000 MYA.

Figure 5
figure 5

Syntenic relationships among putative orthologues and LusHSF genes. The syntenic relations of LusHSF genes to Arabidopsis, soybean, cassava, poplar, grape and castor were plotted using CIRCOS v0.69-5. The chromosomal positions of the syntenic HSF gene pairs are represented with red links. LusHSF genes are labeled on flax chromosomes, and the chromosome numbers are mentioned in the karyotype chords.

Figure 6
figure 6

Box and whisker plots showing comparative distribution of (a) substitution rates of non-synonymous over synonymous site (dN/dS) and (b) estimated time of gene duplication (MYA) in putative paralogues and orthologues of LusHSFs. The top of the box or coloured region represents the 3rd quartile (Q3, maximum values) while the bottom of the box or white region represents the 1st quartile (Q1, lower values). The ends of whiskers represent maximum and minimum values 1.5 times above or lower the Q3 and Q1, respectively. The maximum and minimum outlier values are represented as open circles and star symbols, respectively. In (b), the detailed distribution of gene duplication times (in MYA) of putative LusHSF paralogous gene pairs are shown separately below the comparative figure.

Cis-acting element localization on LusHSF promoters

Since the promoter of a gene often consists of cis-acting regulatory elements that confer its functional specificity, we analysed the distribution of cis-elements in the 1000 bp upstream promoter sequence of LusHSF genes. First, our analysis with the TSSP program Softberry showed that four out of 34 LusHSF promoters comprised unverified bases, thus restricting their lengths to less than 1000 bp for the analysis. Putative promoter positions based on the transcription start site (TSS) were predicted in a total of 24 (70.6%) LusHSF upstream sequences (Table 4). Four of these sequences showed more than one putative TSS position. The location of the putative TATA box sequences in the 14–38 bp region upstream of the TSS was predicted in 23 out of 24 LusHSF promoters. Enhancer elements were predicted in 14 of the LusHSF promoters, of which two promoters consisted of more than one enhancer element. Next, our analysis of the distribution of cis-acting regulatory elements in the promoter sequences of LusHSFs demonstrated the existence of various regulatory elements related to the abiotic-stress response (Table 4). These elements include ABRE (abscisic acid responsive element), CCAAT-box, DRE/CRT/CBF (dehydration-responsive element/C-repeat/C-repeat binding factors), HSE, LTRE (low-temperature response element), MBS (MYB-binding site), and PRECONSCRHSP70 (plastid response element in the promoters of heat shock protein 70A). Although the software programs PlantCARE and PLACE both predicted abiotic stress-related regulatory elements, in a majority of the LusHSF promoters, both programs varied in the number of predicted elements. PlantCARE predicted a smaller number of elements compared to PLACE. In agreement with the PlantCARE program, a considerable number of LusHSF promoters were found to consist of HSE and LTRE, which are linked to impart tolerance to high and low temperatures. In addition, a significant number of elements of ABRE, MBS, and TC-rich repeats were also located, which are likely to be induced under dehydration stress. Each of two LusHSF promoters consisted of DRE and CCAAT box elements; the former is responsible for dehydration stress tolerance, and the latter is involved in interactions with an HSE element to enhance heat shock promoter activity. The program PLACE predicted a considerable number of MYB/MYC transcription factor-binding sites ranging from zero to 31, followed by CRT/DRE/CBF and LTRE. A significant number of cis-elements associated with heat shock protein 70 (HSP70) were also located on the LusHSF promoters by the program PLACE. Altogether, the above results show that the LusHSF promoters are enriched with numerous potential cis-acting elements related to the abiotic-stress response.

Table 4 Details of promoter analysis in the 1000 bp upstream sequence of the LusHSF genes. Position of first nucleotide of putative TSS/TATA box/Enhancer from the 5’ end of the upstream sequence analyzed and not from the start codon. NP- No prediction; PRECON70#- PRECONSCRHSP70.

Gene expression dynamics of LusHSFs in different tissues

A homology search of LusHSF genes against the microarray data (Accession no. GSE21868) revealed only nine high-quality unigene hits (>95% identity). The fewer number of LusHSF hits to the microarray data could be attributed to the expressed sequence tags (ESTs) of the Hermes cultivar used to develop the array rather than the flax genome of CDC Bethune. Nonetheless, these nine LusHSF genes revealed a differential gene expression pattern in different flax tissues (Fig. 7a). On a closer look, the LushsfB1a, which belongs to the B1 group, was found to have higher gene expression in most of these tissues, while LushsfA7c was expressed at low levels. HSF genes from the B1 group are heat inducible and are known for their role in repressing other HSF genes under non-heat conditions. Interestingly, the LushsfA9b gene, which belongs to the A9 group, was less abundant in all tissues but was highly expressed in the late embryo developmental stages. HSF genes from A9 groups are known for their involvement in seed development. Similarly, in another microarray dataset (GSE61311), eight LusHSFs exhibited differential expression patterns in inner and outer stem tissues at the vegetative stage of the wild and mutant genotypes (Fig. 7b). Compared to the microarray data of LusHSFs, the differentially expressed transcriptome resources from the shoot apex of the flax variety CDC Bethune (GSE80718) showed higher hits of 27 LusHSF genes. Twelve of these LusHSF genes showed differential expression patterns in the apical and basal tissues (Fig. 7c). Five LusHSF genes were expressed in abundance and three genes showed low expression in both tissues. However, four LusHSF genes showed contrasting expression patterns in these tissues. The hierarchical clustering of the LusHSF genes in all the above digital gene expression analyses was found in accordance with their expression patterns. From the above digital gene expression analysis, we speculate that the majority of LusHSF genes differ in their expression patterns in various flax tissues and growth stages.

Figure 7
figure 7

Heat map and hierarchical clustering of digital gene expression of LusHSF genes in different flax tissues. (a) LusHSF corresponding gene IDs were derived from the microarray data under GEO accession no. GSE21868. The mean of RMA-normalized, averaged gene-level signal intensity (log2) values were plotted using the Heatmap Illustrator (HemI v.1.0). Tissue includes SIV: stem inner tissue from the vegetative stage; SOV: stem outer tissue from the vegetative stage; root; leaf; SIGC: inner stem at the green capsule stage; SOGC: outer stem at the green capsule stage; and embryo at 10, 20, and 40 days post flowering. (b) The normalized signal intensity values of the LusHSF genes derived from the transcriptome data under GEO accession no. GSE61311 is plotted as a heat map. Digital samples include WT-SIV: inner stem tissue from the vegetative stage of wild-type plants; mut-SIV: inner stem tissue from the vegetative stage of lignified bast fibre mutant plants; WT-SOV: outer stem tissue from the vegetative stage of wild type plant; and mut-SOV: outer stem tissue from the vegetative stage of lignified bast fibre mutant plant. (c) Heat map generated for the LusHSFs derived from RNA-seq data (Accession no. GSE80718) using the log2 transformed average FPKM values. In all heat map plots, the coloured bars shown on the right represent their expression levels.

Expression pattern of LusHSF genes under HT stress

We examined the expression pattern of the LusHSF genes under HT stress by comparing two different fibre flax cultivars, European Viking and Indian JRF-2, to measure the mRNA abundance in the shoot apex of 30 day-old control and HT-stressed (40 °C for 12 hrs) seedlings. From a preliminary screening, twelve LusHSF genes produced clear and consistent bands of expected size in both control and HT stressed samples. The remaining LusHSF genes either showed the presence/absence of bands or comprised non-specific amplicons (data not shown). The RT-qPCR analysis of the twelve LusHSF genes produced a differential expression pattern in the control and HT stressed plants (Fig. 8). Interestingly, in control JRF-2, the expression of a majority of the LusHSF genes was elevated when compared to that of the other samples (0.82 to 34.2-fold). In contrast, most of the LusHSF genes were down-regulated in the HT stress-treated JRF-2 (0.44 to 15.33) compared to those in the JRF-2 control plant. However, the LushsfA1c and LushsfA1a, were reasonably up-regulated in HT stress-exposed JRF-2. In HT-stressed Viking, the LushsfA7a and LushsfB2b genes were significantly up-regulated compared to those in control Viking and HT-stressed JRF-2 plants. Two genes, LushsfA1b and LushsfB4a, produced non-significant gene expression changes in all the samples compared to those in control Viking. Altogether, these differential expression patterns suggest their possible functional relevance in the HT stress response in a genotype-dependent manner.

Figure 8
figure 8

Relative quantification (RT-qPCR) of selective LusHSF genes between HT-treated and control plants of JRF-2 and Viking cultivars. (a) The upper panel shows the effect of HT stress treatments at 40 °C for 12 hrs in flax cultivars, which were used for total RNA extractions. (b) The relative gene expression fold change (2−ΔΔCt) of twelve selected LusHSF genes are represented as bar diagrams. The Ct values of each sample-HSF gene combination were normalized using the reference gene ETIF3E and calibrated with Ct values of Viking control to estimate (2−ΔΔCt) values. The statistical significance of the expression values is represented by ‘ns’ as non-significant, and ‘*‘ as significant at p > 0.05 and Bonferroni’s multiple comparisons test. One asterisk (*) represents adjusted P values between 0.01 and 0.05, and two asterisks (**) represent adjusted P values between 0.01 and 0.001, and so on. RNA samples include Vik_C: Viking under control conditions; Vik_H: Viking under HT stress conditions; JRF-2_C: JRF-2 under control conditions; JRF-2_H: JRF-2 under HT stress conditions.

Prediction of CRISPR/Cas9 guide sequences with minimum off-target effects in the flax genome

We screened the LusHSF genes using an online CRISPOR tool to identify unique 20 bp gRNA sequences for each LusHSF gene. These gRNA sequences, which will serve as a resource for clustered regularly interspaced short palindromic repeats/CRISPR-associated 9 (CRISPR/Cas9)-based gene editing or functional studies, were compared and aligned to the L. usitatissimum genome. The gRNA sequences with the highest specificity and those located within 12 bp adjacent to the protospacer adjacent motif (PAM) sequence (the ‘seed region’) of the gRNA were considered for assessing minimum off-target effects. The total number of gRNA predictions for each LusHSF ranged from 74 to 223 with the 3′ PAM sequence NGG (where N = A/T/G/C). At a specificity score >50 (cutoff for high specificity), the number of gRNA sequences ranged from nine to 157. The gRNA sequences with the highest specificity score and minimum off-target effects are mentioned in Supplementary Table S3. Most of these gRNA sequences had the least off-target hits, ranging from 0 to 21 at the whole genome level, which may arise from 2 to 4 nucleotide mismatches. None of the gRNA sequences was predicted to produce off-target effects (up to ≤4 nucleotide mismatches) within the seed region, i.e., within the 12 bp adjacent to the PAM sequence. The forward and reverse primers were predicted for cloning and expression of all the LusHSF gRNA sequences using the T7 RNA polymerase-based system in the popular gene editing vector DR274 (Addgene plasmid # 42250). Specific restriction enzyme sequences were also predicted within the gRNA sequence at three bp 5′ to the PAM to facilitate the screening of mutation events induced by the gRNA in CRISPR experiments. Oligonucleotides with barcodes and corresponding sequencing primers to generate lentiviral saturation mutagenesis screens with the LusHSF genes, are also shown in Supplementary Table S3.

Discussion

From past studies on HT stress in fibre flax, it is perceived that both low and HT stress, even in the absence of drought, are critical to flax growth and reproduction7,8. Seed germination, flowering, and seed setting in flax are optimum between temperatures of 16 °C to 25 °C. In a simulated experiment, HT during the initial growth phase, followed by a low temperature in the intermediate phase and HT during the late growth stages were observed as the most preferred conditions for fibre flax growth11. However, exposure to more than 40 °C for a stretch of five days during flowering in flax was detrimental, reducing seed yield and fibre quality7. Partial to complete necrosis of the ovules was the crucial limiting factor in poor seed setting due to HT stress in flax. A prolonged period of HT stress also forces the plant to undergo compensatory flowering8. This information warrants prioritized research on the genetic improvement of flax, especially fibre types, for terminal HT stress tolerance. In the long run, these findings will facilitate the acclimation of the superior fibre quality flax genotypes to a diverse climatic condition.

Among various genetic components, HSFs and HSPs play significant functions in responding to HT stress in plants. The former gene group plays the role of a regulatory partner in the functioning of the latter group, which serves as chaperones12. A fundamental knowledge of the role interplayed by these two key genetic factors is crucial beforehand to design a genetic improvement strategy for HT stress tolerance in any plant. Although the genome sequence of flax is available for the past few years9, the characterization of HSFs in flax has remained obscure until now. The present study involved the revelation of 34 true HSF sequences distributed in 14 out of the 15 flax chromosomes. A cumulative analysis of flax and other representatives from the order Malpighiales and commercial fibre crops, whose genome sequences are available, revealed a diverse HSF family size. Our report of 34 non-redundant complete LusHSFs is higher than those of Ricinus sp. (18) and Corchorus sp. (18), but lower than those of Gossypium raimondii (57), Salix purpurea (48), P. trichocarpa (47), and M. esculenta (39) (http://planttfdb.cbi.pku.edu.cn). Considering that whole genome duplication (WGD) events during ancient polyploidization and lineage-specific duplications are crucial factors for the speciation and expansion of gene family13,14, variations in the number of HSFs in flax and related plants have shed light on how this gene family has co-evolved. From WGD time estimates in Malpighiales, it is clear that flax has undergone two rounds of genome duplication: the earlier duplication occurring ~20–40 MYA and a more recent genome duplication at 5–9 MYA compared to the other plants analysed15,16. Intraspecies synteny analysis revealed that many of the LusHSF genes in flax genome constitute part of the syntenic blocks that still support their WGD origin. Genome duplication events simultaneously with gene gains or losses might have contributed to the diversity of the HSF gene family17. Through orthologue identification and dN/dS substitution-based homology analysis of the LusHSFs, we could predict that the divergence time of HSFs in other related plant species occurred much earlier than those of flax, perhaps during ancient polyploidization event. Therefore, most of the putative LusHSF paralogues that co-evolved with the recent flax genome duplication event (5–9 MYA) might also correspond to diverse gene structures and functions.

In the present study, we describe a comprehensive characterization of the LusHSF genes and amino acid sequences to identify their important domains and motifs. All 34 selected LusHSF proteins comprised conserved characteristic domains, such as DBD, HR-A/B regions, NLS, NES, and CTAD; thus qualifying these proteins as true HSF proteins. Since the promoter regions are enriched with specialized cis-acting regulatory elements that also specify their putative functions18, we queried the promoter regions of the LusHSFs. The results revealed that the LusHSF promoters are enriched with a variety of regulatory elements related to abiotic stress tolerance, including the HSE and LTRE, which confer gene expression in response to high- and low-temperature conditions, respectively. From our digital gene expression analysis, using microarray data from the national center for biotechnology information (NCBI) database, evidence of the differential expression of the LusHSF genes was detected in different tissues. Transcriptional analysis of twelve LusHSF genes was also performed in two different fibre flax cultivars, Viking and JRF-2, under control and HT stress conditions. Interestingly, the analysis reveals that the abundance of the majority of the LusHSF mRNA is significantly higher in control JRF-2 (up to 34.2-fold) compared to that in control Viking. This difference may justify the better adaptability of the Indian JRF-2 cultivar under the hot and humid conditions of India compared to that of European Viking. In a few LusHSFs, a fold change in gene expression was also observed up to 16.74 times in the HT-stressed Viking and up to 15.33 times in the HT-stressed JRF-2. Overall, we noticed that the endogenous expression of LusHSFs in control JRF-2 was higher than that in the HT-stressed JRF-2. One possibility for the down-regulation of these LusHSFs in HT-stressed JRF-2 plants could be owing to prolonged HT stress treatment for over 12 hrs. A similar down-regulation of the HSF genes under HT shock treatment was recorded in plants3. All this information suggests that the LusHSFs might produce a differential response in different flax genotypes regarding HT stress and can be selected as candidate gene resources for functional studies and genetic improvements. Genetic engineering using candidate HSF genes was reported to impart enhanced thermotolerance to crop plants, such as in wheat19.

From our analysis of LusHSFs, we assume that these proteins are involved in differential functions in various tissues and under HT-stress induced responses. Since multifunctional HSF proteins have roles in various abiotic-stress tolerance responses3,20, the involvement of LusHSFs in regulating other traits cannot be ignored. Therefore, a comprehensive functional analysis of the LusHSF genes is a prerequisite before harnessing these molecules in any genetic improvement programme. For functional studies on LusHSFs, we were interested in determining the active macro-molecular binding sites on the LusHSF proteins. Computational predictions of the active sites for protein-DNA interactions can considerably reduce the cost and time of functional assays by providing a first-hand functional annotation. These computational predictions can be addressed using information from amino acid sequences or related protein structure models21. Through both the homology model and an amino acid sequence-guided approach, our analysis of the nucleoprotein interaction sites on the interface of LusHSFs aided us in identifying the active amino acid residues that might exert an effect on their functionalities, such as the HT stress response.

Genome editing using the CRISPR/Cas9 system is rapidly emerging as a tool for targeted gene knockout in plants, thus achieving functional analysis in a precise manner and in a rapid time22,23. Nevertheless, the success of CRISPR/Cas9 technology depends primarily on the specificity of the gRNA sequences designed to perform targeted gene-knockouts. CRISPOR is a simple web tool, that permits users to design gRNA for genome-wide CRISPR and saturation screens with information on the possible off-target effects on the genome of interest24. Our screen for gRNA sequences specific to the LusHSF genes produced sequences with minimum possible off-target effects, specifically within sequences adjacent to the PAM. This tool also predicted the gRNA cloning strategy for maximum functionality of the CRISPR system. Genome-wide analysis of a few important gene families, such as the genes controlling fatty acid biosynthesis25, chalcone synthase26, β-galactosidases27, cinnamyl alcohol dehydrogenase (CAD) genes28, NBS-LRR29, aquaporin30, pectinmethylesterases (PME) and pectinmethylesterase inhibitors (PMEI)31, UDP glycosyltransferase (UGT)32 and the dirigent protein family33 were carried out in the flax genome. However, designing the gRNA sequence for functional analysis of these gene families has not been attempted in any of these studies. Our present study, in addition to identifying active protein and DNA binding sites on LusHSF proteins, predicted and designed a number of gRNA sequences with the least off-target effects for functional analysis of the candidate LusHSF genes.

In summary, we identified 34 LusHSF genes with specific DNA and amino acid sequence features, plotted these genes on flax chromosomes and phylogenetically reconfirmed them into three broad groups and 13 subgroups based on their protein domains. The putative LusHSF paralogues were estimated as a recent gene duplication event than the orthologues in terms of their evolutionary gene family expansion. Functional predictions were based on various abiotic stress-related cis-acting elements detected in the promoter regions of the LusHSFs, the dynamics of digital gene expression patterns in different tissues, and the quantitative expression patterns of these genes under control and HT stress conditions. One of the key findings of the present study embodies the design of gRNAs for individual LusHSFs to promote further functional studies of this important gene family. However, a systematic analysis of gene expression under different temperatures and at different time intervals is imperative to assign a specific role to each candidate LusHSF gene before utilizing this gene resource in the genetic improvement of fibre flax for HT stress tolerance.

Methods

Retrieval and characterization of HSF sequences

Genomic, coding, protein and promoter sequences of the HSF gene family with conserved DBDs (Pfam ID: PF00447) from the L. usitatissimum (cv. CDC Bethune) genome were retrieved from the Phytozome database v12.1 (https://phytozome.jgi.doe.gov/pz/portal.html) using the BioMart tool. The protein sequences were confirmed using the batch search tool of the Hidden Markov Models (HMMs) in the Pfam 31.0 database34 with an E-value threshold of 10−3 and the Simple Modular Architecture Research Tool (SMART) for HSF DBD35. The putative LusHSF genes and their detailed classifications were further identified against the HEATSTER platform20 (http://www.cibiv.at/services/hsf/) and MARCOIL (http://toolkit.tuebingen.mpg.de/marcoil) to determine the presence of coiled-coil structures. The isoelectric point (pI) and molecular weight of LusHSF proteins were estimated from the Compute pI/Mw tool of Expassy36 (http://www.expasy.org/). The grand average of hydropathy (GRAVY) scores, which is based on the hydropathy of all the amino acids of a protein molecule and determines whether a protein is polar or non-polar in nature37, were estimated using the GRAVY calculator (http://www.gravy-calculator.de/). Protein subcellular localization of the LusHSFs were predicted by using WoLF PSORT38 (http://www.genscript.com/wolf-psort.html).

Chromosomal mapping and analysis of gene duplication

The genomic coordinates of the 34 LusHSF genes were mapped on flax chromosome10 using the software Graphical Geno Typing v2.0 (GGT 2.0)39 and MapChart v2.340. The paralogous relationships of the LusHSF genes were identified according to their duplication patterns (tandem or block) using conditional reciprocal blast (crb-blast)41 with a stringent E-value of 1.0e−50. The paralogous partners were identified based on query coverage >90% and percentage of identical matches >70%. Patterns of genome duplication among the putative LusHSF paralogues and their adjacent genomic regions were analysed using the GEvo tool of the CoGe database (https://genomevolution.org/coge/) and the (B)LastZ: Large Regions algorithm42.

Multiple sequence alignments, phylogeny, and classification

The amino acid sequences of the conserved DBD and OD with HR-A/B motifs of LusHSFs were deduced from the HEATSTER platform20,43 for multiple sequence alignments using the online Clustal Omega tool of EMBL-EBI44 (http://www.ebi.ac.uk/Tools/msa/clustalo/) and visualized using BoxShade v3.21 (http://www.ch.embnet.org/software/BOX_form.html). For phylogenetic tree reconstruction and reconfirming the classification of the LusHSFs, the amino acid sequences from the start of DBD to the end of OD domains of flax, Arabidopsis (dicot model plant), and rice (monocot) HSFs were retrieved for multiple sequence alignments. The alignment was performed using MUSCLE algorithm and 16 maximum iterations in the MEGA-X software45. The phylogenetic tree was inferred by using the Maximum Likelihood (ML) method and the Jones-Taylor-Thornton (JTT) matrix-based amino acid substitution model46. The best model was estimated using the model selection tool in MEGA-X and from the lowest Bayesian Information Criterion (BIC) score. A discrete gamma distribution was chosen to model evolutionary rate differences among sites [16 categories (+G, parameter = 1.6109)]. The initial tree for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value. All positions with less than 95% site coverage were eliminated, i.e., fewer than 5% alignment gaps, missing data, and ambiguous sequences were allowed at any position (partial deletion option). The test of phylogeny was conducted using 1000 bootstrap replications. All other parameters of phylogenetic tree reconstruction were kept default. Using similar parameters, a phylogenetic ML tree was also reconstructed using the HSFs from closely related sequenced plants of the order Malpighiales and commercial fibre crops, like cotton and Corchorus spp.

Gene structures, protein domain distributions and DNA-binding site predictions

The exon/intron and splicing phase in LusHSF genes were derived by aligning the corresponding CDS and genome FASTA sequences in the Gene Structure Display Server (GSDS2.0) (gsds.cbi.pku.edu.cn/) programme47. The LusHSF DBD coordinates on the protein and the phylogenetic tree in Newick format were used as inputs to display the gene structures. The distribution of protein domains, such as DBD, OD (HR-A/B), NLS and NES, on the LusHSF amino acid sequences were determined from the online HSF prediction tool of the HEATSTER platform20,43, and Interproscan48. The conserved domains and motifs were visualized using the Illustrator for Biological Sciences (IBS) v.1.0.349 (http://ibs.biocuckoo.org/). Additionally, the LusHSF protein sequences were scanned for the prediction of protein-protein and protein-DNA binding interface identification using a FASTA sequence search approach in the online open PredictProtein server50 (https://open.predictprotein.org/) and comparative model-based TFmodeller web server51 (http://maya.ccg.unam.mx/$\sim$tfmodell/).

Orthologue identification, synteny mapping, and evolutionary analysis

Putative orthologues of LusHSF genes were identified from Arabidopsis, poplar (P. trichocarpa), castor bean (R. communis), cassava (M. esculenta), soybean (G. max), and grape (V. vinifera) HSFs, which were derived from the plant transcription factor database v.3.0 (PlantTFDB)52 (http://planttfdb.cbi.pku.edu.cn/index.php). The crb-blast program41 at an E-value of 1.0e−50 was employed for this purpose. The top query and subject BLAST hits were filtered using >70% identity and >90% query and subject coverage in the Microsoft Excel program. Orthologues were also inferred using the OrthoFinder v.1.1.853 and compared with crb-blast output. Only the consistent putative orthologues were used for synteny mapping. The corresponding genomic coordinates of the putative orthologous gene pairs were derived from the respective genomes in the Phytozome database v.12.1, and the orthologous relationships were visualized using CIRCOS v0.69-554. The dN/dS estimation of putative LusHSF homologue sequences (both orthologues and paralogues) was conducted using PAL2NAL (http://www.bork.embl.de/pal2nal/) in the codeml program in PAML55. The evolutionary time (T) or likely gene duplication event of the HSF genes was calculated in terms of million years ago (MYA) using a synonymous mutation rate of λ substitutions per synonymous site per year. A dN/dS ratio <1, >1, and = 1 indicates negative (purifying selection), positive, and neutral evolution, respectively.

Cis-acting regulatory element identifications

To predict the putative promoter region based on the transcription start site of plants (TSSP) in 1000 bp sequences upstream of LusHSF genes, the TSSP online program of SoftBerry (http://www.softberry.com/berry.phtml) was used. The unverified string of bases from the putative promoters was removed from the analysis. The cis-acting regulatory elements were searched on the putative promoter regions of LusHSF genes using plant cis-acting regulatory DNA elements (PLACE; https://www.hsls.pitt.edu/obrc/index.php?page=URL1100876009) and plant cis-acting regulatory elements (PlantCARE; http://bioinformatics.psb.ugent.be/webtools/plantcare/html/) databases56,57. Both positive and negative promoter DNA strands were subjected to a cis-element search. Only the abiotic stress-related regulatory elements were retrieved.

Digital gene expression analysis

Gene expression data in terms of microarray and transcriptome sequences from the different flax tissues and developmental stages of the NCBI Gene Expression Omnibus (GEO) repository were downloaded to analyse the digital gene expression of LusHSF genes. Microarray data (accession number GSE21868), from inner- and outer-stems, embryo, leaves, and roots58 were subjected to a homology-based (blastn) similarity search with an E-value cutoff of 1.0e−50 to LusHSF sequences. The best LusHSF-aligned unigene sequences (≥95% identity) were considered to derive log2 values from the microarray data in robust multi-array average (RMA) values. The mean log2 values for different tissues are represented by a heatmap diagram. Similarly, other microarray data, such as GSE61311 (unpublished), with inner and outer stem tissues from a wild-type and its mutant (lignified bast fibre mutant 1) plants, and the RNA-seq data for the flax shoot apex (GSE80718)59, were searched, and corresponding fragments per kilobase of transcript per million mapped read (FPKM) values were log2 transformed before plotting in heatmaps. All heatmaps were generated using the Heatmap Illustrator (HemI v.1.0)60 and clustering was performed using the hierarchical method, average linkage, and Euclidean distance similarity metric.

Plant samples, HT stress treatment, and RT-qPCR analysis

The seeds of two different winter fibre flax cultivars, European Viking and Indian JRF-2 (Tiara), were grown under controlled glass-house conditions. The former cultivar was a French introduction, while the latter was a released variety for the conditions in India. Our initial field observation showed Viking as a heat-susceptible cultivar (deformed inflorescence, poor flowering, and seed setting) compared to JRF-2. Total RNA was extracted from the shoot apex tissues of 30-day-old control and HT stressed (40 °C for 12 hrs in a plant growth chamber) flax seedlings using TRIzol reagent (Invitrogen, Thermo Fisher Scientific, Inc., USA) according to the manufacturer’s instructions. Approximately, 5 μg/mL of DNaseI-treated total RNA was reverse transcribed using the SuperScript III First Strand cDNA synthesis system (Invitrogen Inc., USA) to generate cDNA according to the manufacturer’s protocol. Gene-specific (LusHSF) and a reference gene eukaryotic translation initiation factor (ETIF3E) primers61 were designed using the Quant Prime62 tool and synthesized at Eurofins Genomics India Private Limited, India (Supplementary Table S4). The RT-qPCR analysis was performed using PowerUpTM SYBR Green Master Mix (Applied Biosystems, Inc., USA) on a CFX Connect Real-Time PCR Detection System (Bio-Rad, Inc., USA). Each qPCR reaction (20 μL) consisted of 10 μL SYBR-Green mix, 4 μL cDNA template (120 ng), and 1.0 μL of 10 μM solution of each forward and reverse primers. The PCR cycling programme consisted of 50 °C for 2 min, 95 °C for 5 min followed by 40 cycles at 94 °C for 10 s, 55 °C for 20 s, and 68 °C for 30 s. To analyse the specificity of the amplicons, a melting curve analysis was performed at 95 °C for 30 s, 65 °C for 30 s, followed by ramping up to 95 °C with 0.5 °C increment per cycle. For each sample, three technical replicates were conducted to minimize the PCR artefacts. The relative expression of each selected gene was averaged from the differences in cycle threshold (Ct) values normalized against the reference gene and finally calibrated against the control RNA sample from the Viking accession. The relative quantification method (2−ΔΔCt)63 was plotted as fold change gene expression in all the samples utilizing the GraphPad Prism software trial version (https://www.graphpad.com/scientific-software/prism/). One-way ANOVA, followed by Bonferroni’s multiple comparisons test correction was employed to analyse the statistical hypothesis at p = 0.05.

Predictions of guide RNA sequences for gene editing and off-target effects

The web server CRISPOR v4.424 (http://crispor.tefor.net/) was employed to predict efficient gRNA sequences for CRISPR/Cas9-based gene editing experiments in LusHSFs. The genomic DNA sequences of LusHSFs were scanned as input sequences for the identification of unique 20 bp target gRNA sequences against the L. usitatissimum genome of Phytozome v.9. LusHSF genes with >2000 bp lengths were scanned up to <2000 bp from the 5′ translational start site as per the requirement of the tool. To facilitate the use of the popular CRISPR/Cas9 system and employ Streptococcus pyogenes Cas9 nuclease, the corresponding NGG trinucleotide was selected as the protospacer adjacent motif (PAM). The gRNA sequences for the respective LusHSFs were chosen based on the highest specificity score and least probable off-target cleavage sites, especially within the 12 bp region adjacent to the PAM, known as the ‘seed region’. The gRNA sequences are allowed at least four nucleotide mismatches for the probability of off-target effect predictions. Oligonucleotides for lentiviral saturation mutagenesis screening were also identified along with specific barcodes. In a saturating mutagenesis experiment, a target region of the genome is altered with many guides, to create as many DNA edits as possible followed by mutant phenotyping. The corresponding Illumina sequencing primers were also designed for each LusHSF with the Illumina adapters TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG and GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG to validate the gene sequence modifications.