Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Expression of a human cDNA in moss results in spliced mRNAs and fragmentary protein isoforms

## Abstract

Production of biopharmaceuticals relies on the expression of mammalian cDNAs in host organisms. Here we show that the expression of a human cDNA in the moss Physcomitrium patens generates the expected full-length and four additional transcripts due to unexpected splicing. This mRNA splicing results in non-functional protein isoforms, cellular misallocation of the proteins and low product yields. We integrated these results together with the results of our analysis of all 32,926 protein-encoding Physcomitrella genes and their 87,533 annotated transcripts in a web application, physCO, for automatized optimization. A thus optimized cDNA results in about twelve times more protein, which correctly localizes to the ER. An analysis of codon preferences of different production hosts suggests that similar effects occur also in non-plant hosts. We anticipate that the use of our methodology will prevent so far undetected mRNA heterosplicing resulting in maximized functional protein amounts for basic biology and biotechnology.

## Introduction

The majority of eukaryotic genes comprise multiple exons interspersed by noncoding introns, the number and length of which vary between species. The human genome size (GRCh38.p13, Genome Reference Consortium Human Build 38 patch release 13) is 3,099,706,400 bp compromising 20,444 protein-encoding genes, with 8.8 exons and 7.8 introns per gene on average1. Generally, human genes have many short exons (average 50 codons) separated by long introns (up to >10 kbp)2. It is widely accepted that such an exon–intron gene structure emerged during evolution and lead to complex gene regulation and diversification of multicellular eukaryotes2,3. Introns that may contain regulatory elements affecting gene expression are removed from pre-mRNAs by RNA splicing. Subsequently, exons are ligated together to generate functional mature mRNAs4. Over 30 years of research on pre-mRNA splicing elucidated an underlying complex regulation, which includes exon versus intron recognition5, co-transcriptional splicing6, alternative splicing (AS)7, exonic and intronic splicing enhancers and repressors8,9, and mRNA export from the nucleus10.

AS is a fundamental regulatory process that contributes to proteome expansion11. AS can produce multiple mRNAs from the same gene through the preference of splice sites during pre-mRNA splicing. It is modulated based on sequence motifs in the pre-mRNA, the interactions between RNA-binding proteins and splice sites, different cell types, and environmental signals12. Splicing is carried out by a large ribonucleoprotein (RNP) complex, the spliceosome, which brings selected exons together by two transesterification reactions13. During splicing, uridine (U)-rich small nuclear RNPs (snRNPs) together with non-snRNP splicing factors, and serine/arginine-rich (SR) proteins participate in the recognition of 5′- and 3′-splice sites as well as the branch site14. The core of the spliceosome is highly conserved in all well-characterized eukaryotes3,15. The key requirement for splicing is the recognition of a donor splice site by the U1 snRNP16. On the other hand, the acceptor splice site, downstream of the polypyrimidine tract, is recognized by the U5 snRNP17. Another important splicing element containing the reactive adenosine, the branch point signal, is located upstream of the acceptor splice signal and is important for the formation of the lariat-like intermediate structure17. While the majority of the introns have “GT” at the donor splice site and “AG” at the acceptor splice site, in rare cases introns can have unusual dinucleotides: for instance “AT” at the donor site and “AC” at the acceptor site18. This class of introns is spliced out of the pre-mRNA by the minor spliceosome U1219.

Besides its role in proteome expansion, AS controls transcript levels by generating unstable mRNA isoforms that may activate nonsense-mediated mRNA decay (NMD)20. Generation of alternative functional mRNAs due to the introduction of premature termination codons (PTC), intron retention, and alternative use of the 5′- and 3′-splice sites causes changes in the localization, stability, and/or function of a protein. While in humans more than 95% of genes encode alternatively spliced mRNAs21, in an early study 6556 out of 39,106 predicted genes showed AS in the plant Physcomitrium patens (moss Physcomitrella) with a strong bias toward intron retention22. Many PTC-containing alternatively spliced transcripts are NMD targets in Physcomitrella23. Furthermore, about half of the AS events altered CDSs and resulted in 2380 distinct proteins. In these studies, analysis of AS relied on expressed sequence tags (ESTs) and the alignment of ESTs and cDNA sequences to the genome. The percentage of AS events in P. patens is estimated as 58%, but large-scale and/or tissue-dependent profiling of AS in moss by the use of improved sequencing methods may result in a higher percentage24.

Two models of splice-site recognition exist25: In the intron-definition model, U1 binds to the upstream donor splice site and U2AF/U2 to the downstream acceptor splice site and branch site of the same intron26,27. The length of introns, which probably remain short under evolutionary selection3, limits the efficiency of splicing. In lower eukaryotes where introns are shorter, this mechanism might be dominant15. In the exon-definition model, the splicing machinery recognizes splice sites flanking the same exon28,29. This is the major mechanism in higher eukaryotes, where introns are generally longer. Due to these differences, the correct splicing of transgene transcripts has to be considered to maximize the production of a functional protein. For instance, it was necessary to remove a cryptic intron in the coding sequence (CDS) of green fluorescent protein (GFP) from the jellyfish Aequorea victoria to function in the flowering plant Arabidopsis thaliana30. Similarly, after several attempts to express Bacillus thuringiensis toxin genes in plants31, it was found that the expression of cry genes in tobacco protoplasts resulted in cry mRNAs, which were unstable and/or not efficiently made32.

The use of plants as alternative pharmaceutical protein production hosts has been on the rise for two decades, and some plant-made products have reached the market33. However, the percentage of plant-based biopharmaceuticals in the pharmaceutical market is still minute, mainly because of lower yields compared to prokaryotic and mammalian systems. Nevertheless, they offer alternative solutions in niche areas. Production of blood-clotting factors could be such a niche33. The human blood-clotting factor IX (FIX), a member of the intrinsic pathway of coagulation, is required for normal hemostasis and its absence or abnormal levels causes hemophilia B34. Current treatment is restricted to protein replacement therapy and factor concentrates costs are between $100,000 and$200,000 per patient and year35. Thus, there is a need for an economic production of FIX, which makes us attempting to produce it in the moss system. The moss P. patens, a model organism for evolutionary and functional genomics approaches, is an established host for the production of complex recombinant biopharmaceuticals and has proven its suitability for large-scale production36,37. The first moss-made drug candidate, moss-aGal for enzyme replacement therapy of Fabry disease, has successfully completed phase I clinical trials38. In addition, a variety of other potential biopharmaceuticals is being produced in moss39,40,41.

Here, we report the expression of a human blood-clotting FIX-encoding cDNA in P. patens. We show that specific sequence motifs in the FIX mRNA were recognized by the splicing machinery as donor and acceptor sites and produced several different transcripts, which resulted in different protein isoforms. We named this new phenomenon, to the best of our knowledge, heterosplicing, and were able to prevent it by changing the splice sites and optimizing codons without affecting the FIX aa sequence (optiFIX). Expression of optiFIX resulted in full-length transcripts only, and a 12-fold increase of protein accumulation. In addition, we found heterosplicing also in transgenic moss lines expressing the CDS of human factor H (FH) CDS. By optimizing the sequence, we generated optiFH, which showed no heterosplicing and yielded significantly enhanced FH protein abundances in vivo. We suggest that our methodology of CDS optimization will provide a better quantification of transcript levels, and precise detection, purification, and quantification of functional proteins. It may reduce the bioenergetic costs of heterologous cDNA expression by preventing the formation of non-functional products and maximizing functional recombinant protein production, major bottlenecks in plant-based systems.

## Results

### Predicted microRNA activity is inhibited by mutating microRNA-binding sites

MicroRNAs can regulate gene expression at the posttranscriptional level, either via target mRNA degradation or translational inhibition. The possible actions of P. patens microRNAs on the RNA sequence for human FIX (NCBI reference NM_000133) were investigated using psRNATarget (http://plantgrn.noble.org/v1_psRNATarget/). Thus, two microRNAs were identified (miR1028c-3p and miR1044-3p), that might interfere with FIX expression in this moss. The respective binding sites on the FIX RNA sequences were modified in the DNA construct without a change in the FIX aa sequence. The potential binding site of ppt-miR1028c-3p, which might inhibit translation, was changed into ATCATGTGAGCCAGCTGTACCC between positions 495 and 516. The binding site of ppt-miR1044-3p, which might lead to cleavage of the FIX RNA, was mutated into AGGGTAAGTACGGCATCTAT between positions 1304 and 1323. The sequence of microRNAs and their binding sites as well as the maximum energy to unpair the target site (UPE) and expectation value are shown in Supplementary Table 1.

### Splicing of FIX mRNA occurs in transgenic plants and in transiently transfected cells

The native FIX signal peptide was replaced by the moss aspartic proteinase signal peptide of PpAP1 to target the protein to the secretory pathway for its proper posttranslational modifications42. After transformation of the parental moss line Δxt/ft41 with the human FIX-encoding, cDNA-based expression construct that comprises the complete FIX CDS of 1401 bp without any introns; 49 transgenic moss lines were regenerated and screened. The presence and integration of the full-length FIX CDS in the moss genome was confirmed via PCR (Supplementary Fig. 1). In order to validate the completeness of the FIX transcripts, RT-PCR was performed with primers spanning the complete CDS. In addition to the full-length transcript, several smaller products occurred (Fig. 1). In cDNA from the juvenile moss protonemal tissue, PCR products of discrete sizes were detected ranging from ~700 to 1370 bp (Fig. 1a). Sequencing of the PCR products revealed that five of the observed bands were FIX-specific products with a length of 1371, 1240, 895, 769, and 691 bp, respectively. Apparently, unexpected splicing of the FIX mRNA produced the smaller aberrant transcripts. To analyze this phenomenon across different tissues, we repeated the same procedure with RNA isolated from adult plants (gametophores). FIX transcripts detected in gametophores were consistent with those from protonema (Fig. 1b).

The sequencing of different PCR products revealed that specific motifs along the human FIX mRNA were recognized as splice signals in moss. The spliced transcripts were identical in all 49 transgenic lines and in 2 different tissues. Hence, we conclude that different transcripts were exclusively derived from unexpected splicing of the mRNA transcribed from the complete FIX CDS, and not by a partial integration of the FIX expression construct into the moss genome. PCRs from genomic DNA support this conclusion (Supplementary Fig. 1).

In order to scrutinize the presence of spliced FIX CDS variants in transiently transfected cells (protoplasts), total RNA was extracted from cells 14 days after transformation. After RT-PCR, FIX CDS variants with varying lengths occurred. Like in the RT-PCR from transgenic lines, 5 FIX CDS variants (1 full-length and 4 aberrant spliced FIX transcripts) were confirmed in these protoplasts (Fig. 1c). Unedited gel images showing the full lanes can be found in Supplementary Fig. 2. Thus, we found an unexpected splicing of an mRNA transcribed from a human full-length cDNA in moss, a phenomenon we henceforth name heterosplicing.

### Spliced FIX transcripts and splice-site analysis

Splice junctions within the FIX mRNA were revealed by sequencing. The splicing of the primary transcripts from human FIX cDNA in moss caused the production of four alternative products (Fig. 2a): i. deletion of a 44 aa long domain in the part coding for the heavy chain (Fig. 2a.II), ii. large deletion in the part coding for the light chain (115 aa) together with the absence observed in the heavy chain (Fig. 2a.III), iii. nearly complete absence of the sequence coding for the light chain and parts of the activation peptide domain (157 aa) and the heavy chain (Fig. 2a.IV), iv. nearly complete lack of the sequence encoding the light chain, complete absence of the activation peptide domain (186 aa) and the sequence representing 41 aa of the heavy chain (Fig. 2a.V). The heavy chain harbors a serine protease domain, which is responsible for the activation of Factor X in the presence of Factor VIII, calcium, and phospholipid surfaces following removal of the activation peptide. Any deletion in this domain will interfere with the activity of the FIX protein. The light chain consists of the Gla domain followed by two EGF domains. Any loss in these domains will adversely affect the properties of FIX. Therefore, any alterations caused by the heterosplicing will result in the loss of FIX function.

The GT-AG splicing rule applies for almost all eukaryotic genes43 and especially P. patens accepts control elements originally optimized for mammalian expression systems without a need for adapting these elements44. Here, we inferred four different motifs for acceptor and donor sites based on sequencing data. All FIX heterosplice sites followed the GT-AG rule. Moreover, the data revealed that a CAGGT motif is present in the exon-intron junction, i.e. the donor site (Fig. 3a). In the intron-exon junction, the acceptor site, the motif CAG is conserved in 3 out of 4 cases, but the adjacent two nucleotides do not follow any trend.

To analyze whether donor and acceptor sites and their neighboring nucleotides are conserved in P. patens, the genomic vicinity of all splice sites of all 87,533 annotated transcripts corresponding to the 32,926 protein-encoding genes of the current P. patens genome release v3.345 were retrieved (Fig. 3b, c). In alignment with our experimental findings, an in silico analysis of these transcripts showed that the CAG|GT motif is the most abundant one in both donor and acceptor sites with a fraction of 23% and 18%, respectively. The full list of splice sites is compiled in Supplementary Tables 2 and 3.

Subsequently, we analyzed the FIX CDS with a moss splice-site prediction tool46, but it largely failed to predict the experimentally verified splicing motifs. This might be due to the fact that this tool has been developed several years ago using only 368 donor and acceptor sites46 and has not been updated since then.

### Detection of truncated mossFIX protein variants and validation by mass spectrometry

The analyses at RNA level suggested that P. patens would produce a full-length FIX protein and four aberrant fragmentary FIX isoforms. To validate this inference, the culture supernatants of transiently transfected cells were precipitated and analyzed with a polyclonal anti-FIX antibody via immunoblot after reduction and alkylation (Fig. 4a). Full-length FIX was expected at a molecular mass of around 52 kDa and spliced FIX variants were calculated based on the heterospliced transcripts with molecular masses of 47, 34, 29, and 27 kDa, respectively. Observed bands on the immunoblot could be attributed to these isoforms together with a band at around 40 kDa (potentially a degradation product) and a band at around 70 kDa (potentially dimers of fragmentary protein isoforms). In addition, we used the other half of the supernatants for mass-spectrometric (MS) determination of FIX-variant peptide sequences. For the MS-database search, protein models for FIX variants derived from the sequenced splice variants were employed.

We detected the peptide VVGGEDAKPGEHNIEETEHTEQKR (Supplementary Fig. 3) in MS analysis after trypsin digestion, which resulted from splicing in the CDS for the heavy chain that led to the loss of 44 aa between the peptides VVGGEDAKP and GEHNIEETEHTEQKR (Fig. 4b; variants II, III, and IV in Fig. 2a). This result confirmed the presence of a predicted FIX variant caused by heterosplicing on an additional level. Unedited blot showing the full lanes can be found in Supplementary Fig. 4.

### Prevention of heterosplicing by CDS optimization

To analyze the effectiveness of mutating donor and acceptor sites on heterosplicing of the FIX mRNA, we created two different FIX CDS versions. In the first one, aspFIX, the codons neighboring the detected splice junctions were modified. P. patens was transiently transfected with the resulting aspFIX plasmid. On day 14, the cells were collected and RNA was isolated. RT-PCR and sequencing of the PCR products revealed that there were only two bands representing FIX variants: the full-length FIX CDS and a shorter variant (~700 bp) (Fig. 5a, Unedited gel image in Supplementary Fig. 2). The donor and acceptor sites of the shorter variant were identified (Fig. 5b) and this shorter variant was lacking the nucleotides encoding the light chain, AP, and parts of the heavy chain (Fig. 5c). Moreover, splicing would cause a frameshift mutation resulting in early stop codons. Although such nonsense mRNAs may be degraded via NMD, not all nonsense mRNAs undergo NMD, which was reported e.g. for P. patens23 and A. thaliana47,48. Despite the fact that splicing and translation take place in different cellular compartments we next performed a complete codon optimization of the FIX CDS in the aspFIX construct according to previous studies49,50 resulting in the optiFIX sequence with an overall increased GC content. Although translatability of a particular RNA via codon usage is not expected to regulate whether it can be spliced or not, we aimed to change the recognition of parts of human CDS regions as intronic regions by the moss spliceosome machinery. Therefore, we aimed at enhancing the GC content due to considerable differences between plant and animal exon/intron definitions. This was used to transiently transfect moss cells. Subsequent RT-PCR analysis and sequencing revealed that the full-length CDS of FIX was the only specific band; formation of any heterospliced FIX variant was not detected (Fig. 5d, Unedited gel image in Supplementary Fig. 2). The presence of exclusively the full-length CDS was confirmed on protein level via immunoblot analysis (Fig. 5e). An unedited blot showing the full lanes can be found in Supplementary Fig. 4.

Next, we compared the codon usage bias in P. patens to biases of other pharmaceutical production platforms. The codon usage frequencies of under- and overrepresented codons for Cys, Glu, Phe, His, Lys, Asn, Gln, and Tyr in P. patens were compared with Spodoptera frugiperda (insect cells), Homo sapiens (HEK cells), Cricetulus griseus (CHO cells), Nicotiana tabacum (tobacco) and Oryza sativa (rice). For instance, the overrepresented codon for Cys, TGC, in P. patens (9.8 in thousands) is also an overrepresented codon in S. frugiperda, H. sapiens, C. griseus, and O. sativa. On the other hand, this codon is underrepresented in N. tabacum and TGT is preferred over TGC for Cys. The synonymous codon usage bias in P. patens is similar to codon usage biases in S. frugiperda, H. sapiens, C. griseus (CHO cells), and O. sativa (rice), but different from N. tabacum (tobacco) (Table 1).

### Comparison of protein levels of FIX and optiFIX in vivo

In order to analyze whether the protein amounts of FIX and optiFIX show any difference in live moss cells, we cloned two fusion constructs, FIX-Citrine, and optiFIX-Citrine. Subsequently, we transiently transfected moss protoplasts with these constructs, and performed confocal laser scanning microscopy (CLSM) on day 3 after transformation. A visual analysis of the images revealed that the FIX-Citrine signal shows a diffuse localization pattern, which might be due to intracellular mislocalization of aberrant protein isoforms (Fig. 6a). In contrast, optiFIX-Citrine exhibited a complex network-like organization, indicating a proper localization of the protein in the ER (Fig. 6b). Furthermore, the optiFIX-Citrine signal intensity was higher than that of FIX-Citrine. Comparison of mean voxel intensities calculated for FIX-Citrine and for optiFIX-Citrine (3.88 and 45.22, respectively) shows a nearly 12-fold higher fluorescence intensity after the CDS optimization, which had prevented undesired heterosplicing (Fig. 6c). The source images (Z-stacks) for Fig. 6c can be found in Supplementary Data 1 and the results of the quantification is available in Supplementary Table 4.

### Automatic CDS optimization of transgenes

In order to automatize the process of P. patens-specific CDS optimization, we compiled our findings into a JavaScript-based web application: physCO, and made it available at www.plant-biotech.uni-freiburg.de. physCO takes as input CDSs in FASTA format, inspects them codon-wise, substitutes codons where possible according to our findings on codon usage (Table 2), and finally outputs them in FASTA format. In addition, physCO checks the input for consistent lengths, valid start, and stop codons, and internal, premature stop codons and raises warnings in case it detects irregularities.

### Heterosplicing of FH mRNA in transgenic plants

In order to test whether heterosplicing is a common phenomenon in P. patens upon transgene expression, we analyzed already existing recombinant human complement FH-producing moss lines40. Our RT-PCR analysis revealed that expression of the human FH CDS in moss generated 2 transcript variants, the full-length FH transcript of 3639 bp and a 1215 bp FH variant, which could lead to a truncated FH isoform with a molecular mass of 45.6 kDa (Supplementary Data 2). Therefore, we generated with the help of physCO an optimized version of hFH, and named it optiFH. These changes affected 386 of 1213 codons, and increased the GC content from 33.9% to 49.8% (Supplementary Data 3). Amino acid sequences of FH and FH isoform and nucleic acid sequences of FH and optiFH can be found in Supplementary Data 4 and 5, respectively. To investigate the effects of these alterations in live moss cells, we cloned two fusion constructs, FH-Citrine and optiFH-Citrine. Subsequently, we transiently transfected moss protoplasts with these constructs, and performed CLSM on day 3 after transformation. A visual analysis of the images acquired for FH and optiFH revealed that while an FH-Citrine signal was not detectable, optiFH-Citrine signal intensity was high and filled the secretory pathway (Supplementary Fig. 5; Supplementary Data 6). Analysis on FH showed that heterosplicing is not an effect specific for FIX, but a general phenomenon in moss. Therefore, transgene sequences have to be modified using physCO prior to transformation for proper transcription and translation.

## Discussion

Heterologous gene expression is of fundamental importance for basic biology and for industrial production. In order to optimize transgene transcription and translation as well as transcript stability, many factors have to be taken into account. Engineering of upstream and downstream regulatory sequences, replacing original signal peptides with host-suitable ones and optimizing codons according to host codon-usage pattern can be used to improve protein production. In addition, microRNAs can affect protein production, either by microRNA-directed cleavage of target mRNAs or by translational inhibition51,52. To eliminate such a potential adverse effect of Physcomitrella microRNAs on the expression of the human gene, predicted microRNA-binding sites in the FIX CDS were mutated without altering the aa sequence.

Splicing/AS is another important factor that alters expression rates by the generation of various transcripts, which create different protein isoforms. Generation of these protein isoforms can affect fundamental characteristics of the protein such as function, activity, intracellular localization, interaction with other molecules, and stability. Moreover, splicing can create a transcript with an early stop codon, which may be degraded by NMD. Thus, it can be hypothesized that modulation of splicing, a neglected point in transgene expression, can improve protein production. For instance, the expression of GFP from the jellyfish Aequorea victoria in A. thaliana failed due to aberrant mRNA processing30. In another example, the expression of bacterial cry IA(b) in tobacco resulted in unstable and/or not efficiently made transcripts31. To avoid AS, generally cDNAs derived from already spliced mRNAs are introduced into the genome of the production host.

We introduced the human FIX cDNA under the control of the PpActin5 promoter and CaMV 35S terminator into the Physcomitrella genome. RNA was isolated from protonemata of transgenic lines and RT-PCR analysis revealed unexpectedly that a fraction of human FIX mRNA was spliced into four variants in addition to the unspliced full-length transcript, resulting in five mRNAs with different lengths derived from one cDNA. According to RT-PCR analyses, this phenomenon occurred not only in juvenile protonema, but also in adult moss plants. We sequenced these mRNAs and surprisingly found that reading frames had not changed in the four smaller variants in comparison to the mature full-length transcript. This may have happened by a pure coincidence. Alternatively, we may not have been able to detect such putative minor transcripts because they were either too low abundant for RT-PCR, or they have been eliminated by NMD.

The outcome of this heterosplicing was detectable on four levels, by RT-PCR from mRNAs, immunoblot, mass spectrometry of the protein isoforms, and in vivo by confocal imaging showing deviant localization of Citrine reporter fusions. These findings were also reflected in RT-PCR analysis from transiently transfected moss cells. Our observations revealed that heterosplicing of the human FIX mRNA in moss is consistent across different tissues and not dependent on developmental stages. Furthermore, the consistency of heterosplicing in episomally (transiently) FIX-expressing cells indicates that heterosplicing of the FIX mRNA in transgenic lines was neither due to illegitimate or partial integration of the construct into the moss genome nor a locus-specific effect. Moreover, results of PCRs from genomic DNA of transgenic lines supported this conclusion. Unlike previously reported cases, in which expression of full-length GFP transcript and fluorescence were not detectable30 or expression of cry mRNAs led to generation of unstable and/or not efficiently made transcripts32, we report here to the best of our knowledge for the first time heterosplicing of a human cDNA in a plant. It caused the generation of various stable FIX mRNA variants and protein isoforms in addition to the complete transcript and protein. Therefore, a possible occurrence of heterosplicing should be analyzed on RNA level, even if a full-length protein can be detected in heterologous cells.

All splice junctions in the FIX CDS were determined by sequencing. In addition, we identified consensus motifs of donor and acceptor splice sites in the latest P. patens genome version45 using all 87,533 annotated transcripts that originate from all 32,926 protein-coding genes, and compared them with our experimentally verified motifs. This revealed that P. patens, like most eukaryotes, follows the GT-AG splicing rule. The same was true for the cryptic additional introns recognized in the FIX mRNA; obviously, the P. patens spliceosome interprets parts of the FIX CDS as intronic sequences. Presumably, moss cells recognize motifs in the human FIX CDS as cis exonic and intronic splicing enhancers. Previous work by Marquez et al. revealed some of the exitrons, exonic introns, that have ancestral exonic origin and follow GT/CAG boundaries like some of our heterosplice sites but little is known about exonic and intronic regulatory sequences in plants as well as driving force for intronization and extronization53,54. Moreover, plant introns are different from animal introns in terms of UA- or U-richness, which is crucial for splicing efficiency55. Due to these characteristic differences, it is likely that the GC content can affect the characteristics of splice site recognition15. The GC content is correlated with many features including gene density56, intron length57, meiotic recombination58, and gene expression59. It varies among species, and even along chromosomes60. It was shown before that the average GC content in P. patens CDSs is 50%46. The GC content of the human FIX CDS recognized as cryptic additional introns ranged from 38% to 40%. This remained the same after exclusively mutating the splice sites’ environment in aspFIX; even though mutations in the splicing motifs decreased the number of transcripts from five to two, the GC content of the cryptic alternative intron was still only 40%. The shorter transcript was not in frame, but still detectable by RT-PCR. Therefore, we created the fully CDS-optimized plasmid optiFIX to check the effect of overall codon optimization on pre-mRNA splicing. A closer comparison of the aspFIX and optiFIX constructs revealed that there was no change in the donor and acceptor sites of the newly emerged cryptic intron. There was, however, an increase of the GC content within the cryptic intron sequence. The GC contents of the sequences previously defined as introns in the FIX and aspFIX constructs increased to 46–50% in optiFIX. Hence, although splicing and translation occurs in different cellular compartments, we conclude that the increase of GC content introduced by the overall codon optimization together with mutating splice sites’ environment prevented the heterosplicing, the mis-interpretation of any sequence as intron by the moss spliceosome. As a result, a nearly 12-fold increase in protein amounts was achieved in transiently transfected moss protoplasts in vivo. As heterosplicing did not cause detectable frameshift mutations, fluorescence signals obtained from the FIX-Citrine construct were the sum of all FIX protein isoforms, full-length and fragmentary, fused to Citrine. Based on this inference and proteins, which were harvested from transient transformation with the FIX construct and detected on the immunoblot (Fig. 4), we conclude that the prevention of heterosplicing from the optiFIX construct has improved full-length protein amounts in vivo even more than 12-fold.

Two mechanisms, namely boosted translation rates and the prevention of mRNA degradation via NMD, may contribute to this result. In addition, the protein from the optiFIX construct was solely detectable in the ER, the compartment of choice for correct posttranslational modifications and subsequent secretion of the protein. A similar phenomenon was observed in the expression of another transgene in Physcomitrella. One example of heterosplicing and its prevention by our methodology is the expression of human complement FH. FH is an important regulator of the alternative pathway in the human complement system. Currently, a recombinant FH is not available on the market, although it has potential as a biopharmaceutical in the treatment of severe human diseases like atypical hemolytic uremic syndrome (aHUS), age-related macular degeneration (AMD) or C3 glomerulopathies. The recombinant production of FH, devoid of plant-specific N-linked sugar residues, in moss resulted in a range of promising biological activities40. Moreover, moss-derived FH is currently being examined for use in Covid-19 treatment (www.elevabiologics.com). RT-PCR analysis in FH-producing stable moss lines revealed that the expression of human FH CDS generated 2 transcript variants, the full-length FH transcript of 3639 bp and a 1215 bp FH variant that could lead to the production of an FH isoform with a calculated molecular mass of 45.6 kDa (Supplementary Data 3). We optimized the FH CDS using our bioinformatics tool physCO, described here, to analyze whether the protein amounts of FH and optiFH show any difference in live moss cells. A visual analysis of the images acquired for FH and optiFH from transiently transfected moss cells revealed that while FH-Citrine signal intensity is not detectable, optiFH-Citrine signal intensity is high (Supplementary Fig. 5). Thus, an optimization of the currently used CDS of human FH in plants is advantageous for plant-based production.

These findings suggest that the splicing of heterologous mRNAs should be taken into account on a routine basis. In addition, our analysis of synonymous codon usage bias suggests that our methodology to prevent heterosplicing can be directly implemented in organisms, which follow similar codon usage patterns as P. patens, such as O. sativa, insect or mammalian cells. Our methodology might still be vital in organisms that have different codon usage patterns, such as the species of the genus Nicotiana, e.g., N. tabacum and N. benthamiana. Mutating donor and acceptor sites and their neighboring nucleotides together with replacing codons ending with A or T to G or C may prevent the generation of heterosplice variants.

Why should splice site mutation and codon optimization be employed at the same time rather than codon optimization alone? Although differences between GC-rich and GC-poor genes were not reported in moss, analyses of various plant species can explain why codon optimization might not be sufficient. The GC content of GC-poor CDS in A. thaliana, soybean, pea, tobacco, tomato, and potato ranges from 40–46% while GC content of introns is 26–31%61. A similar trend is visible in GC-rich genes: 46–49% in CDS and 29–31% in introns. On the other hand, in maize, the GC content is higher: 56% in GC-poor CDS and 67% in GC-rich CDS. The GC content of introns in maize is also elevated: 40% in GC-poor genes and 48% in GC-rich genes61. This shows that the intron/exon definition is not solely dependent on the GC content. Hence, increasing the GC content of cryptic introns to a certain level without mutating splice site motifs might not be sufficient to prevent aberrant mRNA processing.

In general, the analysis of spliced transcripts is beneficial for at least five reasons. (i) It facilitates the reliable quantification of gene expression by designing primers that differentiate between full-length transcripts and shorter variants. (ii) It enables the precise detection and quantification of heterologous proteins by antibodies that differentiate functional isoforms from non-functional ones. (iii) It helps to better purify proteins either by specific antibodies or by size exclusion chromatography. (iv) Optimized cDNAs may result in better detectable reporter proteins in vivo. (vi) Aberrant protein isoforms due to heterosplicing may show mislocalization in cells of the production host and thus may contribute to poorly modified glycoproteins. Prevention of heterosplice variants eventually will improve recombinant protein production and decrease downstream processing costs significantly. Hence, plant-based systems can become an alternative to traditional production platforms. Moreover, it may have implications for basic biology as well, because here the use of heterologous reporter constructs is of vital importance. For these purposes, we developed physCO, an automated P. patens-specific codon-optimization tool, which allows the optimization of transgene expression by the prevention of undesired mRNA splicing. Moreover, our analysis of codon usage patterns indicates that this tool can also be used for insect (S. frugiperda), HEK (H. sapiens), and CHO (C. griseus) cells as well as rice (O. sativa). Besides that, we cannot exclude the possibility, that the phenomenon of mRNA splicing, here described as heterosplicing, is a novel gene regulatory mechanism occurring in eukaryotes in general.

## Materials and methods

### Design of the expression vectors pFIX, aspFIX, and optiFIX

The coding DNA sequence for human FIX without the native signal peptide was synthesized as N-terminal translational fusion to the signal-peptide CDS of PpAP1 (Pp3c5_19520V3.1)42 by GeneArt (Thermo Fisher Scientific, Waltham, MA, USA). It was cloned into the pAct5-MFHR1 plasmid41 to generate the final FIX expression vector, pFIX. Its expression is driven by the promoter of the PpActin5 gene62 and the CaMV 35S terminator.

Nucleotides involved in splicing were replaced without a change of the FIX aa sequence in the aspFIX plasmid which was synthesized by GeneArt. For codon optimization, two sources, codon usage bias calculated by Hiss et al.49 and Kazusa codon usage database50 (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=145481), were used. We decided to exchange the eight underrepresented codons for Cys, Glu, Phe, His, Lys, Asn, Gln, and Tyr. Underrepresented codons were replaced with the overrepresented alternative codons (Table 2) via physCO. The optiFIX CDS was synthesized by GeneArt. See Supplementary Fig. 6 for a multiple sequence alignment63 of FIX, aspFIX, and optiFIX (FIX, aspFIX, and optiFIX sequences are available in Supplementary Data 7.).

The plasmids are available via the International Moss Stock Center IMSC (www.moss-stock-center.org/en/) under the accession numbers P1705, P1811, P1949, P1804, and P1812 for pFIX, pFIX_Citrine, aspFIX, optiFIX, and pOptiFIX_Citrine, respectively.

### Design of the expression vectors pFH and optiFH

The previously generated FH expression plasmid pFH40 contains the same promoter and terminator as pFIX. It was used in transient expression assays. The optiFH CDS was deduced via physCO and synthesized by GeneArt. It was cloned into the same vector as FH. See Supplementary Data 5 for a sequence alignment of FH and optiFH. The plasmids are available via the IMSC under the accession numbers P1837, and P1838, for pFH and optiFH, respectively.

### Transformation and screening of transgenic plants

The Physcomitrella Δxt/ft line, a double knockout line in which the α1,3 fucosyltransferase and the β1,2 xylosyltransferase genes had been abolished64 (IMSC accession number 40828), was cultured in liquid Knop medium (pH 4.5) one week prior to transformation as described previously65 to obtain cells (protoplasts). Protoplast transformation was performed as described66 using 50 µg of linearized pFIX DNA per transformation. In addition to the generation of transgenic lines, mossFIX was produced transiently by an upscaled transformation protocol based on the proportion of 0.7–1 µg DNA of pFIX for 1×105 cells. Stable transfectants were selected on solidified Knop medium containing 25 µg/mL hygromycin as described65. Cells transiently expressing mossFIX were grown in special regeneration medium (7.5 mM MgCl2×6H2O, 2.6 mM MES, 200 mM D(-)-Mannitol, 62.6 mM CaCl2×2H2O, 68.4 mM NaCl, 2.8 mM D(+)-Glucose monohydrate, 5 mM KCl; pH: 5.6; osmolarity: ~550–570 mOsm) and harvested for RNA and protein analyses 2 weeks after transformation. To analyze whether splicing of FIX was prevented on RNA level, WT cells were also transfected with the aspFIX and optiFIX constructs, respectively, and harvested for RNA isolation 2 weeks after transformation.

### Plant cell culture

After selection on solidified Knop medium, transgenic lines were cultured under standard conditions in liquid Knop medium67. They were subcultured periodically by disrupting the tissue with an Ultra-Turrax (IKA, Staufen, Germany) at a rotational speed of 16,000–18,000/min for 1 min and transfer to fresh medium.

### Molecular characterization of transgenic lines

Following the hygromycin selection process, stable lines were characterized by the presence of the FIX transcript. To validate the presence of the transgene in the moss genome, 80–100 mg (fresh weight) moss material from protonema cultures were disrupted with the use of a stainless steel ball in TissueLyser II (Qiagen, Hilden, Germany) for 1 min with an impulse frequency set to 30. Genomic DNA was extracted using GeneJET Genomic DNA Purification Kit (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s instructions. The lines F18, F33, F61, and F130 were chosen for closer analysis and stored in the International Moss Stock Center (IMSC Nos. 40874, 40875, 40876, 40877). Transgene presence and completeness were verified with a standard PCR with the primers FIXfwdB (5′-GGAGTGTTCGATTGGCTTTCT-3′) and FIXrevB (5′-TGGTGGTGGTGGGTTAGTTT-3′) that amplifies almost the full-length FIX CDS. For transgene expression analysis via reverse transcription (RT)-PCR, 50–100 mg (fresh weight) moss material from protonema cultures were disrupted with the use of a stainless steel ball in TissueLyser II for 2 min with an impulse frequency set to 30. The material was resuspended in 1 mL TRIzol® Reagent (Thermo Fisher Scientific), and RNA was isolated according to the manufacturer’s instructions. For analysis of gametophores, 5–6 gametophores were collected and RNA isolation was performed as described above. Six micrograms of DNase-treated RNA were used for first-strand synthesis with Superscript Reverse Transcriptase III (Thermo Fisher Scientific). The quality of cDNA was controlled with a standard PCR with the primers c45for (5′-GGTTGGTCATGGGTTGCG-3′) and c45rev (5′-GAGGTCAACTGTCTCGCC-3′) corresponding to the gene coding for the ribosomal protein L21. Transgene expression was verified with the primers FIXfwdB and FIXrevB that amplifies almost the full-length FIX CDS. PCR reactions were performed with Phusion high-fidelity DNA polymerase (Thermo Fisher Scientific). The PCR products were examined by standard agarose gel electrophoresis with visualization of DNA by ethidium bromide fluorescence.

### Molecular characterization of transiently transfected cells

Following the transformation, cells were grown in regeneration medium for 2 weeks68. Transiently transfected cells (nearly 3.6 million cells for FIX and aspFIX; 7.2 million cells for optiFIX) were used for transgene expression analysis via RT-PCR. RNA was isolated using the TRIzol® Reagent and RNA was purified through spin columns of Direct-zol™ RNA MicroPrep kit (Zymo Research, Irvine, California, USA) according to the manufacturer’s instructions. After DNase-treatment and first-strand synthesis (for the cDNA synthesis: 0.4 µg RNA were used for FIX and aspFIX; 2 µg RNA were used for optiFIX), the quality of cDNA was controlled with primers c45for and c45rev. Transgene expression was verified again with primers FIXfwdB and FIXrevB.

### Analysis of FIX cDNA products

Amplified products were excised from the gel and purified using QIAEX II Gel Extraction Kit (Qiagen). Sequencing was done by Eurofins Genomics (GATC Sanger sequencing service). In the sequencing reactions, in addition to FIXfwdB and FIXrevB, FIX6deepseq1F (5′-ATGGGGGCATCGAGGAGTGTTCGAT-3′), FIX6deepseq1R (5′-TTCCCCAGCCACTTACATAGCCAGA-3′), and FIXseq6 (5′-CACCAACAACCCGAGTGAAG-3′) were used. The remaining purified products were separately cloned into the pJET vector using the CloneJET PCR Cloning Kit (Thermo Fisher Scientific) according to the manufacturer’s instructions. Clones with aberrant insert sizes were selected for plasmid isolation with the GeneJet Plasmid Miniprep kit (Thermo Fisher Scientific). Sequencing was done as described above.

### Splicing motif search

The most recent release of the Physcomitrella v3.3 genome annotation45, which is available from https://phytozome.jgi.doe.gov/, was used. The splice sites of all 87,533 annotated transcripts from all 32,926 protein-coding genes were used as anchors to extract surrounding genomic sequences with a fixed window size of ±9 bp. Consensus sequence logos of all extracted donor and acceptor sites were created using R and the package ggseqlogo69,70.

### Codon usage bias among different species

Synonymous codon usage biases in P. patens, S. frugiperda, H. sapiens, C. griseus, O. sativa, and N. tabacum were obtained from the Codon Usage Database (https://www.kazusa.or.jp/codon/) and biased codons in moss were compared with the codon usage pattern of other species.

### Confocal microscopy imaging and image analysis

All images were taken with a Leica TCS SP8 microscope (Leica Microsystems, Wetzlar, Germany) using HCX PL APO 63x/1.40 oil objective with a zoom factor of 4.5. For the excitation of Citrine, an WL laser was applied at 2% with an excitation wavelength of 514 nm. The voxel sizes were 0.080 μm on the X–Y dimensions and 0.300 μm on the Z dimension. The pinhole was adjusted to 1 Airy Unit (95.5 μm).

The image processing consists of the following steps: (i) denoising, (ii) edge enhancement, (iii) Richardson–Lucy restoration, (iv) local intensity equalization, (v) segmentation. In the first step, a median filter with a window of (3,3,3) was applied to the images in order to remove the salt-and-pepper noise. In the second step, the denoised images were subjected to an unsharp-mask operation71,72, which involves blurring the image with a Gaussian filter, subtracting the blurred image from the input image, and adding the resulting difference back to the input image, a process that sharpens the edges in the images. The edge-enhanced images were then subjected to the Richardson–Lucy restoration algorithm, as implemented in the scikit-image package73 for three iterations, assuming an averaging filter in a window of (3,5,5) as point spread function (here the aim was to smooth the image without losing the thin structures, rather than a true deblurring). The code for the Richardson–Lucy algorithm was adapted from scikit-image package73 using Python programming language. Subsequently, a spatial intensity equalization step was implemented on the images using a home-built algorithm. In short, each image was divided into boxes with dimensions of (7,7,7) voxels. The voxel values within these boxes were rescaled by multiplying the values in each box with weights that are calculated based on the skewness value corresponding to the same box. This step corrected for the intensity gradients in the image in order to minimize the loss of some low-intensity foreground voxels. In the last step, the images were segmented by applying an adaptive thresholding method, where the thresholds were determined locally by applying Otsu’s method to local windows of (7,19,19) voxels throughout the image volume. The code for the adaptive Otsu thresholding algorithm was developed based on the Otsu threshold function of the scikit-image package73 and the Numba package74. To suppress oversegmentation of noisy background areas, a low global threshold was also applied by first calculating Otsu’s threshold for the whole image and multiplying it with 0.66. The adaptive thresholding operation yielded the binary masks that specified the foreground voxels, which in turn were used to calculate the mean voxel intensity for each original image. Three replicates were used for analysis of each of the FIX-Citrine and optiFIX-Citrine constructs. The same is true for FH-Citrine and optiFH-Citrine but the quantification was not possible due to undetectable/very low signals in FH-Citrine images. All six FIX images were processed and quantified in a single run of a Python script by using the exact same parameters to avoid any possible bias. Other packages used to develop the code are NumPy73,75, Pandas76 and SciPy77. The code used for image processing and quantification is deposited on from www.plant-biotech.uni-freiburg.de. The procedure is shown in Supplementary Fig. 7.

### Recombinant protein extraction and detection

Proteins from the culture supernatant were precipitated as described78. After air drying, the precipitate was resuspended in 50 mM HEPES, 2% SDS (pH 7.5) and disulfide bonds within proteins were reduced with 25 mM dithiothreitol (DTT), followed by alkylation of the thiol groups with 60 mM iodoacetamide. For Western blot analyses, 7.5% SDS polyacrylamide gel electrophoresis (SDS-PAGE Ready Gel® Tris-HCl Precast Gels, BioRad, CA, USA) was run at 100 V for 1 h 30 min and blotted to polyvinylidene fluoride (PVDF) membranes (Amersham Hybond P 0.45 µm pore size PVDF blotting membrane, GE Healthcare) in a Trans-Blot® SD Semi-Dry Transfer Cell (BioRad) for 2 h with 1.8 mA/cm2 membrane. The membrane was blocked for 1 h at room temperature in TBS containing 4% Amersham ECL Prime Blocking Reagent (GE Healthcare) and 0.1% Tween 20. It was then incubated with polyclonal anti-FIX antibody (F0652, SIGMA) (1:3000) overnight at 4 °C. Afterwards, the membrane was washed three times with TBS containing 0.1% Tween 20 and incubated with horseradish peroxidase-linked anti-rabbit IgG (NA934, GE Healthcare) at a dilution of 1:10,000 for 1 h. The blot was washed again and proteins were detected using the Amersham ECL Prime Western Blotting Detection Reagent (GE Healthcare) according to the manufacturer’s instructions.

### Sample preparation and MS analysis

The reduced and alkylated protein extracts were subjected to SDS-PAGE. Appropriate gel bands showing FIX fractions were excised and processed for in-gel digestion: Following the destaining of the gel bands with 30% acetonitrile (ACN), they were treated with 100% ACN and dried completely in a vacuum concentrator. Tryptic in-gel digestion of gel slices was performed overnight at 37 °C in 50 mM ammonium bicarbonate using 0.1 µg trypsin (Promega) per gel band. Afterwards, peptides were extracted from the gel with 5% formic acid (FA). Analyses were performed using the UltiMate 3000 RSLCnano system (Dionex LC Packings/Thermo Fisher Scientific) coupled online to a QExactive Plus instrument (Thermo Fisher Scientific). The UHPLC systems were equipped with a C18-precolumn (Ø: 0.3 mm, 5 mm; PepMap, Thermo Fisher Scientific) and an Acclaim® PepMap analytical column (ID: 75 m, 500 mm, 2 m, 100 Å, Dionex LC Packings/Thermo Fisher Scientific).

MS analyses were performed using a binary solvent system consisting of 0.1% FA (solvent A, “A”) and 0.1% FA/86% ACN (solvent B, “B”). Samples were washed and concentrated on a C18 precolumn with 0.1% TFA for 5 min before switching the column in line with the analytical column. Peptide separation was performed applying a 45-min gradient at a flow rate of 250 nL/min. Peptide samples were eluted with a gradient of 4–40% B in 30 min and 40–95% B in 5 min. After each gradient, the analytical column was washed with 95% B for 5 min and re-equilibrated for 15 min with 4% B. The MS instrument was externally calibrated using standard compounds and equipped with a nanoelectrospray ion source and distal coated SilicaTips (FS360–20-10-D, New Objective, Woburn, MA, USA), and MS/MS analyses were performed on multiply charged peptide ions. The measurement was performed using UltiMate 3000 RSLCnano system (Dionex LC Packings/Thermo Fisher Scientific, Dreieich, Germany) coupled online to a QExactive Plus mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) and mass spectrometer was operated in the data-dependent mode to automatically switch between MS (max. of 1 × 10 ions) and MS/MS. Each MS scan was followed by a maximum of 12 MS/MS scans using HCD with a normalized collision energy of 35%. The mass range for MS was m/z = 375–1,700 and resolution was set to 70,000. MS parameters were as follows: spray voltage 1.5 kV; ion transfer tube temperature 200 °C. Raw data were analyzed using Mascot Distiller V2.5.1.0 (Matrix Science, USA) and the peak lists were searched with Mascot V2.6.0 against an in-house database containing all Physcomitrella V3.3 protein models79 as well as the human FIX sequence (complete and alternative ones based on splicing). The peptide mass tolerance was set to 8 ppm and the fragment mass tolerance was set to 0.02 Da. Carbamidomethylation (C, +57.021464 Da) was specified as fixed modification. Variable modifications were Gln->pyro-Glu (-17.026549 Da), oxidation/hydroxylation (M, P, +15.994915 Da) and deamidation (N, +0.984016 Da). A total of two missed cleavages was allowed as well as semitryptic peptides. Search results were loaded into Scaffold4 software (Version 4.11.1, Proteome Software Inc., Portland, OR).

### Statistics and reproducibility

The sample size was decided based on prior experience. Experiments were not blinded and information regarding the number of replicates/samples and statistical analyses are described in the corresponding figure and table legends.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

All data generated or analyzed during this study are included in this article and its supplementary information. Microscopy images can be found in Supplementary Data 1 and 6. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE80 partner repository with the dataset identifier PXD026651 and 10.6019/PXD026651.

## Code availability

The JavaScript source code of our codon optimization tool physCO has been deposited at Zenodo81. An interactive version of physCO as well as the code used for the image processing and quantification are available at www.plant-biotech.uni-freiburg.de.

## References

1. 1.

Sakharkar, M. K., Chow, V. T. K. & Kangueane, P. Distributions of exons and introns in the human genome. Silico Biol. 4, 387–393 (2004).

2. 2.

Georgomanolis, T., Sofiadis, K. & Papantonis, A. Cutting a long intron short: recursive splicing and its implications. Front. Physiol. 7, 598 (2016).

3. 3.

Rogozin, I. B., Carmel, L., Csuros, M. & Koonin, E. V. Origin and evolution of spliceosomal introns. Biol. Direct 7, 11 (2012).

4. 4.

Papasaikas, P. & Valcárcel, J. The spliceosome: the ultimate RNA chaperone and sculptor. Trends Biochem. Sci. 41, 33–45 (2016).

5. 5.

Berget, S. M. Exon recognition in vertebrate splicing. J. Biol. Chem. 270, 2411–2414 (1995).

6. 6.

Goldstrohm, A. C., Greenleaf, A. L. & Garcia-Blanco, M. A. Co-transcriptional splicing of pre-messenger RNAs: Considerations for the mechanism of alternative splicing. Gene 277, 31–47 (2001).

7. 7.

Dredge, B. K., Polydorides, A. D. & Darnell, R. B. The splice of life: alternative splicing and neurological disease. Nat. Rev. Neurosci. 2, 43–50 (2001).

8. 8.

Shai, O., Morris, Q. D., Blencowe, B. J. & Frey, B. J. Inferring global levels of alternative splicing isoforms using a generative model of microarray data. Bioinformatics 22, 606–613 (2006).

9. 9.

Nissim-Rafinia, M. & Kerem, B. Splicing regulation as a potential genetic modifier. Trends Genet. 18, 123–127 (2002).

10. 10.

Yu, J. et al. Minimal introns are not ‘junk’. Genome Res. 12, 1185–1189 (2002).

11. 11.

Nilsen, T. W. & Graveley, B. R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457–463 (2010).

12. 12.

Sanchez, L. & Sánchez, L. Sex-determining mechanisms in insects. Int. J. Dev. Biol. 52, 837–856 (2008).

13. 13.

Shang, X., Cao, Y. & Ma, L. Alternative splicing in plant genes: a means of regulating the environmental fitness of plants. Int. J. Mol. Sci. 18, 432 (2017).

14. 14.

Deckert, J. et al. Protein composition and electron microscopy structure of affinity-purified human spliceosomal B complexes isolated under physiological conditions. Mol. Cell. Biol. 26, 5528–5543 (2006).

15. 15.

Amit, M. et al. Differential GC content between exons and introns establishes distinct strategies of splice-site recognition. Cell Rep. 1, 543–556 (2012).

16. 16.

Carmel, I., Tal, S., Vig, I. & Ast, G. Comparative analysis detects dependencies among the 5′ splice-site positions. RNA 10, 828–840 (2004).

17. 17.

Chiara, M. D., Palandjian, L., Kramer, R. F. & Reed, R. Evidence that U5 snRNP recognizes the 3’ splice site for catalytic step II in mammals. EMBO J. 16, 4746–4759 (1997).

18. 18.

Hall, S. L. & Padgett, R. A. Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splice sites. J. Mol. Biol. 239, 357–365 (1994).

19. 19.

López, M. D., Alm Rosenblad, M. & Samuelsson, T. Computational screen for spliceosomal RNA genes aids in defining the phylogenetic distribution of major and minor spliceosomal components. Nucleic Acids Res. 36, 3001–3010 (2008).

20. 20.

Syed, N. H., Kalyna, M., Marquez, Y., Barta, A. & Brown, J. W. S. Alternative splicing in plants - coming of age. Trends Plant Sci. 17, 616–623 (2012).

21. 21.

Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).

22. 22.

Zimmer, A. D. et al. Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC Genom. 14, 498 (2013).

23. 23.

Lloyd, J. P. B. et al. The loss of SMG1 causes defects in quality control pathways in Physcomitrella patens. Nucleic Acids Res. 46, 5822–5836 (2018).

24. 24.

Melo, J. P., Kalyna, M. & Duque, P. Current challenges in studying alternative splicing in plants: The case of Physcomitrella patens SR Proteins. Front. Plant Sci. 11, 286 (2020).

25. 25.

Schellenberg, M. J., Ritchie, D. B. & MacMillan, A. M. Pre-mRNA splicing: a complex picture in higher definition. Trends Biochem. Sci. 33, 243–246 (2008).

26. 26.

Maniatis, T. & Reed, R. An extensive network of coupling among gene expression machines. Nature 416, 499–506 (2002).

27. 27.

Reed, R. Initial splice-site recognition and pairing during pre-mRNA splicing. Curr. Opin. Genet. Dev. 6, 215–220 (1996).

28. 28.

Ram, O. & Ast, G. SR proteins: a foot on the exon before the transition from intron to exon definition. Trends Genet. 23, 5–7 (2007).

29. 29.

Gelfman, S. et al. Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of exons. Genome Res. 22, 35–50 (2012).

30. 30.

Haseloff, J., Siemering, K. R., Prasher, D. C. & Hodge, S. Removal of a cryptic intron and subcellular localization of green fluorescent protein are required to mark transgenic Arabidopsis plants brightly. Proc. Natl Acad. Sci. USA 94, 2122–2127 (1997).

31. 31.

Diehn S. H., De Rocher E. J. & Green P. J. in Genetic Engineering (Principles and Methods). Vol 18 (ed. Setlow J. K.) (Springer, Boston, MA, 1996).

32. 32.

van Aarssen, R. et al. cry IA(b) transcript formation in tobacco is inefficient. Plant Mol. Biol. 28, 513–524 (1995).

33. 33.

Top, O., Geisen, U., Decker, E. L. & Reski, R. Critical evaluation of strategies for the production of blood coagulation factors in plant-based systems. Front. Plant Sci. 10, 261 (2019).

34. 34.

Kurachi, K., Furukawa, M., Yao, S. N. & Kurachi, S. Biology of factor IX. Hematol. Oncol. Clin. N. Am. 6, 991–997 (1992).

35. 35.

Chen, C. X., Baker, J. R. & Nichol, M. B. Economic burden of illness among persons with hemophilia b from HUGS Vb: examining the association of severity and treatment regimens with costs and annual bleed rates. Value Health 20, 1074–1082 (2017).

36. 36.

Decker, E. L. & Reski, R. Mosses in biotechnology. Curr. Opin. Biotechnol. 61, 21–27 (2020).

37. 37.

Decker, E. L. & Reski, R. Glycoprotein production in moss bioreactors. Plant Cell Rep. 31, 453–460 (2012).

38. 38.

Hennermann, J. B. et al. Pharmacokinetics, pharmacodynamics, and safety of moss-aGalactosidase A in patients with Fabry disease. J. Inherit. Metab. Dis. 42, 527–533 (2019).

39. 39.

Reski, R., Parsons, J. & Decker, E. L. Moss-made pharmaceuticals: from bench to bedside. Plant Biotechnol. J. 13, 1191–1198 (2015).

40. 40.

Michelfelder, S. et al. Moss-produced, glycosylation-optimized human factor H for therapeutic application in complement disorders. J. Am. Soc. Nephrol. 28, 1462–1474 (2017).

41. 41.

Top, O. et al. Recombinant production of MFHR1, a novel synthetic multitarget complement inhibitor, in moss bioreactors. Front. Plant Sci. 10, 260 (2019).

42. 42.

Schaaf, A., Reski, R. & Decker, E. L. A novel aspartic proteinase is targeted to the secretory pathway and to the vacuole in the moss Physcomitrella patens. Eur. J. Cell Biol. 83, 145–152 (2004).

43. 43.

Pucker, B. & Brockington, S. F. Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes. BMC Genomics. 19, 980 (2018).

44. 44.

Gitzinger, M., Parsons, J., Reski, R. & Fussenegger, M. Functional cross-kingdom conservation of mammalian and moss (Physcomitrella patens) transcription, translation and secretion machineries. Plant Biotechnol. J. 7, 73–86 (2009).

45. 45.

Lang, D. et al. The Physcomitrella patens chromosome-scale assembly reveals moss genome structure and evolution. Plant J. 93, 515–533 (2018).

46. 46.

Rensing, S. A., Fritzowsky, D., Lang, D. & Reski, R. Protein encoding genes in an ancient plant: Analysis of codon usage, retained genes and splice sites in a moss, Physcomitrella patens. BMC Genomics 6, 292 (2005).

47. 47.

Kalyna, M. et al. Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic Acids Res 40, 2454–2469 (2012).

48. 48.

Drechsel, G. et al. Nonsense-mediated decay of alternative precursor mRNA splicing variants is a major determinant of the Arabidopsis steady state transcriptome. Plant Cell 25, 3726–3742 (2013).

49. 49.

Hiss, M. et al. Combination of the endogenous lhcsr1 promoter and codon usage optimization boosts protein expression in the moss Physcomitrella patens. Front. Plant Sci. 8, 1842 (2017).

50. 50.

Nakamura, Y., Gojobori, T. & Ikemura, T. Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 28, 292 (2000).

51. 51.

Khraiwesh, B. et al. Transcriptional control of gene expression by microRNAs. Cell 140, 111–122 (2010).

52. 52.

Khraiwesh, B., Ossowski, S., Weigel, D., Reski, R. & Frank, W. Specific gene silencing by artificial microRNAs in Physcomitrella patens: An alternative to targeted gene knockouts. Plant Physiol. 148, 684–693 (2008).

53. 53.

Latijnhouwers, M. J., Pairoba, C. F., Brendel, V., Walbot, V. & Carle-Urioste, J. C. Test of the combinatorial model of intron recognition in a native maize gene. Plant Mol. Biol. 41, 637–644 (1999).

54. 54.

Marquez, Y., Höpfler, M., Ayatollahi, Z., Barta, A. & Kalyna, M. Unmasking alternative splicing inside protein-coding exons defines exitrons and their role in proteome plasticity. Genome Res. 25, 995–1007 (2015).

55. 55.

Reddy, A. S. N. Alternative splicing of pre-messenger RNAs in plants in the genomic era. Annu. Rev. Plant Biol. 58, 267–294 (2007).

56. 56.

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

57. 57.

Duret, L., Mouchiroud, D. & Gautier, C. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308–317 (1995).

58. 58.

Kong, A. et al. A high-resolution recombination map of the human genome. Nat. Genet. 31, 241–247 (2002).

59. 59.

Kudla, G., Lipinski, L., Caffin, F., Helwak, A. & Zylicz, M. High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol. 4, e180 (2006).

60. 60.

Meunier, J. & Duret, L. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21, 984–990 (2004).

61. 61.

Carels, N. & Bernardi, G. Two classes of genes in plants. Genetics 154, 1819–1825 (2000).

62. 62.

Weise, A. et al. Use of Physcomitrella patens actin 5′ regions for high transgene expression: Importance of 5′ introns. Appl. Microbiol. Biotechnol. 70, 337–345 (2006).

63. 63.

Bodenhofer, U., Bonatesta, E., Horejš-Kainrath, C. & Hochreiter, S. Msa: an R package for multiple sequence alignment. Bioinformatics 31, 3997–3999 (2015).

64. 64.

Koprivova, A. et al. Targeted knockouts of Physcomitrella lacking plant-specific immunogenic N-glycans. Plant Biotechnol. J. 2, 517–523 (2004).

65. 65.

Frank, W., Decker, E. L. & Reski, R. Molecular tools to study Physcomitrella patens. Plant Biol. (Stuttg.)7, 220–227 (2005).

66. 66.

Strepp, R., Scholz, S., Kruse, S., Speth, V. & Reski, R. Plant nuclear gene knockout reveals a role in plastid division for the homolog of the bacterial cell division protein FtsZ, an ancestral tubulin. Proc. Natl Acad. Sci. 95, 4368–4373 (1998).

67. 67.

Decker, E. L., Wiedemann, G. & Reski, R. Gene targeting for precision glyco-engineering: production of biopharmaceuticals devoid of plant-typical glycosylation in moss bioreactors. Methods in Mol. Biol. 1321, 213–224 (2015).

68. 68.

Schween, G., Fleig, S. & Reski, R. High-throughput-PCR screen of 15,000 transgenic Physcomitrella plants. Plant Mol. Biol. Rep. 160, 209–212 (2002).

69. 69.

R Development CoreTeam (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

70. 70.

Wagih, O. Ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

71. 71.

Maack, I. & Neitzel, U. in Computer Assisted Radiology/Computergestützte Radiologie (eds. Lemke, H. U. et al.) 109–114 (Springer Berlin Heidelberg, 1991).

72. 72.

Critchley, H. D. et al. Human cingulate cortex and autonomic control: Converging neuroimaging and clinical evidence. Brain 126, 2139–2152 (2003).

73. 73.

van der Walt, S. et al. scikit-image: image processing in Python. PeerJ 2, e453 (2014).

74. 74.

Lam, S. K., Pitrou, A. & Seibert, S. Numba. in Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM ’15. 1–6 (ACM Press, 2015).

75. 75.

Oliphant, T. & Millma, J. K. A guide to NumPy. (Trelgol Publishing, 2006).

76. 76.

McKinney, W. Data structures for statistical computing in python. Proc. 9th Python Sci. Confer. (2010).

77. 77.

Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

78. 78.

Hoernstein, S. N. W. et al. Host cell proteome of Physcomitrella patens harbors proteases and protease inhibitors under bioproduction conditions. J. Proteome Res. 17, 3749–3760 (2018).

79. 79.

Lang, D. et al. The Physcomitrella patens chromosome-scale assembly reveals moss genome structure and evolution. Plant J. 93, 515–533 (2018).

80. 80.

Perez-Riverol, Y. et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 47, D442–D450 (2019).

81. 81.

van Gessel, N. physCO - The Physcomitrella codon optimization tool for CDS sequences. Zenodo https://doi.org/10.5281/zenodo.4905544 (2021).

## Acknowledgements

We thank Bettina Warscheid for the possibility to use the QExactive Plus mass spectrometer and Anne Katrin Prowse for language editing. This work was supported by the Excellence Initiative of the German Federal and State Governments (GSC-4 to O.T., EXC-294 and EXC-2193/1 – 390951807 to R.R.); and a grant from the German Federal Ministry of Education and Research (BMBF 0313852C to R.R.).

## Funding

Open Access funding enabled and organized by Projekt DEAL.

## Author information

Authors

### Contributions

O.T., E.L.D. and R.R. designed research; O.T., S.W.L.M., B.Ö., and N.v.G. performed research and together with S.N.W.H analyzed data; O.T., E.L.D., and R.R. wrote the paper.

### Corresponding author

Correspondence to Ralf Reski.

## Ethics declarations

### Competing interests

The authors declare the following competing interests: All authors are inventors of patents and patent applications related to the production of recombinant proteins. R.R. is a founder of Greenovation Biotech (now eleva GmbH). He currently serves as advisory board member of this company. Eleva develops and markets moss-based biopharmaceuticals.

Peer review information Communications Biology thanks Andreas Schiermeyer and the anonymous reviewers for their contribution to the peer review of this work. Primary handling editors: Caitlin Karniski and Jung-Eun Lee.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Top, O., Milferstaedt, S.W.L., van Gessel, N. et al. Expression of a human cDNA in moss results in spliced mRNAs and fragmentary protein isoforms. Commun Biol 4, 964 (2021). https://doi.org/10.1038/s42003-021-02486-3

• Accepted:

• Published:

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.