Expression of a human cDNA in moss results in spliced mRNAs and fragmentary protein isoforms

Production of biopharmaceuticals relies on the expression of mammalian cDNAs in host organisms. Here we show that the expression of a human cDNA in the moss Physcomitrium patens generates the expected full-length and four additional transcripts due to unexpected splicing. This mRNA splicing results in non-functional protein isoforms, cellular misallocation of the proteins and low product yields. We integrated these results together with the results of our analysis of all 32,926 protein-encoding Physcomitrella genes and their 87,533 annotated transcripts in a web application, physCO, for automatized optimization. A thus optimized cDNA results in about twelve times more protein, which correctly localizes to the ER. An analysis of codon preferences of different production hosts suggests that similar effects occur also in non-plant hosts. We anticipate that the use of our methodology will prevent so far undetected mRNA heterosplicing resulting in maximized functional protein amounts for basic biology and biotechnology.

T he majority of eukaryotic genes comprise multiple exons interspersed by noncoding introns, the number and length of which vary between species. The human genome size (GRCh38.p13, Genome Reference Consortium Human Build 38 patch release 13) is 3,099,706,400 bp compromising 20,444 protein-encoding genes, with 8.8 exons and 7.8 introns per gene on average 1 . Generally, human genes have many short exons (average 50 codons) separated by long introns (up to >10 kbp) 2 . It is widely accepted that such an exon-intron gene structure emerged during evolution and lead to complex gene regulation and diversification of multicellular eukaryotes 2,3 . Introns that may contain regulatory elements affecting gene expression are removed from pre-mRNAs by RNA splicing. Subsequently, exons are ligated together to generate functional mature mRNAs 4 . Over 30 years of research on pre-mRNA splicing elucidated an underlying complex regulation, which includes exon versus intron recognition 5 , co-transcriptional splicing 6 , alternative splicing (AS) 7 , exonic and intronic splicing enhancers and repressors 8,9 , and mRNA export from the nucleus 10 .
AS is a fundamental regulatory process that contributes to proteome expansion 11 . AS can produce multiple mRNAs from the same gene through the preference of splice sites during pre-mRNA splicing. It is modulated based on sequence motifs in the pre-mRNA, the interactions between RNA-binding proteins and splice sites, different cell types, and environmental signals 12 . Splicing is carried out by a large ribonucleoprotein (RNP) complex, the spliceosome, which brings selected exons together by two transesterification reactions 13 . During splicing, uridine (U)-rich small nuclear RNPs (snRNPs) together with non-snRNP splicing factors, and serine/arginine-rich (SR) proteins participate in the recognition of 5′-and 3′-splice sites as well as the branch site 14 . The core of the spliceosome is highly conserved in all wellcharacterized eukaryotes 3,15 . The key requirement for splicing is the recognition of a donor splice site by the U1 snRNP 16 . On the other hand, the acceptor splice site, downstream of the polypyrimidine tract, is recognized by the U5 snRNP 17 . Another important splicing element containing the reactive adenosine, the branch point signal, is located upstream of the acceptor splice signal and is important for the formation of the lariat-like intermediate structure 17 . While the majority of the introns have "GT" at the donor splice site and "AG" at the acceptor splice site, in rare cases introns can have unusual dinucleotides: for instance "AT" at the donor site and "AC" at the acceptor site 18 . This class of introns is spliced out of the pre-mRNA by the minor spliceosome U12 19 .
Besides its role in proteome expansion, AS controls transcript levels by generating unstable mRNA isoforms that may activate nonsense-mediated mRNA decay (NMD) 20 . Generation of alternative functional mRNAs due to the introduction of premature termination codons (PTC), intron retention, and alternative use of the 5′-and 3′-splice sites causes changes in the localization, stability, and/or function of a protein. While in humans more than 95% of genes encode alternatively spliced mRNAs 21 , in an early study 6556 out of 39,106 predicted genes showed AS in the plant Physcomitrium patens (moss Physcomitrella) with a strong bias toward intron retention 22 . Many PTCcontaining alternatively spliced transcripts are NMD targets in Physcomitrella 23 . Furthermore, about half of the AS events altered CDSs and resulted in 2380 distinct proteins. In these studies, analysis of AS relied on expressed sequence tags (ESTs) and the alignment of ESTs and cDNA sequences to the genome. The percentage of AS events in P. patens is estimated as 58%, but large-scale and/or tissue-dependent profiling of AS in moss by the use of improved sequencing methods may result in a higher percentage 24 .
Two models of splice-site recognition exist 25 : In the introndefinition model, U1 binds to the upstream donor splice site and U2AF/U2 to the downstream acceptor splice site and branch site of the same intron 26,27 . The length of introns, which probably remain short under evolutionary selection 3 , limits the efficiency of splicing. In lower eukaryotes where introns are shorter, this mechanism might be dominant 15 . In the exon-definition model, the splicing machinery recognizes splice sites flanking the same exon 28,29 . This is the major mechanism in higher eukaryotes, where introns are generally longer. Due to these differences, the correct splicing of transgene transcripts has to be considered to maximize the production of a functional protein. For instance, it was necessary to remove a cryptic intron in the coding sequence (CDS) of green fluorescent protein (GFP) from the jellyfish Aequorea victoria to function in the flowering plant Arabidopsis thaliana 30 . Similarly, after several attempts to express Bacillus thuringiensis toxin genes in plants 31 , it was found that the expression of cry genes in tobacco protoplasts resulted in cry mRNAs, which were unstable and/or not efficiently made 32 .
The use of plants as alternative pharmaceutical protein production hosts has been on the rise for two decades, and some plant-made products have reached the market 33 . However, the percentage of plant-based biopharmaceuticals in the pharmaceutical market is still minute, mainly because of lower yields compared to prokaryotic and mammalian systems. Nevertheless, they offer alternative solutions in niche areas. Production of blood-clotting factors could be such a niche 33 . The human bloodclotting factor IX (FIX), a member of the intrinsic pathway of coagulation, is required for normal hemostasis and its absence or abnormal levels causes hemophilia B 34 . Current treatment is restricted to protein replacement therapy and factor concentrates costs are between $100,000 and $200,000 per patient and year 35 . Thus, there is a need for an economic production of FIX, which makes us attempting to produce it in the moss system. The moss P. patens, a model organism for evolutionary and functional genomics approaches, is an established host for the production of complex recombinant biopharmaceuticals and has proven its suitability for large-scale production 36,37 . The first moss-made drug candidate, moss-aGal for enzyme replacement therapy of Fabry disease, has successfully completed phase I clinical trials 38 . In addition, a variety of other potential biopharmaceuticals is being produced in moss [39][40][41] .
Here, we report the expression of a human blood-clotting FIXencoding cDNA in P. patens. We show that specific sequence motifs in the FIX mRNA were recognized by the splicing machinery as donor and acceptor sites and produced several different transcripts, which resulted in different protein isoforms. We named this new phenomenon, to the best of our knowledge, heterosplicing, and were able to prevent it by changing the splice sites and optimizing codons without affecting the FIX aa sequence (optiFIX). Expression of optiFIX resulted in full-length transcripts only, and a 12-fold increase of protein accumulation. In addition, we found heterosplicing also in transgenic moss lines expressing the CDS of human factor H (FH) CDS. By optimizing the sequence, we generated optiFH, which showed no heterosplicing and yielded significantly enhanced FH protein abundances in vivo. We suggest that our methodology of CDS optimization will provide a better quantification of transcript levels, and precise detection, purification, and quantification of functional proteins. It may reduce the bioenergetic costs of heterologous cDNA expression by preventing the formation of nonfunctional products and maximizing functional recombinant protein production, major bottlenecks in plant-based systems. at the posttranscriptional level, either via target mRNA degradation or translational inhibition. The possible actions of P. patens microRNAs on the RNA sequence for human FIX (NCBI reference NM_000133) were investigated using psRNATarget (http:// plantgrn.noble.org/v1_psRNATarget/). Thus, two microRNAs were identified (miR1028c-3p and miR1044-3p), that might interfere with FIX expression in this moss. The respective binding sites on the FIX RNA sequences were modified in the DNA construct without a change in the FIX aa sequence. The potential binding site of ppt-miR1028c-3p, which might inhibit translation, was changed into ATCATGTGAGCCAGCTGTACCC between positions 495 and 516. The binding site of ppt-miR1044-3p, which might lead to cleavage of the FIX RNA, was mutated into AGGGTAAG-TACGGCATCTAT between positions 1304 and 1323. The sequence of microRNAs and their binding sites as well as the maximum energy to unpair the target site (UPE) and expectation value are shown in Supplementary Table 1.
Splicing of FIX mRNA occurs in transgenic plants and in transiently transfected cells. The native FIX signal peptide was replaced by the moss aspartic proteinase signal peptide of PpAP1 to target the protein to the secretory pathway for its proper posttranslational modifications 42 . After transformation of the parental moss line Δxt/ft 41 with the human FIX-encoding, cDNA-based expression construct that comprises the complete FIX CDS of 1401 bp without any introns; 49 transgenic moss lines were regenerated and screened. The presence and integration of the full-length FIX CDS in the moss genome was confirmed via PCR ( Supplementary Fig. 1). In order to validate the completeness of the FIX transcripts, RT-PCR was performed with primers spanning the complete CDS. In addition to the full-length transcript, several smaller products occurred (Fig. 1). In cDNA from the juvenile moss protonemal tissue, PCR products of discrete sizes were detected ranging from~700 to 1370 bp (Fig. 1a). Sequencing of the PCR products revealed that five of the observed bands were FIX-specific products with a length of 1371, 1240, 895, 769, and 691 bp, respectively. Apparently, unexpected splicing of the FIX mRNA produced the smaller aberrant transcripts. To analyze this phenomenon across different tissues, we repeated the same procedure with RNA isolated from adult plants (gametophores). FIX transcripts detected in gametophores were consistent with those from protonema (Fig. 1b).
The sequencing of different PCR products revealed that specific motifs along the human FIX mRNA were recognized as splice signals in moss. The spliced transcripts were identical in all 49 transgenic lines and in 2 different tissues. Hence, we conclude that different transcripts were exclusively derived from unexpected splicing of the mRNA transcribed from the complete FIX CDS, and not by a partial integration of the FIX expression construct into the moss genome. PCRs from genomic DNA support this conclusion ( Supplementary Fig. 1).
In order to scrutinize the presence of spliced FIX CDS variants in transiently transfected cells (protoplasts), total RNA was extracted from cells 14 days after transformation. After RT-PCR, FIX CDS variants with varying lengths occurred. Like in the RT-PCR from transgenic lines, 5 FIX CDS variants (1 full-length and 4 aberrant spliced FIX transcripts) were confirmed in these protoplasts (Fig. 1c). Unedited gel images showing the full lanes can be found in Supplementary Fig. 2. Thus, we found an unexpected splicing of an mRNA transcribed from a human full-length cDNA in moss, a phenomenon we henceforth name heterosplicing.
Spliced FIX transcripts and splice-site analysis. Splice junctions within the FIX mRNA were revealed by sequencing. The splicing of the primary transcripts from human FIX cDNA in moss caused the production of four alternative products (Fig. 2a): i. deletion of a 44 aa long domain in the part coding for the heavy chain ( Fig. 2a.II), ii. large deletion in the part coding for the light chain (115 aa) together with the absence observed in the heavy chain ( Fig. 2a.III), iii. nearly complete absence of the sequence coding for the light chain and parts of the activation peptide domain (157 aa) and the heavy chain ( Fig. 2a.IV), iv. nearly complete lack of the sequence encoding the light chain, complete absence of the activation peptide domain (186 aa) and the sequence representing 41 aa of the heavy chain ( Fig. 2a. V). The heavy chain harbors a serine protease domain, which is responsible for the activation of Factor X in the presence of Factor VIII, calcium, and phospholipid surfaces following removal of the activation peptide. Any deletion in this domain will interfere with the activity of the FIX protein. The light chain consists of the Gla domain followed by two EGF domains. Any loss in these domains will adversely affect the properties of FIX. Therefore, any alterations caused by the heterosplicing will result in the loss of FIX function.
The GT-AG splicing rule applies for almost all eukaryotic genes 43 and especially P. patens accepts control elements originally optimized for mammalian expression systems without a need for adapting these elements 44 . Here, we inferred four different motifs for acceptor and donor sites based on sequencing data. All FIX heterosplice sites followed the GT-AG rule. Moreover, the data revealed that a CAGGT motif is present in the exon-intron  junction, i.e. the donor site (Fig. 3a). In the intron-exon junction, the acceptor site, the motif CAG is conserved in 3 out of 4 cases, but the adjacent two nucleotides do not follow any trend.
To analyze whether donor and acceptor sites and their neighboring nucleotides are conserved in P. patens, the genomic vicinity of all splice sites of all 87,533 annotated transcripts corresponding to the 32,926 protein-encoding genes of the current P. patens genome release v3.3 45 were retrieved (Fig. 3b, c). In alignment with our experimental findings, an in silico analysis of these transcripts showed that the CAG|GT motif is the most abundant one in both donor and acceptor sites with a fraction of 23% and 18%, respectively. The full list of splice sites is compiled in Supplementary Tables 2 and 3.
Subsequently, we analyzed the FIX CDS with a moss splice-site prediction tool 46 , but it largely failed to predict the experimentally verified splicing motifs. This might be due to the fact that this tool has been developed several years ago using only 368 donor and acceptor sites 46 and has not been updated since then.
Detection of truncated mossFIX protein variants and validation by mass spectrometry. The analyses at RNA level suggested that P. patens would produce a full-length FIX protein and four aberrant fragmentary FIX isoforms. To validate this inference, the culture supernatants of transiently transfected cells were precipitated and analyzed with a polyclonal anti-FIX antibody via immunoblot after reduction and alkylation (Fig. 4a). Full-length FIX was expected at a molecular mass of around 52 kDa and spliced FIX variants were calculated based on the heterospliced transcripts with molecular masses of 47, 34, 29, and 27 kDa, respectively. Observed bands on the immunoblot could be attributed to these isoforms together with a band at around 40 kDa (potentially a degradation product) and a band at around 70 kDa (potentially dimers of fragmentary protein isoforms). In addition, we used the other half of the supernatants for massspectrometric (MS) determination of FIX-variant peptide sequences. For the MS-database search, protein models for FIX variants derived from the sequenced splice variants were employed.
We detected the peptide VVGGEDAKPGEHNIEETEH-TEQKR ( Supplementary Fig. 3) in MS analysis after trypsin digestion, which resulted from splicing in the CDS for the heavy chain that led to the loss of 44 aa between the peptides VVGGEDAKP and GEHNIEETEHTEQKR ( Fig. 4b; variants II, Prevention of heterosplicing by CDS optimization. To analyze the effectiveness of mutating donor and acceptor sites on heterosplicing of the FIX mRNA, we created two different FIX CDS versions. In the first one, aspFIX, the codons neighboring the detected splice junctions were modified. P. patens was transiently transfected with the resulting aspFIX plasmid. On day 14, the cells were collected and RNA was isolated. RT-PCR and sequencing of the PCR products revealed that there were only two bands representing FIX variants: the full-length FIX CDS and a shorter variant (~700 bp) (Fig. 5a, Unedited gel image in Supplementary Fig. 2). The donor and acceptor sites of the shorter variant were identified (Fig. 5b) and this shorter variant was lacking the nucleotides encoding the light chain, AP, and parts of the heavy chain (Fig. 5c). Moreover, splicing would cause a frameshift mutation resulting in early stop codons. Although such nonsense mRNAs may be degraded via NMD, not all nonsense mRNAs undergo NMD, which was reported e.g. for P. patens 23 and A. thaliana 47,48 . Despite the fact that splicing and translation take place in different cellular compartments we next performed a complete codon optimization of the FIX CDS in the aspFIX construct according to previous studies 49,50 resulting in the optiFIX sequence with an overall increased GC content. Although translatability of a particular RNA via codon usage is not expected to regulate whether it can be spliced or not, we aimed to change the recognition of parts of human CDS regions as intronic regions by the moss spliceosome machinery. Therefore, we aimed at enhancing the GC content due to considerable differences between plant and animal exon/intron definitions. This was used to transiently transfect moss cells. Subsequent RT-PCR analysis and sequencing revealed that the full-length CDS of FIX was the only specific band; formation of any heterospliced FIX variant was not detected (Fig. 5d, Unedited gel image in Supplementary  Fig. 2). The presence of exclusively the full-length CDS was confirmed on protein level via immunoblot analysis (Fig. 5e). An unedited blot showing the full lanes can be found in Supplementary Fig. 4.
Next, we compared the codon usage bias in P. patens to biases of other pharmaceutical production platforms. The codon usage frequencies of under-and overrepresented codons for Cys, Glu, Phe, His, Lys, Asn, Gln, and Tyr in P. patens were compared with Spodoptera frugiperda (insect cells), Homo sapiens (HEK cells), Cricetulus griseus (CHO cells), Nicotiana tabacum (tobacco) and Oryza sativa (rice). For instance, the overrepresented codon for Cys, TGC, in P. patens (9.8 in thousands) is also an overrepresented codon in S. frugiperda, H. sapiens, C. griseus, and O.
sativa. On the other hand, this codon is underrepresented in N. tabacum and TGT is preferred over TGC for Cys. The synonymous codon usage bias in P. patens is similar to codon usage biases in S. frugiperda, H. sapiens, C. griseus (CHO cells), and O. sativa (rice), but different from N. tabacum (tobacco) ( Table 1).
Comparison of protein levels of FIX and optiFIX in vivo. In order to analyze whether the protein amounts of FIX and optiFIX show any difference in live moss cells, we cloned two fusion constructs, FIX-Citrine, and optiFIX-Citrine. Subsequently, we transiently transfected moss protoplasts with these constructs, and performed confocal laser scanning microscopy (CLSM) on day 3 after transformation. A visual analysis of the images revealed that the FIX-Citrine signal shows a diffuse localization pattern, which might be due to intracellular mislocalization of aberrant protein isoforms (Fig. 6a). In contrast, optiFIX-Citrine exhibited a complex network-like organization, indicating a proper localization of the protein in the ER (Fig. 6b). Furthermore, the optiFIX-Citrine signal intensity was higher than that of FIX-Citrine. Comparison of mean voxel intensities calculated for FIX-Citrine and for optiFIX-Citrine (3.88 and 45.22, respectively) shows a nearly 12-fold higher fluorescence intensity after the CDS optimization, which had prevented undesired heterosplicing (Fig. 6c). The source images (Z-stacks) for Fig. 6c can be found in Supplementary Data 1 and the results of the quantification is available in Supplementary Table 4.
Automatic CDS optimization of transgenes. In order to automatize the process of P. patens-specific CDS optimization, we compiled our findings into a JavaScript-based web application: physCO, and made it available at www.plant-biotech.uni-freiburg.de. physCO takes as input CDSs in FASTA format, inspects them codon-wise, substitutes codons where possible according to our findings on codon usage (Table 2), and finally outputs them in FASTA format. In addition, physCO checks the input for consistent lengths, valid start, and stop codons, and internal, premature stop codons and raises warnings in case it detects irregularities.
Heterosplicing of FH mRNA in transgenic plants. In order to test whether heterosplicing is a common phenomenon in P. patens upon transgene expression, we analyzed already existing recombinant human complement FH-producing moss lines 40 . Our RT-PCR analysis revealed that expression of the human FH CDS in moss generated 2 transcript variants, the fulllength FH transcript of 3639 bp and a 1215 bp FH variant, which could lead to a truncated FH isoform with a molecular mass of 45.6 kDa (Supplementary Data 2). Therefore, we generated with the help of physCO an optimized version of hFH, and named it optiFH. These changes affected 386 of 1213 codons, and   analysis of the images acquired for FH and optiFH revealed that while an FH-Citrine signal was not detectable, optiFH-Citrine signal intensity was high and filled the secretory pathway (Supplementary Fig. 5; Supplementary Data 6). Analysis on FH showed that heterosplicing is not an effect specific for FIX, but a general phenomenon in moss. Therefore, transgene sequences have to be modified using physCO prior to transformation for proper transcription and translation.

Discussion
Heterologous gene expression is of fundamental importance for basic biology and for industrial production. In order to optimize transgene transcription and translation as well as transcript stability, many factors have to be taken into account. Engineering of  upstream and downstream regulatory sequences, replacing original signal peptides with host-suitable ones and optimizing codons according to host codon-usage pattern can be used to improve protein production. In addition, microRNAs can affect protein production, either by microRNA-directed cleavage of target mRNAs or by translational inhibition 51,52 . To eliminate such a potential adverse effect of Physcomitrella microRNAs on the expression of the human gene, predicted microRNA-binding sites in the FIX CDS were mutated without altering the aa sequence. Splicing/AS is another important factor that alters expression rates by the generation of various transcripts, which create different protein isoforms. Generation of these protein isoforms can affect fundamental characteristics of the protein such as function, activity, intracellular localization, interaction with other molecules, and stability. Moreover, splicing can create a transcript with an early stop codon, which may be degraded by NMD. Thus, it can be hypothesized that modulation of splicing, a neglected point in transgene expression, can improve protein production. For instance, the expression of GFP from the jellyfish Aequorea victoria in A. thaliana failed due to aberrant mRNA processing 30 . In another example, the expression of bacterial cry IA(b) in tobacco resulted in unstable and/or not efficiently made transcripts 31 . To avoid AS, generally cDNAs derived from already spliced mRNAs are introduced into the genome of the production host.
We introduced the human FIX cDNA under the control of the PpActin5 promoter and CaMV 35S terminator into the Physcomitrella genome. RNA was isolated from protonemata of transgenic lines and RT-PCR analysis revealed unexpectedly that a fraction of human FIX mRNA was spliced into four variants in addition to the unspliced full-length transcript, resulting in five mRNAs with different lengths derived from one cDNA. According to RT-PCR analyses, this phenomenon occurred not only in juvenile protonema, but also in adult moss plants. We sequenced these mRNAs and surprisingly found that reading frames had not changed in the four smaller variants in comparison to the mature full-length transcript. This may have happened by a pure coincidence. Alternatively, we may not have been able to detect such putative minor transcripts because they were either too low abundant for RT-PCR, or they have been eliminated by NMD.
The outcome of this heterosplicing was detectable on four levels, by RT-PCR from mRNAs, immunoblot, mass spectrometry of the protein isoforms, and in vivo by confocal imaging showing deviant localization of Citrine reporter fusions. These findings were also reflected in RT-PCR analysis from transiently transfected moss cells. Our observations revealed that heterosplicing of the human FIX mRNA in moss is consistent across different tissues and not dependent on developmental stages. Furthermore, the consistency of heterosplicing in episomally (transiently) FIX-expressing cells indicates that heterosplicing of the FIX mRNA in transgenic lines was neither due to illegitimate or partial integration of the construct into the moss genome nor a locus-specific effect. Moreover, results of PCRs from genomic DNA of transgenic lines supported this conclusion. Unlike previously reported cases, in which expression of full-length GFP transcript and fluorescence were not detectable 30 or expression of cry mRNAs led to generation of unstable and/or not efficiently made transcripts 32 , we report here to the best of our knowledge for the first time heterosplicing of a human cDNA in a plant. It caused the generation of various stable FIX mRNA variants and protein isoforms in addition to the complete transcript and protein. Therefore, a possible occurrence of heterosplicing should be analyzed on RNA level, even if a full-length protein can be detected in heterologous cells.
All splice junctions in the FIX CDS were determined by sequencing. In addition, we identified consensus motifs of donor and acceptor splice sites in the latest P. patens genome version 45 using all 87,533 annotated transcripts that originate from all 32,926 protein-coding genes, and compared them with our experimentally verified motifs. This revealed that P. patens, like most eukaryotes, follows the GT-AG splicing rule. The same was true for the cryptic additional introns recognized in the FIX mRNA; obviously, the P. patens spliceosome interprets parts of the FIX CDS as intronic sequences. Presumably, moss cells recognize motifs in the human FIX CDS as cis exonic and intronic splicing enhancers. Previous work by Marquez et al. revealed some of the exitrons, exonic introns, that have ancestral exonic origin and follow GT/CAG boundaries like some of our heterosplice sites but little is known about exonic and intronic regulatory sequences in plants as well as driving force for intronization and extronization 53,54 . Moreover, plant introns are different from animal introns in terms of UA-or U-richness, which is crucial for splicing efficiency 55 . Due to these characteristic differences, it is likely that the GC content can affect the characteristics of splice site recognition 15 . The GC content is correlated with many features including gene density 56 , intron length 57 , meiotic recombination 58 , and gene expression 59 . It varies among species, and even along chromosomes 60 . It was shown before that the average GC content in P. patens CDSs is 50% 46 . The GC content of the human FIX CDS recognized as cryptic additional introns ranged from 38% to 40%. This remained the same after exclusively mutating the splice sites' environment in aspFIX; even though mutations in the splicing motifs decreased the number of transcripts from five to two, the GC content of the cryptic alternative intron was still only 40%. The shorter transcript was not in frame, but still detectable by RT-PCR. Therefore, we created the fully CDS-optimized plasmid optiFIX to check the effect of overall codon optimization on pre-mRNA splicing. A closer comparison of the aspFIX and optiFIX constructs revealed that there was no change in the donor and acceptor sites of the newly emerged cryptic intron. There was, however, an increase of the GC content within the cryptic intron sequence. The GC contents of the sequences previously defined as introns in the FIX and aspFIX constructs increased to 46-50% in optiFIX. Hence, although splicing and translation occurs in different cellular compartments, we conclude that the increase of GC content introduced by the overall codon optimization together with mutating splice sites' environment prevented the heterosplicing, the mis-interpretation of any sequence as intron by the moss spliceosome. As a result, a nearly 12-fold increase in protein amounts was achieved in transiently transfected moss protoplasts in vivo. As heterosplicing did not cause detectable frameshift mutations, fluorescence signals obtained from the FIX-Citrine construct were the sum of all FIX protein isoforms, fulllength and fragmentary, fused to Citrine. Based on this inference and proteins, which were harvested from transient transformation with the FIX construct and detected on the immunoblot (Fig. 4), we conclude that the prevention of heterosplicing from the optiFIX construct has improved full-length protein amounts in vivo even more than 12-fold.
Two mechanisms, namely boosted translation rates and the prevention of mRNA degradation via NMD, may contribute to this result. In addition, the protein from the optiFIX construct was solely detectable in the ER, the compartment of choice for correct posttranslational modifications and subsequent secretion of the protein. A similar phenomenon was observed in the expression of another transgene in Physcomitrella. One example of heterosplicing and its prevention by our methodology is the expression of human complement FH. FH is an important regulator of the alternative pathway in the human complement system. Currently, a recombinant FH is not available on the market, although it has potential as a biopharmaceutical in the treatment of severe human diseases like atypical hemolytic uremic syndrome (aHUS), age-related macular degeneration (AMD) or C3 glomerulopathies. The recombinant production of FH, devoid of plant-specific N-linked sugar residues, in moss resulted in a range of promising biological activities 40 . Moreover, mossderived FH is currently being examined for use in Covid-19 treatment (www.elevabiologics.com). RT-PCR analysis in FHproducing stable moss lines revealed that the expression of human FH CDS generated 2 transcript variants, the full-length FH transcript of 3639 bp and a 1215 bp FH variant that could lead to the production of an FH isoform with a calculated molecular mass of 45.6 kDa (Supplementary Data 3). We optimized the FH CDS using our bioinformatics tool physCO, described here, to analyze whether the protein amounts of FH and optiFH show any difference in live moss cells. A visual analysis of the images acquired for FH and optiFH from transiently transfected moss cells revealed that while FH-Citrine signal intensity is not detectable, optiFH-Citrine signal intensity is high ( Supplementary  Fig. 5). Thus, an optimization of the currently used CDS of human FH in plants is advantageous for plant-based production.
These findings suggest that the splicing of heterologous mRNAs should be taken into account on a routine basis. In addition, our analysis of synonymous codon usage bias suggests that our methodology to prevent heterosplicing can be directly implemented in organisms, which follow similar codon usage patterns as P. patens, such as O. sativa, insect or mammalian cells. Our methodology might still be vital in organisms that have different codon usage patterns, such as the species of the genus Nicotiana, e.g., N. tabacum and N. benthamiana. Mutating donor and acceptor sites and their neighboring nucleotides together with replacing codons ending with A or T to G or C may prevent the generation of heterosplice variants.
Why should splice site mutation and codon optimization be employed at the same time rather than codon optimization alone? Although differences between GC-rich and GC-poor genes were not reported in moss, analyses of various plant species can explain why codon optimization might not be sufficient. The GC content of GC-poor CDS in A. thaliana, soybean, pea, tobacco, tomato, and potato ranges from 40-46% while GC content of introns is 26-31% 61 . A similar trend is visible in GC-rich genes: 46-49% in CDS and 29-31% in introns. On the other hand, in maize, the GC content is higher: 56% in GC-poor CDS and 67% in GC-rich CDS. The GC content of introns in maize is also elevated: 40% in GC-poor genes and 48% in GC-rich genes 61 . This shows that the intron/exon definition is not solely dependent on the GC content. Hence, increasing the GC content of cryptic introns to a certain level without mutating splice site motifs might not be sufficient to prevent aberrant mRNA processing.
In general, the analysis of spliced transcripts is beneficial for at least five reasons. (i) It facilitates the reliable quantification of gene expression by designing primers that differentiate between full-length transcripts and shorter variants. (ii) It enables the precise detection and quantification of heterologous proteins by antibodies that differentiate functional isoforms from nonfunctional ones. (iii) It helps to better purify proteins either by specific antibodies or by size exclusion chromatography. (iv) Optimized cDNAs may result in better detectable reporter proteins in vivo. (vi) Aberrant protein isoforms due to heterosplicing may show mislocalization in cells of the production host and thus may contribute to poorly modified glycoproteins. Prevention of heterosplice variants eventually will improve recombinant protein production and decrease downstream processing costs significantly. Hence, plant-based systems can become an alternative to traditional production platforms. Moreover, it may have implications for basic biology as well, because here the use of heterologous reporter constructs is of vital importance. For these purposes, we developed physCO, an automated P. patens-specific codon-optimization tool, which allows the optimization of transgene expression by the prevention of undesired mRNA splicing. Moreover, our analysis of codon usage patterns indicates that this tool can also be used for insect (S. frugiperda), HEK (H. sapiens), and CHO (C. griseus) cells as well as rice (O. sativa). Besides that, we cannot exclude the possibility, that the phenomenon of mRNA splicing, here described as heterosplicing, is a novel gene regulatory mechanism occurring in eukaryotes in general.

Materials and methods
Design of the expression vectors pFIX, aspFIX, and optiFIX. The coding DNA sequence for human FIX without the native signal peptide was synthesized as N-terminal translational fusion to the signal-peptide CDS of PpAP1 (Pp3c5_19520V3.1) 42 by GeneArt (Thermo Fisher Scientific, Waltham, MA, USA). It was cloned into the pAct5-MFHR1 plasmid 41 to generate the final FIX expression vector, pFIX. Its expression is driven by the promoter of the PpActin5 gene 62 and the CaMV 35S terminator.
Nucleotides involved in splicing were replaced without a change of the FIX aa sequence in the aspFIX plasmid which was synthesized by GeneArt. For codon optimization, two sources, codon usage bias calculated by Hiss et al. 49 and Kazusa codon usage database 50 (http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi? species=145481), were used. We decided to exchange the eight underrepresented codons for Cys, Glu, Phe, His, Lys, Asn, Gln, and Tyr. Underrepresented codons were replaced with the overrepresented alternative codons ( Table 2) via physCO. The optiFIX CDS was synthesized by GeneArt. See Supplementary Fig. 6 for a multiple sequence alignment 63 of FIX, aspFIX, and optiFIX (FIX, aspFIX, and optiFIX sequences are available in Supplementary Data 7.).
Design of the expression vectors pFH and optiFH. The previously generated FH expression plasmid pFH 40 contains the same promoter and terminator as pFIX. It was used in transient expression assays. The optiFH CDS was deduced via physCO and synthesized by GeneArt. It was cloned into the same vector as FH. See Supplementary Data 5 for a sequence alignment of FH and optiFH. The plasmids are available via the IMSC under the accession numbers P1837, and P1838, for pFH and optiFH, respectively.
Transformation and screening of transgenic plants. The Physcomitrella Δxt/ft line, a double knockout line in which the α1,3 fucosyltransferase and the β1,2 xylosyltransferase genes had been abolished 64 (IMSC accession number 40828), was cultured in liquid Knop medium (pH 4.5) one week prior to transformation as described previously 65 to obtain cells (protoplasts). Protoplast transformation was performed as described 66 using 50 µg of linearized pFIX DNA per transformation. In addition to the generation of transgenic lines, mossFIX was produced transiently by an upscaled transformation protocol based on the proportion of 0.7-1 µg DNA of pFIX for 1×10 5 cells. Stable transfectants were selected on solidified Knop medium containing 25 µg/mL hygromycin as described 65 . Cells transiently expressing mossFIX were grown in special regeneration medium (7. were chosen for closer analysis and stored in the International Moss Stock Center (IMSC Nos. 40874,40875,40876,40877). Transgene presence and completeness were verified with a standard PCR with the primers FIXfwdB (5′-GG AGTGTTCGATTGGCTTTCT-3′) and FIXrevB (5′-TGGTGGTGGTGGGTTAG TTT-3′) that amplifies almost the full-length FIX CDS. For transgene expression analysis via reverse transcription (RT)-PCR, 50-100 mg (fresh weight) moss material from protonema cultures were disrupted with the use of a stainless steel ball in TissueLyser II for 2 min with an impulse frequency set to 30. The material was resuspended in 1 mL TRIzol ® Reagent (Thermo Fisher Scientific), and RNA was isolated according to the manufacturer's instructions. For analysis of gametophores, 5-6 gametophores were collected and RNA isolation was performed as described above. Six micrograms of DNase-treated RNA were used for first-strand synthesis with Superscript Reverse Transcriptase III (Thermo Fisher Scientific). The quality of cDNA was controlled with a standard PCR with the primers c45for (5′-GGTTGGTCATGGGTTGCG-3′) and c45rev (5′-GAGGTCAACTGTCTC GCC-3′) corresponding to the gene coding for the ribosomal protein L21. Transgene expression was verified with the primers FIXfwdB and FIXrevB that amplifies almost the full-length FIX CDS. PCR reactions were performed with Phusion highfidelity DNA polymerase (Thermo Fisher Scientific). The PCR products were examined by standard agarose gel electrophoresis with visualization of DNA by ethidium bromide fluorescence.
Molecular characterization of transiently transfected cells. Following the transformation, cells were grown in regeneration medium for 2 weeks 68 . Transiently transfected cells (nearly 3.6 million cells for FIX and aspFIX; 7.2 million cells for optiFIX) were used for transgene expression analysis via RT-PCR. RNA was isolated using the TRIzol ® Reagent and RNA was purified through spin columns of Direct-zol™ RNA MicroPrep kit (Zymo Research, Irvine, California, USA) according to the manufacturer's instructions. After DNase-treatment and firststrand synthesis (for the cDNA synthesis: 0.4 µg RNA were used for FIX and aspFIX; 2 µg RNA were used for optiFIX), the quality of cDNA was controlled with primers c45for and c45rev. Transgene expression was verified again with primers FIXfwdB and FIXrevB.
Analysis of FIX cDNA products. Amplified products were excised from the gel and purified using QIAEX II Gel Extraction Kit (Qiagen). Sequencing was done by Eurofins Genomics (GATC Sanger sequencing service). In the sequencing reactions, in addition to FIXfwdB and FIXrevB, FIX6deepseq1F (5′-ATGGGGGCATCGAGG AGTGTTCGAT-3′), FIX6deepseq1R (5′-TTCCCCAGCCACTTACATAGCCAGA-3′), and FIXseq6 (5′-CACCAACAACCCGAGTGAAG-3′) were used. The remaining purified products were separately cloned into the pJET vector using the CloneJET PCR Cloning Kit (Thermo Fisher Scientific) according to the manufacturer's instructions. Clones with aberrant insert sizes were selected for plasmid isolation with the GeneJet Plasmid Miniprep kit (Thermo Fisher Scientific). Sequencing was done as described above.
Splicing motif search. The most recent release of the Physcomitrella v3.3 genome annotation 45 , which is available from https://phytozome.jgi.doe.gov/, was used. The splice sites of all 87,533 annotated transcripts from all 32,926 protein-coding genes were used as anchors to extract surrounding genomic sequences with a fixed window size of ±9 bp. Consensus sequence logos of all extracted donor and acceptor sites were created using R and the package ggseqlogo 69,70 .
Codon usage bias among different species. Synonymous codon usage biases in P. patens, S. frugiperda, H. sapiens, C. griseus, O. sativa, and N. tabacum were obtained from the Codon Usage Database (https://www.kazusa.or.jp/codon/) and biased codons in moss were compared with the codon usage pattern of other species.
Confocal microscopy imaging and image analysis. All images were taken with a Leica TCS SP8 microscope (Leica Microsystems, Wetzlar, Germany) using HCX PL APO 63x/1.40 oil objective with a zoom factor of 4.5. For the excitation of Citrine, an WL laser was applied at 2% with an excitation wavelength of 514 nm. The voxel sizes were 0.080 μm on the X-Y dimensions and 0.300 μm on the Z dimension. The pinhole was adjusted to 1 Airy Unit (95.5 μm).
The image processing consists of the following steps: (i) denoising, (ii) edge enhancement, (iii) Richardson-Lucy restoration, (iv) local intensity equalization, (v) segmentation. In the first step, a median filter with a window of (3,3,3) was applied to the images in order to remove the salt-and-pepper noise. In the second step, the denoised images were subjected to an unsharp-mask operation 71,72 , which involves blurring the image with a Gaussian filter, subtracting the blurred image from the input image, and adding the resulting difference back to the input image, a process that sharpens the edges in the images. The edge-enhanced images were then subjected to the Richardson-Lucy restoration algorithm, as implemented in the scikit-image package 73 for three iterations, assuming an averaging filter in a window of (3,5,5) as point spread function (here the aim was to smooth the image without losing the thin structures, rather than a true deblurring). The code for the Richardson-Lucy algorithm was adapted from scikit-image package 73 using Python programming language. Subsequently, a spatial intensity equalization step was implemented on the images using a home-built algorithm. In short, each image was divided into boxes with dimensions of (7,7,7) voxels. The voxel values within these boxes were rescaled by multiplying the values in each box with weights that are calculated based on the skewness value corresponding to the same box. This step corrected for the intensity gradients in the image in order to minimize the loss of some low-intensity foreground voxels. In the last step, the images were segmented by applying an adaptive thresholding method, where the thresholds were determined locally by applying Otsu's method to local windows of (7,19,19) voxels throughout the image volume. The code for the adaptive Otsu thresholding algorithm was developed based on the Otsu threshold function of the scikit-image package 73 and the Numba package 74 . To suppress oversegmentation of noisy background areas, a low global threshold was also applied by first calculating Otsu's threshold for the whole image and multiplying it with 0.66. The adaptive thresholding operation yielded the binary masks that specified the foreground voxels, which in turn were used to calculate the mean voxel intensity for each original image. Three replicates were used for analysis of each of the FIX-Citrine and optiFIX-Citrine constructs. The same is true for FH-Citrine and optiFH-Citrine but the quantification was not possible due to undetectable/very low signals in FH-Citrine images. All six FIX images were processed and quantified in a single run of a Python script by using the exact same parameters to avoid any possible bias. Other packages used to develop the code are NumPy 73,75 , Pandas 76 and SciPy 77 . The code used for image processing and quantification is deposited on from www.plant-biotech.uni-freiburg.de. The procedure is shown in Supplementary Fig. 7.
Recombinant protein extraction and detection. Proteins from the culture supernatant were precipitated as described 78 . After air drying, the precipitate was resuspended in 50 mM HEPES, 2% SDS (pH 7.5) and disulfide bonds within proteins were reduced with 25 mM dithiothreitol (DTT), followed by alkylation of the thiol groups with 60 mM iodoacetamide. For Western blot analyses, 7.5% SDS polyacrylamide gel electrophoresis (SDS-PAGE Ready Gel ® Tris-HCl Precast Gels, BioRad, CA, USA) was run at 100 V for 1 h 30 min and blotted to polyvinylidene fluoride (PVDF) membranes (Amersham Hybond P 0.45 µm pore size PVDF blotting membrane, GE Healthcare) in a Trans-Blot ® SD Semi-Dry Transfer Cell (BioRad) for 2 h with 1.8 mA/cm 2 membrane. The membrane was blocked for 1 h at room temperature in TBS containing 4% Amersham ECL Prime Blocking Reagent (GE Healthcare) and 0.1% Tween 20. It was then incubated with polyclonal anti-FIX antibody (F0652, SIGMA) (1:3000) overnight at 4°C. Afterwards, the membrane was washed three times with TBS containing 0.1% Tween 20 and incubated with horseradish peroxidase-linked anti-rabbit IgG (NA934, GE Healthcare) at a dilution of 1:10,000 for 1 h. The blot was washed again and proteins were detected using the Amersham ECL Prime Western Blotting Detection Reagent (GE Healthcare) according to the manufacturer's instructions.
Sample preparation and MS analysis. The reduced and alkylated protein extracts were subjected to SDS-PAGE. Appropriate gel bands showing FIX fractions were excised and processed for in-gel digestion: Following the destaining of the gel bands with 30% acetonitrile (ACN), they were treated with 100% ACN and dried completely in a vacuum concentrator. Tryptic in-gel digestion of gel slices was performed overnight at 37°C in 50 mM ammonium bicarbonate using 0.1 µg trypsin (Promega) per gel band. Afterwards, peptides were extracted from the gel with 5% formic acid (FA). Analyses were performed using the UltiMate 3000 RSLCnano system (Dionex LC Packings/Thermo Fisher Scientific) coupled online to a QExactive Plus instrument (Thermo Fisher Scientific). The UHPLC systems were equipped with a C18-precolumn (Ø: 0.3 mm, 5 mm; PepMap, Thermo Fisher Scientific) and an Acclaim ® PepMap analytical column (ID: 75 m, 500 mm, 2 m, 100 Å, Dionex LC Packings/Thermo Fisher Scientific).
MS analyses were performed using a binary solvent system consisting of 0.1% FA (solvent A, "A") and 0.1% FA/86% ACN (solvent B, "B"). Samples were washed and concentrated on a C18 precolumn with 0.1% TFA for 5 min before switching the column in line with the analytical column. Peptide separation was performed applying a 45-min gradient at a flow rate of 250 nL/min. Peptide samples were eluted with a gradient of 4-40% B in 30 min and 40-95% B in 5 min. After each gradient, the analytical column was washed with 95% B for 5 min and reequilibrated for 15 min with 4% B. The MS instrument was externally calibrated using standard compounds and equipped with a nanoelectrospray ion source and distal coated SilicaTips (FS360-20-10-D, New Objective, Woburn, MA, USA), and MS/MS analyses were performed on multiply charged peptide ions. The measurement was performed using UltiMate 3000 RSLCnano system (Dionex LC Packings/Thermo Fisher Scientific, Dreieich, Germany) coupled online to a QExactive Plus mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) and mass spectrometer was operated in the data-dependent mode to automatically switch between MS (max. of 1 × 10 ions) and MS/MS. Each MS scan was followed by a maximum of 12 MS/MS scans using HCD with a normalized collision energy of 35%. The mass range for MS was m/z = 375-1,700 and resolution was set to 70,000. MS parameters were as follows: spray voltage 1.5 kV; ion transfer tube temperature 200°C. Raw data were analyzed using Mascot Distiller V2.5.1.0 (Matrix Science, USA) and the peak lists were searched with Mascot V2.6.0 against an in-house database containing all Physcomitrella V3.3 protein models 79 as well as the human FIX sequence (complete and alternative ones based on splicing). The peptide mass tolerance was set to 8 ppm and the fragment mass tolerance was set to 0.02 Da. Carbamidomethylation (C, +57.021464 Da) was specified as fixed modification. Variable modifications were Gln->pyro-Glu (-17.026549 Da), oxidation/hydroxylation (M, P, +15.994915 Da) and deamidation (N, +0.984016 Da). A total of two missed cleavages was allowed as well as semitryptic peptides. Search results were loaded into Scaffold4 software (Version 4.11.1, Proteome Software Inc., Portland, OR).
Statistics and reproducibility. The sample size was decided based on prior experience. Experiments were not blinded and information regarding the number of replicates/samples and statistical analyses are described in the corresponding figure and table legends.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data generated or analyzed during this study are included in this article and its supplementary information. Microscopy images can be found in Supplementary Data 1 and 6. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE 80 partner repository with the dataset identifier PXD026651 and 10.6019/PXD026651.