Recurrent RNA edits in human preimplantation potentially enhance maternal mRNA clearance

Posttranscriptional modification plays an important role in key embryonic processes. Adenosine-to-inosine RNA editing, a common example of such modifications, is widespread in human adult tissues and has various functional impacts and clinical consequences. However, whether it persists in a consistent pattern in most human embryos, and whether it supports embryonic development, are poorly understood. To address this problem, we compiled the largest human embryonic editome from 2,071 transcriptomes and identified thousands of recurrent embryonic edits (>=50% chances of occurring in a given stage) for each early developmental stage. We found that these recurrent edits prefer exons consistently across stages, tend to target genes related to DNA replication, and undergo organized loss in abnormal embryos and embryos from elder mothers. In particular, these recurrent edits are likely to enhance maternal mRNA clearance, a possible mechanism of which could be introducing more microRNA binding sites to the 3’-untranslated regions of clearance targets. This study suggests a potentially important, if not indispensable, role of RNA editing in key human embryonic processes such as maternal mRNA clearance; the identified editome can aid further investigations.

plicates. Finally, we mapped the coordinates of these alignments back onto the hg38 assembly, and kept only alignments with mapping quality scores >= 20 on chromosomes 1-22, X, or Y.

Variant calling with GATK
We used Genome Analysis Toolkit (GATK) version 3.6.0 6 to realign the reads at indels with IndelRealigner, and then recalibrated base quality scores using BaseRecalibrator with the dbSNP 7 vcf file (version 151) downloaded from https://ftp.ncbi.

Filter against unreliable variants
Steps (1-9) for first-round per-sample filtering (with relevant Perl script in the GitHub repository marked when available): 1. (https://github.com/gao-lab/HERE/blob/master/scripts/S15_1__get_ sample_RNA_editing_sites_v3/step07__apply_complex_filter/complex_ filter_1/Convert_VCF.pl) Starting from the GATK-called VCF file, we filtered variant sites ('sites' for short) for those on autosomes, chromosome X, or chromosome Y with the FILTER being "PASS" and at least one read covered.
3. (https://github.com/gao-lab/HERE/blob/master/scripts/S15_1__get_ sample_RNA_editing_sites_v3/step07__apply_complex_filter/complex_ filter_1/threads_rmMismatch.pl and https://github.com/gao-lab/ HERE/blob/master/scripts/S15_1__get_sample_RNA_editing_sites_ v3/step07__apply_complex_filter/complex_filter_1/Remove_mismatch_ first6bp.pl) For each site, we re-counted the number of mismatch reads on this sites based on all realigned reads (as produced by GATK IndelRealigner) whose mismatched base at this site was located outside the first 5-prime 6 bp of the read and had a base quality (as recorded in the realigned bam file) no less than 58. If no reads with mismatched bases were available for a given site after re-counting, this site was discarded. In addition, we discarded sites with two or more different types of mismatch bases (i.e., sites that are potential polymorphisms).
8. (https://github.com/gao-lab/HERE/blob/master/scripts/S15_1__get_ sample_RNA_editing_sites_v3/step07__apply_complex_filter/complex_ filter_1/nonAlu_filter_new.pl) We picked the output of Step (7), and used UCSC Genome Browser RepeatMasker track to separate these sites into those that were on non-Alu repeat elements, and those that were not on repeat elements. No sites were discarded in this step. 9. We used "bcftools concat -allow-overlaps" to combine the sites called on Alu elements from Step (4), sites on non-Alu repeat elements from Step (8), and sites not on repeat elements from Step (8). No sites were discarded in this step.
Step (10) We used "bcftools isec -collapse all" to identify all those genomic variants from the following worldwide population studies that overlapped with any of the sites discovered in any samples in Steps (1)(2)(3)(4)(5)(6)(7)(8)(9) We then used "bcftools isec -collapse all" to identify all those variant sites discovered in any samples in Steps (1-9) that overlapped with any of these genomic variants, and discarded these variant sites.
Step (11) Starting with the variant sites kept in Step (10), we further kept those sites that met either of the following: (a) this site was located in an Alu element, and had >=10 reads covered, and had a >=10% mismatch frequency; (b) this site was not located in an Alu element, and had >=10 reads covered, and had a >=10% mismatch frequency, and had >=3 reads supporting the mismatch.
Step (12) Starting with the variant sites kept in Step (11), we further kept those sites that met either of the following: (a) this site was located in an Alu element; (b) this site was not located in an Alu element, and was observed in at least two normal samples (or two abnormal samples) of the same stage (see Supplementary Data 2 for the details of each stage).
Step (13) Starting with the variant sites kept in Step (11) and their SnpEff annotation (run with parameters "snpEff ann -lof" and the GENCODE v32 annotation), we kept those variants annotated with either the A>G (i.e., A-to-G) SnpEff event only, or both the A>G and the T>C (i.e., T-to-C) SnpEff event. The former can be regarded as stranddefinite, while the latter can be regarded as strand-ambiguous. 7 We deliberately kept the latter type of variant, which appeared in at least two transcripts of different orientations (i.e., a variant reported by SnpEff to induce an A-to-G shift in one transcript and a thymine-to-cytosine shift in another), because a previously well-studied A-to-I RNA editing event found to recode tyrosine to cysteine in the BLCAP (bladder cancer-associated protein) gene 12 , persist in human embryonic stem cells and multiple human tissues 13 , and promote cell proliferation 14 is this type of variants.

Marking of unsequenced editing sites
Due to the inherent technical noise of RNA-seq that occurs with low cell content, including a high drop-out rate 15 , the labeling of edits in unsequenced regions as nonexistent would be obviously biased. We thus marked all editing sites with zero read coverage in each sample as "unsequenced" rather than "nonexistent" or "undetected"; sites with at least one read covered were not altered in this step. RNA editing events. Thus, the direct application of preexisting RNA editing identification pipelines, which are generally more stringent than genomic variant calling approaches 1 to these datasets, should yield a ratio of genomic variant-overlapping edits comparable to those obtained with bulk datasets.
Nevertheless, we validated our adapted pipeline using paired single-cell DNA-/RNA-seq datasets for the A375 cell line 17 (we did not use the dataset 18 used by Ding et al. 16 because it is not publicly available). For each A375 cell with both DNA and RNA sequenced, we downloaded the raw reads and applied our pipeline with the following modifications: • We used Zachariadis et al.'s read preprocessing strategy 17 (https://github. com/EngeLab/DNTRseq).
• Whereas we applied all filters to RNA-seq data to obtain identified editing events, for DNA-Seq we stopped at the raw variant calling results generated by GATK and treated them as the ground truth for genomic variants.
• Due to the low sequencing depth of these samples, we adjusted the read coverage filter. Specifically, we filtered for Alu edits with at least two reads covered and an editing level of at least 0.1, and for non-Alu edits additionally with at least two reads with mismatches.
Comparison of the edits in each cell with its genomic variants revealed that all A375 cells with at least one edit identified had a zero ratio of genomic variant-overlapping 9 edits after, but not before, our filtering (Fig. 1d). This result supported the validity of our pipeline.

Supplementary Note 4: A preliminary case study on MBSgaining REEs on a given gene
To investigate potential key targets of MBS-gaining REEs, we surveyed genes that have been shown to undergo maternal mRNA clearance, and examined whether they were targeted by any MBS-gaining REEs. We found that the SUV39H2 gene, a known target of maternal mRNA clearance during human embryogenesis 20  (f) Cytoplasmic exosome cofactor: SKIV2L (RNA helicase), TTC37 (whose yeast ortholog is Ski3), WDR61 (whose yeast ortholog is Ski8), HBS1L (isoform 3; whose yeast ortholog is Ski7) We found that multiple REEs target EXOSC6 in oocyte (GV), oocyte (MII), and zygote, and a single REE targets CNOT6 in 2-cell (Supplementary Figure 34). While the function of EXOSC6 has been rarely studied, the mouse ortholog of EXOSC10 in the RNA exosome degradation pathway has been found to be required for the growthto-maturation transition in oocytes 24 and eight-cell embryo/morula transition 25 ; therefore, it is also possible that A-to-I editing on EXOSC6 is a preferred regulation on the RNA exosome degradation pathway by oocytes (GV), oocytes (MII), and/or zygotes.
On the other hand, the mouse ortholog of CNOT6 gene has been found to regulate deadenylation of mRNAs in mouse oocyte growth and maturation 26 , and the mouse CCR4-NOT complex itself is also found to be involved in regulating 2-cell-specific genes 27 , suggesting the possibility that A-to-I editing on CNOT6 might be preferred in the 2-cell stage as well.
Because the CCR4-NOT complex is known to target 3'-UTR of mRNAs with certain sequence motifs 28 , we also examined whether REEs could lead to gain or loss of such motifs. Specifically, we examined the Pumilio-response element (PRE) UGUA-NAUW (where N is any nucleotide and W is either A or U) 29 , and the AU-rich element (ARE) UUAUUUAUU 30 . In addition, we also examined the hnRNP A1 and A2/B1 binding site UAASUUAU (where S is either C or G) as discovered in 31 . We found that REEs in normal early stages did alter 6 motifs for 5 genes (Supplementary However, the sample size from which the conclusion is derived is too limited (no more 16 than 5 for each type of oocytes), and a further examination with a large sample size of each type of oocytes is needed to confirm whether this is a truly biological signal. from young to older mothers, but this difference was not statistically significant (Supplementary Figure 39). Further examination of these samples revealed that one from and older mother (GSM2514781) had a poor maturation rate relative to the moderate rates observed in other oocytes from older mothers (Supplementary Table 1 in 32 ). We clustered all of the samples using a Manhattan distance based on all REE-matching edits of maternal genes only, and found that this sample appeared more like an oocyte from a young mother (Supplementary Figure 40). Thus, we excluded GSM2514781 (and GSM2514773, an oocyte from a young mother that clustered more closely with oocytes from older mothers) from the set of oocyte (GV) samples included in the analysis whose results are illustrated in Supplementary Figure 38b Alu and non-Alu sites separately. The proportion is defined as "the union of stranddefinite A-to-G variants and strand-ambiguous A-to-G/T-to-C variants" (see Step (13) of Supplementary Note 1) to all variants. Symbols in boxplots follow the definition by "geom_boxplot" of the R package "ggplot2" 33 Figure 10) displayed a statistical significance in the deviation of 3'-UTR-or-intron edit ratio of the overlap edits from that of the unique edits, as shown here, such deviation might partially arise from the fact that they were also identified with a higher sequence coverage (see Supplementary Figure 12). In addition, the drop of statistical significance in 8-cells and morula stages might be a consequence of multiple factors, including but not limited to the loss of a stable editing pattern indicated by REE upon entry into the 8-cell stage (Fig. 2b), the end of maternal mRNA clearance 34 , and the wave of zygotic genomic activation 35 in the 8-cell stage. Only stages with both types of samples were shown. All p-values were from the chi-square test (with the null hypothe-sis being that whether an edit is a 3'-UTR-or-intron edit is independent of whether this edit is shared by both or normal-/abnormal-only) and Benjamini-Hochberg-adjusted.
See Supplementary Data 28 (the source data behind Supplementary Figures 10 and 11) for the number of edits in each group in each chi-square test.  Figure 20, the starting stage is the stage whose observation of REE on each gene was used to split the genes into "Has REE in the starting stage" and "Does not have REE in the starting stage", and the "subsequent stage" is the stage that is after the starting stage along the development timeline.
In addition, for each subplot only genes with FPKM > 0.1 at the starting stage was considered. The number shown below each boxplot is the median of all FPKM medians (without the transformation of log 10 (FPKM median + 1 × 10 −4 )) for that boxplot.
Here, the starting stage is oocyte (GV). Symbols in boxplots follow the definition by "geom_boxplot" of the R package "ggplot2" 33  For each stage pair, only genes with median FPKM > 0.001 in the current stage were plotted. The p-value shown in each scatterplot is the Benjamini-Hochberg-adjusted (BH-adjusted) p-value for two-sided correlation test (computed by R's cor.test(method = "spearman")), with the null hypothesis being that the correlation is zero. The 95% confidence interval (CI) for this test was computed using the ci.spear function of the R package statpsych. The trend of association is visualized by linear regression (with orange line denoting the fitted line and grey shadow denoting its 95% confidence interval of the slope of the regression, where the null hypothesis for the linear regression test is that the slope is zero).
hibited a more positive net change in MBS counts due to REEs than did baseline but not did other maternal genes. All p-values were unadjusted and were derived from one-tailed, unpaired Wilcoxon's Rank Sum test, with the alternative hypothesis being that the median value for the left ("decay at 8-cell"; see the subsection "Annotation of maternal genes and targets of maternal mRNA clearance" in Methods for more details about "decay at 8-cell" and "others") group would be greater than that for the right (others/baseline) group. The baseline group is just a value of 0; we plotted it here for the sake of visual clarity. The unpaired test between "decay at 8-cell" and "others" involves 17,060 and 17,755 pairs of (gene, sample) for "decay at 8-cell" and "others", respectively, and its estimated (pseudo)median and 95 percent confidence interval re- Sum tests, with the alternative hypothesis being that the median value for the left (old mother/AG/PG) group would be less than that for the right (young mother/control/BI (biparental embryos)) group. See Supplementary Data 29 for description of sample size, difference in location (the effect size for Wilcoxon's Rank Sum test), and its 95% confidence interval for each of these tests. Symbols in boxplots follow the definition by "geom_boxplot" of the R package "ggplot2" 33  location (the effect size for Wilcoxon's Rank Sum test), and its 95% confidence interval for each of these tests. Symbols in boxplots follow the definition by "geom_boxplot" of the R package "ggplot2" 33 : the inner thick line indicates the median; the lower and upper boundaries (or hinges) of the box indicate the first and third quartiles (i.e., 25% and 75% quantiles), respectively; the upper whisker extends from the hinge to the largest value no further than 1.5 * inter-quartile range (the third quartile minus the first quartile), the lower whisker extends from the hinge to the smallest value at most 1.5 * inter-quartile range of the hinge, and data beyond the end of the whiskers are the outlier points, which are plotted individually.
illustrated in Supplementary Figure 37, but conducted with other datasets containing data on abnormal early embryos (the GSE133854 dataset). All p-values were unadjusted and based on one-tailed, unpaired Wilcoxon's Rank Sum test, with the alternative hypothesis being that the left (abnormal) group is greater than the right (control) group. AG, androgenetic; PG, parthenogenetic; BI, biparental (i.e., normal).
See Supplementary Data 32 for description of sample size, difference in location (the effect size for Wilcoxon's Rank Sum test), and its 95% confidence interval for each of these tests (except for the tests in the "decay at 8-cell" vs. "8-cell" group, where all points are of the same value (possibly partly due to the depletion of such genes in the 8-cell stage) and the difference in location and the confidence interval cannot be estimated). Symbols in boxplots follow the definition by "geom_boxplot" of the R package "ggplot2" 33