Pre-pro is a fast pre-processor for single-particle cryo-EM by enhancing 2D classification

2D classification plays a pivotal role in analyzing single particle cryo-electron microscopy images. Here, we introduce a simple and loss-less pre-processor that incorporates a fast dimension-reduction (2SDR) de-noiser to enhance 2D classification. By implementing this 2SDR pre-processor prior to a representative classification algorithm like RELION and ISAC, we compare the performances with and without the pre-processor. Tests on multiple cryo-EM experimental datasets show the pre-processor can make classification faster, improve yield of good particles and increase the number of class-average images to generate better initial models. Testing on the nanodisc-embedded TRPV1 dataset with high heterogeneity using a 3D reconstruction workflow with an initial model from class-average images highlights the pre-processor improves the final resolution to 2.82 Å, close to 0.9 Nyquist. Those findings and analyses suggest the 2SDR pre-processor, of minimal cost, is widely applicable for boosting 2D classification, while its generalization to accommodate neural network de-noisers is envisioned.

In this section, we evaluated the effect of de-noising on aligning and grouping similar particle images.
There was no iteration involved in this experiment. To perform the test, we chose a widely used dataset of E. coli 70S ribosome [1][2][3][4][5][6][7] downloaded from RELION classification benchmark dataset as described in the Methods. This dataset contains a total of 10,000 particles in a box of 130 × 130 pixels. In this study, we only use the first 5,000 particles, which have a translation factor (EF-G) bound on the ribosome. For the sake of simplicity, we term this set "70S ribosome" throughout this article.
We first generated a de-noised set of the 5,000 70S ribosome particles using the 2SDR de-noising method. Some of the de-noised particles are displayed in the illustration of the workflow Figure 1 For the control experiment, we randomly picked five particles as references (column (a) of Supplementary Figure 1). To each reference, we searched within the set for most resembling particles, for which we used a Matlab program re-coded from FRM2D [8], a fast alignment algorithm, to facilitate the alignment during the search. For a resembling particle, we recorded the best alignment parameters -the rotation angles and translation shifts in x-and-y-direction. For each reference we selected the 20 most resembling particles. By applying the alignment parameters to each of the 20 particles, we generated the average. The results are displayed in column (b) of Supplementary Figure 1. We then repeated this procedure with the de-noised set as we did in the control experiment. In brief, the de-noised particles are to replace the original particles except in the end the alignment parameters are again applied to the original particles for calculating the aligned average, as shown in column (c) in Supplementary

grouping alike particles
As the matching set composed by the 20 most resembling particles found with the de-noising do not overlap well with that of no de-noising, we furthered a study using simulated data to measure the occurrence of true positives. The simulated noisy images of 70S ribosome were then prepared as follows.
First, the 3D structure was calculated from the 5,000 experimental images using CryoSparc. Then a total of 50 distinct 2D images with 130 × 130 pixels were generated by projecting the 3D structure of 70S ribosome in equally spaced (angle-wise) orientations. Secondly, 5,000 images were generated from these 50 projections by making 100 copies for each projection. For each projection, the 100 copies from i = 1 to i = 100 are uniformly rotated with 3.6 × i degree. Thirdly, each image was then convoluted with the electron microscopy contrast transfer function randomly sampled from a set of 50 CTF values. 3 Finally, i.i.d. Gaussian noise N ( 0, σ 2 ) was added such that the signal-to-noise ratio (SNR) equals to 0.01. To measure the occurrence of grouping true alike particles, we repeated the above-mentioned procedure for 10 times. For this simulation dataset the average number of the correct particles among the top 100 is found to be 3.5 and 11.2 for the group without and with the 2SDR de-noising, converted to the frequency of true positive 3.5% and 11.2% respectively (Supplementary Figure 2). With the SNR increased to 0.05, the corresponding frequencies increase to 61.5% and 94.4% (Supplementary Figure 3). With the defocus lowered to the range of 1 to 1.5 mm, these figures drop to 51.3% and 93.7% (Supplementary Figure 4).
These results show that the 2SDR de-noising helps grouping true alike particle especially when the SNR or defocus is lower.

A.2 Supplementary Materials for RELION Analysis
In this section, we document the statistics output from the classification programs, by which we can utilize to evaluate the performance in an objective way. As these statistics are associated with each class, we use the histogram to plot the number of classes with respect to a statistics quantity.
For comparing RELION with P-RELION, two statistics that represent the alignment errors [9], "rl-nAccuracyRotations" and "rlnAccuracyTranslations" are used. In brief, for each iteration, the program will select a random subset of images from each class and assume the orientations of these selected images in the previous iteration are correct. Then, the program will modify the orientations of each image in small steps until the ratio of posterior probability between current modified orientation and true orientation is smaller than 0.01. The averages of the corresponding rotation and translation steps for each case are recorded in the "rlnAccuracyRotaions" and "rlnAccuracyTranslations". One can expect that the value of steps will be small if the orientations can be distinguished reliably. These values can thus be considered as the alignment errors of RELION while the lower is the better. These values are used as the guideline for selecting good class averages in this work. 4

A.3 Supplementary Materials for ISAC Analysis
For ISAC, there are also two statistics associated with each class as defined in [10]. They are pixel error and the class size, which can be used to compare ISAC with P-ISAC. 5 First, the pixel error is the statistics computed using the alignment parameters from independent rounds of within-class alignment, which is so called the "stability test" and it will be lowered if the class becomes more homogeneous. From Supplementary Figure 8, we can see that with P-ISAC, the pixel error becomes lowered and the number of classes with small pixel error increases. Secondly, the class size indicates how many images pass the above stability test; it again represents the performance of the clustering step and will be larger if a class is more homogeneous. From Supplementary Figure 9, we can see that the class with small size becomes fewer by P-ISAC as compared to ISAC.

A.4 Supplementary Materials for CL2D Analysis
We further demonstrate our method on a popular classification algorithm called CL2D. CL2D [2] belongs to the category of classification algorithms that is based on K-means. It develops an approach based on divisive hierarchical clustering and adopts a new probabilistic similarity metric based on correntropy to reduce the influence of noise and obviate the phenomenon of unbalanced classes. When preprocessing is added to CL2D, we abbreviate the procedure as P-CL2D in the same way as in the main article.

B Supplementary Figures
Supplementary Figure 1: Pilot tests of the effect by 2SDR on grouping and aligning 70S ribosome particles. Column (a) shows five cryo-EM images of 70S ribosome randomly picked from the 5,000 cryo-EM images and used as references; note that they will not be centered but will be masked by a circle for the purpose of rotation in the alignment process. Column (b) shows an aligned average of the 20 particles that most resemble their reference in (a). Column (c) shows the aligned average of the 20 particles whose de-noised images most resemble their de-noised reference (the de-noised images are not shown). Column (d) shows the projections of the structure of 70S ribosome that correspond to the views in Column (a).
Supplementary Figure 2: Pilot Test with Simulation Data of 70S ribosome Where SNR = 0.01 with defocus randomly sampled between 1.5um to 2.0um. From left to right are Column (1) to (5). Column (1) represents five simulated cryo-EM images used as the references, which are produced from the projection images in Column (5) with the application of CTF and addition of noise; Column (2) represents the average from the 100 particles that most resemble the reference in (1); Column (3) the same as (2) but the average is from the 100 most resembling particles found after 2SDR de-noising is applied. Column (4) Noise-free images of Column (1) to show the ground truth, which are produced from the projection images in Column (5) by application of CTF. (5) Five projection images of 70S ribosome.
Supplementary Figure 3: Pilot Test with Simulation Data of 70S ribosome Where SNR = 0.05 with defocus randomly sampled between 1.5um to 2.0um. From left to right are Column (1) to (5). Column (1) represents five simulated cryo-EM images used as the references, which are produced from the projection images in Column (5) with the application of CTF and addition of noise; Column (2) represents the average from the 100 particles that most resemble the reference in (1); Column (3) the same as (2) but the average is from the 100 most resembling particles found after 2SDR de-noising is applied. Column (4) Noise-free images of Column (1) to show the ground truth, which are produced from the projection images in Column (5) by application of CTF. (5) Five projection images of 70S ribosome.
Supplementary Figure 4: Pilot Test with Simulation Data of 70S ribosome Where SNR = 0.05 with defocus randomly sampled between 1.0um to 1.5um. From left to right are Column (1) to (5). Column (1) represents five simulated cryo-EM images used as the references, which are produced from the projection images in Column (5) with the application of CTF and addition of noise; Column (2) represents the average from the 100 particles that most resemble the reference in (1); Column (3) the same as (2) but the average is from the 100 most resembling particles found after 2SDR de-noising is applied. Column (4) Noise-free images of Column (1) to show the ground truth, which are produced from the projection images in Column (5) by application of CTF. (5) Five projection images of 70S ribosome.

C Supplementary Tables
Supplementary