SPHIRE-crYOLO: A fast and well-centering automated particle picker for cryo-EM

Selecting particles from digital micrographs is an essential step in single particle electron cryomicroscopy (cryo-EM). Since manual selection is a tedious and time-consuming process, many automatic particle pickers have been developed. However, they have problems especially with non-ideal datasets. Here, we present a novel automated particle picking software called crYOLO, which is based on the deep learning object detection system “You Only Look Once” (YOLO). After training the network with 500 – 2,500 particles per dataset, it automatically selects the particles with high accuracy and precision reaching a speed of up to six micrographs per second. Importantly, we show that crYOLO can be trained to select previously unseen datasets paving the way for completely automated “on-the-fly” cryo-EM data pre-processing during data acquisition. CrYOLO is available as standalone program under http://sphire.mpg.de/ and will be part of the image processing workflow in SPHIRE.


Introduction
In recent years, single particle electron cryomicroscopy (cryo-EM) became one of the most important and versatile methods for investigating the structure and function of biological macromolecules. In single particle cryo-EM many images of identical but randomly oriented macromolecular particles are selected from raw cryo-EM micrographs, then digitally aligned, classified, averaged, and back-projected to obtain a three-dimensional structure of the protein.
Since usually more than 100,000 particles have to be selected for a near-atomic cryo-EM structure, many automatic particle picking procedures, often based on heuristic approaches, have been recently developed [1][2][3][4][5][6] . A popular selection approach, called template matching, cross-correlates the micrographs with pre-calculated templates of the particle of interest. However, this procedure is error-prone and the parameter optimization is often complicated. Although it works with optimal data, it often fails when dealing with non-ideal datasets where part of the particles overlap or are degraded and the background of the micrographs is contaminated with ice. In these cases, either the number of selected particles decreases or the number of false positives increases tremendously. The last resort is then often manual selection of particles, which is laborious and time-intensive.
To solve this problem two particle selection programs 7,8 have been recently developed, employing deep convolutional neural networks (CNN). CNNs are extremely successful in processing data with a grid-like topology 9 . For images this is undoubtedly the state-of-the-art method for pattern recognition and object detection.
Similar to learning the weights and biases of specific neurons in a multi-layer perceptron 10 , a CNN learns the elements of convolutional kernels. A convolution is an operation which calculates the weighted sum of the local neighborhood at a specific position in an image. The weights are the elements of the kernel which extracts specific local features of an image, such as corners or edges. In convolutional neural networks, several layers of convolutions are stacked, where the output of one layer is the input of the next layer. This enables CNNs to learn hierarchies of features.
Common modern object detection systems, like the particle selection tools by Wang et al. and Zhu et al. 7,8 , employ a specific classifier for an object and evaluate it at every position. Here, they slide a window over the micrograph to crop out single regions, pass it through a CNN network and finally classify the extracted region. The confidence of classification is transferred into a map and the object positions are estimated by finding the local maxima in this map. Using this approach, it is possible to handle more challenging datasets. However, as it classifies many cut out regions independently, it comes with a high computational cost. Moreover, since the classifier only sees the windowed region, it is not able to learn the larger context of a particle (e.g. to not pick regions near ice contamination).
In 2016, Joseph Redmon introduced the 'You Only Look Once' (YOLO) framework 11 as an alternative to the sliding window approach and reformulated the classification problem into a regression task, where the exact position of the particle is predicted by a network. In contrast to the sliding window approach, the YOLO framework needs only a single pass of the full image instead of multiple passes of cropped-out regions. Thus, the YOLO approach basically splits the image into a grid and predicts for each grid cell if it contains the center of a particle in the bounding box.
If that is the case, it applies regression to predict the relative position of the particle center inside the cell as well as the width and height of the bounding box ( Figure 1a).
This procedure makes the detection pipeline rather simple and reduces the number of convolutions tremendously, which renders it fast but still accurate. During training, the YOLO approach only requires labeling positive examples, whereas the sliding window approach also needs the background and contamination to be labeled as negative examples. Moreover, since the network sees the complete image, it is able to learn the larger context and provides therefore an excellent framework to detect single particles in electron micrographs reliably and fast ( Figure 1b).
Here, we present the novel single particle selection procedure crYOLO which utilizes the YOLO algorithm to select single particles in cryo-EM micrographs. We evaluated our procedure on three recently published high resolution cryo-EM datasets.
The results demonstrate that crYOLO is able to accurately and precisely select single particles in micrographs of varying quality with a speed of up to six micrographs per second on a single GPU. This leads to a tremendous acceleration of particle selection, and significantly improves not only the quality of the datasets but also the final structure. Furthermore we show that crYOLO, trained on multiple datasets, is able to select previously unseen particles.

Convolutional Neural Network
The program crYOLO builds upon a Python-based open source implementation 12 of YOLO and uses the deep learning library Keras 13 . Beyond the basic implementation we added support for MRC micrographs, single channel data and EMAN1 14 box files. We added a graphical tool to read and create box files for training data generation or visualization of the results in a user-friendly manner ( Figure 2) and made changes in the original neural network architecture of YOLO 15 .
The YOLO network consists of 22 convolutional and five max-pooling layers.
In addition, it contains a passthrough layer to exploit fine grain features and the network is followed by a (1x1) convolutional layer that is used as a detection layer. YOLO uses a relatively coarse grid, which could lead to problems when detecting smaller particles as each grid cell can only detect a single particle. To tackle this, we took out the maxpooling layer after the 13 th convolutional layer. Thus, the adapted network downsamples the input image only by a factor of 16 instead of 32 in original YOLO resulting in a finer effective grid size of 64x64 cells on an input image size of 1024x1024. By default, crYOLO bins the micrographs automatically to 1024x1024 pixels to reduce memory consumption and increase the detection speed. The finer grid is now suitable for detecting smaller particles. Finally, we removed the last six convolutional layers, the passthrough layer and placed a dropout layer in front of the detection layer to reduce overfitting (Table 1).
To increase the amount of training data and to reduce overfitting, we augmented our datasets for each training step of the crYOLO network by a random selection of the following methods: flipping, blurring, adding noise and random contrast changes.

Test datasets
We validated the performance of our program on three different cryo-EM datasets which are publicly available in the electron microscopy public image archive (EMPIAR) and produced high-resolution structures. These include the TcdA1 toxin subunit from Photorhabdus luminescens 16  particles of TcdA1 to obtain a reconstruction at a resolution of 3.5 Å 16 . Due to the large molecular weight of the specimen and its characteristic shape, particles are clearly discernible in the micrographs although a carbon support film was used. Therefore, the selection of the particles is straightforward in this case. We chose this test dataset because of its small size, since the quality and number of selected particles will have likely an influence on the quality of the final reconstruction.
In the case of NOMPC, which was reconstituted in nanodiscs 17 , the overall shape of the particle, sample concentration, and limited contrast make it difficult to accurately select the particles, even though it has a molecular weight of 754 kDa.
Furthermore, the density of the nanodisc is significantly higher than the density of the extramembrane domain of the protein, thus the center of mass is not located at the center of particle but shifted towards the nanodisc density. A challenge for the selection program is therefore to accurately detect the center of the particles and avoid selection of empty nanodiscs. We therefore chose NOMPC as second test dataset.
Prx3 has a molecular weight of 257 kDa and shows a characteristic ring-like shape. The dataset is one of the first near-atomic resolution datasets recorded using a Volta phase plate 18 (VPP). The VPP introduces an additional phase shift in the unscattered beam. This increases the phase contrast significantly, providing a significant boost of the signal-to-noise ratio in the low-resolution range. This makes the structural analysis of low molecular weight complexes at high resolution possible and has revolutionized the cryo-EM field 20 . The VPP, however, enhances not only the contrast of the particles of interest, but also the contrast of all weak phase objects, including smaller particles (impurities, contamination, dissociated particle components) that would otherwise not be visible in conventional cryo-EM. This poses new complications with regard to particle selection, especially for automated particle picking procedures that cross-correlate raw EM images with templates. We therefore chose Prx3 as third test dataset.

Training of crYOLO
To train crYOLO, we manually selected initial training datasets. Dependent on the density of particles, the heterogeneity of the background, and the variation of defocus, more or fewer micrographs are needed. We found that 500 to 2,500 particles from at least five micrographs were sufficient to properly train the networks for the three datasets used in this study (  To determine the influence of the size of the training dataset on the selection quality, we trained the crYOLO network on subsets of the manually picked particle dataset. We decreased the size of the subsets systematically and evaluated the results by calculating the precision and recall scores 21  The selection performance was excellent, when using as few as 400 TcdA1 particles from four micrographs ( Figure 3a). However, when fewer particles were used, such as 200, the training dataset was not large and representative enough to enable a precise selection ( Figure 3a). Interestingly, training with 400 particles from four micrographs worked better than with 600 particles from six micrographs ( Figure 3a).
The reason for this unexpected behavior is the fact that four different defocus groups were used for the recording of the TcdA1 dataset. Therefore, selection of six micrographs instead of four results in an overrepresentation of certain defoci. However, this effect only appears when working with very small training datasets. In general, the user should make sure that the selection of micrographs covers the full defocus range when training crYOLO on manually picked particles.
We furthermore compared crYOLO with the original YOLO network using the full training dataset comprising 1,100 TcdA1 particles from 10 micrographs ( Figure   3b). At an input image size of 1024x1024 the performance of the networks is very similar. However, for input images with the size of 512x512, crYOLO with an AUC of 0.71 is superior to YOLO which reaches only a very low AUC value of 0.31 in this case ( Figure 3b). Therefore, it is possible to train crYOLO on GPUs with less memory.
However, the 1024x1024 input image size is preferable due to the overall better performance ( Figure 3b). Importantly, for Prx3 crYOLO significantly outperformed YOLO using a training dataset of 2,500 particles from 5 micrographs (Figure 3c), clearly highlighting the power of crYOLO for small particles.
To quantify how well the particles are centered, we calculated the mean intersection over union (IOU) value of manually selected particles versus the automatically picked boxes using crYOLO. The IOU is defined as ratio of the intersecting area of two boxes and the area of their union and is a common measure for the localization accuracy. Picked particles with an IOU higher than 0.6 were classified as true positives. The mean IOU for TcdA1 and Prx3 was 0.88 ± 0.009 and 0.80 ± 0.004, respectively. This indicates a high localization accuracy of crYOLO. Since it was difficult to determine the proper center of each NOMPC particle even when selected manually, we did not calculate an IOU for this dataset.
Besides this classical evaluation, we additionally calculated 2-D classes using the iterative stable alignment and clustering approach ISAC 22 (Figure 6e). Interestingly, a large fraction of the particles was discarded during refinement, both by the original authors -97.6%and by us -87.7% kept. In our case, this is not due to the quality of the particle selection.
Indeed, 310,123 of our picks were included into the stably aligned images after ISAC.
Instead, we deleted most of the particles found in the preferred top orientation, which represents roughly half the dataset. The remaining discarded images correspond to correct picks in regions where the protein forms clusters. In those cases, while the particles were correctly identified by crYOLO, the particles were not used for single particle reconstruction.

Computational efficiency
We used a desktop computer equipped with a NVIDIA Geforce GTX 1080 graphics card with 8 GB memory and an Intel Core i7 6900k to train the networks and select the particles. The time needed for training the networks was 13 to 25 minutes (Figure 7a).
Whereas the same was needed for training crYOLO on the TcdA1 and NOMPC datasets, the training time was lower for Prx3.
The selection process was very fast for all three datasets (Figure 7b)

Generalization to unseen datasets
Ideally, crYOLO would specifically recognize and select particles that it has not seen before, paving the way for fully automated particle selection. In order to reach this level of generalization, we trained the network with a combination of 34,813 manually picked particles from 331 micrographs derived from two VPP and nine conventional cryo-EM datasets, including the Prx3 dataset described above, the T20S proteasome 24 (EMPIAR 10025) and nine other datasets from different projects that are currently being processed in our research group. The molecular weight of the complexes from the respective datasets ranges from 64 kDa to 1.1 MDa.
Using this generalized crYOLO network, we automatically selected particles of TcdA1 (EMPIAR 10089), an ATP-synthase 25 (EMPIAR 10023), an in-house ribosome and an influenza hemagglutinin dataset 26 (EMPIAR 10097) (Figure 8a-d). Although crYOLO had never seen these particles before, it performed very well in selecting them specifically while avoiding to pick ice contamination or particles surrounded by contamination (Figure 8c). This clearly shows that crYOLO is able to learn general features for identifying particles in context of the micrograph.
To assess the quality of the generalized crYOLO network, we compared its performance selecting TcdA1 with that of the network that was directly trained on this protein (see above). As expected, the AUC and localization accuracy are better for the directly trained network (Figure 8e). However, the generalized crYOLO network selected a similar number of particles ( Figure 4d); the AUC of 0.748 (Figure 8e) and the IOU of 0.77 show that its performance is of sufficient quality to select a good set of particles. Indeed, the TcdA1 particles, selected by the generalized crYOLO network resulted in a reconstruction of similar resolution and quality as the particles from the TcdA1-trained crYOLO (Figure 4f). We expect that increasing the number of particles on different micrographs from different complexes to train the network will further expand the capabilities of crYOLO.

Discussion
Here, we present crYOLO, a novel automated particle-picking procedure for cryo-EM.
crYOLO employs a state-of-the-art deep learning object detection system which contains a 16-layer convolutional neural network. The excellent performance of crYOLO on several state-of-the-art direct detector datasets, reflects the effectiveness of the program and its efficiency to detect good particles at an accuracy comparable to manual particle selection. This underlines its potential to become a crucial component in a streamlined automated workflow for single particle analysis and thus eliminates one of the few remaining bottlenecks.
To close the other remaining gaps in the workflow between the electron microscope and data processing, such as evaluation, drift and CTF correction, file conversion and transfer, our lab has developed an on-the-fly processing pipeline called TranSPHIRE, which is documented and freely available on www.sphire.mpg.de (manuscript in preparation). The TranSPHIRE pipeline includes crYOLO.
The crYOLO package is available as a standalone program and will also soon be integrated into the SPHIRE image processing workflow. In addition to the command line interface, crYOLO provides an easy-to-use graphical tool (Figure 2), that makes it easier for users to generate training data and evaluate the results. CrYOLO outputs the coordinate files in EMAN .box-file format which can be easily imported to all available software packages for single particle analysis. CrYOLO picks micrographs rapidly on a standard GPU. If a trained network is available, particle picking can also be performed In comparison to template-based particle picking approaches, crYOLO is not prone to template bias and therefore the danger of picking "Einstein from noise" 22 is significantly reduced. The presented datasets have obtained optimal results after training the network on 5 to 10 micrographs. However, the particles in the training data set must be representative to those of the full dataset.
Importantly, crYOLO can be trained to select previously unseen datasets. With Taken together, crYOLO enables rapid automated on-the-fly particle selection at the precision of manual picking without bias from reference templates. Furthermore, increasing the size of the training database will result in a generalized particle picker that can be used without a template or human intervention to select particles on most single particle cryo-EM projects within few minutes.

Acknowledgments
We would like to thank Amir Apelbaum, Evelyn Schubert, and Philine Hagel for providing datasets for optimization purposes and Tanvir R. Tapu

Data availability
The training datasets for this study are available from the corresponding author upon reasonable request. CrYOLO is available for download under http://sphire.mpg.de

CrYOLO architecture and training
CrYOLO trains a deep convolutional neural network (CNN) for automated particle selection. A typical CNN consists of multiple convolutional and pooling layers, and is characterized by the depth of the network, which is the number of convolutional In typical CNNs max-pooling layers are inserted between some of the convolutional layers. Max-pooling layers divide the input image into equal sized tiles (e.g. 2x2) which are then used to calculate a condensed feature map. Therefore for each tile a cell is created, the maximum for the tile is computed and inserted into the cell.
This leads to a reduced dimensionality of the feature map, and makes the network more memory-efficient and robust against small perturbations in the pixel values.
The general crYOLO network is summarized in Table 1 The network was trained using backpropagation with the ADAM optimizer 30 .
Backpropagation applies the chain-rule to compute the gradient values in every layer.
The gradient determines how the kernel elements in each convolutional layer should be updated to get a lower loss. The optimizer determines how the gradient of the loss is used to update the network parameters. For YOLO, the loss function that is minimized in that way is given by: The confidence that a cell contains a particle is , .
The first term of the loss function penalizes bad localization of particle boxes. The second term penalizes inaccurate estimates for the width and height of the boxes. The third term tries to increase the confidence for cells with a particle inside. The last term decreases the confidence of those cells containing no particle center. The loss function slightly differs from the one used in Redmon et al. 11 since we have only a single class to predict and also only one reference box (anchor box in Redmon et al. 11 ).

Data Augmentation
During training each image is augmented before passing it through the network. This means that it is slightly altered by random selection methods instead of passing the original image through the network. As each image is passed multiple times through the network, it is randomly modified in different ways. This helps the network to reduce overfitting and also the amount of training data needed. The applied methods are: • Gaussian blurring: A random standard deviation between 0 and 3 is selected and then a corresponding filter mask is created. This mask is then convolved with the input image.
• Average blurring: A random mask size between 2 and 7 is chosen. This mask is shifted over the image. At each position the central element is replaced with the mean values of its neighbors.
• Flip: The image is mirrored along the horizontal axis.
• Noise: Gaussian noise with randomly selected standard deviation is added to the image.
• Dropout: Randomly replaces 1 to 10% of the image pixels with the image mean value.
• Contrast normalization: The contrast is changed by subtracting the median pixel value from each pixel, multiply them by a random constant and finally add the median value again. During training and later picking, the image is spatially downsampled by the CNN to a small grid. Then crYOLO predicts for each grid cell if it contains the center of a particle. If this is the case, it estimates the relative position of the particle center inside the grid cell as well as the width and height of the particle box. If not the grid cell is classified as background. Since the network "sees" the complete micrograph it learns the context of the particle. b) CrYOLO uses the trained CNN to select particles from the full dataset. Since every micrograph is only processed once by the network, this procedure is very fast and outperforms the sliding window approach; crYOLO selects up to six micrographs per second.   Red boxes indicate the particles selected by Gauss-Boxer (a), crYOLO (b) or the generalized crYOLO network (c), respectively. Scale bar, 50 nm. d) Summary of particle selection and structural analysis of the three datasets. All datasets were processed using the same workflow in SPHIRE. e) Representative reference-free 2-D class averages of TcdA1 obtained using the ISAC and Beautifier tools (SPHIRE) from particles picked using crYOLO. Scale bar, 10 nm. f) Fourier shell correlation (FSC) curves of the 3-D reconstructions calculated from the particles selected in crYOLO and Gauss-Boxer. The FSC 0.143 between the independently refined and masked half-maps indicates resolutions of ~3.4 and ~3.5 Å, respectively. g) The final density map of TcdA1 obtained from particles picked by crYOLO is shown from the top and side, and is colored by subunit. The reconstruction using particles from the generalized crYOLO network is indistinguishable.    , ribosome (in-house dataset) (c) and influenza hemagglutinin (EMPIAR 10097) (d) by crYOLO using the generalized network. None of the datasets was included in the set used for training the generalized crYOLO network. As the YOLO approach enables to learn the context of an object, crYOLO did not select particles enclosed in or overlaid by ice blobs (c, white arrows). Scale bars, 40 nm e) Precision-recall curves for TcdA1 picked with either a network directly trained on the TcdA1 dataset (black) and a generalized model trained on several datasets but not on TcdA1 (grey).