TranSPHIRE: automated and feedback-optimized on-the-fly processing for cryo-EM

Single particle cryo-EM requires full automation to allow high-throughput structure determination. Although software packages exist where parts of the cryo-EM pipeline are automated, a complete solution that offers reliable on-the-fly processing, resulting in high-resolution structures, does not exist. Here we present TranSPHIRE: A software package for fully-automated processing of cryo-EM datasets during data acquisition. TranSPHIRE transfers data from the microscope, automatically applies the common pre-processing steps, picks particles, performs 2D clustering, and 3D refinement parallel to image recording. Importantly, TranSPHIRE introduces a machine learning-based feedback loop to re-train its picking model to adapt to any given data set live during processing. This elegant approach enables TranSPHIRE to process data more effectively, producing high-quality particle stacks. TranSPHIRE collects and displays all metrics and microscope settings to allow users to quickly evaluate data during acquisition. TranSPHIRE can run on a single work station and also includes the automated processing of filaments.

runtime to only pick pore state particles (d). Output values can be plotted as either a scatter plots or histograms, as shown here for the total number of particles (f), and the overall drift per micrograph (g), respectively. This live monitoring enables an early evaluation of data quality during data acquisition and provides initial information about the protein structure.

Supplementary Figure 2. TranSPHIRE workflow.
Flow chart of the TranSPHIRE pipeline depicting sequentially executed processes below each other and parallel running processes next to each other.
The workflow is highly adaptable allowing, for example, the binning of super resolution data during motion correction and CTF estimation. All inputs from the microscope, outputs from the involved processes, and additional statistics produced by TranSPHIRE are monitored and presented live in the GUI. If specified, the processes "Copy to work", "Copy to backup" and "Copy to HDD" create a copy of the results of each individual step to a workstation/cluster, a backup server, or an external hard drive, respectively.

Supplementary Figure 3. Benchmarks and timings for on-the-fly processing. (a)
Plot illustrating the dependence of on-the-fly processing on the average number of particles per micrograph (x-axis), the number of micrographs collected per hour (y-axis) and the percentage of "good" particles (displayed for 25%, 50%, 75% and 100%). TranSPHIRE can keep up with most routinely used data acquisition settings (green). In case of data collection with e.g. AFIS or a very dense particle distribution the hardware used in this paper might only allow on-the-fly processing up to 2D results (yellow) or motion correction (blue). Presented data are extrapolated from timings of the TRPC4 data processed in this manuscript. Example: Supposing 25% of particles end up in classes labeled "good" by Processing of the data only starts once the feedback loop has finished (after ~7 hours) and TranSPHIRE requires time to catch up in order to process data on-the-fly. Dashed lines (black) indicate time points where processing has caught up for 2D classification (green), 3D refinement (blue), and starts running on-the-fly. Transparent lines indicate run times excluding the feedback loop. Timings are based on default settings (particle thresholds of 20,000 (2D) and 40,000 (3D) per batch). Figure 4. Choosing an optimal particle threshold for the feedback. To simulate a challenging data set, 90% of the picked particle coordinates were replaced by random coordinates on the micrograph prior to 2D classification. (a) Feedback performance measured as the total number of "good" particles (blue); the total running time of the feedback loop (grey); and the resolution of the final 3D reconstruction (green) in dependence of the chosen particle threshold. Data points for a particle threshold of 5,000 are missing, because the number of particles was insufficient for 2D clustering in round 1 (also see b). While both the resolution and number of good particles varies little, the running time increases considerably with an increases particle threshold. Thus, one has to balance the speed and stability when choosing a particle threshold for the feedback. (b)

Supplementary
Representative 2D classes using a particle threshold of 5,000 illustrating that the number of particles is insufficient to generate high resolution classes. (c) Evaluation of the picking performance in dependence of the particle threshold. With a particle threshold of 10,000, the percentage of "good" particles fluctuates stronger and converges to a lower value compared to higher particle thresholds.
As the performance is similar for particles thresholds of 20,000 and 40,000, a default value of 20,000 was chosen due to its lower run time.