An encryption–decryption framework to validating single-particle imaging

We propose an encryption–decryption framework for validating diffraction intensity volumes reconstructed using single-particle imaging (SPI) with X-ray free-electron lasers (XFELs) when the ground truth volume is absent. This conceptual framework exploits each reconstructed volumes’ ability to decipher latent variables (e.g. orientations) of unseen sentinel diffraction patterns. Using this framework, we quantify novel measures of orientation disconcurrence, inconsistency, and disagreement between the decryptions by two independently reconstructed volumes. We also study how these measures can be used to define data sufficiency and its relation to spatial resolution, and the practical consequences of focusing XFEL pulses to smaller foci. This conceptual framework overcomes critical ambiguities in using Fourier Shell Correlation (FSC) as a validation measure for SPI. Finally, we show how this encryption-decryption framework naturally leads to an information-theoretic reformulation of the resolving power of XFEL-SPI, which we hope will lead to principled frameworks for experiment and instrument design.

become more misoriented. Consequently the 'noise terms' between two independently reconstructed volumes (see Eq. (3) in 31 ) become correlated. Hence the FSC measure, which is invariant to isotropic filtering, can paradoxically report better resolutions when the orientation uncertainty of patterns increases. Second, the threshold criterion for determining resolution is controversial even in the cryo-electron microscopy community 31,32 . This criterion is demonstrably dependent on the speckle sampling ratio (i.e. size of realspace support), the symmetry of the particle, and assumes additive noise 31 . Unfortunately, there are still prominent violations of these criteria 33 . Third, to compute the FSC between two 3D volumes, their relative orientations must be accurately determined.
To circumvent some of these issues with FSC, we propose examining the source of correlations between two independently reconstructed volumes: the 'disconcurrence' , inconsistency, and agreement between how these volumes orient individual patterns. A similar orientation-based approach to validation was explored by Tegze and Bortel 34 , where they proposed using the fraction of patterns that are well-oriented to validate intensity reconstructions. However, the so called C-factor that they proposed for validation only considered orientation precision but not accuracy nor reproducibility.
It can be useful to recast the XFEL-SPI validation problem in information theoretic terms. Indeed, information theory has been insightful for SPI 35 as well as coherent diffraction imaging 36,37 . In fact, the half-bit criterion for FSC in cryo-electron microscopy 31 established a connection between spatial resolution and information theory. There, however, the half-bit criterion merely referred to when the signal-to-noise ratio of an idealized noisy channel attained a value of √ 2 − 1 . What this signal-to-noise ratio means for resolving spatial features within an object remains unclear.
Looking farther back, Shannon's original proof of the noisy channel theorem was based on a straightforward encoding-decoding scheme 38 . Below we show how Shannon's scheme can be explicitly constructed for the orientation determination problem in SPI. Doing so, allows us to validate reconstructions using an orientation resolution that can be directly related to the mutual information of the SPI experiment.
An SPI reconstruction is similar to probabilistic symmetric-key cryptography, where plaintext messages are encrypted into ciphertexts using a correct key plus a randomness scheme. Because of this randomness, the same plaintext message can produce different ciphertexts.
The analogous messages in an XFEL-SPI experiment are the hidden orientations of illuminated single particles 39 . The experimental setup itself can be viewed as a cipher algorithm that encrypts these messages as noisy two-dimensional (2D) diffraction patterns. When these orientations (messages) are properly decrypted, the full three-dimensional (3D) diffraction volume of the target particle can be recovered.
The conundrum for SPI, however, is that these orientations are best decrypted using the ground truth 3D diffraction volume. Hence, reconstructing this diffraction volume can be viewed as 'cracking' (i.e. guessing) the correct symmetric key in probabilistic cryptography. Figure 3 shows the similarities between SPI-validation and key-cracking in cryptography, which has the following correspondence: • correct key ↔ ground truth 3D diffraction intensities; • encryption cipher ↔ SPI experiment; • decryption cipher ↔ orientation inference scheme; Figure 1. Schematic of how orientations are encoded in XFEL-SPI. A diffraction pattern collected on a detector ( K t where t labels the pixels on the detector) of a scatterer is an Ewald tomogram W Qt through the 3D diffraction volume W. When this scatterer suffers an active random 3D rotation about its own original reference frame, it is equivalent to a passive rotation of said Ewald tomogram in the opposite sense (i.e. −1 ). Throughout the rest of the paper, we parametrize this rotation with unit quaternions Q ≡ �(Q) (primer on unit quaternions in Supplementary Appendix).

Figure 2.
Fourier shell correlation (FSC) reports improved resolution despite increased orientational blurring. Two disjoint SPI datasets were simulated, A and B, each with 5000 patterns. (A) The FSC was calculated for all pairs of reconstructions from the same dataset and with the same orientation blurring δθ (blue curve). Diffraction volumes were reconstructed from each dataset by interpolating each pattern back into ten random orientations near the true one. The true variance of these orientations is denoted δθ 2 , which is proportional to the degree of deliberate orientation blurring. The orientation disconcurrence proposed in this paper, �θ (red curve), was computed using a third smaller sentinel dataset (1000 patterns) not used in the reconstructions.  . Analogy between 'key-cracking' in cryptography (text in upper rows) and validation for single particle imaging (text in lower rows). www.nature.com/scientificreports/ constraints or independent measurements. Such external validations, however, are not always possible in SPI especially when resolving novel structural forms. We know that a correct key must decipher each ciphertext into a unique message. However, this uniqueness alone is insufficient to determine correctness, since wrong keys given to a deterministic cipher can yield unique but wrong decipherments. An example of this occurs when a recovered key overfits to a set of ciphertexts. Nevertheless, we can exploit this uniqueness requirement to design a scheme that detects if at least one of two candidate keys is incorrect.
Suppose we are given two disjoint sets of ciphertexts ( {K A }, {K B } ) that are encrypted by the same solution key W T . We can independently recover two keys ( W A , W B ), one from each set of ciphertexts. Disagreements between how these two keys decipher a third hidden set of ciphertexts {K S } betrays the incorrectness of at least one of these two keys. If the first two sets of ciphertexts are sufficiently large and randomly chosen then both candidate keys are likely incorrect.
Owing to the randomness in probabilistic encryption, it is practically impossible to guarantee a perfectly accurate key given only a finite number of noisy ciphertexts. Analogously, we cannot perfectly recover the ground truth SPI diffraction volume only from a finite number of noisy, incomplete photon patterns. Consequently, any pair of recovered keys must differ measurably from each other. This difference quantifies the decryption precision of these keys, which is the lower bound of their decryption accuracies.
Back to the SPI data analysis, we wish to find the difference in how two independently reconstructed volumes W A and W B decrypt the orientations of a third disjoint set of sentinel photon patterns, {K S } . This difference in decryption increases if the disagreement between W A and W B increases. More importantly, it also increases as either volume departs farther from the hidden ground truth volume W T . We refer to this difference as the orientation disconcurrence between these two volumes.
To define this framework in Fig. 3 requires well-defined encryption and decryption procedures. In an XFEL-SPI experiment, this encryption is described by how an illuminated scatterer at a certain orientation generates a noisy photon pattern (Fig. 1). In a Bayesian framework, the probability that a scatterer's specific orientation (Q) is encrypted as a particular photon pattern (K) is termed the data likelihood. Inversely, the probability that a pattern K will be decrypted as a particular orientation Q is its equivalent orientation posterior distribution (OPD).
This encryption of orientation information into a photon pattern is governed by the physics of photon-particle interaction, wavefront propagation, and photon measurement on the detector. Under ideal XFEL-SPI experimental conditions the photon pattern K t is a Poisson sample from an Ewald tomogram, W Qt , of a particle at orientation Q (Fig. 1). This idealization allows an explicit formulation of the likelihood (see Eq. (10)), and hence OPD. Additionally, one might consider factors such as extraneous photon scattering sources, non-linear detector artefacts, and the local fluence of the XFEL pulses each particle randomly encounters. Such non-Poissonian OPDs were shown to be effective in different XFEL-SPI experiments 13,19,39 . More generally, there is an infinite number of alternatives to the Poissonian OPD that could be used to decrypt particle orientation from photon patterns. Exploring the efficacy of these myriad alternatives is clearly beyond the scope of this paper.
The encryption-decryption framework that validates two intensity reconstructions ( W A , W B ) in Fig. 3 is indifferent to the algorithms that were used to reconstruct W A and W B . And while the Poissonian OPD chosen in this paper was also used in the original EMC algorithm to infer the orientations of photon patterns 8 , here this OPD is used to decrypt orientations for validating 3D intensity volumes W A , W B , which could be reconstructed with algorithms other than EMC. Since our validation occurs after W A and W B are separately reconstructed, it does not add any computational overhead during their reconstructions.
The OPD that most accurately describes the experiment should be used both to reconstruct and validate reconstructions. Hence it is unsurprising that the OPD used in both situations are identical.
Finally, since the validation framework in Fig. 3 compares the ability of two volumes W A and W B to decrypt orientations, we are essentially comparing their OPDs from decrypting the orientations of a set of sentinel patterns. To compare these OPDs, we evaluate their convolutions in orientation space to produce what we call angular displacement distributions (ADD). The orientation disconcurrence between W A and W B are then extracted from this ADD. The procedure to compute the orientation disconcurrence given W A and W B is outlined below.
1. Partition the XFEL-SPI photon patterns {K} into three disjoint sets: two larger and equally sized sets, {K A } and {K B } , for reconstructions; and a third, smaller set of unseen sentinel patterns {K S } to measure orientation disconcurrence. 2. Using any algorithm you desire, reconstruct two 3D intensities from the two larger sets of patterns: {K A } → W A , and {K B } → W B . 3. For each sentinel pattern K S , compute the OPD of the reconstructed volumes W A and W B . This is the probability that K S corresponds to the Ewald sphere section of orientation in each reconstructed volume (i.e. P(� A |K S , W A ) and P(� B |K S , W B ) ). This step creates 2 |{K S }| distributions, two for each sentinel pattern, where |{K S }| is the number of sentinel patterns used. 4. Next, we compute the angular displacement distribution (ADD, defined in Eq. (13)) of the sentinel patterns from the OPD of W A and W B . The ADD for each sentinel pattern K S (the red or blue distribution in Fig. 4) is essentially a convolution of OPD A and OPD B over the space of relative orientations between W A and W B . If OPD A and OPD B were delta functions, then this convolution peaks at the relative orientation between W A and W B . The ADD AB (the grey distribution in Fig. 4), which is the normalized sum of these convolutions for all sentinel patterns (Eq. (14)), is the distribution of relative orientations between W A and W B as 'measured by' {K S }. 5. Finally, from the ADD of all the sentinel patterns between the volumes W A and W B , estimate their orientation disconcurrence.

Results
Measures of orientation uncertainties. The orientation disconcurrence between two independently reconstructed volumes comprises two aspects: inconsistency and disagreement. By the cryptographic analogy, the first aspect characterizes how consistently each volume separately decrypts the orientations of sentinel patterns; the second aspect describes how often the decryptions of two (or more) volumes mutually agree. These concepts are illustrated in Fig. 5, and defined below. In the following numerical simulations, we use the disconcurrence between independent reconstructions from the same scatterer to estimate the lower bound of their correctness. Recall that this procedure requires partitioning a set of photon patterns into three disjoint sets ( {K A }, {K B }, {K S } ). We reconstruct two 3D intensities from the first two sets ( W A and W B respectively), while the last sentinel set is reserved for validation. Unlike an actual experiment, the true solution intensities W T that generated these patterns are known in these simulations, and will provide useful insights. Given these definitions, let us consider different orientation measures at the end of the procedure outlined at the end introduction section.  (12). The opacities of these disks are proportional to the value of the ADD at these quaternions. The blue and red disks represent the ADDs for two specific sentinel patterns respectively. The yellow disk shows the average overall rotation Q BA as defined in Eq. (16). Figure 5. The orientation disconcurrence for two sentinel patterns ( K 1 in blue, and K 2 in orange) consists of two parts: the inconsistency that each model orients sentinel patterns (disk spanned by dashed-dotted radii), and the disagreement between how different models orient these patterns (disk spanned by dashed radii). These aspects are affected by the photon counts per pattern (N) and the number of patterns ( M data ) respectively.  (17)) is computed from the width of the angular displacement distribution (ADD) between intensities W A and W B that are independently reconstructed from two disjoint sets of patterns. �θ c measures the difference between the orientations of specific sentinel patterns within W A and W B , despite having aligned the centroids of these two distributions (i.e. overall orientations of W A and W B ).

Measure of average orientation inconsistency:
This is the root-mean-squared (RMS) angular width of the autocorrelation of W A 's and W B 's orientation posterior distribution (OPD), which is equivalent to repeating the intensity model labels in Eq. (18). In Fig. 4, the angular width of the blue and red points show the orientation inconsistency for decryption the orientations of two sentinel patterns ( K 1 and K 2 ). The RMS of �θ 2 c (W A , W A ) and �θ 2 (W B , W B ) is used to approximate the angular width (red or blue distribution) in Fig. 4, because it is expensive to calculate the inconsistency between W A and W B for each sentinel patterns and it is a good approximation when the OPD is assumed to be a Gaussian distribution (see more details in "A one-dimensional (1D) model" section). Thus �θ i simply averages this width over all sentinel patterns and both reconstructions W A and W B . 3. Measure of orientation disagreement: which is the angular displacement between reconstructions W A and W B that is not due to an overall rotation between the two volumes, nor from the angular width �θ i of the OPD. In "A one-dimensional (1D) model" section, this relation is illustrated with a 1D model in more detail. 4. Measure of orientation inconsistency given the ground truth: which measures the angular width of the OPD in determining the patterns' orientations given the ground truth W T . With enough patterns in {K A } and {K B } , such that W A and W B do not over-fit to their respective photon patterns, we expect �θ i ≥ �θ * i . 5. Measure of orientation disconcurrence with ground truth: which is the angular width of the ADD between the reconstructed and ground truth intensity volumes ( W A vs W T respectively). Notice that �θ c is identical to �θ * c above if we replaced W B → W T . Hence, �θ * c is essentially the orientation disconcurrence between W A and the ground truth. 6. Measure of average orientation disconcurrence with ground truth: which is the average angular width of the ADDs between the reconstructed versus the ground truth intensity volumes ( W A , W B vs W T respectively). If only two volumes were reconstructed, W A and W B , then ��θ * c � represents the average orientation disconcurrence against the ground truth.
Factors that influence disconcurrence. Many experimental factors influence the orientation disconcurrence of an SPI intensity reconstruction including: incident photon fluence, number of photon patterns from single particles, resolution and sampling of each pattern, amount of missing detector data (i.e. beamstop, gaps in compound detectors, inactive pixels), extent of photon background (i.e. from particles' incoherent scattering or stray light sources), degree of structural heterogeneity between particles in the ensemble. The choice of algorithms and their parameters used to reconstruct the intensities also play important roles. Furthermore, the symmetries of the scatterer itself can also affect how the ADD is interpreted (see Fig. 9 and "Methods" section).
In this section, we focus on three of these factors: the average number of photons per pattern N, the fineness of orientation space sampling by reconstruction algorithms, and the number of patterns M data . In each scenario studied below, we simulated diffraction patterns with a small 105 kDa protein (PDB code, 4ZW6 42 ) under experimental conditions that were modeled after those at the Tender X-ray endstation at the Linac Coherent Light Source (see Table 1). We then used the EMC algorithm to reconstruct two independent 3D volumes each from disjoint sets {K A }, {K B } , each with M data patterns. For each test condition, a single set of 1000 sentinel patterns was reserved {K S } to evaluate the six types of �θ listed above. The user should choose the number of sentinel patterns such that the uncertainties of their orientation disconcurrence is acceptably small. Another consideration is whether the range of SO(3) orientations is adequately covered by randomly oriented sentinel patterns (see "Sentinel pattern coverage in the SO(3) orientation space" section).
The average number of photons per diffraction pattern (N) is directly related to the mutual information for inferring latent parameters (e.g. orientations) as well as the particle's structure 8 . N depends on the brightness of the X-ray beam, the size of the X-ray focus (i.e. beam intensity), as well as the relative alignment between particle www.nature.com/scientificreports/ and X-ray beams. In general, all six types of �θ fall when N increases in Fig. 6. Simply put, more photons per pattern reduces orientation disagreement and inconsistency, hence disconcurrence. Additionally, the orientation disconcurrence between W A and W B falls with their respective disconcurrences with the ground truth W T . This correspondence is consistent with the fact that uniqueness is a necessary condition for correctness (i.e. 'precision ≤ accuracy'). How finely orientations are sampled in XFEL-SPI reconstruction algorithms impacts the quality of reconstructed results 8 . Recall, this sampling fineness is different from the adaptive refinement scheme for OPD and ADD Eq. (12): the former pertains to the reconstruction algorithm, while the latter evaluates the reconstructed results. Fig. 6 shows that a higher sampling level in the EMC reconstruction algorithm generally reduces all Table 1. Range of parameters used to simulate XFEL-SPI photon patterns of a 105 kDa protein (PDB code, 4ZW6) in this paper. Here we assume that the incident beam energy 3 mJ, transmission efficiency 20%, and a binned detector is used here for computational efficiency.

Parameter Value
Photon wavelength (Å) 3.4 Detector distance (mm) 300 Detector pixel size (mm) 1.2 Detector size (pixel) 100 × 100 Beamstop radius (pixel) 10 Photon fluence (photons µm −2 ) 10 13 to 5 5 × 10 13 Focal area, µm 2 0.33 2 to 0.15 2 Figure 6. Effects of incident photon counts per pattern and sampling fineness of the latent orientation space. Each data point compares two 3D intensity reconstructions with 5000 photon patterns (solid lines), or each one of them with a ground truth 3D intensity volume (dashed lines). The rotation group is sampled with refinement levels n = 8 or n = 13 . As the average photon counts per pattern increases, all varieties of angular uncertainties specified in "Measures of orientation uncertainties" section decrease. The uncertainties involving the ground truth ( * -superscript, dashed lines here) are typically lower than those with only the reconstructed volumes (solid lines). Finer orientation sampling reduces all orientation uncertainties. Furthermore, orientation disconcurrence ( �θ c , red) is dominated by inconsistency ( �θ i , blue) as orientation disagreement ( �θ a , yellow) is suppressed. www.nature.com/scientificreports/ alignment uncertainties �θ . While the various forms of �θ have a noticeable spread at n = 8 orientation sampling, this spread significantly reduces when this sampling fineness is increased to n = 13 . Numerically, we found the average angular separation between the quasi-uniform unit quaternions samples to be 0.161 and 0.099 radians respectively. This figure complements the information-theoretic heuristic for deciding sampling sufficiency in 8 . With sufficient sampling, Fig. 6 shows that the orientation disconcurrence is dominated by the orientation inconsistency rather than orientation disagreement: In an SPI experiment the number of SPI patterns, M data , is a product of the fraction of particles that are illuminated by x-ray pulses (i.e. hit-rate), the pulse repetition rate, and the total experiment time. One intuitively expects that reconstructions improve with larger M data , which Fig. 7 confirms. The intrinsic orientation inconsistency of each reconstruction, �θ i , falls with more patterns (blue curve). The orientation disconcurrence �θ c , likewise, also falls with more patterns.
We found that in Fig. 7 that �θ c and �θ i both decrease numerically with the number of patterns as α M −β data + �θ * i , where α is a multiplicative constant, β is a real positive number, and �θ * i is the angular width of the OPD given the patterns {K S } and ground truth model. Although �θ c → �θ * i as M data → ∞ , we can only assert that the reconstructed pairs of models ( W A and W B ) are closer to each other, but not whether either are close to the ground truth W T . The former is evident from the ratio of orientation disagreement against disconcurrence, �θ 2 a /�θ 2 c (gray dots in Fig. 7): increasing M data eliminates orientation disagreements ( �θ a ) between two independent reconstructions faster than intrinsic inconsistency ( �θ i ). Using Eq. (2) and the fitted forms in Fig. 7, this vanishing of the orientation disagreement becomes clear: where we assumed β c < β i , and γ c ≈ γ i = γ . Obviously, when M data approaches infinity, �θ a gets close to 0. Simply put, as M data increases independently reconstructed volumes become more unique but not necessarily more correct.
Relating �θ to spatial resolution. The 3D speckles in the reconstructed diffraction volume whose angular width are smaller or comparable to �θ c will lose contrast, hence spatial resolution. Let us denote the full angular width of these 3D speckles as 2�θ sp (q) at spatial frequency q . Naturally, the resolutions of reconstructions become orientation-limited at the frequencies where �θ sp (q) approaches the width of OPD which is about �θ c / √ 2 ("A one-dimensional (1D) model" section). We caution that the previous paragraph suggests an inequality rather than strict equality between spatial resolution and orientation disconcurrence. To understand why, consider how Fig. 8 shows that it is possible for reconstructions whose orientation disconcurrence is smaller than the angular width of a single pixel at the edge of the detector �θ pix . This situation occurs with very high average number of photons per pattern ( N ≫ 1 ), abundant patterns ( M data ≫ 1 ), and sufficiently fine sampling of the rotation group during reconstructions www.nature.com/scientificreports/ (Fig. 6). Thus, the dynamic range and contrast of the reconstructed 3D diffraction speckles are high up to the detector's maximum captured resolution ( q max ), which allows us to distinguish arbitrarily small angular variations between actual diffraction patterns. We must remember that the reconstructed diffraction volume W does not explicitly contain spatial information beyond the maximum spatial resolution q max . So even if �θ c ≪ �θ pix , we can only say that spatial resolution is not orientation limited. Perhaps with additional priors about the structure of the particle (e.g. know sequence, similar structure known, atomicity, etc) is might be possible to extend the resolution beyond q max . But such extensions are beyond the scope of this discussion.
It should now be clear that orientation disconcurrence relates to how effectively one can resolve the orientation of an average SPI photon pattern. From this section, it should also be clear that spatial resolution can be limited by large orientation disconcurrences. More concretely, consider Fig. 8, which simulates an XFEL-SPI experiment of a 105 kDa protein at the Tender X-ray endstation at LCLS (Table 1). To resolve this protein to 10nm-resolution without significant orientation blurring requires more than 5000 patterns each with more than 600 photons. However, it is premature to define spatial resolution only in terms of orientation concurrence, especially since a decryption scheme for the spatial resolution (similar to Fig. 3) is absent. Such detailed discussions, however, are deferred to future studies. Data sufficiency and mutual information. The question 'how many patterns are sufficient?' frequently occur in an XFEL-SPI experiment. The answer to this hypothetical question determines if a proposed experiment is 'feasible' , as well as how many different samples to inject during the precious dozens of hours of XFEL beamtime allocated to each user group. Orientation disconcurrence can be used to define data sufficiency: when the number of patterns gives a disconcurrence smaller than the angular width of speckles at a target resolution q target : If the ADD peak in Fig. 4 were compact and locally Gaussian ("A one-dimensional (1D) model" section), this last condition means that approximately 74% ( 2σ criterion) of the oriented sentinel patterns should intersect their target 3D speckle at resolution q target .
With the disconcurrence target defined, we can extrapolate data sufficiency with bootstrapping. Given M data total patterns, one can compute �θ c (M data ) for pairs of models reconstructed from random, non-overlapping, equal subsets from the full M data dataset similar to the data points in Fig. 7. Repeating this procedure via a simple bootstrapping scheme gives the orientation disconcurrence curves in Fig. 7. These curves fit reasonably well to a lifted power law, �θ c = α c M −β c data + γ c . The shrinking error bars on �θ c from bootstrapping with increasing M data in Fig. 7 suggests that this fit requires sufficiently many patterns to be robust.
Owing to various constraints, only a finite number of XFEL-SPI patterns are collected each time (say M exp ). To maximize signal-averaging in a reconstruction logically requires the input from all collected patterns. Yet the two independent reconstructions in this framework (Fig. 3) only sees only a little less than half of the full dataset ( < M exp /2 ). Fortunately, the lifted power law fit in Fig. 7 allows us to extrapolate the orientation disconcurrence between a pair of hypothetical independent 3D reconstructions that each used all patterns in an XFEL-SPI dataset. Specifically, if �θ c (M data ≤ M exp /2) were computed between pairs of reconstructed volumes each using up exp + γ c . A similar extrapolation from bootstrapped reconstructions was proposed to define spatial resolution in cryo-electron microscopy 43 .
This lifted power law also helps us extrapolate to a second scenario. Should the target orientation disconcurrence be the angular width of a single pixel at the edge of the detector, �θ c = �θ pix (q max ) , then γ c < �θ pix (q max ) is required. If this requirement is satisfied, then 1 β c log α c /(�θ pix (q max ) − γ c ) patterns are needed to reach this target.
The lifted power law form of �θ c = α c M −β c data + γ c in Fig. 7 allows us to parametrize data sufficiency in an information-theoretic sense. Essentially, the mutual information here can be defined as the reduction in the entropy of orienting an average sentinel pattern give a set of M data photon patterns {K} . Ignoring factors of order unity, this mutual information, is approximately assuming M data ≫ 1.
Equation (8) contains two intuitive results. First, this mutual information is bounded from above by that when the solution intensities are known: log 2π 2 /(�θ * i ) 3 . This upper bound can be viewed as the SPI channel capacity for decryption orientations, and is computed in the same manner as the mutual information I(K, �)| W in 8 . Second, the mutual information for decryption orientations increases with the number of patterns. This assumes that α c /�θ * i > 0 and β c > 0 , which are manifest in Fig. 7. Furthermore, β c > 0.5 in Fig. 7, which is better than one would expect if patterns were mutually independent (i.e. β c = 0 ). This 'co-dependence' arises because additional patterns can improve the reconstructed volumes, which in turn help earlier patterns distribute their photons more precisely into orientation classes.

Focal spot size affects hit rate and orientation disconcurrence. The linear size of the XFEL focus
L focus is a critical parameter in an SPI experiment (see Table 1). This choice of focus size can be paraphrased simply: given a fixed total number of photons per XFEL pulse, would it be better to 'distribute' them into more patterns with fewer photons each, or fewer patterns with more photons each? Whereas a larger focus can dramatically increase the odds of illuminating randomly injected particles, it also drastically decreases the number of scattered photons should a particle be illuminated (N). These odds, also known as the 'hit-rate' , is effectively M data per time. In fact, N ∝ L −2 focus while M data /time ∝ L 2 focus . In this hypothetical scenario, the total number of photons measured per time ( NM data /time ) remains constant despite L focus . Suppose that in either case, you had enough patterns to adequately sample different views of the scatterer, and were perfectly able to detect particle hits against background scatter/noise. This same ambivalence to the focus size appears again in the simple signal-to-noise ratio (SNR) described in 8 : where M rot is the number of rotation samples used to reconstruct the intensity volumes W A and W B . This SNR is motivated by a simple distribution of photons across a limited number of Ewald tomograms, and has been used to indicate data sufficiency in the orientation space 9 .
The discussion above may lead one to believe that there is no ideal focus size. However, if we again used a smaller orientation disconcurrence �θ c to quantify when things are 'better' , the preference is to reduce L focus . Notice that nearly doubling the average number of photons per pattern ( N = 355 to N = 622 given M data = 5000 ) in Fig. 6 reduces both �θ c and �θ i more than if we doubled the number of patterns ( M data = 5000 to M data = 10000 given N = 355 ) in Fig. 7. The total number of photons in all patterns is approximately equal in both cases. Yet doubling the average number of photons per pattern substantially improves the asymptotic orientation inconsistency (i.e. �θ * i falls).

Discussion
In summary, we propose an encryption-decryption approach to validate 3D intensity volumes reconstructed in XFEL-SPI. This validation is based on the volumes' ability to decrypt the orientations of sentinel patterns unused in these reconstructions. While these volumes can be reconstructed from any algorithmic means, they must strictly adhere to the data independence scheme laid out in Fig. 3. This scheme can be generalized to validate other latent information inferred within the full dataset (e.g. unmeasured local photon fluence, structural class, etc). From realistic simulations of SPI experiments this approach can validate reconstructions in a principled information-theoretic manner. Our approach relates the challenging question of data sufficiency intuitively to key experimental variables such as the number of measured photon patterns, and nominal incident photon intensity. Furthermore, the various forms of decrypting (orientation) uncertainties shown here can be interpreted as disconcurrence, disagreement, and inconsistencies in how confidently the latent variables are inferred. These interpretations give a more informative and comprehensive view of the validation exercise.
Whereas there were studies about the expected scattered photon signals from biomolecules in idealized XFEL-SPI scenarios 44,45 , systematic studies of how well these signals can be integrated into a 3D diffraction (8) www.nature.com/scientificreports/ volume despite missing information is still sorely lacking. Our results show that the complex considerations that contribute to data sufficiency in XFEL-SPI can be fitted as simple parameters (e.g. α, β, γ ). Relating these parameters to basic properties of the target scatterer (e.g. mass, radius of gyration, etc), experimental conditions (e.g. beam intensity, photon wavelength, background scattering, etc), and choice of reconstruction algorithms, will be useful for experiment design and planning. An extension of our encryption-decryption approach can be used to define and validate the spatial resolution of XFEL-SPI and cryo-electron microscopy reconstructions. In principle, the resolving power of an imaging instrument should be the reduction in uncertainty of locating spatial features within the sample. Re-framing this uncertainty reduction in the encryption-decryption framework of Fig. 3 may give rise to more interpretable notions of spatial resolution. This information theoretic formulation of this conceptual framework, similar to Eq. (8), also naturally accounts for external priors for localizing spatial features.
Ultimately, our encryption-decryption approach demonstrably overcomes the difficulties of using FSC as a validation measure for XFEL-SPI, in spite of FSC's popularity 13,16,[18][19][20][21][22][23][24][25][26][27][28][29] . The data throughput from XFELS will rapidly increase because of higher pulse repetition rates 46 , and more efficient sample injection techniques. This trend inevitably creates a larger data load, which in turn increases our reliance on statistical techniques to assign confidence to de novo structural reconstructions. Such confidence is especially important when imaging structural ensembles with considerable flexibilities, or other structural variations. Despite the specificity of our validation routine to orientations, the encryption-decryption framework proposed in Fig. 3 can be readily generalized to test the reproducibility of claims of novel reconstructed structures. Such tests, we believe, are central to illuminating our path towards novel structural insights as we navigate through the photon-limited world of XFEL-SPI.

Methods
Sampling orientations. A scatterer can take on an infinite number of possible 3D orientations. In practice these orientations Q are discretely sampled to angular divisions smaller than the intrinsic angular precision of the patterns (see "Relating �θ to spatial resolution" section). We adopt a quasi-uniform sampling scheme based on 8 , which adaptively refines the 600-cell polytope with refinement parameter n. In this scheme the number orientation samples scales like n 3 , while their angular resolution increases like 1/n.

Orientation posterior distribution (OPD) of sentinel patterns.
The orientation posterior distribution (OPD) of a particular sentinel pattern K S defines the probability of orienting it within a specific 3D diffraction volume W. This OPD, written here as P(Q | K S , W) , can be inferred from the likelihood P(K S | Q, W) using Bayes' theorem, where the prior distribution of orientations, P(Q), is uniformly distributed unless the specimens have a known orientation bias. Because the space of orientations is only quasi-uniformly sampled by unit quaternions in our discretization scheme, we replace P(Q) with the numerically computed non-uniform weights w(Q) 9 . Note that this OPD can be computed even if K S did not in fact originate from W: such a computation will naturally yield highly uncertain orientations of K S .
We presume the likelihood of detecting a sentinel pattern K S (comprising pixels indexed by t) from the Ewald tomogram at orientation Q of volume W (see Fig. 1

) assuming perfect detection absent background photon sources is
This likelihood can be replaced if the true detection statistics departs from this Poissonian form.
Often the posterior and likelihood in Eqs. (10) and (11) of a converged intensity volume is significant only for a relatively small set of orientations. For a given pattern K S , we represent this set of important orientations by their corresponding important unit quaternions {Q | K S } (written in boldface). For computation efficiency, only the probability at {Q | K S } is recorded; those at other quaternions are safely set to zero.
For sufficient orientation coverage, we require these important quaternions to capture at least 99% of the total posterior distribution. To implement this, all patterns' posterior distributions are first sampled by a unit quaternion set {Q | n} with 600-cell quaternion sampling strategy 8 where n is the sampling refinement level. Then we increase n until the smallest set of important quaternions {Q | K S , n} min ⊂ {Q | n} that captures this total posterior distribution comprises at least 100 important quaternions: and the size of every K S , |{Q | K S , n} min | ≥ 100 . To be concise, we omit the subscript · min in subsequent formulae.

Angular displacement distribution (ADD) between two reconstructed volumes.
Returning to our cryptography analogy, our next step is to compare how two diffraction volumes decrypt the orientations of a set of sentinel patterns. Three key considerations stand out here. First, the orientation of a noisy sentinel pattern is described by a probability distribution (i.e. OPD) rather than a point estimate. Second, W A and W B would almost always differ by an overall mutual 3D rotation Q BA because each volume is typically randomly initialized Q∈{Q | K S ,n} min P(Q | K S , W) www.nature.com/scientificreports/ to avoid reconstruction biases. Hence, the sentinel OPDs for W A and W B would also be displaced by Q BA . Third, we must average the OPDs for different sentinel patterns to obtain a robust estimate of the orientation disconcurrence between W A and W B . These considerations are captured in the angular displacement distribution (ADD) between W A and W B . The ADD allows us to compare the OPD of a single sentinel pattern ( K S ) given W A and W B without having to pre-align them in the space of possible orientations. Mathematically, the ADD for a single sentinel pattern K S is the outer product (or convolution) of its two OPDs given W A and W B on their respective important quaternions, which is computed over the set of important unit quaternions. Here Q BA = Q B Q −1 A represents the possible relative orientations between the reconstructed volumes W A and W B over the two sets of important quaternions {Q A |K S } and {Q B |K S } as defined in Eq. (12). Since Q BA depends on the sentinel pattern K S , the ADD in Eq. (13) may be different for different K S . Averaging the ADD over all the set of sentinel patterns {K S } we get Given the noise in the diffraction patterns, we expect variations in the decrypted orientations of sentinel patterns. To compute this variation, an average of an ADD must be established. When the reconstructed volumes W A and W B are similar, the ADD of their many sentinel patterns tend to cluster around the average unit quaternion Q AB in orientation space. This overall rotation Q AB is not a mere linear average of the unit quaternions that sample the ADD since this average may not have unit length and hence not correspond to a 3D spatial rotation. To define Q AB , let us first consider the relative rotation between Q BA and a presumptive average overall rotation Q . This relative rotation can be written as a quaternion multiplication which is written here as a four-component vector; n and θ are respectively the axis and magnitude of this relative rotation. The magnitude of this relative rotation, θ(Q BA , Q) , vanishes as Q approaches Q BA .
We define the average overall rotation Q BA of an ADD between W A and W B as that which minimizes the average θ against all the rotation samples of the ADDs for the set of sentinel patterns. Specifically, the average overall rotation is defined as the unit quaternion that minimizes the angular variance 2 : and the orientation disconcurrence is the minimum value of where the angular variance is defined as A special case here is when W A and W B are identical. In this case, Q BA = (1, 0, 0, 0) which is the identity quaternion.
Resolving ambiguities from centro-symmetric diffraction volumes. To obtain the most compact ADD (Eq. (14)), we must eliminate trivial symmetries in the diffraction patterns that broaden the ADD. One such example is the centro-symmetry of 3D diffraction intensities from optically thin samples, whose scattering density distribution is effectively real-valued. Consequently, at sufficiently low resolutions any two-dimensional diffraction pattern is similar to itself after a 180° in-plane rotation about the scattering experiment's optical axis ( ẑ ). Each such photon pattern K should have similar posterior probabilities to occur at either rotation Q or QQ z : where the in-plane rotation about the z-axis is Q z = (0, 0, 0, 1) . This two-fold ambiguity plus the fact that Q z is its own inverse, means that in ADD, the relative rotation Q BA or Q ′ BA = Q B Q z (Q A ) −1 could occur in Eq. (14). Hence, for each ADD sample we check the angular closeness of both Q BA and Q ′ BA to the ADD's average unit quaternion Q BA , and keep the one that is closer. This essentially replaces the θ expression in Eq. (18):  (Fig. 9). Examples of such symmetries include icosahedral viral capsids 13 and octahedral nanoparticles 18 . The multiplicity of these clusters arise because each pattern could be oriented at different and/or multiple locations of the symmetry orbit within the diffraction volume. As Fig. 9 shows, should this symmetry be known we can compute a single orientation disconcurrence by first folding these multiple symmetry-related peaks in ADD into its fundamental domain. We emphasize that this folding can be done even if this symmetry were not imposed during the reconstructions of W A and W B . Figure 9 illustrates ADD folding for a particle with chiral octahedral symmetry (O). The reconstructed diffraction intensities of this particle ( W A and W B ) has 24 rotational symmetries (of order 24  . Collapsing the ADD of 500 sentinel patterns for a scatterer, whose diffraction volumes is centrosymmetric and has octahedral symmetry, into the fundamental domain: (A-D). Starting clockwise from (A), which shows a projection of the ADD onto two components of each quaternion ( Q = (Q 0 , Q 1 , Q 2 , Q 3 ) ), we collapsed the points related by centro-symmetry (since 2D patterns have sufficiently low resolution) to obtain a sharper distribution in (B). The red disk throughout the panels represent the average quaternion Q AB of the ADD. In (C), we rotate the ADD such that Q AB = (1, 0, 0, 0) for clarity. The histogram of the ADD vs Q 0 is shown above panel (C), can sometimes reveal the flavor of symmetry in W. Finally, using the particle's known symmetry group operations we can fold the ADD into the fundamental domain in (D). } also), hence the same orientation posterior probability at these orientations. Recalling the ADD comprises the joint product of OPDs for K S to be oriented at Q A and Q B within W A and W B respectively. We see this multiplicity of ADD in Fig. 9b (main text), which contains 48 clusters owing to the unit quaternion double covering SO(3) . The number of clusters does not increase even if we include the symmetry operations of W A by assuming W A and W B are similar, for the same reason that randomly oriented sentinel patterns in an asymmetric volume still produce a 2-clustered ADD (only one branch is plotted in Fig. 4).
For each sentinel pattern K S , we can fold each important unit quaternion Q BA in its ADD into the fundamental domain by exhaustively searching the symmetry operation in {1, 0, 0, 0} or {0, 0, 0, 1} ) that minimizes the angular variance Here, Q is the presumptive average relative rotation between W A and W B similar to that in Eq. (16). Like Eq. (20), we also minimize over each pattern's in-plane inversion. Therefore, the optimal relative rotation ( Q BA ) and canonical realignment ( Q OB ) are found by minimizing the total angular variance weighted over all important unit quaternions for all sentinel patterns in the ADD: where To recapitulate, the orientation disconcurrence between two symmetric volumes W A and W B is defined by Eq. (25) as This computation involves separate optimizations: we iteratively refine Q BA → Q BA and Q OB → Q OB by minimizing Eq. (25); for each presumptive Q BA and Q OB , find the symmetry operation in {Q O } for each sentinel pattern that minimizes the quantity in Eq. (24) as well as the most compatible in-plane rotations for each sentinel pattern ("Resolving ambiguities from centro-symmetric diffraction volumes" section). The results of these completed optimizations are used to fold the ADD into the fundamental domain in Fig. 9.
We note that one can discover the symmetry of W A using a special case of ADD with itself (i.e. W A = W B ). This 'self-ADD' will be similar to Fig. 9c (main text) since there is no relative rotation between W A and itself. Because the first component of every unit quaternions in a symmetry group is independent on the choice of canonical axis, we may deduce W A 's symmetry group from number and positions of their clusters in their Q 0 histograms of its 'self-ADD' (panel above Fig. 9c (main text)).
A one-dimensional (1D) model. Here, we show the relation between the orientation disconcurrence and the disagreement (misalignment of the centers of ADDs) and the inconsistency (the size of each ADDs) with a one-dimensional (1D) rotation analogy as opposed to the full 3D rotation version in Fig. 4.
The unit quaternion Q that describes rotation about a 1D ring is a real number θ ∈ [0, 2π) . Suppose that the two OPDs (of reconstructed models W A and W B ) that comprise the ADDs for a set of sentinel patterns {K S } are mostly constrained within a small segment of this 1D ring. Let us further suppose that their ADD over {K S } can be approximated by local Gaussian distribution within this angular segment. We denote the 1D ADD averaged over all sentinel patterns {K S } as P(Q | {K S }) ≡ P(Q | {K S }, W A , W B ) . For a single sentinel pattern K S its ADD, P(Q | K S ) (blue or red distribution in Fig. 4), we denote its mean as Q(K S ) , and variance as �θ 2 (K S ) . Hence the mean and variance of this ADD for the entire set of sentinel patterns {K S } are equivalent to the overall orientation, Q({K S }) , and the square of orientation disconcurrence, �θ 2 c ({K S }) , defined in Eqs. (17) and (18) respectively. The square difference between the disconcurrence, �θ c ({K S }) , and the inconsistency, ��θ 2 (K)� K∈{K S } , is equivalent to the RMS distance between Q(K S ), K S ∈ {K S } and Q({K S }) , can be thought of as the disagreement, �θ a (W A , W B ) , between reconstructions W A and W B . This relation can be shown by The width of OPD, δ 2 , quantifies how well we can identify the orientation for a given pattern. For a pixel at q in this pattern, we cannot decide whether this pixel belongs to a diffraction speckle near its most likely orientation if the speckle's radii θ sp (q) is larger than δ . Strictly, if we want a 74% confidence interval, then we should have θ sp (q) ≤ 2δ . It should be noted that the confidence interval for 2σ is 74% instead of 95% since OPD is a 3D Gaussian distribution even though we simplified the derivation above with a 1D Gaussian distribution. The δ is computational expensive, but it can be easily inferred from �θ i by δ ≈ �θ i / √ 2 if the Gaussian assumption discussed above is utilized. Moreover, being more cautious about the conclusion, we replace the �θ c instead of �θ i in Eq. (7). Sentinel pattern coverage in the SO(3) orientation space. Comparing a sentinel pattern to a diffraction intensity results in the former's OPD. This OPD covers a certain region in the SO(3) orientation space. The volume of this region should be proportional to the width of the OPD which could be estimated by �θ i / √ 2 as mentioned in Eq. (28). If we crudely partitioned these OPDs with boxes whose average edge length is twice the average OPD width then the average volume covered by an OPD is (2�θ i / √ 2) 3 . Given when the number of patterns diverges (the yellow asymptote) in Fig. 7, �θ i = 0.24 , then at least we need OPDs to cover the whole SO(3) space, where π 2 is the total volume of SO(3).