Diffraction data from aerosolized Coliphage PR772 virus particles imaged with the Linac Coherent Light Source

Single Particle Imaging (SPI) with intense coherent X-ray pulses from X-ray free-electron lasers (XFELs) has the potential to produce molecular structures without the need for crystallization or freezing. Here we present a dataset of 285,944 diffraction patterns from aerosolized Coliphage PR772 virus particles injected into the femtosecond X-ray pulses of the Linac Coherent Light Source (LCLS). Additional exposures with background information are also deposited. The diffraction data were collected at the Atomic, Molecular and Optical Science Instrument (AMO) of the LCLS in 4 experimental beam times during a period of four years. The photon energy was either 1.2 or 1.7 keV and the pulse energy was between 2 and 4 mJ in a focal spot of about 1.3 μm x 1.7 μm full width at half maximum (FWHM). The X-ray laser pulses captured the particles in random orientations. The data offer insight into aerosolised virus particles in the gas phase, contain information relevant to improving experimental parameters, and provide a basis for developing algorithms for image analysis and reconstruction.


Methods
In single particle diffraction-before-destruction imaging experiments 8 a sample, usually biological, is introduced into the focus of an XFEL beam where the X-ray fluence is high enough to destroy the sample with each pulse, however the pulse duration is so short that this does not happen before a 2D diffraction pattern is formed. For samples that are small and non-crystaline, such as individual viruses or biomolecules, the scattered signal containing structural information is weak and often in a photon counting regime. However, using a continuously replenished stream of identical particles in random orientations, a 3D diffraction volume with sufficient signal-to-noise for structure determination can be composed from the individual measurements provided the particle orientations can be determined and sufficient diffraction patterns have been measured. The 3D diffraction volume has a higher resolution than any given single diffraction pattern and can be inverted to form a real space representation of the average particle. Details of the methods used for sample preparation, sample delivery, instrumentation, and preliminary data analysis are described below.
Sample preparation. PR772 bacteriophage growth and purification was performed as previously described 6 . For completeness, we provide a brief overview of the process here. The samples were grown overnight in E. coli, then cultured onto hard agar plates and incubated overnight at 37 °C. The samples were then scraped from the plates, placed in a storage buffer consisting of 50 mM Tris, 100 mM NaCl, 1 mM MgSO4, 1 mM EDTA at a pH of 8.0, and incubated on a rocker overnight at 4 °C. The mixture was centrifuged at 8,000 g for 30 min to remove the agar and cell debris. The supernatant was then collected and filtered through a 0.2 μm filter. Viral particles were separated from the solution by PEG precipitation with PEG 8000 (9% w/v PEG + 5.8% w/v NaCl) and left to mix overnight on a rocker at 4 °C. After mixing, the precipitate was centrifuged for 90 min at 8,000 g at 4 °C to pellet the virus. The viral pellet was then suspended in the storage buffer. A Capto-Q anion exchange column using FPLC was then applied. The sample was eluted by NaCl (typical concentrations 750 to 900 mM). Just prior to sample injection, the PR772 virus particles were transferred from the storage buffer into a volatile ammonium acetate buffer (250 mM, pH 7.5) using PD10 desalting columns (GE Healthcare). Verification of the sample was conducted using electron microscopy and nanoparticle tracking analysis as shown in Fig. 1.
Sample delivery. For all datasets described here, PR772 bacteriophage was aerosolized using gas dynamic virtual nozzles (GDVN) 9,10 with helium as the nebuliser gas. For amo87215, amo06516, and amo11416 a glass GDVN nozzle was used (ground and polished with an outer diameter of 1.0 mm and an initial inner diameter of 0.78 mm). The Glass GDVN Nozzles were melted to create a much smaller inner diameter of order 15 to 20 μm. For amox34117 the nozzle was 3D printed via 2-photon polymerization photo-lithography with a Nanoscribe Professional GT printer 11 . These 3D printed nozzles (shown in Fig. 2) had an asymmetric "syringe tip" design featuring an elliptical liquid orifice with minor/major axis diameters of 23 μm and 68 μm, respectively, and an exit gas aperture of 60 μm. The virus particles were then passed through a differentially pumped skimmer that was used for pressure reduction (from atmospheric to typically 60 to 300 Pa at the exit of the skimmers). The skimmer is needed for the proper use of the particle focusing system and to limit the maximum sample chamber pressure to 4 × 10 −3 Pa. The chamber pressure limit is required to reduce the background scattering from the carrier gas and to protect the detector from thermal drift and high voltage arcing. The samples were then focused into the sample chamberâ€ ™ s interaction point of the X-ray instrument using an aerodynamic lens stack injector 4,5 . Instrumentation. All four experiments were conducted at the LAMP endstation of the AMO instrument at the LCLS [12][13][14] . A schematic of these experiments is shown in Fig. 3. The instrument uses a pair of Boron Carbide coated Kirkpatrick-Baez (KB) mirrors capable of focusing the FEL beam to a nominal 1.5 μm diameter focal spot. Wavefront sensor measurements taken in 2017 show the focused X-ray beam to be nearly Gaussian in shape with a FWHM of 1.3 μm × 1.7 μm (vertical × horizontal). Shot by shot X-ray pulse energies were measured with gas monitors 15 located upstream of the AMO optics. Measured pulse energies varied between 2 and 4 mJ per pulse and are included in the metadata for each diffraction image. It is noted that the X-ray optical transport system of the AMO instrument is not perfect and has been measured to be ~40% efficient. Background scatter, from the upstream optics and residual gas in the chamber, was reduced using a beveled silicon nitride 4-jaw slit followed Fig. 1 Sample verification of PR772 used in AMOX34117. (a) Nanoparticle tracking analysis conducted on PR772 to determine concentration and size. The first and dominant peak is at 82 nm, with a concentration of (2.4 ± 0.09) × 10 8 particles/ml. The standard error is shown in blue. Note: the sample was diluted by 10 4 to allow for a more accurate peak determination. (b) Negative stained transmission electron microscopy image of PR772. (c) Cryogenic transmission electron microscopy imaging of PR772 using a Krios electron microscope. by two motorized 1 mm × 1 mm opening silicon nitride apertures used to reduce scatter from the 4-jaw slit. The 4-jaw slit was located ~20 cm upstream of the focus and the two apertures were located ~15 cm and ~7 cm respectively upstream of the focus. Additionally, adjustable rolled B 4 C slits were used 2.0 m upstream of the KB mirrors to define the entrance aperture of the focusing system (not shown in Fig. 3).
Initial alignment of the aerodynamic lens injector to the focal spot position was performed using the beamline alignment laser and a retractable alignment pin coated in a powdered phosphor to directly align the center of the injector with the X-ray focus. The injector was positioned 3 mm above the X-ray focus. Lateral scans of the injector were conducted for each experiment to optimize hit rates. The focus of the particle stream was found to be approximately 100 μm (full width at half maximum) with variation in focal spot size depending on inlet and chamber pressures.
The samples exiting the aerodynamic lens injector and entering the X-ray interaction region of the instrument are in random orientations and also enter the interaction point at random time intervals, as the aerodynamic lens does not align the particles in any particular orientation. As the sample delivery focus was far greater than that of the X-ray pulses in width (as illustrated in the inset of Fig. 3) the majority of X-ray pulses miss the sample and do not interact with any particles. The LCLS provides 120 equally spaced X-ray pulses per second and typically ~1 % of these will intersect with a sample, depending on the sample concentration, GDVN and skimmer operating conditions. Diffracted X-rays are collected, downstream of the interaction point on two 512 × 1024 pixel pnCCD panels 16,17 . The detector consists of two panels which are movable jointly along the X-ray beam axis, Z, and the two panels can also be moved independently vertically, Y, with respect to the horizontal gap between the two detector  www.nature.com/scientificdata www.nature.com/scientificdata/ panels. When no particle is present in the X-ray focus the measured intensity corresponds to instrument background due to scatter from residual gas, slits, and so on; however, when a sample particle interacts with the XFEL beam a coherent diffraction pattern is additionally measured on the detector. The position of both panels and the camera length of the detector from the interaction region was determined using the known diffraction of Silver Behenate prior to each experiment. An example of such a calibration is shown in Fig. 4.
An X-ray photon energy of 1.7 keV (0.73 nm) was used for most of the experiments reported here, except for during runs 38-58 of the amo87215 experiment where an X-ray photon energy of 1.2 keV (1.03 nm) was used (other runs in amo87215 were at 1.7 keV).
Both the detector distance and the detector gap size have been optimized for the measurement of high resolution data throughout the experiments. The detector distance and the detector edge resolution for each experiment can be found in Table 1. Notice that, in amo11416, for runs 55 and 56, the gap size is different from the previous runs to reach a higher edge resolution of 2.8 nm.
Data processing. The pnCCD detector is an integrating detector that reads out the deposited charge incident on each pixel in analog-to-digital units (ADUs). Photon counting detectors cannot be used for this type of experiment due to the arrival of multiple photons in an individual pixel within the space of a few femtoseconds 18 . However, integrating detectors (such as the pnCCD) can still achieve single photon sensitivity under certain conditions. A series of corrections and calibrations are required in order to convert the data from ADUs to photon counts per pixel. In this report, we use psana, an LCLS software framework 19,20 , to retrieve the data, obtain the detector pixel positions, mask for bad pixels and apply various corrections to convert the ADUs into photon counts.
Corrections applied to the pnCCD data include (in order) pedestal subtraction, common-mode correction and gain correction followed by conversion to photon counts. As each photon strikes a given pixel, an electron cloud is generated in the substrate of the detector panel, with the number of electrons being proportional to the number of incident photons, the photon energy and the degree of charge sharing between neighbouring pixels. This current is then integrated to form the ADU count for that pixel. In Fig. 5 we show a histogram of the measured ADU counts from silicon fluorescence (Kα = 1.74 keV) after pedestal and common-mode correction (i.e. subtraction of average CCD dark current and voltage offsets). The modal ADU values corresponding to zero, one and two incident photons are situated at the peaks of the three Gaussian profiles (black dashed lines) with values of 0, 134 and 268 ADUs for a gain setting of 4, respectively. The spread in the ADU values about these peaks are due to the stochastic nature of the pedestal, gain and charge sharing processes. Thus, simple division of ADUs by the mean ADUs-per-photon yields poor photon conversion. We used a psana built-in function 19,20 (detector.photons) to convert the ADUs into photon counts for each pixel which accounts for charge sharing and incident photon energy.
Hit rates in these experiments were typically ~1 % as previously mentioned. Hits are defined as frames containing discernible diffraction from the sample, which are identified as frames with significantly elevated diffraction intensity. This process is accomplished using the program psocake 19,20 . First, one designs a mask for each run defining bad regions, usually blocking the zeroth order diffraction fringe, pixels too far away from the diffraction center and other "bad" regions in the detector where there is significant instrument scattering or there are readout issues with specific pixels. The total photon numbers in the remaining region is calculated, and then patterns are sorted according to the total photon counts per frame as shown in Figs. 6, 7, 8, and 9. The threshold at which to stop accepting frames is then determined by inspection of individual data frames from high intensity to lower intensity. Below a certain number of photons in the region of interest, the diffraction fringes are no longer visible. www.nature.com/scientificdata www.nature.com/scientificdata/ When diffraction fringes are no longer visible by eye, the image is considered to contain not enough data to be classified as a hit and is classified as empty or blank for preliminary processing. Frames with higher total photon counts than that value are considered hits and retained for subsequent analysis.
Not all the patterns retained above are valid diffraction patterns from a single PR772 virus particle. These patterns are further classified manually to select the single-hit patterns, from those consisting of clusters of PR772 virus particle. This clustering occurs when two or more PR772 virus particles are contained in an single aerosolization droplet causing the viruses to stick together. A trade off between higher isolated particle hit rates and a higher number of clusters is observed as increasing hit rates to higher levels usually requires changing sample concentration or GDVN conditions in the same direction that also increases the probability of multiple particles existing in an aerosolization droplet. It is acknowledged that this analysis process is influenced by human bias, however it is relatively straightforward to distinguish good single hit patterns from the others for PR772 particles when the intensity is high enough, because the PR772's shell possess pseudo-icosahedral symmetry this lends itself to a distinct diffraction pattern at low diffraction angles.

Fig. 5
Calibration of pnCCD detector for ADUs per photon using silicon fluorescence (Kα = 1.74 keV) during the amo06516 experiment. Shown is a histogram of the average number of ADUs and the average number of pixels per image giving the ADU value averaged over 10,000 data frames/readouts. The fluence in the calibration was kept low so there was less than one 2 photon event per collected frame. The 1 photon peak was found to be 134 ADUs with a width of σ = 9.7 ADUs, while the 2 photon peak was found to be 268 ADUs with a width of σ = 15 ADUs. It is noted that there is significant number of pixels with ADU values between 0 and 1 photon. These events are due to charge sharing between pixels. This happens when a photon strikes close enough to the edge of a pixel that the resulting electron cloud of charge created is shared between pixels. www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
We provide access to the experiment data, both in the native file format used by the LCLS and in the CXI file format 21 . The LCLS stores beamtime data in the XTC format, which is optimised for sequential reading and writing. The XTC files contain the unprocessed "raw" detector data and metadata for every event in the selected experiment runs. Instructions for extracting data from XTC formatted files can be found at the LCLS data analysis website: https://confluence.slac.stanford.edu/display/PSDM/LCLS+Data+Analysis. The CXI format is based on the popular HDF5 format, which is a self-describing container for multidimensional data structures. The CXI format can be understood as simply a set of conventions for storing scientific data relating to coherent x-ray imaging in a HDF5 file. The CXI files contain the processed and selected diffraction patterns following version 1.6 of the standard, as shown in Fig. 10. There is one cxi file per experiment. The data corresponding to the nth experiment run is stored in a separate "entry" /entry_n, for example, the data for run 90 of the AMO06516 experiment is stored in /entry_1 of the file amo06516.cxi, since this is the first run that has been selected from that experiment.  www.nature.com/scientificdata www.nature.com/scientificdata/ Fig. 9 Histograms and typical single hits for experiment AMOX34117. (a) The histogram of the total photon counts of the single hit patterns in this experiment. (b-e) Each is a random pattern selected from the 1st, 3rd, 5th and 7th column in the histogram. The boundary is colored with the same color as that of the corresponding column. Single hit patterns are rendered with matplotlib.pyplot.imshow funcitons with color map "jet" and vmax = 4. Before rendering, the photon count patterns are first down-sampled 4-by-4 times.   Single  106  101  12  60  22  475  128  70  189  200  29  67  300   Total  1122  984  217  902  379  6850  1938  1009  1396  2723  289  900  3238   Run  106  107  108  109  111  113  114  116  117  118  119  121  122   Single  74  481  484  409  461  3  376  487  438  406  375  432  410   Total  708  4681  4711  4155  4088  26  3028  3759  3592  3404  3022  2945  3364   Run  123  124  126  127  128  129  132  133  137  138  143   Single  355  385  355  350  369  13  395  201  0  6  9   Total  3373  2705  2511  4009  3786  287  2716  1681  1  26  71   AMO11416   Run  38  42  44  45  46  47  48  49  50  55  56   Single  1  1  1  0  0  0  6  83  119  128  135   Total  964  257  368  117  121  3  190  1232  1294  2336  1324   AMOX34117   Run  130  131  132  133  134  135  136  141  147  148  149  150  151   Single  18  19  19  2  4  25  1  0  0  0  0  1  0   Total  379  507  521  108  280  1598  126  111  460  494  165  1570  1044   Run  152  153  154  155  156  157  158  159  160  163  164  165  168   Single  0  0  0  0  0  0  0  0  0  0  0  0  0   Total  194  351  376  750  1437  61  231  94   www.nature.com/scientificdata www.nature.com/scientificdata/ The pnCCD detector 16 used to collect these data is composed of 2 panels, as stated above, with two readout electronic back-ends per panel (each containing 4 analogue to digital converters). Each readout is composed of a 2D pixel array of shape 512 × 512. In the stack format, the recorded image data, are presented in an array with a shape of (4, 512, 512). In this 3D array, the first index is the index of the electronic readout, and the last two are the indexes Fig. 11 Pseudo SAXS patterns for six different configurations; (first and third rows) pseudo 1D SAXS profile, with the x-axis scaled to resolution in nm, and the y-axis in arbitrary units. (second and fourth rows) 2D summed SAXS patterns from single-hits after mapping the detector panels to x-y coordinates in the laboratory frame. Note: the red circles are to show the center of the pattern and the tile locations and not resolution. As all of the images are of the same size PR772 virus capsid the resolution of the diffraction speckle fringes is an indication of the camera length and hence resolution. (2020) 7:404 | https://doi.org/10.1038/s41597-020-00745-2 www.nature.com/scientificdata www.nature.com/scientificdata/ of a specific pixel in that panel. When one would like to represent the actual spatial arrangement of the pixels with a 2D array, one can use psana functions to assemble arrays in the stack format and obtain the corresponding array in the 2D format. Alternatively, one can use the the corner_positions and basis_vectors datasets to determine the x and y coordinates of each pixel, as documented in the CXIDB file description. In the CXI file, this diffraction data (after conversion to photon counts) is stored in the data set /entry_n/data_1/data, which is an N × 4 × 512 × 512 unsigned 16 bit integer dataset, where N is the number of frames in the experiment run.
In addition to the diffraction data, the datasets energy, pulse_energy and pulse_length contain the X-ray pulse properties, basis_vectors and corner_positions the detector geometry, mask the detector mask and tags the image classification labels (1 if the diffraction was deemed to have originated from an isolated PR772 molecule and 0 otherwise). For a detailed explanation of these datasets, see the version 1.6 format description at 21 .
Data access. All datasets described above are deposited in the Coherent X-ray Imaging Data Bank (CXIDB) 21 in the CXIDB data format 7 .
Data statistics. The run number range, total hit number, single hit number and the single hit to total hit number ratio are summarized in Table 1 The hit threshold, the number of measured photons required to be classified as a "hit", for amox34117 has been set to a lower value, compared to the other experimental runs, which causes the drastic drop in the single to total hit number ratio.
The detailed distribution of total hits and single hits during each run are summarized in Table 2.

Technical Validation
As a measure of the reliability of the datasets, all single-hits from each experiment were summed to form pseudo small angle X-ray scattering (SAXS) patterns (see the first and third rows of Fig. 11). These SAXS patterns are calculated as a function of resolution, accounting for the missing diffraction data and changing detector distance in each dataset, thus one can compare the SAXS profiles across the 6 groups of data. The second and fourth rows of Fig. 11 show the 2D summed images corresponding to each of the 1D pseudo SAXS profiles. In these summed patterns background and detector artifacts are observable. It is noted that for amo87215 one of the panels had an issue with the readout electronics so that two of the analogue to digital converters read out at a different gain levels. For amo06516 there was a gap in the scatter shield of the second aperture, resulting in an increased level of beamline background signal in the unshielded area, located on the side of the detector (upper part of the image). For amo11416 an analogue to digital converter readout gain issue, similar to amo087215, is also observed. Additionally after run 55 one can observe the increase in the gap of the detector to allow one of the panels to obtain higher resolution. For amox34117 the center four of the analogue to digital converters readouts on one of the panels were not operational.

Usage Notes
The dataset contains the recorded data during the experiment in both XTC and CXIDB formats. The dataset also contains a set of pre-selected hits and metadata as described in this paper. XTC files are the native format of LCLS and can be read using analysis frameworks provided by the LCLS (see https://confluence.slac.stanford.edu/ display/PSDM/LCLS+Data+Analysis.