An open-source, end-to-end workflow for multidimensional photoemission spectroscopy

Characterization of the electronic band structure of solid state materials is routinely performed using photoemission spectroscopy. Recent advancements in short-wavelength light sources and electron detectors give rise to multidimensional photoemission spectroscopy, allowing parallel measurements of the electron spectral function simultaneously in energy, two momentum components and additional physical parameters with single-event detection capability. Efficient processing of the photoelectron event streams at a rate of up to tens of megabytes per second will enable rapid band mapping for materials characterization. We describe an open-source workflow that allows user interaction with billion-count single-electron events in photoemission band mapping experiments, compatible with beamlines at 3rd and 4rd generation light sources and table-top laser-based setups. The workflow offers an end-to-end recipe from distributed operations on single-event data to structured formats for downstream scientific tasks and storage to materials science database integration. Both the workflow and processed data can be archived for reuse, providing the infrastructure for documenting the provenance and lineage of photoemission data for future high-throughput experiments.


Introduction
Many disciplines in the natural sciences are increasingly dealing with densely sampled multidimensional datasets. The scientific workflows to obtain and process them are becoming increasingly complex due to the provenance and structure of the data and the information needed to be extracted and analyzed 1,2 . In materials science and condensed matter physics, various spectroscopic and structural characterization techniques produce experimental data of distinct formats and characteristics. Their creation and understanding require customized processing and analysis pipelines designed by specialists in the respective fields. The growing incentive for building experimental materials science databases 3 that complement established theoretical counterparts 4 calls for open-source and reusable workflows for data processing 5,6 that transform raw data to shareable formats for downstream query, analysis and comparison by non-specialists of the experimental techniques 7,8 . Among the various properties associated with materials, the electronic band structure (EBS) of condensed matter systems is of vital importance to the understanding of their electronic properties in and out of equilibrium. Multidimensional photoemission spectroscopy (MPES) [9][10][11] is an emerging technique that bears the potential of high-throughput EBS characterization through band mapping experiments and holds promise as an enabling technology for building experimental EBS databases, where data integration requires traceable knowledge of the processing steps between the archived and the raw format. Here we present an open-source workflow that focuses on band mapping data from MPES. In the following, we briefly introduce the technology of MPES and the associated data processing, before providing details on the workflow from raw data to database integration.

Results
Workflow description. The workflow schematic shown in Fig. 1 starts with raw single-event data from measurements. The data are (i) binned in a distributed fashion in the measurement coordinates, including each of the photoelectrons' position on the detector (X, Y), its TOF, a digital encoder (ENC) axis, and others, if more than four dimensions are acquired in parallel. The binned histogram is (ii) used to estimate the numerical transforms for distortion correction and axis calibration. Next, these transforms are (iii) applied to the raw single-event data to convert the measurement coordinates to the physical axes, (k x , k y , E, t pp ) and others for higher dimensions (see also Fig. 2). Finally, the single-event data are (iv) binned in the transformed grid to yield 3D, 3D + t or other higher-dimensional data with the correct axis values. The outcome may be exported in different formats for storage, visualization and downstream analysis.
Tasks and software infrastructure. Processing billion-count single-event data requires user interaction for data checking and distributed processing to reduce the time consumption. The general tasks in the workflow include the transformation of data streams to multidimensional histograms, artifact correction and axis calibration. These operations can be efficiently decomposed into column-wise operations of the distributed dataframe format offered by the dask package 37 in Python. While the use of dask dataframes provides the common foundation for interactivity with single events in hextof-processor and mpes, they distinguish themselves by the experimental requirements.
At large-scale facilities, experiments often record a large number of machine parameters that need to be stored, though only a small number of relevant parameters are needed for downstream processing. Therefore, the hextof-processor package includes a parameter sampling step to retrieve intermediate tabulated data in the Apache Parquet format (https://parquet.apache.org/), a column-based data structure optimized for computational efficiency. This approach reduces the processing overhead in searching through the raw data files every time when data are queried during the subsequent processing. As an open-source project, other beamtime-specific functionalities are added by users to the existing framework at every new experimental run. The mpes package adapts to the much simpler file structure produced at table-top experimental setups and makes direct use of the HDF5 raw data. It comes with added functionalities motivated by the existing issues encountered in data acquisition and downstream processing such as axis calibration, masking, alignment and different forms of artifact correction. The softwares come with detailed documentation and examples online for users to gain familiarity (see Code availability).
Artifact correction. Artifacts in MPES data come from mechanical imperfections, stray fields (electric and magnetic), uncertainties in the alignment of the sample, light beams and the multistage electron-optic lens systems as well as the data digitization process. Minimizing and correcting instrumental imperfections plays an important role in the validity of downstream analysis. We carry out artifact correction sequentially at the level of single photoelectron events or the data hypervolume obtained from multidimensional histogramming (see Fig. 2). The outcomes are illustrated using the correction of (1) digitization artifact (see Fig. 3) and (2) spherical timing aberration artifact (see Fig. 4), with technical details in Methods.

Axis calibration.
To transform the measurement axes of the DLD into physically relevant axes for electronic band mapping, calibrations are required, as shown in Fig. 2  (2) Single-event data acquisition is monitored and controlled by the measurement controller computer. The raw data are first streamed and stored onto a hard drive in HDF5 format (.h5) and subsequently processed in the workflow through (3) file reduction (optional), (4)(6) distributed binning, (5) artifact correction and axis calibrations, carried out at the singleevent or the binned data levels. At the end of the workflow, other data formats are generated (such as HDF5, MAT or TIFF) for (7) storage, visualization or downstream analysis for extracting relevant physical parameters. Critical parameters within the workflow may be exported (as workflow parameters files), shared and reused for processing other datasets.
www.nature.com/scientificdata www.nature.com/scientificdata/ position) with the corresponding scales in data. They are applied either to the binned data hypervolume, or to the single-electron events raw data individually in a distributed fashion before binning. Details on the calibration data transforms are provided in Methods.
Data storage and format. The simplistic form of the output data hypervolume derived from single-electron events includes non-negative scalar values of the photoemission intensity and the calibrated real-valued axes coordinates, including k x , k y , E, and other parameter dependencies such as the pump-probe time delay t pp . These values are exported as HDF5, MAT or TIFF, with the metadata included as attributes of the files.
Workflow archiving and reuse. Computational workflows are valued by their reproducibility 38 . Archiving and sharing the workflow parameters among users of the beamlines or facilities allow comparison between experimental runs and reuse for the simultaneous benefits of machine diagnostics and experimental efficiency. To achieve this, we store critical parameters generated within the workflow in a separate file as workflow parameters (see Fig. 1) during each step, including the numerical values used in binning, the intermediate parameters and coefficients of the correction and calibration functions, etc. They can be reused when loading into the processing of other datasets. Data visualization. The adaptation of established scientific visualization methods in the physical sciences 39,40 to band mapping data should incorporate the requirements and knowledge of the data characteristics in this field of research. The band mapping data in 3D (multi-megavoxel) and 3D + t (multi-gigavoxel) include the inherent symmetries from the electronic band structure of the material, but the intensity modulations in the photoemission process 41 , dynamics and sample condition disrupt the original symmetry. The overall goal is to emphasize the features of interest while exploiting the symmetry to simplify the visualization (see Methods). The output files from the processing pipeline are compatible with open-source visualization software such as matplotlib 42 , ParaView 39 and Blender 43 .

Downstream analysis integration.
Typical photoemission data analysis involves extracting electronic band structure parameters, physical coupling constants and lifetimes via fitting of lineshapes 16 or dynamical models 44 , often carried out specific to the material under study. At the end of our distributed workflow, the data size is on the order of a few to tens of gigabytes, which can be directly loaded into memory on users' local machines for downstream data analysis with custom routines.
Experimental metadata. The metadata of the data files have a tree structure and contain information of the experimental setting, parameters of the pulsed light source, the detector and the sample under study. A list of top-level metadata parameters is presented in Table 1. A full and current list of all metadata parameters, including the top-level parameters and their constituent lower-level parameters, along with their definitions, units and values, is provided in Supplementary Tables 1-4. For database integration, an accompanying data parser (parser-mpes, see Code availability) for MPES data has been written in accordance with existing standards 45 for computational materials science in NOMAD 8 , featuring an electronic version of the metadata parameter list in the file mpes.nomadmetainfo.json online. The metadata parameter list and the data parser are versioned and are updated based on the corresponding changes in the data structure for photoemission spectroscopy experiments. The existing WSe 2 photoemission data have been integrated into the experimental section of the materials science database NOMAD (see Data availability).  Comparison of the W4f spectra at the center and on the edge of the detector plane. The energy spectra are extracted from the corresponding regions, marked by the dots in the same blue and red colors, respectively, in (c). The white stripes crossing at the detector center block the exposed edges of the four-quadrant detector quadrants. (d). The uncorrected and corrected radial-averaged peak TOF positions for the W4f 7/2 core level.

Category name Description
General parameters Descriptive information of the experiment and facility www.nature.com/scientificdata www.nature.com/scientificdata/

Discussion
We have designed and implemented an open-source, end-to-end workflow for processing single-event data produced in multidimensional photoemission spectroscopy, linking to downstream tasks, providing guidelines and software for integrating processed data into the NOMAD experimental materials science database. The distributed processing takes full advantage of the single-event data streams directly accessible from the TOF delay-line detector for event-wise correction and calibration and converts the raw events to the calibrated data hypervolume for project-specific downstream analysis. The functionalities within the workflow are publicly accessible through the software packages we have developed (hextof-processor 30 and mpes 31 ). The processing workflow is archived at each step of operation and the processed data may be integrated into experimental database with user-specified metadata. The methods described here are applicable to all existing types of multidimensional photoemission band mapping measurements beyond the static and time-dependent settings described here.
Our end-to-end workflow from raw data to processed data to database integration provides a fast-track and all-in-one solution to the demands for open experimental data and reproducible research in the materials science community 7,8 . The public repositories for the software packages are the foundations for phased future extension and integration with existing analytical tools in the photoemission spectroscopy community. The modular structure of the packages introduced here allows targeted upgrades by both temporary and dedicated maintainers and users. Casting the workflow in the Python programming environment provides the foundation for convenient incorporation of existing image processing and machine-learning resources 46 for further exploration and understanding of the band mapping datasets, which contain rich information owing to the complex nature of the photoemission process 16,18 . This is especially beneficial for broader adoption of photoemission since the interpretation of photoemission data is often linked to the observed or extracted outstanding features such as local intensity extrema, dispersion kinks and satellites, lineshape parameters and pattern symmetry 16 , therefore, the access to experimental data and the potential integration with existing electronic structure-related software 5,47-49 will facilitate method developments and the direct comparison between experimental results and theoretical band structure calculations within the same programming platform.

Sample preparation.
Single-crystalline samples of 2D bulk WSe 2 (2H stacking) were purchased from HQ Graphene. Crystals of size around 5 mm × 5 mm × 1 mm were used directly for the measurements. To prepare a clean surface by cleaving, we attached a cleaving pin upright to the sample surface using conducting epoxy (EPOTEC H20) outside the vacuum chamber and removed the pin by mechanical force in ultrahigh vacuum.
Photoemission experiments. The measurements were conducted using the HEXTOF instrument 24 at the DESY FLASH PG-2 beamline 50 with the free-electron laser (FEL) as well as a laboratory source 21 with a METIS electron momentum microscope (SPECS METIS 1000) installed at the FHI. In the measurements at FLASH, the FEL was tuned to 36.5 eV (or 34.0 nm) and 109 eV (or 11.4 nm), the optical pump pulse had a center wavelength of 775 nm. The measurements at the FHI used a 21.7 eV home-built extreme UV source based on high harmonic generation in Ar gas driven by an optical parametric chirped-pulse amplifier operating at 500 kHz repetition rate 51 . The optical pump pulse is centered at 800 nm. In both FEL and laboratory experiments, the near-infrared light pulses promote the electronic population at the K and K′ high-symmetry points (corresponding to K and ′ K points, respectively, in the projected Brillouin zone obtained from photoemission, as shown in Fig. 5) in momentum space to the excited states via direct optical transitions. The nonequilibrium electronic dynamics are probed via valence and conduction band photoemission 35 as well as core-level photoemission 36 , using s-polarized extreme UV and soft X-ray probe pulses, respectively. Digitization artifact. The time-to-digital converter (TDC) outputs digitized data according to the binning width of the on-board electronics. Data conversion from one digitized format to another in a rebinning process often creates a picket fence-like effect (see Fig. 3). This phenomenon originates from the incommensurate bin size in the two rounds of sampling processes (binning and rebinning). To solve the problem, one introduces a slight amount of uniformly distributed noise, with an amplitude equal to half of the original bin size, to the single-event values when carrying out the bin counts. This is similar to the histogram jittering (or dithering) technique 52,53 used in statistical visualization and computer graphics. Mathematically, the uniformly distributed noise U(0,1) bounded in the range [0,1] is added before binning to a univariate data stream, S = {S i } via, here, w b is the bin width. For binning of multivariate data streams, such as the detector X position (or k x ), Y position (or k y ), and the photoelectron TOF (or E), we adopt the same approach individually for each dimension. The effect of jittering in reducing the digitization artifact is demonstrated in Fig. 3.

Spherical timing aberration.
Electrons entering the TOF tube at different lateral positions travel through different path lengths to reach the detector, which is the origin of the spherical timing aberration as illustrated in Fig. 4. The lateral position-dependent time delay may be expressed as, where r is the radial distance from the center of the DLD and TOF 0 is the TOF normalization constant. For a typical field-free region length of d∼1 m in the TOF tube and a DLD screen radius of r = 50 mm, Δ ≈ . × − TOF/TOF 1 25 10 0 3 . Assuming TOF 0 = 0.5 μs, the spherical timing aberration in TOF scale is www.nature.com/scientificdata www.nature.com/scientificdata/ Δ ≈ . TOF 062 sph ns, which is larger than the DLD's temporal resolution of ∼ 0.15 ns. The effect of the spherical timing aberration is visible for a few eV energy range with fine bins but quite small on a large energy range. To illustrate this effect, we use the W 4f core-level data presented in Fig. 4b. For every (X, Y) position on the detector the peak of W 4f 7/2 was fitted with a Voigt profile and the peak positions are shown in Fig. 4c. As the spectra from deep core levels typically do not show dispersion, the deviation from fitting corresponds to the spherical timing aberration of the electron optics. In order to compensate for the spherical timing aberration, we first transform the data from Cartesian to the polar coordinates (see Fig. 4c), and then fit the radial-averaged peak position to a polynomial function of the radius, The fitting results together with the corrected radial distribution are presented in Fig. 4d. Symmetry distortion. Photoemission patterns in the (k x , k y ) plane (i.e. an energy slice) may exhibit distorted symmetry due to the influence of various factors from the instrument, the sample and the experimental geometry on the trajectory of low-energy photoelectrons. Correction of the symmetry distortion yet preserving the intensity features requires the use of symmetry-related landmarks to solve for the symmetrization coordinate transform in the framework of nonrigid image registration 54 . In typical situations with an excellent electron lens alignment, the energy dependence of the momentum distortion within the focused phase space volume covering an energy range of several eV is negligible, so the same coordinate transform can be applied to all energy slices in the volumetric data (including both valence and conduction bands) or simultaneously to all single events. www.nature.com/scientificdata www.nature.com/scientificdata/ Other single-experiment artifacts. (1) Momentum center shift: The momentum center of the emergent photoelectrons travelling through the electron-optic system may experience an energy-dependent shift owing to the slight misalignment in the system or the influence of stray fields. Correction of the center shift requires an energy-dependent center alignment of energy slices. The shift along the energy (or TOF) axis may be estimated using phase correlation 55 or mutual information-based 56 sequential image registration methods, in which the series of energy slices are treated as an image sequence. In a well-shielded and well-aligned electron-optic lens system, generally, the momentum center shift is negligible in the focused photoelectron energy range. (2) Space-charge effect (SCE): The secondary photoelectron clouds originating from the probe and pump pulses cause a "doming effect" of the photoemission intensity distribution around the momentum center of the band structure. This is especially visible in systems with a clear Fermi edge 9,11 or non-dispersing shallow core levels, which may be used as references for calibrating the parameters used for the flattening transform by applying a momentum-dependent shift Δ k k TOF ( , ) x y sc in the TOF (or the calibrated energy) coordinate of the single-event data.
Momentum calibration. The scaling factors for momentum calibration are computed by comparing the positions of known high symmetry points in the band structure with their corresponding locations in an energy slice. Suppose A and B are two high symmetry points identifiable (e.g. as local extrema) from the experimental data with pixel positions (X A , Y A ) and (X B , Y B ), and momentum positions, (k x A , k y A ) and (k x B , k y B ), respectively. We calculate the pixel-to-momentum scaling ratios, f X and f Y , along the X (column) and Y (row) directions of a 2D k-space image, respectively. Then, the momentum coordinate (k x , k y ) at each pixel position (X, Y) may be derived.
Energy calibration. The calibration requires a set of band mapping data measured at different bias voltages (applied between the material sample and the ground), usually sampled with a spacing of 0.5 V in a range of ± 3-5 V around the normally applied bias voltage for a particular sample. The calibration proceeds by finding the TOF feature (e.g. local extrema) correspondences in the 1D energy distribution curves (EDCs) at different biases using the dynamic time warping algorithm 57 . The transformation from the TOF to the photoelectron energy E is approximated as a polynomial function, The approximation is sufficiently accurate within a range of ∼ 20 eV, sufficient to cover the entire valence band and some low-lying conduction bands of typical materials. The polynomial coefficients are determined using nonlinear least squares by solving Δ ⋅ = Δ a T E , in which = ... a a a ( , , ) T 1 2 is the coefficient vector while the constant offset a 0 is determined by manual alignment to an energy reference, such as the Fermi level or valence band maximum. The vector ΔE and the matrix ΔT contain, respectively, the pairwise differences of the bias voltages and the polynomial terms of differential TOF values. To calibrate a large energy range including multiple core levels, a piecewise polynomial may be used 11 .
Pump-probe delay calibration. The time origin ("time zero") in time-resolved photoemission spectroscopy, i.e. the temporal overlap of pump and probe pulses, is determined by fitting of a characteristic trace extracted from the data. Since the readings of the digital encoder (see Fig. 2) are sampled linearly, equally-spaced pump-probe delays are directly convertible from the readings using linear interpolation, given the boundary values of the translation stage positions and the corresponding delay times. For unequally-spaced delays, a delay marker is first added to each data point as a separate column after data acquisition to group together the encoder reading ranges that correspond to the specific time delays. The data binning is carried out over the delay marker column instead of the equally-sampled encoder readings.
Visualization strategies. We discuss here three methods for the display of volumetric band mapping data, which are, at the same time, the basis for visualizing 3D + t data with time as an animated axis. (1) The orthoslice representation includes orthogonal 2D planes selected in specific regions in the 3D volume 39 , which highlights specific slices deep within the data less visible in a volumetrically rendered view (see Fig. 5a). Along this line, we have developed a software package, 4Dview 58 , to explore 4D data using simultaneously linked orthoslices, which also features contrast adjustment and data integration within a hypervolume of interest. (2) The band-path plot (see Fig. 5b) is a 2D representation of the 3D band mapping volume generated by combining a series of 2D cuts along selected momentum paths (or k-paths) traversing a list of so-called high-symmetry points 59,60 . This representation captures the largest dispersion within the band structure. For volumetric data, the same path may be sampled from all the full energy range to produce the plot shown in Fig. 5b. The analysis and visualization modules in the mpes package include functionalities to compose customized band-path plots. (3) The cut-out view (see Fig. 5c) exposes a specific part of interest in the volumetric data, while not losing the rest 39 . The analysis module in the mpes package provides ways to generate precise cut-outs using position landmarks (e.g. high-symmetry points labelled in Fig. 5) and inequalities.