Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# An open-source, end-to-end workflow for multidimensional photoemission spectroscopy

## Abstract

Characterization of the electronic band structure of solid state materials is routinely performed using photoemission spectroscopy. Recent advancements in short-wavelength light sources and electron detectors give rise to multidimensional photoemission spectroscopy, allowing parallel measurements of the electron spectral function simultaneously in energy, two momentum components and additional physical parameters with single-event detection capability. Efficient processing of the photoelectron event streams at a rate of up to tens of megabytes per second will enable rapid band mapping for materials characterization. We describe an open-source workflow that allows user interaction with billion-count single-electron events in photoemission band mapping experiments, compatible with beamlines at 3rd and 4rd generation light sources and table-top laser-based setups. The workflow offers an end-to-end recipe from distributed operations on single-event data to structured formats for downstream scientific tasks and storage to materials science database integration. Both the workflow and processed data can be archived for reuse, providing the infrastructure for documenting the provenance and lineage of photoemission data for future high-throughput experiments.

## Introduction

Many disciplines in the natural sciences are increasingly dealing with densely sampled multidimensional datasets. The scientific workflows to obtain and process them are becoming increasingly complex due to the provenance and structure of the data and the information needed to be extracted and analyzed1,2. In materials science and condensed matter physics, various spectroscopic and structural characterization techniques produce experimental data of distinct formats and characteristics. Their creation and understanding require customized processing and analysis pipelines designed by specialists in the respective fields. The growing incentive for building experimental materials science databases3 that complement established theoretical counterparts4 calls for open-source and reusable workflows for data processing5,6 that transform raw data to shareable formats for downstream query, analysis and comparison by non-specialists of the experimental techniques7,8. Among the various properties associated with materials, the electronic band structure (EBS) of condensed matter systems is of vital importance to the understanding of their electronic properties in and out of equilibrium. Multidimensional photoemission spectroscopy (MPES)9,10,11 is an emerging technique that bears the potential of high-throughput EBS characterization through band mapping experiments and holds promise as an enabling technology for building experimental EBS databases, where data integration requires traceable knowledge of the processing steps between the archived and the raw format. Here we present an open-source workflow that focuses on band mapping data from MPES. In the following, we briefly introduce the technology of MPES and the associated data processing, before providing details on the workflow from raw data to database integration.

MPES, also called momentum microscopy (MM), is born out of the recent integration of time-of-flight (TOF) electron spectrometers with delay-line detectors (DLDs) and improved electron-optic lens designs12,13,14,15. Compared with the earlier generations of angle-resolved photoemission spectroscopy (ARPES)16,17 using hemispherical analyzers to measure the 2D energy-momentum distribution of the photoemitted electrons18, MPES is capable of recording single-electron events simultaneously sorted into the (kx, ky, E) coordinates (E: electron energy, kx, ky: parallel momentum components) in band mapping experiments, obviating the need for scanning across sample orientations and subsequent data merging as is the case for similar experiments using a single hemispherical analyzer. Operation of the TOF DLD in MPES requires a pulsed photon source and is directly compatible with 3rd and 4th generation light sources19 as well as laboratory-based table-top setups20,21,22,23, harnessing their high repetition rates in the range of multi-kilohertz to megahertz to drastically improve the detection speed and efficiency. Mapping of the 3D band structure with sufficient signal-to-noise ratio (SNR) may be achieved on the timescale of minutes. The technological convergence opens up the possibilities to record 3D datasets in dependence of one or more additional parameters, such as spatial location I(x, y, kx, ky, E), probe photon energy, I(kx, ky, E, kz)10, spin-polarization, I(kx, ky, E, S)9, or pump-probe time in time-resolved MM, I(kx, ky, E, t)24 within a reasonable time frame.

From the data perspective, the pulsed sources with high repetition rates generate densely sampled data at rates of multiple megabytes per second (MB/s), which has brought about challenges in data processing and management compared with conventional ARPES experiments. The raw data in MPES are single photoelectron events registered by the DLD and the physical quantities related to the detected events are streamed in parallel to the storage files in a hierarchical file format (e.g. HDF525). A typical dataset involves 107–1010 detected events with a total size of up to a few hundred gigabytes (GBs), depending on the number of coordinates measured (3D or 4D) and the required SNR. Unlike the large 2D or 3D image-based datasets, such as those obtained in various forms of optical26,27 and electron microscopy techniques28,29, processing and conversion of tabulated single-event data requires additional steps of statistical computing for conversion into standard images. This motivates the current workflow development for efficient data processing and analysis. In data processing and calibration, experiments performed at different facilities share similar procedures going from the raw events to the multidimensional hypervolume with calibrated axes, which is the basis for archiving and downstream analysis. To maintain reproducibility for the particular data source characteristics, data structure and processing procedure, we have summarized the workflow (see Fig. 1) into two open-source software packages (hextof-processor30 and mpes31), with similar design principles for coping with large-scale facility and table-top experiments, respectively. The core of our approach includes distributed statistical processing at the single-event level using parameters calibrated and determined from preprocessed volumetric datasets, which enables effective instrument diagnostics, artifact correction, and sample condition monitoring. The algorithms involved balance physical knowledge and existing methods in image processing and computer vision. The workflow is illustrated next with data obtained at some of the electron momentum microscopes currently in operation, such as the HEXTOF (high energy X-ray time-of-flight) measurement system24 at the free-electron laser source FLASH32 at DESY, and the table-top high harmonic generation-based setup at the Fritz Haber Institute (FHI)21 involving a commercial TOF and DLD (METIS 1000, SPECS GmbH). We use the material example of tungsten diselenide (WSe2) measured at both experimental setups to demonstrate the workflow execution, because in momentum space, the patent features of WSe2 band structure and the nonequilibrium dynamics initiated by optical excitation of WSe2 have been thoroughly studied in the past (see Methods)24,33,34,35,36. We expect the workflow described here to serve as a blueprint for upcoming software platforms in similar setups to be installed in other facilities or laboratories worldwide.

## Results

### Workflow description

The workflow schematic shown in Fig. 1 starts with raw single-event data from measurements. The data are (i) binned in a distributed fashion in the measurement coordinates, including each of the photoelectrons’ position on the detector (X, Y), its TOF, a digital encoder (ENC) axis, and others, if more than four dimensions are acquired in parallel. The binned histogram is (ii) used to estimate the numerical transforms for distortion correction and axis calibration. Next, these transforms are (iii) applied to the raw single-event data to convert the measurement coordinates to the physical axes, (kx, ky, E, tpp) and others for higher dimensions (see also Fig. 2). Finally, the single-event data are (iv) binned in the transformed grid to yield 3D, 3D + t or other higher-dimensional data with the correct axis values. The outcome may be exported in different formats for storage, visualization and downstream analysis.

Processing billion-count single-event data requires user interaction for data checking and distributed processing to reduce the time consumption. The general tasks in the workflow include the transformation of data streams to multidimensional histograms, artifact correction and axis calibration. These operations can be efficiently decomposed into column-wise operations of the distributed dataframe format offered by the dask package37 in Python. While the use of dask dataframes provides the common foundation for interactivity with single events in hextof-processor and mpes, they distinguish themselves by the experimental requirements.

At large-scale facilities, experiments often record a large number of machine parameters that need to be stored, though only a small number of relevant parameters are needed for downstream processing. Therefore, the hextof-processor package includes a parameter sampling step to retrieve intermediate tabulated data in the Apache Parquet format (https://parquet.apache.org/), a column-based data structure optimized for computational efficiency. This approach reduces the processing overhead in searching through the raw data files every time when data are queried during the subsequent processing. As an open-source project, other beamtime-specific functionalities are added by users to the existing framework at every new experimental run. The mpes package adapts to the much simpler file structure produced at table-top experimental setups and makes direct use of the HDF5 raw data. It comes with added functionalities motivated by the existing issues encountered in data acquisition and downstream processing such as axis calibration, masking, alignment and different forms of artifact correction. The softwares come with detailed documentation and examples online for users to gain familiarity (see Code availability).

### Artifact correction

Artifacts in MPES data come from mechanical imperfections, stray fields (electric and magnetic), uncertainties in the alignment of the sample, light beams and the multistage electron-optic lens systems as well as the data digitization process. Minimizing and correcting instrumental imperfections plays an important role in the validity of downstream analysis. We carry out artifact correction sequentially at the level of single photoelectron events or the data hypervolume obtained from multidimensional histogramming (see Fig. 2). The outcomes are illustrated using the correction of (1) digitization artifact (see Fig. 3) and (2) spherical timing aberration artifact (see Fig. 4), with technical details in Methods.

### Axis calibration

To transform the measurement axes of the DLD into physically relevant axes for electronic band mapping, calibrations are required, as shown in Fig. 2. The calibration functions are constructed with parameters derived from comparing physical knowledge of the materials (e.g. Brillouin zone size, Fermi level position) with the corresponding scales in data. They are applied either to the binned data hypervolume, or to the single-electron events raw data individually in a distributed fashion before binning. Details on the calibration data transforms are provided in Methods.

### Data storage and format

The simplistic form of the output data hypervolume derived from single-electron events includes non-negative scalar values of the photoemission intensity and the calibrated real-valued axes coordinates, including kx, ky, E, and other parameter dependencies such as the pump-probe time delay tpp. These values are exported as HDF5, MAT or TIFF, with the metadata included as attributes of the files.

### Workflow archiving and reuse

Computational workflows are valued by their reproducibility38. Archiving and sharing the workflow parameters among users of the beamlines or facilities allow comparison between experimental runs and reuse for the simultaneous benefits of machine diagnostics and experimental efficiency. To achieve this, we store critical parameters generated within the workflow in a separate file as workflow parameters (see Fig. 1) during each step, including the numerical values used in binning, the intermediate parameters and coefficients of the correction and calibration functions, etc. They can be reused when loading into the processing of other datasets.

### Data visualization

The adaptation of established scientific visualization methods in the physical sciences39,40 to band mapping data should incorporate the requirements and knowledge of the data characteristics in this field of research. The band mapping data in 3D (multi-megavoxel) and 3D + t (multi-gigavoxel) include the inherent symmetries from the electronic band structure of the material, but the intensity modulations in the photoemission process41, dynamics and sample condition disrupt the original symmetry. The overall goal is to emphasize the features of interest while exploiting the symmetry to simplify the visualization (see Methods). The output files from the processing pipeline are compatible with open-source visualization software such as matplotlib42, ParaView39 and Blender43.

### Downstream analysis integration

Typical photoemission data analysis involves extracting electronic band structure parameters, physical coupling constants and lifetimes via fitting of lineshapes16 or dynamical models44, often carried out specific to the material under study. At the end of our distributed workflow, the data size is on the order of a few to tens of gigabytes, which can be directly loaded into memory on users’ local machines for downstream data analysis with custom routines.

The metadata of the data files have a tree structure and contain information of the experimental setting, parameters of the pulsed light source, the detector and the sample under study. A list of top-level metadata parameters is presented in Table 1. A full and current list of all metadata parameters, including the top-level parameters and their constituent lower-level parameters, along with their definitions, units and values, is provided in Supplementary Tables 14. For database integration, an accompanying data parser (parser-mpes, see Code availability) for MPES data has been written in accordance with existing standards45 for computational materials science in NOMAD8, featuring an electronic version of the metadata parameter list in the file mpes.nomadmetainfo.json online. The metadata parameter list and the data parser are versioned and are updated based on the corresponding changes in the data structure for photoemission spectroscopy experiments. The existing WSe2 photoemission data have been integrated into the experimental section of the materials science database NOMAD (see Data availability).

## Discussion

We have designed and implemented an open-source, end-to-end workflow for processing single-event data produced in multidimensional photoemission spectroscopy, linking to downstream tasks, providing guidelines and software for integrating processed data into the NOMAD experimental materials science database. The distributed processing takes full advantage of the single-event data streams directly accessible from the TOF delay-line detector for event-wise correction and calibration and converts the raw events to the calibrated data hypervolume for project-specific downstream analysis. The functionalities within the workflow are publicly accessible through the software packages we have developed (hextof-processor30 and mpes31). The processing workflow is archived at each step of operation and the processed data may be integrated into experimental database with user-specified metadata. The methods described here are applicable to all existing types of multidimensional photoemission band mapping measurements beyond the static and time-dependent settings described here.

Our end-to-end workflow from raw data to processed data to database integration provides a fast-track and all-in-one solution to the demands for open experimental data and reproducible research in the materials science community7,8. The public repositories for the software packages are the foundations for phased future extension and integration with existing analytical tools in the photoemission spectroscopy community. The modular structure of the packages introduced here allows targeted upgrades by both temporary and dedicated maintainers and users. Casting the workflow in the Python programming environment provides the foundation for convenient incorporation of existing image processing and machine-learning resources46 for further exploration and understanding of the band mapping datasets, which contain rich information owing to the complex nature of the photoemission process16,18. This is especially beneficial for broader adoption of photoemission since the interpretation of photoemission data is often linked to the observed or extracted outstanding features such as local intensity extrema, dispersion kinks and satellites, lineshape parameters and pattern symmetry16, therefore, the access to experimental data and the potential integration with existing electronic structure-related software5,47,48,49 will facilitate method developments and the direct comparison between experimental results and theoretical band structure calculations within the same programming platform.

## Methods

### Sample preparation

Single-crystalline samples of 2D bulk WSe2 (2H stacking) were purchased from HQ Graphene. Crystals of size around 5 mm × 5 mm × 1 mm were used directly for the measurements. To prepare a clean surface by cleaving, we attached a cleaving pin upright to the sample surface using conducting epoxy (EPOTEC H20) outside the vacuum chamber and removed the pin by mechanical force in ultrahigh vacuum.

### Photoemission experiments

The measurements were conducted using the HEXTOF instrument24 at the DESY FLASH PG-2 beamline50 with the free-electron laser (FEL) as well as a laboratory source21 with a METIS electron momentum microscope (SPECS METIS 1000) installed at the FHI. In the measurements at FLASH, the FEL was tuned to 36.5 eV (or 34.0 nm) and 109 eV (or 11.4 nm), the optical pump pulse had a center wavelength of 775 nm. The measurements at the FHI used a 21.7 eV home-built extreme UV source based on high harmonic generation in Ar gas driven by an optical parametric chirped-pulse amplifier operating at 500 kHz repetition rate51. The optical pump pulse is centered at 800 nm. In both FEL and laboratory experiments, the near-infrared light pulses promote the electronic population at the K and K′ high-symmetry points (corresponding to $$\bar{{\rm{K}}}$$ and $$\bar{{\rm{K}}{\prime} }$$ points, respectively, in the projected Brillouin zone obtained from photoemission, as shown in Fig. 5) in momentum space to the excited states via direct optical transitions. The nonequilibrium electronic dynamics are probed via valence and conduction band photoemission35 as well as core-level photoemission36, using s-polarized extreme UV and soft X-ray probe pulses, respectively.

### Digitization artifact

The time-to-digital converter (TDC) outputs digitized data according to the binning width of the on-board electronics. Data conversion from one digitized format to another in a rebinning process often creates a picket fence-like effect (see Fig. 3). This phenomenon originates from the incommensurate bin size in the two rounds of sampling processes (binning and rebinning). To solve the problem, one introduces a slight amount of uniformly distributed noise, with an amplitude equal to half of the original bin size, to the single-event values when carrying out the bin counts. This is similar to the histogram jittering (or dithering) technique52,53 used in statistical visualization and computer graphics. Mathematically, the uniformly distributed noise U(0,1) bounded in the range [0,1] is added before binning to a univariate data stream, S = {Si} via,

$$S{{\prime} }_{i}={S}_{i}+\frac{{w}_{b}}{2}\times U(0,1).$$
(1)

here, wb is the bin width. For binning of multivariate data streams, such as the detector X position (or kx), Y position (or ky), and the photoelectron TOF (or E), we adopt the same approach individually for each dimension. The effect of jittering in reducing the digitization artifact is demonstrated in Fig. 3.

### Spherical timing aberration

Electrons entering the TOF tube at different lateral positions travel through different path lengths to reach the detector, which is the origin of the spherical timing aberration as illustrated in Fig. 4. The lateral position-dependent time delay may be expressed as,

$$\Delta {{\rm{TOF}}}_{{\rm{sph}}}(r)=(\sqrt{1+{r}^{2}/{d}^{2}}-1){{\rm{TOF}}}_{0},$$
(2)

where r is the radial distance from the center of the DLD and TOF0 is the TOF normalization constant. For a typical field-free region length of d1 m in the TOF tube and a DLD screen radius of r = 50 mm, $$\Delta {\rm{TOF}}/{{\rm{TOF}}}_{0}\approx 1.25\,\times \,1{0}^{-3}$$. Assuming TOF0 = 0.5 μs, the spherical timing aberration in TOF scale is $$\Delta {{\rm{TOF}}}_{{\rm{sph}}}\approx 0.62$$ ns, which is larger than the DLD’s temporal resolution of 0.15 ns. The effect of the spherical timing aberration is visible for a few eV energy range with fine bins but quite small on a large energy range. To illustrate this effect, we use the W 4f core-level data presented in Fig. 4b. For every (X, Y) position on the detector the peak of W 4f7/2 was fitted with a Voigt profile and the peak positions are shown in Fig. 4c. As the spectra from deep core levels typically do not show dispersion, the deviation from fitting corresponds to the spherical timing aberration of the electron optics. In order to compensate for the spherical timing aberration, we first transform the data from Cartesian to the polar coordinates (see Fig. 4c), and then fit the radial-averaged peak position to a polynomial function of the radius,

$$\Delta {{\rm{TOF}}}_{{\rm{sph}}}(r)=\frac{{r}^{2}{{\rm{TOF}}}_{0}}{2{d}^{2}}-\frac{{r}^{4}{{\rm{TOF}}}_{{\rm{0}}}}{8{d}^{4}}+{\rm{O}}({r}^{6}).$$
(3)

The fitting results together with the corrected radial distribution are presented in Fig. 4d.

### Symmetry distortion

Photoemission patterns in the (kx, ky) plane (i.e. an energy slice) may exhibit distorted symmetry due to the influence of various factors from the instrument, the sample and the experimental geometry on the trajectory of low-energy photoelectrons. Correction of the symmetry distortion yet preserving the intensity features requires the use of symmetry-related landmarks to solve for the symmetrization coordinate transform in the framework of nonrigid image registration54. In typical situations with an excellent electron lens alignment, the energy dependence of the momentum distortion within the focused phase space volume covering an energy range of several eV is negligible, so the same coordinate transform can be applied to all energy slices in the volumetric data (including both valence and conduction bands) or simultaneously to all single events.

### Other single-experiment artifacts

(1) Momentum center shift: The momentum center of the emergent photoelectrons travelling through the electron-optic system may experience an energy-dependent shift owing to the slight misalignment in the system or the influence of stray fields. Correction of the center shift requires an energy-dependent center alignment of energy slices. The shift along the energy (or TOF) axis may be estimated using phase correlation55 or mutual information-based56 sequential image registration methods, in which the series of energy slices are treated as an image sequence. In a well-shielded and well-aligned electron-optic lens system, generally, the momentum center shift is negligible in the focused photoelectron energy range. (2) Space-charge effect (SCE): The secondary photoelectron clouds originating from the probe and pump pulses cause a “doming effect” of the photoemission intensity distribution around the momentum center of the band structure. This is especially visible in systems with a clear Fermi edge9,11 or non-dispersing shallow core levels, which may be used as references for calibrating the parameters used for the flattening transform by applying a momentum-dependent shift $$\Delta {{\rm{T}}{\rm{O}}{\rm{F}}}_{{\rm{s}}{\rm{c}}}({k}_{x},{k}_{y})$$ in the TOF (or the calibrated energy) coordinate of the single-event data.

### Momentum calibration

The scaling factors for momentum calibration are computed by comparing the positions of known high symmetry points in the band structure with their corresponding locations in an energy slice. Suppose A and B are two high symmetry points identifiable (e.g. as local extrema) from the experimental data with pixel positions (XA, YA) and (XB, YB), and momentum positions, ($${k}_{x}^{A}$$, $${k}_{y}^{A}$$) and ($${k}_{x}^{B}$$, $${k}_{y}^{B}$$), respectively. We calculate the pixel-to-momentum scaling ratios, fX and fY, along the X (column) and Y (row) directions of a 2D k-space image, respectively. Then, the momentum coordinate (kx, ky) at each pixel position (X, Y) may be derived.

$${f}_{D}=\left({k}_{d}^{A}-{k}_{d}^{B}\right)/\left({D}_{A}-{D}_{B}\right)$$
(4)
$${k}_{d}={f}_{D}\times (D-{D}_{A})\quad (D,d=X,x\,{\rm{or}}Y,y)$$
(5)

### Energy calibration

The calibration requires a set of band mapping data measured at different bias voltages (applied between the material sample and the ground), usually sampled with a spacing of 0.5 V in a range of ± 3–5 V around the normally applied bias voltage for a particular sample. The calibration proceeds by finding the TOF feature (e.g. local extrema) correspondences in the 1D energy distribution curves (EDCs) at different biases using the dynamic time warping algorithm57. The transformation from the TOF to the photoelectron energy E is approximated as a polynomial function,

$$E({\rm{TOF}})=\mathop{\sum }\limits_{i=0}^{n}{a}_{i}{{\rm{TOF}}}^{i}$$
(6)

The approximation is sufficiently accurate within a range of 20 eV, sufficient to cover the entire valence band and some low-lying conduction bands of typical materials. The polynomial coefficients are determined using nonlinear least squares by solving $$\Delta T\cdot {\boldsymbol{a}}=\Delta E$$, in which $${\boldsymbol{a}}={({a}_{1},{a}_{2},...)}^{T}$$ is the coefficient vector while the constant offset a0 is determined by manual alignment to an energy reference, such as the Fermi level or valence band maximum. The vector ΔE and the matrix ΔT contain, respectively, the pairwise differences of the bias voltages and the polynomial terms of differential TOF values. To calibrate a large energy range including multiple core levels, a piecewise polynomial may be used11.

### Pump-probe delay calibration

The time origin (“time zero”) in time-resolved photoemission spectroscopy, i.e. the temporal overlap of pump and probe pulses, is determined by fitting of a characteristic trace extracted from the data. Since the readings of the digital encoder (see Fig. 2) are sampled linearly, equally-spaced pump-probe delays are directly convertible from the readings using linear interpolation, given the boundary values of the translation stage positions and the corresponding delay times. For unequally-spaced delays, a delay marker is first added to each data point as a separate column after data acquisition to group together the encoder reading ranges that correspond to the specific time delays. The data binning is carried out over the delay marker column instead of the equally-sampled encoder readings.

### Visualization strategies

We discuss here three methods for the display of volumetric band mapping data, which are, at the same time, the basis for visualizing 3D + t data with time as an animated axis. (1) The orthoslice representation includes orthogonal 2D planes selected in specific regions in the 3D volume39, which highlights specific slices deep within the data less visible in a volumetrically rendered view (see Fig. 5a). Along this line, we have developed a software package, 4Dview58, to explore 4D data using simultaneously linked orthoslices, which also features contrast adjustment and data integration within a hypervolume of interest. (2) The band-path plot (see Fig. 5b) is a 2D representation of the 3D band mapping volume generated by combining a series of 2D cuts along selected momentum paths (or k-paths) traversing a list of so-called high-symmetry points59,60. This representation captures the largest dispersion within the band structure. For volumetric data, the same path may be sampled from all the full energy range to produce the plot shown in Fig. 5b. The analysis and visualization modules in the mpes package include functionalities to compose customized band-path plots. (3) The cut-out view (see Fig. 5c) exposes a specific part of interest in the volumetric data, while not losing the rest39. The analysis module in the mpes package provides ways to generate precise cut-outs using position landmarks (e.g. high-symmetry points labelled in Fig. 5) and inequalities.

## Data availability

The single-event photoemission data used for demonstrating the workflow is available on the Zenodo platform at https://doi.org/10.5281/zenodo.2704787 (valence and conduction band photoemission at FEL)61, https://doi.org/10.5281/zenodo.3945432 (core-level photoemission at FEL)62 and https://doi.org/10.5281/zenodo.3987303 (valence band photoemission from laboratory setup)63. The preprocessed data are being integrated into the NOMAD database in the domain for experimental materials science data accessible at https://nomad-lab.eu/prod/rae/gui/search?domain=ems.

## Code availability

The code, including documentation and examples in Jupyter notebooks for implementing the data transformations in the workflow, is available as hextof-processor (https://github.com/momentoscope/hextof-processor)30 and mpes (https://github.com/mpes-kit/mpes)31. The parser for integrating preprocessed experimental data into the NOMAD database is available as parser-mpes (https://gitlab.mpcdf.mpg.de/rpx/parser-mpes)64.

## References

1. Pruneau, C. Data Analysis Techniques for Physical Scientists (Cambridge University Press, 2017).

2. Deelman, E. et al. The future of scientific workflows. The International Journal of High Performance Computing Applications 32, 159–175 (2018).

3. Zakutayev, A. et al. An open experimental database for exploring inorganic materials. Scientific Data 5, 180053 (2018).

4. Himanen, L., Geurts, A., Foster, A. S. & Rinke, P. Data-Driven Materials Science: Status, Challenges, and Perspectives. Advanced Science 1900808 (2019).

5. Pizzi, G., Togo, A. & Kozinsky, B. Provenance, workflows, and crystallographic tools in materials science: AiiDA, spglib, and seekpath. MRS Bulletin 43, 696–702 (2018).

6. Perkel, J. M. Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019).

7. Hill, J. et al. Materials science with large-scale data and informatics: Unlocking new opportunities. MRS Bulletin 41, 399–409 (2016).

8. Draxl, C. & Scheffler, M. NOMAD: The FAIR concept for big data-driven materials science. MRS Bulletin 43, 676–682 (2018).

9. Schönhense, G., Medjanik, K. & Elmers, H.-J. Space-, time- and spin-resolved photoemission. Journal of Electron Spectroscopy and Related Phenomena 200, 94–118 (2015).

10. Medjanik, K. et al. Direct 3D mapping of the Fermi surface and Fermi velocity. Nature Materials 16, 615–621 (2017).

11. Schönhense, B. et al. Multidimensional photoemission spectroscopy—the space-charge limit. New Journal of Physics 20, 033004 (2018).

12. Krömker, B. et al. Development of a momentum microscope for time resolved band structure imaging. Review of Scientific Instruments 79, 053702 (2008).

13. Ovsyannikov, R. et al. Principles and operation of a new type of electron spectrometer –ArTOF. Journal of Electron Spectroscopy and Related Phenomena 191, 92–103 (2013).

14. Damm, A. et al. Application of a time-of-flight spectrometer with delay-line detector for time- and angle-resolved two-photon photoemission. Journal of Electron Spectroscopy and Related Phenomena 202, 74–80 (2015).

15. Tusche, C., Krasyuk, A. & Kirschner, J. Spin resolved bandstructure imaging with a high resolution momentum microscope. Ultramicroscopy 159, 520–529 (2015).

16. Damascelli, A., Hussain, Z. & Shen, Z.-X. Angle-resolved photoemission studies of the cuprate superconductors. Reviews of Modern Physics 75, 473–541 (2003).

17. Yang, H. et al. Visualizing electronic structures of quantum materials by angle-resolved photoemission spectroscopy. Nature Reviews Materials 3, 341–353 (2018).

18. Suga, S. & Sekiyama, A. Photoelectron Spectroscopy: Bulk and Surface Electronic Structures (Springer, 2014).

19. Couprie, M. New generation of light sources: Present and future. Journal of Electron Spectroscopy and Related Phenomena 196, 3–13 (2014).

20. Chiang, C.-T. et al. Boosting laboratory photoelectron spectroscopy by megahertz highorder harmonics. New Journal of Physics 17, 013035 (2015).

21. Puppin, M. et al. Time- and angle-resolved photoemission spectroscopy of solids in the extreme ultraviolet at 500 kHz repetition rate. Review of Scientific Instruments 90, 023104 (2019).

22. Corder, C. et al. Ultrafast extreme ultraviolet photoemission without space charge. Structural Dynamics 5, 054301 (2018).

23. Buss, J. H. et al. A setup for extreme-ultraviolet ultrafast angle-resolved photoelectron spectroscopy at 50-kHz repetition rate. Review of Scientific Instruments 90, 023105 (2019).

24. Kutnyakhov, D. et al. Time- and momentum-resolved photoemission studies using timeof-flight momentum microscopy at a free-electron laser. Review of Scientific Instruments 91, 013109 (2020).

25. Folk, M., Heber, G., Koziol, Q., Pourmal, E. & Robinson, D. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases - AD ’11, 36–47 (ACM Press, New York, New York, USA, 2011).

26. Weiler, N. C., Collman, F., Vogelstein, J. T., Burns, R. & Smith, S. J. Synaptic molecular imaging in spared and deprived columns of mouse barrel cortex with array tomography. Scientific Data 1, 140046 (2014).

27. Ker, D. F. E. et al. Phase contrast time-lapse microscopy datasets with automated and manual cell tracking annotations. Scientific Data 5, 180237 (2018).

28. Levin, B. D. et al. Nanomaterial datasets to advance tomography in scanning transmission electron microscopy. Scientific Data 3, 160041 (2016).

29. Aversa, R., Modarres, M. H., Cozzini, S., Ciancio, R. & Chiusole, A. The first annotated set of scanning electron microscopy images for nanoscience. Scientific Data 5, 180172 (2018).

30. Acremann, Y. et al. hextof-processor. https://github.com/momentoscope/hextof-processor (2020).

31. Xian, R. P. & Rettig, L. mpes. https://github.com/mpes-kit/mpes (2020).

32. Ackermann, W. et al. Operation of a free-electron laser from the extreme ultraviolet to the water window. Nature Photonics 1, 336–342 (2007).

33. Riley, J. M. et al. Direct observation of spin-polarized bulk bands in an inversionsymmetric semiconductor. Nature Physics 10, 835–839 (2014).

34. Shallenberger, J. R. 2D tungsten diselenide analyzed by XPS. Surface Science Spectra 25, 014001 (2018).

35. Bertoni, R. et al. Generation and Evolution of Spin-, Valley-, and Layer-Polarized Excited Carriers in Inversion-Symmetric WSe2. Physical Review Letters 117, 277201 (2016).

36. Dendzik, M. et al. Observation of an Excitonic Mott Transition Through Ultrafast Corecum -Conduction Photoemission Spectroscopy. Physical Review Letters 125, 096401 (2020).

38. Stodden, V. et al. Enhancing reproducibility for computational methods. Science 354, 1240–1241 (2016).

39. Hansen, C. D. & Johnson, C. R. (eds.) The Visualization Handbook (Elsevier Butterworth-Heinemann, 2005).

40. Lipşa, D. R. et al. Visualization for the Physical Sciences. Computer Graphics Forum 31, 2317–2347 (2012).

41. Moser, S. An experimentalist’s guide to the matrix element in angle resolved photoemission. Journal of Electron Spectroscopy and Related Phenomena 214, 29–52 (2017).

42. Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9, 90–95 (2007).

43. Community, B. O. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam http://www.blender.org (2018).

44. Weinelt, M. Time-resolved two-photon photoemission from metal surfaces. Journal of Physics: Condensed Matter 14, R1099–R1141 (2002).

45. Ghiringhelli, L. M. et al. Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats. npj Computational Materials 3, 46 (2017).

46. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559, 547–555 (2018).

47. Ong, S. P. et al. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science 68, 314–319 (2013).

48. Hjorth Larsen, A. et al. The atomic simulation environment—a Python library for working with atoms. Journal of Physics: Condensed Matter 29, 273002 (2017).

49. Ganose, M. & Jackson, A.,J. A. & O. Scanlon, D. sumo: Command-line tools for plotting and analysis of periodic ab initio calculations. Journal of Open Source Software 3, 717 (2018).

50. Gerasimova, N., Dziarzhytski, S. & Feldhaus, J. The monochromator beamline at FLASH: performance, capabilities and upgrade plans. Journal of Modern Optics 58, 1480–1485 (2011).

51. Puppin, M. et al. 500 kHz OPCPA delivering tunable sub-20 fs pulses with 15 W average power based on an all-ytterbium laser. Optics Express 23, 1491 (2015).

52. Chambers, M., Cleveland, S., Tukey, A. & Kleiner, B. Graphical Methods for Data Analysis (Wadsworth International Group, 1983).

53. Novo, D. & Wood, J. Flow cytometry histograms: Transformations, resolution, and display. Cytometry Part A 73A, 685–692 (2008).

54. Xian, R. P., Rettig, L. & Ernstorfer, R. Symmetry-guided nonrigid registration: The case for distortion correction in multidimensional photoemission spectroscopy. Ultramicroscopy 202, 133–139 (2019).

55. Guizar-Sicairos, M., Thurman, S. T. & Fienup, J. R. Efficient subpixel image registration algorithms. Optics Letters 33, 156 (2008).

56. Viola, P. & Wells, W. M. Alignment by Maximisation of Mutual Information. International Journal of Computer Vision 24, 137–154 (1997).

57. Salvador, S. & Chan, P. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 561–580 (2007).

58. Dendzik, M. mdendzik/4Dview 1.0. Zenodo https://doi.org/10.5281/zenodo.3360817 (2019).

59. Setyawan, W. & Curtarolo, S. High-throughput electronic band structure calculations: Challenges and tools. Computational Materials Science 49, 299–312 (2010).

60. Hinuma, Y., Pizzi, G., Kumagai, Y., Oba, F. & Tanaka, I. Band structure diagram paths based on crystallography. Computational Materials Science 128, 140–184 (2017).

61. Xian, R. P. et al. Multidimensional photoemission spectra of tungsten diselenide. Zenodo https://doi.org/10.5281/zenodo.2704787 (2020).

62. Dendzik, M. et al. Time-resolved core-level photoemission data of tungsten diselenide. Zenodo https://doi.org/10.5281/zenodo.3945432 (2020).

63. Xian, R. P. et al. Datasets for the computational workflow of multidimensional photoemission spectroscopy. Zenodo https://doi.org/10.5281/zenodo.3987303 (2020).

64. Xian, R. P. & Scheidgen, M. parser-mpes. https://gitlab.mpcdf.mpg.de/rpx/parser-mpes (2019).

## Acknowledgements

We thank G. Schönhense for support on the photoelectron detector, S. Grunewald, S. Schülke and G. Schnapka for support on the computing infrastructures. We thank G. Brenner, H. Redlin and S. Dziarzhytski at FLASH, DESY, and H. Meyer and S. Gieschen from the University of Hamburg for beamline and instrumentation support. The work was partially supported by BiGmax, the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant No. ERC-2015-CoG-682843), and the German Research Foundation (DFG) through the Emmy Noether program under grant number RE 3977/1 and the SFB/TRR 227 “Ultrafast Spin Dynamics” (projects A09 and B07). D. Kutnyakhov, M. Heber and W. Wurth acknowledge funding by the DFG within the framework of the Collaborative Research Centre SFB 925 - 170620586 (project B2). F. Pressacco acknowledges funding from the excellence cluster EXC 1074 “The Hamburg Centre for Ultrafast Imaging - Structure, Dynamics and Control of Matter at the Atomic Scale” of the DFG. S. Y. Agustsson and J. Demsar acknowledge the financial support by the DFG in the framework of the Collaborative Research Centre SFB TRR 173 - 268565370 (project A5). D. Curcio and P. Hofmann acknowledge funding from VILLUM FONDEN via the Centre of Excellence for Dirac Materials (Grant No. 11744). T. Pincelli thanks the Alexander von Humboldt Foundation for financial support. Open Access funding enabled and organized by Projekt DEAL.

## Author information

Authors

### Contributions

Y.A., K.B., S.Y.A., D.C., R.P.X. and M.D. wrote the hextof-processor package. R.P.X. and L.R. wrote the mpes package. D.K., Y.A., F.P., R.P.X., S.Y.A., D.C., M.D., M.H., S.D., P.H., L.R., R.E. and W.W. participated in the experiments at the FLASH PG-2 beamline using the HEXTOF instrument in Hamburg. S.D. and L.R. conducted the experiment at the Fritz Haber Institute using the METIS electron momentum microscope. R.P.X., M.D., L.R., R.E., M.S., T.P. constructed the metadata format, R.P.X. and M.S. implemented them into parser-mpes. R.P.X. wrote the initial manuscript with contributions from M.D. and Y.A. All authors contributed to discussions to bring the manuscript to its final form.

### Corresponding authors

Correspondence to R. Patrick Xian, Laurenz Rettig or Ralph Ernstorfer.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Xian, R.P., Acremann, Y., Agustsson, S.Y. et al. An open-source, end-to-end workflow for multidimensional photoemission spectroscopy. Sci Data 7, 442 (2020). https://doi.org/10.1038/s41597-020-00769-8

• Accepted:

• Published:

• DOI: https://doi.org/10.1038/s41597-020-00769-8