## Main

Direct observation of the nucleation and growth of a new phase is the solid-state equivalent of the molecular movie in chemistry1. However, this is challenging to realize in solids due to the range of length- and timescales involved, with dynamics occurring from the atomic scale up to the macroscale and from femtoseconds to microseconds. Remarkable progress has been made in understanding atomic-scale dynamics on the ultrafast time scale by using diffraction-based probes, which can reveal atomic dynamics beyond the mean response2 or dynamics of the average structure over nanoscale regions3. However, electronic probes are needed to understand the functionality of these states. This is crucial in quantum materials, where transient states can be created by light that have electronic properties not found in equilibrium4,5,6,7,8,9. In many cases, transient states are believed to be heterogeneous at the nanoscale10,11,12,13,14,15, both because of the inhomogeneous excitation profiles generated by the pump beam in the depth of the material and due to the intrinsic heterogeneity of many quantum materials. A key challenge to understanding these phases is therefore to isolate the photo-induced state and to directly probe its properties at the nanoscale. Resonant coherent diffraction from electronically ordered states has been used to infer the statistical properties of a domain16,17,18, but real-space images have not been produced. Here we use time- and energy-resolved coherent resonant soft X-ray imaging to observe the ultrafast insulator–metal phase transition in the prototypical quantum material vanadium dioxide (VO2), over a macroscopic area with sub-50 nm spatial resolution and 150 fs time resolution, returning full spectroscopic information on transient states at the nanoscale.

The light-induced phase transition in VO2 has been particularly influential in our understanding of optically driven quantum materials. At room temperature, VO2 is in a monoclinic insulating (M1) phase characterized by dimerized pairs of vanadium ions. Light can break these dimers and drive the transition on the ultrafast timescale to the high-temperature rutile metallic (R) phase. The study of this transition has driven the development of multiple new techniques: it was the first solid–solid transition to be tracked by time-resolved X-ray diffraction19, while ultrafast X-ray absorption techniques were first used to understand its electronic nature20,21. The transition to the rutile phase proceeds by disordering of the vanadium pairs on a sub-100 fs timescale2, with the bandgap collapsing on a similar timescale22. However, it remains an open question whether the bandgap collapses before, or because of, the structural transition to the rutile phase.

In addition to these homogeneous effects, heterogeneity in the transient state is believed to play a key role in the dynamics8,9,10,23,24,25. Although the lattice and electronic properties change on the ultrafast timescale, analysis of the terahertz (THz) conductivity suggests that the rutile metallic phase locally nucleates and grows on a timescale of tens10 to hundreds25 of picoseconds. In addition, electron diffraction results suggest that an additional, meta-stable heterogeneous monoclinic metallic phase can form that can persist for hundreds of picoseconds8,23 or even microseconds9. This phase is distinct from the transient state that occurs within the first 100 fs described above and is stabilized by a structural distortion that preserves the monoclinic symmetry. However, the existence of non-equilibrium phases remains debated26,27 because a direct measurement of the metastable monoclinic metallic state has not been made.

To address the role of nanoscale heterogeneity and phase separation, we use time- and spectrally resolved resonant soft X-ray coherent imaging at the Pohang Accelerator Laboratory X-ray Free Electron Laser (PAL-XFEL)28,29,30,31 to image the light-induced phase transition on the ultrafast timescale with nanometre spatial resolution. The power of this technique lies in the fact that it is a wide-field imaging technique that can exploit resonant X-ray spectroscopy both to provide a contrast mechanism between phases32 and to enable the extraction of quantitative spectral information to aid phase identification on the nanoscale33. We report time-resolved imaging using two modes of operation, Fourier transform holography (FTH) and coherent diffractive imaging (CDI). In FTH, scattering patterns are inverted directly through the use of a fast Fourier transform32 and require a single exposure. This enables rapid data collection but comes at the expense of losing the absolute values of the complex transmission. CDI, conversely, uses multiple exposures to increase the dynamic range of detected scattering patterns, with images obtained via iterative phase retrieval algorithms that yield the quantitative absolute transmission of the sample33.

Figure 1a shows a multi-energy FTH image of VO2 recorded at the vanadium L edge (517 eV, red channel) and oxygen K edge (529.5 eV and 531.25 eV, blue and green channel, respectively) at 325 K, a temperature at which the insulating and metallic phases coexist32. At these energies, the absorption coefficient shows large changes depending on the phase of the material, with the 529.5 eV signal showing a decrease in transmission and the 531.25 eV signal an increase when the system changes from M1 to R34,35. These changes can be used to provide contrast in imaging. The red–green–blue image of Fig. 1a shows the full topography of the sample, which comprises a range of crystallite sizes and grain boundaries (white). These boundaries are known to pin the position of the metallic domains32, and in larger crystallites, clear domain coexistence of M1 (purple) and rutile metallic (green) phases can be seen. This pinning is vital as it ensures the initial domain structure recovers after photoexcitation (Supplementary Note 1 and Extended Data Figs. 1 and 2).

In Fig. 1b, we further verify the domain assignment by performing a temperature-dependent measurement. By subtracting the green and blue channels from the red–green–blue image, we can remove the temperature-independent topography and highlight the domain structure, as the transmission at 529.5 eV (blue channel) decreases in the metallic phase, while the transmission at 531.5 eV increases (green channel). Here we clearly see the R phase nucleating and forming a stripe state with M1, which is common in nanocrystals32.

We now examine the dynamics of the heterogeneous state at 325 K, to observe the disappearance of the M1 phase, the growth of the R phase and/or the nucleation of new transient phases. We excite the system with 24 mJ cm2, 800 nm pulses and perform time-resolved FTH imaging at 529.5 eV photon energy. At this fluence, we are in the saturation regime, where increasing fluence no longer results in substantial changes, but the initial domain structure still recovers between photo-excitation events (Supplementary Note 1 and Extended Data Figs. 1 and 2). Large changes are found all over the sample, and in Fig. 1c, we focus on two regions of interest that are cuts across approximately 50-nm-wide rutile metallic domains surrounded by the M1 phase, which appear as dips in the image intensity. In both cases, the contrast between the metal and insulating states is strongly, but not completely, lost within the first 150 fs, after which any further substantial changes are only observed on the hundreds of picoseconds timescale.

We next examine the spatial dependence of the dynamics more closely. Figure 2a shows the pump-induced changes in the domain structure across the full field of view measured at 529.5 eV, relative to the state of the sample at −8.5 ps. Regions of increased transmission are shown in red, with decreased transmission in blue. Changes are observed across the entire sample with nanoscale texturing, but the spatial dependence of the pattern is roughly independent of pump–probe delay after the initial changes. A comparison with the static domain pattern (Fig. 2b) shows that the regions where the transmission decreased correspond to regions that began in the M1 phase, as expected for a transition to the R phase. However, regions that began in the R phase show an increase in transmission of similar magnitude to the changes seen in M1. This is unexpected because, although there may be excited state effects in R, these should be much weaker than the changes of M1 to R (ref. 26).

Instead, the changes observed at the R phase are mainly artificial, resulting from the loss of the d.c. component in the FTH images, which can cause correlated dynamics across the whole image (Supplementary Note 2 and Extended Data Fig. 3). Therefore, to identify how many unique processes are occurring spatially, we perform a principal component analysis of the dynamic images. This process breaks down the transmission dynamics T into a series of ‘eigen’ spatial, Ai(x,y), and temporal, fi(t), functions of the form $$T\left( {x,y,t} \right) = \mathop {\sum}\nolimits_i {A_i} \left( {x,y} \right)f_i\left( t \right)$$.

For times up to 20 ps, we find that only a single principal component is needed to describe the dynamical evolution of the images, meaning all regions in space evolve with the same temporal dynamic and the transmitted intensity can be represented as $$T\left( {x,y,t} \right) = A\left( {x,y} \right)f\left( t \right)$$. Only when data beyond approximately 100 ps are included in the analysis are additional terms needed (Extended Data Figs. 46 and Supplementary Notes 3 and 4). The spatial and temporal response of the initial dynamics are plotted in Fig. 2b. The spatial pattern shows a correlation with the initial domain structure, demonstrating that the observed dynamics are indeed the result of the M1 regions of the sample switching to the metallic R phase and that no other local dynamics occur. A fit to the time trace for this process reveals two time constants, one, a near-resolution-limited 203 ± 18 fs fall time is consistent with the ultrafast nature of the structural2 and electronic21,22 changes during the M1 to R phase transition, while the second 4.98 ± 0.04 ps time constant is much slower. Critically, as only one principal component is needed to describe the dynamics, these two timescales occur in identical regions of the sample. This secondary picosecond time constant has been seen in multiple experiments, which have shown it to be fluence dependent8,10,23,36,37. However, the interpretation of this time constant has been debated. In some cases, it has been taken as evidence for a non-equilibrium, monoclinic metallic phase8,23, while others have interpreted it as nucleation and growth of the metallic phase10. In the former case, the fast timescale is attributed to regions of the sample that undergo a direct transition from the M1 phase to the R phase, while the slower time constant is attributed to regions of the sample that transition from M1 to monoclinic insulating phases. This contrasts with our observations, which show both time constants occur in all regions of the sample. Similarly, in the nucleation and growth picture, the fast timescale should only be observed at the initial nucleation site; in addition, domain growth would require multiple principal components to describe. Therefore, both these scenarios are at odds with the data presented here, and thus we require an alternative description of the ultrafast phase transition.

To better understand the nature of the transient state formed after the picosecond evolution, we use spectrally resolved CDI to recover the full spectrum of the newly switched regions33. We acquire a hyperspectral image by scanning the probe wavelength, with images taken at 31 photon energies across the oxygen K edge at a delay of 20 ps after photoexcitation (Methods). The resulting spectrally integrated image is shown in Fig. 3a, which shows that the sample is remarkably homogeneous after excitation. However, as already noted in Fig. 1, markers of the initial phase are still observed at key energies. To elucidate the origin of this effect, we extract the transient spectra from all regions of the sample that were initially insulating and compare them with those that were initially metallic (Fig. 3b), enabling us to understand how the transient metallic state differs from the thermal state. The resulting spectra, and the differential, are shown in Fig. 3c,d.

Both regions are remarkably similar, and the resulting difference, less than 1% at 530.5 eV, is much smaller than the changes found from the insulator–metal transition, which are of order 10% (ref. 33), showing the regions that were initially M1 have switched to the R phase. While the differences are smaller than the overall error on either spectra independently (Fig. 3c), the primary source of noise is fluctuations in the X-ray intensity, which are highly correlated across the sample and thus do not affect the differential spectra. The uncertainties on the difference spectra are reported independently in Fig. 3d and are of order 0.1% (Methods and ref. 33). The spectral signatures observed are not consistent with thermal differences in the metallic phase26, and, instead, we suggest that these differences result from strain generated during the phase transition. In equilibrium, the volume of VO2 increases by approximately 1% between the M1 and R phase38, but, on the ultrafast timescale, dimerization is lost without volume change, as the volume expansion can only occur after strain propagation. As a result, the photo-generated metallic regions are more strained than regions that were initially metallic.

A strain discontinuity at a surface leads to a strain wave39,40. The finite penetration depth of the pump causes the film to partially transform to the R phase, starting from the surface–vacuum interface. Fast moving, short-lived strain waves will be launched from the vacuum/sample and photo-generated R/M1 phase interfaces into the out-of-plane direction. As the excitation fluence is increased, the R phase will be switched deeper into the material, and the acoustic dynamics will speed up due to the higher speed of sound in the R phase10. When uniformly switched to the rutile phase at high fluence, strain waves will cross our sample in around 7 ps, consistent with our observed timescale (Supplementary Note 5 and Extended Data Fig. 7). The predicted speed increase with fluence is consistent with previous measurements23. In-plane strain relaxes more slowly because of the length scales involved (microns versus nanometres) and may be responsible for the spatially dependent dynamics observed at the hundreds of picosecond timescale.

The picture that emerges for the phase transition is shown in Fig. 3e. Diffraction measurements have shown that photoexcitation first breaks the vanadium dimers within approximately 100 fs2, resulting in a strained rutile phase. Because of the anisotropy of the initial M1 phase, the degeneracy between the rutile a and b axis is broken, giving an orthorhombic structure. After approximately 10 ps, the out-of-plane strain relaxes, but the in-plane strain remains, preserving the orthorhombic nature and leading to the long-lived state shown in Fig. 3c,d. The tetragonal structure is only reached after hundreds of picoseconds when the in-plane strain can relax.

To test this hypothesis, we have simulated the x-ray absorption spectra (XAS) of both the equilibrium rutile phase and the 20 ps orthorhombically strained configuration (Methods). The resulting spectra and differences are also plotted in Fig. 3 and show remarkable agreement with the experimental data. Qualitatively, the effect in both simulation and experiment is that the perturbed spectra are shifted in energy with respect to the initial state, but without changing their shape. Such a global shift, or ‘shear’, results in the difference spectra resembling the derivative of the spectra with respect to energy. The signs of the shifts are consistent with our interpretation, with the strained and switched spectra both moving to lower energies than the unstrained and unswitched spectra. There is also good agreement quantitatively at the π* state, where the two difference spectra match within a factor of 2. At the σ* state, the quantitative agreement is worse, although the qualitative agreement remains good. This results from the fact that all functionals used here overestimate the strength of the σ* state in the initial state compared with our data, which in turn influences the magnitude of the difference (Extended Data Fig. 8).

The implications of these results are twofold. First, we find no evidence for a heterogeneous monoclinic metallic phase for the fluence measured here. The strained orthorhombic state is qualitatively different from previously proposed out-of-equilibrium monoclinic phases8,9,23,41, which result from decoupled internal degrees of freedom and neglect strain, but strain effects may explain the long-lived diffraction features previously associated with them8,9,23. In addition, the lack of nucleation and growth of the rutile metallic phase suggests that previously observed dynamics of the THz conductivity could result from the fact that strained VO2 has a higher resistivity than the fully relaxed state10,25.

We note that we cannot conclusively rule out the existence of the monoclinic metallic state in a different parameter range, as the proposed volume fraction generated is fluence dependent. While we are within the fluence regime in which a monoclinic metallic state has previously been claimed to exist8,9,23, our fluence is higher than the fluence that is claimed to produce the maximum volume fraction of monoclinic metal23. Furthermore, our own measurements have shown that the fluence transition threshold in VO2 is strongly sample and geometry dependent26, and thus systematic fluence-dependent studies will be essential to definitively settle this question. In addition, although nucleation and growth dynamics have not been observed here, they may occur closer to the critical fluence, which is at much lower fluences than used here. As thin film samples show a spatially dependent critical temperature (Tc)32,33 and the threshold fluence is known to correlate with Tc (ref. 26), one would expect that close to threshold excitation will only switch nanoscale regions with the lowest Tc. This local switching will produce complex strain fields that could drive complex cooperative effects42, which could now be directly visualized. Furthermore, as the 0.25 eV of X-ray bandwidth used here corresponds to a transform limited pulse duration of less than 20 fs, decoupling of the bandgap changes from structural dynamics could be resolved during the sub-100 fs initial evolution of the system2. Our approach can easily be extended to characterize filament growth in electric fields, which will enable spectroscopy of field-induced states driven by quasi-d.c. fields or THz pulses9,43; thus, heterogeneous transient states in quantum materials can now be explored at the nanoscale.

## Methods

### Sample preparation and measurements

Samples consisted of 75-nm-thick layers of VO2 prepared on silicon nitride membranes by pulsed laser deposition and subsequent annealing, as widely used to prepare high-quality VO2 films8,9,32,33,44. A [Cr(5 nm)/Au(55 nm)]20 multilayer (~1.1 µm thickness) was deposited on the opposite side, and a focused ion beam was used to mill the mask structure. A 2-µm-diameter aperture was milled to define the field of view, along with five 50- to 90-nm-diameter reference apertures 5 µm away from the central aperture. Crystals grow with the c axis preferentially in plane.

Pump–probe experiments were performed at the FTH end station of the soft x-ray spectroscopy and scattering beamline at the PAL-XFEL operating at 60 Hz repetition rate. The X-ray polarization was perpendicular to the rutile c axis. Time-resolved images were taken by alternating between positive and negative time delays to ensure that the initial domain structure did not change. Each image averages 9,000 X-ray free electron laser (XFEL) shots. The XFEL beam was focused to a 50 μm × 50 μm spot size at the sample position using a Kirkpatrick–Baez mirror pair. To prevent the sample from being damaged, the XFEL was attenuated to have ~1.2 × 108 photons per pulse. Scattering patterns were captured by an in-vacuum charge-coupled device detector (PI-MTE 2048b, Teledyne Princeton Instruments) cooled to −40 °C and read out at 100 kHz with 2 × 2 binning. The charged-coupled device was placed 300 mm downstream from the sample, with a metallic filter (100 nm parylene/100 nm aluminium from Luxel) and 1000-µm-thick epoxy beamstop placed directly before the sensor. Optical pump-only backgrounds were subtracted from each image before either iterative reconstruction or holographic inversion. Laser pulses with a central wavelength of 800 nm were focused to a spot size of 200 µm full-width at half-maximum (FWHM) at the sample with a small (~1°) crossing angle relative to the normal incidence X-rays. The overall temporal resolution was around 150 fs, limited by the relative timing jitter of the optical and X-ray beams. Both beams impinged on the sample VO2 side first to reduce the potential impact of plasmonic effects in the Cr/Au multilayer introducing additional inhomogeneity3.

### FTH and CDI

In FTH, a beam block can be added to the scattering field to block the intense low-momentum scatter, allowing single exposures to capture the high-momentum scattering needed to observe the nanoscale domains. However, the presence of beam block in the central part of the Fourier plane removes the d.c. component and acts as a high-pass filter. As a result, the absolute values of the complex transmission of the sample are lost. In CDI, multiple exposures are combined to improve dynamic range, enabling better sampling of both low- and high-angular scattering. Although, in principle, the beam block can be removed to capture the d.c. component, it is not necessary for CDI to reconstruct the d.c. level if the X-ray beam has sufficient coherence, which is the case at the FEL. Instead, multiple exposures with the beam block in place were sufficient to produce good CDI results via iterative phase retrieval algorithms33.

### Principal component analysis and data fitting

The dynamics of the real space images were analysed using principal component (PC) analysis. First, all negative time delays (probe before pump images) from each time trace were grouped and decomposed into PCs; as these images should be constant over time, the amplitude of the second PC was used to define a noise threshold for the pump–probe measurements. Then each of the two time traces (0–1 ps and 0–20 ps) was decomposed into PCs, and any PC with an amplitude below the noise threshold was discarded. Only two PCs were found to be significant for each time trace: one with a spatial pattern resembling the negative time delay structure with a weak structure in time and one approximately constant at negative time delays and exhibiting a sharp drop at temporal overlap. We identified the first as the static background, with dynamics resulting only from fluctuations in the overall XFEL brightness, and the second as the dynamics of the phase transition. We replaced the first component with its time-averaged mean value and kept the second. We note that including the other, weaker PCs barely affects the time traces; therefore, we ascribed all time dynamics to the second principal component (Extended Data Fig. 5). The remaining time dynamics can be described as a purely real function. Including images taken at longer time delays (100 ps, 250 ps or 500 ps) breaks this simple description, and additional PCs are required to describe the dynamics, corresponding to the onset of spatial dynamics (Extended Data Fig. 4).

While the short and long time traces cannot be directly combined because of a change in the initial domain structure, which we attribute to a random fluctuation in the XFEL intensity introducing irreversible changes, examination of PC corresponding to the dynamics shows they are the same in both cases (Extended Data Fig. 6). As such, we fit the time dependent signal, S, of both traces simultaneously with a double exponential decay of the form $$S\left( t \right) = H\left( t \right)\left( {A\exp \left( { - \frac{t}{{\tau _1}}} \right) + B\exp \left( { - \frac{t}{{\tau _2}}} \right) + C} \right) + 1$$, where H(t) is the Heaviside step function, with identical time constants (τ12) but freely varying amplitudes (A, B, C) and convolved with the time resolution (150 fs FWHM). Error bars on the fit represent the standard deviation of the fit assuming a standard deviation of 0.03% on the data, obtained from examining the pre-time-zero data. Uncertainty in the time resolution, pump fluctuations and penetration depth mismatch are not explicitly considered, but increase uncertainty at the femtosecond timescale.

### Reconstruction of the transient phase

Full amplitude and phase images of transient phase were recovered using partially coherent iterative phase reconstruction algorithms as described in ref. 33. Long and short exposures were combined to provide high dynamic range images at 31 photon energies across the oxygen K edge, which were each used to reconstruct real space images independently. The object constraint was determined from the known mask geometry and the FTH reconstructions; a data constraint mask was used to allow the reconstruction to freely vary in regions where the beamblock masked the detector or where additional background light or detector damage was present, with the additional constraint that the blocked low-q response obeyed circular symmetry. The properties of the transient state were determined by comparing the average spectral response of all sample regions that began in the R phase and those that were switched from the M phase. These regions were determined by combining FTH images at positive and negative time delays at three different photon energies; only regions where all three photon energies agreed were preserved for further analysis. These regions also agree clearly with a PC analysis of the spectrogram, although the degree of sample inhomogeneity prevented a conventional clustering analysis.

### XAS simulations

XAS simulations were performed using density functional theory (DFT) and the supercell core–hole45 method as implemented in Vienna Ab initio Simulation Package46 with projector-augmented wave47 pseudopotentials. For all calculations, supercells containing 16 VO2 formula units (48 atoms) were employed, and 800 bands were included in the core–hole calculations. The plane-wave cut-off was 600 eV, and a 4 × 4 × 6 k-point mesh was used. XAS absorption was approximated by the imaginary part of the frequency-dependent dielectric function, which was convoluted with a Gaussian function with a FWHM of 0.3 eV and a Lorentzian function that linearly increases in width from 0.0 to 0.7 eV between 520 eV and 540 eV to simulate the instrument and lifetime broadening. The simulated spectra of the undistorted VO2 were manually aligned with experiment, and the same energy shift was applied to the distorted structures. We compared three DFT methods, LDA, PBE and PBE+U with a rotationally invariant Hubbard-U correction of 3.1 eV for the vanadium d bands. The three methods did not show any noteable differences for the oxygen K edge (Extended Data Fig. 8). Reported simulations are for X-ray polarization perpendicular to the rutile c axis and were based on LDA and are in good agreement with previously published measurements for VO2 (refs. 33,48,49), although the amplitude of σ* state at 532 eV can be seen to vary between different measurements and calculation. The undistorted rutile structure was calculated with lattice parameters aR = bR = 4.5546 Å and cR = 2.8514 Å, reported in ref. 50. The orthorhombic phase is calculated by projecting the monoclinic axes onto their rutile counterparts while removing the dimerization and tilt distortions, so that the vanadium ions are evenly spaced along each axis, that is, the vanadium ions are located at (0, 0, 0) and (0.5, 0.5, 0.5) in fractional coordinates of the unit cell, as in the rutile phase. This orthorhombic structure has lattice parameters aO = bM = 4.5378 Å, bO = cM sin βM = 4.5322 Å, and cO = aM/2 = 2.8759 Å, where aM, bM, cM and βM are the monoclinic unit cell lengths and angles reported in ref. 51. At the measurement time of 20 ps, the out-of-plane strain has relaxed, so we finally set the orthorhombic bO axis, which lies out of plane, to the rutile value to give aO = 4.5378 Å, bO = 4.5546 Å, and cO = 2.8759 Å.