## Introduction

The strong interaction between Rydberg atoms and microwave (MW) fields that results from their high polarizability means that the Rydberg atom is a candidate medium for MW fields measurement, e.g., using electromagnetically induced absorption1, electromagnetically induced transparency (EIT)2,3 and the Autler–Townes effect3,4,5,6. The amplitudes7,8,9,10, phases10,11 and frequencies9,10 of MW fields could then be measured with high sensitivity. Based on this measurement sensitivity for MW fields, the Rydberg atom has been used in communications7,8,12,13 and radar14 as an atom-based radio receiver. In the communications field, the Rydberg atom replaces the traditional antenna with superior performance aspects that include sub-wavelength size, high sensitivity, system international (SI) traceability to Planck’s constant, high dynamic range, self-calibration and an operating range that spans from MHz to THz frequencies7,9,10,15,16. One application is analogue communications, e.g., real-time recording and reconstruction of audio signals13. Another application is digital communications, e.g., phase-shift keying and quadrature amplitude modulation7,8,12. The channel capacity of MW-based communications is limited by the standard quantum limited phase uncertainty7. Furthermore, a continuously tunable radio-frequency carrier has been realized based on Rydberg atoms17, thus paving the way for concurrent multichannel communications. Detection and decoding of multifrequency MW fields are highly important in communications for acceleration of information transmission and improved bandwidth efficiency. Additionally, MW fields recognition enables simultaneous detection of multiple targets with different velocities from the multifrequency spectrum induced by the Doppler effect. However, because of the sensitivity of Rydberg atoms, the noise is superimposed on the message, meaning that the message cannot be recovered efficiently. Additionally, it is difficult to generalize and scale the band-pass filters to enable demultiplexing of multifrequency signals with more carriers16.

To solve these problems, we use a deep learning model for its accurate signal prediction capability and its outstanding ability to recognize complex information from noisy data without use of complex circuits. The deep learning model updates the weights via backpropagation and then extracts features from massive data without human intervention or prior knowledge of physics and the experimental system. Because of these advantages, physicists have constructed complex neural networks to complete numerous tasks, including far-field subwavelength acoustic imaging18, value estimation of a stochastic magnetic field19, vortex light recognition20,21, demultiplexing of an orbital angular momentum beam22,23 and automatic control of experiments24,25,26,27,28,29.

Here, we demonstrate a deep learning enhanced Rydberg receiver for frequency-division multiplexed digital communication. In our experiment, the Rydberg atoms act as a sensitive antenna and a mixer to receive multifrequency MW signals and extract information9,11,12. The modulated signal frequency is reduced from several gigahertz to several kilohertz via the interaction between the Rydberg atoms and the MWs, thus allowing the information to be extracted using simple apparatus. These interference signals are then fed into a well-trained deep learning model to retrieve the messages. The deep learning model extracts the multifrequency MW signal phases, even without knowing anything about the Lindblad master equation, which describes the interactions between atoms and light beams in an open system theoretically. The solution of the master equation is often complex because the higher-order terms and the noises from the environment and from among the atoms are taken into consideration. However, the deep learning model is robust to the noise because of its generalization ability, which takes advantage of the sensitivity of the Rydberg atoms while also reducing the impact of the noise that results from this sensitivity. Our deep learning model is scalable, allowing it to recognize the information carried by more than 20 MW bins. Additionally, when the training is complete, the deep learning model extracts the phases more rapidly than via direct solution of the master equation.

## Results

### Setup

We adapt a two-photon Rydberg-EIT scheme to excite atoms from a ground state to a Rydberg state. A probe field drives the atomic transition $$|5{S}_{1/2},F=2\rangle \to |5{P}_{1/2},F^{\prime} =3\rangle$$ and a coupling light couples the transition $$|5{P}_{1/2},F^{\prime} =3\rangle \to |51{D}_{3/2}\rangle$$ in rubidium 85, as shown in Fig. 1a. Multifrequency MW fields drive a radio-frequency (RF) transition between the two different Rydberg states $$|51{D}_{3/2}\rangle$$ and $$|50{F}_{5/2}\rangle$$. The energy difference between these states is 17.62 GHz. The multifrequency MW fields consist of multiple MW bins (more than three bins) with frequency differences of several kilohertz from the resonance frequency. The amplitudes, frequencies, and phases of the multiple MW bins can be adjusted individually (further details are provided in the “Methods” section). The detunings of the probe, coupling and MW fields are Δp, Δc and Δs, respectively. The Rabi frequencies of the probe, coupling and MW fields are Ωp, Ωc and Ωs, respectively. The experimental setup is depicted in Fig. 1b. We use MW fields to drive the Rydberg states constantly, producing modulated EIT spectra, i.e., the probe transmission spectra, as shown in the inset of Fig. 1b. The phases of the MW fields correlate with the modulated EIT spectra and can be recovered from these spectra with the aid of deep learning. Specifically, the probe transmission spectra are fed into a well-trained deep learning model that consists of a one-dimensional convolution layer (1D CNN), a bi-directional long–short-term memory layer (Bi-LSTM) and a dense layer to extract the phases of the MW fields. Figure 1c–e shows these components of the neural network (further details are presented in the “Methods” section). Finally, the bin phases are recovered and the data are read out.

### Frequency-division multiplexed signal encoding and receiving

In the experiments, we use a four-bin frequency-division multiplexing (FDM) MW signal for demonstration, where one of the four MW bins is used as the reference bin. The relative phase differences between the reference bin and the other bins are modulated by the message signal. Specifically, for the four-bin MW signal,

$$E= \;{A}_{1}\cos [({\omega }_{0}+{\omega }_{1})t+{\varphi }_{1}]+{A}_{2}\cos [({\omega }_{0}+{\omega }_{2})t+{\varphi }_{2}]\\ +{A}_{3}\cos [({\omega }_{0}+{\omega }_{3})t+{\varphi }_{3}]+{A}_{4}\cos [({\omega }_{0}+{\omega }_{4})t+{\varphi }_{4}],$$

where ω0 is the resonant frequency, ω1,2,3 are the relative frequencies, the carrier frequencies are 2π(ω0 + ω1) = 17.62 GHz − 3 kHz, 2π(ω0 + ω2) = 17.62 GHz − 1 kHz, 2π(ω0 + ω3) = 17.62 GHz + 1 kHz and 2π(ω0 + ω4) = 17.62 GHz + 3 kHz, the frequency difference between two frequency-adjacent bins is Δf = 2 kHz and the message signal is φ1,2,3 = 0 or π, standing for 3 bits (0 or 1), and the reference phase is φ4 = 0 (which remains unchanged). The phase list $$\left({\varphi }_{1},\,{\varphi }_{2},\,{\varphi }_{3},\,{\varphi }_{4}\right)$$ is a bit string for time t0. By varying the phase of φ1,2,3 with time, we then obtain the FDM signal for binary phase-shift keying (2PSK). Additionally, the amplitudes of the four bins are 0.1A 4 = A1,2,3 to solve the problem that results from the nonlinearity of the atom, where the probe transmission spectra of two different bit strings, e.g. (0, 0, π, 0) and (0, π, 0, 0), are the same (further details are presented in the “Methods” section). By increasing the frequency difference Δf, we can obtain higher information transmission rates. For four bins with Δf = 2 kHz, the information transmission rate is nb × Δf = (4 − 1) × 2 × 103 bps = 6 kbps, where nb is the number of bits. In the experiments, disturbances originate from the environment and atomic collisions. Because of the sensitivity of Rydberg atoms to MW fields, the resulting noise submerges our signal. To use the sensitivity of the Rydberg atoms and simultaneously minimize the effects of noise, the deep learning model is used to extract the relative phases $$\left({\varphi }_{1},\,{\varphi }_{2},\,{\varphi }_{3}\right)$$.

### Deep learning

To improve the robustness and speed of our receiver, we use a deep learning model to decode the probe transmission signal. The complete encoding and decoding process is illustrated in Fig. 2a. The Rydberg antenna receives the FDM-2PSK signal and down-converts this signal into the probe transmission spectrum. The information is then retrieved from the spectrum using the deep learning model. The precondition is that the different bit strings correspond to distinct probe spectra; this is resolved by setting 0.1A4 = A1,2,3, as discussed earlier. Then, we combine the 1D CNN layer, the Bi-LSTM layer and the dense layer to form the deep learning model (see the “Methods” section for further details)30,31. One of the reasons for using the 1D CNN layer and the Bi-LSTM layer is that the data sequences are long, which means that prediction of the phases $${{{{{\boldsymbol{\varphi }}}}}}=\left({\varphi }_{1},\,{\varphi }_{2},\,{\varphi }_{3},\,0\right)$$ from the spectrum is a regression task and requires a long-term memory for our model. Another reason is to combine the convolution layer’s speed with the sequential sensitivity of the Bi-LSTM layer32. The input sequence is first processed by the 1D CNN to extract the features, meaning that a long sequence is converted into a shorter sequence with higher-order features. This process is visualized to show how the deep learning model treats the transmission spectrum; more details are presented in the Supplementary Materials. The shorter sequence is then fed into the Bi-LSTM layer and resized by the dense layer to match the label size (see the “Methods” section for further details). Specifically, the probe spectrum $${{{{{\bf{T}}}}}}=\left\{{T}_{0},\,{T}_{\tau },\,{T}_{2\tau },\,\cdots \,,\,{T}_{{{{{{\rm{i}}}}}}\cdot \tau }\cdots \,,\,{T}_{{{{{{\rm{N}}}}}}\cdot \tau }\right\}$$ and the corresponding phases $${{{{{\boldsymbol{\varphi }}}}}}=\left({\varphi }_{1},\,{\varphi }_{2},\,{\varphi }_{3},\,{\varphi }_{4}=0\right)$$ are collected to form the data set, where Tiτ is the ith data point of a probe spectrum and the fourth bit φ4 = 0 is the reference bit. Both the spectra and the phases are 1D vectors with dimensions of N + 1 and 4, respectively. These independent, identically distributed data {{T}, {φ}} are fed into our model as a data set. By shuffling this data set and splitting it into three sets, i.e., a training set, a validation set and a test set, we train our model on the training set (feeding both the waveforms and labels {{T}, {φ}}), validate, and test our model on the validation and test sets, respectively (by feeding waveforms without labels and comparing the predictions with ground truth labels). The validation set is used to determine whether there is either overfitting or underfitting during training. Finally, the performance (i.e., accuracy) of the model is estimated by predicting the test set.

The performance of our deep learning model is affected by the training epochs and the training and validation set sizes. The training curves on different training sets and validation sets are shown in Fig. 2b, c. Initially, our model performs well on the training set only, implying overfitting. The curves then converge (dashed line) and our model performs well on both the training set and the validation set. The sudden jump in the loss curve in Fig. 2c is caused by the change in the learning rate (further details are presented in the “Methods” section). Use of more training and validation data causes the curves to converge more quickly. The deep learning model performs well after these few-sample training. In Fig. 2d, we show a confusion matrix for prediction of a uniformly distributed test set, which demonstrates accuracy of 99.38%.

### Comparison between deep learning method and the master equation

In our case, the master equation that we employed is the commonly used one without considering the noise spectrum. The accuracies of the deep learning model and the master equation fitting on noisy data are different. Figure 4 shows the accuracies obtained by the two methods. The deep learning model is trained on a training set without additional noise, and tested on a test set with additional white noise whose standard deviation is σ (the transmission spectra with noise are given in Supplementary Materials). Here for simplicity, the data set is composed of the transmission of four MW bins only (one of them is reference bin) and the frequency difference between the adjoin bins is Δf = 2kHz. On the other hand, the result of the master equation is given based on the same test set as that of the deep learning model. The deep learning method outperforms the fitting of the master equation on the noisy data set.

Apart from the robutness to the noise, when the transmission rate is increased by increasing the number of MW bins or the frequency difference Δf, the deep learning model performs well, while it is difficult to retrieve the messages with high accuracy using the master equation. Specifically, to increase the bandwidth efficiency and the transmission rate, the number of MW bins used to carry the messages must be increased, but the information is still recognizable because of the scalability of the deep learning model. For 20 MW bins, the number of bits is (20 − 1) with one reference bit, giving a $$\left(20-1\right)\times 2\,{{{{{\rm{kbps}}}}}}=38\,{{{{{\rm{kbps}}}}}}$$ transmission rate. The number of combinations of these bits is 219, which increases exponentially as the number of MW bins increases. Here, for demonstration purposes, only the first 3 bits of the total of 19 bits carry the messages and the other bits, including the reference, are set to be 0. To show how well our model performs, we train, validate and test the model on this new data set without varying the other parameters, with the exception of the training epochs of our model. The loss curves for training and validation are shown in Fig. 5a. A confusion matrix for epoch 78 is shown in Fig. 5b. The model performs well on this new test set, which was sampled uniformly from eight categories with an accuracy of 100%. Another method that can be used to increase the information transmission rate involves increasing the frequency difference. In our case, the frequency difference is increased from Δf = 2 kHz to Δf = 200 kHz. The transmission rate increases correspondingly, from (4 − 1) × 2 kbps = 6 kbps to (4 − 1) × 200 kbps = 0.6 Mbps. To detect the high-speed signal, the DD bandwidth is increased, which inevitably leads to increased noise. After the model is trained on this new data set, the training and validation loss curves are as shown in Fig. 5c. A confusion matrix for epoch 83 is shown in Fig. 5d. Increasing the number of training epochs allows the model to perform well on this new data set, with an accuracy of 98.83% on a uniformly sampled test set.

To compare the performances of the deep learning model and the master equation, we fitted the probe spectra for 20 bins with a frequency difference Δf = 2 kHz and four bins with a frequency difference Δf = 200 kHz by solving the master equation without considering the higher-order terms and the effects of noise. In each case, 160 probe spectra were fitted that were sampled uniformly from every category. The prediction results are shown in Fig. 5(e) and (f). The prediction accuracy of the master equation is lower than that of the deep learning model. In our case, the impact of increasing the number of bins is greater than increasing the DD bandwidth for high-speed signals on the fitting accuracy. The prediction accuracy for a 20-bin carrier with frequency difference Δf = 2 kHz is 20.63%, which is like to the accuracy of guessing, i.e., 1/8. This implies that there is a disadvantage that comes from the fitting method itself, i.e., it can easily become trapped by local minima. Some type of prior knowledge is required to overcome this disadvantage, e.g., provision of the initial values of the phases before fitting. In contrast, the deep learning model is data driven and does not require any prior knowledge. The local minima problem of deep learning can be overcome using some well-known techniques, including learning rate scheduling and design of a more effective optimizer32. Additionally, the accuracy difference for the 200-kHz-difference MW bins between the deep learning model and the master equation means that the deep learning model is more robust to noise. Furthermore, the prediction time for the master equation is 25 s per spectrum, while the time for the deep learning model is 1.6 ms per spectrum. The master equation is solved by “FindFit” function in Mathematica 11.1 with both “AccuracyGoal” and “PrecisionGoal” default, while the deep learning code is written in Python 3.7.6. These codes are run on the same computer with NVIDIA GTX 1650 and Intel®$${{{{{{\rm{Core}}}}}}}^{{{{{{\rm{TM}}}}}}}$$ i7-9750H.

Another method to decode the signal is available that uses an in-phase and quadrature (I–Q) demodulator or a lock-in amplifier7,12. However, the carrier frequency must be given when decoding the signal in this case. Additionally, for multiple MW bins, numerous bandpass filters are required. The deep learning method is thus much more convenient.

## Discussion

We report a work on Rydberg receiver enhanced via deep learning to detect multifrequency MW fields. The results show that the deep learning enhanced Rydberg mixer receives and decodes multifrequency MW fields efficiently; these fields are often difficult to decode using theoretical methods. Using the deep learning model, the Rydberg receiver is robust to noise induced by the environment and atomic collisions and is immune to the distortion that results from the limited bandwidths of the Rydberg atoms (from dipole-dipole interactions and the EIT pumping rate, as studied in ref. 7) for high-speed signals (Δf = 200 kHz). In addition to increasing the transmission speed of the signals, further increments in the information transmission rate are achieved by using more bins, which is feasible because of the scalability of our model. Besides the transmission rate, this deep learning enhanced Rydberg system promises for use in studies of the channel capacity limitations. Because spectra that are difficult for humans to recognize as a result of noise and distortion are distinguishable when using the deep learning model, Rydberg systems enhanced by deep learning could take steps toward the realization of the capacity limit proposed in the literature ref. 34. To obtain high performance (i.e. high signal-to-noise ratio, information transmission rate, channel capacity and accuracy), the training epochs and training set must be extended and enlarged.

## Methods

### Generation and calibration of MW fields

The MW fields used in our experiments were synthesized by the signal generator (1465F-V from Ceyear) and a frequency horn. Each bin in the multifrequency MW field is tunable in terms of frequency, amplitude and phase. The RF source operates in the range from DC to 40 GHz. The frequency horn is located close to the Rb cell. We used an antenna and a spectrum analyser (4024F from Ceyear) to receive the MW fields and then calibrated the amplitudes of the MW fields at the centre of the Rb cell.

The probe transmission spectrum in the time domain when Δp = 0, Δc = 0 and Δs = 0 reflects the interference among the multifrequency MW bins, which results from the beat frequencies of the bins that occur through the interaction between the atoms and light. The Rydberg atoms receive the MW bins by acting as an antenna and a mixer9,11,12. After reception by the atoms, the frequency spectrum of the probe transmission shows that we can obtain the frequency differential signal from the probe transmission spectrum. This represents an application of our atoms to reduce the modulated signal frequency (from terahertz to kilohertz magnitude), which allows the signal to be received and decoded using simple apparatus. In our experiment, more than 20 frequency bins can be added to the atoms, for which the dynamic range is greater than 30 dBm. The amplitudes, phases and frequencies of these bins can be tuned individually. When the bandwidth is increased to detect an increasing frequency difference Δf signal, more noise is involved, but this noise is suppressed by the deep learning model. In other words, the signal can be recognized using the deep learning model when the information transmission rate is increased by raising the frequency difference Δf. These bins are used to send FDM-PSK signals in the “FDM signal encoding and receiving ” section of the main text.

### Master equation

The Lindblad master equation is given as follows: $${{{{{\rm{d}}}}}}\rho /{{{{{\rm{d}}}}}}t=-i\left[H,\rho \right]/\hslash +L/\hslash$$, where ρ is the density matrix of the atomic ensemble and H = ∑kH[ρ(k)] is the atom–light interaction Hamiltonian when summed over all the single-atom Hamiltonians using the rotating wave approximation. This Hamiltonian has the following matrix form:

$$H=\hslash \left(\begin{array}{llll}0&-\frac{{{{\Omega }}}_{{{{{{\rm{p}}}}}}}}{2}&0&0\\ -\frac{{{{\Omega }}}_{{{{{{\rm{p}}}}}}}}{2}&{{{\Delta }}}_{{{{{{\rm{p}}}}}}}&-\frac{{{{\Omega }}}_{{{{{{\rm{c}}}}}}}}{2}&0\\ 0&-\frac{{{{\Omega }}}_{{{{{{\rm{c}}}}}}}}{2}&{{{\Delta }}}_{{{{{{\rm{c}}}}}}}+{{{\Delta }}}_{{{{{{\rm{p}}}}}}}&-\frac{{{{\Omega }}}_{{{{{{\rm{s}}}}}}}(t)}{2}\\ 0&0&-\frac{{{{\Omega }}}_{{{{{{\rm{s}}}}}}}(t)}{2}&{{{\Delta }}}_{{{{{{\rm{c}}}}}}}+{{{\Delta }}}_{{{{{{\rm{p}}}}}}}+{{{\Delta }}}_{{{{{{\rm{s}}}}}}}\end{array}\right),$$
(1)

where for the MW signal $$E={A}_{1}\cos [\left({\omega }_{0}+{\omega }_{1}\right)t+{\varphi }_{1}]+{A}_{2}\cos [\left({\omega }_{0}+{\omega }_{2}\right)t+{\varphi }_{2}]+{A}_{3}\cos [\left({\omega }_{0}+{\omega }_{3}\right)t+{\varphi }_{3}]+{A}_{4}\cos [\left({\omega }_{0}+{\omega }_{4}\right)t+{\varphi }_{4}]$$, we have the Rabi frequency $${{{\Omega }}}_{{{{{{\rm{s}}}}}}}(t)=\sqrt{{E}_{1}^{2}+{E}_{2}^{2}}$$, where $${E}_{1}={A}_{1}\sin [{\omega }_{1}t+{\varphi }_{1}]+{A}_{2}\sin [{\omega }_{2}t+{\varphi }_{2}]+{A}_{3}\sin [{\omega }_{3}t+{\varphi }_{3}]+{A}_{4}\sin [{\omega }_{4}t+{\varphi }_{4}]$$ and $${E}_{2}={A}_{1}\cos [{\omega }_{1}t+{\varphi }_{1}]+{A}_{2}\cos [{\omega }_{2}t+{\varphi }_{2}]+{A}_{3}\cos [{\omega }_{3}t+{\varphi }_{3}]+{A}_{4}\cos [{\omega }_{4}t+{\varphi }_{4}]$$. The Rabi frequency can be derived as follows:

$$E =\mathop{\sum }\limits_{i=1}^{4}{A}_{i}\cos \left[\left({\omega }_{0}+{\omega }_{i}\right)t+{\varphi }_{i}\right]\\ =\sqrt{{E}_{1}^{2}+{E}_{2}^{2}}\cos \left({\omega }_{0}t+\arctan \frac{\mathop{\sum }\nolimits_{i = 1}^{4}{A}_{i}\sin \left({\omega }_{i}t+{\varphi }_{i}\right)}{\mathop{\sum }\nolimits_{i = 1}^{4}{A}_{i}\cos \left({\omega }_{i}t+{\varphi }_{i}\right)}\right),$$
(2)

where the second term (which resonates with the energy levels of the Rydberg atoms) induces the normal EIT spectrum and the first term modulates that spectrum. In the interaction between the atoms and the MW fields, the atoms act as a mixer such that the output signal frequency (ω1, ω2, ω3) is less than the input signal frequency (ω0 + ω1, ω0 + ω2, ω0 + ω3). The modulation signal’s nonlinearity is reduced by setting the reference and increasing its amplitude as shown in Eq. (3), which is a precondition for recognition of these phases via deep learning.

$$\sqrt{{E}_{1}^{2}+{E}_{2}^{2}} \\ \approx {A}_{4}\sqrt{1+2\mathop{\sum }\limits_{i=1}^{3}\frac{{A}_{i}}{{A}_{4}}\cos \left[\left({\omega }_{4}-{\omega }_{i}\right)t+\left({\varphi }_{4}-{\varphi }_{i}\right)\right]}\\ \approx {A}_{4}+\mathop{\sum }\limits_{i=1}^{3}{A}_{i}\cos \left[\left({\omega }_{4}-{\omega }_{i}\right)t+\left({\varphi }_{4}-{\varphi }_{i}\right)\right],$$
(3)

where the condition for the approximations on the second line and the third line is A4A1,2,3.

The Lindblad superoperator L = ∑kL[ρ(k)] is composed of single-atom superoperators, where L[ρ(k)] represents the Lindbladian and has the following form: $$\frac{L[{\rho }^{(k)}]}{\hslash }=-\frac{1}{2}{\sum }_{m}\left({C}_{m}^{{{\dagger}} }{C}_{m}\rho +\rho {C}_{m}^{{{\dagger}} }{C}_{m}\right)+{\sum }_{m}{C}_{m}\rho {C}_{m}^{{{\dagger}} }$$ where $${C}_{1}=\sqrt{{{{\Gamma }}}_{{{{{{\rm{e}}}}}}}}\left|g\right\rangle \left\langle e\right|$$, $${C}_{2}=\sqrt{{{{\Gamma }}}_{{{{{{\rm{r}}}}}}}}\left|e\right\rangle \left\langle r\right|$$ and $${C}_{3}=\sqrt{{{{\Gamma }}}_{{{{{{\rm{s}}}}}}}}\left|r\right\rangle \left\langle s\right|$$ are collapse operators that stand for the decays from state $$\left|e\right\rangle$$ to state $$\left|g\right\rangle$$, from state $$\left|r\right\rangle$$ to state $$\left|e\right\rangle$$ and from state $$\left|s\right\rangle$$ to state $$\left|r\right\rangle$$ with rates Γe, Γr and Γs, respectively. Because we are only concerned with the steady state here, i.e. t → , the Lindblad master equation can be solved using dρ/dt = 0. The complex susceptibility of the EIT medium has the form χ(v) = (μge2/ϵ0)ρeg, where ρeg is the element of density matrix solved using the master equation. The spectrum of the EIT medium can be obtained from the susceptibility using $$T \sim {e}^{-{{{{{\rm{Im}}}}}}[\chi ]}.$$

### Deep learning layers

Our deep learning model consists of a 1D CNN layer, a Bi-LSTM layer and a dense layer. The mathematical sketches for these layers are given as follows.

The 1D CNN layer is illustrated in Fig. 1c. The input signal convolutes the kernel in the following form:

$$f\otimes g=\mathop{\sum }\limits_{m=0}^{N-1}{f}_{m}{g}_{(n-m)}.$$
(4)

where f represents the input data, g is the convolution kernel, m is the input data index and n is the kernel index. The 1D CNN extracts the higher-order features from the input data to reduce the lengths of the sequences fed into the Bi-LSTM layer. Before flowing into the Bi-LSTM layer, the data pass through the batch normalization layer, the ReLU activation layer and the max-pooling layer, in that sequence. For a mini-batch $${{{{{\mathcal{B}}}}}}=\left\{{x}_{1\cdots m}\right\}$$, the output from the batch normalization layer is yi = BNγ,β(xi) and the learning parameters are γ and β40. The update rules for the batch normalization layer are:

$${\mu }_{{{{{{\mathcal{B}}}}}}}\leftarrow \frac{1}{m}\mathop{\sum }\limits_{i=1}^{m}{x}_{i},$$
(5)
$${\sigma }_{{{{{{\mathcal{B}}}}}}}^{2}\leftarrow \frac{1}{m}\mathop{\sum }\limits_{i=1}^{m}{\left({x}_{i}-{\mu }_{{{{{{\mathcal{B}}}}}}}\right)}^{2},$$
(6)
$${\hat{x}}_{i}\leftarrow \frac{{x}_{i}-{\mu }_{{{{{{\mathcal{B}}}}}}}}{\sqrt{{\sigma }_{{{{{{\mathcal{B}}}}}}}^{2}+\epsilon}},$$
(7)
$${y}_{i}\leftarrow \gamma {\hat{x}}_{i}+\beta \equiv B{N}_{\gamma ,\beta }({x}_{i}),$$
(8)

where Eqs. (5) and (6) evaluate the mean and the variance of the mini-batch, respectively; the data are normalized using the mean and the variance in Eq. (7) and the results are then scaled and shifted in Eq. (8). The training is accelerated using the batch normalization layer and the overfitting is also weakened by this layer. The output then passes through the ReLU activation layer. The activation function of this layer is $${f}_{{{{{{\rm{ReLU}}}}}}}(x)=\max (x,0)$$. The vanishing gradient problem is diminished by this activation function. Next, the inputs are downsampled in a max-pooling layer30.

The LSTM layer and an LSTM cell are shown schematically in Figs. 1d and 6a, respectively. The equations for the LSTM are shown as Eqs. (9)–(14)32,41. At a time t, the input xt and two internal states Ct−1and ht−1 are fed into the LSTM cell. The first thing to be decided by the LSTM cell is whether or not to forget in Eq. (9), which outputs a number between 0 and 1 that represents retaining or forgetting. Next, an input gate (Eq. (10)) decides which values are to be updated from a vector of new candidate values created using Eq. (11). The new value is then added to the cell state and the old value is forgotten in Eq. (12). Finally, the cell decides what to output using Eqs. (13) and (14).

$${f}_{t}=\sigma \left({W}_{f}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right),$$
(9)
$${i}_{t}=\sigma \left({W}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right),$$
(10)
$${\tilde{C}}_{t}=\tanh \left({W}_{C}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right),$$
(11)
$${C}_{t}={f}_{t}\times {C}_{t-1}+{i}_{t}\times {\tilde{C}}_{t},$$
(12)
$${o}_{t}=\sigma \left({W}_{o}\left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right),$$
(13)
$${h}_{t}={o}_{t}\times \tanh \left({C}_{t}\right),$$
(14)

where σ(x) = 1/(1 + ex) is the sigmoid function. The sigmoid and $$\tanh$$ functions are applied in an element-wise manner. The LSTM is followed by a time-reversed LSTM to constitute a Bi-LSTM layer that improves the memory for long sequences.

The dense layer and a neuron are drawn in Figs. 1e and 6b, respectively, and the corresponding equations are

$${{{{{\bf{a}}}}}}={{{{{\bf{w}}}}}}\cdot {{{{{\bf{x}}}}}}+b,$$
(15)
$${{{{{\bf{y}}}}}}=g({{{{{\bf{a}}}}}}),$$
(16)

where w is the vector of weights, b is the bias, x represents the input data, g(a) = 1/(1 + ea) is the sigmoid activation function used to limit the output values to between 0 and 1, and y is the output. The dense layer resizes the shape of the data obtained from the Bi-LSTM to match the size of the label.

The training consists of both forward and backward propagation. A batch of probe spectra propagates through the 1D CNN layer, the Bi-LSTM layer, and dense layer during the forward training process. The differentiable loss function is then calculated. In our case, the differentiable loss function is the mean squared error (MSE) between the predictions and the ground truth, which is used widely in the regression task32. The equation for the MSE is

$${L}_{{{{{{\rm{MSE}}}}}}}=\frac{1}{m\cdot n}\mathop{\sum }\limits_{j=1}^{n}\mathop{\sum }\limits_{i=1}^{m}{\left({\varphi }_{i,j}-f({T}_{i,j})\right)}^{2},$$
(17)

where m is the number of data points in one spectrum, n is the mini-batch size, φi is the ground truth and f(Ti) is the model prediction. In backpropagation, the trainable weights of each layer are updated based on the learning rate and the derivative of the MSE loss function with respect to the weights to minimize the loss LMSE, such that

$$W\leftarrow W-\eta \frac{\partial {L}_{{{{{{\rm{MSE}}}}}}}}{\partial W},$$
(18)

where η is the learning rate and W is the trainable weight for each layer. The weights of each layer are then updated according to the RMSprop optimizer42.

The network is implemented using the Keras 2.3.1 framework on Python 3.6.11 (ref. 30). All weights are initialized with the Keras default. The hyper-parameters of the deep learning model (including the convolution kernel length, the number of hidden variables and the learning rate) are tuned using Optuna43.

### Deep learning pipeline

To obtain better fitting results, the data are scaled based on their maximum and minimum values, i.e., $$T^{\prime} =({T}_{i}-\min(T))/(\max(T)-\min(T))$$. The labels are encoded in dense vectors with four elements rather than in one-shot encoding vectors to save space32. Each of these elements is either 0 or 1, representing the relative phase 0 or π of each bin, respectively.

A one-dimensional convolution layer (1D CNN), a bidirectional long–short-term memory layer (Bi-LSTM) and a dense layer are used in our deep learning model. The deep learning model structure is shown in Fig. 7. The data size for the input layer is given in the form (batch size, length of probe spectrum, number of features). The batch size is 64 in our case. Because the duration of the spectrum ranges from t = 0 to t = 0.999 ms with a time difference of τ = 1 μs, the spectrum length is 1000. For a 1D input, the number of features is 1. Therefore, the data size for the input layer is (64, 1000, 1).

During training of this model, fourfold cross-validation is used to save the amount of training data.The data set is split as shown in Fig. 8. First, the data set is split into two parts. The first is the test set (red), which remains untouched during training. The second (purple) is used to train the model. In the cross-validation process, the rest data set (purple) is copied four times and is divided equally into four parts each. One of these parts is the validation data set (green) and the others are used as training sets (blue). Four models are trained on the different training sets and validation sets. Then the best model is chosen according to the validation set and is tested on the test set. After splitting, the training set, the validation set, and the test set all remain unchanged. In every epoch, each model iterates the training set only once. There is no new set being taken; instead, the same training set is iterated once each epoch.

The computational graph is cleared before each training sequence to prevent leakage of the validation data. Gaussian noise (where the mean is 0 and the standard deviation is 0.5) is added to the training data to increase the robustness of the proposed model. In addition, the learning rate is adjusted during training to jump out of the local minimum, which results in the jump in Fig. 2c in the main text. The initial learning rate is 0.001. If the loss (mean-square error) of the validation set does not decrease over 10 epochs, the learning rate is multiplied by 0.1. The RMSprop optimizer is used to update the weight of each layer during training42.

The bidirectional LSTM layer can be replaced with the well-known self-attention layer to improve the memory of our proposed model further44. However, this would require more training time and increased GPU memory. The current model has been able to meet our requirements to date.