Advancing theoretical understanding and practical performance of signal processing for nonlinear optical communications through machine learning

In long-haul optical communication systems, compensating nonlinear effects through digital signal processing (DSP) is difficult due to intractable interactions between Kerr nonlinearity, chromatic dispersion (CD) and amplified spontaneous emission (ASE) noise from inline amplifiers. Optimizing the standard digital back propagation (DBP) as a deep neural network (DNN) with interleaving linear and nonlinear operations for fiber nonlinearity compensation was shown to improve transmission performance in idealized simulation environments. Here, we extend such concepts to practical single-channel and polarization division multiplexed wavelength division multiplexed experiments. We show improved performance compared to state-of-the-art DSP algorithms and additionally, the optimized DNN-based DBP parameters exhibit a mathematical structure which guides us to further analyze the noise statistics of fiber nonlinearity compensation. This machine learning-inspired analysis reveals that ASE noise and incomplete CD compensation of the Kerr nonlinear term produce extra distortions that accumulates along the DBP stages. Therefore, the best DSP should balance between suppressing these distortions and inverting the fiber propagation effects, and such trade-off shifts across different DBP stages in a quantifiable manner. Instead of the common ‘black-box’ approach to intractable problems, our work shows how machine learning can be a complementary tool to human analytical thinking and help advance theoretical understandings in disciplines such as optics.

O ptical communications are the backbone of all forms of information technology infrastructure in our modern society. As global Internet traffic grows by 60% per year 1 , research breakthroughs in optical communication speeds are much needed to meet the connectivity demands in the future. Fiber Kerr nonlinearity has long been the fundamental bottleneck for long-haul optical communications. While signal propagation dynamics in nonlinear optical fibers is well-known and governed by the nonlinear Schrödinger equation (NLSE), the interactions between fiber nonlinearity-induced self-phase modulation (SPM), CD, and inline optical amplifier noise prove to be very difficult to statistically characterize and compensate. The best and most common DSP to date for fiber nonlinearity compensation is the class of digital back-propagation (DBP) 2 algorithm and its variants 3-5 that provide reasonable performance gain (measured in terms of the improvement in bit error ratio (BER) or the corresponding quality (Q) factor in comparison with the case without nonlinearity compensation) in single-channel single-polarization systems. In practical polarization-division-multiplexed (PDM) wavelength-division-multiplexed (WDM) environments, however, single-channel DBP algorithms show negligible improvements due to interchannel cross-phase modulation (XPM) effects 6,7 . In addition, as different wavelength channels might have traveled through different links in a mesh network before arriving at the same receiver, interchannel nonlinear effects cannot be fully extracted from the received signals that significantly reduce the effectiveness of any joint-channel DBP that includes all the signals from neighboring channels, and attempts to compensate both SPM and XPM effects. Consequently, DBP is not widely deployed in commercial transceivers at present. Fundamentally, the interplay between CD, SPM, and amplified spontaneous emission (ASE) noise from inline amplifiers renders the system intractable and makes it difficult to derive optimal DSP algorithms.
ML has recently gained a lot of attention as a powerful tool in science and engineering problems that are virtually impossible to explicitly formulate. In optics, ML has been applied to enhance resolution of microscopy 8 , identify anthrax spores through quasiphase imaging (QPI) 9 , and predict Internet network traffic 10 . ML is also applied to fiber nonlinearity compensation in optical communications. Various ML techniques, such as expectation maximization (EM) 11 , support vector machine (SVM) 12,13 , and message-passing algorithms 5 , were studied, but they show meaningful gains only for dispersion-managed links or OFDM signals, both of which are not default choices of technology in current long-haul digital coherent systems. For single-carrier systems, Kamalov et al. 14 conducted a field-trial demonstration using neural networks with information symbol triplets as inputs, but the performance is inferior to standard DBP. On the other hand, Häger and Pfister [15][16][17] considered the linear and nonlinear steps of DBP as a deep neural network (DNN) where preliminary simulation studies for single-channel single-polarization systems are presented. However, practical transmission impairments, such as laser-phase noise, laser-frequency offsets, polarization, and WDM effects, have not been studied. In addition, many ML applications in optical communications are impressive "blackbox" data-driven models with unparalleled performance, but they contribute little additional insights into the problem concerned.
Here, we show a unique example of how ML can not only produce readily implementable algorithms advancing system performance, but also complement analytical thinking to develop deeper mathematical insights into fiber nonlinearity compensation. We first demonstrate performance gain in realistic experimental settings using the deep neural network-based digital backpropagation (DNN-based DBP) architecture [15][16][17] . To do so, we take into account the time series and dispersive nature of the received signals, and appropriately integrate the DNN with other essential non-ML DSP blocks. We show that a low-complexity implementation of such DNN-based DBP demonstrates a 0.9-dB Q-factor gain compared with optimal DBP performance with arbitrary complexity for single-channel 28-GBaud 16-QAM systems. We further extended DNN-based DBP to PDM and WDM systems with dynamic polarization-state estimation. Lowcomplexity DNN-based DBP demonstrates a Q-factor gain of 0.6 dB and 0.25 dB over arbitrarily complex DBP for singlechannel PDM and WDM PDM 28-GBaud 16-QAM systems, respectively. It should be emphasized that the DNN-based DBP is a single-channel DSP algorithm that provides performance gain in WDM environments, which serves as a key stepping stone toward practically implementable nonlinear compensation algorithms in a WDM environment. In addition, the optimized DNNbased DBP configurations reveal subtle mathematical structures that guide us to analyze the interplay between CD, nonlinearity, and noise. Such machine-learning-inspired analysis leads to a deeper insight that the optimal DSP should balance between compensating transmission impairments and additional distortions generated by the DBP itself. This is in contrast with typical ML applications in optical communications, which propose highperformance algorithms and bypass the need to further analyze the system at hand. The work is an example of the emerging area of interpretable machine learning 18 in the ML community where qualitative and human-understandable insights are gained from examining optimized ML configurations, which in turn help advance the theoretical understandings of the field concerned.

Results
Digital back propagation. Let E(z, t) at z = 0 be the electric field of a signal at the transmitter. In the simplest form, signal propagation in optical fibers is described by the stochastic scalar nonlinear Schrodinger equation (NLSE) where n(z, t) are distributed ASE noise from inline amplifiers, which can be modeled as additive white Gaussian noise (AWGN) with zero mean and autocorrelation E n z; t ð Þn * ðz 0 ; t 0 Þ Â Ã ¼ σ 2 δ t À t 0 ð Þδðz À z 0 Þ, where δ is the Kronecker delta function. In this formulation, D and N are the linear and nonlinear operators, respectively, and α, β 2 , and γ denote attenuation, group velocity dispersion, and fiber nonlinearity coefficient.
Signal propagations in optical fibers can be numerically simulated using the split-step Fourier method (SSFM) that interleaves the effect of CD/loss and nonlinear-phase rotation over a small length Δz of fiber. By separately applying the effect of CD/loss and nonlinearity to the signal, each step is analytically tractable. With digital coherent receivers, DBP is the standard technique to compensate fiber nonlinearity and is shown in Fig. 1. The DBP algorithm is a cascade of CD compensation filters D −1 with transfer function H ω ð Þ ¼ e Àjω 2 β 2 L s =2 , where L s is the step size of the DBP. The nonlinear-phase derotation operation N −1 is defined by σ k x ð Þ ¼ xe ÀjγL eff ξ k x j j 2 , where L eff ¼ 1 À e ÀαL s ð Þ =α is the effective length and ξ k is a scaling factor. The operators D −1 and N −1 attempt to undo the linear and nonlinear effects of NLSE during fiber transmission.
In standard DBP implementations, ξ k is the same for each stage of the DBP and is empirically optimized using brute-force approaches. The optimal value ξ opt depends on the noise level, dispersion map among other factors. The choice of the step size L s is a trade-off between complexity and performance, and is well-documented in the literature 2 . DBP has since been extended to the case of polarization-multiplexed transmissions 19 , stochastic DBP 5 , and joint-channel DBP 20 for WDM transmissions, and numerous simplification techniques have been proposed. Unfortunately, as of today, DBP is still relatively complex 3,4,21,22 , and for WDM systems, single-channel DBP provides marginal performance improvements.
With the advent of machine learning and especially deep learning in recent years, one can put the familiar DBP algorithm in the lens of deep learning. Specifically, the interleaving linear and nonlinear steps of DBP can be seen as the linear and nonlinear operations of a multilayer neural network as shown in Fig. 2, in which the input is the received signal sample and the output is the estimated symbol sequence 15 . In this case, the nonlinear operation is basically the nonlinear-phase derotation. For a communication channel with linear effects, such as loss and chromatic dispersion, the linear operators resemble the effect of linear filters, which essentially implies that W k are Toeplitz matrices. In this case, all the filter taps s k ¼ ½s k1 s k2 s k3 Á Á Á and ξ k become parameters that can be optimized by machine-learning techniques.
Training of DNN-based DBP. While the concept of DNN-based DBP has been studied in the literature 15,16 , we propose several necessary modifications to enable performance gain in practical transmission experiments using DNN-based DBP. The input to the DNN-based DBP is derived from the coherently detected signal with sampling rate of 2 samples/symbol. As our random symbols are generated by 25 repetitions of pseudo random bit sequence (PRBS) with a period of 65,536 symbols, we use the first 32,768 symbols (65,536 samples) for training and 25 copies of the other 32,768 symbols for testing to avoid repeating of training data in the testing set. The testing set contains a total of 819,200 symbols (3,276,800 bits for 16-QAM signals) for BER and Q-factor calculation. We apply a DNN model to our application by taking into account the time series and dispersive nature of the received signals, and use adjacent input vectors and training mini batches with overlapping signals, and use overlapand-save 23 in the linear step of our implementation. In particular, we divide the 65,536 training samples into blocks of neural network inputs and jointly tuned the input vector size, linear filter length, and the initial-and final learning rates of AdaBound optimizer (to be described in more detail below). The results show that 121 taps and 128 samples per input, i.e., 512 total input vectors are optimal settings across most experimental setups studied in our work. With the amount of CD-induced pulse broadening in the transmission link, we append 60 neighboring samples to the two ends of each input to appropriately incorporate pulse-broadening effects in the overlap-and-save method, thus resulting in 512 input vectors of length 248 samples into the neural network as shown in Fig. 3. The number of layers in the deep neural network is equivalent to the number of DBP steps. The loss function, or objective function, we chose to optimize is the error vector magnitude (EVM) defined as the mean of that is closely related to mean-squared error (MSE). It should be noted that the signal-to-noise ratio (SNR) is essentially proportional to 1/EVM or 1/MSE and can be used as loss function in principle. We use EVM/MSE as the cost function as they are common in both optical communications and machinelearning community 24,25 . Typically, optimization of neural networks uses real-valued parameters, while communication signals are complex-valued. Therefore, we first convert all computations into real multiplications and additions so that the optimization task is supported by common deep-learning frameworks like Tensorflow 26 . Standard DBP configurations are used as the initializations of the deep neural network, i.e., CD compensation filter as the linear step with ξ k = ξ opt for all k. The batch size is optimized to 16. Since the correct overlapping waveform is needed for overlap-and-save in each linear step, we append 6 neighboring inputs before and after each batch of 16, so that the overlapping parts outside the batch are updated along with each batch at each step. This arrangement is similar to having overlapping data in neighboring mini batches. The whole batch of 6 + 16 + 6 = 28 inputs are processed together, while the output of the middle 16 vectors is used to calculate the cost function and update the parameters of the neural network as shown in Fig. 3.
We will first pass the input into standard DSP blocks, such as CD compensation or DBP, followed by the laser-frequency-offset compensation (FOE) by using the periodogram of the 4th power Fiber propagation Input Layer  Fig. 4. At the testing phase, DBP is replaced by the trained DNNbased DBP and the CPE output is used for BER calculation. For training the neural network, the Adam 29 optimizer is a popular choice, thanks to its rapid converging speed compared with stochastic gradient descent (SGD) and is adopted by Häger and Pfister 15 . However, we observed in our work that the optimized parameter settings using Adam are highly specific to each training dataset and generalized poorly for different experimental setups.

Digital back propagation
In this connection, we chose the recently proposed AdaBound 30 optimizer that outperforms Adam in convergence stability, neural network performance, and generalizability to different datasets. This is achieved by specifying the desired initial and final learning rate so that the actual learning rate is bounded and smoothly transitioned between these two values in the AdaBound optimizer. In addition, we extended the neural network model in Häger and Pfister 15 to PDM systems, conduct PDM and WDM experiments, appropriately integrate the neural network with non-ML laser and polarization impairment compensation algorithms, and highlight how the optimized neural network configurations in turn help advance theoretical understandings of nonlinear signal-noise interactions during propagation and receiver signal processing, as will be shown next.
DNN-based DBP for single-channel transmissions. Experiments are conducted to determine the effectiveness of DNNbased DBP in practice. The experimental setup is shown in Fig. 5. At the transmitter side, a 92GSa/s arbitrary waveform generator (AWG) is used to generate 28-GBaud 16-QAM symbols shaped by squared raised cosine filter with roll-off factor of 0.2. The electrical waveforms first go through SHF-807 high-bandwidth electrical amplifiers followed by I/Q modulator to modulate the optical signals. The modulated optical waveform was amplified and launched into the fiber link. A flat-top optical filter with 3-dB bandwidth of 4 nm is used in each span to suppress the out-ofband amplified spontaneous emission (ASE) noise to maximize the optical signal-to-noise ratio (OSNR). NKT Koheras ADJUS-TIK fiber laser with linewidth around 100 Hz is used at both transmitter and receiver side. Since each span length is different, and our erbium-doped fiber amplifiers (EDFAs) have minimal gain of 20 dB (to compensate the loss of approximately 100 km of fiber), we fixed all EDFA gains to be 20 dB and use tunable attenuator integrated in each EDFA to ensure the proper equalization of the fiber loss incurred over all the spans. After 815-km transmission and polarization alignment between the signal and LO using a polarization controller, the optical waveform was coherently detected and sampled by an 80 GSa/s digital oscilloscope with 33-GHz electrical bandwidth. The sampled signals are then processed by offline DSP whose structure is shown in Fig. 5. The received signal is first resampled to two samples/symbol and digitally filtered to remove out-of-band noise before DBP/DNNbased DBP is applied. The Constant Modulus Algorithm (CMA) is used to compensate the residual linear distortion followed by downsampling to one sample/symbol and CPE for laser-phase noise compensation. Figure 6a plots the comparison of Q factors (calculated from bit error ratio (BER) through Q ¼ 20log 10 ffiffi ffi 2 p erfc À1 2BER ð Þ À Á as a function of signal-launched power using DNN-based DBP and DBP with different step sizes together with CD compensation (CDC) only. We consider 50 steps-per-span (StPS)-DBP as the benchmark for optimal DBP performance with arbitrary complexity as no further improvements are obtained beyond  While the optimized phase responses are close to quadratic that resembles CD compensation, the amplitude spectra are "M"-shaped. The optimized ξ k is not the same for each step, and exhibits a "U"-shaped structure. Source data are provided as a Source Data file.  amplitude responses exhibit an "M"-shaped feature and become more apparent at later stages of the DNN-based DBP. Furthermore, the optimal ξ k learnt by machine learning are not constant across k and not equal to ξ opt in general. Rather, ξ k have a "U"-shaped structure so that the nonlinear-phase derotation is larger in the middle stages. Finally, it should be noted that similar performance gains and features are obtained when using external cavity lasers (ECL) with linewidth~100 kHz at the transmitter and receiver by modifying the cost function to that of radiusdirected equalization (RDE) 31 .

Receiver DSP
Mathematical insights into fiber nonlinearity compensation as interpretable ML. Why are the amplitude responses of the linear filter "M"-shaped and why is the optimal ξ k "U"-shaped? In most applications, ML are "black-box" models with excellent predictive power, but it is difficult to inquire how they work exactly and why they produce such good results. However, in our case, the "M"shape and "U"-shape features are clear mathematical structures that strongly suggest certain hidden dynamics of DBP yet to be better understood. Specifically, the "M"-shaped filter indicates that the optimal linear filter exhibits some high-pass feature that becomes more pronounced at later stages of the DNN-based DBP. A plausible explanation for this phenomenon is that there exists an additional undesired term with a "∩"-shaped spectrum that grows with the DNN-based DBP stages such that the "M"shaped filter tries to compensate. On the other hand, the "U"shaped ξ k suggests that the nonlinear-phase derotation is small at the beginning and toward the end of the DBP, but can be larger in the middle stages. This can imply that the beginning and end stages of DBP are more prone to noise and distortions, and hence phase derotation based on instantaneous signal power may not be effective. The overall optimized DNN-based DBP configuration from ML seems to suggest that noise and distortion accumulations play a hidden yet pivotal role in the nature of nonlinear DSP and effectiveness of DNN-based DBP. We proceed to analyze noise and distortion accumulations in DBP to try to explain the optimal DNN-based DBP configurations discovered by machine learning. For simplicity, we will remove the variable t and abbreviate E(z, t) as E z and let the step size L s = Δz in the following analysis. Referring to Fig. 1, the received signal is given by where CD Δz (·) denotes the effect of CD on a signal over a distance Δz and n 0 L ¼ CD ÀΔz n L ð Þ. This is followed by a phase derotation proportional to ξ LÀΔz B LÀΔz j j 2 . Note that as supposed to ξ k that refers to the kth-step DNN-based DBP in the previous section, ξ L−Δz here relates to the phase derotation to estimate the transmitted signal at z = L − Δz, which is given bŷ where n 00 L ¼ n 0 L e ÀjγΔz E LÀΔz j j 2 and n 000 L ¼ n 0 L e jγΔzξ LÀΔz jE LÀΔz e jγΔz E LÀΔz j j 2 þn 0 L j 2 . Since we are interested in the statistical properties of the estimated transmitted signal only and n L ; n 0 L , n 00 L , and n 000 L are all AWGN processes with the same statistical distributions, we will denote all of them as n L for simplicity so that In this case, the extra distortions arising from fiber nonlinearity and its compensation are E LÀΔz j j 2 E LÀΔz and < E LÀΔz n * L È É E LÀΔz . Note that for typical pulse shapes, E LÀΔz j j 2 E LÀΔz will have a "∩"-shaped spectrum due to the triple convolution of the original pulse's spectrum. Also, the amplifier noise n L at z = L is brought backward to the signal estimateÊ LÀΔz at z = L − Δz where v Δz ¼ CD ÀΔz ðn L þ n LÀΔz Þ. Note that the statistics of v Δz does not depend on the CD operation, and we will also denote v 0 = n L for notational convenience. The signal estimate at z = L − 2Δz is given bŷ A mathematical pattern is emerging from Eq. (6). Continuing with the above derivation, the estimate of the transmitted signal using DBP iŝ where K is the total number of DBP steps. As K → ∞ and Δz → 0 for distributed amplification system with arbitrarily complex DBP, In Eq. (8), v L corresponds to the total ASE noise in the system and the term jγ R L 0 CD Àz E z j j 2 E z À Á dz actually corresponds to the nonlinear-phase shift due to the signal, and is largely compensated by carrier-phase estimation in practical systems. The other terms jγ R L 0 ξ z CD Àz E z j j 2 E z À Á dz and jγ R L 0 2ξ z CD Àz <fE z v * LÀz gE z À Á dz are the major nonlinear impairments that degrade transmission performance. As the variances of the noises and distortions are typically used to characterize overall system performance, Fig. 7 shows the simulated variances of CD Àz E z j j 2 E z À Á and 2CD Àz <fE z v * LÀz gE z À Á as a function of z, and it can be seen that one of them increases with z, while the other decreases with z. As ξ z control the relative strengths of CD Àz E z j j 2 E z À Á and 2CD Àz <fE z v * LÀz gE z À Á in the overall nonlinear distortions inÊ 0 , we can now appreciate why ξ z j j is smaller at the beginning and end of the DNN-based DBP stages and larger in the middle as shown in Fig. 6d. This can be intuitively interpreted by noting that 1. The nonlinear-phase derotation at each DBP stage is not perfect due to ASE noise, and such imperfections accumulate throughout the whole DBP chain. Imperfections at the early DBP stages (corresponding to ξ z for z~L) accumulate the most in the final signal estimateÊ 0 and therefore ξ z should be small for z~L to minimize such accumulation. 2. Toward the end of the DBP, the signal amplitude is already heavily distorted and quite different from the original signal due to noise and accumulation of imperfect compensation from preceding DBP stages. Therefore, the phase derotation at the end of the DBP stages (corresponding to ξ z for z~0) will not be accurate, and hence ξ z should be small for z~0 to prevent overcompensation and in turn produce additional distortions.
In addition, since the imperfect-phase derotations cannot completely eliminate the nonlinear distortion term E z j j 2 E z , they continue to grow within the DBP stages and accumulate at the end of the algorithm. As E z j j 2 E z has a "∩"-shaped spectrum, an inverted-shaped spectrum to partially equalize the distortions will be more beneficial than a pure CD compensation operation in Eq. (8). Furthermore, since E z j j 2 E z has three times the bandwidth of E z , one should simply filter out the out-of-band distortions at each DBP stage. This explains why the overall linear filter of DNN-based DBP exhibits the "M"-shaped features depicted in Fig. 6b, and how the shape is more apparent toward the later stages of DNN-based DBP.
The new insights developed illustrate that the optimal linear filter of DNN-based DBP does not merely equalize CD. It is in fact a trade-off between compensating CD of the signal and mitigating the third-order nonlinear distortion term E z j j 2 E z and its accumulation along the DBP stages. Similarly, the optimal nonlinear-phase derotation actually attempts to strike a balance between reversing the nonlinear phase during propagation and minimizing additional phase noise accumulation along the DBP stages due to corrupted signal power levels. Overall, machine learning reveals that the original design philosophy of DBP as an iterative linear and nonlinear compensation of fiber propagation effects is not complete. Rather, the optimal DSP should undo the nonlinear channel effects, as well as manage additional distortions accumulated within the DSP itself. In our work, it should be emphasized that the new insights are inspired from the MLoptimized configurations depicted in Fig. 5. In our work, ML provided directions for theoreticians to work toward deeper analytical insights, which in turn validate the consistency and reliability of the DNN-based DBP as we are able to interpret the features with concrete mathematical arguments.
DNN-based DBP for PDM and WDM transmissions. For PDM transmissions with vector input Eðz; tÞ ¼ Â E x ðz; tÞ E y ðz; tÞ Ã T , the vector Manakov-PMD equation governs signal propagation. In this formulation, Ψ ¼ U À1 ðzÞE where U(z) models the random principle states of polarization  NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17516-7 ARTICLE (PSP) rotation that evolve with z. β 0 , β 1 are 2 × 2 matrices modeling birefringence (difference in refractive index of signals in the two PSPs) and polarization-mode dispersion (PMD) (difference in group refractive index induced group delay of signals in the two PSPs). Random polarization rotations and PMD lead to polarization-dependent nonlinear interactions. Although their effects in nonlinear fiber propagation are well characterized analytically, their unknown and random nature significantly reduce the effectiveness of DBP 19,32 . In this connection, the proposed DNN-based DBP framework can be extended to roughly estimate and compensate the polarization rotations at each DBP stage. In particular, we can express the DNN-based DBP in a vector form and append a polarization rotation matrix R k ¼ cos θ k cos ϕ k À j sin θ k sin ϕ k À sin θ k cos ϕ k þ j cos θ k sin ϕ k sin θ k cos ϕ k þ j cos θ k sin ϕ k cos θ k cos ϕ k þ j sin θ k sin ϕ k !
at the kth stage along with the linear and nonlinear operators in each DNN-based DBP step as shown in Fig. 8. Note that R k is optimized from training data without any prior knowledge of the link PSP. In this case, the nonlinear step σ k (·) will consist of two optimization parameters ξ xx;k ¼ ξ yy;k and ξ xy;k ¼ ξ yx;k .   Figure 9a compares the Q factors as a function of signallaunched power for 28-GBaud PDM 16-QAM transmissions over 815 km using CDC-only, DNN-based DBP and DBP with different step sizes. PDM is achieved in experimental settings through a PDM emulator that splits the original signal into two polarizations with equal energy, delays one of the signals to generate delayed signal copies and recombines them. The DNNbased DBP here is trained on the PDM data. It can be seen from Fig. 9a that 1-StPS DBP can only provide a small gain of 0.7 dB compared with CDC only, and 50-StPS can provide a further gain of 0.5 dB over 1-StPS DBP. However, a mere 1-StPS DNNbased DBP can already produce an extra 0.6-dB gain over 50-StPS DBP, which shows that DNN-based DBP can not only save complexity but improve transmission performance in polarization-multiplexed systems. The 2-StPS DNN-based DBP can further improve the performance by 0.6 dB over 1-StPS DNN-based DBP. The amplitude spectra and optimized ξ xx,k shown in Fig. 9b, c exhibit the "M"-shaped and "U"-shaped features as discussed previously. The phase spectra have the same quadratic shape as CD compensation filter and are not shown here. The optimized angles θ k and ϕ k shown in Fig. 9d do not exhibit any mathematical structures or trends, and are in agreement with theoretical expectations. It is clear that including R k in DNN-based DBP provides new dimensions of optimization, and hence DNN-based DBP can overcome the limitations of traditional DBP by compensating polarization-dependent nonlinear effects.
Finally, we investigate DNN-based DBP performance for PDM 16-QAM WDM transmissions with different baud rates. The 5channel 50-GHz-spaced WDM transmission setup is shown in Fig. 10. The low-linewidth fiber laser is combined with two external cavity lasers (ECL) of around 100-kHz linewidth to generate the odd-numbered channels, while two other ECLs produce the even-numbered channels. The odd and even channels are modulated by two IQ modulators driven by two independent random 16-QAM sequences with baud rates of 28 GBd or 34 GBd. The roll-off factor is 0.2. The signals from the five channels are combined into the PDM emulator to generate delayed signal copies for polarization multiplexing. The link configuration is the same as single-channel experiments described previously. At the receiver, the center channel is filtered by a wavelength-selective switch (WSS) with 3-dB bandwidth of 40 GHz, which is sampled by 80 GSa/s sampling scope samples followed by offline processing.
The Q factor of the center channel for the five-channel system is shown in Fig. 11a for 28-GBaud transmissions. The DNNbased DBP here is trained on the WDM data, and 262144 bits are used in the testing set to calculate the BER and Q factor. The performance gain of 1-StPS DBP and 50-StPS DBP over CDC only is around 0.3 dB and 0.6 dB, respectively. On the other hand, 1-StPS and 2-StPS DNN-based DBP provides an additional gain of 0.25 dB and 0.45 dB over 50-StPS DBP and a total gain of 0.85 dB and 1 dB over CDC only, respectively. For 34-GBaud transmissions, the gain of 1-StPS DNN-based DBP over 1-StPS DBP and CDC only is around 0.2 dB and 0.4 dB, respectively. 2-StPS DNN-based DBP further improves the optimal Q factor of 1-StPS DNN-based DBP by 0.15 dB. The amplitude spectra and optimized ξ xx,k shown in Fig. 11c-f display the same "M"-shaped and "U"-shaped features as expected from the analytical insights developed through machine learning. Overall, the results show that DNN-based DBP represents a new design dimension in single-channel DSP algorithm for nonlinearity compensation in WDM systems without compromising computational complexity. It serves as a crucial step forward in improving practical WDM transmission performance.

Discussion
In this paper, we experimentally demonstrate that by relating the configuration of well-known digital back-propagation algorithm into interleaving linear and nonlinear operators of a deep neural network, machine-learning techniques can optimize the network parameters and lead to dramatic performance improvements and computational savings. Applying DNN-based DBP to PDM-WDM systems reaps sizeable performance improvements compared with other single-channel DSP algorithms, thus achieving a key step in bringing nonlinearity compensation DSP into realistic WDM systems. More importantly, the optimal parameter configurations in turn guided us to analyze the interplay between CD, nonlinearity, and noise more closely and led to deeper theoretical insights that the receiver DSP should not exactly invert the linear and nonlinear steps of the fiber propagation effects. Rather, it should balance between compensating transmission impairments and suppressing the additional distortions arising from imperfect linear and nonlinear compensation steps due to inline amplifier noise. As the new analytical insights are inspired by the optimized neural network model, it shows that machine learning can actually go beyond conventional thinking to develop deeper theoretical understandings in the field of nonlinear fiber transmissions in addition to providing algorithms exceeding state-of-the-art performance. Our work serves as an example that machine-learning techniques can not only provide a detour from intractable systems and arrive at intelligent solutions or strategies, but also help elucidate the path toward deeper physical insights and the underlying mathematical structures.

Data availability
The data files are available from the corresponding author upon reasonable request. Source data are provided with this paper.

Code availability
Example codes for the training and testing of the deep neural networks used in the paper are available from the corresponding author upon reasonable request. Source data are provided with this paper.