Neural networks for computing and denoising the continuous nonlinear Fourier spectrum in focusing nonlinear Schrödinger equation

We combine the nonlinear Fourier transform (NFT) signal processing with machine learning methods for solving the direct spectral problem associated with the nonlinear Schrödinger equation. The latter is one of the core nonlinear science models emerging in a range of applications. Our focus is on the unexplored problem of computing the continuous nonlinear Fourier spectrum associated with decaying profiles, using a specially-structured deep neural network which we coined NFT-Net. The Bayesian optimisation is utilised to find the optimal neural network architecture. The benefits of using the NFT-Net as compared to the conventional numerical NFT methods becomes evident when we deal with noise-corrupted signals, where the neural networks-based processing results in effective noise suppression. This advantage becomes more pronounced when the noise level is sufficiently high, and we train the neural network on the noise-corrupted field profiles. The maximum restoration quality corresponds to the case where the signal-to-noise ratio of the training data coincides with that of the validation signals. Finally, we also demonstrate that the NFT b-coefficient important for optical communication applications can be recovered with high accuracy and denoised by the neural network with the same architecture.

(1) i ∂q ∂z + 1 2 ∂ 2 q ∂t 2 + |q| 2 q = 0, www.nature.com/scientificreports/ disturbances 32,33 , imposing limits on the NFDM transmission quality. Thus, in our current work, we analyse the capability of a neural network (NN) to denoise the NF spectra. Similarly to Ref. 34 , in optical transmission applications, the NFT-Net that we consider in this work, is supposed to be integrated into the receiver architecture: it takes in the corrupted signal and yields the "purified" nonlinear spectrum containing the modulated data. Another widespread deviation from idealised model (1) is the non-zero nonuniform gain-loss profile occurring in realistic systems for both lumped 18,22 and distributed 19 amplification schemes. We also mention the effects of polarisation mode dispersion 35,36 , higher-order chromatic dispersion 35 , and component-induced impairments, to itemise just several important sources. All these effects bring about the deviations of the true optical channel from integrable NLSE (1) such that the NF spectrum of the signal at the end of our transmission system can be significantly distorted, which results in the appearance of errors in the transmitted data 20,21,35 . Given that, the machine learning and artificial neural networks (NN) based signal processing methods have recently attracted much attention, as they can effectively render adaptive distortions-resilient signal processing tools, and, thus, using the NNs we can mitigate the impact of detrimental factors mentioned above 37,38 . The first direction in utilising the NNs for NFDM systems consists in applying the additional NN-based processing unit at the receiver to compensate the emerging line impairments and deviations from the ideal model [39][40][41][42][43] . But, despite ensuing transmission quality improvement, this type of NN usage brings about the additional complexity of the receiver. In the other approach, the NFT operation at the receiver is entirely replaced by the NN element. It has been shown that this approach, indeed, results in a considerable improvement of the NFT-based transmission system functioning 31,34,44 . But, despite the benefits rendered by such a NN utilisation, the NNs emulating the NFT operation have so far been mostly used in the NFDM systems operating with solitons only, and the NN structure used there was relatively simple. In the only work related to the continuous NF spectrum recovery 45 , a standard "imageInputLayer" NN (developed originally for hand-written digits recognition) from MATLAB 2019a deep learning toolbox was adapted to process the signals of a special form. Such an approach, evidently, has limited applicability and flexibility and is not optimal neither in terms of the result's quality nor in the complexity of signal processing. In our current work, we demonstrate how this direction can be significantly extended and optimised, presenting and analysing the NN-based NFT modelling for the continuous NF spectrum, and using the special optimisation tools for finding the best NN architecture. We believe that our current research can lay the basis for the development of high-efficiency channel-agnostic NFDM transmission systems. Moreover, in our study, we address the question of recovering not only the NF spectrum r(ξ ) , but also the b-coefficient, so it can be combined with the most efficient NFDM transmission method: the b-modulation.
Finally, we note that, recently, the interest in using the NFT as a signal-processing tool has risen in fields that are not directly relevant to optical transmission. In particular, the NFT was applied in the so-called integrable turbulence to monitor the appearance of coherent structures, such as breathers, solitons, and rogue waves 46,47 , to the optical microresonators regime analysis 48 , to the optical frequency combs characterisation 49 , and to the analysis of laser regimes and the emergence of dissipative coherent nonlinear structures [50][51][52] . The analysis of NFT modes' evolution for such systems often appears to be more informative and convenient than dealing with the conventional Fourier modes. The NFT is also an important tool for the design of fibre Bragg gratings 53,54 . Thus, we believe that the technique presented in this work can have a much wider range of applications than simply being a processing tool in optical communications. To end up, solving nonlinear differential equations itself by using NNs is a fast-growing area with a range of applications in science and engineering [55][56][57] . We hope that our work will also advance knowledge in this emerging field.

Results
In this section, we describe the main results obtained in the process of finding a suitable NN architecture for computing the continuous NF spectrum of a given signal. First, we describe which type of signals we used in training and testing. Next, we discuss the Bayesian optimisation application for our finding the best-performing NN architecture and the respective training procedure. Then, we analyse the output accuracy for the proposed NN architecture and compare it with that produced by a deterministic NFT numerical algorithm. In this paper, for the data generation and "conventional" computation we use the Fast NFT (FNFT) library 58 . At the end of the Results section, we show that the proposed NN architecture can predict not only the scattering coefficient r(ξ ) , but also the NF coefficient b(ξ ) , Eq. (8).
Training data generation. In this work, without loss of generality, we analyse the NF decomposition of the signals having the form of wavelength division multiplexing (WDM) format with random modulation and return-to-zero carrier functions, considered in 59,60 . In the time domain, one (normalised) WDM symbol to decompose is given as the sum of independent subcarriers: where M is a number of WDM channels, ω k is a carrier frequency of the k-th channel, C k corresponds to the digital data in k-th channel, and T defines the symbol interval; f(t) is the carrier support waveform of our returnto-zero pulses. Q in (2) is the normalisation factor that we use to set the required energy for each signal (the total signal energy is calculated according to Eq. (3)). Each C k in (2) is a complex number drawn from the constellation with a particular cardinality, i.e. it is chosen with an equal probability from the finite set of allowed constellation  www.nature.com/scientificreports/ points. For our NF decomposition analysis each time we use a single signal of the form given in Eq. (2). To train the NN, we precomputed 94035 such signals, with C k for each carrier randomly drawn from quadrature phaseshift keying (QPSK) constellations, i.e. the constellations with 4 possible points; the number of optical channels (carriers) in (2) is 15. Then we sampled our signal at equidistant points in time, t m , over the segment of length T, q(t m ) = q m : the number of sample points in each signal representation was 2 10 = 1024 . The normalised symbol interval T was set to unity so that the time step size used was t = 2 −10 (for the explicit normalisations referring to single-mode fibre transmission see, e.g., Ref. 3 ). For generated discretised profile, the reflection coefficient r(ξ ) was identified for 1024 sample points in ξ variable, calculated using the fast numerical NFT method 58 . The parameter ξ for our computations ranged from −π/(4�t) ≈ −804 to π/(4�t) ≈ 804 : this region corresponds to the conventional Fourier spectrum computational bandwidth for the given sampling rate t , up to the scaling factor 2 referring to the linear limit correspondence 14 . Each signal in the dataset was eventually normalised so that its energy E signal = 39.0 . Some of the signals in the initial dataset for this energy contained solitons, but such signals were singled out and removed from the training and validation datasets. The remaining 94,035 signals did not contain solitons, which means that the discrete nonlinear spectrum for each signal is absent, such that these are used in our analysis. We note that although there are no solitons in the signals, we are still operating in the regime where the signal nonlinearity is not negligible, see Methods. The more straightforward way of generating the datasets with desired properties would be to use the inverse NFT routines, but these are much more time-consuming, such that we decided to employ the data-generation approach described below: it also allows us to explicitly control the accuracy of the generation process.
Together with the set of deterministic signals, we generated the signal sets with the addition of uncorrelated Gaussian noise, adding the random value to each sample point. In realistic applications, the source of this noise can be the instrumental imperfections of the transceiver or the effects relevant to inline amplifiecation 9 . The signal-to-noise ratio (SNR) is a traditionally used characteristic for quantifying the level of a noisy corruption: where E signal and E noise are the signal and noise energies, respectively; q m is the m-th signal sample, with N being total number of sampling points, t is the time sample size. For further training, in addition to the set without noise, which had 84632 signals, we used 8 sets of 423160 signals (5 different noise realisations). Each set corresponds to one of the following SNR values: {0, 5, 10, 13, 17, 20, 25, 30} dB. 9 sets of 9403 signals with the corresponding noise levels were left to validate the network performance. Validation data sets were not used in the training process. We note that the NFT in optical communications is tailored for use in long-haul systems, meaning the high levels of noise (low SNR) is most interesting from the application perspectives. However, we also include the results for high SNR levels to analyse the NN functioning peculiarities in detail.
Neural network design and Bayesian optimisation. As mentioned above, the general NF spectrum attributed to a given localised waveform consists of two parts: the discrete spectrum that we do not consider in our current study (our trial pulses do not contain any solitonic component, neither in pure form nor in the noisy case), and the continuous part which is our subject in hand here. The continuous part is retrieved through considering the special Jost solutions (7) to the Zakharov-Shabat problem (6), see Methods. The goal of our work is to demonstrate the fundamental possibility of replacing the direct calculation of NF spectrum through the numerical solution of the Zakharov-Shabat problem (6) with the computations employing specially-designed and trained NNs.
The latter task can be addressed using the encoder-decoder approach, where the encoder transforms the input signal into some intermediate vector representation and, later, the decoder converts this representation into the output signal. We notice that the input and output signals can belong to two different data domains. There are several advantages of this approach, e.g. it is quite flexible, so the encoder and decoder structures can differ to match exactly the "nature" of each signal's domain. With this, we train such NNs in the end-to-end style, so the weights of the encoder and decoder will be trained simultaneously and fit each other. A lot of highly efficient encoder-decoder architectures have been designed up to date, e.g. those can demonstrate an efficiency higher than that of a human brain for some specific tasks 61 . For processing quite long sequences (typically more than 1000 data points), the convolutional NNs (CNN) are often more beneficial than the recurrent NNs (RNN). Also, the CNN allows us to parallelise the computations in an efficient way, which is important in our case. Thus, we argue that the encoder-decoder architectures based on CNNs are most suitable for our data and task, though other NN types may also deserve investigation in latter studies.
As a starting point, we took the WaveNet 62 -based network, which extends the concept of deep CNNs. Models of this type have several advantages, among which we underline the reduction of time required for training the network on long data sequences. However, a significant drawback of this architecture is the requirement to embed a large number of convolutional layers to increase the receptive field. In our work, to increase the effective size of that region, we used convolutions with dilation. This made it possible to exponentially increase the receptive field with the NN depth growth and, therefore, to capture a larger number of data points in the input signal.
The momentous issue in using NNs to perform any nonlinear transformation is the choice of the optimal network architecture. One of the optimisation methods is to enumerate the possible combinations of NN parameters. But even in the case of a relatively small number of layers, the number of hyperparameters can reach several thousand, which makes the optimisation process very time-consuming, if realisable at all. Thus, the search for an optimisation algorithm for such computationally expensive problems can be extremely difficult. However, www.nature.com/scientificreports/ the Bayesian optimisation method 63 is deemed to be one of the most efficient optimisation strategies, and so we employ it in this work to find the optimal hyperparameters distribution for the NFT-Net. The Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set 63,64 . By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, the Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. Thus, it tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (the hyperparameters expected to bring us close to the optimum). An important aspect to note is that the Bayesian optimisation often does not return one specific point in the parameter hyperspace for which the optimised function is minimal. The process converges into some subspace of parameters, where several points can locally minimize the function 63 . A detailed description of hyperparameters tuning can be found in the article 65 where Bayesian optimisation is used to adapt parameters for the synthesis of a digital pre-distortion filter for optical transmitters.
We manipulate the following hyperparameters for the convolutional part of the neural network: the number of convolutional layers, the number of filters, the kernel size, stride, dilation, and the activation function for each layer. We used the activation functions "ReLU", "tanh" and "sigmoid" in the hyperparameters optimisation. After the convolutional part, there are 2 fully connected layers, the second of which has a fixed size (1024, which corresponds to the size of the output vector). The size and activation function of the first fully connected layer was also a hyperparameter for optimization. For the optimisation, we used a dataset without additional noise and employed only the real part of the continuous spectrum for the prediction. After that, the "optimal" architecture (but not weights) is fixed, and is no longer changed to predict the imaginary part of the continuous spectrum or for our operating with the datasets with additional noise. The loss function was optimised for each architecture. We used the mean squared error (MSE) as the loss function, aiming to minimise the MSE between the network output and the target output computed with the conventional NFT method 58 . In training, we employed the Adam (Adaptive Moment Estimation) optimisation algorithm with the learning rate of 1e-4 66 . The learning process of each point in the parameter hyperspace was stopped if the value of the loss function did not decrease over 5000 epochs. We chose this large epoch stopping-criterium number to neutralise the factor of randomness in the learning process, which appears due to the random choice of the initial weights. Additionally, we checked the value of the loss function on the validation set to prevent the overfitting, but for the amount of training data used, the overfitting was not observed. Figure 2 presents the dependence of the MSE value and dependence of its minimum on the Bayesian iteration number. For architectures with more than 20 million training parameters, we set the value of the loss function to 1.0: this explains the upper cut-off limit in the figure. It is apparent from Fig. 2 that the optimisation has identified a subspace where many architectures have approximately the same value of the loss function at the level of 10 −5 . However, there was a point where the value was at the level of 10 −7 . Thus, we took this point (a set of hyperparameters) as the optimal one. After finding the optimal architecture, each NN's weights were trained for different SNR but keeping the same optimal architecture parameters. On average, with the amount of data used, our learning process took 50,000 epochs to reach the minimum for each noise level.
The original signal and NF data for the continuous spectrum are complex-valued functions. Therefore, two networks with the same architecture are to be used for the whole transformation; each identical part is responsible for the computation of either the real or imaginary parts of the resulting arrays, which contain the values of continuous NF spectrum defined in Eq. (10). Figure 3 depicts the schematic for the entire optimised NFT-Net architecture. The convolutional part consists of three layers with 10, 15 and 10 filters. Kernel sizes of the first and third convolutional layers are 10, and for the layer between them, it is 18. As noted above, we took the dilation value for each layer as one of the sought hyperparameters. For NFT-Net, the optimisation gave that the first two layers have dilation 2, stride 1 and "tanh" activation function, and for the third layer, the dilation is 1 with stride 3 and "ReLu" activation. After the CNN part we put the flattening layer, not shown in the figure (but affecting the processing complexity), and two fully-connected layers with 4096 and 1024 neurons. The exemplary picture of how the designed NN works on one signal is given in Fig. 4c. In this figure, we show the results of the NN-based NF spectrum computation for the noiseless case. Already from this figure, we can notice that the result produced by our NN and that obtained from conventional NFT routine 58 are very similar. www.nature.com/scientificreports/ Studying the NFT-Net performance for computing NF spectra of noisy signals. In this section we analyse the NFT-Net performance and the denoising property of the NN. We compare the deviations in the obtained nonlinear spectrum calculated with the NFT-Net and calculated with the conventional NFT applied to the same signal without noise. To quantify the performance rendered by the NFT-Net application with the performance of conventional algorithms applied to noisy signals, we use the following metric: where S is the total number of signals in the validation set, �·� ξ denotes the mean over the spectral interval, {r predicted (ξ )} i and {r actual (ξ )} i correspond to the value of reflection coefficient r(ξ ) computed for the signal number i at point ξ (we compare the quantities for the validation data set). The label "predicted" refers to the result produced by the NFT-Net on the noisy signal, and "actual" marks the r(ξ ) value obtained using the conventional NFT algorithm 58 for the noiseless signal. The relative error η(ξ ) is determined at the point ξ , so we use η(ξ ) ξ to estimate the overall mean of the error for one signal, and use Eq. (4) to evaluate the error for the entire validation dataset. We stress that the metric was chosen in such a way as to take into account even the regions where the value of the spectrum is much less than one. The results of our comparison for r(ξ ) computation using different SNR levels for NFT-Net are presented in Table 1, and are arranged as follows. The first column of the table identifies the SNR value in dB for the validation signals, i.e. the level of noise for the signals which we analyse. The first row of the table displays the SNR values of noisy signals from the training set, i.e. it shows the noise level of the signals on which the NFT-Net was trained. We notice that the case SNR = 30 dB corresponds to almost negligible noise, while for SNR = 0 dB our noise energy is equal to that of our signal, which signifies a very intensive noisy corruption. Thus, each column in the table corresponds to the results produced by the NN trained on the signals with the chosen level of a noisy corruption. The number in each cell shows the averaged metric value, Eq. (4), where for the computation of {r predicted (ξ )} i we used the NFT-Net trained on the signals with SNR values shown in the first row and applied to the validation signals having the SNR values given in the first column. The "Conv. NFT" column shows the error value for the numerical result of the fast NFT method on the signals with added noise, where the respective SNR is presented, again, in the first column. The value of metric (4) corresponding to the conventional NFT method applied to noiseless signals is, obviously, zero: the results provided by the conventional NFT without noise are taken as the true ones. When the NFT-Net produces a less accurate result compared to the conventional NFT applied to the noisy signal, the cell is marked with bold; when the NFT-Net outperforms the conventional NFT www.nature.com/scientificreports/ method, i.e. it successfully purifies our signal from noise, the respective cell is not highlighted (white). Whence, the white area size in each table demonstrates how well the NFT-Net retrieves the nonlinear spectrum for noisecorrupted signals. Table 1 shows the error values for the restoration of r(ξ ) coefficient (10) of a noiseless and noisy-perturbed signals (2), by the NFT-Net architecture given in Fig. 3. The first row in the table corresponds to the noiseless case. It is always marked with bold, which means that the NN cannot provide any better results than the benchmark ones rendered by the conventional fast NFT method used to generate the training data.
However, the values of the error for noise-corrupted signals reveal interesting tendencies. It follows from the table that for the low training noise level (up to 10 dB, columns three through nine), the NFT-Net error is typically lowest for the noiseless validation dataset (second row). Thus, the addition of low noise in the training dataset only degrades the NFT-Net restoration capability, even though this decrease is not significant. This NFT-Net feature can be deemed as the NN's being "confused" by the weak noise in its training in the nonlinear transformation identification. For the most interesting case of high noise, the network works best for samples where the SNR value is the same for the validation and training sets. In such cases, the relative error is about 8-12%, while the error for conventional NFT is at the level of 100-200%. Another fact is that with decreasing noise (rows from bottom to top) in the validation set, the error value remains at approximately the same level after the cell corresponding to the same training and validation noise values. These results confirm that the presented  (c)) is the amplitude for calculated scattering coefficient r(ξ ) associated with the signal from the pane above. The blue line corresponds to the data obtained using the conventional NFT method, the red line corresponds to the NFT-Net result. The difference between the scattering coefficients for signal example calculated by these methods is shown in panel (e). Pane (b): the same plot for complex signal q(t) with the addition of Gaussian noise. The SNR value used is 5 dB. Plot (d) shows the result of calculating the continuous spectrum for the noisy signal using the FNFT method (green line) and using the NFT-Net trained at the same noise level (SNR = 5 dB, red line) and original spectrum (for noiseless case, blue line). For NFT-Net trained with noise, pane (f) below shows the difference between the predicted scattering data for the example of noisy signal and the reflection coefficient calculated by conventional NFT for that signal without noise. www.nature.com/scientificreports/ NN architecture is capable of performing the desired nonlinear transformation, the NFT, and, in addition, it can also work as an effective denoising element when the noise level becomes non-negligible. The examples of original and noise-corrupted signals and the corresponding nonlinear continuous spectra are given in Fig. 4, where we used the NFT-Net for the computations. Figures 4b and d show that when the additional noise distorts the signal, the conventional numerical algorithms naturally produce the noise-distorted nonlinear spectra. Fig. 4e and f show the relative error value η(ξ ) (4) for the continuous spectrum prediction with NFT-Net for the signal without noise (left) and the signal with noise (right), and the reflection coefficient computed for the original signal by the conventional NFT (marked as "Conv. NFT" in the panes' legends). In Fig. 4c, e, the NFT-Net is trained on the dataset without adding noise, and in Fig. 4d, f, the NFT-Net is trained on the dataset with additional noise for SNR = 5 dB. Figure 4c and d show that in the presence of noise, the fast NFT results begin to deviate noticeably from the original (noiseless) values, while the NFT-Net tends to denoise the resulting nonlinear spectrum.

NFT-Net performance for the restoration of NF coefficient b(ξ ) attributed to noisy signals.
In addition to the coefficient r, from the optical communications perspective it is instructive and important to check how the proposed architecture would work to predict the NF coefficient b, Eq. (8). We note that the optical transmission method coined b-modulation 24,26,49 , where we operate with the modulation of the b-coefficient, has proven to be the most efficacious technique among different NFDM methods proposed 12,28 . Moreover, for the practical case when our signal has a finite extent, the continuous part of the NF spectrum can be completely described by the b-coefficient only, because the second NF coefficient a(ξ ) can be calculated from b(ξ ) profile, see Eq. (11) in Methods. Our goal here is to demonstrate that the same NFT-Net structures can be used for the both r(ξ ) and b(ξ ) computation, when the NN is trained on the respective dataset. As the loss function, we now use the MSE build on the b-coefficient samples, and the MSE is also used as our quality metric in the respective tables: The notations are the same as we used in (4): the labels "predicted" and "actual" correspond, respectively, to the result of the NFT-Net applied to noisy signals and the result produced by the conventional NFT routine applied to noiseless signals.
We carried out the analysis of the NFT-Net performance for the restoration of b-coefficient using the same approach as we did in the previous subsection for r(ξ ) . Our results for noise pulses with the different level of noise are summarised in Table 2. We checked that the NFT-Net configurations when applied to the computation and denoising of b(ξ ) revealed the same tendencies for the quality of restoration as we observed in the previous subsection devoted to the reflection coefficient r(ξ ).
A similar situation as was observed for coefficient r(ξ ) , remains in this case. The error is minimal for a noiseless validation set. However, this trend now continues for high noise levels. A similar tendency is observed all over the results: the values above the diagonal vary slightly. The additional observations when dealing the b-coefficient are as follows. An interesting difference from the case relevant to r(ξ ) , is that the metric value (5) in the case of predicting b(ξ ) is less, and the bold region in Table 1 is larger compared to what we see in Table 2 for the b-coefficient. From the results, it is clear that the prediction accuracy is higher for the b-coefficient. It means that our NN generally works more accurately for the restoration of coefficient b(ξ ) than for r(ξ ) . This result can be expected, as the noise-perturbed r(ξ ) contains the noisy contributions from both a(ξ ) and b(ξ ) , www.nature.com/scientificreports/ while the b-coefficient involves only its noisy contribution, and thus gets corrupted less. So in the latter case, the NN has to clean off less noise. Figure 5 summarizes the above and shows the calculation errors (4) and (5) for NFT-Net architecture. The plot actually visualises the values and tendencies from Tables 1 and 2. For both r(ξ ) and b(ξ ) coefficients, the NN outperforms the fast NFT results when the NFT-Net gets trained on the data with additional noise.

Discussion
Our goal in this work was to demonstrate that the NN can be successfully used for performing the NFT operation, in particular, for computing the profile of continuous NF spectrum. Note that our interest was not only the computation of the continuous NF spectrum, i.e. the nonlinear transformation, but the possibility to denoise signal using NNs. We started with the WafeNet-type architecture 62 , which is effectively a deep CNN, and applied Bayesian optimization 67 to find the optimal set of hyperparameters. Initially, we set the task of optimizing the entire architecture, so the hyperparameters were not only the parameters of the layers but also their number. Table 2. Comparison of the NFT-Net performance against the fast conventional NFT in the computation of coefficient b(ξ ). The table presents the results for the NFT-Net architecture from Fig. 3. The values in the cells show the error value (5) for each specific pair of training and validation sets SNR. The bold cells correspond to the cases when the accuracy of the NFT-Net nonlinear spectrum restoration is lower than that of fast NFT, i.e. the NN does not denoise the signal well, while the white cells correspond to the cases when the accuracy of the continuous NF spectrum rendered by the NFT-Net is higher, i.e. the NN effectively denoises the signal.  www.nature.com/scientificreports/ Once again, we emphasize that Bayesian optimization does not always give the "best" set of parameters. It provides a subspace of hyperparameters in which neural networks with such parameters are best trained on the available dataset. Due to the fact that neural networks are universal approximators, any sufficiently complex architecture can be trained for a specific task. We can expect that the optimization process can converge endlessly towards increasing the complexity of the network. However, this is not suitable for our task, where we want to minimize the complexity of the network while improving the accuracy of the work. Therefore, we simultaneously limited the number of trainable parameters in the NN during optimization. In our case, during the optimization process, we found an architecture that gives us the best metric value (4) and we chose it as the desired architecture. Further, the optimization process could converge to another subspace of hyperparameters, but we stick to the point with the minimum value of the loss function.

Conv. NFT
We found that this NN, indeed, can perform the NFT operation and denoise the received NF spectrum: the denoising effect is pronounced at medium to high noise levels. To achieve this effect, several realisations of the noise are needed for the neural network to "understand" the influence of noise on the signal. As expected, denoising is typically best when the training and testing data noise levels coincide, though we observed some deviations from this rule for lower noise levels, where the quality of restoration of the NF spectrum also makes a noticeable contribution in the overall error value. When being trained on different noise levels, the NFT-Net was still able to produce denoising, thus demonstrating the design's flexibility. We have shown that conventional NFT calculation methods give "distorted" results when working added noise. In fact, the "distorted" results are actually correct, but from the nonlinear transformation point of view. But from the application's perspective, we are almost always interested in the denoised signals to reduce the embedded data corruption level. At this place we notice that the exemplary signals that we used for the NFT-Net training/testing, Eq. (2), are, evidently, different from those used in r-of b-modulated NFDM systems. Moreover, the latter are subject to dispersive effects as the NN has to process them at the receiver side after their having passed some distance. To adapt the NFT-Net for the different signals, two possible strategies can be used. The first one is straightforward, where we retrain the NN from scratch using a different dataset. The second strategy can make use of the pretrained NFT-Net model and utilise domain randomisation and adaptation 68,69 . We believe that after the retraining procedure, the NFT-Net (or some of its modifications, if we find that the capacity of the proposed NN architecture is insufficient to account for some complicated real-world effects) should be capable to account for the spurious soliton emergence and involved noise properties taking place in the realistic optical transmission systems.
Finally, we note that the problem of recovering a few solitons from a given pulse utilizing NN has been studied in 31,34,44,70 . However, the NN architectures used in those studies are much simpler as one has to identify and filter only a few solitonic parameters , while in our work we recovered 1024 complex numbers representing the continuous NF spectrum. A larger number of solitary modes was considered in 71 , where, however, only the total number of solitons in the pulse was studied. Potentially, it is interesting to combine the NN developed in our work with the additional module that can deal with soliton parameters restoration: such a hybrid tool would be able to perform the complete NFT decomposition of an arbitrary decaying pulse.
To sum up, we investigated the modelling of the NFT operation associated with the focusing NLSE, using the NN with a special structure, which we coined the NFT-Net. We considered here an almost unexplored case dealing with the computation of the continuous part of the NF spectrum. It was demonstrated that the WaveNet-type NFT-Net structure can satisfactorily perform the task of the NF spectrum computation, and the best-performing architecture was obtained by Bayesian hyperparameters optimisation. Moreover, we showed that the same NFT-Net structure can be used to efficaciously retrieve both the reflection coefficient r(ξ ) and the scattering coefficient b(ξ ) . The most practically important feature of the developed NN-based method is its capability to perform signal denoising. We demonstrated that the NN-based processing can bring about essential improvements in the quality of NF spectrum restoration attributed to noise-perturbed time-domain profiles, compared to the conventional high-accuracy NFT processing method. The advantage in denoising becomes most pronounced at high noise levels, with the maximum restoration quality typically occurring when the SNR of the training data is the same as that of the validation dataset.

Methods
Forward NFT operation for focusing NLSE. The NF spectrum associated to a given pulse q(t) (we drop the dependence of our quantities on z for simplicity) having a finite L 1 norm, is calculated using the solutions of the so-called Zakharov-Shabat spectral problem [2][3][4]6 . The latter is represented by the set of coupled ordinary differential equations written for two auxiliary functions v 1,2 . Our signal to decompose, q(t), enters into this set as an effective potential. We write down the Zakharov-Shabat problem (the focusing NLSE case) as 4 : In Eq. (6), ξ is the (generally complex-valued) spectral parameter which plays the role of conventional Fourier frequency for integrable nonlinear PDEs. The overbar in Eq. (6) and below denotes the complex conjugates of corresponding quantities. To determine the NF spectrum associated with our profile q(t), we need to find the special solution �(t, ξ) of Eq. (6), called Jost function, imposing the special asymptotic condition at the trailing end of the pulse: www.nature.com/scientificreports/ The NF pulse decomposition consists in finding the continuous and discrete components of the NF spectrum associated with the localised signal q(t). The core part of NFT is the calculation of scattering coefficients, a(ξ ) ∈ C and b(ξ ) ∈ C , defined through the Jost solution �(t, ξ) as follows where ξ ∈ R . The scattering coefficients for the focusing NLSE satisfy: The continuous part of NF spectrum is generally defined by the ratio of quantities b and a from (7): where r(ξ ) is often refereed to as the reflection coefficient. r(ξ ) plays the role of the ordinary Fourier spectrum for nonlinear integrable PDEs and converges to the FT of our signal in the low-power (linear) limit; see more direct expressions below. The discrete part of NF spectrum (the solitonic degrees of freedom) consists of the set of complex-valued pairs: {ξ n , c n } , where n numerates the soliton mode, and each ξ n is the (non-degenerate) solution of the equation a(ξ ) = 0 , laying the the upper complex semi-plane of ξ . The second quantity, the so-called norming constants c n , are given (for a sufficiently localised signal 72 ) by: c n = c(ξ n ) = b(ξ n )/a ′ (ξ n ) , with prime meaning the derivative with respect to ξ . The value of ξ n determines the amplitude and frequency of each solitonic component, while c n defines the values of phase and the "centre-of-mass" position of a solitary mode. However, the discrete part of NF spectrum is not addressed in our study; see Refs. 31,34,44 where the solitonic parameters are computed using the NNs.
More exact mathematical details regarding the NF spectrum definition and properties can be found in, e.g., monograph 6 , see also Ref. 72 for a brief mathematical review.
NF spectrum associated with finite-extent signals. In practical applications, we do not typically deal with the signals defined on the whole infinite t-axis, but rather operate with the truncated wave-forms, meaning that q(t) is non-zero only inside the finite interval of t. In this case, the NF spectrum of the signal is completely characterised by the coefficient b(ξ ) from (8), which becomes band-limited, appended with the finite discrete set of solitonic parameters {ξ n , c n } 25,26 . When, in addition, the discrete NF spectrum is absent, as it is in the case considered, the whole NF spectrum can be defined using just b(ξ ) profile 24 , while the coefficient a(ξ ) can be expressed through b(ξ ) in the following way: where the integral in the exponent is understood in the principal value sense. So, in practice, instead of r(ξ ) (10), it is sufficient to compute the b-coefficient, and then find a(ξ ) using Eq. (11). If needed, we then can use both computed quantities to find the reflection coefficient (10). In practice, the b-coefficient is preferable, since when calculating the r(ξ ) , in the case of a value of the a(ξ ) close to zero, the numerical error of the calculation greatly increases. We note that within the b-modulation concept, which has turned out to be the most efficacious NFDM method developed so far, we utilise the b(ξ ) functions as information carriers 24-27 . NF spectrum for the weakly-nonlinear case and threshold for soliton nucleation. Let us assume that the amplitude of our signal is small, say |q(t)| ∼ ε , with ε ≪ 1 . Then, we can derive the following expansions for the NF scattering coefficients 14 : up to ε 2 (the next expansion term ∼ ε 4 ), and up to ε 3 (the next expansion term ∼ ε 5 ). With the accuracy up to ε 4 , we have for the reflection coefficient: So we see that the first linear term in r(ξ ) expansion is simply the conjugated FT of our signal up to the frequency scaling factor. Then, r(ξ ) from Eq. (14) differs from the expression for b(ξ ) , Eq. (13), only by the terms ∼ ε 3 and higher, but the structure of both expressions is the same, and so the NFT-Net with the same structure |a(ξ )| 2 + |b(ξ )| 2 ≡ 1. www.nature.com/scientificreports/ can successfully recover both r(ξ ) and b(ξ ) if we explicitly train it for the recognition of the corresponding quantity. We believe that this also holds for any level of nonlinearity, maybe aside from the case when we are close to the soliton creation threshold and r(ξ ) displays sharp peaks 14 , Fig. 2]. But, in such a special scenario, it looks more efficient to use the NN to recover a(ξ ) and b(ξ ) profiles, as these do not typically display any singular behaviour.
Turning to the question of soliton appearance from a localised profile, the rigorous criterion for our having no embedded solitons can be formulated for single-lobe profiles as 73 : and the deterministic profiles used in our work have a much higher normalised energy. For more involved multi-lobe profiles, the soliton-creation threshold is typically higher, but we still had some profiles that contained solitary components, so we had to eliminate them. When we add noise to our signal that initially contains no solitons, a random modulation typically diminishes the probability of solitons appearance 74,75 . However, we checked out that all randomly perturbed signals used in our study did not contain a solitonic component as well.
To demonstrate the difference between the continuous NFT spectrum and the linear FT spectrum, we calculated (taking into account the necessary transformations and frequency scaling) both spectra for an example signal of the type used in our analysis. As the measure showing the distinction between the conventional Fourier and NF spectra, we use the norm of the difference: |r(ξ ) − r FT (ξ )| , where r FT (ξ ) is given by the first (linear) term in the expansion of r(ξ ) , Eq. (14). Figure 6a shows an example of a nonlinear and conventional Fourier spectrum. The dependence of the difference on the spectral parameter ξ for a typical signal from our testing set is shown in Fig. 6b. The critical decrease of the difference at ξ region below −100 and above 100 occurs because the amplitude of the continuous spectrum at that region also tends to zero. The average maximal difference parameter value over the entire spectrum for all signals from the test dataset is ≈ 9 . This fact allows us to argue that the nonlinear effects are essential for the selected testing signals, despite their containing no solitons. Thus, the accuracy of the NFT-Net allows us to perceive the truly nonlinear effects.
Numerical NFT computation. In our work we used the conventional forward NFT numerical method to generate training and testing data set pairs: the signal and its respective NF spectrum. For the computation of continuous NF spectrum associated with a given profile q(t) (containing no solitons) having the form of Eq. (2), we used the exponential scheme ES4 from the FNFT package 58 (non-fast realisation). It has the accuracy proportional to the fourth power of the time sample size, ∼ (�t) 4 . We note that there exists the fast realisation of the NFT processing with ∼ (�t) 4 accuracy 76 , which can potentially be used for efficient NFT-Net training.
Complexity analysis. One of the important metrics in the development of signal processing tools is the complexity of the processing device, i.e. the number of elementary arithmetic operations that the processing unit employs to reach its goal. Quite often we need to analyse the interplay between the complexity and accuracy of the processing unit. Thus, here we perform the complexity analysis for the NFT-Net.
In our case, we concentrate only on the number of multiplications, since in practical implementation the computational complexity of addition operations is negligible. The number of real multiplications needed for the forward propagation of the model, as introduced in 77 for several types of NN layers, is also used to calculate the computational complexity of the NFT-Net in this paper.
The overall complexity C of the NFT-Net can be presented as the sum of two constituents: the complexity of densely-connected block C dense and the complexity of convolutional block C conv . For the calculation of C dense the same formula as in 77 can be used, where we have n i inputs, n 1 neurons in the hidden layers, and n o outputs, and the complexity is defined as:  www.nature.com/scientificreports/ In the case of the convolution layer, we can change the equation given in 77 to measure the generalised convolutional layer complexity by taking into account the number of filters f and kernel size k, as well as the effect of padding p, stride s, and dilation d. The complexity C conv, layer for one layer when the input shape is [ L in , Q in ], is specified as follows: where Q in denotes a number of channels, L in is a length of signal samples sequence. Therefore, the total complexity of the NFT-Net used in this paper in terms of real multiplications per output sequence (1024 complex valued points) is: where the factor 2 in front appears due to the use of two identical NNs to predict the real and imaginary parts of the continuous NF spectrum. Turning to our optimised architecture, to process 1024 complex signal samples, the following number of multiplication operations for the optimised architecture is required: For comparison, processing a signal consisting of 1024 points using FNFT methods from Ref. 78 requires 3885572 FLOPs (note that this is not the number of multiplications, so the direct comparison with the number from Eq. (19) is somewhat difficult). Generally, for the computation of N points in the NF spectrum from N point in t-domain, the non-fast NFT methods 72 typically require N 2 FLOPs, while the fast methods need N log 2 N FLOPs 58,78 . From this perspective, the complexity of the current NFT-Net corresponds to that of non-fast NFT methods. However, some techniques can be further used to reduce the NN's complexity 79 .