Experimental implementation of a neural network optical channel equalizer in restricted hardware using pruning and quantization

The deployment of artificial neural networks-based optical channel equalizers on edge-computing devices is critically important for the next generation of optical communication systems. However, this is still a highly challenging problem, mainly due to the computational complexity of the artificial neural networks (NNs) required for the efficient equalization of nonlinear optical channels with large dispersion-induced memory. To implement the NN-based optical channel equalizer in hardware, a substantial complexity reduction is needed, while we have to keep an acceptable performance level of the simplified NN model. In this work, we address the complexity reduction problem by applying pruning and quantization techniques to an NN-based optical channel equalizer. We use an exemplary NN architecture, the multi-layer perceptron (MLP), to mitigate the impairments for 30 GBd 1000 km transmission over a standard single-mode fiber, and demonstrate that it is feasible to reduce the equalizer’s memory by up to 87.12%, and its complexity by up to 78.34%, without noticeable performance degradation. In addition to this, we accurately define the computational complexity of a compressed NN-based equalizer in the digital signal processing (DSP) sense. Further, we examine the impact of using hardware with different CPU and GPU features on the power consumption and latency for the compressed equalizer. We also verify the developed technique experimentally, by implementing the reduced NN equalizer on two standard edge-computing hardware units: Raspberry Pi 4 and Nvidia Jetson Nano, which are used to process the data generated via simulating the signal’s propagation down the optical-fiber system.


Results
We develop and experimentally evaluate the performance of a low-complexity NN-based equalizer that can be deployed on resource-constrained hardware and, at the same time, can successfully mitigate nonlinear transmission impairments in a simulated optical communication system. This is achieved by applying the pruning and quantization techniques to the NN 23 , and by studying the optimal trade-off between the complexity of the NN solution and its performance. The obtained results can be split into three main categories.
First, we quantify how complexity reduction techniques affect the performance of the NN model and establish a compression limit for optimal performance versus complexity trade-off. Second, we analyze the computational complexity of the pruned and quantized NN-based equalizer in terms of DSP. Finally, we experimentally evaluate the impact that the characteristics of the hardware and the NN model have on the signal processing time and energy consumption by deploying the latter on both a Raspberry Pi 4 and a Nvidia Jetson Nano. Now we briefly review the previous results in the field of compression techniques applied to NN-based equalizers in optical links, to underline the novelty of our current approach. The use of these techniques to reduce the NNs complexity in optical systems is, clearly, not a new concept 25 . However, the compression methods have recently gained a new wave of attention due to the question of how realistic the hardware implementation of NN-based equalizers in optical transmission systems is. In a direct detection transmission system, a parallelpruned NN equalizer for a 100-Gbps PAM-4 links were tested experimentally using the enhanced version of the one-shot pruning method 26 , which decreased by 50% the resource consumption without significant performance degradation. When considering coherent optical transmission, the complexity of the so-called learned DBP nonlinearity mitigation method was reduced by pruning the coefficients in the finite impulse response filters 27 (see more technical explanations in "Methods" section below). In that case, using a cascade of three filters, a sparsity level of around 92% can be achieved with a negligible impact on the overall performance. Recently, some advanced techniques for avoiding multiplications in such equalizers using additive powers-of-two quantization were tested 28 . In the latter work, 99% of the weights could be removed using advanced pruning techniques, and instead of multiplications, just bit-shift operations were required. However, none of those works deal with the experimental demonstration of hardware implementation, and our study addresses exactly the latter problem.
So, unlike previous works, in the current study, we implement the compressed NN-based equalizer for the coherent optical channel in two different hardware platforms: a Raspberry Pi 4 and a Nvidia Jetson Nano. We also evaluate the impact of the compression techniques on the system's latency for each hardware type and study the performance-complexity trade-off. Finally, we carry out an analysis of energy consumption and of the impact that the characteristics of the hardware and the NN model have on it.

Optical communication system and equalizer design.
To address the use of a MLP as an NN-based equalizer, an accurate measurement system for both the inference time and the power consumption, on both a Raspberry Pi and a Nvidia Jetson Nano, has been designed, so that the effects that pruning and quantization have on these metrics, can be characterized (see "Methods" section below for a detailed explanation). In Refs. 10,14 , the non-compressed MLP post-equalizer was considered, and it was shown that it can successfully compensate for the nonlinearity-induced impairments in a coherent optical communication system. We analyze the equalizer's performance in terms of the standard achieved Q-factor, using the simulated data for a 0.1 root-raised cosine (RRC) dual-polarization signal, with 30 GBd, and 64-QAM modulation, for the transmission over the 20 × 50 km links of standard single-mode fiber (SSMF). We used the same simulator as described in Refs. 10,29 , to generate our training and testing datasets, and the same procedure to training the NN-based equalizer (see "Numerical setup and neural network model" subsection in "Methods" for more details). In our configuration, the NN is placed at the receiver (Rx) side after the Integrated Coherent Receiver (ICR), Analog-Digital Converter (ADC), and DSP block. This last block consists of a matched filter and a linear equalizer. Concerning the matched filter, it is just the same RRC filter used in the transmitter. Moreover, the linear equalizer is composed of a full electronic chromatic dispersion compensation (CDC) stage and a normalization step, see Fig. 1 where constants K, K DSP ∈ C and x h/v is the signal in either h or v polarisation. No other distortions-related to the components within the transceiver-were considered. For this system, the best optimal power occurred at − 1 dBm with the Q-factor being close to 7.8, as it can be appreciated in Fig. 2. We then wanted to investigate the 3 next powers (e.g. 0 dBm, 1 dBm, and 2 dBm) going towards the higher nonlinear regime, where the task of the NN would be more complicated.
The hyperparameters that define the structure of the NN are obtained using a Bayesian optimizer (BO) 10,30 , where the optimization is carried out with regards to the signal's restoration quality performance (see "Numerical setup and neural network model" subsection in "Methods"). The resulting optimized MLP has three hidden layers (we did not optimize the number of layers, but the number of neurons and the activation functions type), with 500, 10, and 500 neurons, respectively. (These numbers were set as the minimal and maximal weights number limits, within which the BO algorithm was searching the optimal configuration). The activation function " tanh " was chosen by the optimizer and no bias is employed. The NN takes the downsampled signal (1 sample per symbol) and inputs into the equalizer N = 10 neighbors symbols (number of taps) to recover the central one. This memory size was defined by the BO procedure. The NN was subjected to pruning and quantization after it had been trained and tested. We analyzed the performance of different NN models depending on their sparsity level; the latter ranged from 20 to 90%, with a 10% increment. The weights and activations are quantized, converting their data type from 32-bit single-precision floating-point (FP32) to 8-bit integer (INT8). The quantization was carried out to enable a real-time use of the model as well as its deployment on resource-constrained hardware. The final system is depicted in Fig. 1. The inference process (the signal equalization) was, first, carried out using a MSI GP76 Leopard personal computer, equipped with Intel ® Core TM i9-10870H processor, 32 GB of RAM and GPU Nvidia RTX2070. The results obtained on this computer were used as a benchmark and compared to those obtained on two small single-board computers: a Raspberry Pi 4 and a Nvidia Jetson Nano.
Finally, the NNs were developed using TensorFlow. The pruning and quantization techniques were implemented using the TensorFlow Model Optimization Toolkit-Pruning API and TensorFlow Lite 31 .
(1)  www.nature.com/scientificreports/ Compressing process for neural network equalizers. When designing an NN for a particular purpose, the traditional approach consists in using dense and over-parametrized models, insofar as it often can provide a good model's performance and learning capabilities 32,33 . This is due to the over-parametrization's smoothing effect on the loss function, which benefits the convergence of the gradient descent techniques used to optimize the model 32 . However, some precautions must be taken while training an over-parametrized model, because such models often tend to overfit, and their generalization capability can be degraded 32,34 .
The good performance achieved due to over-parameterization comes at the cost of larger computational and memory resources. This also results in a longer inference time (latency growth) and higher energy consumption. Note that these costs are the consequence of parameter redundancy and a large number of floating-point operations 20,23 . Therefore, the capabilities of high-complexity NN-based equalizers do not translate yet into end-user applications on resource-constrained hardware. Thus, reducing the gap between the algorithmic solutions and the experimental real-world implementations is an increasingly active topic of research. During the past several years, noticeable efforts have been invested in developing techniques that can help to simplify the NNs without significantly decreasing their performance. These techniques are grouped under the term "NNs compression methods", and the most common approaches are: down-sizing the models, factorizing the operators, quantization, parameter sharing or pruning 20,23,24 . When these techniques are applied, the final model typically becomes much less complex, and, therefore, its latency, or the time it takes to make a prediction, decreases, which also results in a lower energy consumption 20 . In this work, we focus on both pruning and quantization for compressing our NN equalizer and quantify a trade-off between complexity reduction and system performance, see "Methods" section for a detailed description of both approaches.
Performance vs. compression trade-off. Firstly, we note that the complexity reduction of the equalizer must not affect its performance drastically, i.e. the system's performance is still required to be within an acceptable range. In Fig. 3a, the Q-factor achieved by the NN equalizer is depicted versus different sparsity values, for three launch power levels: 0 dBm, blue; 1 dBm, red; and 2 dBm, green. The results are shown using dotted lines and stars, which are those obtained on the PC, Raspberry Pi, and Nvidia Jetson Nano, using the pruned and quantized model. For each of these launch powers, two baselines for the Q-factor are depicted: one corresponds to the level achieved by the uncompressed model, defined by the straight lines, while the other provides the benchmark when we do not employ any NN equalization and use only standard linear chromatic dispersion compensation plus phase/amplitude normalization (LE, linear equalization); the latter levels for the three different launch powers are marked by dotted lines having the appropriate colors. Figure 3b quantifies the impact that each compression technique has on the performance: in that figure, we plotted the Q-factor achieved by the NN equalizer versus different values of sparsity, for the 1 dBm launch power. The blue and red straight lines represent the Q-factor of the original model and the Q-factor achieved by it after being quantized. The dotted lines with asterisks, show the performance of a model that has been only pruned (blue), and the performance in the case of both pruning and quantization (red). It is seen that a substantial reduction of the complexity can be achieved without a dramatic degradation of the performance. The sparsity levels at which the fast deterioration of the performance occurs, are also clearly seen in this figure. www.nature.com/scientificreports/ First, it can be observed from Fig. 3a that the quantization and pruning process does not cause a significant performance degradation until a sparsity level equal to 60% is reached, with just a 4% performance reduction. However, when we move to sparsity levels around 90%, the performance is close to the one achieved using a linear equalization (i.e., the Q-factor curves drop to the levels marked with the dashed lines of the same color).
We can conclude that when the levels of sparsity are above 60%, the decrease in the performance is mainly the effect of the quantization process. A nearly 2.5% drop in the Q-factor value has also been observed when quantizing an already pruned model. Once the levels of sparsity are higher than 60%, the reduction in performance due to the quantization gets accelerated. Moreover, we observe that some degree of sparsification can even improve the model's performance with respect to the unpruned model. This behavior has already been reported in other studies and it was found that it is specifically pertinent to the over-parametrized models. Thus, the NNs with less complex structures do not show up such an increase in performance due to low-sparsity pruning, making it impossible to achieve such a good performance-complexity ratios 32,33,35,36 . Computational complexity analysis. Figure 4 depicts the reduction in the size of the model as well as the model's computational complexity for different sparsity values, after having applied quantization. For the definition of the metrics used to calculate the computational complexity as well as the size of the models, see the subsections "Computational complexity metrics and memory size metrics" in "Methods". Overall, we have achieved an 87.12% reduction in the memory size after pruning 60% of the NN equalizer weights and quantizing the remaining ones. As a consequence, the size of the model went down from 201.4 to 25.9 kilobytes. For the decrease of the model's computational complexity, it goes from 75,960,427.38 to 16,447,962 bit operations (BoPs) after applying the same compression strategy, which is a 78.34% reduction (see the explicit definition of BoPs in "Methods" section). We would like to point out once more that sparsity levels of 60% can be reached without a substantial performance loss. Therefore, approximately the same high level of performance can be achieved with a model that is significantly less complex than the initial NN structure, which is one of the main findings of our work.
It is worth mentioning the individual impact that quantization and pruning have on the computational complexity of the model. When the computational complexity is calculated for a quantized, but unpruned model, the number of BOPs is equal to 23,321,563. Therefore, if this value is compared with the already mentioned 75,960,427 BoPs for the unpruned and unquantized NN, a reduction in complexity of a 69.3% is obtained thanks to quantization. As it can be seen in Fig. 4, the remaining gain comes from the pruning technique, and it grows linearly as indicated in Eq. (5).
Online latency evaluation. Numerous deep learning applications are latency-critical, and therefore the inference time must be within the bounds specified by service level objectives. Optical communication applications that employ deep learning techniques are a good example of this. Note that the latency is highly dependent on the NN model implementation and the hardware employed (e.g., FPGA, CPU, GPU). Please refer to "Methods" section for more details on the devices' inference time measurements.
When measuring the inference time for the different types of hardware and the quantized model that has had 60% of its weights pruned, the results are:   Figure 5 shows the latency of the considered NN model before and after quantization. We notice that the results are expressed in a way that is more appropriate for the task at hand. Thus, latency is defined as the time it takes to process one symbol: we have averaged it over 30 k symbols. With the quantized model, we observe approximately a 56% reduction in latency for all three values of power, when compared to the original model. We must notice that pruning is not taken into account because it does not affect this metric since Tensorflow Lite does not support sparse inference yet, which makes the algorithm still use the same amount of cache memory. Also, we could observe that Raspberry Pi has the longest inference time among our devices. This is in line with the fact that Raspberry is designed as a low-cost and general-purpose single-board computer 37 . On the other hand, the Nvidia Jetson Nano was developed with GPU capabilities, which makes it more suitable for deep learning applications, allowing us to achieve lower latencies.
Online energy consumption evaluation. Within the context of edge computing, not only is speed an important factor, but also power efficiency. In this work, the metric used to evaluate the energy consumption and compare the different types of hardware for the coherent optical channel equalization task is the energy per recovered symbol. When using a quantized model with a pruning level of 60%, the average energy consumed during inference for the Raspberry Pi 4 and the Nvidia Jetson Nano is 2.98 W ( σ = ±0.012 ) and 3.03 W ( σ = ±0.017 ), respectively. On the other hand, if the original model is employed, there is an increase in energy consumption of around 3%, which is congruent with the findings in previous works 23 . Thus, during inference, the Raspberry Pi 4 consumes 3.06 W ( σ = ±0.011 ) and the Nvidia Jetson Nano 3.13 W ( σ = ±0.015 ), respectively. Multiplying these values by the NN processing times per recovered symbol reported in Fig. 5, we obtain the results presented in Fig. 6. We note that Raspberry Pi has the highest energy consumption per recovered symbol. This is a consequence of the lack of a GPU, which results in longer inference times. Thus, the Nvidia Jetson Nano consumes 33.78% less energy than the Raspberry Pi 4. Regarding pruning and quantization, the use of these techniques allows an energy saving of 56.98% for the Raspberry Pi 4 and a 57.76% saving for the Nvidia Jetson Nano. It must be noticed that although TensorFlow Lite does not support sparse inference and therefore pruning does not help to reduce the inference time, it affects the size of the model. This has a direct effect on the power consumption of the device due to the decrease in the use of resources. In contrast, quantization has a positive effect on both of these parameters thanks to employing lower precision formats and reducing the size of the model. Therefore, it has a stronger effect on energy consumption. This is reflected in the results exposed in this section. Moreover, it is congruent with the findings reported in previous studies 23,38 .
See "Methods" section for more details on the energy consumption measurement.

Discussion
In our work, we investigated how we can use pruning and quantization to reduce the complexity of the hardware implementation of an NN-based channel equalizer in a coherent optical transmission system. With this, we tested the implementation of the designed equalizer experimentally, using a Raspberry Pi 4 and a Nvidia Jetson  Moreover, the effect of using different types of hardware was experimentally characterized by measuring the inference time and energy consumption in both a Raspberry Pi 4 and a Nvidia Jetson Nano. We note, however, that we experimented only with the edge devices, and the data from the communication system were obtained via simulations; but we do not expect that the results regarding the performance vs complexity trade-off achieved thanks to pruning and quantization for the true optical system would seriously differ. It has been demonstrated that the Nvidia Jetson Nano allows 34% faster inference times than the Raspberry Pi, and that, thanks to the quantization process, a 56% inference time reduction can be achieved. Finally, due to the use of pruning and quantization techniques, we achieve 56.98% energy savings for the Raspberry Pi 4 and 57.76% for the Nvidia Jetson Nano; we also observed that the latter device consumes 33.78% less energy.
Overall, our findings demonstrate that the usage of pruning and quantization can be a suitable strategy for the implementation of NN-based equalizers that are efficient in high-speed optical transmission systems when deployed on resource-restricted hardware. We believe that these model compression techniques can be used for the deployment of NN-based equalizers in real optical communication systems, and for the development of novel online optical signal processing tools. We hope that our results can also be of interest to the researchers developing sensing and laser systems, where the application of machine learning for field processing and characterization is a rapidly developing area of research 39 .

Methods
Numerical setup and neural network model. We numerically simulated the dual-polarization (DP) transmission of a single-channel signal at 30 GBd. The signal is pre-shaped with a root-raised cosine (RRC) filter with 0.1 roll-off at a sampling rate of 8 samples per symbol. In addition, the signal modulation format is 64-QAM. We considered the case of transmission over 20 × 50 km links of SMF. The optical signal propagation along the fiber was simulated by solving the Manakov equation via split-step Fourier method 40 with the resolution of 1 km per step. The considered parameters of the TWC fiber are: the attenuation parameter α = 0.23dB/km , the dispersion coefficient D = 2.8 ps/(nm × km), and the effective nonlinearity coefficient γ = 2.5 (W × km) −1 . The SSMF parameters are: α = 0.2 dB/km, D = 17 ps/(nm × km), and γ = 1.2 (W × km) −1 . Moreover, after each span, an optical amplifier with the noise figure NF = 4.5 dB was placed to fully compensate fiber losses and added amplified spontaneous emission (ASE) noise. At the receiver, a standard Rx-DSP was employed. It consisted of the full electronic chromatic dispersion compensation (CDC) using a frequency-domain equalizer, the application of a matched filter, and the downsampling to the symbol rate. Finally, the received symbols were normalized (by phase and amplitude) to the transmitted ones. In this work, no additional transceiver distortions were taken into account. After the Rx-DSP, the bit error rate (BER) is estimated using the transmitted symbols, received soft symbols, and hard decisions after equalization.
The NN receives as input a tensor with a shape defined by three dimensions: (B, M, 4), where B is the minibatch size, M is the memory size determined by the number of neighbors N as M = 2N + 1 , and 4 is the number of features for each symbol, which correspond to the real and imaginary parts of two polarization components. The NN will have to recover the real and imaginary parts of the k-th symbol of one of the polarization. Therefore the shape of the NN output batch can be expressed as (B, 2). This task can be treated as a regression or www.nature.com/scientificreports/ classification one. This aspect has been considered in previous studies and stated that the results achieved by regression and classification algorithms are similar but fewer epochs are needed in the case of regression. Thus, the mean square error (MSE) loss estimator is used in this paper, as it is the standard loss function employed in regression tasks 41 . The loss function is optimized using the Adam algorithm 42 with the default learning rate equal to 0.001. The maximum number of epochs during the training process was 1000, as it was stopped earlier if the value of the loss function did not change over 150 epochs. After every training epoch, we calculated the BER obtained using the testing dataset. The optimal number of neurons and activation functions in each layer of the NN, as well as the memory (input) of the system were inferred employing the Bayesian Optimization algorithm (BO). The values tested for the number of neurons were n ∈ [10, 500] . For the activation function, the BO had to chose between: " tanh ", "ReLu", "sigmoid" and "LeackyReLu". The values tested for the memory (input) of the system were N ∈ [5,50] The metric of the BO was the BER, finding the hyperparameters that helped to reduce the BER as much as possible with a validation dataset of 2 17 data points. The final solution was the use of " tanh " as an activation function and 500, 10, and 500 neurons for the first, second, and third layer, respectively. The training and test datasets were composed of independently generated symbols of length 2 18 each. To prevent any possible data periodicity and overestimation 43,44 , a pseudo-random bit sequence (PRBS) of order 32 was used to generate those datasets with different random seeds for each of them. The periodicity of the data is, therefore, 2 12 times higher than our training dataset size. For the simulation, the Mersenne twister generator 45 was used with different random seeds. Moreover, the training data was shuffled before being used as an input to the NN. Finally, we would like to notice an important matter as it is the necessity of the periodical retraining of the equalizer on realistic transmission. In this case, it may be a point of concern. This issue has already been addressed in previous studies 29 , where it has been demonstrated that using transfer learning can drastically reduce the training time and training data requirements when changes on the transmission setup occur.
Pruning technique. With pruning, the redundant NN elements can be removed to sparsify the network without significantly limiting its ability to carry out a required task 24,32,46 . Thus, networks with a reduced size and computational complexity are obtained, resulting in lower hardware requirements as well as faster prediction times 23,24 . Furthermore, pruning acts as a regularization technique, improving the model quality by helping to reduce overfitting 32 . Moreover, retraining an already pruned NN can help to escape local loss function minima, which can lead to a better prediction accuracy 24 . Thus, less complex models can often be achieved without a noticeable effect on the NN's performance 32 .
Depending on what is going to be pruned, the sparsification techniques can be classified into two types: model sparsification and ephemeral sparsification 32 . In the first case, the sparsification is permanently applied to the model, while in the second case, the sparsification only takes place during the computing process. In our work, we will use the model sparsification, because of the effects it has on the final NN's computing and memory hardware requirements. Adding to this, the model sparsification can consist in removing not only weights but also larger building blocks, such as neurons, convolutional filters, etc. 32 . Here we apply pruning to just the weights of the network, for the sake of simplicity and as far as it matches the NN structure (the MLP) that is considered.
After having defined what to prune, it is necessary to define when the pruning occurs. Based on this, there are two main types of pruning: static and dynamic 24 . In the static case, the elements are removed from the NN after the training, and in this work, to demonstrate the effect, we use the static pruning variant because of its simplicity.
The static pruning is generally carried out in three steps. First, we decide upon what requires to be pruned. A simple approach to define the pruning objects can be to evaluate the NN's performance with and without particular (pruned) elements. However, this poses scalability problems: we have to evaluate the performance when pruning each particular NN's parameters, and there may be millions of these.
Alternatively, it is possible to select the elements to be removed randomly, which can be done faster 32,47,48 . Following this latter approach, we beforehand decided to prune the weights. Once it has been decided which elements are to be pruned, it is necessary to establish the criteria for how the elements are to be removed from the NN, ensuring that high levels of sparsity are achieved without a significant loss in performance. When pruning the weights of the NN, it is possible to remove them based on different aspects: considering their magnitude (i.e., the weights having values close to zero are to be pruned, with the pruning percentage is defined by the sparsity level we aim to achieve), or their similarity (if two weights have a similar value, only one of those is kept); we mention that the other selection procedures also exist 32,48 . Here, we pick the relatively simple weights pruning strategy based on their magnitude. In Fig. 7 we show the impact when we have pruned our NN equalizer by 40%. When comparing the weight distributions of the original and pruned models, it is clear that the sparsity level defines the number of weights that need to be pruned. Thus, the pruning process starts by removing the smallest weight and continues until the desired sparsity level is reached. Finally, a retraining or fine-tuning phase should be done, to reduce the degradation in the modified NN performance 24 .
When carrying out pruning using the Tensorflow Model Optimization API, it is necessary to define a pruning Schedule to control this process by notifying at each step the level at which the layer should be pruned 49 . In this work, the schedule known as Polynomial Decay is employed. The main characteristic of this type of schedule is that a polynomial sparsity function is built. In this case, the power of the function is equal to 3 and the pruning takes place every 50 steps. This means that during the last steps higher ratios of sparsification are employed (e.g. more weights are removed), speeding up the pruning process. On the other hand, if the power of the function were negative, pruning would be slowed-down. The model starts with a 0% sparsity and the process takes place during 300 epochs. This is approximately 35 % of the number of iterations required for training the original model. It is the objective of future works to optimize the hyperparameters of the pruning process, improve its efficiency and reduce the cost related to a high number of iterations. processing, the precision of such arithmetic operations is another crucial factor when determining the model's complexity and, therefore, the inference latency, as well as equalizer's memory and energy requirements 23,50-52 .
The process of approximating a continuous variable with a specified set of discrete values is known as quantization. The number of discrete values will determine the number of bits necessary to represent the data. Thus, when applying this technique in the context of deep learning, the objective is decreasing the numeric precision used to encode the weights and activations of the models, avoiding a noticeable decrease in the NN's performance 20,52 . Using low-precision formats allows us to speed up math-intensive operations, such as convolution and matrix multiplication 52 . On the other hand, the inference (signal processing) time depends not only on the format representation of the digits involved in the mathematical operations but is also affected by transporting the data from memory to the computing elements 23,38 . Moreover, heat is generated during the latter process and, therefore, using a lower-precision representation can result in energy savings 23 . Finally, another benefit of using low-precision formats is that a reduced number of bits is needed to store the data, which reduces the memory footprint and size requirements 23,52 .
FP32 has been traditionally used as the numerical format for encoding weights and activations (output of the neurons) in an NN, to take advantage of a wider dynamic range. However, as it has already been mentioned, this results in higher inference times, which is an important factor when a real-time signal processing is considered 20 . A variety of alternatives to the FP32 numerical format for NN's elements representation have been proposed lately, to reduce the inference time, as well as to decrease the hardware requirements. For example, it is becoming popular to train NNs in FP16 formats, as it is supported by most deep learning accelerators 20 . On the other hand, math-intensive tensor operations executed on INT8 types can see up to a 16× speed-up compared to the same operations in FP32. Moreover, memory-limited operations could see up to a 4 × speed-up compared to the FP32 version [22][23][24]52 . Therefore, in addition to pruning, we will reduce the precision of the weights and activations to further decrease the computational complexity of the equalizer, employing the technique known as integer quantization 52 .
The integer quantization maps a floating point value x ∈ [α, β] to a bit integer x q ∈ [α q , β q ] . This mapping can be defined mathematically using the following formula: x q = round 1 s x + z , where s (a positive floating point number) is known as the scale, and z is the zero point (an integer). The scaling factor basically divides a range of real values, in this case those within the clipping range [α, β] , into a number of partitions. Thus, it can be expressed as s = β−α 2 b −1 where b is the the quantization bit width. On the other hand, the zero point can be defined as z = α(1−2 b ) β−α . Therefore, it will be 0 in the case of symmetric quantization. Moreover, the previous mapping can be refactored in order to take into account that if x is outside of the range [α, β] , then x q is outside of [α q , β q ] . Thus, it is necessary to clip the values when this happen; as a consequence, the mapping formula becomes: x q = clip(round 1 s x + z , α q , β q ) , where the clip function takes the values 24,53 : Integer quantization can take different forms, depending on the spacing between quantization levels and the symmetry of the clipping range (determined by the value of the zero-point z) 53 . For the sake of simplicity, in this work, we used symmetric and uniform integer quantization. The quantization process can occur after the training or during it. The first case is known as post-training quantization (PTQ) and the second one is the quantization aware training 22-24 . In PTQ, a trained model has its weight and activations quantified. After this, a small unlabelled calibration set is used to determine the activations' dynamic ranges 23,[52][53][54] . No retraining is needed, which makes this method very popular because of its simplicity and lower data requirements 53,54 . Nonetheless, when a trained model is directly quantized, this may www.nature.com/scientificreports/ perturb the trained parameters, moving the model away from the convergence point reached during the training with a floating-point precision. In other words, we notice that PTQ can have accuracy-related issues 53 .
In this work, the quantization is carried out after the training stage, i.e., we use the PTQ. The calibration process required to estimate the range, i.e, (min, max) of the activations in the model, is done by running a few inferences with a small portion of the test dataset. In our case, it consisted of 100 samples. When using the Tensorflow Lite API, the calibration is carried out automatically, and it is not possible to choose the number of inferences.
Computational complexity metrics. Finally, it is important to discuss how we can correctly evaluate the computational complexity of such models. In this regard, we quantitatively evaluate the reduction of computation complexity achieved by applying pruning and quantization, calculating the number of bits used during an inference step. The most common operations in an NN are multiply-and-accumulate operations (MACs). These are operations of the form a = a + w × x , where three terms are involved: firstly, x corresponds to the input signal of the neuron; secondly, w refers to the weight; and, finally, the accumulate variable a 55 . Traditionally, the network complexity arithmetic has been measured using the number of MAC operations. However, in terms of the DSP processing, the number of BoPs is a more appropriate metric to describe the computational complexity of the model, as for low-precision networks composed of integer operations, it is not possible to measure the computational complexity using FLOPS 22,56 . Thus, in this work, we use BoPs to quantify the complexity of the equalizer. It is important to notice that within the context of optical channel non-linear compensation, the complexity of NN-based channel equalizers has been traditionally measured taking into account only the number of multiplications 12,44,57 . Thus, the accumulator contribution was neglected. However, in this project, we aim to have a more general complexity metric and therefore include it in our calculations.
The BOPs measure was proposed for the first time in 56 , and defined for a convolutional layer that had been quantized as: In Eq. (2), b w and b a are the weight and activation bit-width, respectively; n is the number of input channels, m is the number of output channels, and k defines the filters size (e.g. k × k filters) 58 . Taking into account that a MAC operation takes the form: a = a + w × x , it is possible to distinguish two contributions in the equation above: one corresponding to the nk 2 × b 0 number of additions, where b 0 = b a + b w + log 2 (nk 2 ) (e.g. accumulator width in the MAC operations), and the other corresponds to the number of multiplications, e.g. nk 2 (b a b w ) 56 .
Equation (2) was further adapted for the case of a dense layer that has been both pruned and quantized 59 . Thus, it is applicable to our case, as the MLP consists of a series of dense layers arranged one after the other: In Eq. (3), n and m correspond to the number of inputs and outputs, respectively; b w and b a are the bit widths of the weights and activations. The additional term, f p i , is the fraction of pruned layer weights, which allows us to take into account the reduction in multiplication operations because of pruning. This is the reason why it only relates to the term b a b w 59 .
Therefore, in our case of the MLP with 3 hidden layers, the total number of BOPs is: where i ∈ [1,2,3] , BoPs input and BoPs output correspond to the contributions of the input and output layers. Equation (4) can be written in a less compact way as follows: where n i , n 1 , n 2 , n 3 ,and n o are the number of neurons in the input, first, second, third, and output layers, respectively; b w , b a , b o and b i are the bit width of the weights, activations, output and input, respectively; f p is the fraction of the weights that have been pruned in a layer, which, in our case, is the same for every layer.
Memory size metrics. In this work, the size of the model is defined as the number of bytes that it occupies in memory. Moreover, we notice the direct correlation between the value of this metric and the format used to represent the model. Thus, in contrast to the traditional formats used in Tensorflow (e.g .h5 or HDF5 binary data format and .pb or protobuf), a TensorFlow Lite model is represented in a special efficient portable format identified by the .tflite file extension. This provides two main advantages: a reduced model's size and lower inference times. Therefore, the deployment of the NN model on a resource-restricted hardware becomes feasible. As a consequence, it would not make sense to compare the models saved in the traditional Tensorflow format with those that have been pruned and quantized as well as converted into Tensorflow Lite. We were aware of this situation during the realization of the procedure and, thus, to avoid overestimating the benefits of pruning and quantization, the unpruned and unquantized model were converted to .tflite format. To better understand the implications that this step has, the size of the original model in .h5 format would experiment a 96.22% size reduction after being converted to .tflite format, quantized and pruned (60% sparsity). On the other hand, if the original model has already been converted to .tflite, the size reduction is 87.12%. Of course, based on this, always using .tflite format instead of the other conventional ones seems to be the best strategy. The main reason behind not doing this is that a graph that is in .tflite format can not be trained again, as it only sup- BoPs MLP = (n i n 1 b i + n 1 n 2 b a + n 2 n 3 b a + n 3 n o b a )(1 − f p )b w + (n i n 1 )(b i + b w ) log 2 (n i ) Memory and processor restricted hardware. In many deep learning applications, low power consumption and a reduced inference time are especially desirable. Moreover, the use of graphics processing units (GPU) to attain high performance has some costs-related issues which are far from being ultimately solved 37,60 . Therefore, a small, portable, and low-cost hardware is required to bring the solution to this problem. As a result, single-board computers have become popular, and Raspberry Pi 4 and Nvidia Jetson Nano are among the most used ones 37 . Hence, here we analyse the functioning of our NN-based equalizer using these two aforementioned popular hardware types. Power measurement. In this work, together with the latency and accuracy attributed to each model processing, we also address the issue of the power consumption for the NN equalizers implemented in the Nvidia Jetson Nano and the Raspberry Pi 4. It is possible to measure the power consumption of both the Nvidia Jetson Nano and the Raspberry Pi in different ways. Regarding Nvidia Jetson Nano, there are three onboard sensors located at the power input, at the GPU, and at the CPU. Thus, the precision of the measurements is limited by these sensors. To read the recordings of these sensors, it is possible to do it automatically using the tegrastats tool, or manually by reading .sys files, a pseudo-file system on Linux. By using both approaches, the information of measurements for the power, voltage, and current can be readily collected 62 . In contrast, Raspberry Pi 4 has no system to easily provide power consumption numbers. Some software-based methods have been developed, as well as some empirical estimations 63 . However, it has been demonstrated that most of the aforementioned software methods give just an approximation that may not be used if very precise results are required 63 . On the other hand, the second empiric strategy to measure the power consumption on Raspberry Pi is specific for this type of hardware and cannot be used in Nvidia Jetson Nano.
To compare the power consumption of the equalizer on these two types of hardware, it is more accurate and desirable to use the same method in both of them, to avoid any instrumental bias. In this paper, we developed a platform-agnostic method through the use of a digital USB multimeter. The proposed power consumption measurement system addresses the problem of these devices having no onboard shunt resistors; such an approach allows us to easily measure power with an external energy probe. A schematic of the measurement set-ups is given in Fig. 8.
In the case of Raspberry Pi, the power is supplied through a USB type C port via a 5.1 V-2.5 A power adapter. For Nvidia Jetson Nano, the power can be supplied through a Micro-USB connector using a 5.1 V-2.5 A power www.nature.com/scientificreports/ adapter or a Barrel jack 5 V-4 A (20 W) power supplier. It is possible to change from one configuration to the other by setting a jumper and moving from the 5 W Mode to the 10 W one. To use the same source of power as in Raspberry Pi, the Micro-USB configuration is used. As energy is supplied through a USB connection, it is possible to measure the power using a USB digital multimeter. The model used in this work is the A3-B/A3 manufactured by Innovateking-EU. It records voltage, current, impedance, and power consumption. The input voltage and current ranges are 4.5 V-24 V and 0 A-3 A, respectively. Moreover, we can measure the energy in a range that goes from 0 to 99,999 mWh. The voltage and current measurement resolution are 0.01 V and 0.001 A, with the measurement accuracies ± 0.2% and ± 0.8%, respectively.
The USB digital multimeter A3-B/A3 comes with the software named UM24C PC Software V1.3, which allows sending the measured data to a computer in real-time, as it is shown in Fig. 8a,b . During the measurement process, no peripherals are connected either to Raspberry Pi or Nvidia Jetson Nano, except for the Ethernet port. This is used for communication over SSH, Fig. 8. Moreover, 25 measures were taken for each device. In each of them, 100 inferences were run, and the power consumption was averaged over them, not taking into account the power consumed during the initialization phase.

Inference time measurement.
To evaluate the inference time for each model, no peripherals are connected either to the Raspberry Pi or to the Nvidia Jetson Nano, except the Ethernet port, which is used to establish communication over the Secure Shell protocol. Moreover, any initialization time (e.g., library loading, data generation, and model weight loading) is ignored because this is a one-time cost that occurs during the device's setup. Furthermore, 25 measures were taken for each device. In each of them, 100 inferences were run (in each inference 30 k symbols are recovered) and the inference time was averaged, not taking into account the initialization phase.

Data availibility
Data underlying the results presented in this paper are not publicly available at this time, but can be obtained from the authors upon request.