Deep-learning-based high-resolution recognition of fractional-spatial-mode-encoded data for free-space optical communications

Structured light with spatial degrees of freedom (DoF) is considered a potential solution to address the unprecedented demand for data traffic, but there is a limit to effectively improving the communication capacity by its integer quantization. We propose a data transmission system using fractional mode encoding and deep-learning decoding. Spatial modes of Bessel-Gaussian beams separated by fractional intervals are employed to represent 8-bit symbols. Data encoded by switching phase holograms is efficiently decoded by a deep-learning classifier that only requires the intensity profile of transmitted modes. Our results show that the trained model can simultaneously recognize two independent DoF without any mode sorter and precisely detect small differences between fractional modes. Moreover, the proposed scheme successfully achieves image transmission despite its densely packed mode space. This research will present a new approach to realizing higher data rates for advanced optical communication systems.

Fractional modes of BG beams. Phase holograms for generating BG beams are given by 42 (1) � k r ,m (r, ϕ) = exp [imϕ + ik r r] , www.nature.com/scientificreports/ where m is the azimuthal index that determines the order of the BG beam and a TC associated with light OAM. k r is the radial wavenumber that determines the spacing of the intensity rings and the non-diffracting distance given by 42 where w is the radius of an input beam, and k = 2π/ is the propagation constant. Various BG modes can be generated by combining mode indices m and k r in Eq. (1). Unlike the radial modes of which the mode index is a continuous parameter, a fractional OAM mode is given by a coherent superposition of all possible integer modes, not a single eigenmode, due to the discrete parameter m 22,24 . Spiral phase with a fractional index l is expressed as 23 where the complex coefficient c m of each integer mode is calculated by the orthonormal condition, and φ is a phase shift parameter that determines the orientation of the phase discontinuity and results in the rotation of the intensity profile; see Fig. 4b. Theoretically, the decomposition of fractional OAM modes includes all integer modes. However, the optical realization of the modes is restricted by the size of a physical aperture 24 and above all, most of energy is distributed to adjacent 8 modes, e.g., l = −2 to 5 in our case. In fact, the center ring structure of fractional beams, which provides crucial local features for each mode, consists of theses modes. The field distribution produced by the phase hologram of Eq. (1) can be described by the Fresnel diffraction integral 42 where z is the propagation distance, and u g = exp −r 2 /w 2 is the field distribution of an incident Gaussian beam. Here, we omitted the first term in the first line for the sake of simplicity and used the integral form of the high-order Bessel function 43 . Thus the resulting fractional BG beams with a radial index k r and an azimuthal index l is given by The Gouy phases of different OAM modes give rise to the unstable evolution of light beams emerging from a fractional spiral phase plate 24 . However, after a certain propagation distance, after 0.6z max in our case, the transverse intensity profile of the mode does not change, as shown in Fig. 2. Besides, the fractional BG beam is diffraction-free during propagation over the finite distance, which indicates that the propagation properties of the ordinary BG beams are applied to Bessel beams carrying fractional OAM. Unlike integer OAM beams with a doughnut shape, distinctive structures of the fractional OAM modes provide various local features and allow CNN easily to discriminate differences between adjacent modes.
To demonstrate the feasibility of the densely encoded data transmission via a deep-learning method, we selected a total of 256 modes given by a combination of radial modes from k r = 7.10 to k r = 7.80 with k r = 0.10 (8 modes) and OAM modes from l = 1.03 to l = 1.96 with l = 0.03 (32 modes). Meanwhile, a blazed phase grating and a surface correction hologram, made by a combination of Zernike polynomials, are additionally U k r ,l (ρ, θ, z) = m c m u k r ,m (ρ, θ, z). www.nature.com/scientificreports/ included in Eq. (1). The former separates the desired first diffraction order, and the latter compensates for the phase deformation caused by the spatial inhomogeneity of the SLM.
Designed network structure and dataset preparation. The designed CNN model comprises two parts for feature extraction and classification, as depicted in Fig. 3. The part of the feature extraction is constructed by 5 blocks, and each block consists of a convolutional layer and a max pooling layer. In this experiment, the number of blocks was determined considering the number of trainable parameters and model performance.
The number of parameters in the first fully connected (FC) layer is proportional to the pixel size of input feature maps. Therefore, a model causes inefficiently large number of parameters if the size of feature maps is not scaled down properly. For this reason, 4 and 5 block models were taken into account, and the latter showed the lower validation loss despite similar training times. Multiple convolution kernels in the convolutional layer detect various local features of input images 32 . Then the extracted feature maps are processed by a nonlinear activation function. The purpose of the nonlinear activation function is to introduce non-linearity into the output, which allows the network to learn complex data and provide an accurate prediction 33 . Then the size of the activated feature maps is reduced by half through a max pooling layer, which allows the model to detect the weak variance coming from fractional mode separation 41 . The features extracted from the last block are connected to the classification part comprised of two fully connected (FC) layers. Finally, a softmax function included in the last FC layer yields the probabilities that the transmitted mode belongs to specific laser modes, and then a laser mode with the maximum value is determined to be a received mode; see Eq. (7). Detailed information on the CNN is summarized in Table1, including the hyperparameters, output shape, and the number of trainable parameters. Each component in the third column of the table represents the number of convolution kernels, kernel size, activation functions, and zero paddings. For FC layers, each component represents the number of hidden units, also referred to as neurons of the layer, and activation functions. Additionally, He initialization useful for networks with the rectified linear unit (ReLU) as a nonlinear activation function was selected to initialize weight parameters 44 . The dropout layer placed in front of the first FC layer randomly sets 5% of its neurons to 0 for each training epoch to prevent overfitting 45 . It is activated only during the training time, which does not affect the inference.  www.nature.com/scientificreports/ To train and test the CNN, 51,200 intensity profiles with 200 images per mode are prepared. These 200 images have different phase shifts from 0.01 to 2.00π, respectively. Of the 200 images, 100 images, corresponding to phase shifts φ = 0.02, 0.04, . . . , 2.00π , are used for training the model, and the others for testing. Additionally, 20 images with different phase shifts from 0.05 to 1.95π per mode are prepared as a validation set. The role of the validation set is to provide an unbiased evaluation of the model during the training process. Meanwhile, the test set is used to assess the prediction accuracy of the final model. All images are resized from 400 × 400 to 100 × 100 pixels for computational efficiency and preprocessed for supervised learning.

Experimental results and discussion
Measurement of generated BG beams. Figure 4a displays captured intensity profiles of fractional BG beams generated with different mode indices. Here, pseudo-color, which varies smoothly from black to red and then yellow, is used for visual clarity, but the format of actual images is 8-bit grayscale. Only a few of all 256 spatial modes are presented to show and highlight the tiny variation caused by the fractional mode separation. As the radial wavenumber increases, the spacing between intensity rings is reduced. The intensity of the center ring decreases while the outer ring slightly brightens. Meanwhile, a spiral phase with a fractional TC results in exotic intensity distribution. As the charge increases from l = 1 , the central ring of high-intensity forms two brighter light petals, and the second ring is transformed into a relatively dark and distorted four-petal structure. The angular position of those intensity peaks depends on the applied phase shift. The lower one of the four petals gradually approaches the central ring. Although not included in Fig. 4a, the petal moves further inward and then merges with the center ring. Finally, the intensity profile becomes a multiple ring structure again at l = 2 . Using various convolution kernels, the deep-learning model detects invisible variations in local intensity distribution caused by the fractional mode separation. Then it gives the prediction, with high accuracy, based on the decision boundary formed by training data. www.nature.com/scientificreports/ Figure 4b depicts the influence of the applied phase shift on the intensity distribution of fractional BG beams. As described above, the intensity profiles of the generated BG beams are rotated counterclockwise by the amount of the phase shift, which means that phase modulation applied to fractional modes is explicitly recognizable. In other words, unlike the canonical integer vortices with the rotational symmetry, broken symmetry of the fractional modes can provide an additional DoF for information coding. It will be a new kind of phase-shift keying technique that exploits phase-shifted fractional modes as M-ary symbols.
Training of neural network and performance test. The CNN training and prediction are performed on the commercial GPU system (GPU: RTX 2060; CPU: i7-9750H) using Keras framework. The parameters of CNN are trained using the Adam optimizer with batch size 50 for 50 epochs. The Adam optimizer is a stochastic gradient descent method, which is based on adaptive estimation of moments 46 . It is computationally efficient, making it suitable for networks with large parameters. The weight parameters for the spatial mode recognition are updated to minimize a loss (objective) function during the training process. The loss function used for this purpose, i.e., for the multiclass classification problem, is the averaged categorical cross-entropy, which is given by 33,41 where N is the total number of data to be considered, n is the total number of classes, i.e., representing 256 spatial modes, and index j represents each mode. For example, 1-32th classes correspond to modes with k r = 7.10 and l = 1.03 to l = 1.96 , and next 32 classes correspond to modes with k r = 7.20 and l = 1.03 to l = 1.96 . Meanwhile, t (i) j is the jth element of a target label vector. The label vector is a vector of size n with one on the element corresponding to a specific spatial mode and zeros elsewhere. For example, in our experiment a label vector of the spatial mode with k r = 7.10 and l = 1.03 is {1, 0, . . . , 0} , and that of the spatial mode with k r = 7.10 and l = 1.06 is {0, 1, 0, . . . , 0} . S (i) j is the output of a softmax activation function, which represents the probability that the ith sample belongs to the jth class, and it is given by 33 where y m is the input of the mth unit, and n is the total number of classes as described above. The initial learning rate is 0.001, which is reduced by half whenever validation loss does not be improved for 5 epochs. After the CNN training, prediction time was measured to be t p < 4ms and it is shorter than the switching time of the SLM. In other words, our CNN model can decode all transmitted data without a delay of signal, which indicates that real-time prediction of laser spatial modes is possible. The calculation speed is affected by the Floating-point Operations Per Second (FLOPS) of the computing system 38 . Here, we employed RTX 2060 of a laptop system, and the FLOPS of a RTX 3080, one of the recent advanced GPU, is about four times higher. It implies that using a more advanced GPU can shorten the prediction time to sub-ms. In other words, the mode detection using the proposed model can be sufficiently applied to devices with fast switching speed, such as a digital micromirror device (DMD). The DMD is a pixelated device of micromechanical mirrors individually switched "on" or "off " states. The device is used as a binary amplitude mask for beam steering and shaping, and a fast refresh rate of up to a few kilohertz or faster is its typical characteristic 47 . Figure 5 shows the loss and accuracy curves for the training and validation sets as a function of the training epoch. Here the accuracy is defined as the number of correctly classified samples divided by the total number of samples. As shown in Fig. 5a, the model suitably learns the training data and performs well on new, unseen data without overfitting. Note that the validation set never participates in the training process and is used only to evaluate the model every epoch. Minimum validation loss 6.46e−4 was achieved at the 47th epoch, and the validation accuracy was 99.9%. The model and weight parameters that had achieved the minimum validation loss were saved and used for inference.
In addition to the recognition accuracy of the model, computational speed is a critical factor for a communication purpose and real-time prediction. For this reason, we investigated the influence of the input pixel size on the training time and model performance. The image set is resized from 400 × 400 to 5 different pixel numbers, from 40 × 40 to 120 × 120 pixels, using the open-source computer vision library (OpenCV) and fed to the network for training. As can be seen in Fig. 6a, the training time is proportional to the square of the pixel number, which is reasonable when considering that an image is a 2-d matrix. Figure 6b shows the evaluation results obtained with different pixel numbers. A minimum loss of 2.78e-3 and a maximum accuracy of 99.0% were achieved at pixel number 100. Because of a lack of information, the models trained with samples of a resolution lower than 100 × 100 pixels show poor performance, i.e., underfitting has occurred. On the other hand, the dataset of 120 × 120 pixels gives rise to the overfitting of the model, which means that the model has learned too many details like noise and may not be able to predict new data in the future. Regularization techniques such as l2 regularization, dropout, and batch normalization can mitigate the overfitting 45,48,49 , but using a dataset of 100 × 100 pixels is more efficient for model training and real-time prediction. With the trained model described above, an additional test was implemented to demonstrate the classification performance. Figure 7a shows a part of the confusion matrix obtained from the prepared test dataset. The numbers displayed on axes represent spatial modes given by a combination of a radial mode 7.30 and OAM modes from 1.03 to 1.96, which is employed for the sake of simplicity. The diagonal elements indicate that the transmitted mode is correctly classified. Only 25 of the 25,600 images make errors, which are because of the adjacent modes, and the measured test accuracy is 99.9%. The experimental results demonstrate that the trained model can www.nature.com/scientificreports/ simultaneously identify two independent spatial modes, without any optical mode sorters like a fork hologram. Besides, the model accurately recognizes extremely small differences between fractional modes regardless of the applied phase shift, as can be seen in Fig. 7b. It means that the OAM phase-shift keying proposed in subsection of measurement of generated BG beams could be combined to encode more data without degradation of the recognition accuracy. This is a distinct advantage compared to the optical coordinate transformation method, which shows different output results by the extrinsic OAM component of fractional OAM beams depending on the orientation angle of the phase discontinuity 30 .
Image transmission by using spatial mode encoding. As a proof of concept, 8-bit grayscale images are transmitted, pixel by pixel, through free space. 8-bit encoding is performed with 256 spatial modes separated by fractional intervals, which demonstrates the possibility of super dense optical communication assisted by laser spatial modes and deep-learning. As shown in Table 2, 32 OAM modes with the same radial wavenumber are assigned to 32 levels of the 256 gray levels. For example, laser spatial modes (k r = 7.10, l = 1.03) and (k r = 7.10, l = 1.06) are assigned to 0 and 1, respectively. In other words, 256 different laser spatial modes act as 256-ary symbols. Image transmission is performed according to the following procedure. First, an image is transformed into a 1-d vector comprised of a series of 256-ary numbers, and the data encoding is done by switching the phase holograms corresponding to each value. At this time, start and stop frames specifying the beginning and end of each data stream are appended every 20 pixels, i.e., 20 symbols, to prevent timing errors between the transmitter and receiver. After free-space propagation, the transmitted laser beam is captured by a CCD camera placed at the receiver, and the trained CNN model decodes the data in real time from the received intensity image. Figure 8 shows the one result of image transmission performed using the proposed optical link. As a test sample, an Eiffel Tower image of 100 × 150 pixels was sent. Using the proposed 8-bit encoding scheme, we were able to process a total of 120,000 binary data into 15,000 encoding symbols. The measured error rate was 0.05%, and the errors occurred only at 8 pixels: two due to the nearest OAM mode, eight due to the nearest radial mode. A correlation coefficient between the transmitted image and received image was higher than 0.99, which implies that the reconstructed image is almost identical to the transmitted one. Despite the successful image  www.nature.com/scientificreports/ reconstruction with the very high correlation coefficient, a wrong prediction caused by the adjacent radial mode is not negligible. However, one can reduce the error through multiple measurements or proper post-processing algorithms 50 . For example, a simple way is to conduct image transmission multiple times and extract either the most frequently occurring number or average value per each pixel. The image was transmitted four times consecutively to investigate not only long-term stability but the error reduction, and the error rate of the reconstructed image by extracting the most frequent number per each pixel was 0.
Despite the impressive performance of the deep-learning-based optical link, it is necessary to discuss some critical issues for future practical applications. One thing is the transmission speed associated with data encoding and decoding. Encoding speed is determined by the frame rate of an SLM, which limits the symbol rate of the communication link. In the experiment, the hologram switching rate is set to 2.5 Hz, transmitting 20 bits of information per second. Even if we use the maximum rate of the SLM (60 Hz), the highest achievable data rate is only 480 bits per second. However, the demand for high-speed OAM switching can be addressed by using tunable integrated OAM devices 51,52 or DMD 53 , which can increase the modulation rate up to a few tens of kilohertz or faster. Meanwhile, a camera with a frame rate faster than the refresh rate of a used modulator is needed to capture and decode laser spatial mode varying over time, and it could be addressed by using a CMOS camera. Assuming that the frame rate of a system, made up of a DMD and a CMOS camera, is 1 kHz, the transmission speed that can be achieved with 256-ary spatial mode encoding is just 8 (1 × 8) Kbit/s. However, it is possible to increase  www.nature.com/scientificreports/ the number of bits per symbol by employing more spatial modes and combining the OAM phase shift keying. Note that we achieved 8-bit encoding in an OAM subspace with the radial mode and deep-learning-based highresolution recognition method. It indicates that the proposed link can transmit numerous data even if the size of the aperture restricts the range of available OAM modes. Besides, the capacity of the communication system can be increased further by combining other photonic DoF (wavelength and polarization) 10 . In other words, the speed can be doubled by combining two polarization states, and N times further by adding phase shift modulation of fractional modes. The maximum data speed is estimated to be a few hundred Kbit/s. Note that an optical link with a single wavelength was considered. This shows a relatively lower speed compared to fiber-optic communication systems, but the proposed OAM encoding scheme is able to improve physical layer security due to its inherent characteristic that does not depend on mathematical or quantum-mechanical encryption methods 5,54 . Therefore, the proposed optical link would also be applicable for military communication systems requiring high security. For long-distance outdoor links, there exists atmospheric turbulence that distorts the structured phase and intensity and degrades the link performance. Methods of adaptive optics, based on a wavefront sensor 55 or deep-learning 37,38 , can be taken into account to compensate for the deteriorating effects of turbulence. In the case of strong atmospheric turbulence, an active method based on ultrashort high-intensity laser filaments can produce a cleared optical channel by opto-mechanically expelling the droplets out of the beam area 56 . Besides, it is possible to achieve long-range propagation by means of nonlinear self-channeling of high-power laser pulses 57 . Compared to the ordinary spatial modes of an integer mode index, fractional modes are much sensitive to the external perturbation. We will investigate the transmission performance of the fractional mode encoding in different atmospheric turbulence levels and an effective method to mitigate intermodal crosstalk.

Conclusion
In conclusion, we experimentally demonstrated that both radial and azimuthal modes of BG beams, with the fractional mode spacing, can be used to transmit information over free space. 8-bit data are densely encoded in 256 spatial modes and successfully decoded by a deep-learning classifier. To achieve this, we first trained the designed neural network based on Alexnet architecture and tested its performance. The recognition accuracy of different fractional modes was nearly 100%, which demonstrates that the deep-learning decoder can simultaneously identify two independent spatial modes and accurately recognize extremely small differences between adjacent modes. Furthermore, the translational and rotational invariance of the trained CNN provided a reliable, stable model performance despite the sensitivity of fractional OAM beams to the optical alignment. Then, we transmitted a 100 × 150 grayscale image, a total of 120,000 bits, encoded with 15,000 data symbols by switching phase holograms, and the transmitted data was successfully recovered in the proposed optical link. In addition to the fractional modulation, the explicit phase shift applied to fractional OAM beams could provide an additional degree of freedom for information coding, which is achievable without degrading the performance of the proposed method. There are challenges to be overcome, such as modulation rate and atmospheric turbulence, but the proposed fractional mode encoding/deep-learning decoding scheme will provide an effective way to meet the growing demand for data traffic.