## Introduction

The past half-decade has seen unprecedented growth in machine learning with deep neural networks (DNNs). Use of DNNs represents the state-of-the-art in many applications, including large-scale computer vision, natural language processing, and data mining tasks1,2,3. DNNs have also impacted practical technologies such as web search, autonomous vehicles, and financial analysis1. However, DNNs have substantial computational and memory requirements, which greatly limit their training and deployment in resource-constrained (e.g., computation, I/O, and memory bounded) environments. To address these challenges, there has been a significant trend in building high-performance DNNs hardware platforms. While there has been significant progress in advancing customized silicon DNN hardware (ASICs and FPGAs)2,4 to improve computational throughput, scalability, and efficiency, their performance (speed and energy efficiency) are fundamentally limited by the underlying electronic components. Even with the recent progress of integrated analog signal processors in accelerating DNNs systems which focus on accelerating matrix multiplication, such as Vector Matrix Multiplying module (VMM)5, mixed-mode Multiplying-Accumulating unit (MAC)6,7,8, resistive random access memory (RRAM) based MAC9,10,11,12,13, etc., the parallelization are still highly limited. Moreover, they are plagued by the same limitations of electronic components, with additional challenges in the manufacturing and implementation due to issues with device variability10,12.

Recently, there are increasing efforts on optical neural networks and optical computing based DNNs hardware, which bring significant advantages for machine learning systems in terms of their power efficiency, parallelism and computational speed14,15,16,17,18,19,20,21,22,23. Among them, free-space diffractive deep neural networks (D2NNs) , which is based on the light diffraction, feature millions of neurons in each layer interconnected with neurons in neighboring layers. This ultrahigh density and parallelism make this system possess fast and high throughput computing capability. Note that the diffractive propagations controlled by such physical parameters are differentiable, which means that such parameters can be optimized via conventional backpropagation algorithms16,18,19 using autograd mechanism24.

In terms of hardware performance/complexity, one of the significant advantages of D2NNs is that such a platform can be scaled up to millions of artificial neurons. In contrast, the design and DNNs deployment complexity on other optical architectures, e.g., integrated nantophotnics14,25 and silicon photnics23), can dramatically increase. For example, Lin et al.16 experimentally demonstrated various complex functions with an all-optical D2NNs. In conventional DNNs, forward prorogation are computed by generating the feature representation with floating-point weights associated with each neural layer. In D2NNs, such floating-point weights are encoded in the phase of each neuron of diffractive phase masks, which is acquired by and multiplied onto the light wavefunction as it propagates through the neuron. Similar to conventional DNNs, the final output class is predicted based on generating labels according to a given one-hot representation, e.g., the max operation over the output signals of the last diffractive layer observed by detectors. Recently, D2NNs have been further optimized with advanced training algorithms, architectures, and energy efficiency aware training18,19,26, e.g, class-specific differential detector mechanism improves the testing accuracy by 1–3%19,26 improves the robustness of D2NNs inference with data augmentation in training.

## Results and discussion

Figure 1 shows the proposed real-time multi-task diffractive deep neural network (D2NN) architecture. Specifically, in this work, our multi-task D2NN deploy image classification DNN algorithms with two tasks, i.e., classifying MNIST10 dataset and classifying Fashion-MNIST10 dataset. In a single-task D2NN architecture for classification16, the number of opto-electronic detectors positioned at the output of the system has to be equal to the number of classes in the target dataset. The predicted classes are generated similarly as conventional DNNs by selecting the index of the highest probability of the outputs (argmax), i.e., the highest energy value observed by detectors. Moreover, due to the lack of flexibility and reconfigurability of the D2NN layers, deploying DNNs algorithms for N tasks requires physically designing N D2NN systems, which means N times of the D2NN layer fabrications and the use of detectors. Our main goal is to improve the cost efficiency of hardware systems while deploying multiple related ML tasks. Conceptually, the methodologies behind multi-task D2NN architecture and conventional multi-task DNNs are the same, i.e., maximizing the shared knowledge or feature representations in the network between the related tasks27.

Let the D2NN multi-task learning problem over an input space $${\mathcal {X}}$$, a collection of task spaces $${\mathcal {Y}}^n_{n\in [0,N]}$$, and a large dataset including data points $$\{x_i,y_i^1,\ldots ,y_i^N\}_I\in [D]$$, where N is the number of tasks and D is the size of the dataset for each task. The hypothesis for D2NN multi-task learning remains the same as conventional DNNs, which generally yields the following empirical minimization formulation:

\begin{aligned} \min \limits _{\theta ^{share}, {\theta ^{1},\theta ^{2},\ldots ,\theta ^{N}}} \quad \sum ^{N}_{n=1} c^t {\mathcal {L}}(\theta ^{share}, \theta ^{t}) \end{aligned}
(1)

where $${\mathcal {L}}$$ is a loss function that evaluates the overall performance of all tasks. The finalized multi-task D2NN will deploy the mapping, $$f(x,\theta ^{share},\theta ^{n}) : {\mathcal {X}} \rightarrow {\mathcal {Y}}^n$$, where $$\theta ^{share}$$ are shared parameters in the shared diffractive layers between tasks and task-specific parameters $$\theta ^{n}$$ included in multi-task diffractive layers. Specifically, in this work, we design and demonstrate the multi-task D2NN with a two-task D2NN architecture shown in Figure 1. Note that the system includes four shared diffractive layers ($$\theta ^{share}$$) and one multi-task diffractive layer for each of the two tasks. The multi-task mapping function becomes $$f(x,\theta ^{share},\theta ^{1,2}) : {\mathcal {X}} \rightarrow {\mathcal {Y}}^2$$, and can be then decomposed into:

\begin{aligned}&f(x,\theta ^{share},\theta ^{1,2}) = det\left( f^{1}\left( \frac{1}{2} \cdot f^{share}(x,\theta ^{share}),\theta ^{1}\right) + f^{2} \left( \frac{1}{2} \cdot f^{share}(x,\theta ^{share}),\theta ^{2}\right) \right) \end{aligned}
(2)
\begin{aligned}&f^{share}: {\mathcal {X}} \rightarrow (\mathfrak {R}+ \mathfrak {I})^{\in 200\times 200}, \quad f^{1}, f^2: (\mathfrak {R}+ \mathfrak {I})^{\in 200\times 200} \rightarrow (\mathfrak {R}+ \mathfrak {I})^{\in 200\times 200} \end{aligned}
(3)

where $$f^{share}, f^{1},$$ and $$f^2$$ produce mappings in complex number domain that represent light propagation in phase modulated photonics. Specifically, the forward functionality of each diffractive layer and its dimensionality $${\mathbb {R}}^{200 \times 200}$$ remains the same as16. The output $$det \in {\mathbb {R}}^{C \times 1}$$ are the readings from C detectors, where C is the largest number of classes among all tasks; for example, $$C=10$$ for MNIST and Fashion-MNIST. The proposed multi-task D2NN system is constructed by designing six phase modulators based on the optimized phase parameters in the four shared and two multi-task layers (Fig. 1), i.e., $$\theta ^{share},\theta ^{1,2}$$. The phase parameters are optimized with backpropogation with gradient chain-rule applied on each phase modulation and adaptive momentum stochastic gradient descent algorithm (Adam). The design of phase modulators can be done with 3D printing or lithography to form a passive optical network that performs inference as the input light diffracts from the input plane to the output. Alternatively, such diffractive layer models can also be implemented with spatial light modulators (SLMs), which offers the flexibility of reconfiguring the layers with the cost of limiting the throughput and increase of power consumption.

\begin{aligned} {\text {Acc-HW Product}} = 1 \cdot \frac{Acc_{multi}}{Acc_{single}} \cdot \frac{HWCost_{multi}}{HWCost_{single}}; HWCost = \# Detectors. \end{aligned}
(4)

Figure 2 illustrates the proposed approach for producing the classes, which re-use the detectors for two different tasks. Specifically, for the multi-task D2NN evaluated in this work, both MNIST and Fashion-MNIST have ten classes. Thus, all the detectors used for one class can be fully re-utilized for the other. To enable an efficient training process, we use one-hot encodings for representing the classes similarly as the conventional multi-class classification ML models. The novel modeling introduced in this work that enables re-using the detectors is—defining “1” differently in the one-hot representations. As shown in Fig. 2a,b, for the first task MNIST, the one-hot encoding for classes 0–9 are presented, where each bounding box includes energy values observed at the detectors. In which case, “1” in the one-hot encoding is defined as the lowest energy area, such that the label can be generated as argmin(det)—the index of the lowest energy area. Similarly, Fig. 2c,d are the one-hot encodings for classes 0–9 of the second task Fashion-MNIST, where label is the index of the highest energy area, i.e., argmax(det). Therefore, ten detectors can be used to generate the final outputs for two different tasks that share the same number of classes, to gain extra 55% and 50% hardware efficiency of the proposed multi-task D2NN (see Table 1).

Figure 3 includes visualizations of light propagations through multi-task D2NN and the results on the detectors, where the input, internal results after each layer, and output are ordered from left to right. Figure 3a shows one example for classifying MNIST sample, where the output class is correctly predicted (class 7) by returning the index of the lowest energy detector. Figure 3a presents an example for classifying Fashion-MNIST sample, where the output class is correctly predicted (class 8) by returning the index of the highest energy detector.

While building conventional multi-task DNN, it is well known that the robustness of the multi-task DNNs degrades compared to single-task DNNs, for each individual task. Such concerns become more critical in the proposed multi-task D2NN system due to the potential system noise introduced by the fabrication variations, device variations, detector noise, etc. Thus, we comprehensively evaluate the noise impacts for our proposed multi-task D2NN, by considering a wide range of Gaussian noise in detectors and device variations in phase modulators. Details of noise modeling in the proposed systems are discussed in Section Methods (Eqs. 810). Figure 4 includes four sets of experimental results for evaluating the robustness of our system under system noise. Specifically, Fig. 4a evaluates the prediction performance of both tasks under detector noise, where the x-axis shows the $$\sigma$$ of a Gaussian noise vector S/N (Signal to Noise), and the y-axis shows the accuracy. Figure 4b evaluates the accuracy impacts from device variations of phase modulators, where the x-axis shows the phase variations of each optical neuron in the diffractive layer (note that phase value is $$\in [0,2\pi ]$$), and the y-axis shows the accuracy. In practice, detector noise is mostly within 5%, and device variations are mostly up to 0.2 (80% yield). We can see that the prediction performance of the proposed system is resilient to a realistic noise range while considering only one type of noise. Moreover, in Fig. 4c,d, we evaluate the noise impacts for MNIST and Fashion-MNIST, respectively, under both detector noise and device variations. While the accuracy degradations are much more noticeable when both noises become significantly, we observe that the overall performance degradations remain ≤ 1% within the practical noise ranges. In summary, the proposed architecture is practically noise resilient.

In multi-task learning, it is often needed to adjust the weight or importance of different prediction tasks according to the application scenarios. For example, one task could be required to have the highest possible prediction performance while the performance of other tasks are secondary. To enable such biased multi-task learning, the shared representations $$\theta ^{share}$$ need to carefully adjusted. Figure 5 demonstrates the ability to enable such biased multi-task learning using loss regularization techniques. Specifically, we propose to adjust the performance of different tasks using a novel domain-specific regularization function shown in Eq. (5), where $$\lambda _1$$ and $$\lambda _2$$ are used to adjust the task importance, with a modified L2 normalization applied on multi-task layers only. The results with 100 trials of training (with different random seeds for initialization and slightly adjusted learning rate) are included in Fig. 5a. We can see that loss regularization is sufficient to enable biased multi-task learning in the proposed multi-task D2NN architecture, regardless of the initialization and training setups. Moreover, Fig. 5b empirically demonstrates that with even with very large or small regularization factors, the proposed loss regularization will unlikely overfit either of the tasks because of the adjusted L2 norm used in the loss function (Eq. 6). Note that the adjusted L2 normalization only affects the gradients for $$\theta ^{1}$$ and $$\theta ^{2}$$, where $$\lambda _{L2}$$ is the weight of this L2 normalization.

\begin{aligned} {\mathcal {L}}(\theta ^{share}, \theta ^{1,2}) = \underbrace{\lambda _1}_{\mathrm{t}1\ \mathrm{factor}} {\mathcal {L}}_1(\theta ^{share}, \theta ^{1}) + \underbrace{\lambda _2}_{\mathrm{t}2\ \mathrm{factor}} \cdot {\mathcal {L}}_2(\theta ^{share}, \theta ^{2}) + \underbrace{\lambda _{L2} \frac{\lambda _2}{\lambda _1} \cdot ( ({\theta ^{1}})^2 + ({\theta ^{2}})^2)}_{\mathrm{adjusted}\ \mathrm{L}2\ \mathrm{norm}} \end{aligned}
(5)

## Methods

Figure 1 shows the design of the multi-task D2NN architecture. Based on the phase parameters $$\theta ^{share}, \theta ^1$$, and $$\theta ^2$$, there several options to implement the diffractive layers to build the multi-task D2D2NN system. For example, the passive diffractive layers can be manufactured using 3D printing for long-wavelength light (e.g. terahertz) or lithography for short-wavelength light (e.g. near-infrared), and active reconfigurable ones can be implemented using spatial light modulators. A 50–50 beam splitter is used to split the output beam from the last shared diffractive layer into two ideally identical channels for multi-task layers. Coherent light source, such as laser diodes, is use in this system. At the output of two multi-task layers, the electromagnetic vector fields are added together on the detector plane. The generated photocurrent corresponding to the optical intensity of summed vector fields is measured and observed as output labels. Regarding the real-time capability of the proposed system, the proposed architecture performs the same the system proposed in16, where computation is executed at the speed of light and the information is processed on each neuron/pixel of the phase mask is highly parallel. Thus, the time of light flight is negligible and the determination factor for system hardware performance is dependent on the performance of THz detectors. For a detector with operation bandwidth f, the corresponding latency is 1/f and the largest throughput is f frames/s/task. The minimum power requirement for this system is determined by the number of detector, NEP (noise-equivalent-power), and , if we assume the loss and energy consumption associated with phase masks is negligible. In practice, considering a room-temperature VDI detector (https://www.vadiodes.com/en/products/detectors?id=214) operating at $$\sim 0.3\ THz$$ , $$f=\sim 40\ GHz$$, and $$NEP=2.1 pW/\root 2 \of {Hz}$$, the latency of the system will be 25 ps, throughput is $$4\times 10^{10}\ fps/task$$ (frame/second/task), with power consumption 0.42 uW. In addition to mitigate the large cost of detectors, alternative materials can be used, such as graphene. For example, the specific detector performance shown in28 is $$NEP=\sim 80 pW/\root 2 \of {Hz}$$, and $$f=\sim 300\ MHz$$. In which case, the system atency is $$\sim 30 ns$$, such that the throughput is $$3 \times 10^{8}\ fps/task$$ with the estimated minimum power 1.4 uW.

### Training and inference of multi-task D2 NN

The proposed system has been implemented and evaluated using Python (v3.7.6) and Pytorch (v1.6.0). The basic components in the multi-task D2NN PyTorch implementation includes (1) diffractive layer initialization and forward function, (2) beam splitter forward function, (3) detector reading, and (4) final predicted class calculation. First, each layer is composed of one diffractive layer that performs the same phase modulation as16. To enable high-performance training and inference on GPU core, we utilize for complex-to-complex Discrete Fourier Transform in PyTorch (torch.fft) and its inversion (torch.ifft) to mathematically model the same modulation process as16. Beam splitter that evenly splits the light into transmitted light and reflected light is modeled as dividing the complex tensor produced by the shared layers in half. The trainable parameters are the phase parameters in the diffractive layers that modulate the incoming light. While all the forward function components are differentiable, the phase parameters can be simply optimized using automatic differentiation gradient mechanism (autograd). The detector has ten regions and each detector returns the sum of all the pixels observed (Fig. 2). To enable training with two different one-hot representations that allow the system to reuse ten detectors for twenty classes, the loss function is constructed as follows:

\begin{aligned} {\mathcal {L}}&=\lambda _1 \cdot \underbrace{\texttt {MSELoss}(LogSoftmax(f(\theta ^{share}, \theta ^1, {\mathcal {X}}^1), (label^1 + 1)\%2)}_{\mathrm{one-}\mathrm{hot}\ \mathrm{encoding} \ \mathrm{with}\ \mathrm{one} {0''}\ \mathrm{and}\ \mathrm{nine}\ {1s''}} \nonumber \\&\quad + \lambda _2 \cdot \underbrace{\texttt {MSELoss} (LogSoftmax(f(\theta ^{share}, \theta ^2, {\mathcal {X}}^2), label^2)}_{\mathrm{one-}\mathrm{hot}\ \mathrm{encoding}\ \mathrm{with}\ \mathrm{one} \ {1''}\ \mathrm{and}\ \mathrm{nine}\ {0s''}} \nonumber \\&\quad + \underbrace{\lambda _{L2} \cdot \frac{\lambda _2}{\lambda _2} \texttt {L2}(\theta ^1,\theta ^2)}_{\mathrm{L}2\ \mathrm{norm}\ \mathrm{on}\ \mathrm{multi}-\mathrm{task} \ \mathrm{diffractive}\ \mathrm{layers}} \end{aligned}
(6)

The original labels $$label^{1}$$ and $$label^2$$ are represented in conventional one-hot encoding, i.e., one “1” with nine of “0s”, and $$label^{1}$$ has been converted into an one-hot encoding with one “0” and nine “1s”. Note that LogSoftmax function is only used for training the network, and the final predicted classes of the system are produced based on the values obtained at the detectors. With loss function shown in Eq. (6) and the modified one-hot labeling for task 1, the training process optimizes the model to (1) given an input image in class c for task 1 (MNIST), minimize the value observed at (c+1)th detector, as well as maximize the values observed at other detectors; (2) given an input image in class c for task 1 (Fashion-MNIST), maximize the value observed at (c+1)th detector, as well as minimize the values observed at other detectors. Thus, the resulting multi-task model is able to automatically differentiate which task the input image belongs to based on the sum of values observed in the ten detectors, and then generate the predicted class using argmin (argmax) function for MNIST (Fashion-MNIST) task. The gradient updates have been summerized in Eq. (7).

\begin{aligned} \theta ^{share}&= \theta ^{share} - \frac{1}{2} \cdot \eta \frac{\lambda _2}{\lambda _1}(\nabla \theta ^1 + \nabla \theta ^2)\nonumber \\ \theta ^{1'}&= \theta ^{1} - \eta \nabla \theta ^1 - 2\eta \lambda _{L2}|| \theta ^{1} + \theta ^{2}||\nonumber \\ \theta ^{2'}&= \theta ^{2} - \eta \nabla \theta ^2 - 2\eta \lambda _{L2}|| \theta ^{1} + \theta ^{2}|| \end{aligned}
(7)

### System noise modeling

We demonstrate that the proposed system is robust under the noise impacts from the device variations of diffractive layers and the detector noise in our system. Specifically, to include the noise attached to the detector, we generate a Gaussian noise mask $${\mathcal {N}}(\sigma , \mu ) \in {\mathbb {R}}^{200\times 200}$$ with on the top of the detector readings, i.e., each pixel observed at the detector will include a random Gaussian noise. As shown in Fig. 4a, we evaluate our system under multiple Gaussian noises defined with different $$\sigma$$ with $$\mu =0$$. We also evaluated the impacts of $$\mu$$, while we do not observe any noticeable effects on the accuracy for both tasks. This is because increasing $$\mu$$ of a Gaussian noise tensor does not change the ranking of the values observed by the ten detectors, such that it has no effect on the finalized classes generated with argmax or argmin. The forward function for ith task with detector noise is shown in Eq. (8).

\begin{aligned} c^i = \texttt {argmax/argmin}(det(f(\theta ^{share}, \theta ^i, {\mathcal {X}}^i)) + {\mathcal {N}}(\sigma , 0)), \quad i=\{1,2\} \end{aligned}
(8)

We also considered the imperfection of the devices used in the system. With 3D printing or lithography based techniques, the imperfection devices might not implement exactly the phase parameters optimized by the training process. Specifically, we consider the imperfection of the devices that affect the phases randomly under a Gaussian noise. As shown in Fig. 4b, the x-axis shows that the $$\sigma$$ of Gaussian noise that are added to the phase parameters for inference testing. The forward function is described in Eq. (9). Beam splitter noise has also been quantified, where we do not see direct impacts on both tasks (see Fig. 2 in supplementary file SI.pdf).

\begin{aligned}&\theta ^{share}_{{\mathcal {N}}} = (\theta ^{share} +{\mathcal {N}}(\sigma , 0)) \quad \% \quad 2\pi \nonumber \\&\theta ^{i}_{{\mathcal {N}}} = (\theta ^{i} + {\mathcal {N}}(\sigma , 0)) \quad \% \quad 2\pi , \quad i=\{1,2\}\nonumber \\&c^i = \texttt {argmax/argmin}(det(f( \theta ^{share}_{{\mathcal {N}}}, \theta ^i_{{\mathcal {N}}}, {\mathcal {X}}^i_{{\mathcal {N}}}))), \quad i=\{1,2\} \end{aligned}
(9)

Finally, for results shown in Fig. 4c,d, we include both detector noise and device variations in our forward function (Eq. 10):

\begin{aligned}&\theta ^{share}_{{\mathcal {N}}} = (\theta ^{share} +{\mathcal {N}}^1(\sigma ^1, 0)) \quad \% \quad 2\pi \nonumber \\&\theta ^{i}_{{\mathcal {N}}} = (\theta ^{i} + {\mathcal {N}}^1 (\sigma ^1, 0)) \quad \% \quad 2\pi , \quad i=\{1,2\}\nonumber \\&c^i = \texttt {argmax/argmin}(det(f( \theta ^{share}_{{\mathcal {N}}}, \theta ^i_{{\mathcal {N}}}, {\mathcal {X}}^i_{{\mathcal {N}}})) + {\mathcal {N}}^2(\sigma ^2, 0))), \quad i=\{1,2\} \end{aligned}
(10)