Freely scalable and reconfigurable optical hardware for deep learning

As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The path-length-independence of optical energy consumption enables information locality between a transmitter and a large number of arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that digital optical data transfer is beneficial over electronics when the spacing of computational units is on the order of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>10\,\upmu $$\end{document}>10μm.


Supplementary Note 2: Crosstalk correction
The bit error rate described in the previous section is mainly attributable to optical crosstalk at the detector due to imperfect lenses and alignment. Since this error is deterministic (as opposed to random fluctuations), it can be compensated by postprocessing. To illustrate this principle, we performed simple crosstalk correction: we multiplied each line of an image detected on the camera by a tridiagonal crosstalk reduction matrix, per equation (1) (where I :n is the corrected line of the camera image).
ξ was estimated to be ∼0.19 and ∼0.18 for the red and blue arms, respectively, from a calibration image of alternating '1's and '0's transmitted by the DMDs. I :n is renormalized after this matrix multiplication. We show the effects of crosstalk reduction in Fig. S2.
To maximize energy efficiency, the final version of this system (with a custom CMOS chip that integrates detection with digital MAC computation) will not perform post-processing. We can use charge-sharing at the detector pixels to implement equation (1) with custom CMOS. Alternatively, we could simply reduce crosstalk by changing the system design. For example, we could choose to space the PEs further apart or shrink the active region of the detectors to improve the ratio of signal at the current pixel to noise from neighboring pixels. Another option is a dual-rail scheme, where two detectors on opposite corners of a MAC unit detect one bit: light is sent to the first detector for a '1' or to the second detector for a '0'. The difference in charge between these neighbors is more robust to error compared with the absolute charge, which may cross the fixed threshold with sufficient crosstalk, but not reach a higher charge than its neighbors.

Supplementary Note 3: Training and test sets
In our proof-of-concept experiment, we performed inference on 500 images using a two-hidden-layer, fully-connected neural network, where each hidden layer had 100 activations. We used TensorFlow's built-in dataset importer to download the first 500 images in the test set of the MNIST handwritten digit dataset 1 , as downloaded from the TensorFlow 2 Keras database. Relevent code can be found in the GitHub repository for user Alexander Sludds (alexsludds): https://github.com/alexsludds/Digital-Optical-Neural-Network-Code The model's weights were pre-trained on an NVIDIA K40 GPU using the entire MNIST training set. Categorical crossentropy was used as a loss function. Dropout regularized the model's weights in each layer to prevent overfitting. Input images were downsized from 49 × 49 to 7 × 7 using bilinear interpolation.

Supplementary Note 4: Electronic interconnect switching energy in 0 to 1 transitions
The dynamic switching energy of CMOS devices is the amount of energy required to charge the output capacitance of a CMOS gate. Energy is only consumed in CMOS inverters for low-to-high transitions on the outputs of these gates. Consider the toy circuit model shown in Fig. S3. On the left is a CMOS inverter, and on the right are a low-to-high and high-to-low transitions, respectively. In the low-to-high transition, the PMOS has to switch closed, shorting the output to the supply rail by charging the load capacitance. In the high-to-low transition, the NMOS already has a sufficient drain-to-source voltage from the load 3/7 Figure S3. A demonstration of where dynamic energy consumption goes during switching of a CMOS inverter. The circuit, shown left, consists of a stacked NMOS and PMOS device. During an output low to high transition, shown center, charge is deposited on the lumped output capacitance. During an output low to high transition, shown right, that charge from the lumped output capacitance is discharged through the NMOS into ground. Figure S4. A proposed circuit for resetting the receiver lumped capacitor model. capacitance charge, so it can discharge the output without consuming any power from the supply. To summarize, in an output which switches from low to high and back to low again, the PMOS initially turns on, taking CV 2 DD energy from the supply, then the NMOS will turn on, discharging 1 2 CV 2 DD from the charged load capacitor (the other 1 2 CV 2 DD is dissipated as heat in the resistive load).

Supplementary Note 5: Resetting a 'receiverless' circuit
There are several circuit methods by which the accumulated charge on the input capacitor can be reset. In the method shown in Fig. S4, we place the NMOS device NMOS Discharge between the photodetector and ground and drive the gate with an external reference voltage V ref . The benefit of this solution is that it consumes no dynamic energy when there is no optical input power. However, it has the tradeoff that it requires additional area on chip and, because it is ratioed logic, requires careful design to ensure functionality. The width of NMOS Discharge is set such that the accumulated charge on the capacitor generates a voltage high enough to overcome the input threshold of the load (modeled here as a CMOS inverter), but not so small that it cannot dissipate the charge quickly in a single clock cycle. One problem that arises from receiverless photodetection is that a constant steam of '1's coming into a photodetector without a strong enough NMOS Discharge fill causes additional charge to slowly build on the load capacitance. To compensate, we propose a P-N junction diode (Clamp Diode).

Supplementary Note 6: Electronic repeaters
A naive implementation of a repeater is a double inverter. The energy required is C T V 2 DD , since in any transition, one inverter must be making a low-to-high transition and the other a high-to-low transition. As a result, in any 'flip' of a repeater, one inverter does not consume energy. Using the values in Table 2 of the main text, the cost of a repeater is .06 fJ/bit for an output low-to-high transition. Therefore, even in the worst-case scenario where we place a repeater between every multiplier in an array of abutted 8-bit MAC units, the inter-multiplier interconnect energy cost is larger than that of the repeater.

Supplementary Note 7: Shot and thermal noise
In a hypothetical crosstalk-free DONN, the remaining noise sources are thermal (Johnson) and shot noise. To gain insight into whether they would affect classification accuracy, we estimate the ensuing bit error rates (BERs). The detector registers a '1'

4/7
when q ≥ q D photoelectrons are generated, and a '0' when q < q D , where we assume the threshold charge is set by q D = n p /2 electrons. Fig. S5 illustrates the probability distributions of the number of photoelectrons, as well as the probabilities that a '0' is received when a '1' is sent (BER 1 ), and vice-versa (BER 0 ). In a receiverless photodetector scheme, thermal noise can be approximated as 'kT/C' noise 2 , with: where σ V is the standard deviation of voltage, k B is the Boltzmann constant, T is the temperature in Kelvin, C det is the capacitance of the photodetector, and C T is the capacitance of the inverter. The temperature depends on quality of heat sinking and proximity to hot spots; from the literature 3 , we assume it is in the range T ∈ [300 − 500]. Using the values from Table 2 of the main text, we find σ V ≈ 5 − 6 mV V DD . We can further verify whether thermal noise is likely to cause bit errors by approximating the probability distribution due to thermal noise, p J (q), by a Gaussian: To first order, shot noise will not affect the transmission of '0's (BER 0 ) since the number of transmitted photons is n p = 0. Thus: BER 0 for different n p = 2q D are reported in Table 1.
We assume shot noise follows a Poissonian probability distribution: where n p is the number of photons per detector per clock cycle.

5/7
For ease of computation with large n p , we take the natural logarithm: ln (p shot (q)) = ln e −n p n p q q! (7) = ln e −n p + qln n p − ln (q!) (8) ⇓ (10) BER 1 due to shot noise is therefore: Results of this computation for various n p are shown in Table 1.
Thermal noise will also contribute to BER 1 ; we convolve the probability distributions to find the total bit error rate: From equation (5) in the main text, we find n p ≈ 1000 photons/bit to generate a voltage swing of 0.8 V on the load capacitance; therefore, the expected BER is negligible, per Table 1.

Supplementary Note 8: Scalability of the DONN
In this section, we discuss the scalability of our proposed system. Free-space optical propagation is nearly lossless, so transmission distance will not be limiting in practice. It is true that there will be a limit to the power an individual source can produce (∼100 mW for a single VCSEL 4 ), which would appear to limit N's maximum value (N max ). In this case, using the values and equations from the main text, at a standard clock rate of 1 GHz: We can then conservatively define a unit cell with a similar layout to Fig. 2c in the main text with N = 1,000 and B = 1,000. This unit cell can then be tiled (replicated) to increase the effective size of the array. Unit cells can be synchronized by optically transmitting values to them in the same manner that light sources are fanned out to receivers in the DONN. A tree-style structure distributes data from one central array to each of these unit cells where the branching points in the tree are a linear array of O-E-O devices which receive optical light, convert to an electronic signal and re-emit an amplified signal. Such a device could be implemented by connecting a receiverless photodetector to a CMOS repeater to a VCSEL. These devices can add additional energy overhead, requiring an external power source to enable the creation of this new strong light signal. However, because higher levels of the tree are fanning out to successively many more roots, the cost of these devices can be substantially amortized (by a factor of roughly 1000 P , where P is the level of the tree). Thus, they do not add significant power overhead. In terms of fabrication of the unit cell, large, densely packed electronic multiplier arrays are already in use today, for example in the Google TPU 5 .

Supplementary Note 9: Potential latency reduction from DONN architecture
There are several components which could limit the speed of operation of the DONN, namely the modulated sources, photodetectors, and receiver electronics.

Modulated sources
The bandwidth of a directly-driven incoherent light source such as a nanoLED can achieve bandwidths in excess of 10 GHz, though the current energy efficiency of these devices requires improvement 6 . Coherent sources, such as VCSELs driven by an electronic source, can achieve bandwidths exceeding 10 GHz 7 .

Photodetector bandwidth
The bandwidth of a receiverless photodetector can be limited by one of two factors: the time for carrier removal or RC time constants at the detector. Well-designed photodetectors confine photocarriers between electrical contacts, which allows all carriers to be extracted by carrier drift rather than carrier diffusion, achieving bandwidths close to 100 GHz 8 . The RC time constant of tightly-integrated photodetectors (micrometer-scale wires) is orders of magnitude smaller than the carrier removal time.

Receiver electronics
A final factor which can limit the bandwidth of the system is the speed of driver and receiver electronics. Most modern high-throughput electronic systems have a clock speed limited by thermal constraints 9 of ∼10 6 W/m 2 . Considering an 8-bit MAC unit of energy E MAC = 25 fJ/MAC and area 50 µm 2 , the thermal limitation on operating speed for densely packed MAC logic with full utilization is ∼2 GHz. One benefit of the DONN architecture is the ability to increase distance between MAC units to overcome thermal constraints with no additional cost in data transfer. Increasing the pitch of the devices by 3.3× decreases the thermal density by 10×, which allows for operation at up to 20 GHz.