Introduction

A unitary operation is an essential building block of quantum1,2,3 and classical4,5 linear systems because any linear operator can be decomposed into a set of unitary and diagonal operators6. With advances in quantum computations3 and deep learning accelerators7, development of reconfigurable hardware for universal unitary operations has become a topic of intense study. A programmable photonic circuit is one of the most widely used platforms8,9 for unitary operations in optical neural networks4,10, modal decoding5 and quantum computations1,2,3.

The fundamental strategy to realize universal unitary operators is to factorize the target operator of the n-degree unitary group U(n) into the diagonal operators and unitary operators of the lower-degree group, such as SU(2)11. These subsystems can be realized with conventional optical elements, such as beam splitters, Mach‒Zehnder interferometers (MZIs) and phase shifters, which constitute a programmable photonic circuit with reconfigurable modulation. Although the mesh composed of these unit elements can perform universal unitary operations, the connectivity inside the mesh is nonunique and involves an optimal design issue for more compact and robust platforms12,13,14,15. To improve the original proposal—the Reck design11—for the mesh topology, recent approaches have successfully demonstrated advanced arrangements of two-channel subsystems—the Clements design12—and the advantages of utilizing multichannel building blocks—the Saygin design13.

When each channel of the mesh is assigned as a node, a photonic circuit can be interpreted as a graph network16 regardless of design strategy. Accordingly, it is logical to seek inspiration from network science17 to understand and improve the large-scale mesh topology of high-degree unitary groups, which should inherit intriguing features of complex networks. In this context, one promising issue is the degree distribution describing the differentiated importance of network nodes, which has been a hot topic through the concepts of heavy-tailed distributions, hub nodes and scale-freeness17,18,19,20,21. When the multiple decomposition processes are applied to U(n)11,12, a natural question arises: Do every decomposition and corresponding optical element contribute equally to the designed unitary operation? The answer to this question is of fundamental and practical importance in quantum physics and photonics for devising more advanced hardware architecture applicable to universal quantum evolutions and deep learning accelerators, especially with large-scale photonic circuits. Very recently, the first step toward this answer was achieved in the Reck design with asymmetric architecture to devise a stable algorithm in the self-configuration for large-scale multiport interferometers22.

In this paper, we reveal that some subsystems are more important than others, even in the highly symmetric architecture of large-scale programmable photonic circuits. By applying various statistical models to programmable photonic circuits targeting universal unitaries, we verify that a type of unit rotation operator has a heavy-tailed distribution. This finding shows the presence of hub optical elements and the Pareto principle in photonic circuits, which enables the development of the pruning technique23 for linear quantum or classical hardware. We demonstrate that the suggested hardware pruning for random unitaries allows for improved fidelity when the elements with noise above a specific threshold are removed. This result provides a design strategy for high fidelity and energy efficiency in large-scale quantum computations and photonic deep learning accelerators.

Results

Rotation operators in programmable photonic circuits

Before applying the statistical analysis to large-scale programmable photonic circuits, we revisit the Clements design12, which is one of the most widely used architectures for universal unitaries. Figure 1 shows a schematic of the photonic circuit for the n × n unitary matrix Un U(n) obtained from the Clements design. Both the Reck and Clements designs employ nulling the off-diagonal elements of Un by sequentially multiplying the programmable unit operations Tml U(n) (1 ≤ m ≤ n – 1, 1 ≤ l ≤ n, m and l are integers). Tml leads to the SU(2) operation on the Bloch sphere defined for the mth and (m + 1)th channels to set the off-diagonal element (l, m) or (m + 1, l) to be zero.

Fig. 1: Programmable photonic circuits for universal unitary operators.
figure 1

a Programmable photonic building block of Tml composed of MZIs and phase shifters for the SU(2) operation between the mth and (m + 1)th channels. Red and blue boxes represent the phase shifters for θ and φ, respectively. b, c The rotation operators of Rxm(–π/2)Rzm(–φ) (b) and Rxm(–π/2)Rzm(–2θ) (c), described in Bloch spheres. Black and colored solid lines indicate x-axis and z-axis rotations, respectively. b and c correspond to the parts indicated by blue and red arrows in (a), respectively. d Phase shifters for the diagonal components of Dn. e Schematic diagram of the programmable photonic circuit for U16. The tunability of θ and φ allows for the programming of U16.

To maximally cover the SU(2) group with Tml, reconfigurable and independent control of the amplitude and phase differences between the mth and (m + 1)th channels is necessary8. One of the most popular platforms for Tml is to utilize two pairs of a stationary MZI and a tunable phase shifter in one arm8,9,12, which involves two adjustable parameters of θ [0, π/2] and φ [0, 2π) (Fig. 1a). While the phase shifts θ and φ correspond to tunable z-axis rotations on the Bloch sphere, the stationary MZIs constitute the –π/2 x-axis rotations (Fig. 1b, c). The unit operator then becomes Tml(θ,φ) = Rxm(–π/2)Rzm(–2θ)Rxm(–π/2)Rzm(–φ), where Ram(ξ) is the ξ-rotation to the a-axis on the m-(m + 1) Bloch sphere, and θ and φ are determined to satisfy nulling of the (l, m) or (m + 1, l) element. The target unitary operator Un is reproduced with multiple Tml operators and the remaining diagonal matrix Dn after nulling, as follows:

$${U}_{n}={D}_{n}\left[\mathop{\prod}\limits_{\{m,l\}\in {S}_{n}}{{T}_{m}}^{l}({\theta}_{m,l},{φ}_{m,l})\right]$$
(1)

where Sn is the ordered sequence of {m, l} pairs determined by the nulling process12 and Dn is realized with phase shifters (Fig. 1d). By newly defining Sn, the Clements design employs the highly symmetric arrangement of the MZIs (Fig. 1e), which decreases the device footprint by half and enhances robustness to optical losses compared to the Reck design (see Supplementary Note S1 for the detailed processes).

The reconfigurability for universal unitary operators is thus realized with the z-axis rotation Rz obtained from tunable phase shifts θ and φ. As programmable devices, the noise and power consumption of photonic circuits are determined by the performance of modulating optical refractive indices Δn in the phase shifters and the following changes of θ and φ, as ~LΔn, where L is the modulation length. Therefore, the statistical analysis of the two adjustable phases θ and φ is critical in examining the performance of large-scale programmable photonic circuits.

Heavy tails in rotations

Due to the highly symmetric form of the photonic circuit (Fig. 1e), at first glance, it may appear to be reasonable to predict that the building blocks Tml in the circuit have equal importance. Under this presumption, the distributions of θ and φ should be statistically uniform for an ensemble of photonic circuits that generate random unitary operations uniformly distributed in U(n)24. Furthermore, it may also seem reasonable to expect similar distributions for θ and φ, both of which perform z-axis rotations.

However, upon closer inspection, we reveal that those presumptions are invalid. Instead, there are differences in the contributions of individual building blocks as well as the rotation operators of θ and φ. First, revisiting the nulling process of the Clements design12, we note that each off-diagonal element of Un undergoes differentiated transformations. For example, in nulling the 5 × 5 unitary matrices (Fig. 2a), nulling the (5,1) and (4,1) components results in the (T15)-transformed 1st and 2nd columns and the T31-transformed 3rd and 4th rows, respectively. Because the nulled off-diagonal elements no longer change, each building block treats a matrix element that undergoes a different number of SU(2) transformations; matrix elements that are nulled earlier get fewer transformations (see extended discussion in Supplementary Note S1).

Fig. 2: Heavy-tailed distributions in unitary photonic circuits.
figure 2

a, b Two origins of the heavy-tailed distributions of the rotation operators: unequal transformations in the nulling process (a) and nonuniform SU(2) rotations (b–d). a An example of the nulling process for U5. Orange and green arrows denote the nulling of the off-diagonal elements with UT and TU, respectively. Red and blue boxes indicate the rotating components for the nulling of the (5,1) and (4,1) components, respectively. Rotated states with T(θ, 0) (b), T(0, φ) (c), and T(θ, φ) (d). Each point in (b) and (c) denotes the transformed state through the corresponding T applied to the uniformly random initial states on the Bloch sphere. The colors in the map in d depict the nonuniform density of the transformed states on the Bloch sphere. The initial states in (b) and (c) are obtained with 10 polar grids and 20 azimuthal grids (200 points), while 200 polar grids and 400 azimuthal grids (80,000 points) are used in (d). e Heavy-tailed distributions of θ described by the CCDF. The inset shows the PDF and its fitting. The body and tail are separated with P(θ) = 0.20, referring to the Pareto principle. Red, orange and blue dashed lines show fitting with the power-law, power-law with an exponential cutoff (or truncated power-law) and log-normal distributions, respectively. Red and orange circles indicate the lower limit of the θ-fitting for the power law and power law with an exponential cutoff, respectively. f, g The variations of the power-law estimators for different n: α (f) and θmin (g). Each blue point represents a realization, and orange markers and error bars show the average and root-mean-square error (RMSE) of 100 random realizations at each n, respectively.

These disparate transformations of each matrix element do not guarantee nontrivial distributions of the phase shifts θ and φ. However, the decomposed form of the building block operation Tml(θ, φ) = Rxm(–π/2)Rzm(–2θ)Rxm(–π/2)Rzm(–φ) results in the nontrivial distribution of θ, which is clearly distinct from that of φ. Figure 2b and c shows the transformations of the initial states uniformly distributed in the polar (ξ) and azimuthal (η) axes of the Bloch sphere by multiplying Tml(θ, φ = 0) = Rxm(–π/2)Rzm(–2θ)Rxm(–π/2) and Tml(θ = 0, φ) = Rxm(–π)Rzm(–φ), respectively, where nonzero θ and φ also have uniformly distributed values in their ranges. Notably, the transformed states by Tml(θ, φ = 0) become nonuniform (Fig. 2b), in sharp contrast to the uniform distribution from Tml(θ = 0, φ) (Fig. 2c). Such a discrepancy originates from the difference between the pure z-axis rotation Rzm(–φ) and the transformed rotation Rxm(–π/2)Rzm(–2θ)Rxm(–π/2) and eventually leads to nonuniformity on the Bloch sphere for Tm(θ, φ) (Fig. 2d). We emphasize that the unequal contributions of each nulling (Fig. 2a) will accumulate the nonuniform distribution from the θ rotations, which leads to nontrivial statistics in the phase shift design.

To confirm this prediction, we investigate the statistics of θ and φ in realizing programmable photonic circuits that reproduce random unitary operations achieved by uniformly sampling the U(n) group with the Haar measure24. We calculate the probability density functions (PDFs) p(θ) and p(φ) and the complementary cumulative distribution functions (CCDFs) P(θ) and P(φ) for an ensemble of 100 Un realizations at each n. As expected from the uniform distribution with Tml(θ = 0, φ) (Fig. 2c), the distribution of p(φ) is trivially uniform (Supplementary Note S2).

One of the key findings of this work is the nonuniform distribution of θ. Figure 2e shows an example of the θ distribution for U128, which includes 8128 values. As shown in the linearized plots of the CCDF and PDF on the log-log scale, θ possesses a heavy-tailed distribution17,19,21, indicating a smaller decrease in p(θ) for increasing θ than that of the exponential distribution. For the quantitative analysis, we employ three representative heavy-tailed distribution models19—power-law, power-law with an exponential cutoff and log-normal distributions—and the exponential distribution model. The models are fitted with the θ dataset of each realization of photonic circuits by utilizing analytical or numerical maximization of the model likelihoods19,25 and the Kolmogorov‒Smirnov test26 for the models with lower bounds (see Methods for details). This standard procedure determines the range and shape of the tail of each model for the optimized fitting of a given dataset.

Notably, all the heavy-tailed models provide good fits for large n, showing the consistent behaviors of their estimators for each realization, which is a critical condition for model consistency21. For example, the exponent α (Fig. 2f) and the lower bound θmin (Fig. 2g) in the power-law model P(θ) = (θ/θmin)α+1 converge with increasing n, which demonstrates that the heavy-tailed distribution becomes more apparent in larger-scale programmable photonic circuits. The average of the power-law exponents at n = 128 is αavg = 3.18 for 100 realizations (or 812,800 values of θ), while lower and upper limits are αmin = 2.75 and αmax = 3.78. Such consistency clearly proves the validity of the power-law model20,21 for describing the distribution of the θ-rotation operators (Supplementary Notes S3S5 for the results of the crossover heavy-tailed distributions and exponential distribution). We note that the averaged lower bound θmin = 0.08π with P(θmin) = 0.24 shows that most of the significant rotations 0.08π ≤ θ ≤ 0.50π come from ~24% of the building blocks, which illustrates the Pareto principle for large-scale programmable photonic circuits.

Hub units and pruning

The observed heavy-tailed distribution of θ rotation operators signifies that some building blocks Tml with θ in the ‘Tail’ part in Fig. 1e are more critical than other building blocks (‘Body’ part in Fig. 1e). In realizing programmable photonic circuits for universal unitary operations (Fig. 1a), many phase shifters in the “body” of the distribution may be unnecessary because θ ~ 0. On the other hand, the “tail” phase shifters with large θ values operate as hub units. Because such hub units deliver most of the necessary θ-rotations for realizing Un, we can envisage the application of the pruning technique in computer science23 to photonic hardware.

Figure 3a shows the concept of pruning for programmable photonic circuits. The entire photonic circuit for Un includes n(n – 1)/2 number of SU(2) building blocks and the same number of θ values. We define the set of sorted θ values for a given photonic circuit as Θn = {θr|1 ≤ r ≤ n(n – 1)/2 for the integer r that denotes the index of each building block according to an order of θ, as θp ≤ θq for p ≤ q}, where θr with larger r represents a more important building block. The pruning of less important ones—body elements—for the photonic circuit is then defined by imposing θr = 0 for 1 ≤ r ≤ σ, where the integer σ determines the degree of pruning: σ = 0 for preserving the original circuit and σ = n(n – 1)/2 for entirely removing θ rotations in the circuit. In the hardware implementation, pruning corresponds to leaving out the phase shifters for θ and preserving the symmetry in the MZI arms.

Fig. 3: Pruning is often better than noise.
figure 3

a The concept of pruning in programmable photonic circuits. The phase shifter 2θ of the building block is replaced with an ordinary waveguide, which preserves the symmetry in the MZI arms. b The noisy building block. The phase shifter 2θ is perturbed as 2(θ + δ). c Comparison of the fidelities of the U128 photonic circuits in different groups: pruning body (red line), pruning tail (blue line), noisy body (orange line) and noisy tail (green error bars). The thicknesses of the colored lines and the error bars present the range of the fidelities between their maxima and minima. The red arrows indicate the pruning thresholds for each case. Two pairs of groups with noisy bodies and noisy tails are shown for δ0 = 0.04π and 0.08π, which correspond to about 2K–4K temperature changes in silicon infrared thermo-optical phase shifters27. d Pruning threshold as a function of the noise level δ0 for different degrees of unitary operators. In (c) and (d), 100 random Un realizations are analyzed per value of n and defect ratio.

The refractive index modulation in the phase shifters is responsible for much of the energy consumption and noise generation in programmable photonic circuits8,9. For example, consider a typical thermo-optic phase shifter with a device length of 100 μm27, which operates at the telecom wavelength of 1550 nm and is based on silicon photonics technology. The amount of thermal noise present in the phase evolution is determined by the thermo-optic coefficient of silicon28 dn/dT = 1.8 × 10–4 K–1, which can approach 0.02π per kelvin. This noise may be further exacerbated in larger-scale devices due to increasing thermal crosstalk. Therefore, pruning of superfluous phase shifters allows for more energy-efficient and noise-tolerant photonic circuits for reconfigurable unitary operations, provided that the circuit after the pruning accurately reproduces unitary operations. To examine the performance of pruning in a practical situation, we prepare three control groups: one group with the pruning of more important building blocks—tail elements—with θr = 0 for n(n − 1)/2 − σ + 1 ≤ r ≤ n(n – 1)/2, and two groups with noisy elements. For the noisy elements, we assume random noise from the phase shifter by assigning the noise δk to the kth original rotation as θk + δk, where δk = u[0,δ0] represents the uniform random distribution between 0 and δ0. For a fair comparison, we construct the groups of noisy elements by replacing the body- or tail-pruned elements in the pruning groups with noisy elements.

To characterize the precision of the operation of the circuits with pruning or noises, we define the fidelity that quantifies the metric between the original and defective operators29 as follows (see Methods for the derivation):

$$F({{U}_{n}}^{{{{{{\rm{D}}}}}}},{{U}_{n}}^{{{{{{\rm{O}}}}}}})=\frac{2{{{{{\rm{Re}}}}}}({{{{{\rm{Tr}}}}}}[{({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{O}}}}}}}])}{n+{{{{{\rm{Tr}}}}}}({({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{D}}}}}}})},$$
(2)

where UnO and UnD represent the original unitary matrix and its defective (pruned or noisy) one, respectively, and Tr(A) is the trace of the square matrix A. Figure 3c shows the fidelities of each photonic circuit with the pruning or noise as a function of the ratio of defective elements: 2σ/n(n – 1) in the pruning groups. As expected, the fidelity is preserved much better when the body is pruned instead of the tail. More critical results are shown in comparison with the noisy circuits. When the noise amplitude increases, removing a specific ratio of the “body” phase shifters can be better for higher fidelity than the noisy ones, whether the noise is imposed on body or tail elements. Such a ratio, called the pruning threshold, increases with the noise level and scale of photonic circuits (Fig. 3d). This result states that there is a substantial restriction on the noise level in a large-scale programmable photonic circuit. If a phase shifter cannot meet this restriction, then it is better to remove the phase shifter to increase accuracy and decrease energy consumption for reconfigurability.

Universal architecture for pruning

Although the result shown in Fig. 3 demonstrates hub functionality and the advantage of pruning in realizing an individual unitary operator, it is insufficient to apply pruning to programmable photonic circuits for universal unitary operators. This is because the sorted set Θn for pruning varies with the form of a unitary operator. To apply the pruning method for universal unitaries with reconfigurability, it is necessary to construct an adaptable architecture for the pruning process.

Because the position of each building block for nulling a specific off-diagonal element is fixed in an n-degree photonic circuit, the averages of the phase rotations <θm,l> and <φm,l> are well-defined in hardware for random unitary operations that are uniformly sampled from U(n). Figure 4a and b describes the universal architectures defined by <θm,l> and <φm,l>, respectively, for 100 Un realizations of n = 16 and n = 32 (see Supplementary Note S6 for n = 64). As expected from the distinct SU(2) operations from θ and φ (Fig. 2b, c), we observe a spatially inhomogeneous distribution of <θm,l> in contrast to that of <φm,l>. More specifically, the universal architectures show significant θ-rotation contributions from the building blocks near the boundary of the programmable photonic circuits. Such a consistent distribution allows for a universal sorted set <Θ>n = {<θm,l>r|1 ≤ rn(n – 1)/2 for the integer r that denotes the index of each building block according to an order of <θm,l>, as <θm,l>p ≤ <θm,l>q for p ≤ q} to develop a pruning process applicable to any unitary operations.

Fig. 4: Universal architecture for pruning in reconfigurable unitaries.
figure 4

a, b The averages of <θm,l> (a) and <φm,l> (b) for the photonic circuits of 100 Un realizations with n = 16 and n = 32. We set the upper bound of the color map in (a) to be 0.3π for better visibility. c Comparison of the fidelities of the U128 photonic circuits in different groups: pruning body (red line), pruning tail (blue line), noisy body (orange line) and noisy tail (green error bars). The thicknesses of the colored lines and the error bars present the range of the fidelities between their maxima and minima. The red arrow indicates the pruning threshold. Two pairs of groups with noisy bodies and noisy tails are shown for δ0 = 0.10π and 0.20π, which correspond to about 5K~10K temperature changes in silicon infrared thermo-optical phase shifters27. d Pruning threshold as a function of the noise level δ0 for different degrees of unitary operators.

From this guideline, we again employ pruning and add noise to the body and tail elements to the set <Θ>n. As shown in Fig. 4c and 4d, the general tendencies in Fig. 3c and 3d are preserved; the tail is more important than the body, the bad is better to be removed and pruning is more efficient for larger-scale photonic circuits. Although the minimum noise level increases, there is still a pruning threshold that guarantees the advantage of removing θ phase shifters, and this tendency is much more apparent in larger-scale programmable photonic circuits. Notably, the importance of protecting hub elements from noise becomes evident at the strong noise level (δ0 = 0.20π cases in Fig. 4c). Furthermore, let us consider one of the state-of-the-art realizations of experimentally demonstrated programmable photonic circuits, which allows for n = 64 matrix multiplications30,31. The realization then requires 2016 (=n(n – 1)/2) unit cells composed of 4032 MZIs and 2016 θ- and φ-phase shifters for unitary operations. When we consider thermal noise from about 5K temperature change, Fig. 4d shows that it is more advantageous to remove 50% (or 1008) θ-phase shifters in realizing U(64) circuits.

Pruning in photonic deep neural networks

The importance of achieving high-fidelity photonic U(n) operations demonstrated in Fig. 4 has been widely recognized in quantum computing, such as boson sampling1 or quantum Fourier transform32 in linear-optical quantum computing33. On the other hand, the validity of pruning in deep learning requires further clarification because unitary operations are part of an entire neural network composed of weight matrices and activation functions4,34. Therefore, we analyze how the pruning and adding noise in the unitary subparts of weight matrices impact the performance of photonic deep neural networks4.

To focus on the effect of pruning and adding noise, we consider a traditional feedforward neural network34,35 (Fig. 5a) with conventional training36 and evaluation methods. The target problem is the regression of the relationship between the input and output datasets, which are connected through the neural network of Fig. 5a with a specific set of weight matrices {Wp|p = 1, 2, and 3}. The goal of the network training starting from initially random weight matrices is the inference of the test output dataset from the test input, which corresponds to finding a specific form of a deep neural network from random deep neural networks (see Methods for details of the neural network model: architecture, datasets, training, hyperparameters and loss function).

Fig. 5: Pruning in photonic deep neural networks for regression.
figure 5

a The architecture of the deep neural network for analyzing the effect of pruning. Mp (p = 1, 2, 3, 4) is the number of neurons in the pth layer, where M1 = M4 = 16 and M2 = M3 = 32. \({\Psi }^{p}={[{{a}_{1}}^{p},{{a}_{2}}^{p},\ldots,{{a}_{Mp}}^{p}]}^{{{{{{\rm{T}}}}}}}\) and \({\varDelta }^{p}={[{{\delta }_{1}}^{p},{{\delta }_{2}}^{p},\ldots,{{\delta }_{Mp}}^{p}]}^{{{{{{\rm{T}}}}}}}\) are the signal and error column vectors at the pth layer, respectively, where akp and δkp are the signal and error at the kth neuron of each layer (k = 1, 2, …, Mp), respectively. Wp is the weight matrix between the pth and (p + 1)th layers. b, c Learning curves (b) and R-squared accuracies (c) of the model, which are estimated with the test dataset. The colors of the lines denote the unitary photonic circuits in different groups: pruning body (red), pruning tail (blue), noisy body (orange) and noisy tail (green). Noisy body and noisy tail are shown for δ0 = 0.001π. In (b), solid and dashed lines denote 5% (2σ/n(n – 1) = 0.05) and 10% (2σ/n(n – 1) = 0.10) defect ratios, respectively. In c, dashed lines and error bars represent the mean value and half of the standard deviation of the test dataset accuracies of 1000 realizations. The disconnection of the blue dashed line in (c) denotes the failure of the training due to the divergence of network parameters. All the other parameters for training and calculating loss functions are shown in Methods.

When employing programmable photonic circuits to deep neural networks, the weight matrix W = Wp is realized through the singular value decomposition6 W = UDV †, where U and V † are unitary matrices and D is a diagonal matrix4,8. The unitary- and diagonal-matrix operations can be implemented with the structures in Fig. 1e and Fig. 1d, respectively8, while gain or loss may be necessary for diagonal operations. Among those sub-operations of W, we apply pruning or adding noise to the unitary matrices U and V through the procedure in Fig. 4. Each weight matrix is then replaced with the defective one: WD = UDDVD†, where UD and VD† are the pruned or noisy unitary matrices.

Figure 5b and c demonstrates the impact of pruning in the regression problem. The learning curves estimated with the test dataset (Fig. 5b) show that the pruning body (red solid and dashed lines) allows for much more robust network learning than all the other cases of defective unitaries. Noisy cases present relatively unstable learning, especially when the ratio of defective elements increases (solid lines for 5% and dashed lines for 10% defective elements). Notably, the pruning tail—removing the operations of hub elements—results in the complete failure of learning. Such a discrepancy between pruning and adding noise is also apparent in the R-squared regression accuracy (R2) estimated after 300 epochs of training (Fig. 5c), which shows that only the pruning body case provides an accuracy close to the ideal case without any defects (black solid line). Although the case examined in Fig. 5b, c is a specific regression problem of a deep neural network model, the result shown in the same vein as Fig. 4 demonstrates the validity of the pruning method in photonic deep neural networks, serving as proof-of-concept at least. To ensure reproducibility, we include the codes for Figs. 25 in Supplementary Code 1.

Discussion

Due to the mathematical generality of our study, the presented results should be universal for programmable photonic8,9 or superconducting37 processors for reconfigurable unitary operations when the unit SU(2) operation is nonuniform on the Bloch sphere and the target degree n is finite. Notably, we observed the excellent fitting with the power-law model, the crossover behaviors from exponential to heavy tails in the truncated power-law and log-normal models, and the evident failure of the exponential model, nearly above the degree n ≥ 80. It is worth mentioning that the state-of-the-art realization of programmable photonic circuits using MZIs and phase shifters allows the n = 64 matrix multiplication30,31, which is close to the heavy-tailed regime. Therefore, the heavy-tailed features are evident at the scale near and beyond the present state-of-the-art degrees (n~102) in deep learning accelerators4,38,39 and noisy intermediate-scale quantum computers3,40,41. The suitable application of the demonstrated pruning method, which allows for leaving out a significant portion of electro-optic modulations in programmable photonic circuits, will become particularly beneficial for the next era of quantum computing and deep learning hardware.

Although we studied the performance of pruning in photonic deep neural networks, various issues still remain for practical realizations. First, although we focused on unwanted phase shifts that may originate from thermal crosstalk, optical absorption from material loss or imperfect device fabrication may be worse in large-scale photonic neural networks, especially when using gain or loss media42. Second, the effect of other forms of nonlinear activation functions43 and network architectures44 on pruning performance should be studied. Finally, the statistical properties of weight matrices depend on problems, model architectures, learning methods and hyperparameters45,46. The resulting unitary matrices can have some biased distributions distinct from random Haar matrices, which will impose the problem- or model-specific properties in pruning performance.

The presence of heavy-tailed distributions in programmable photonic circuits inspires the extension of seminal achievements in probability theory and network science to wave physics. As shown in our study, the intriguing features related to heavy-tailed distributions are demonstrated in wave platforms, such as the observed Pareto principle in wave physics and the critical role of hub elements in pruning and noise immunity. Regarding the emergence of heavy-tailed distributions in network science18, the phase shift θ, or the rotation Rzm(–2θ), may correspond to the number or weight degree of links that network nodes possess. In this analogy, each SU(2) unit corresponds to a network node, and different design methodologies of the U(n) decomposition11,12,13 imply a set of distinct network architectures for the same signal behavior U(n). To complete this analogy between wave physics and network science, we can envisage a network metric that quantifies the connectivity between SU(2) units, which should consider the interference effect, as demonstrated in the network modeling of wave scattering systems47. When the connectivity of integrated wave systems becomes more extensive and complex48, the concepts of complex networks will provide a foundation for design strategies in wave physics.

In conclusion, we demonstrated that some of the unit elements in a large-scale programmable photonic circuit are more important than others, exhibiting the heavy-tailed feature verified with conventional statistical models, i.e., the power-law, power-law with an exponential cutoff and log-normal distributions, and the exponential distribution as a counterexample. The observed heavy-tailed distribution originates from nonuniform rotations on the Bloch sphere, which are ubiquitous in conventional SU(2) units for programmable photonic circuits. The result allows for the design strategy—pruning—for high fidelity and energy efficiency, which offers intriguing insight into the design of large-scale photonic structures for classical and quantum devices, as demonstrated in the application to photonic deep neural networks. Further research on devising other forms of SU(2) units or the units with higher degree for Un factorization is desirable to alter the observed heavy-tailed distributions.

Methods

Model fitting process

To analyze the θ distributions in an ensemble of programmable photonic circuit realizations, we employ multiple statistical models: power-law, power-law with an exponential cutoff, log-normal and exponential distributions. Each model is defined by a set of model parameters {qs}. To calculate the model parameters for the fitting of a given dataset {θ1, θ2, …, θM}, we employ an analytical or numerical calculation of the maximum likelihood estimators (MLEs)25 from the PDF p(θm;{qs}), which defines the probability of finding θm with the model having the parameters {qs}. First, the probability of obtaining the dataset from the statistical model with the given model parameters {qs} and the PDF p(θm;{qs}) is

$$p(\{{\theta }_{m}\};\{{q}_{{{{{{\rm{s}}}}}}}\})=\mathop{\prod }\limits_{m=1}^{M}p({\theta }_{m};\{{q}_{{{{{{\rm{s}}}}}}}\}),$$
(3)

which is called the likelihood for the data and model. The model with the higher likelihood then provides the better fit to the dataset {θm}19. Because the employed statistical models have exponential forms, it is conventional to utilize the log-likelihood L:

$$L=\mathop{\sum }\limits_{m=1}^{M}\log (p({\theta }_{m};\{{q}_{{{{{{\rm{s}}}}}}}\})).$$
(4)

The fitting of the model to a given set of data, which requires the calculation of {qs}, then corresponds to the maximization of L with respect to {qs}. Therefore, the MLE is defined as

$${\nabla }_{\{{q}_{{{{{{\rm{s}}}}}}}\}}L={\nabla }_{\{{q}_{{{{{{\rm{s}}}}}}}\}}\mathop{\sum }\limits_{m=1}^{M}\log (p({\theta }_{m};\{{q}_{{{{{{\rm{s}}}}}}}\}))=O.$$
(5)

Power-law distribution model

In analyzing the heavy-tailed statistics of the θ-rotations, we mainly employ the power-law distribution model17,18,19, which supports the PDF and CCDF, as follows:

$$p(\theta )=\frac{\alpha -1}{{\theta }_{\min }}{\left(\frac{\theta }{{\theta }_{\min }}\right)}^{-\alpha },$$
(6)
$$P(\theta )={\left(\frac{\theta }{{\theta }_{\min }}\right)}^{-\alpha+1},$$
(7)

where α and θmin are the exponent and lower bound of the power-law model, respectively. The model is defined in the range α > 1, and the model parameter set is {qs} = {α}. For a given dataset, the log-likelihood becomes

$$L=M\,\log (\alpha -1)+M(\alpha -1)\log {\theta }_{\min }-\alpha \mathop{\sum }\limits_{m=1}^{M}\log {\theta }_{m}.$$
(8)

The MLE then leads to α, as

$$\alpha=1+M{\left[\mathop{\sum }\limits_{m=1}^{M}\log \left(\frac{{\theta }_{m}}{{\theta }_{\min }}\right)\right]}^{-1}.$$
(9)

We calculate an array of α values using Eq. (9) for all the possible values of θmin, where each pair of α and θmin comprises a candidate power-law model.

Power-law model with an exponential cutoff

To obtain a thorough confirmation of the heavy-tailed statistics, we test crossover distributions between a power-law and an exponential distribution. First, we apply the power-law model with an exponential cutoff, which is the truncated version of the original power-law model. The PDF and CCDF of the model are17,19:

$$p(\theta )=\frac{{{\lambda }_{{{{{{\rm{c}}}}}}}}^{1-{\alpha }_{{{{{{\rm{c}}}}}}}}}{\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })}{\theta }^{-{\alpha }_{{{{{{\rm{c}}}}}}}}{e}^{-{\lambda }_{{{{{{\rm{c}}}}}}}\theta },$$
(10)
$$P(\theta )=\frac{\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}\theta )}{\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })},$$
(11)

where αc, λc, and θc,min are the power-law exponent, cutoff exponent and the lower bound of the model, respectively, and Γ(s,x) is the upper incomplete gamma function. The model is defined in the range of αc ≥ 0 and λc ≥ 0. The log-likelihood for the dataset {θ1, θ2, …, θM} is

$$L=M(1-{\alpha }_{{{{{{\rm{c}}}}}}})\log {\lambda }_{{{{{{\rm{c}}}}}}}-M\,\log \Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })-{\alpha }_{{{{{{\rm{c}}}}}}}\mathop{\sum }\limits_{m=1}^{M}\log {\theta }_{m}-{\lambda }_{{{{{{\rm{c}}}}}}}\mathop{\sum }\limits_{m=1}^{M}{\theta }_{m}.$$
(12)

Although the MLE with the model parameters {qs} = {αc, λc} leads to the following relations:

$$\log {\lambda }_{{{{{{\rm{c}}}}}}}+\frac{{\partial }_{{\alpha }_{{{{{{\rm{c}}}}}}}}\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })}{\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })}=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}\log {\theta }_{m},\\ \frac{1-{\alpha }_{{{{{{\rm{c}}}}}}}}{{\lambda }_{{{{{{\rm{c}}}}}}}}+\frac{{\theta }_{{{{{{\rm{c}}}}}},\min }{({\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })}^{-{\alpha }_{{{{{{\rm{c}}}}}}}}{e}^{-{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min }}}{\Gamma (1-{\alpha }_{{{{{{\rm{c}}}}}}},{\lambda }_{{{{{{\rm{c}}}}}}}{\theta }_{{{{{{\rm{c}}}}}},\min })}=\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{\theta }_{m},$$
(13)

we instead employ the numerical minimization of –L with the constraints αc ≥ 0 and λc ≥ 0 due to the difficulty in handling the analytical derivative of the upper incomplete gamma function. We calculate the pairs of αc and λc for all the possible values of θc,min, where a set of αc, λc, and θc,min comprises a candidate for the model.

Log-normal distribution model

To cover the intermediate regime between the power-law and exponential distributions17, we employ another crossover distribution: the log-normal distribution model. The PDF and CCDF of the model are17,19:

$$p(\theta )=\frac{1}{\sigma \theta \sqrt{2\pi }}\exp \left(-\frac{{(\log \theta -\mu )}^{2}}{2{\sigma }^{2}}\right),$$
(14)
$$P(\theta )=\frac{1}{2}\left[1-{{{{{\rm{erf}}}}}}\left(\frac{\log \theta -\mu }{\sigma \sqrt{2}}\right)\right],$$
(15)

where μ and σ are the mean and standard deviation of log(θ), respectively, and erf is the error function. With the model parameters {qs} = {μ, σ}, the log-likelihood and the MLE relation are shown in Eqs. (16) and (17), respectively, as follows:

$$L=-\mathop{\sum }\limits_{m=1}^{M}\log {\theta }_{m}-M\,\log \sigma -\frac{M}{2}\,\log 2\pi -\mathop{\sum }\limits_{m=1}^{M}\frac{{(\log {\theta }_{m}-\mu )}^{2}}{2{\sigma }^{2}}.$$
(16)
$$\mathop{\sum }\limits_{m=1}^{M}\frac{\log {\theta }_{m}-\mu }{{\sigma }^{2}}=0,\frac{M}{\sigma }=\mathop{\sum }\limits_{m=1}^{M}\frac{{(\log {\theta }_{m}-\mu )}^{2}}{{\sigma }^{3}}.$$
(17)

Instead of utilizing the analytical MLE, we employ numerical minimization of –L with the constraint σ ≥ 0.

Exponential distribution model

For the comparison with models other than heavy-tailed distributions, we test the exponential distribution model17,19,21, which has the following PDF and CCDF:

$$p(\theta )={\lambda }_{{{{{{\rm{e}}}}}}}{e}^{{\lambda }_{{{{{{\rm{e}}}}}}}{\theta }_{{{{{{\rm{e}}}}}},\min }}{e}^{-{\lambda }_{{{{{{\rm{e}}}}}}}\theta },$$
(18)
$$P(\theta )={e}^{{\lambda }_{{{{{{\rm{e}}}}}}}{\theta }_{{{{{{\rm{e}}}}}},\min }}{e}^{-{\lambda }_{{{{{{\rm{e}}}}}}}\theta },$$
(19)

where the model parameter is {qs} = {λe}. The log-likelihood and the MLE relation are

$$L=\,\log {\lambda }_{{{{{{\rm{e}}}}}}}+M{\lambda }_{{{{{{\rm{e}}}}}}}{\theta }_{{{{{{\rm{e}}}}}},\min }-{\lambda }_{{{{{{\rm{e}}}}}}}\mathop{\sum }\limits_{m=1}^{M}{\theta }_{m},$$
(20)
$${\lambda }_{{{{{{\rm{e}}}}}}}={\left[\mathop{\sum }\limits_{m=1}^{M}{\theta }_{m}-M{\theta }_{{{{{{\rm{e}}}}}},\min }\right]}^{-1}.$$
(21)

We calculate an array of λe values using Eq. (21) for all the possible values of θe,min, where each pair of λe and θe,min comprises a candidate for the model.

Kolmogorov‒Smirnov test

In the power-law, power-law with an exponential cutoff and exponential distribution models, we obtain multiple candidates for the models with different values of lower bounds θmin, θc,min and θe,min, respectively. Each candidate of a model supports a distinct range of data for model validity and possesses different values of model parameters {qs}. To extract the optimum model among the candidates, we apply the Kolmogorov‒Smirnov (KS) test19,26. When the CDFs of the dataset and the statistical model are S(θ) and P(θ; θmin, {qs}) for the lower bound parameter θmin, we define the maximum distance D between the data and model distributions as:

$$D=\mathop{\max }\limits_{\theta \ge {\theta }_{\min }}|S(\theta )-P(\theta ;{\theta }_{\min },\{{q}_{{{{{{\rm{s}}}}}}}\})|.$$
(22)

We select θmin and the corresponding {qs} to minimize D, determining the optimum statistical model for each case of the power-law, power-law with an exponential cutoff and exponential distribution models. This optimum model has the tail for the best fitting of a given dataset within the definition of each distribution.

Fidelity for unitary matrices

We consider the n × n unitary matrix UnO and its defective one UnD, which could be nonunitary in general. The cost function or the square of the metric between the matrices is defined by29:

$${J}_{U}=\frac{1}{{n}^{2}}{\mathop{\sum}\limits_{i,j}|{{{U}_{n}}^{{{{{{\rm{O}}}}}}}}_{(i,j)}-{{{U}_{n}}^{{{{{{\rm{D}}}}}}}}_{(i,j)}|}^{2}\\=\frac{1}{n}+\frac{1}{{n}^{2}}{{{{{\rm{Tr}}}}}}({({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{D}}}}}}}-2{{{{\mathrm{Re}}}}}[{({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{O}}}}}}}]),$$
(23)

where A(i,j) is the (i,j) matrix component and Tr(A) is the trace of the square matrix A. Because JU ≥ 0, we obtain the relationship:

$$n+{{{{{\rm{Tr}}}}}}({({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{D}}}}}}})\ge 2{{{{{\rm{Tr}}}}}}({{{{\mathrm{Re}}}}}[{({{U}_{n}}^{{{{{{\rm{D}}}}}}})}^{{{\dagger}} }{{U}_{n}}^{{{{{{\rm{O}}}}}}}]),$$
(24)

where equality is achieved with the minimum defect, as UnO = UnD. Because the left side of Eq. (24) is positive, the definition of fidelity is F(UnD, UnO) in Eq. (2) in the main text.

Photonic deep neural networks

To analyze the effect of defective unitaries on deep learning, we examine traditional supervised feed-forward neural networks with the error backpropagation method34,35. In the forward propagation, the signal column vector of each layer Ψp is updated with:

$${\Psi }^{p+1}={W}^{p}{h}_{p}({\Psi }^{p}),$$
(25)

where hpp) denotes the application of the activation function to each component of Ψp through the computer-assisted simulation using electro-optic conversion4. We apply the tangent hyperbolic activation function49 in hidden layers (p = 2 and 3) and the linear activation function in the input (p = 1) and output (p = 4) layers.

The training and test datasets are obtained with the forward propagation of the neural network using the predefined weight matrices Wp. The elements of Wp are obtained with the uniform random distribution u[–1/Mp2, 1/Mp2]. The datasets are then achieved with a set of the input vectors Ψ1 obtained from u[0,1] and its application to the predefined neural network. The training and test datasets consist of 4000 and 1000 pairs of input and output realizations, respectively.

The error backpropagation for network training is defined with the following equation35:

$${\varDelta }^{p}=h{^\prime} ({\Psi }^{p})\circ [{({W}^{p})}^{{{{{{\rm{T}}}}}}}{\varDelta }^{p+1}],$$
(26)

where h'(Ψp) denotes the derivative of the activation function and is the Hadamard product. We utilize the mean square error (MSE) as the loss function for updating the weight parameters with the output-layer error vector Δ4. The weight matrices are updated with the mini-batch gradient descent (MGD) method50 by dividing the training dataset into four mini-batches. The MGD leads to the following updating rules:

$${W}^{p}({\tau }_{{{{{{\rm{B}}}}}}}+1,{\tau }_{{{{{{\rm{E}}}}}}})={W}^{p}({\tau }_{{{{{{\rm{B}}}}}}},{\tau }_{{{{{{\rm{E}}}}}}})-\eta \langle (h{^\prime} ({\Psi }^{p})\circ [{({W}^{p})}^{{{{{{\rm{T}}}}}}}{\varDelta }^{p+1}]){[h({\Psi }^{p})]}^{{{{{{\rm{T}}}}}}}\rangle,$$
(27)
$${W}^{p}(0,{\tau }_{{{{{{\rm{E}}}}}}}+1)={W}^{p}(4,{\tau }_{{{{{{\rm{E}}}}}}}),$$
(28)

where Wp(τB, τE) is the weight matrix between the pth and (p + 1)th layers at the (τE)th epoch with applying τB mini-batches, η = 2 × 106 is the learning rate and \(\langle \cdots \rangle\) is the average of the loss-function gradient for each mini-batch of the training dataset. After conducting Eq. (27) for a mini-batch, we apply Eq. (25) and recalculate the MSE loss function to employ the next mini-batch or epoch to Eq. (27). Starting from the initially random Wp with u[–1/Mp2, 1/Mp2], we train the neural network for 300 epochs. At each epoch, the loss function of the MSE for complex numbers is estimated with the test dataset to obtain the learning curves in Fig. 5b. The regression estimator of the R-squared is calculated with the test dataset after 300 epochs of training to obtain the accuracy curves in Fig. 5c.