Meta-neural-network for real-time and passive deep-learning-based object recognition

Analyzing scattered wave to recognize object is of fundamental significance in wave physics. Recently-emerged deep learning technique achieved great success in interpreting wave field such as in ultrasound non-destructive testing and disease diagnosis, but conventionally need time-consuming computer postprocessing or bulky-sized diffractive elements. Here we theoretically propose and experimentally demonstrate a purely-passive and small-footprint meta-neural-network for real-time recognizing complicated objects by analyzing acoustic scattering. We prove meta-neural-network mimics a standard neural network despite its compactness, thanks to unique capability of its metamaterial unit-cells (dubbed meta-neurons) to produce deep-subwavelength phase shift as training parameters. The resulting device exhibits the “intelligence” to perform desired tasks with potential to overcome the current limitations, showcased by two distinctive examples of handwritten digit recognition and discerning misaligned orbital-angular-momentum vortices. Our mechanism opens the route to new metamaterial-based deep-learning paradigms and enable conceptual devices automatically analyzing signals, with far-reaching implications for acoustics and related fields.


Supplementary Note 1. Monopole source approximation
Owing to airborne sound-hard walls in metamaterials, these deep subwavelength meta-neurons could be regarded as acoustic waveguides with specific phase modulation.
The sound-hard material is thick enough to avoid wave coupling between adjacent unit cells 1 , which is totally different from the diffractive elements. Based on the Huygens-Fresnel principle, every point of a wavefront can be considered as a center of a secondary disturbance which gives rise to spherical wavelets, and the wavefront at any later instant can be regarded as the envelope of these wavelets 2 . Due to its finite geometric size, each meta-neuron can be treated as a square source on the plane incident wavefront with the amplitude and phase in the progress of wave propagation.
Then the acoustic pressure at an arbitrary location in the next layer can be calculated by using the Helmholtz-Kirchhoff formula, as follows where = ( , , ) and ′ = ( ′ , ′ , ′ ) refer to the source and observation points on two adjacent layers, respectively. Since the exact evaluation of the total field is difficult and requires solutions of the boundary, for the purpose of simplifying our meta-neuron network we make the following assumptions | − ′ | ≫ 1, where 0 = ′ − + For a deep-learning neural network, the full connection physically requires each metaneuron in a specific layer effectively connects to all the meta-neurons in the neighboring layer and can be mathematically described by using an ideal monopole source to replace the square meta-neuron in the following training process. Therefore, the directivity

Supplementary Note 2. Forward propagation model and softmax-cross-entropy loss
Due to the deep-subwavelength size of each meta-neuron, it is reasonable to approximately regard it as an ideal point source as demonstrated in Supplementary Note 1. Then the wave propagation function between an arbitrary spatial location and a specific meta-neuron can be expressed as where , = ( , , ) refers to the position of this meta-neuron located at m-th column and n-th row on the l-th layer (m=1, 2,…, M, n=1, 2,…, N and l=1, 2, …, L with the total row (column) number M (N) and total layer number L being chosen as 28 (28) and 2 here respectively), with being the position of the l-th layer in the z-axis, 0 = c ⁄ is the wave number, c is the airborne sound velocity and j=√−1. Hence, as shown in Supplementary Figure 2, the input wave of the meta-neuron located at the r-th column and s-th row of next layer (viz. the (l+1)-th layer) can be expressed as , where , +1 is the pressure of the input wave impinging on the meta-neuron located at ( , , +1 ), � , +1 , , � is the wave propagation function between these two metaneurons locating at ( , , ) and ( , , +1 ) respectively, , is the amplitude and phase modulation of the meta-neuron located at ( , , ) defined as , = , exp ( , ) with , and , being the amplitude and phase modulation respectively. To build the analog between our meta-neural-network with × meta-neurons on each of the L layers and a classic neural-network containing × × neurons, we rewrite Supplementary Equation 5 into a more concise form, as follows where .
In general, the mean squared error is often used for function approximation (viz., regression) problems because of its convenience in mathematical analysis, while the cross-entropy error function is commonly more suitable for our interested classification problems when outputs are interpreted as probabilities 5 . The selection of loss function provides different level of effective overparameterization, which greatly affects the final performance 6 . Problems with MSE has also been analyzed and verified in other paper 7 , "we have already seen that least-squares solutions lack robustness to outliers, and this applies equally to the classification application." Only when the loss function is appropriately chosen for accurately specifying one's goal of search can satisfactory results be achieved.
For a quantitative evaluation of the difference between predicted probabilities and ground-truth probabilities, we introduce the cross-entropy loss and try to minimize the difference between probabilities predicted by our meta-neural-network and groundtruth probability via adjustment of the phase modulation conducted by meta-neurons during the training process. The cross-entropy loss function 8 is defined as is the q-th element at the label corresponding to this digit. It is apparent that by introducing the softmax layer to optimize the training process, we ensure that the criterion of classification is still the region with the maximum acoustic intensity.
In order to verify the correctness of above conclusions in the specific objectrecognizing tasks of our interest, here we take an example to explain the reason why MSE loss treating all classes equally is less suitable for classification. In the handwritten digit recognition, when the input digit is '0' (the corresponding label is (1,0,0,0,0,0,0,0,0,0)), and assuming there are two kinds of outputs, one is In contrast, the cross-entropy loss with an extra softmax layer (also called as 'softmaxcross-entropy loss', SCE loss for abbreviation) is more tolerant of the output in incorrect regions and focuses on the correct region. Taking the above as an example, the SCE loss of the correct classified output is 1.8654 while the incorrect one is 2.4388.
The SCE loss chooses the output with lower loss and correctly recognizes the input digit in the meantime. Thus, comparing with the MSE loss, the SCE loss focuses more on the correct class and tolerant the non-zero intensity in other regions. The authors 6 also prove that the softmax-cross-entropy (SCE) loss performs better than mean square error (MSE) loss due to the fact that "QL (quadratic loss, the definition is the same with MSE) focuses on fitting all classes, whereas cross-entropy (CE) focuses on only the correct class (associated with the label)".

Supplementary Note 3. Role of the wave propagation function
From the above derivation, the forward propagation function between two neighboring layers can be written as where denotes the input wave of the l-th metasurface, is the wave propagation matrix, = exp ( ) is the amplitude and phase modulation introduced by the meta-neurons at l-th metasurface with and being the amplitude and phase modulations respectively, and '∘' denotes the element-wise multiplication. While the conventional neural network can be written as where is the nonlinear active function, is the weight and is bias.
Comparison of Supplementary Equations 8 and 9 clearly reveals the equivalence between our proposed meta-network and a conventional neural network. To be specific, the learnable parameters in our meta-neural-network are the phase modulation provided by the meta-neurons, and the wave propagation function between two neighboring layers of meta-neurons prevents the multi-layered meta-neural-network from degenerating into a monolayer meta-neural-network in physical systems. As a result of such equivalence, one can strictly prove that in our meta-neural-network, each metaneuron connects to all the meta-neurons on the neighboring layer.
In order to verify the role of wave propagation matrix , we use a bilayer metaneural-network as an example, and prove that a monolayer meta-neural-network could not replace a bilayer one with nonzero physical distance between these two layers.

Similar to Supplementary Equation 8, the pressure distribution on detection plane is
Therefore, in order to find a monolayer meta-neural-network for replacing this twolayer meta-neural-network, we need to seek a specific vector W that makes the following relationship stands In other words, physically we have to build a single hidden layer with a learnable vector W that perfectly mimics the wave field produced by these two layers and the wave propagation function between them.
To simplify the derivation, we rewrite Supplementary Equation 8 as where ( ) means the transpose of the vector, f(x) is a function that expands a vector

Thus Supplementary Equation 12 can be rewritten as
Similarly, Supplementary Equation 10 can be rewritten as And Supplementary Equation 11 can be rewritten as Assuming there exists a specific W such that one can arrive at a relationship 1 needs to satisfy Clearly there is no value satisfying all the equivalence of Supplementary Equation 15 in most cases unless the layer distance between two hidden layers is 0. When the axial distance between the two layer is zero, 1 is an identity matrix and Supplementary

Equation 13 becomes
As a result, only when the distance between two neighboring layers is zero can we find the matrix = 2 ∘ 1 required by the degeneracy of the bilayer meta-neuralnetwork into a monolayer one (apparently, in such a case the two layers in this physical model literally merges into a single layer) As a consequence, prevents the multi-layered meta-neural-network from degenerating into a monolayer meta-neural-network in physical system and thus plays a significant role in the meta-neural-network.
Moreover, the axis distance which determined the wave propagation function is not necessary to optimize during the training process since the axis distance will not appreciably improve the classification performance (see Supplementary Figure 3), as evidence by the results showing nearly stable performance for axis distance varying within a large range above this lower-limit, which significantly simplifies the design of our meta-neural-network. Thus, the wave propagation function which prevents the multi-layered meta-neural-network from degenerating into a monolayer meta-neuralnetwork in physical system and forms the connection between adjacent layers is a hyperparameter rather than learnable parameter. In this work, we do not introduce nonlinear response to mimic the nonlinear activation function in conventional neural network, however, by employing programmable active metamaterials 9 , or tunable acoustic switch 10 , it is possible to realize nonlinear activation function in our physical model. The acoustic devices above serving as acoustic switches can be manipulated by secondary sound or AC voltage. By combining the meta-neurons with such acoustic switchable devices controlled by sound or other external energy sources, both the learnable parameters and nonlinear activation function can be simultaneously introduced into the system. This would be the goal of our future work.

Supplementary Note 4. The comparison between meta-neural-network and diffractive neural network in compact systems
To demonstrate the key importance of deep-subwavelength nature of meta-neurons, here we train a series of meta-neural-network for which the thickness of each metaneuron layer is unchanged but the width of a single building block becomes one halfwavelength (6cm) and the layer distance is chosen such that the monopole approximation stands. Accordingly, the meta-neuron number in one layer becomes 10 × 10 (100 in total). The training result is shown in Fig. 2(b) which clearly reveals that despite the subwavelength size of meta-neuron, this meta-neural-network is far outperformed by meta-neural-network composed of meta-neurons downscaled to deepsubwavelength regime. This indicate that deep-subwavelength structure is vital for the miniaturization of devices and its application for small objects.
In addition, we need to stress that the diffractive elements cannot precisely provide the abruptly-changing phase profile under the mathematical framework proposed in the current work for realizing an ideal passive neural network, leading to the fact that the training results of such meta-neural-network of half-wavelength-sized neurons, shown in Fig. 2(b), cannot be outperformed by diffractive layers with equal size.
Supplementary Figure S4 shows the desired distribution calculated from the We further investigate essential difference between meta-neural-network and passive neural network build on diffractive layers and demonstrate the typical simulated results of comparison of intensity distributions on detection plane produced by these two systems with equal-sized bilayer structure ( 5 × 5 wavelength) and same neuron number (28 × 28) respectively, for particular objects chosen as ten handwritten digits, as shown in Supplementary Figure 5. We can see that the energy going through the meta-neural-network is accurately redistributed into the expected region corresponding to the handwritten digits. In stark contrast, the output pattern produced by diffractive neural network is severely blurred and apparently cannot be recognized accurately as the corresponding digit, due to the fact that diffractive elements cannot provide the required subwavelength-distribution of abrupt phase shift. This clearly indicates the extraordinary capability of our proposed meta-neural-network to work for device and object downsized to scales orders of magnitudes smaller than achievable with diffractive neural-network that cannot ensure production of the arbitrary and discrete phase distribution yielded by the training process which is vital for the equivalence between mathematical neural network model and practical physics system.

Supplementary Figure 5 | The comparison of the distribution of intensity on detection plane produce by a meta-neural-network and a passive neural network based on diffractive layers. (a-j) show the acoustics intensity
distributions of ten handwritten digits on the detection plane produced by metaneural-network and diffractive neural network. The total acoustic intensity has been normalized with respect to the maximal value measured in all the detection regions assigned to the ten digits.

Supplementary Note 5. The design of acoustic meta-neurons
In the current study, we implement the meta-neuron by using a specific kind of acoustic

Supplementary Note 7. Preparation of the input data for the meta-neural-network
In the current study, we choose to demonstrate the unique functionality of our The schematic diagram of the experimental setup is given in Fig. 1(a). In the experiment, the input sound was generated by a speaker (Beyma In our proposed mechanism for designing a meta-neural-network, the classification criterion of a specific digit-shaped object is that the total acoustic energy we can gather in the desired detection region assigned to this digit is higher than in the rest regions. Thanks to this design, there is no need to obtain the fine spatial distribution of acoustic intensity within the detection region which usually has to be performed by using a complicated array comprising a large number of sensors or moving the sensor spatially for a point-by-point scanning of the acoustic field. Instead, at the receiving end we conveniently measure the total acoustic intensity in each detection region by using fix number of sensors. By attaching the sensor to the small throat of tapered structure with an exponential profile and a square cross-section which acoustically behaves like a near-reflectionless acoustic energy concentrator, we use a single sensor to realize the energy integration over a specific detection region. In addition, the total sensor number is as few as the number of object classifications (equal to the number of 20 detection regions and chosen as 10 here) regardless of the resolution or overall size of the meta-neural-network, which is a unique advantage over conventional active-devicebased deep learning mechanisms. The typical experimental results of the total acoustic intensity measured on all the detection regions for ten specific objects with shape of handwritten digits from 0 to 9 are depicted in Supplementary Figure 9, which verify the passive, real-time and sensor-scanning-free object-recognizing functionality of the fabricated meta-neural-network that engineers the wavefront of scattered wave by the object and generates the highest total acoustic intensity in the desired detection region.

Supplementary Figure 9 | Typical experimental results of accurate object
recognition by the fabricated meta-neural-network. (a-j) show the total acoustic intensity measured on the detection plane (right column) for ten specific objects (left column) with shape of handwritten digits from 0 to 9, respectively. The total acoustic intensity has been normalized with respect to the maximal value measured in all the detection regions assigned to the ten digits.

Supplementary Note 9. The influence of experimental error in phase shift of metaneurons on the classification accuracy
In the measurements, we have tested 20 objects with shape of handwritten digit (viz. 2 for each digit), and the experimental results show that the meta-neural-network prototype recognized all the digits accurately except that the two digits "4" are mistakenly recognized as "3". We believe that such incorrect recognition primarily stems from the experimental errors especially the imperfect fabrication of metamaterial sample. In the experiment, we fabricate the meta-neural-network prototype via 3D printing technique, by using a machine with precision of 0.1 mm. Considering the actual size of an individual meta-neuron used in the experiment, such a fabrication error will lead to a uncertainty in the phase shift within the range of ± 18 ⁄ , which is sufficiently large to cause an appreciable move of the output focal region from detection region corresponding to "4" to region "3" when the meta-neural-network interacts with the scattered wave produced by the object. Due to the difficulty of precisely obtaining the actual phase profiles provided by the practical sample of meta-neural-network containing a large number of unit cells (exceeding 1500), we investigate the potential reason responsible for the misclassification of the digit "4" only via numerical simulations. As a result, we numerically prove that when an extra phase shift within show that our practical meta-neural-network prototype only accurately recognized 1 of all the 7 digit "4" samples, and all the misclassified "4" are recognized as "3" as expected. We therefore conclude that this problem will not affect the effectiveness of our mechanism and can be easily solved by increasing the fabrication precision of our meta-neural-network prototype.

Supplementary Note 10. The recognition of multiplexed orbital angular momentum beams
With infinite dimensionality of the Hilbert space, orbital angular momentum (OAM) offers the possibility to dramatically improve the capacity of waves as information carriers and particularly crucial for acoustic waves that dominate underwater communications 13,14 . However, the de-multiplexing of such spatially-multiplexed information carried by many twisted beams with different topological charges (TCs) essentially needs accurate recognition of spatial pattern far more complicated than the above digit-shaped objects can produce. In contrast to the active de-multiplexing mechanisms relying on elaborated sensor array and time-consuming postprocessing 13 , purely-passive paradigms offer a simple and low-cost solution for real-time demultiplexing without sensor scanning or postprocessing (such as by using Dammann gratings, Q-plates, or metasurfaces [14][15][16] ), but still suffers from uncontrollable spatial locations of output beams and, in particular, misalignment between transmitter and receiver that will lead to severe inter-channel crosstalk. Here we demonstrate that our meta-neural-network offers a new mechanism capable of going beyond these fundamental barriers and recognizing TCs in real-time from non-aligned OAM beams.
In the current study, we use a four-layer meta-neural-network (101 × 101 × 4 , Firstly, we calculate the pressure field of multiplexed OAM beams after they propagate 500 cm and 700 cm in free space respectively, and set these pressure fields as the input of first layer as shown in the Fig. 4(a). Secondly, the misalignment between the centres of OAM beams and meta-neural-network is set to be independent along two orthogonal directions ( , ) (viz., along and vertical to the radial direction).
Supplementary Figure 11(a) demonstrates the maximal misalignment range allowed by our current design of meta-neutral-network for recognizing three multiplexed OAM beams composed of TCs= +3, ±4 (marked by the green circle), and the pressure distribution of a typical OAM beam with misalignment chosen as ( = 6 , = ).
Supplementary Figure 11(b) illustrates the zoom-in view of the pressure distribution in the blue square region in Supplementary Figure 11(a) as the input of first layer, which is obviously identical as Fig.4(a) in the manuscript. Last but not least, we use pressure distribution in complex values as the raw data in training. To make sure that the dataset covers sufficiently large misalignments, the ranges of r and are chosen as [0, 6 ] and [0, 2π) respectively, which can be regarded significant enough given that the side length of metasurface is about 18λ. Following this procedure, we generate training data with 80000 multiplexed OAM beams and 10000 testing ones.
Supplementary Figure 11 | The misalignment between the centre of OAM beams and the centre of metasurface.
The training process of the recognition of multiplexed OAM beam is similar to the recognition of handwritten digits mentioned above, except the criterion of recognition.
In this task, the detection plane is divided into eight regions representing eight modes respectively. Notice that the assignment of detection region location is entirely freestyle and independent of TC. Each region has two areas marked by 'Y' and 'N' as shown in Fig. 4(a). The existence of a specific OAM mode is judged by comparing the magnitude of total sound energy within these two areas in the corresponding region. If sound