Double-deep Q-learning to increase the efficiency of metasurface holograms

We use a double deep Q-learning network (DDQN) to find the right material type and the optimal geometrical design for metasurface holograms to reach high efficiency. The DDQN acts like an intelligent sweep and could identify the optimal results in ~5.7 billion states after only 2169 steps. The optimal results were found between 23 different material types and various geometrical properties for a three-layer structure. The computed transmission efficiency was 32% for high-quality metasurface holograms; this is two times bigger than the previously reported results under the same conditions. The found structure is transmission-type and polarization-independent and works in the visible region.


Methods
We can tune the phase of the incident light by changing the geometrical properties of the metasurfaces. The metasurface consists of nano-antennas with dimensions much smaller than the operating wavelength, so the thin layer of metasurface acts as a homogeneous medium with different refractive index n than the surrounding medium. The function of this thin layer is to apply a small phase delay to the light that is passing through it 26 . n of this thin layer can be controlled by changing the geometrical properties of the nano-antennas; the change in n leads to different phase delays by different structures. So each structure creates a phase delay. This way we can create a phase map (which is a collection of different phase delays from −π to π) by combining different structures.
Some authors use non circular Nano-antennas (like V-shaped 25 , rectangular 22 ,…) to produce the required phase map. As an example, V-shaped Nano-antennas can easily be tuned by changing the length of each antenna or by changing the angle between them, to create the required phase delay. But they only work for a specific  Table 1. (b) Structure of DDQN method used to optimize the metasurface hologram. (c) The optimal structure found by DDQN with a high efficiency for generating holograms. The DDQN determined that no grating or film is required, by setting the coverage of the grating to zero and the material of the film as glass. www.nature.com/scientificreports www.nature.com/scientificreports/ polarization. A structure that can produce a polarization independent phase delay should be cylindrical. But the problem with cylindrical structures is that they can only be tuned by their radius to generate the required phase delay, compared to non-circular shaped structures (since the thickness should be kept constant for manufacturing limitations). This makes it very hard to find the right structure which can generate the whole phase map and at the same time have a high transmission efficiency. Adding lattice constant and material type variables (which is common between all holograms) to this problem makes it very hard for human researchers to check all the possibilities to find the optimum structures. Here we use an AI code to help us find the optimum structure.

Structure definition.
Here we try to find the optimal structure type to achieve high efficiency in the visible range for transmission-type holograms. The metasurface structure that we chose for this idea is a nano-disk laying on a thin film that is laying upon a grating, all on a glass substrate ( Fig. 1(a)). Having this structure as the starting point covers many possibilities. The combination of the grating with nano-disks can increase transmission in metallic metamaterials by forming a structure that is similar to a Fabry-Perot cavity 41 . Using the circular shape for nanoantennas makes the hologram independent of polarization. All geometrical properties (except disk radius) and material types will be found using the DDQN. The important factor here is that the DDQN decides to use the starting structure as it is or to change it (removing grating, thin film, or both). DDQN structure. DDQN can be used to optimize a physical structure, as described previously 39 . Briefly, based on the given and future rewards, DDQN tries to connect the state of the structure to the action that should be taken ( Fig. 1(b)). In DDQN we have two neural network models. A main model and an auxiliary model. The auxiliary model is used to update the main model's weights, and the main model is used to predict the actions. These models had 3 hidden layers with 24, 48, 24 neurons each, with an Adam optimizer with a learning rate of 0.005 39 . A Markov decision process 42 is used to predict the actions.
The model creates some data for itself by initial guessing at the beginning and by doing some actions (the model creates some data by itself from what it learned so far) as the code progresses, and all of these data are saved as an experience replay. This experience replay keeps getting updated as the model progresses (the old data is replaced by new data) so the model learns from the newly generated data. In other words, the model is training on data that is continuously updated.
An epsilon-greedy method is used to create the initial database. This method determines when the guessing should finish and the learning should start. To do this we define an epsilon function starting from 0.95 to 0.1 with a decay rate of 0.995 as shown in Fig. 2. At each step, a random number is generated by the code. If the generated random number was lower than epsilon, then the model guesses the next action randomly (known as exploration), and if it was higher than epsilon the model predicts the next action by what it learned so far. At each step, the epsilon decays until it reaches 0.1 (this assures that the model always has a 10 percent chance of exploration).
At each step, the DDQN changes a geometrical property or material type of the structure. Based on the given feedback from the simulating environment it learns the effect of the change it made, and so it learns how to act better in future. The model consists of three parts: (1) the state of the structure at each step; (2) the action that should be taken to change the geometrical properties of the structure at each step; (3) a reward system that awards or penalizes the model for the action that it chose.
The state of the structure is composed of the geometrical properties and material types of the structure at each step: • Nano-disk material type (D_M): 23 different materials.
• Thin film material type (F_M): 23 different materials. We did not include the disk's radius in the parameters, because it is used to evaluate the structure's ability to produce the needed phase map. At each state, a separate loop is performed on the disk's radius between 45 nm to 190 nm and the phases generated by different radii are computed and saved. If the phase range generated by the structure is big enough for holographic uses, the structure is considered as a candidate for optimization by the DDQN. All of these processes are performed in the reward system. Action definitions. The next step is to define the actions, which determine what the model should do at each step. To change the material of the disks, film, and grating we represented 23 materials in a matrix (Table 1). If the model wants to change the material of one of the parts, it simply changes the index of the material matrix of that specific part. Two actions are defined for changing the material of each part: one to increase the material's matrix index and one to decrease it. The definitions of actions are shown in Table 2.
Reward system. The final step is to define a reward system. It gives feedback to the model at each step, so it learns to improve its actions in future steps. We designed the reward system to give the highest feedback to the phase-generating property of the structure, and lesser feedback to its efficiency; i.e., the model prioritizes the structures that increase the phase map, then considers their efficiencies. To do this we divided the range of −π to π into six equal parts. A model gets 100 points for finding a structure that generates one of these parts, so in total, a model can get 600 points for finding a structure that can generate the whole phase map. A model gets additional points for the minimum transmitted power of the structure times 100. For example, if a structure generates four phase parts and has a minimum transmitted power of 0.25 it will get 4 × 100 + 0.25 × 100 = 425 points. This way, a structure that generates a large number of phase parts will be preferred by the model compared to a structure with a lower number of generated phase parts regardless of the structure's efficiency. We choose this scheme because we seek a structure that can generate the whole phase map. The scheme also sets the terminating reward of the structure as 700 as is needed for DDQN model. A score of 700 means that the model has reached its ideal structure and should stop looking for new structures. The final found structure by DDQN is shown in (Fig. 1(c)).
Generating the hologram. Now we discuss how the found structure can be used to generate holograms.
The first step is to find the phase that the structure generates. This procedure is performed by calculating the S-parameter, which shows the generated phase. We generated the whole phase range of [−π, π] while changing the radius of the disk from 45 nm to 190 nm. The radius affected the phase and amplitude of the S-parameter of the transmitted light ( Fig. 3(a)). It also affects the transmission (Fig. 3(b)), which will be used for calculating hologram's efficiency.
To generate the hologram, we need the phase map of our desired image. To find the phase map of a given image, we used an algorithm 43 that creates a numerical phase map from a given image. Once we have the phase map, the next step is to construct the phase map by metasurfaces (Fig. 4(a)). The phase map is a matrix of numbers. We replace each phase by its corresponding diameter (Fig. 3(a)); this process yields a matrix of diameters by which we can construct an array of metasurfaces, and create the needed phase map and also calculate the   (Fig. 3(b)). The full phase map contains all the phases. A Fourier transform of the phase map generates a hologram (Fig. 4(b)). This procedure is done physically by using a Fourier transform lens 44 or done numerically by applying a Fourier transform to the phase map matrix.
To generate the whole phase map, we need all radii from 45 nm to 190 nm. However, fabricating radii with a 1 nm is precision is impossible in practice, so we chose only some of them. This process is known as an m-level phase map, in which m represents the number of chosen radii. For example, m = 6 means that the phase map  www.nature.com/scientificreports www.nature.com/scientificreports/ has 6 levels or in other words only 6 radii. This procedure leads to loss of data and decreases the quality of the recovered image (Fig. 5(a-d)). Only 6 cylinders were used to calculate the final phase map and since the average of transmission power of those cylinders were higher than the average of transmission power of all the cylinders, the efficiency of 6-level phase map is higher in this case.

Results
The simulations were done at 532 nm (green) to be compatible with most recent experimental work in the visible range 21,23,45 . The numerical simulations were performed in Lumerical and the machine learning codes are written in Python. A machine with a 16-core 3.40 GHz processor, 64 GB of RAM, and a NVIDIA GTX 1080ti GPU with 11GB DDR5X RAM was used. Although the number of states was ~5.7 billion, DDQN found the optimal results in only 2169 steps. As can be seen the model could find the results pretty fast. This may imply that the model just found this result by random guessing. As can be seen from Fig. 2 after step 400 the model predicts the actions just   www.nature.com/scientificreports www.nature.com/scientificreports/ by learning and only 10 percent of the actions are done by guessing. It should be noted that there might be better answers than what the model found. So based on how long we let the model run, how the good the rival results are, initial conditions or complexity of the problem, it may take longer or shorter to find the optimum results.
It took a month for coding and running the model. The time is variable for different problems based on their complexity.
The final structure is as follows: • Nano-disk material type (D_M): 19 (Indium phosphide) ( Table 2 We can compute the efficiency of the hologram by calculating the average transmitted power from the phase map. We have the transmission for each of the radii (Fig. 3(b)), so by counting the number of each of the radii used in the phase map and calculating the average of transmission power, we can estimate the efficiency of the corresponding phase map (Fig. 5(a)). This is a rough estimation for two reasons. First the coupling between adjacent cylinders should be considered, and second, each image will have its own phase map and so different setup of cylinders is used for each image which results in different transmission efficiencies. So the only way to correctly find the transmission efficiency of a hologram is by fabricating it, and each image will have its own efficiency 21 . But this method gives us an approximate estimation of the average transmission efficiency as is shown in Fig. 4.
The computed transmission efficiency was 32% for a high-quality recovered image ( Fig. 5(a)). Compared to the structures with the same properties (transmission type, polarization independent, and in visible regime) our proposed structure's transmission efficiency is two times higher than 21 with 17% transmission efficiency (theoretical) (6% experimental), and much higher than other similar work 24 with <1% transmission efficiency(experimental). It should be noted that what we computed here is the total transmission efficiency (that is defined as the ratio of image intensity to the total power of incident light) and it shouldn't be confused with diffraction efficiency (that is defined as the ratio of image intensity to the total power of hologram plane 24 ). In diffraction efficiency, the source monitor is placed after the hologram (unlike the transmission efficiency in which the source monitor is placed before hologram) and so the efficiency is much higher compared to transmission efficiency, since the effect of hologram is neglected. A comparison of our hologram's transmission efficiency with some of the previously reported results is shown in Table 3.

Conclusion
Here, we used double deep Q-learning to optimize a hologram structure to increase its efficiency. The DDQN model optimized the geometrical properties and also found the best material types for the structure. The hologram structure reported here is transmission type, works in the visible range and is independent of polarization. The previously reported structures with these properties had a maximum of 17% transmission efficiency, but our AI code could find a structure that had a 32% transmission efficiency while yielding a high-quality output.

Data Availability
All data generated or analysed during this study are included in this published article (and its Supplementary Information files).