Machine learning powered ellipsometry

Ellipsometry is a powerful method for determining both the optical constants and thickness of thin films. For decades, solutions to ill-posed inverse ellipsometric problems require substantial human–expert intervention and have become essentially human-in-the-loop trial-and-error processes that are not only tedious and time-consuming but also limit the applicability of ellipsometry. Here, we demonstrate a machine learning based approach for solving ellipsometric problems in an unambiguous and fully automatic manner while showing superior performance. The proposed approach is experimentally validated by using a broad range of films covering categories of metals, semiconductors, and dielectrics. This method is compatible with existing ellipsometers and paves the way for realizing the automatic, rapid, high-throughput optical characterization of films.


I. MORE RESULTS
FIG. S1. Fitting ambiguity in traditional techniques. By fitting only (Ψ, ∆), EP4Model finds multiple sets of (n, κ, d), all of which reproduce (Ψ, ∆) perfectly following the forward functions F and G (the dashed lines in the first row), but none of them generated satisfactory (R, T ) as shown in the third row. One must examine the goodness of fit on additional data, i.e., (R, T ), in order to choose the best solution, which is d = 35.1 nm here. Such fitting ambiguity has also been reported in literature, such as Ref. [1] and references therein. The model-generated ones are denoted as solid lines correspondingly. The SUNDIAL analyzes the experimental (Ψ, ∆, R, T ) simultaneously to derive (n, κ, d).

II. DETAILS OF THE SUNDIAL METHOD
A. Architecture of the inverse and forward networks Table I shows details of features which compose input and target for forward and inverse networks of SUNDIAL. To be specific, (n film , κ film , d film , θ, n sub , κ sub , n air , κ air ) and (Ψ, ∆, R, T ) are input and output features for the forward networks which also act as the output and input features for the inverse networks. There are four neural networks involved in the SUNDIAL, two (inverse and forward) for (Ψ, ∆) and two for (R, T ). Each of them consists of three identical blocks of U-modules and additional input and output layers to accommodate to the input and target tensors, as shown in Fig. 2 in the main text and Fig. S5 here. Note that U-module is similar to the U-Net 2 except that a residual-style addition, instead of concatenation, has been used to combine features from both arms.
Neural network architecture is of great importance in many successful applications of deep convolutional neural networks. Milestones include the classical LeNet, 3 AlexNet, 4 ResNet, 5 U-Net, 2 etc. Novel architectures allow us train very deep neural networks and achieve remarkable performance. For the inverse and forward networks in the SUNDIAL framework, we propose to use this "stacked residual U-modules" architecture as backbones which may be viewed as a variant of U-Net. Our experiments showed that it outperformed others including ResNet by an order of magnitude which is consistent with the facts that for dense regression applications U-Net and its variants are usually the most effective architectures.
Since in our case the tasks that both the inverse and forward networks tackle with are clearly dense (spectrum) regression, it is not totally unexpected that this stacked residual U-modules significantly outperformed other architectures.

B. Offline training on simulated data
To acquire domain/ellipsometric knowledge, the SUNDIAL is first trained offline on a large amount of simulated data which however may significantly deviate from the real-world experimental data. We applied data augmentation to increase the robustness of the trained model. The protocol is as follows: 1) Shift the input left or right randomly by a few wavelength which is not greater than 3 in our experiments; 2) Multiply the input features by a small random scale (1 + g * N r ), where g is a random number generated from a standard normal distribution, and N r denotes the noise level not greater than 10 −3 here.
To clearly illustrate the effect of data augmentation especially when the training data is not sufficient or does not contain the noise that is to be dealt with, we have conducted a series of experiments where a relatively small training set of 300 samples was used and enhanced with different amount of augmented samples, and plotted the accuracy (RMSE v.s. N r ) in Fig. S6. It can be seen that when data augmentation was applied, the error reduced by a significant amount. Heavier augmentation led to more significant reduction in error, until it reached a saturation point where the improvement between augmentation of ×20 and augmentation of ×40 was marginal. This demonstrated that the proposed data augmentation protocol is an effective tool to supplement the training dataset.
For both the inverse and forward networks, the training on simulated data has been done using ADAM 6 optimizer with learning rate 10 −4 and weight decay 10 −5 . We have also applied the cosine annealing strategy to change the learning rate. The batch size was 64 and the maximum epochs was set to 2000 with early stopping. The loss function were mean-square-error which has been widely used for regression.

C. Online inference on experimental data
To bridge the gap between training on the simulated data and inferring on the realworld data, we propose a novel iterative inference strategy based on stochastic gradient descent. Unlike conventional inference of computing outputs by a single forward pass on neural networks which is doomed to fail in our case, our proposed framework allows neural modules continue to adapt online on the real-world data until a satisfactory solution is obtained. The pseudo-code is shown in Algorithm 1. It is worth noting that our method is designed to be built upon modern deep learning infrastructure and therefore can easily benefit, such as improvement of speed, from the specifically, generate a number of random points in the neighborhood of x (t) and compute the corresponding pseudo-measurements 3 (b) Train the forward network on D (t) for a number of iterations.
(c) Fix the weights Θ F of the forward network, and run one step update of the whole framework using Ψ, ∆, R, T as both input and target. Only the weights of the inverse networks is updated in this step. advancing of the computing hardware and software with very little modification of the method itself.

D. Robustness of the trained models
We investigate the robustness, i.e., how the noise in (Ψ, ∆, R, T ) affects the accuracy of the SUNDIAL. Since in the ellipsometric problem one cannot know the (n, κ, d) beforehand, we adopt the approach of using synthetic/simulated data. We took 500 groups of (n, κ, d) and corresponding synthetic (Ψ, ∆, R, T ) from the test (sub-)set of the simulated dataset for offline training where all the (n, κ, d) come from the Palik and Sopra databases. With this setting, we were able to calculate the model accuracy. It is worth noting that these 500 samples were not used for training, but testing the performance of the SUNDIAL on the simulated data, as we have followed the standard protocol of randomly partitioning the whole simulated dataset (6240 samples) into non-overlapping training (5240 samples), validation (500 samples) and test sets (500 samples). To simulate data with different levels of noise, (Ψ, ∆, R, T ) was multiplied by (1 + g * N r ), where g is a random number generated from a standard normal distribution, and N r denotes the noise level. We plot the RMSE of (n, κ, d) with respect to different levels of N r in Fig. S7. It can be seen that the SUNDIAL shows good accuracy on simulated data when N r < 10 −3 , to be specific, about 5 × 10 −3 , 3 × 10 −3 , and 0.8 nm for RMSE of (n, κ, d), respectively. On the other hand, as the (Ψ, ∆, R, T ) becomes too noisy with N r of 10 −2 , the RMSE of (n, κ, d) increases.

E. Noise in experimental data
The experimentally measured (Ψ, ∆) and (R, T ) data would inevitably contain noise (from for example power fluctuation of light source, dark noise of detector of ellipsometer and spectrometer, etc). In Fig. S8(a) for Si is shown in Fig. S8(b)). In practice, we may choose γ according to the relative importance, e.g. noise level, of (Ψ, ∆) and (R, T ). However, both of (Ψ, ∆) and (R, T ) show quite small noise in our case (indicated by dashed lines, below 6 × 10 −4 ), we do not have strong evidence to favor any of them. Thus we suggest γ = 0.5 in our study to equilibrate the contributions from both the (Ψ, ∆) and (R, T ) and indeed it showed good results in our experiments (shown by Fig. 4(c) in main text).