Deep learning in optical metrology: a review

With the advances in scientific foundations and technological implementations, optical metrology has become versatile problem-solving backbones in manufacturing, fundamental research, and engineering applications, such as quality control, nondestructive testing, experimental mechanics, and biomedicine. In recent years, deep learning, a subfield of machine learning, is emerging as a powerful tool to address problems by learning from data, largely driven by the availability of massive datasets, enhanced computational power, fast data storage, and novel training algorithms for the deep neural network. It is currently promoting increased interests and gaining extensive attention for its utilization in the field of optical metrology. Unlike the traditional “physics-based” approach, deep-learning-enabled optical metrology is a kind of “data-driven” approach, which has already provided numerous alternative solutions to many challenging problems in this field with better performances. In this review, we present an overview of the current status and the latest progress of deep-learning technologies in the field of optical metrology. We first briefly introduce both traditional image-processing algorithms in optical metrology and the basic concepts of deep learning, followed by a comprehensive review of its applications in various optical metrology tasks, such as fringe denoising, phase retrieval, phase unwrapping, subset correlation, and error compensation. The open challenges faced by the current deep-learning approach in optical metrology are then discussed. Finally, the directions for future research are outlined.

For many phase measuring optical metrology techniques, including optical interferometry, digital holography, electronic speckle pattern interferometry, Moiré profilometry, and fringe projection profilometry, the physical quantities to be measured (such as the surface shape, displacement, strain, roughness, defects, etc.) are directly or indirectly encoded in the phase information of the fringes formed by means of interference or projection. Consequently, phase demodulation, which analyzes the quasi-periodic fringe pattern for the wrapped phase extraction, is the most critical step because the measurement accuracy of these optical metrology techniques depends directly on the phase demodulation accuracy of recorded fringe patterns. How to extract the phase information with the highest accuracy, fastest speed, and full automation remains a research hotspot in the field of optical metrology.
For many phase measuring optical metrology techniques, including optical interferometry, digital holography, electronic speckle pattern interferometry, Moiré profilometry, and fringe projection profilometry, the physical quantities to be measured (such as the surface shape, displacement, strain, roughness, defects, etc.) are directly or indirectly encoded in the phase information of the fringes formed by means of interference or projection. Consequently, phase demodulation, which analyzes the quasi-periodic fringe pattern for the wrapped phase extraction, is the most critical step because the measurement accuracy of these optical metrology techniques depends directly on the phase demodulation accuracy of recorded fringe patterns. How to extract the phase information with the highest accuracy, fastest speed, and full automation remains a research hotspot in the field of optical metrology.
As discussed in the Section "Image processing in optical metrology" of the main text, traditional fringe pattern analysis, or phase demodulation techniques, can be broadly classified into two categories: spatial phase demodulation and temporal phase demodulation: • Temporal phase demodulation techniques detect the phase distribution from the temporal variation of fringe signals 1 . The most well-established phase-shifting technique extracts the phase information using multiple phase-shifted fringe patterns achieving high-resolution pixel-wise phase measurement at the cost of time-sequential data acquisitions. Consequently, such approaches are vulnerable to disturbances such as object motion/environmental vibration, thus making them difficult to be applied to high-speed measurements.
• Spatial phase demodulation methods, such as Fourier transform (FT) 2 , windowed Fourier transform (WFT) 3 , and wavelet transform (WT) 4 methods, are capable of estimating the phase distribution from a single fringe pattern, making them insensitive to vibration/motion and appropriate for dynamic measurement. However, they are sensitive to fringe discontinuities, isolated samples, and rich details of testing surfaces, preventing them from high-precision and highresolution phase measurement of complex surfaces. In addition, spatial phase demodulation methods usually have many parameters to adjust in order to achieve better results, making them difficult to be fully automatic in practical applications.
The goal of this Supplementary Information is to present a simple but representative example of how to apply deep learning to optical metrology. More specifically, you will learn how to build a deep neural network (DNN) with fully convolutional network architectures (U-Net) and how to train the network for phase demodulation from a single fringe pattern. Instead of adopting an end-to-end learning scheme directly linking the input fringe image to the output phase map, here we choose to predict the numerator and denominator terms of the arctangent function of the phase map from one input fringe pattern [ Fig. S1] based on two basic considerations: (1) predicting the phase from the arctangent function bypasses the difficulties associated with reproducing abrupt 2π phase wraps, and thus, obtains a high-quality phase estimate; (2) such a strategy refers to the physical model of the traditional phase-shifting method, which removes the influence of the surface reflectivity by the division operation, making the trained model suitable for objects with complex surfaces 5,6 .It will be demonstrated that the developed and trained deep neural network can accomplish the phase demodulation task in an accurate and efficient manner, using only a single fringe pattern. Thus, it is capable of combining the single-frame strength of the spatial phase demodulation methods with the high measurement accuracy of the temporal phase demodulation methods. Source codes for this tutorial in the "Supplementary code" folder. In order to run the source codes, the user must navigate to the work directory (main folder "Supplementary code"). The main folder contains two Python file (*.py) [ Fig. S3]: • step1 train phase demodulation.py: the source codes for training U-Net Phase; • step2 test phase demodulation.py: the source codes for testing U-Net Phase.
The sample code provided here was computed on a desktop (Linux system) with Intel Core i7-7800X CPU and a GeForce GTX 1080 Ti GPU (NVIDIA) under the Python 3.6 deep learning framework Keras 2.2.4 with the TensorFlow 1.10.1 platform (Google), and was run in Spyder 5.0.5, which is an open-source cross-platform integrated development environment for scientific computing using the Python language.

4/16
The datasets for the phase demodulation task have been uploaded to the figshare (https://figshare.com/s/ f150a36191045e0c1bef) [ Fig. S4]. In this example, the data in the following two subfolders are used: • Input of CNN1: The input data are in this folder and named "LeftImage"; • Output of CNN1: The corresponding ground truth data are in this folder, and the numerator and denominator are named "M" and "D", respectively.

Network architecture design
In principle, any fully convolutional network architectures can be used for different types of image processing tasks that we encountered in optical metrology. Just as we discussed in the Subsection "Advantages of invoking deep learning in optical metrology", deep learning subverts the conventional "physics-model-driven" paradigm and opens up the "data-driven" learning-based representation paradigm, which eliminates the need to design different processing flows for specific image processing algorithm based on experience and pre-knowledge. By applying different types of training datasets, one specific class of neural network can be trained to perform various types of transformation for different tasks, significantly improving the universality and reducing the complexity of solving new problems.. In this regard, we adopt a typical fully convolutional network architecture-U-Net 7 to predict the numerator and denominator terms of the arctangent function of the phase map from one input fringe pattern. Considering the task that U-Net aims at, we call the designed network U-Net Phase here. Its detailed inner architecture is shown in Fig. S5. A 3D tensor with size (H,W,C 0 ) is used as the input of the network, where (H,W ) is the size of the input images, and C 0 represents the number the input images, which is one in this case. For each convolutional layer, the kernel size is 3×3 with convolution stride one, zero-padding is used to control the spatial size of the output, and the output is a 3D tensor of shape (H,W,C), where C = 64 represents the number of filters used in each convolutional layer.
• In the first path of U-Net Phase, the input is processed by a convolutional layer, followed by a group of residual blocks (containing four residual blocks) and another convolutional layer. Each residual block consists of 2 sets of convolutional layers activated by rectified linear unit (ReLU) 8 stacked one above the other, which can solve the degradation of accuracy as the network becomes deeper and ease the training process 9 . Also, implementing shortcuts between residual blocks contributes to the convolution stability.  Figure S5. The detailed structure of U-Net Phase.
• In the other three paths, the data is down-sampled by the pooling layers by two, four, and eight times, respectively, for better feature extraction, and then up-sampled by the upsampling blocks to match the original size. The input data passes through a convolutional layer with ReLU activation. We then use quadruple filters to extract features from the input for providing rich information for the following upsampling.
• The outputs of four paths are concatenated into a tensor with quad channels. Finally, two channels are generated in the last convolution layer.
Except for the last convolutional layer which is activated linearly, the rest ones use the ReLU as activation function, i.e., ReLU (x) = max (0, x) . Here we use a classic loss function-the mean-squared-errors of the outputs with respect to the ground truth, which is computed as: where G M and G M are the ground truth of the numerator and denominator, and P θ M and P θ D the numerator and denominator predicted by the neural network with the parameter space θ which includes the weights, bias and convolutional kernels in this layer. In the training, the networks use the score of loss function as a feedback signal to adjust the parameters in θ by a little bit, in a direction that would lower the loss score. To this end, the adaptive moment estimation (ADAM) is used in the network to tune the parameters to find the minimum of the loss function.
The codes on lines 188 to 238 in "step1 train phase demodulation.py" [Fig. S6] correspond to the structure of U-Net Phase. Figure S7 gives an explanation of the "Conv2D" function of constructing the convolution layer, which contains some important parameters: • num filter conv: the number of filters; • num conv: convolution kernel size; • activation: activation function; • padding: filling type of convolution, including "valid" and "same"; • kernel initializer: initializer of kernel weight matrix; Users can modify the network structure or replace the network of other structures according to their needs.

Convolution kernel size
Filling type of convolution Activation function Initializer of kernel weight matrix Figure S7. Explanation of the "Conv2D" function of constructing the convolution layer.

Dataset preparation
In order to train the network, a matched dataset of ground truth parameters and corresponding raw fringe patterns should be created. In this tutorial, the dataset was collected by physical experiments based on a real fringe projection profilometry system, which includes a LightCrafter 4500Pro (912 × 1140 resolution) and a Basler acA640-750 µm camera (640 × 480 resolution). The camera is equipped with a lens of 12 mm focal length. The distance between the measured object and our system was about one meter. To obtain the precise parameters estimates, multi-step phase-shifting algorithm is used to calculate the ground truth data for our neural networks. The multi-step phase-shifting fringe patterns are generated as where (x p , y p ) is the pixel coordinate of the projector, and index n = 0, 1, 2, ..., N − 1 (N is the number of phase-shifting steps). Parameters a, b, f are the mean value, amplitude and spatial frequency, respectively. In our experiments, we set a = b = 127.5 (for projection of 8-bits images) and f = 48. The projector projected the generated multi-step phase-shifting fringe patterns onto different measured objects. The camera captured the reflected fringe pattern simultaneously from a different angle and transferred them to our computer. As the performance of the deep neural network largely depends on the quality of collected training data, we captured 1000 different scenes including simple and complex objects (metal industrial parts, plaster models, paper boxes, etc.) placed in arbitrary postures. For each scene, we recorded multi-step phase-shifting fringe patterns. The captured phase-shifting images can be represented as: where I n represents the (n + 1)th captured image, n = 0, 1, ..., N−1, (x, y) is the camera pixel coordinate, A is the average intensity map, B is the fringe amplitude map, ϕ is the phase, and 2πn/N is the phase shift. Through the standard N-step phase-shifting algorithm, we then calculated the corresponding ground truth data with the least square method: Through Eq. (S4), we can calculate the numerator term M and denominator term D of the arctangent function. It is worth noting that we recommend using the phase-shifting method with a higher number of steps, such as 12-step phase-shifting method, to obtain higher-quality phase-related information. Then we use the first fringe patten of the three-step phase-shifting images and the corresponding numerator and denominator terms as a set of input and ground truth data for training the neural network. The training datasets have been uploaded to the figshare (https://figshare.com/s/f150a36191045e0c1bef). The input data are in the "Input of CNN1" folder and named "LeftImage"; the corresponding ground truth data are in the "Output of CNN1" folder, and the numerator and denominator are named "M" and "D" respectively. Figure S8 shows four representative sets of training data. The first column show captured fringe images of the scenes. The second and third rows of Fig. S8 shows the ground truth numerator and denominator for the network training. Moreover, for a preferable selection of training objects, one is suggested choosing objects without very dark or shiny surfaces to insure captured fringe images with enough signal-to-noise ratio or without saturated points.

Fringe pattern Numerator Denominator
Ground truth Figure S8. Four representative sets of input and ground truth data.

8/16
Network training Before being fed into the network, the input fringe images are divided by 255 for normalization, making the learning process easier for the network. In this case, 450 sets of data are used for training and 150 sets are used for verification. The codes on lines 56 to 142 in "step1 train phase demodulation.py" are used to initialize the input and output tensors for training and verification [ Fig. S9]. Users can customize the file names of the datasets and the amounts of data used for training and verification, and even modify different input and output data for other tasks. Figure S9. Codes used for initializing the tensors for training and verification.
Then start training the network. The codes on lines 242 to 275 in "step1 train phase demodulation.py" are used to train the network model [ Fig. S10], which defines the loss function, optimizer type, training epoch, etc. When "step1 train phase demodu lation.py" file is compiled successfully in Spyder, the Spyder's console prints the network structure table, as shown in Fig. S11. When training starts, the Spyder's console prints the time consumption, epoch number and loss of the training, as shown in Fig.  S12.
After 300 epochs, the training loss curve converges, and the network training is completed. The training of 300 epochs takes about 5.67 hours. The training and validation losses and the learned model are output as files [ Fig. S13], where "phase demodulation model.h5" is the learned model, and "traloss cnn1.mat" and "valloss cnn1.mat" are training and verification loss data. The loss curve distribution of training and verification will also be printed in Spyder's console.

C. Result analysis
To test the phase demodulation performance of the trained network, we apply it to phase demodulation of fringe patterns obtained by measuring different types of samples. By running "step2 test phase demodulation.py" in Spyder, the predicted numerator and denominator terms corresponding to the input fringe images can be output in the "test" folder (the output folder can be customized). Then users can use the arctangent function to get the phase map. Figures S14 and S15 respectively show the comparison results of different phase demodulation methods for training data and test data, where Figs. S14b-d and Figs. S15b-d are the numerator terms and denominator terms predicted from the inputs [Figs. S14a and S15a], and the phase maps calculated by the prediction results, respectively. It can be seen from Figs. S14e-S14g and Figs. S15e-S15g that, compared with the traditional single-frame phase demodulation method-spatial phase demodulation methods (here we use FT and WFT methods) whose quality is limited around discontinuities and isolated areas, the deep learning method can obtain high-quality phase map with significantly higher accuracy. We calculate the phase RMSE of traditional single-frame methods and deep learning method relative to the phase-shifting method. It can be seen from the quantization results that the phase accuracy of the deep learning method is nearly an order of magnitude higher than that of traditional single-frame methods.
To compare the results more intuitively, we further unwrap the phases and reconstruct the 3D surface profiles of the measured objects based on the pre-calibrated geometry parameters of the fringe projection system used for acquiring the dataset. It can be seen that the reconstructed result from FT [Figs. S14h and S15h] features many grainy distortions, which are mainly due to the inevitable spectral leakage and overlapping in the frequency domain. Compared with FT, the WFT reconstructed the objects with more smooth surfaces but failed to preserve the surface details, especially around the edge regions [Figs. S14i and S15i]. By comparison, the deep learning method yielded the highest-quality 3D reconstructions with the high-fidelity recovery of surface details, producing 3D reconstructions [Figs. S14j and S15j] that visually almost reproduce the results [Figs. S14k and S15k] of the multi-step phase-shifting method. But different from the multi-frame phase-shifting method, such high-quality phase demodulation is obtained from only one fringe image as input. Further, unlike the FT and WFT methods, where the performance heavily relies on fine-tuning of several parameters, the deep learning method is fully automatic -once the neural network has been trained, it does not require any manual parameter searches to optimize its performance.