What makes the unsupervised monocular depth estimation (UMDE) model training better

Current computer vision tasks based on deep learning require a huge amount of data with annotations for model training or testing, especially in some dense estimation tasks, such as optical flow segmentation and depth estimation. In practice, manual labeling for dense estimation tasks is very difficult or even impossible, and the scenes of the dataset are often restricted to a small range, which dramatically limits the development of the community. To overcome this deficiency, we propose a synthetic dataset generation method to obtain the expandable dataset without burdensome manual workforce. By this method, we construct a dataset called MineNavi containing video footages from first-perspective-view of the aircraft matched with accurate ground truth for depth estimation in aircraft navigation application. We also provide quantitative experiments to prove that pre-training via our MineNavi dataset can improve the performance of depth estimation model and speed up the convergence of the model on real scene data. Since the synthetic dataset has a similar effect to the real-world dataset in the training process of deep model, we finally conduct the experiments on MineNavi with unsupervised monocular depth estimation (UMDE) deep learning models to demonstrate the impact of various factors in our dataset such as lighting conditions and motion mode, aiming to explore what makes this kind of models training better.

What makes the unsupervised monocular depth estimation (UMDE) model training better Xiangtong Wang , Binbin Liang , Menglong Yang & Wei Li * Current computer vision tasks based on deep learning require a huge amount of data with annotations for model training or testing, especially in some dense estimation tasks, such as optical flow segmentation and depth estimation. In practice, manual labeling for dense estimation tasks is very difficult or even impossible, and the scenes of the dataset are often restricted to a small range, which dramatically limits the development of the community. To overcome this deficiency, we propose a synthetic dataset generation method to obtain the expandable dataset without burdensome manual workforce. By this method, we construct a dataset called MineNavi containing video footages from first-perspective-view of the aircraft matched with accurate ground truth for depth estimation in aircraft navigation application. We also provide quantitative experiments to prove that pre-training via our MineNavi dataset can improve the performance of depth estimation model and speed up the convergence of the model on real scene data. Since the synthetic dataset has a similar effect to the real-world dataset in the training process of deep model, we finally conduct the experiments on MineNavi with unsupervised monocular depth estimation (UMDE) deep learning models to demonstrate the impact of various factors in our dataset such as lighting conditions and motion mode, aiming to explore what makes this kind of models training better.
In recent years, the machine learning based depth estimation methods, which heavily rely on the labeled dataset, have achieved satisfying performance. However, the scarcity of available labeled data, high costs of data acquisition and annotation, limit the quantity and variety of existing deep learning methods. Although the problem of data shortage can be partly solved by unsupervised learning methods with only sparse or even no annotated data, the ground-truth are still needed in experiments for evaluating or testing the generalization performance of the model. Thus, it is still of great significance to obtain a sufficient amount of images with accurate and dense depth information.
The common data acquisition method in real world is not feasible for the depth estimation, especially for aircraft visual navigation because humans cannot manually label a pixel-wise annotation. Building a virtual world to generate synthetic datasets as the intermediate domain with the help of digital simulation technology may be the most feasible way for data generation and labeling at current stage. Since the newly released synthetic datasets [1][2][3][4] are not flexible enough to suit for different needs, e.g, fixed resolution, limited scenes, low data diversity and huge volume, etc, it is difficult to apply them to the dense estimation task in the large scale environment, especially for the depth estimation in aircraft navigation. Therefore, in this paper, we propose a simple and expandable synthetic dataset generation method, and construct a custom dataset, which is called as MineNavi (Fig. 1). This dataset generation method can not only solve the problem of high cost of real-world data acquisition, but also can narrow the gap between the training domain and the target domain by customizing the synthetic scene that similar with the target domain. Besides, different with conventional studies that adjust models in a fixed dataset to make them close to or superior to the state-of-the-art methods under certain evaluation metrics, we analyze the influences of the changes in datasets on the models. It is very significant because it can not only verify the generalization capabilities of the models to the environment, but also give guidance to construct real-world datasets. In addition, to explore the impact of the various dataset factors on depth estimation models, our constructed MineNavi dataset contains the dense depth maps and surface normal vectors of objects. It will help us to observe the performance of depth estimation model under different factors of the dataset, such as the ego-motion camera, lighting and motion patterns, etc. Our experiments show that these variations on training sets may significantly affect the performance of the models. Finally, unlike the KITTI dataset 5 applied to autonomous driving, our dataset is mainly oriented to the www.nature.com/scientificreports/ depth estimation of the large-scale scene in aircraft view, which can not only lead to the development of scene 3D reconstruction 6 but also provide training data and testbeds for autonomous aircraft with scene perception 7 . Our contributions are as follows: firstly, we propose an open synthetic dataset generation method and construct MineNavi for the large-scale depth estimation applications. Secondly, we design experiments to report the performance of the baseline models pre-trained on the MineNavi dataset, and reveal the influence of various factors in datasets on depth estimation models. MineNavi dataset is available on Kaggle platform 8 .

MineNavi: a synthetic dataset of large scale scenes
Using MineCraft to construct a dataset is not a novel idea for computer vision community 9 , we here to utilize it for depth estimation in aircraft visual navigation application. Our data generation method contains four steps: map loading, camera moving path setting, shader and lighting conditions setting, and ground-truth acquisition. Figure 2 shows the pipeline of data generation process.
Not only the environment features, such as the scene structure and lighting condition, affect the performance of depth estimation models, but also the particular dynamical parameters, such as moving targets in the environment and ego-motion of the aircraft, play important roles in benchmark datasets for models' training and evaluation. Accordingly, each image frame of the dataset can be parameterized as: where P ⋆ (n) represents the 6 DoF camera motion paths, n is the quantified timestamp of the path, M is the map of the scene, s is the shader that renders the world, L(t, w) is the lighting condition. t is the time in a day, and w indicates the weather conditions. Scene construction. Although a lot of work 2,10 build the scenes based on 3D modeling software such as Blender and Maya, the construction of large-scale 3D scenes is still a relatively time-consuming and laborious task. Besides, the limited scenes diversity will lead to the over-fitting situation of the models. MineCraft community 11 has extremely rich scene maps and users can freely build the required scene to generate specific dataset. Since the aircraft navigation is always involved in the large-scale scenes, the negative effects of the jagged features of objects in MineCraft can be ignored.
In order to increase the diversity of data, we use different shaders and lighting conditions. MineNavi dataset cooperates with the time and weather system according to different light and shadow styles to generate multiple style data.  The open virtual world AudiaCity that we used to build our dataset. Down: Users can achieve higher resolution scenes or buildings by applying plugins that adjust the blocks to small size. www.nature.com/scientificreports/ The construction in MineNavi based on the block is very simple and flexible. In order to build a more refined scene, users can use plug-ins to adjust the size of the block to achieve more complex objects (see Fig. 3).
Camera paths setting. Base on previous study [12][13][14] , we have found that the unsupervised monocular depth estimation methods are very sensitive to the camera motion in the training.
Therefore, we develop different camera paths and generates corresponding datasets for experiments. Unlike lighting and other factors that can be quantified as a scalar, a moving camera has 6 continuous degrees of freedom.
Therefore, for a training triplets, we propose a quantitative scalar , i.e., quasi-axis rate to generate datasets of the motion paths according to , and analyze the pros and cons of the data under different . The can be formulated as: φ(n) is the rotation angle of camera visual axis at time n, calculated from R ∈ SO(3),and t(n) = [x, y, z] is the camera position vector at time n. When = 0 , the camera moves parallel to the visual axis, when = 1 , the camera moves perpendicular to the visual axis.
In MineNavi, we can set the key points manually or automatically by using Aperture 15 and obtain the full path that matched with image sequence through the interpolation algorithm (see Fig. 4). The generated path has high enough dynamics, and the pose transformation is much larger than the general real-world data captured by UAVs.
Moving objects in the scene. The dynamic objects in a practical environment may have a great influence on unsupervised monocular depth estimation method. Many previous work [16][17][18] focused on how to remove the negative effects of dynamic objects in monocular depth estimation, but due to the very limited dataset, the progress is far from satisfactory. In order to simulate the influence of a moving object in the synthetic dataset on the depth estimation, the construction of the scene containing moving objects can be further parameterized as: where P i is the path of the ith moving objects in the scene. Each of the dynamic object can be modeled as custom shape by Blender or other 3D software and set their paths by using BlockBuster 19 . Note that we have no involved the moving object in our proposed dataset yet due to its negative effects on depth estimation mode. www.nature.com/scientificreports/ Generating ground-truth annotations. The shader can perform color mapping on the 3D information of the scene, which acts as a ground-truth as shown in Fig. 1.
We use the DepthMap rendering plug-in to export the corresponding error-free, pixel-level dense depth map that matches the image in sequences. In addition, we provide a surface normal rendering plug-in SurfMap to support surface normal estimation tasks.
Thus, the datasets construction method proposed in this study can generate a large number of customized datasets at a very low cost.
Datasets building. We constructed several datasets by MineNavi, as shown in Table 1

Experiments
In this section, we verify the feasibility and credibility of the MineNavi dataset in the training of the depth estimation model, and explore the impact of dataset varieties on the unsupervised monocular depths estimation model. Thus, we will demonstrate that 1) the depth estimation model can improve generalization through pre-training on MineNavi. 2) it is desirable to exploit the influence of data to model caused by various factors of the dataset. We prepare monodepth2 20 and its two variants monodepth2-3D and monodepth2-3Ds as the test models on our proposed datasets. We also present using Sequential Heat-map of Photometric-error Histogram (SHPH) to verify whether an image sequence is compatible with depth estimation model training intuitively.
MDE models. Unsupervised depth estimation includes monocular methods 12-14,21-24 usually contain a single-view depth and a multi-views pose network, to compute the depth. With the similar principle, we use a test model monodepth2 20 and its variants as baseline, which are shown in Fig. 5. Inspired by spacial-temporal methods in scene understanding 25,26 , the first variant of monodepth2 is monodepth2-3D, i.e., replace the encoder with a 3D encoder for improving the efficiency of training frames, which can enhance the richness by extracting the temporal features from multiple images 17,27,28 . What's more, as mentioned by previous work 29 that if there is structural similarity among candidate tasks, it is reasonable to assign just one encoder to extract identical features and recover required information by task-oriented decoders respectively 20,21 . Thus, we apply the model that using a single encoder to extract the mixed features for depth or pose estimation network as second variants of monodepth2, i.e., monodepth2-3Ds.
Apply MineNavi to MDE models. We present two variant models by changing their encoder, and apply them into frameworks of supervised training and unsupervised training (monodepth2) on MNv1.0, MNv1.1 and MNv1.2. For comparison, we also prepare the model Table 2 shows the results of models with single-frame and multi-frame input and the models on MN achieve similar or even better results by simply replacing the encoder from ResNet18 to 3D-ResNet18. Obviously, depth information is embedded under the multi-frame image sequence, which can assist the model to recover depth better. The quantitative results are shown in Fig. 6.

Generalization of MineNavi.
We execute the models pre-training on MineNavi datasets with linear camera moving path. In order to evaluate the influence of data diversity on the performance, we use MNv1.0, MNv1.1 and MNv1.2 for model pre-training. For comparison, we also prepare the models that are trained from scratch and ImageNet 30 . Although the model pre-trained on ImageNet by classification task has structural difference compare with the model that trained on similar target task, it is still the most popular method in depth estimation task.
Fine-tune on KITTI. We conduct fine-tuning on the KITTI with monodepth2, monodepth2-3D and monodepth2-3Ds pre-trained from scratch, ImageNet, MNv1.2 and MNv1.3 for 10 epochs. Note that we have www.nature.com/scientificreports/ removed the mask mechanism and reduced the epochs for simple training without affecting the final conclusion, so the results may be different from the original monodepth2 20 . From Table 3 it can be seen that the performance of monodepth2 and monodepth2-3D pre-trained on Ima-geNet is better than that pre-trained on MNv1.2 and scratch, but worse than MNv1.3. The MineNavi has a strong generalization capability compared to the KITTI. As mentioned before, MNv1.2 and MNv1.3 are only different in lighting condition and data volume. Therefore, the diversity of lighting conditions effectively improves the generalization capabilities of the models.
Compared with the other datasets, the model of monodepth2-3Ds pre-trained on ImageNet has the better performance. This is mainly because excessive noises in KITTI, e.g, the moving objects deteriorate the robustness of the network performance of the shared encoder, but the large amount of data of ImageNet can make the model more robust 31 . Note that although MineNavi dataset is much smaller the ImageNet, it has competitive performance with ImageNet in depth estimation model training. The quantitative results are show in Fig. 7 which matches with Table 3. It can be seen that the depth map obtained by the MN-trained model has sharper edges compared to the ImageNet with the trained model, which also indicates that the model can generalize better to similar tasks by unifying the task with the trained model. We also provide fine-tuning curves on Fig. 8 and it shows the value in generalization of our MineNavi dataset.

Fine-tune on FPV.
Since there is no ground-truth in the FPV dataset, we have to compute the distances between the models in different domains based on the loss value 29 . The closer the migration distance is, the better the pretraining dataset can be generalized to the target domain.
Compared the losses curve among of different pre-trained models that are fine-tuned on FPV in Fig. 9, MineNavi pre-trained models converge faster than the others. The reason behind probably comes from that the MineNavi dataset is closer to the FPV dataset than ImageNet in terms of environment scenes. What's more, compared with the ImageNet pre-trained model through the task of objects detection, the MineNavi pre-trained model through the task of depth estimation has learned geometric representation 32 during the pre-training, which leads the model converge faster when the target task has structural similarity 29 with source task. Note that, with the continuous expansion of the dataset, MineNavi can realize a more satisfied performance.
Factors that affect the train of MDE. Due to the expandable characteristics of the MineNavi dataset, we can easily generate customized datasets with different variation factors to avoid the over-fitting. It also a helpful way to discover the impacts of factors of datasets on the models. Thus we conduct experiments to explore how  Impact of shaders. The MineNavi dataset can generate the rendered image sequences sampled on the same path through different shaders, which allows us to quantitatively evaluate the impacts of the synthetic world design and the quality of other rendering parameters on the algorithm performance. We apply Sildurs 33 to adjust the image rendering quality and build three training datasets index Raw, middle-sildurs and high-sildurs of MNv1.2.
All of them are captured in an identical scene with linear camera motion and collected for about 10000 images. The only differences among them are shader setting: Raw is rendered by no shader, middle-sildurs uses sildurs with middle performance and high-sildurs uses high-performance shader. We apply random initial weights encoder to monodepth2 and train it on above three datasets. We use cross-evaluation on each trained model, i.e., evaluate every model on all datasets. The qualitative results are shown in Table 4. It shows that as the training scenes rendered gradually improve, the performance of the depth estimation model improves consequently. Besides, compared with a model that is trained on less-texture data and tested on rendered data, the model that is trained on rendered data and tested on less-texture data brings a worse result. It is consistent with the fact that the rendering performance will promote the robust of the model during the training.
Lighting conditions. Previous study 20,34 show that during the depth estimation model training, the low-texture areas caused by insufficient lighting or overexposure will produce problem pixels in depth estimation.
To further explore the impact of lighting conditions in data on the depth estimation model, we apply the models with random initial weights and train them on five datasets index of MNv1.3 (see Fig. 10) under different lighting conditions: morning, noon, afternoon, night and rainy day. Quantified results on AbsRel are shown in Fig. 11. We can observe that in the lighting conditions at morning and noon, three test models achieve similar results. However, as the lighting in training data is getting dim (afternoon, night, rainy), three models are deteriorated significantly. This can be attributed primarily to that the adequate lighting makes the color between pixels more diverse, and the error map is close to the uniform distribution. Note that at the time of afternoon, the models performance dropped dramatically, even worse than night that with dimmer lighting condition, we suspect that the reason behind this is there are too much problematic pixels in captured images caused by lens flare, which strongest in afternoon compared with the other lighting conditions. SHPH results on the collected sequences under different lighting conditions and different camera moving paths are shown in the row3 of Fig. 10. It can be clearly seen that the clear lighting conditions bring the even distribution of the SHPH.
Impact of motion blur. The motions of cameras will also affect the stability of the SHPH. As shown in the Fig. 12, it can be seen that the distribution of the photometric error map gradually even with the increase motion blur. In our experiment, four datasets with different motion blur are built. The quantified results of monodepth2 are shown in the Table 6. The motion blur has a great impact on the SHPH, we suspect that it is an effective way to overcome the noise and introduce the robustness by adding a certain motion blur in sequences. This is reflected in SHPH that appropriate motion blur can make the SHPH more stable, which leads the view synthesis of depth estimation model easier (see Fig. 12). Table 7 shows the the performance of two variants of monodepth2 in MineNavi datasets with different motion blur. It can be seen that the two models are trained on the motion-blurred dataset, and the performance is significantly better than the dataset without blurred.
Besides, we also introduce vary lighting conditions into experiments. As shown in the Table 5, it can be seen that in the variant models of monodepth2, the darker the lighting conditions, the worse the performance, which Table 3. Quantity results of various MDE models in KITTI with different pre-trained datasets. The best result are bolded and the second best are underlined. Since we have only 10 epochs of fine-tune and without masking mechanism, the results are different from the original paper of monodepth2 20 .

Models
Pre-trained datasets www.nature.com/scientificreports/ Impact of ego-motion variance. The ego-motion of the camera in the video will affect the depth estimation model training. Due to the continuous nature of the camera ego-motion, it is not easy to explore the impact of this factor. In this section, we build three datasets, i.e., MNv2.0, MNv2.1 and MNv2.2, which various in motion mode which corresponds to linear motion 1 = 1 , overhead cruising motion 2 = √ 2 2 , and circular motion 3 = 0 . Finally, the motion speed can be controlled by the number of interval frames of each train triplet in the datasets, and each of them is equipped with three velocities v, and v 3 v 2 v 1 . We test different motion modes through the models, and the quantitative results are shown in the Table 8. It can be seen that as the decreases, the performance of the test model also decreases, and the velocity of training triplets also has significantly affect on the performance of test model. According to the previous analysis, the reason behind this probably is that the training triplet with larger and appropriate velocity have a even distribution in SHPH, hence a better performance is achieved.
Velocity of training sequence. We find that the training sequence vary in sample frequency can greatly affect the performance of the model. It is essential because if the velocity of sampling camera is faster, the photometric differences between two adjacent frames are bigger, making the model difficult to train. Figure 13 shows the qualitative results of the models that vary in velocity of training sequence and encoder.

Discussion
This paper proposes a method to construct a synthetic dataset, which includes a large-scale scene with low cost but infinite volume, including surface normals, depth, and the 6 DoF paths of the camera's ego-motion. This dataset generation method can provide a solution to overcome the difficulty of data collection in some dense          Table 7. Motion blur test in monodepth2-3D (up) and monodepth2-3Ds (down). The best result in each row is underlined and the optimal result is bolded.  Table 8. Model performance in different ego-motion modes. The best result in each column is underlined and the optimal result is bolded.