Robust pose estimation which guarantees positive depths

In the area of 3D computer vision, the ability to estimate pose between two cameras under high noise levels while maintaining small reprojection errors reflects the robustness of such pose estimation algorithms. Moreover, maintaining positive depth constraint is another challenging task. Unfortunately, current pose estimation algorithms are often sensitive to noise/outliers and do not always guarantee positive depths. As a standalone task, these algorithms perform a positive sign check and simply discard the points with negative depths after the algorithms are executed. These algorithms do not integrate positive depth constraints into the algorithms themselves. Instead, they do it afterwards. Here, from a comprehensive mathematical derivation, we propose a novel pose estimation algorithm that integrates positive depth constraint into the algorithm itself by estimating the depths directly. The algorithm was competitive in producing small reprojection errors when compared to the state-of-the-art algorithms under both synthetic and real-world tests, while most importantly guaranteeing positive depths.

Pose estimation is widely and thoroughly studied in the field of computer vision, which tackles the problem of solving relative pose between cameras or world coordinate systems.It is proven useful in various real-world computer vision application scenarios like Structure from Motion (SfM) 1 , Simultaneous Localization and Mapping (SLAM) 2 , Light Detection and Ranging (LIDAR) 3,4 , autonomous navigation 5,6 , Augmented Reality (AR) 7,8 , and so on.From a model-wise perspective, the input of pose estimation problem is camera observations, usually denoted as uncalibrated pixel coordinates or calibrated image coordinates.The output is pose parameters between coordinate systems, specifically, rotation matrix, translation vector, depth, etc.Most importantly, the integral part that connects the input and the output is known as pose estimation algorithm.In literature, pose estimation is often interchangeably used with terms such as Perspective-n-Point (PnP).Therefore, PnP algorithms are also considered pose estimation algorithms, which comprise both 3D-2D and 2D-2D PnP algorithms according to different nature of observations.Over the decades, pose estimation algorithms or PnP algorithms have evolved dramatically, on both algorithmic design and performance.According to the nature of the observations, state-of-the-art (SOTA) algorithms can be categorized into 2D-2D and 3D-2D methods, which require intrinsic camera information for pose estimation.Representative 2D-2D algorithms include Eight-point (8-pt) 9 , normalized Eight-point (8-pt (norm)) 10 , Minimum-eigenvalue (Min-eig) 11 , Five-point (5-pt) 12,13 algorithm.These 2D-2D algorithms accept 2D-2D calibrated image coordinates as inputs.On the other hand, popular 3D-2D algorithms include EPnP 14 , EPnP-GN 15 , DLS 16 , RPnP 17 , LHM 18 , DLT 19 , OPnP 20 , MLPnP 21,22 , and (R)EPPnP 23 algorithms, and also more recently, the SQPnP 24 , CPnP 25 , Uncertain-PnP or PnP(L) 26 , and QPEPs 27 .These solvers usually try to achieve certain optimization goals.For example, due to the linearity nature, 8-pt and 8-pt (norm) focus on execution efficiency, which is simple and fast; 5-pt algorithm only requires five correspondence pairs to recover pose, extending its compatibility in more scenarios; EPnP algorithm demonstrates competitive performance with low computational cost; RPnP algorithm further improves the performance over EPnP with similar computational cost, excelling in translation estimates; OPnP performs well for all variables including rotation, translation, and depth, even under presence of noise; (R)EPPnP incorporates outlier rejection scheme into the algorithm itself, circumventing the time-consuming Random Sample Consensus (RANSAC) [28][29][30][31][32] procedure, which is frequently utilized to remove outliers from the input dataset; and finally, MLPnP is a maximum likelihood solution to the PnP problem that includes image observation uncertainties to improve accuracy.Such uncertainties do occur in real-world scenarios since not all points are measured with equal certainty.For example, imperfections on sensors could lead to such uncertainties, resulting shifts on pixels.To an extent, MLPnP can also handle outliers, with outlier scrutinization process also embedded inside the algorithm.The more recent SQPnP treats PnP problem

Varying the noise levels
We performed the experiments on the range of standard deviation (std) of zero-mean Gaussian noise covering from 1e − 7 to 1e − 1 in calibrated image coordinates.To cover such wide noise spectrum, we used "semilog" plots for the x-axis.The y-axis was the percentage of good estimates so that the higher the plot, the better the performance.The number of iterations done at each noise level was 100.The number of points was ten.
Figures 1, 2, 3 and 4 show all algorithms' percentages of good rotation, translation, reprojection, and depth estimates under varying noise levels.By saying "good", we are referring that the estimation errors are less than some set thresholds.It can be seen from Fig. 1 that the Pos-dep was among the best ones, it only downgraded after the noise std was beyond 1e − 2. The DLT and (R)EPPnP were the worst.We can also see that our Pos-dep algorithm was the best among all 2D-2D algorithms, keeping at a percentage of 100 when the noise was below 1e − 2. After that, it also had the slowest degrading rate compared to other 2D-2D algorithms.The SQPnP outperformed CPnP when the noise std exceeds 1e − 2.  www.nature.com/scientificreports/ For translations, we see from Fig. 2 that the Pos-dep stayed at 100% line until 3e − 3, after that it kept comparable with 3D-2D algorithms such as MLPnP and EPnP.Among all 2D-2D algorithms, our Pos-dep beat all others when the noise was below 3e − 3; when the noise was beyond this level, our algorithm still remained comparable to the Min-eig, while beating others, including the 8-pt (norm).The SQPnP outperformed CPnP when the noise std exceeds 1e − 2.
From Fig. 3, we see for reprojections, of all the algorithms the Pos-dep performed the best (stayed up at the 100% line), including the 3D-2D PnP algorithms and the recent SQPnP and CPnP.In fact, being capable of reaching and maintaining small reprojection errors across the noise spectrum is one of the main advantages of the Pos-dep algorithm.The SQPnP outperformed CPnP when the noise std exceeds 2e − 2 (0.02).
For depth estimates, we see from Fig. 4 that overall, the 3D-2D algorithms were better than the 2D-2D, with DLT and (R)EPPnP being the exceptions.Among all 2D-2D algorithms, our Pos-dep again beat all other 2D-2D algorithms, especially at low-medium noise levels.The SQPnP outperformed CPnP when the noise std exceeds 1e − 2.
In terms of negative depth instances, we can clearly see from Fig. 5 that out of 100 tests on each noise level, the Pos-dep stayed all the way flat at the zero-instance line, guaranteeing positive depth estimates.The SQPnP algorithm overall has fewer instances of negative depths than the CPnP algorithm.
The computation time of Pos-dep was consistently around 6 ms across the noise spectrum, lower than the OPnP algorithm which was around 10 ms (Fig. 6).The runtime was recorded using MATLAB on a PC with 16 GB RAM and 6-core Intel CPU with 2.6 GHz clock frequency.
Limited by the length of this paper, for more comprehensive results that involve testing with smaller thresholds and comparison within the 3D-2D and 2D-2D algorithmic pools only, please refer to pages 194-210 of this thesis 34 .www.nature.com/scientificreports/ In addition to the percentage of good estimates, mean and median estimation errors are also the metrics actively used in this area, as is performed in the OPnP paper 20 .Here, at each noise level, 100 runs again are independently executed.The mean and median error at each noise level is calculated.The DLS algorithm is excluded in the plot since it is out of bounds.Close-up plots are shown next to the originals.These results are shown in Fig. 7, which includes the results of mean and median rotation, translation, reprojection, and depth errors.
For rotations, in Fig. 7a and b, when the noise std is at 1e − 2, the median rotation error of Pos-dep was around 5 degrees.For translations, in Fig. 7c and d, when the noise std is at 1e − 2, the median translation error of Pos-dep was around 10%.For reprojections, we can see from Fig. 7e and f that both the mean and median reprojection errors of Pos-dep were among the smallest, compared with other algorithms.For depths, we can see from Fig. 7g and h that when the noise std is below 1e − 2, the depth errors were below 30%, which was used as the threshold to differentiate "good" estimates.From Fig. 7, we also found the OPnP and the SQPnP had similar performance, and overall performed the best in the algorithm pool.

Varying the percentage of outliers
We define the outliers to be the data points corrupted by noise with a std of 0.1 (1e-1).We chose the percentages to be from 10 to 100% with a 10% increment.The total number of points (inliers and outliers combined) was 20.Again, 100 tests were independently performed at each outlier rate.While a proportion of the points were assigned as outliers, other points were not corrupted by noise and were still considered "clean data".
We followed the similar evaluation method used in varying the noise levels.We can see from Fig. 8 that the Pos-dep was comparable to the EPnP-GN and better than EPnP.The OPnP and the SQPnP stayed at 100% and overlapped with each other, both were better than the CPnP algorithm.At high outlier rates such as 70%, the Pos-dep still achieved more than 80 good rotation estimates, out of 100.
Figure 9 shows that our Pos-dep again was the best among all 2D-2D algorithms.The LHM, OPnP, and SQPnP had similar performance.The SQPnP outperformed CPnP.Overall, the Pos-dep was comparable to EPnP-GN.At 60% outlier rate, the Pos-dep outperformed CPnP algorithm.At high outlier rates such as 90% and 100%, the Pos-dep can also compete with the CPnP algorithm.
Figure 10 shows that for reprojections, our Pos-dep performed exceptionally well among the algorithms.It stayed at 100%.OPnP, LHM, RPnP, and SQPnP were the next.The SQPnP outperformed CPnP.The threshold used to classify good reprojection was 0.5, measured in calibrated image coordinates.
Figure 11 shows that for depth errors the Pos-dep again beat all 2D-2D algorithms.The LHM, OPnP, and SQPnP were the best among all algorithms.The SQPnP outperformed the CPnP here.The threshold was 30%.Our Pos-dep was the only 2D-2D algorithm that could be comparable with the 3D-2D algorithms.In fact, it beat MLPnP when the outlier rate was beyond 30%.From Fig. 11, it can be seen our algorithm almost formed a "separating line" between the 3D-2D and 2D-2D algorithms, following the EPnP algorithm.
For negative depth instances, we see from Fig. 12 that the Pos-dep stayed at the bottom of the graph all the time, meaning it recorded zero instances of negative depths.We can also see that interestingly, the negative depth instances gradually increased as the outlier rate increased.At 100% outlier rate, roughly at least 30 instances of negative depths were reported for the SOTA algorithms.The SQPnP algorithm overall has fewer instances of negative depths than the CPnP algorithm.
In terms of computation time, see Fig. 13.In theory, the computation time of Pos-dep should be nearly constant.Although there exists a slight increase of the computation time of Pos-dep, since the number of points were fixed at 20 and the computation cost of Pos-dep only depends on point numbers, we argue such increase was within the margin of reasonable system errors.For instance, as experiments going on, the machine might have gone through possible thermal throttling with reduced clock frequency that impacted the CPU's performance.The computation time of Pos-dep ranges approximately from 20 to 40 ms, the difference is 20 ms.The OPnP stayed nearly consistently at 10 ms.
Similar to varying the noise levels, we also evaluated the performance by studying the mean and median estimation errors including the rotation, translation, reprojection, and depths.Again, at each outlier rate (percentage), 100 runs are independently executed.The mean and median errors at each outlier rate are calculated.The configurations such as the noise std, the choices of percentages, and the number of total points are the same as those presented in Figs. 8, 9, 10, 11, 12 and 13.The DLS algorithm is excluded since it is out of bounds.The results are shown in Fig. 14.
From Fig. 14a, we see that for rotations, the mean error of Pos-dep was comparable to that of EPnP-GN, while the median error was among the best that stayed below 20% for all outlier rates.For translations, in Fig. 14b, when the outlier rate is below 40%, the mean translation error is below 30%, and the median translation error is below 20%.In extreme circumstances such as 90% outlier rate, the mean rotation, median rotation, mean translation, and median translation errors of Pos-dep were around 20 degrees, 20 degrees, 40%, and 30%, respectively.For reprojections, from Fig. 14c we see the Pos-dep outperformed other algorithms across the outlier rates in both mean and median errors.At 100% outlier rate, the median error of Pos-dep was around 0.1, smaller than all other algorithms.For depth estimates, from Fig. 14d we see the mean errors of Pos-dep were among the best, and the median errors were the best among all 2D-2D algorithms and beat (R)EPPnP.When the outlier rate exceeds 50%, the median depth errors of Pos-dep were comparable to the most recent CPnP algorithm, although admittedly, the recent algorithms SQPnP and CPnP were among the best ones in each subplot of Fig. 14.In this experiment, the noise std used was 0.1 in calibrated coordinates, which could comfortably transform the raw data into outliers.www.nature.com/scientificreports/

Varying the number of points
The 8-pt algorithm requires the minimum number of points used is 8. Hence, we chose the number of points to be from 8 to 30, with an increment of 2. Figures 15, 16, 17 and 18 show the errors of rotation, translation, reprojection, and depth, respectively.We can see from Fig. 15 that when the number of points exceeds 16, the proposed algorithm performed well (the higher the better), and the CPnP performed even better; the SQPnP only reported around 20% of good estimates.For translations we can see from Fig. 16       Mean Rotation Error   better than the SQPnP algorithm, except only at point numbers 24, 28, and 30.The proposed algorithm stayed around 90% in good estimates across all point numbers.The proposed algorithm reported zero instance of negative depths all the time (Fig. 19), while on average (R)EPPnP and DLT reported more than 20 negative instances out of a total of 100 when point number exceeds 18, more than other algorithms.
Figure 20 shows the computation time (runtime) of the algorithms versus the number of points.We see that the runtime of our algorithm almost "linearly" increased.This is expected due to the mathematical nature of our algorithm's D R matrix, please see Eq. (23) in Methods for definition.The D R matrix is of dimension 2n × 2n , where n is the number of points.As the number of points increase, the dimension of D R would also increase lin- early, resulting in an increase of computation time when calculating eigenvectors and eigenvalues of the matrix.It is noted that even when the number of points is 30, the proposed algorithm still reported a runtime below 40 ms.For each point number, 100 runs were independently performed at each point number and the averaged runtime over these runs was reported for each point number.The reader is also encouraged to refer to pages 212 to 215 of the thesis 34 for more comprehensive results when varying the number of correspondence points.
In real applications, hundreds of or even thousands of points may exist.Therefore, we also studied the variations of computation time when the number of points is ranging from 100 to 1000, with a 100 increment.The runtime was recorded using MATLAB on the same PC with 16 GB RAM and 6-core Intel CPU with 2.6 GHz clock frequency.We compared our proposed algorithm with the more performant OPnP algorithm in Fig. 21.
We observe from Fig. 21 that similar to the case shown in Fig. 20, the runtime of Pos-dep increases with the increasing number of points.As discussed before, an increase of point numbers n would cause an increase on the dimension of the matrix D R , which is of size 2n × 2n .Thus, the eigenvalue and eigenvector computation of    www.nature.com/scientificreports/D R would be more computationally expensive.We also see the OPnP presents a slight runtime increase, ranging from approximately 0.01 s to over 0.03 s.We speculate such increase (only about 20 ms difference) was within the margin of system error caused by the heating-up machine with thermal throttling happening under prolonged computation process.Again, in Fig. 21 the averaged runtime over 100 independent runs at each point number was reported.
We noticed the thresholds chosen may be too small (such as 1e-14) for certain scenarios in the above experiments.Therefore, just like the first two scenarios, we also studied the mean and median estimation errors, which is thoroughly used in the OPnP work 20 .Also, in the above experiments, we did not add noise to the data when varying the number of points.While in the OPnP work 20 , zero-mean Gaussian noise with standard deviation of 2 pixels was added.Hence, we also added zero-mean Gaussian noise with a fixed standard deviation of 0.01 in calibrated image coordinates when studying the mean and median errors, which corresponds to noise with a standard deviation of 8 pixels.For each data point, 100 independent runs were performed, and the mean/   median errors were reported.In 20 , 4 to 15 points with an increment of 1 are used; whereas in our experiments, to meet the requirement of each algorithm and to be consistent with the previous results, 8 to 30 points with an increment of 1 were used.The experimental results are shown in Fig. 22.The 5-pt and DLS were excluded from the graphs since they were out of bounds comparing to other algorithms.For rotation errors, in Fig. 22a, when the point number reaches 13, the Pos-dep starts to yield a mean rotation error less than 5 degrees, which is considered the threshold of a good estimate in this study.In Fig. 22b, the median rotation error starts to fall below 5 degrees when the point number reaches 11.For translations, in Fig. 22c, when the point number reaches 13, the Pos-dep starts to yield a mean translation error less than 10%, which is considered the threshold of a good estimate in this study.In Fig. 22d, the median translation error starts to fall below 10% when the point number reaches 10.For reprojection errors, it can be seen from Fig. 22e and f that the Pos-dep was among the best, and when the point number reaches 14 and beyond, the Pos-dep reported an averaged reprojection error of less than 0.02 at each number of points.For depth errors, we see from Fig. 22g that the Pos-dep was better than (R)EPPnP; from Fig. 22h we see that when point number reaches 15, the Pos starts to yield median depth errors less 10%, which is considered the threshold of "good" estimate in this study.Through Fig. 22, we see the recent algorithms, the SQPnP and CPnP, both performed well, staying low among the algorithms across all point numbers.

On real-world scenarios
The experiments on real-world data contain four scenarios/cases.The first and the second utilize standard datasets available online with ground truth provided 35 .The third case was conducted on a rigid box image taken from a mirrorless camera (SONY α6000).The fourth case was conducted on a satellite mockup image taken from an iPhone rear camera (iPhone SE2).Both cameras were self-calibrated using MATLAB's camera calibration toolbox and a checkerboard pattern to obtain the intrinsic parameters.

Results on the standard datasets
For the first scenario, we used ten pairs of successive dinosaur images from the dataset.We ran the testing for each image pair in the dataset and calculated the mean and median estimation errors.We removed the points not on the object and removed duplicate points, making the total number of correspondence points to be consistent (around ten) for each pair of images (we increased the number of points in the RANSAC experiment).These correspondences were generated by SIFT 36 algorithm.We summarized the results in tabular forms and marked our algorithm in green.The images used from the Dino dataset can be found in Fig. S1 of the Supplementary Information.
We see from Table 1 that our algorithm reported an average of 0.5917 degrees, beating all algorithms except DLS, RPnP and OPnP; the median rotation error (0.6038) was below the 5-degree threshold for good estimates, being slightly better than the CPnP.The CPnP performed better than SQPnP.
For translations, from Table 2 we see our Pos-dep reported 5.2030% for translation error, outperforming all except the DLS, RPnP and OPnP.The median error (4.4400%) was below the 10% threshold for good estimates.This time, the SQPnP performed better than the CPnP in both mean and median errors.
We can see from Table 3 that the reprojection error of our Pos-dep was 0.0001, which is the smallest among all algorithms and matches the OPnP.The median error of Pos-dep (0.0001) was also small compared to other algorithms.The SQPnP had smaller reprojection errors in both mean and median than the CPnP.
For depth errors, we rounded the results to integers.We can see from Table 4 that both the mean and median errors of our Pos-dep were 4%, only larger than the DLS, RPnP and OPnP algorithms, and were comparable to the EPnP-GN.The SQPnP had smaller mean depth error than the CPnP, while its median error was larger than CPnP.
In terms of negative depth instances, by checking the experimental data we see that out of ten pairs of images, negative depths occurred in two image pairs for the (R)EPPnP, three image pairs for the 8-pt (norm), four image pairs for the 5-pt, six image pairs for the DLT and the 8-pt algorithms.All other algorithms, including our Posdep, recorded zero negative instances.
For the Dino dataset, the computation time of each algorithm on each image pair, is summarized in Table 5.We see the Pos-dep yielded 0.0066 s (6.6 ms) for both the mean and median computation time on single image pair, which is comparable to the SQPnP algorithm, which yielded 6.5 ms for the mean time, and 6.8 ms for the median time.The computation time of SQPnP was nearly half of that of CPnP.
For the second scenario, we tested the Temple dataset from the same website 35 .Please also refer to pages 227-234 of this thesis 34 .Recent algorithms such as SQPnP and CPnP were again included.Again, successive image pairs were used.We ran the testing for each image pair in the dataset and calculated the mean and median estimation errors.We removed the points not on the object and removed duplicate points, making the total number of correspondence points to be consistent (around ten) for each pair of images (we increased the number of points in the RANSAC experiment).These correspondences were generated by SIFT 36 algorithm.The results are summarized in tabular forms.The images used from the Temple dataset can be found in Fig. S2 of the Supplementary Information.
We see from Table 6 that our algorithm reported an average of 1.3111 degrees, beating all algorithms except DLS, EPnP-GN and OPnP; the median rotation error (1.0991) was below the 5-degree threshold for good estimates, and was better than SQPnP.CPnP performed better than SQPnP in both mean and median rotation errors.
For translations, from Table 7 we see our Pos-dep reported 8.9336% for translation error, outperforming all except the DLS, EPNP-GN and OPnP.The median error (9.9931%) was below the 10% threshold.The SQPnP performed better than the CPnP in mean error, while worse than CPnP in median error.
We can see from  www.nature.com/scientificreports/ the best, and better than the CPnP.The SQPnP had smaller reprojection errors in both mean and median errors than the CPnP.
For depth errors, we rounded the results to integers.We can see from www.nature.com/scientificreports/In terms of negative depth instances, by checking the experimental data we see that out of ten pairs of images, negative depths occurred in one image pair for the 8-pt (norm) and CPnP, three image pairs for the 8-pt, four image pairs for the 5-pt, eight image pairs for the DLT, nine image pairs for the (R)EPPnP.All other algorithms, including our Pos-dep, recorded zero negative instances.
For the Temple dataset, the computation time of each algorithm on each image pair, is summarized in Table 10.We see the Pos-dep yielded 0.0040 s (4 ms) for the mean time, and 0.0036 s (3.6 ms) for the median time.The Pos-dep took less computation time than both the more recent SQPnP and CPnP algorithms.Again, we found the computation time of SQPnP was nearly half of that of CPnP.

Results on the rigid box image
For the third scenario, we performed the experiment on a rigid box.To get the 3D-2D matches, we used a SONY α6000 mirrorless camera with a fixed focal length of 30 mm to complete the calibration process using a checkerboard pattern.Then, we used the same calibrated camera without refocusing to capture the image of the rigid box.The original dimension of the image was 6000 (H) by 4000 (V) pixels.We chose ten correspondence points on the surface of the rigid box with MATLAB's "getpts" control point selection functionality for 2D point selection.Since the ground truth was not known, we transformed the two 3D-2D problem into one 3D-3D problem with ground truth known.The rigid box image with control points (correspondences) selected, as well as the transformation method from 3D-2D to 3D-3D, are presented in Figs.S3 and S4 of Supplementary Information.We see from Fig. 24 that the Pos-dep had good translation errors of 6.48%, lower than the 10% threshold and better than the normalized 8-pt, 5-pt, and MLPnP; the LHM, OPnP, and SQPnP had zero errors, better than the CPnP (0.79%).
We can see from Fig. 25 that the reprojection error of our Pos-dep was 0.0275, among the smallest and comparable to EPnP's.The SQPnP and CPnP had similar reprojection errors (0.0116, 0.0125) with the SQPnP performed slightly better.
From Fig. 26, we see the depth error of Pos-dep was 1.16%, much lower than the 10% threshold for good estimates.The SQPnP (0.02%) performed better than the CPnP (0.13%) algorithm in terms of depth errors.
In the rigid box case, no negative depths were reported for all algorithms.In terms of computation time, the 8-pt was 0.0078 s, the 8-pt norm 0.0282 s, the Min-eig 0.0417 s, the Pos-dep 0.0100 s, the 5-pt 0.0147 s, the EPnP 0.1177 s, the EPnP-GN 0.0155 s, the DLS 0.0322 s, the RPnP 0.0050 s, the LHM 0.0047 s, the DLT 0.0016 s, the OPnP 0.0340 s, the MLPnP 0.0131 s, the (R)EPPnP 0.0083 s, the SQPnP 0.0026 s, and the CPnP was 0.0034 s.The 10 ms-computation time of Pos-dep was most close to those of the MLPnP (13.1 ms) and (R)EPPnP (8.3 ms).The SQPnP (2.6 ms) had less computation time than the CPnP (3.4 ms).
Using the estimated results, it is tempting to calculate the 3D reconstructed model of the rigid box and compare it with the original one.This result is included in Fig. S5 of Supplementary Information.www.nature.com/scientificreports/

Results on the satellite mockup image
For the fourth scenario, we performed the experiment on a satellite mockup image.The process to obtain the 3D-2D matches was similar to that in the previous rigid box case.The original dimension of the image this time was 4032 (H) by 3024 (V) pixels.The focal length was fixed at 3.99 mm on the iPhone's rear-facing camera throughout the calibration process.Since the ground truth was again not known, we applied the same 3D-2D to 3D-3D transform as that used in the rigid box case.The configuration of the satellite mockup with coordinate system setup and selected correspondence points, was shown in Fig. S6 of Supplementary Information.
For rotation errors, we see from Fig. 27 that our Pos-dep reported 12.75 degrees.The SQPnP, LHM, and OPnP were at zero errors, better than the CPnP (0.02 degrees).
We see from Fig. 28 that the Pos-dep reported a translation error of 9.45%, lower than the 10% threshold; the LHM, OPnP, and SQPnP had zero errors, better than the CPnP (1.26%).
We can see from Fig. 29 that the reprojection error of our Pos-dep was 0.0240, among the smallest ones and comparable to EPnP.The SQPnP and CPnP had similar reprojection errors (0.0162, 0.0166) with the SQPnP performed slightly better.
From Fig. 30, we see the depth error of Pos-dep was 2.42%, much lower than the 10% threshold for good estimates.The CPnP (0%) was better than the SQPnP (0.05%) in terms of depth errors.
In the satellite mockup case, all algorithms except the 5-pt reported positive depths.In terms of computation time, the 8-pt was 0.0006 s, the 8-pt norm 0.0078 s, the Min-eig 0.0498 s, the Pos-dep 0.0103 s, the 5-pt 0.0037 s, www.nature.com/scientificreports/ the EPnP 0.0103 s, the EPnP-GN 0.0102 s, the DLS 0.0029 s, the RPnP 0.0025 s, the LHM 0.0034 s, the DLT 0.0005 s, the OPnP 0.0199 s, the MLPnP 0.0055 s, the (R)EPPnP 0.0024 s, the SQPnP 0.0023 s, and the CPnP was 0.0010 s.The 10.3 ms-computation time of Pos-dep was most close to those of the EPnP (10.3 ms) and EPnP-GN (10.2 ms).The CPnP (1 ms) had less computation time than the SQPnP (2.3 ms).

Comparing with the RANSAC algorithm
In the practice of pose estimation using PnP algorithms, outlier detection and rejection strategies such as RANSAC is often applied and only inliers are used for pose estimation, similar to the (R)EPPnP 23 algorithm.
To understand how our proposed algorithm performs comparing to the RANSAC algorithm, we self-coded the RANSAC algorithm to remove the outliers and used only inliers to estimate pose.To achieve a tradeoff between accuracy and efficiency, our RANSAC algorithm takes the Min-eig algorithm to iteratively select the consensus set, then it utilizes our proposed algorithm to estimate the pose using only the inliers.All correspondence points, including both the inliers and outliers, are obtained from the 25th and 26th image of the standard dinosaur dataset 35 using the SIFT algorithm.Figure 31 shows all the correspondences found with straight lines connecting them.334 correspondences were obtained in total, which more reflects real-world scenarios where hundreds of points are usually encountered.
Since the ground truth pose can be directly computed from the dataset, the rotation errors, translation errors, reprojection errors, and depth errors with and without RANSAC can be compared.Without RANSAC, the www.nature.com/scientificreports/rotation error was 12.5368 degrees; the translation error was 13.0664%; the depth error from the first image was 5.0166%, and 5.0268% from the second image; the reprojection error was 0.0032 in calibrated image coordinates; the runtime was 4.1672 s (334 points).With RANSAC, 16 correspondence points were randomly selected for the consensus set, and 100 iterations were used to finalize such set.Therefore, the resultant consensus dataset comprising all inliers has 16 data points.The rotation error was 5.1490 degrees; the translation error was 2.5094%; the depth error from the first image was 4.3327%, and 4.4798% from the second image; the reprojection error was 0.00013 in calibrated image coordinates; the runtime was reduced to 0.0126 s due to the reduced number of points.The runtime of the RANSAC process only was 7.5512 s.Illustrative results are shown in Fig. 32, where the averaged depth error of both images, and the cumulative runtime (RANSAC and algorithm execution combined) are shown.For visualization purposes, the reprojection errors are not shown in Fig. 32 and are only numerically compared.
Using RANSAC before algorithm execution, the accuracy improves at the cost of increased cumulative runtime.The accuracy improvement is more obvious in Rotation and Translation Errors than that in Averaged Depth Errors.However, the tradeoff is the increased computation time, with most of the cost consumed in the RANSAC process.

Discussion
In this work, we presented a positive depth-guaranteed pose estimation algorithm that estimates the camera pose with better or comparable performance to the state-of-the-art algorithms in certain aspects.The proposed Pos-dep algorithm guarantees positive depths.It solves depths with all positive entries directly within the algorithm execution process, instead of later calculating them, which ensures the positivity of depths even before   the pose is estimated.With the existence of noise and outliers, our algorithm showed its robustness to various interfering conditions, while producing small reprojection errors (Figs. 3, 7, 10, 14, 17, 22, 25, 29; Tables 3, 8) and guaranteeing positive depth estimates all the time.The way the proposed algorithm proves its tolerance on high percentage of outliers is not by detecting or removing outliers before computing the pose.In fact, without any outlier rejection scheme, the proposed algorithm demonstrates its tolerance on outliers.Of course, outliers can be screened and removed before algorithm execution, and outlier rejection scheme can be integrated as part of the algorithm.However, we are demonstrating the algorithm's outlier tolerance capabilities itself, without extra techniques that help handle outliers, which often take place "outside" the algorithm.
Many PnP algorithms evaluate their performance on degenerate configurations such as planar or quasisingular configurations.We studied previous works that include these configurations and found that RPnP is robust in both planar and quasi-singular point configurations, it can also deal with non-planar configurations; EPnP struggles in co-planar configurations; and LHM assumes coplanar or weak-perspective point configurations, which may limit its wide adoption.For our algorithm, we randomly generated the data points in 3D space for simulation.Thus, they are not intentionally configured to be in co-planar, nor in quasi-singular configurations.We ran simulations that had points all on the same plane parallel to the image plane, and found that our algorithm produced inconsistent results only when the points were too close to the camera.If we made sure the distance from the camera to the object was at least half the length of the data span, the algorithm always produced consistent results.From one aspect, this finding proves our algorithm's capability of handling co-planar configuration.In real-world scenarios, points used in the rigid box case of this paper were deliberately selected to avoid degenerate configurations such as co-planar.That is, most of the points were selected to be the box corners.In addition, our algorithm was also able to handle co-planar configuration in the satellite mockup scenario.Thus, through both simulation and real-world testing, our algorithm was able to handle certain degenerate configurations.
In many of our experiments, we are aware that the number of points used starts from 10; while for many PnP algorithms, this number is valid if it is greater than 3.In addition to the family of PnP algorithms, since we are also testing the 5-pt algorithm, as well as the 8-pt algorithm, which require at least 5 or 8 points, to meet the point number requirement of all the algorithms, at least 8 points are needed.When varying the number of points, we could certainly choose the starting number to be 8.In fact, when setting the starting number to be 10 instead, it is consistent with the real-world scenarios of the study, where 10 control points were selected.Yet, in the experiment of "varying the number of points", we chose the starting number to be 8 (8 to 30 points with an increment of 1).
Mathematically, our Pos-dep (Min-eig-Depths) algorithm finds the depth as an eigenvector of a constructed data matrix.That is, instead of directly solving for rotation matrix R and translation vector t , it first finds all the "positive" eigenvectors (eigenvectors with all positive entries) of a data matrix, and picks the "positive eigenvector" associated with the smallest eigenvalue to be the depth vector, thus all depth values are guaranteed to be positive.To guarantee such "positive eigenvector" exists, the algorithm keeps taking randomly generated rotations as new initial guesses, until an eigenvector with all positive entries could be found.The number controlling the repetition times is a user-defined parameter.During our study, we have not encountered a case where no positive solutions could be found.However, to handle such rare case, we brute force set the resulting eigenvector with all entries equal to 1, which is a "positive" vector.Hence, the existence of such "positive" vector is proven empirically, rather than mathematically and we acknowledge that the existence of the "positive" eigenvector is achieved through empirical observation.After the "positive" eigenvector is found, the algorithm then uses Optimal Quaternion Algorithm [41][42][43][44][45][46] to calculate a new rotation, and iteratively compares the current rotation with the new rotation under some set threshold (for example, 5 • ), then update R if necessary (see Algorithm 1 in Methods).After the rotation is solved, the algorithm uses the solved R , the solved (positive) depths and the input data (observations) to calculate the translation vector t .
As for computational cost and efficiency, the Pos-dep/Min-eig-Depths had a runtime of less than 35 ms when the number of points was less than 30 (Fig. 20).In real-world tests, the Pos-dep reported a runtime of about 6 ms (Table 5) and 4 ms (Table 10) on the two standard datasets, and a runtime of about 10 ms on the rigid box and the satellite mockup cases.Since the time-consuming sign check and solution rejection processes have been integrated into the Pos-dep itself, the algorithm automatically handles such processes without manual and standalone checks after algorithm execution.Therefore, no runtime is wasted on the cheirality check process after the pose is solved, which has already been taken care of at the algorithm execution phase.Although such mechanism may inherently increase the runtime of the algorithm itself, it incorporates the separate sign-checking process into algorithm execution altogether, circumventing extra workloads.
There exist some limitations in terms of accuracy and runtime of the Pos-dep.As for accuracy, we do acknowledge that the proposed algorithm is less accurate than certain state-of-the-art (SOTA) algorithms as reported in Fig. 22.However, we found in Fig. 22a, when the point number reaches 13, the Pos-dep starts to yield a mean rotation error less than 5 degrees; in Fig. 22b, when the point number reaches 11, the median rotation error also starts to fall below 5 degrees, which is considered the threshold of good rotation estimate in this study.In Fig. 22c, when the point number reaches 13, the Pos-dep starts to yield a mean translation error less than 10%; in Fig. 22d, when the point number reaches 10, the median translation error also starts to fall below 10%, which is considered the threshold of good translation estimate in this study, and also applies to depth estimates.For reprojection errors, it can be seen from Fig. 22e and f that the Pos-dep was among the best, and when the point number reaches 14 and beyond, the Pos-dep reported an averaged reprojection error of less than 0.02 at each point number.In fact, one of the strengths of our algorithm is its ability of providing pose estimates with a small reprojection error, which is a crucial factor in pose estimation accuracy.It is also worth noting that Pos-dep performed less accurate (than EPnP) in Fig. 22, which is "varying the number of points" and is only one of the many experiments conducted in this work.In the experiment of "varying the percentage of outliers", we can see from Fig. 14 that Pos-dep outperformed EPnP instead, on top of other benefits such as much smaller reprojection errors and positive depth estimates.It can be observed from the many experiments conducted in this paper that the Pos-dep may be less performant in one experiment, yet it can be more performant and even outperforms the most recent algorithms in another experiment (eg.Figures 15, 16, 17, 18, the Dino and the Temple datasets); and being able to constantly generate small reprojection errors with positive depth estimates, is a key advantage of the Pos-dep algorithm.
As for runtime, indeed, when the number of points n = 30, the Pos-dep costs more time than other algorithms (Fig. 20).The increase on runtime with the increasing number of points is due to the increase of the dimension of the data matrix D R , which is of size 2n × 2n .We also believe that the runtime inefficiency is not due to factors like MATLAB.The OPnP is also originally implemented in MATLAB and its runtime is around 20 ms.In our experiments, we found the runtime of OPnP on our machine, which is also implemented in MATLAB, is around 8 ms when n < 30 (Fig. 20); and around 20 ms when n ranges from 100 to 1000 (Fig. 21b).There is practically no significant runtime increase when using MATLAB on different machines.Therefore, we think the runtime inefficiency is due to the algorithm itself, which tries to find the eigenvector corresponding to the minimal eigenvalue from a data matrix of size 2n × 2n .The computation cost of eigenvector computation will then increase as n goes up.We acknowledge that as n increases, the Pos-dep may not be suitable for certain real-time applications, and it can be used as an offline backup solution to verify the results, particularly the results on reprojections.In the near future, we are aiming to improve the algorithm's efficiency, possibly using parallelized programming or other multi-threaded acceleration techniques.
There also exist some limitations on the algorithms and datasets used in this study.In terms of algorithms, there exist some recent algorithms such as Uncertain-PnP or PnP(L) 26 and QPEPs 27 .The proposed PnP(L) method integrates the feature uncertainty with "globally convergent PnP(L) solvers, leveraging a complete set of 2D and 3D uncertainties" 26 for pose estimation.The QPEPs method studies many quadratic pose estimation problems (QPEPs) including PnP, hand-eye calibration, point-to-plane registration, etc. 27 .This work proposed a "general quaternion-based mathematical model" 27 to unify these QPEPs.Though both algorithms have MAT-LAB implementations, we tried to incorporate the PnP(L) into our study, while encountered difficulties when adapting its "method_list" parameter and "run_pnpl_method()" function into our configuration.We also tried to incorporate the QPEPs method into our study.We managed to replicate some test results presented in its paper, while had trouble in interpreting the quaternions and correlating its implementations in the "test_rel_att.m" and "test_stewart.m"scripts with our configuration.As for other recent algorithms such as CPnP and SQPnP, we found their implementations were more straightforward to port into our configuration and managed to incorporate them into this paper.In the near future, we are adding more algorithms proposed recently into our comparisons to keep our work up-to-date, starting with the PnP(L) and the QPEPs that already have MATLAB implementations.
In terms of datasets, currently there exist more realistic computer vision datasets that are more suitable for real-world applications.Examples of these datasets and benchmarks include the KITTI Odometry Dataset 37 , the Robust Vision Challenge 38 , the ETH3D 2-view stereo benchmark 39 , the Heidelberg HD1K Stereo benchmark 40 , etc.For instance, the KITTI Odometry Dataset includes tasks of stereo, optical flow, visual odometry, 3D objects detection and 3D tracking.The dataset complements established datasets such as Middlebury by providing realworld benchmarks that can better reflect out-of-laboratory scenarios.Yet, in our work, we tested the algorithms using simulated data and four scenarios of real-world datasets.The first and the second scenarios are the Middlebury datasets, and the third and the fourth scenarios are self-constructed datasets.We acknowledge that to make the results more convincing, experiments with the datasets and benchmarks that better reflect real-world scenarios can be conducted.In the near future, we are expanding both the width and depth of our work by using these more realistic datasets.
Despite the limitations discussed above, we believe one of the most significant contributions of the proposed Pos-dep algorithm is its neat yet elegant mathematical derivation process.Pos-dep tries to minimize the spatial translation error as a least squares problem.It uses the mathematical fact that the minimum of the least squares 1 (13) χ i 2 = x 1 2 , x 2 2 , x 3 2 , . . ., x i−1 2 , (1 − n)x i 2 , x i+1 2 , . . ., x n 2 . ( . . .www.nature.com/scientificreports/

Generating problem instances
For the simulation part, taking "varying the noise levels" as an example, we chose arbitrary calibration matrix with randomized rotation and translation; we also randomized 3D positions X i 1 and calculated X i 2 using Eq.(1); then we were able to generate calibrated and uncalibrated (pixel) coordinates, which are the inputs of various pose estimation algorithms.The way we added noise is as follows: we added random Gaussian noise with different noise standard deviations, known as "noise levels", to the calibrated coordinates x i 1 , x i 2 .For each noise level, 100 tests were performed to calculate the percentage of good estimates at that noise level.In addition, the mean and median estimation errors of the algorithms were also studied.
For the real-world part, the correspondences were found using SIFT algorithm for the standard datasets, and were selected manually for the rigid box and the satellite mockup cases.The camera intrinsics in the standard datasets were available thus no calibration was needed.We self-calibrated the cameras using checkerboard patterns for the rigid box and the satellite mockup cases to obtain calibrated image coordinates.For the rigid box image and the satellite mockup image, the 3D positions of the points were manually measured using a ruler, and the 2D coordinates were obtained through MATLAB's "getpts" control point selection, which were later converted to calibrated image coordinates using the calibration information.

Integrating the SQPnP and CPnP algorithms
The SQPnP algorithm has been implemented in OpenCV using the solvePnP function with the flag option set as SOLVEPNP_SQPNP.We used OpenCV's Python package "cv2" and ported the Python script into MATLAB using the "py" prefix preceding the Python expressions.In this way, Python script of SQPnP can be directly used inside MATLAB and made comparable with other algorithms.

Conclusion
We proposed a novel camera pose estimation algorithm, the Pos-dep/Min-eig-Depths algorithm.The algorithm was formulated based on comprehensive mathematical derivation, which solves for depth as the eigenvector associated with the minimum eigenvalue of a constructed data matrix.The algorithm was tested along with many other pose estimation algorithms under various noisy conditions in numerous simulated and real-world experiments.The proposed algorithm showed better or comparable performance than the SOTA in certain experiments, while producing small reprojection errors and guaranteeing positive depths.In the near future, we are aiming to improve the accuracy and efficiency of the proposed algorithm.We will also incorporate more recently proposed algorithms, as well as more realistic computer vision datasets into our study.

Figure 7 .
Figure 7. Mean and median estimation errors when varying the noise levels.(a) Mean rotation error, a close-up is shown to the right; (b) Median rotation error, a close-up is shown to the right; (c) Mean translation error, a close-up is shown to the right; (d) Median translation error, a close-up is shown to the right; (e) Mean reprojection error, a close-up is shown to the right; (f) Median reprojection error, a close-up is shown to the right; (g) Mean depth error, a close-up is shown to the right; (h) Median depth error, a close-up is shown to the right.

Figure 14 .Figure 15 .
Figure 14.Mean and median estimation errors when varying the percentage of outliers.(a) Mean and median rotation errors; (b) Mean and median translation errors; (c) Mean and median reprojection errors; (d) Mean and median depth errors.

Figure 18 .
Figure 18.Depth errors vs. number of points.

Figure 20 .
Figure 20.Computation time vs. number of points.

Figure 21 .
Figure 21.Runtime comparison between Pos-dep and OPnP when larger numbers of points are used.(a) Runtime comparison when the number of points ranges from 100 to 1000; (b) A close-up of the OPnP algorithm; (c) A close-up of the Pos-dep when the number of points equaling 100, 200, and 300; (d) A close-up of the Pos-dep when the number of points equaling 400, 500, and 600; (e) A close-up of the Pos-dep when the number of points equaling 700, 800, 900, and 1000.

Figure 22 .
Figure 22.Mean and median rotation, translation, reprojection, and depth errors when varying the number of points under fixed noise of 8 pixels.(a) Mean rotation error, a close-up is shown to the right; (b) Median rotation error, a close-up is shown to the right; (c) Mean translation error, a close-up is shown to the right; (d) Median translation error, a close-up is shown to the right; (e) Mean reprojection error, a close-up is shown to the right; (f) Median reprojection error, a close-up is shown to the right; (g) Mean depth error, a close-up is shown to the right; (h) Median depth error, a close-up is shown to the right.

Figure 32 .
Figure 32.Accuracy and runtime comparison without and with RANSAC applied.With RANSAC applied before algorithm execution, the accuracy improves at the cost of increased cumulative runtime.

(e) (f) (g)
that the Pos-dep was only behind RPnP, which stayed above

Table 8
that the reprojection error of our Pos-dep was 0.0005, which was among the smallest ones.It was comparable to SQPnP and better than CPnP.The median error of Pos-dep (0.0003) was also among Vol.:(0123456789)Scientific Reports | (2023) 13:22165 | https://doi.org/10.1038/s41598-023-49553-9 Table 9 that both the mean and median errors of our Pos-dep were 4%.The mean error of Pos-dep beat those of both the SQPnP and CPnP.The median error of Pos-dep beat that of SQPnP.The CPnP performed better than the SQPnP in terms of both mean and median depth errors.

Table 1 .
Rotation errors in degrees (Dino).Significant values are in bold.

Table 2 .
Translation errors in percentage (Dino).Significant values are in bold.For rotation errors, we see from Fig.23that our Pos-dep was 3.87 degrees, comparable to the 8pt, normalized 8-pt, and MLPnP.The SQPnP and OPnP were at zero errors, better than the CPnP (0.19 degrees).

Table 3 .
Reprojection errors in calibrated units (Dino).Significant values are in bold.

Table 4 .
Depth errors in percentage (Dino).Significant values are in bold.

Table 5 .
Computation time on each image pair (Dino).Significant values are in bold.

Table 6 .
Rotation errors in degrees (Temple).Significant values are in bold.

Table 7 .
Translation errors in percentage (Temple).Significant values are in bold.

Table 8 .
Reprojection errors in calibrated units (Temple).Significant values are in bold.

Table 9 .
Depth errors in percentage (Temple).Significant values are in bold.

Table 10 .
Computation time on each image pair (Temple).Significant values are in bold.