A Motion Free Approach to Dense Depth Estimation in Complex Dynamic Scene
Abstract
Despite the recent success in perframe monocular dense depth estimation of rigid scenes using deep learning methods, they fail to achieve similar success for complex dynamic scenes, such as MPI Sintel [4]. Moreover, conventional geometric methods to address this problem using a piecewise rigid scene model requires a reliable estimation of motion parameters for each local model, which is difficult to obtain and validate. In this work, we show that, given perpixel optical flow correspondences between two consecutive frames and the sparse depth prior for the reference frame, we can recover the dense depth map for the successive frames without solving for motion parameters. By assigning the locally rigid structure to the piecewise planar approximation of a dynamic scene which transforms asrigidaspossible over frames, we demonstrate that we can bypass the motion estimation step. In essence, our formulation provides a new way to think and recover dense depth map of a complex dynamic scene which is recursive, incremental and motion free in nature and therefore, it can also be integrated with the modern neural network frameworks for largescale depthestimation applications. Our proposed method does not make any prior assumption about the rigidity of a dynamic scene, as a result, it is applicable to a wide range of scenarios. Experimental results show that our method can effectively provide dense depth maps for the successive/multiple frames of a dynamic scene without using any motion parameters.
1 Introduction
Dense depth estimation of complex dynamic scenes from two consecutive frames has recently gained enormous attention from several industries involved in augmented reality, autonomous driving, movies etc. Applications such as obstacle detection [21], robot navigation [20] etc., need reliable depth to develop autonomous systems. Despite the recent research in solving this problem has provided some promising theory and results, its success strongly depends on the accurate estimates of 3D motion parameters [19, 26].
To our knowledge, almost all the existing geometric solutions to this problem have tried to fit the wellestablished theory of rigid reconstruction to estimate perpixel depth of dynamic scenes from monocular images [24, 19, 26]. Hence, these extensions are intricate to execute and highly depend on perobject or persuperpixel [1] reliable motion estimates [24, 19, 26]. The main issue with these frameworks is that, even if the depth for the first/reference frame is known, we must solve for persuperpixel or perobject motion to obtain the depth for the next frame. As a result, the composition of their objective function fails to utilize the depth knowledge and therefore, it does not integrate to largescale applications. In this work, we argue that in a dynamic scene, if the depth for the reference frame is known then it seems “unnecessary or at least undesirable” to estimate motion to recover the dense depth map for the next frame. Therefore, the rationale behind relative motion estimation as an essential paradigm for obtaining the depth of a complex dynamic scene seems optional under the prior knowledge about the depth of the reference frame and dense optical flow between frames. To endorse our argument, we propose a new motion free approach which is easy to implement and allow the users to get rid of the complexity associated with the optimization on (3) manifold.
We posit that the recent geometric methods to solve this task have been limited by their inherent dependence on the motion parameters. Consequently, we present an alternative method to realize the dynamic scene depth estimation task as a global asrigidaspossible (ARAP) optimization problem which is motionfree. Inspired by the prior work [19], we model the dynamic scene as a set of locally planar surfaces, now previous work constrains the movement of local planar structure based on the homography [23] and its relative motion between frames. In contrast, we propose that the ARAP constraint over a dynamic scene may not need 3D motion parameters, and its definition just based on 3D Euclidean distance metric is a sufficient regularization to supply the depth for the next frame. To this point, one may ask “Why ARAP assumption for a dynamic scene?”
Consider a general realworld dynamic scene, the change we observe in the scene between consecutive time frame is not arbitrary, rather it is regular. Hence, if we observe a local transformation closely, it changes rigidly, but the overall transformation that the scene undergoes is nonrigid. Therefore, to assume that the dynamic scene deforms as rigid as possible seems quite convincing and practically works well for most realworld dynamic scenes.
To use this ARAP model, we first decompose the dynamic scene as a collection of moving planes. We considered Knearest neighbors per superpixel [1] —which is an approximation of a surfel in the projective space, to define our ARAP model. For each superpixel, we choose three points i.e., an anchor point (center of the plane), and two other noncollinear points. Since the depth for the reference frame is assumed to be known (for at least 3 noncollinear points per superpixel), we can estimate per plane normal for the reference frame, but to estimate per plane normal for the next frame, we need depth for at least 3 noncollinear points per plane . If perpixel depth for the reference frame is known, then ARAP model can be extended to pixel level without any loss of generality. The only reason for such discrete planar approximation is the computational complexity.
Our ARAP model defined over planes does not take into account the depth continuity along the boundaries of the planes. We address it in the subsequent step by solving a depth continuity constraint optimization problem using the TRWS algorithm [14] (see Fig. 1 for a sample result).
In this work, we make the following contributions:

We propose an approach to estimate dense depth map of a complex dynamic scene that circumvents explicit parameterization of the interframe motion, specifying as rigid as possible constraint on the depth estimation.

Our algorithm under piecewise planar and as rigid as possible assumption appropriately encapsulates the behavior of a dynamic scene to estimate per pixel depth.

Although the formulation is shown to work ideally for classical case of two consecutive frames, its incremental in nature and therefore, it is easy to extend to handle multiple frames without estimating any 3D motion parameters. Experimental results over multiple frames show the validity of our claim .
2 Related Work and Our Motivation
Recently, numerous papers motivated by the success of neural networks have been published for the dense depth estimation of a dynamic scene from images [31, 8, 30, 6]. The noticeable part is, none of these work shows their results on the MPI dataset [4]. For brevity, in this paper, we limit our discussion to the recent papers that are motivated geometrically to solve this problem, leading to the easy discourse of our contributions. Also, we briefly discuss why our formulation can be more beneficial to the learning algorithms for this task than other geometric approaches [19, 26].
Motionfree approach to estimate the 3D geometry of a rigid scene introduced by Li [22] and its extension [13] to single nonrigidly deforming object are restricted to handle few sparse points over multiple frames (M view, N point). To the best of our knowledge, two significant class of work in the recent past have been proposed for estimating dense depth map of the entire dynamic scene from two consecutive monocular images [24, 19, 26], however, all of these methods are motion dependent. These work can broadly be classified as (a) object level motion segmentation approach (b) object level motion segmentation free approach.
(a) Objectlevel motion segmentation approach: Ranftl et al. [26] proposed a two/threestaged approach to solve dense monocular depth estimation of a dynamic scene. Given the dense optical flow field, the method first performs an object level motion segmentation using epipolar geometry [10]. Perobject motion segmentation is then used to perform object level 3D reconstruction using triangulation [10]. To obtain a scene consistent depth map, ordering constraint and smoothness constraint were employed over Quickshift superpixel [29] graph to deliver the final result.
(b) Objectlevel motion segmentation free approach: Kumar et al. [19] argued that “in a general dynamic scene setting, the task of densely segmenting rigidly moving object or parts is not trivial”. They proposed an overparametrized algorithm to solve this task without using objectspecific motion segmentation. The method dubbed as “Superpixel Soup” showed that under two mild assumptions about the dynamic scene i.e., (a) the deformation of the scene is locally rigid and globally as rigid as possible and (b) the scene can be approximated by piecewise planar model, scale consistent 3D reconstruction of a dynamic scene can be obtained for both the frames with a higher accuracy. Inspired by locally rigid assumption, recently, Noraky et al. [24] proposed a method that uses optical flow and depth prior to estimate pose and 3D reconstruction of a deformable object.
Challenges with such geometric approaches: Although these methods provide a plausible direction to solve this challenging problem, its usage to realworld applications is very limited. The major challenge with these approaches is the correct estimation of motion parameters. The method proposed by Ranftl et al. [26] estimates perobject relative rigid motion which is not a sensible choice if the object themselves are deforming. On the other hand method such as [24, 19] estimates per superpixel/region relative rigid motion which is sensitive to the size of the superpixels and distance of the surfel from the camera.
The point we are trying to make is, given the depth for the reference frame of a dynamic scene, can we correctly estimate the depth for the next frame using the aforementioned approaches?. Maybe yes, but then, we have to again estimate relative rigid motion for each object or superpixel and so on and so forth. Inspired by the “asrigidaspossible” (ARAP) intuition [19], in this work, we show that if we know the depth for the reference frame and dense optical flow correspondences between consecutive frames, then estimating relative motion is not essential, under the locally planar assumption. We can successfully estimate the depth for the next frame by exploiting asrigidaspossible global constraint. These depth estimate using ARAP can further be refined using boundary depth continuity constraint.
The next concern could be why we are after solving this problem in a motion free way?. Keeping in mind the success of deep learning approaches to estimate perframe dense depth map, our cost function can directly provide the depth for the next frame of a dynamic scene without any motion estimate. And since the choice of a reference frame and the next frame is relative, it further provides a recursive way to improve depth estimate over iteration if supplied with appropriate priors. Moreover, our formulation provides the flexibility to solve for depth at a pixel level rather than at an object level or superpixel level which is hard to realize using motion based approaches [24, 19, 26]. Nevertheless, to reduce the overall computational cost, we stick to optimize our objective function at superpixel level.
3 Piecewise Planar Scene Model
Inspired by the recent work on dense depth estimation of a general dynamic scene [19], our model parameterizes the scene as a collection of piecewise planar surface, where each local plane is assumed to be moving over frames. The global deformation of the entire scene is assumed to be as rigid as possible. Moreover, we assign the center of each plane (anchor point) to act as a representative for the entire points within that plane (see Fig.2). In addition to the anchor point of each plane, we take two more points from the same plane so that these three points are noncollinear (see Fig.3). This strategy is used to define our as rigid as possible constraint between the reference frame and next frame without using any motion parameters. As the depth for the reference frame and the optical flow between the two successive frames is assumed to be known a priori, each local planar region is described using only four parameters —normal and depth, instead of nine [19].
Our model first assigns each pixel of the reference frame to a superpixel using SLIC algorithm [1] and each of these superpixels then acts as a representative for its 3D plane geometry. Since the global geometry of the dynamic scene is assumed to be deforming ARAP, we solve for the depth in the next frame subject to the transformation that each plane undergoes from the first frame to the next frame should be as minimum as possible. The solution to ARAP global constraint provides depth for three points per plane in the next frame, which is used to estimate the normal and depth of the plane. The estimated depth and normal of each plane is then used to calculate per pixel depth in the next frame.
Although our algorithm is described for the classical twoframe case, it is easy to extend to the multiframe case. The energy function we define below is solved in two steps: First, we solve for the depth of each superpixel in the next frame using as rigid as possible constraint. Due to the piecewise planar approximation of the scene, the overall solution to the depth introduces discontinuity along the boundaries. To remove the blocky artifacts —due to the discretization of the scene, we smooth the obtained depth along the boundaries of all the estimated 3D plane in the second step using TRWS [14]. If the ARAP cost function is extended to pixellevel then the boundary continuity constraint can be avoided [11]. Nevertheless, oversegmentation of the scene provides a good enough approximation of a dynamic scene and is computationally easy to handle.
3.1 Model overview
Notation: We refer two consecutive perspective image , as the reference frame and next frame respectively. Vectors are represented by bold lowercase letters, for e.g. ‘’ and the matrices are represented by bold uppercase letters, for e.g. ‘’. The 1norm, 2norm of a vector is denoted as and respectively.
3.2 AsRigidAsPossible (ARAP)
The idea of ARAP constraint is well known in practice and has been widely used for shape modeling and shape manipulation [12]. Recently Kumar et al. [19] exploited this idea to estimate scale consistent dense 3D structure of a dynamic scene. The motivation to use ARAP constraint in our work is inspired by [19] idea i.e. restrict the deformation such that the overall transformation in the scene between frames is as small as possible.
Let (, ) and (, ) be the depth of two neighboring 3D points from the reference coordinate in the consecutive frames. Let , be its image coordinate in the reference frame and , be its image coordinate in the next frame. If ‘’ denotes the intrinsic camera calibration matrix then, , is the unit vector in the direction of the 3D point respectively for the reference frame. Similarly, the corresponding unit vectors in the next frame is denoted with (see Fig. 2(a)). Using these notations, we define the ARAP constraint as:
(1) 
Here, is the total number of planes used to approximate the scene and is the ‘’ neighboring planes local to superpixel (see Fig. 2(b)). is the exponential weight fall off based on the image distance of the points i.e. slowly break the rigidity constraint if the points are far apart in the image space. This constraint encapsulates our idea i.e., the change in the distance of point relative to its local neighbors in the next frame should be as minimum as possible. Note that the summation goes over rather than due the reason discussed in Sec.
3.3 Orientation and Shape Regularization
Solving the ARAP constraint provides us the depths for three noncollinear points perplane for the next frame. We use these three depth estimate per plane to solve for their normals in the next frame. Let the 3D points corresponding to the three depths for superpixel in the next frame be denoted as , and respectively. We estimate the normals in the next frame as:
(2) 
where superscript ‘’ is used intentionally to denote the anchor point, which is assumed to be at the center of each plane (see Fig. 3). Rewriting Eq. (2) in terms of depth
(3) 
(a) Orientation smoothness constraint: Once we compute the normal for each plane and 3D coordinates of the anchor point, which lies on the plane, we estimate the depth of the plane as follows
(4) 
The computed depth of the plane is then used to solve for perpixel depth in the next frame —assuming the intrinsic camera matrix is known [19, 10]. To encourage the smoothness in the change of angles between each adjacent planes (see Fig. 3), we define the orientation regularization as
(5) 
where, is an empirical constant and = is the truncated function with as a scalar parameter.
(b) Shape smoothness constraint: In our representation, the dynamic scene model is approximated by the collection of piecewise planar regions. Hence, the solution to perpixel depth obtained using Eq. (1) to Eq. (4) may provide discontinuity along the boundaries of the planes in 3D (see Fig. 3). To allow smoothness in the 3D coordinates for each adjacent planes along their region of separation, we define the shape smoothness constraint as
(6) 
The symbol ‘’ denotes the set of boundary pixels of superpixel that are shared with the boundary pixel of other superpixels. The weight = takes into account the color consistency of the plane along the boundary points —weak continuity constraint [3]. Since all the pixels within the same plane are assumed to share the same model, smoothness for the pixels within the plane is not essentially required. Similar to orientation regularization, = is the truncated penalty function with as a scalar parameter. The overall optimization steps of our method is provided in Algorithm (1).
(7)  
(8)  
4 Experimental Evaluation
We performed the experimental evaluation of our approach on two benchmark datasets, namely MPI Sintel [4] and KITTI [7]. These two datasets conveniently provide a complex and realistic environment to test and compare our dense depth estimation algorithm. We compared the accuracy of our approach against two recent stateoftheart methods [19, 26] that use geometric approach to solve dynamic scene dense depth estimation from monocular images. These comparisons are performed using three different dense optical flow estimation algorithms, namely PWCNet [27], FlowFields [2] and Full Flow [5]. All the depth estimation accuracies are reported using mean relative error (MRE) metric. Let be the estimated depth and be the groundtruth depth, then MRE is defined as
(9) 
where ‘’ denotes the total number of points. The statistical results for DMDE [26] and Superpixel Soup [19] are taken from their published work for comparison.
Implementation Details:
We oversegment the reference frame into 10001200 superpixels using SLIC algorithm [1] to approximate the scene. Almost all of the experiments use fixed value of = 1 and = 2025. For computing the dense optical flow correspondences between images we used both traditional methods and deeplearning framework such as PWCNet [27], FlowFields[2] and Full Flow [5]. The depth for the reference image is initialized using MonoDepth [8] model on the KITTI dataset and using Superpixel Soup algorithm [19] on the MPISintel dataset. The reason for such inconsistent choice is that available deepneural network depth estimation model fails to provide reasonable depth estimate on the MPI dataset –see supplementary material. The proposed optimization is solved in two stages, firstly Eq. (7) is optimized using SQP [25] algorithm and Eq. (8) is optimized using TRWS [14] algorithm. The choice of the optimizer is purely empirical, and the user may choose different optimization algorithm to solve the same cost function. The algorithm is implemented in C++/MATLAB which takes 1012 minutes on a commodity desktop computer to provides the results.
The implementation is performed under two different experimental settings. In the first setting, given the sparse (i.e. for three noncollinear points per superpixel) depth estimate of a dynamic scene for the reference frame, we estimate the perpixel depth for the next frame. In the second experimental setting, we generalize this idea of two frame depth estimation to multiple frames by computing the depth estimates over frames. For easy understanding, MATLAB codes are provided in the supplementary material showing our idea of ARAP on synthetic examples of a dynamic scene.
4.1 MPI Sintel
This dataset gives an ideal setting to evaluate depth estimation algorithms for complex dynamic scenes. It contains image sequences with considerable motion and severe illumination change. Moreover, the large number of nonplanar scenes and nonrigid deformations makes it a suitable choice to test the piecewise planar assumption. We selected seven set of scenes namely , , , , , and from the clean category of this dataset to test our method.
OF Methods  DMDE [26]  S. Soup [19]  Ours 

PWC Net [27]      0.1848 
Flow Fields [2]  0.2970  0.1669  0.1943 
Full Flow [5]    0.1933  0.2144 
(a) Twoframe results: While testing our algorithm for the twoframe case, the reference frame depth is initialized using recently proposed superpixelsoup algorithm [19]. The optical flow between the frames is computed using methods such as PWCNet [27], Flow Fields [2] and Full Flow [5]. Table (1) shows the statistical performance comparison of our method against other geometric approaches. The statistics clearly show that we can perform almost equally well without motion estimation. Qualitative results within this setting are shown in Fig. 4.
(b) Multiframe results: In multiframe setting, only the depth for the first frame is initialized. The result obtained for the next frame is then used for the upcoming frames to estimate its dense depth map. Since we are dealing with the dynamic scene, the environment changes slowly and therefore, the error starts to accumulate over frames. Fig. 9(a) reflects this propagation of error over frames. Qualitative results over multiple frames are shown in Fig. 5.
4.2 Kitti
The KITTI dataset has emerged as a standard benchmark dataset to evaluate the performance of dense depth estimation algorithms. It contains images of outdoor driving scenes with different lighting conditions and large camera motion. We tested our algorithm on both KITTI rawdata and KITTI 2015 benchmark. For KITTI dataset, we used Monodepth method [8] to initialize the reference frame depth. Dense optical flow correspondences are obtained using the same aforementioned methods. For consistency, the depth estimation error measurement on KITTI dataset follows the same order of 50 meters as presented in [8] work.
Twoframe results: KITTI 2015 scene flow dataset provides two consecutive frame pair of a dynamic scene to test algorithms. Table (2) provides the depth estimation statistical result of our algorithm in comparison to other competing methods. Here, our results are a bit better using PWCNet [27] optical flow and Monodepth [8] depth initialization. Fig. 6 shows the qualitative results using our approach in comparison to the Monodepth [8] for the next frame.
Multiframe results: To test the performance of our algorithm on multiframe KITTI dataset, we used KITTI raw dataset specifically from the city, residential and road category. The depth for only the first frame is initialized using monodepth deep learned model and then we estimate the depth for the upcoming frames. Due to very large displacement in the scene per frame (150) pixels, the rate of change of error accumulation curve for KITTI dataset (Fig. 9(b)) is a bit steeper than MPI Sintel. Fig. 7 and Fig. 9(b) show the qualitative results and depth error accumulation over frames on KITTI raw dataset respectively.
OF Methods  DMDE [26]  S. Soup [19]  Ours 

PWC Net [27]      0.1182 
Flow Fields [2]  0.1460  0.1268  0.1372 
Full Flow [5]    0.1437  0.1665 
5 Statistical Analysis
Besides experimental evaluations under the aforementioned variable initialization, we also conducted other experiments to better understand the behavior of the proposed method. We conducted experiments on a synthetic example shown in Fig. 8 for easy understanding to the readers. MATLAB codes are provided in the supplementary material for reference.
(a) Effect of the variable : The number of superpixels to approximate the dynamic scene can affect the performance of our method. A small number of superpixel can provide poor depth result, whereas a very large number of superpixel can increase the computation time. Fig. 9(c) shows the change in the accuracy of depth estimation with respect to change in the number of superpixels. The curve suggests that for KITTI and MPI Sintel 10001200 superpixel provides a reasonable approximation to the dynamic scenes.
(b) Effect of the variable : The number of Knearest neighbors to define the local rigidity graph can have a direct effect on the performance of the algorithm. Although works well for the tested benchmarks, it is purely an empirical parameter and can be different for a distinct dynamic scene. Fig. 9(d) demonstrates the performance of the algorithm with the change in the values of .
(c) Performance of the algorithm under noisy initialization: This experiment is conducted to study the sensitivity of the method to noisy depth initialization. Fig. 10(a) shows the change in the 3D reconstruction accuracy with the variation in the level of noise from 1% to 9%. We introduced the Gaussian noise using randn() MATLAB function and the results are documented for the example shown in Fig. 8 after repeating the experiment for 10 times and taking its average values. We observe that our algorithm can provide arguable results when the noise level gets high.
(d) Performance of the algorithm under restricted isometry constraint with objective function: While minimizing the ARAP objective function under the constraint, we restrict the convergence trust region of the optimization. This constraint makes the algorithm works extremely well —both in terms of timing and accuracy, if an approximate knowledge about the deformation that the scene may undergo is known a priori. Fig. 10(b) shows the 3D reconstruction accuracy as a function of for the example shown in Fig. 8. Clearly, if we have an approximate knowledge about the scene transformation, we can get high accuracy in less time. See Fig. 10(d) which illustrates the quick convergence by using this constraint under suitable range of .
(e) Nature of convergence of the proposed ARAP optimization:
1) Without restricted isometry constraint: As rigid as possible minimization under the constraint is alone a good enough constraint to provide acceptable results. However, it may take a considerable number of iterations to do so. Fig. 10(c) shows the convergence curve.
2) With restricted isometry constraint: Employing the approximate bound on the deformation that the scene may undergo in the next time instance can help fast convergence with similar accuracy. Fig. 10(d) shows that the same accuracy can be achieved in 6070 iterations.
6 Limitation and Discussion
Even though our method works well for diverse dynamic scenes, there are still a few challenges associated with the formulation. Firstly, very noisy depth initialization for the reference frame can provide unsettling results. Secondly, our method is challenged by the instant arrival or removal of the dynamic subjects in the scene, and in such cases, it may need reinitialization of the reference depth. Lastly, wellknown limitations such as occlusion and temporal consistency, especially around the regions close to the boundary of the images can also affect the accuracy of our algorithm.
Discussion: In defense, we would like to state that motion based methods to structure from motion is prone to noisy data as well. Algorithms like motion averaging [9], Mestimators and random sampling [28] are quite often used to rectify the solution.
(a) Why do we choose geometric approach to initialize our algorithm on MPI dataset? LKVO network [30] is one of the top performing networks for dense depth estimation on KITTI dataset. Our implementation of this network on the MPI dataset provided us with unsatisfactory results. Qualitative results obtained using this network on the clean class is provided in the supplementary material. The training parameters are also provided for reference.
(b) What do we gain or lose by our motion free approach?
Estimating all kinds of conceivable motion in a complex dynamic scene from images is a challenging task, in that respect, our method provides an alternative way to achieve per pixel depth without estimating any 3D motion. However, in achieving this we are allowing the gauge freedom between the frames (temporal relations in 3D over frames).
7 Conclusion
The problem of estimating perpixel depth of a dynamic scene, where the complex motions are prevalent is a challenging task to solve. Quite naturally, previous methods rely on standard motion estimation techniques to solve this problem, which in fact is a nontrivial task for a nonrigid scene. In contrast, this paper introduces a new way to perceive this problem, which essentially trivializes the motion estimate as a compulsory step. By observing the behavior of most of the realworld dynamic scenes closely, it can be inferred that it locally transforms rigidly and globally as rigid as possible. Such observation allows us to propose a motionfree algorithm to dense depth estimation under the piecewise planar approximation of the scene. Results on benchmark datasets show the competence of our 3D motionfree idea.
Acknowledgement. This work is funded in part by the ARC Centre of Excellence for Robotic Vision (CE140100016), ARC Discovery project on 3D computer vision for geospatial localisation (DP190102261), ARC DECRA project DE140100180 and Natural Science Foundation of China (61420106007, 61871325).
References
 [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. Slic superpixels compared to stateoftheart superpixel methods. In IEEE transactions on Pattern Analysis and Machine Intelligence, volume 34, pages 2274–2282. IEEE, 2012.
 [2] C. Bailer, B. Taetz, and D. Stricker. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In IEEE international Conference on Computer Vision, pages 4015–4023, 2015.
 [3] A. Blake and A. Zisserman. Visual reconstruction. MIT press, 1987.
 [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, pages 611–625. Springer, 2012.
 [5] Q. Chen and V. Koltun. Full flow: Optical flow estimation by global optimization over regular grids. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4706–4714, 2016.
 [6] R. Garg, B. V. Kumar, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
 [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. In Int. J. Rob. Res., volume 32, pages 1231–1237, Sept. 2013.
 [8] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 7, 2017.
 [9] V. M. Govindu. Motion averaging in 3d reconstruction problems. In Riemannian Computing in Computer Vision. To Appear. Springer, 2015.
 [10] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
 [11] M. Hornáček, F. Besse, J. Kautz, A. Fitzgibbon, and C. Rother. Highly overparameterized optical flow using patchmatch belief propagation. In European Conference on Computer Vision, pages 220–234. Springer, 2014.
 [12] T. Igarashi, T. Moscovich, and J. F. Hughes. Asrigidaspossible shape manipulation. In ACM transactions on Graphics, volume 24, pages 1134–1141. ACM, 2005.
 [13] P. Ji, H. Li, Y. Dai, and I. Reid. “maximizing rigidity” revisited: A convex programming approach for generic 3d shape reconstruction from multiple perspective views. In ICCV, pages 929–937. IEEE, 2017.
 [14] V. Kolmogorov. Convergent treereweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006.
 [15] S. Kumar. Jumping manifolds: Geometry aware dense nonrigid structure from motion. arXiv preprint arXiv:1902.01077, 2019.
 [16] S. Kumar, A. Cherian, Y. Dai, and H. Li. Scalable dense nonrigid structurefrommotion: A grassmannian perspective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [17] S. Kumar, Y. Dai, and H.Li. Spatiotemporal union of subspaces for multibody nonrigid structurefrommotion. In Pattern Recognition, volume 71, pages 428–443. Elsevier, May 2017.
 [18] S. Kumar, Y. Dai, and H. Li. Multibody nonrigid structurefrommotion. In International Conference on 3D Vision (3DV), pages 148–156. IEEE, 2016.
 [19] S. Kumar, Y. Dai, and H. Li. Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. In IEEE International Conference on Computer Vision, pages 4649–4657, Oct 2017.
 [20] S. Kumar, A. Dewan, and K. M. Krishna. A bayes filter based adaptive floor segmentation with homography and appearance cues. In Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing, page 54. ACM, 2012.
 [21] S. Kumar, M. S. Karthik, and K. M. Krishna. Markov random field based small obstacle discovery over images. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 494–500. IEEE, 2014.
 [22] H. Li. Multiview structure computation without explicitly estimating motion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2777–2784. IEEE, 2010.
 [23] E. Malis and M. Vargas. Deeper understanding of the homography decomposition for visionbased control. PhD thesis, INRIA, 2007.
 [24] J. Noraky and V. Sze. Depth estimation of nonrigid objects for timeofflight imaging. In IEEE International Conference on Image Processing, pages 2925–2929. IEEE, 2018.
 [25] M. J. Powell. A fast algorithm for nonlinearly constrained optimization calculations. In Numerical analysis, pages 144–157. Springer, 1978.
 [26] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4058–4066, 2016.
 [27] D. Sun, X. Yang, M.Y. Liu, and J. Kautz. PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [28] P. H. Torr and D. W. Murray. The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision, 24(3):271–300, 1997.
 [29] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pages 705–718. Springer, 2008.
 [30] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
 [31] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 7, 2017.