Abstract
The problems of dense stereo reconstruction and object class segmentation can both be formulated as Random Field labeling problems, in which every pixel in the image is assigned a label corresponding to either its disparity, or an object class such as road or building. While these two problems are mutually informative, no attempt has been made to jointly optimize their labelings. In this work we provide a flexible framework configured via cross-validation that unifies the two problems and demonstrate that, by resolving ambiguities, which would be present in real world data if the two problems were considered separately, joint optimization of the two problems substantially improves performance. To evaluate our method, we augment the Leuven data set (http://cms.brookes.ac.uk/research/visiongroup/files/Leuven.zip), which is a stereo video shot from a car driving around the streets of Leuven, with 70 hand labeled object class and disparity maps. We hope that the release of these annotations will stimulate further work in the challenging domain of street-view analysis. Complete source code is publicly available (http://cms.brookes.ac.uk/staff/Philip-Torr/ale.htm).
Similar content being viewed by others
References
Alahari, K., Russell, C., & Torr, P. H. S. (2010). Efficient piecewise learning for conditional random fields. In Conference on computer vision and pattern recognition.
Batra, D., Sukthankar, R., & Tsuhan, C. (2008). Learning class-specific affinities for image labelling. In Conference on computer vision and pattern recognition.
Bleyer, M., Rother, C., Kohli, P., Scharstein, D., & Sinha, S. (2011). Object stereo—joint stereo matching and object segmentation. In Conference on computer vision and pattern recognition.
Boykov, Y., & Jolly, M. (2001). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In International conference on computer vision.
Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. Transactions on Pattern Analysis and Machine Intelligence.
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. Transactions on Pattern Analysis and Machine Intelligence.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European conference on computer vision.
Comaniciu, D., & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. Transactions on Pattern Analysis and Machine Intelligence.
Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In International conference on computer vision.
Hoiem, D., Efros, A., & Hebert, M. (2005) Automatic photo pop-up. ACM Transactions on Graphics.
Hoiem, D., Efros, A., & Hebert, M. (2006). Putting objects in perspective. In Conference on computer vision and pattern recognition.
Hoiem, D., Rother, C., & Winn, J. M. (2007). 3D layout CRF for multi-view object class recognition and segmentation. In Conference on computer vision and pattern recognition.
Kohli, P., Kumar, M., & Torr, P. H. S. (2007). P 3 and beyond: solving energies with higher order cliques. In Conference on computer vision and pattern recognition.
Kohli, P., Ladicky, L., & Torr, P. H. S. (2008). Robust higher order potentials for enforcing label consistency. In Conference on computer vision and pattern recognition.
Kolmogorov, V., & Zabih, R. (2001). Computing visual correspondence with occlusions via graph cuts. In ICCV.
Kumar, M. P., Veksler, O., & Torr, P. H. S. (2011). Improved moves for truncated convex models. Journal of Machine Learning Research.
Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. S. (2009). Associative hierarchical CRFs for object class image segmentation. In International conference on computer vision.
Leibe, B., Cornelis, N., Cornelis, K., & Gool, L. V. (2007). Dynamic 3D scene analysis from a moving vehicle. In Conference on computer vision and pattern recognition.
Liu, B., Gould, S., & Koller, D. (2010). Single image depth estimation from predicted semantic labels. In Conference on computer vision and pattern recognition.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision.
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In International conference on computer vision.
Ramalingam, S., Kohli, P., Alahari, K., & Torr, P. H. S. (2008). Exact inference in multi-label CRFs with higher order cliques. In Conference on computer vision and pattern recognition.
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In SIGGRAPH.
Russell, C., Ladicky, L., Kohli, P., & Torr, P. H. S. (2010). Exact and approximate inference in associative hierarchical networks using graph cuts. Uncertainty in Artificial Intelligence.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. Transactions on Pattern Analysis and Machine Intelligence.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision.
Sturgess, P., Alahari, K., Ladicky, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British machine vision conference.
Taskar, B., Chatalbashev, V., & Koller, D. (2004). Learning associative Markov networks. In International conference on machine learning.
Torr, P. H. S., & Murray, D. W. (1997). The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision.
Torralba, A., Murphy, K., & Freeman, W. (2004). Sharing features: efficient boosting procedures for multiclass object detection. In Conference on computer vision and pattern recognition.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. The Journal of Machine Learning Research.
Woodford, O., Torr, P. H. S., Reid, I., & Fitzgibbon, A. (2008). Global stereo reconstruction under second order smoothness priors. In Conference on computer vision and pattern recognition.
Yotta (2011). Yotta DCL horizons. http://www.yottadcl.com/horizons/.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ladický, L., Sturgess, P., Russell, C. et al. Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction. Int J Comput Vis 100, 122–133 (2012). https://doi.org/10.1007/s11263-011-0489-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-011-0489-0