Generating open world descriptions of video using common sense knowledge in a pattern theory framework

Aakur, Sathyanarayanan; de Souza, Fillipe; Sarkar, Sudeep

doi:10.1090/qam/1530

Authors: Sathyanarayanan N. Aakur, Fillipe DM de Souza and Sudeep Sarkar
Journal: Quart. Appl. Math. 77 (2019), 323-356
MSC (2010): Primary 54C40, 14E20; Secondary 46E25, 20C20
DOI: https://doi.org/10.1090/qam/1530
Published electronically: January 11, 2019
MathSciNet review: 3932962
Full-text PDF

Abstract | References | Similar Articles | Additional Information

Abstract: The task of interpretation of activities as captured in video extends beyond just the recognition of observed actions and objects. It involves open world reasoning and constructing deep semantic connections that go beyond what is directly observed in the video and annotated in the training data. Prior knowledge plays a big role. Grenander’s canonical pattern theory representation offers an elegant mechanism to capture these semantic connections between what is observed directly in the image and past knowledge in large-scale common sense knowledge bases, such as ConceptNet. We represent interpretations using a connected structure of basic detected (grounded) concepts, such as objects and actions, that are bound by semantics with other background concepts not directly observed, i.e., contextualization cues. Concepts are basic generators and the bonds are defined by the semantic relationships between concepts. Local and global regularity constraints govern these bonds and the overall connection structure. We use an inference engine based on energy minimization using an efficient Markov Chain Monte Carlo that uses the ConceptNet in its move proposals to find these structures that describe the image content. Using four different publicly available large datasets, Charades, Microsoft Visual Description Corpus (MSVD), Breakfast Actions, and CMU Kitchen, we show that the proposed model can generate video interpretations whose quality is comparable or better than those reported by state-of-the-art approaches, such as different forms of deep learning models, graphical models, and context-free grammars. Apart from the increased performance, the use of encoded common sense knowledge sources alleviate the need for large annotated training datasets and help tackle any imbalance in the data through prior knowledge, which is the bane of current machine learning approaches.

References

Sathyanarayanan N. Aakur, Fillipe DM de Souza, and Sudeep Sarkar, Towards a knowledge-based approach for generating video descriptions, Conference on Computer and Robot Vision (CRV), Springer, 2017.

Sathyanarayanan N. Aakur, Fillipe DM de Souza, and Sudeep Sarkar, An inherently explainable model for video activity interpretation, Workshops of the AAAI Conference on Artificial Intelligence, AAAI, 2018.

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos, From images to sentences through scene description graphs using commonsense reasoning and knowledge, arXiv preprint arXiv:1511.03292 (2015).

Massimiliano Albanese, Rama Chellappa, Naresh Cuntoor, Vincenzo Moscato, Antonio Picariello, VS Subrahmanian, and Octavian Udrea, Pads: A probabilistic activity detection framework for video data, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010), no. 12, 2246–2261.

Massimiliano Albanese, Rama Chellappa, Vincenzo Moscato, Antonio Picariello, VS Subrahmanian, Pavan Turaga, and Octavian Udrea, A constrained probabilistic petri net framework for human activity detection in video, IEEE Transactions on Multimedia 10 (2008), no. 6, 982–996.

Mohamed R Amer, Sinisa Todorovic, Alan Fern, and Song-Chun Zhu, Monte carlo tree search for scheduling activity recognition, IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1353–1360.

Y. Amit, U. Grenander, and M. Piccioni, Structural image restoration through deformable templates, J. American Statistical Association (1991).

Y. Amit and A. Kong, Graphical templates for model registration, Pattern Analysis and Machine Intelligence, IEEE Transactions on 18 (1996), no. 3, 225–236.

E. Bienenstock, S. Geman, and D. Potter, Compositionality, mdl priors, and object recognition, Advances in neural information processing systems (1997), 838–844.

Guoray Cai, Contextualization of geospatial database semantics for human–gis interaction, Geoinformatica 11 (2007), no. 2, 217–237.

Lo-Bin Chang, Ya Jin, Wei Zhang, Eran Borenstein, and Stuart Geman, Context, computation, and optimal ROC performance in hierarchical models, Int. J. Comput. Vis. 93 (2011), no. 2, 117–140. MR 2783693, DOI https://doi.org/10.1007/s11263-010-0391-1

Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and René Vidal, Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2009, pp. 1932–1939.

Qiang Chen, Zheng Song, Jian Dong, Zhongyang Huang, Yang Hua, and Shuicheng Yan, Contextualizing object detection and classification, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (2015), no. 1, 13–27.

Naresh P. Cuntoor, B. Yegnanarayana, and Rama Chellappa, Activity modeling using event probability sequences, IEEE Trans. Image Process. 17 (2008), no. 4, 594–607. MR 2512463, DOI https://doi.org/10.1109/TIP.2008.916991

Navneet Dalal and Bill Triggs, Histograms of oriented gradients for human detection, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, IEEE, 2005, pp. 886–893.

Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2634–2641.

Fillipe DM de Souza, Sudeep Sarkar, and Guillermo Cámara-Chávez, Building semantic understanding beyond deep learning from sound and vision, 23rd International Conference on Pattern Recognition (ICPR), IEEE, 2016, pp. 2097–2102.

Fillipe DM De Souza, Sudeep Sarkar, Anuj Srivastava, and Jingyong Su, Pattern theory-based interpretation of activities, 22nd International Conference on Pattern Recognition (ICPR), IEEE, 2014, pp. 106–111.

Fillipe DM de Souza, Sudeep Sarkar, Anuj Srivastava, and Jingyong Su, Pattern theory for representation and inference of semantic structures in videos, Pattern Recognition Letters 72 (2016), 41–51.

Fillipe DM de Souza, Sudeep Sarkar, Anuj Srivastava, and Jingyong Su, Spatially coherent interpretations of videos using pattern theory, International Journal of Computer Vision 121 (2017), no. 1, 5–25.

S. Geman and M. Johnson, Dynamic programming for parsing and estimation of stochastic unification-based grammars, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 279–286.

Stuart Geman, Daniel F. Potter, and Zhiyi Chi, Composition systems, Quart. Appl. Math. 60 (2002), no. 4, 707–736. MR 1939008, DOI https://doi.org/10.1090/qam/1939008

U. Grenander, Y. Chow, and D. M. Keenan, Hands, Research Notes in Neural Computing, vol. 2, Springer-Verlag, New York, 1991. A pattern-theoretic study of biological shapes. MR 1084371

Ulf Grenander and Michael I. Miller, Representations of knowledge in complex systems, J. Roy. Statist. Soc. Ser. B 56 (1994), no. 4, 549–603. With discussion and a reply by the authors. MR 1293234

Ulf Grenander and Michael I. Miller, Computational anatomy: an emerging discipline, Quart. Appl. Math. 56 (1998), no. 4, 617–694. Current and future challenges in the applications of mathematics (Providence, RI, 1997). MR 1668732, DOI https://doi.org/10.1090/qam/1668732

U. Grenander, A. Srivastava, and S. Saini, A pattern-theoretic characterization of biological growth, IEEE Transactions on Medical Imaging 26 (2007), no. 5, 648–659.

Ulf Grenander, General pattern theory, Oxford Mathematical Monographs, The Clarendon Press, Oxford University Press, New York, 1993. A mathematical study of regular structures; Oxford Science Publications. MR 1270904

Ulf Grenander, Elements of pattern theory, JHU Press, 1996.

Ulf Grenander, A calculus of ideas: a mathematical study of human thought, World Scientific, 2012.

Ulf Grenander and Michael I. Miller, Pattern theory: from representation to inference, Oxford University Press, Oxford, 2007. MR 2285439

John J Gumperz, Contextualization and understanding, Rethinking context: Language as an interactive phenomenon 11 (1992), 229–252.

F. Han and S.C. Zhu, Bottom-up/top-down image parsing with attribute grammar, Pattern Analysis and Machine Intelligence, IEEE Transactions on 31 (2009), no. 1, 59–73.

De-An Huang, Li Fei-Fei, and Juan Carlos Niebles, Connectionist temporal modeling for weakly supervised action labeling, European Conference on Computer Vision, Springer, 2016, pp. 137–153.

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei, Image retrieval using scene graphs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3668–3678.

Seong-Wook Joo and Rama Chellappa, Recognition of multi-object events using attribute grammars, International Conference on Image Processing (ICIP), IEEE, 2006, pp. 2897–2900.

Rama Kovvuri, Ram Nevatia, and Cees GM Snoek, Segment-based models for event detection and recounting, Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, 2016, pp. 3868–3873.

Hilde Kuehne, Ali Arslan, and Thomas Serre, The language of actions: Recovering the syntax and semantics of goal-directed human activities, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 780–787.

Tian Lan, Leonid Sigal, and Greg Mori, Social roles in hierarchical models for human activity recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1354–1361.

Hugo Liu and Push Singh, ConceptNet—a practical commonsense reasoning tool-kit, BT technology journal 22 (2004), no. 4, 211–226.

M. I. Miller, A. Srivastava, and U. Grenander, Conditional-expectation estimation via jump-diffusion processes in multiple target tracking/recognition, IEEE Transactions on Signal Processing 43 (1995), no. 11, 2678–2690.

M. I. Miller, G. E. Christensen, Y. Amit, and U. Grenander, Mathematical textbook of deformable neuroanatomies, Proceedings of the National Academy of Science 90 (1993), no. 24.

Vlad I Morariu and Larry S Davis, Multi-agent event recognition in structured scenarios, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 3289–3296.

David Mumford, Pattern theory: a unifying perspective, First European Congress of Mathematics, Vol. I (Paris, 1992) Progr. Math., vol. 119, Birkhäuser, Basel, 1994, pp. 187–224. MR 1341824

Ram Nevatia, Tao Zhao, and Somboon Hongeng, Hierarchical language-based representation of events in video streams, Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on, vol. 4, IEEE, 2003, pp. 39–39.

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang, Hierarchical recurrent neural encoder for video representation with application to captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, Bleu: a method for automatic evaluation of machine translation, Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 311–318.

Matthew Richardson and Pedro Domingos, Markov logic networks, Machine learning 62 (2006), no. 1-2, 107–136.

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele, Translating video content to natural language descriptions, IEEE International Conference on Computer Vision (ICCV), 2013, pp. 433–440.

Olga Russakovsky, Jia Deng, Hao Su et al., ImageNet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015), no. 3, 211–252. MR 3422482, DOI https://doi.org/10.1007/s11263-015-0816-y

Gunnar A Sigurdsson, Santosh Divvala, Ali Farhadi, and Abhinav Gupta, Asynchronous temporal fields for action recognition, arXiv preprint arXiv:1612.06371 (2016).

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta, Hollywood in homes: Crowdsourcing data collection for activity understanding, European Conference on Computer Vision, Springer, 2016, pp. 510–526.

Karen Simonyan and Andrew Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014, pp. 568–576.

Fillipe Souza, Sudeep Sarkar, Anuj Srivastava, and Jingyong Su, Temporally coherent interpretations for long videos using pattern theory, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1229–1237.

Robert Speer and Catherine Havasi, Representing general relational knowledge in ConceptNet 5, LREC, 2012, pp. 3679–3686.

Robert Speer and Catherine Havasi, ConceptNet 5: A large semantic network for relational knowledge, The People’s Web Meets NLP, Springer, 2013, pp. 161–176.

E. H. Spriggs, F. De La Torre, and M. Hebert, Temporal segmentation and activity classification from first-person sensing, IEEE Workshops on Computer Vision and Pattern Recognition (CVPRW), June 2009, pp. 17–24.

A. Srivastava, M. I. Miller, and U. Grenander, Multiple target direction of arrival tracking, IEEE Transactions on Signal Processing 43 (1995), no. 5, 1282–85.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J Mooney, Integrating language and vision to generate natural language descriptions of videos in the wild., International Conference on Computational Linguistics (COLING), vol. 2, 2014, p. 9.

Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu, Image parsing: Unifying segmentation, detection, and recognition, International Journal of computer vision 63 (2005), no. 2, 113–140.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko, Translating videos to natural language using deep recurrent neural networks, arXiv preprint arXiv:1412.4729 (2014).

Yi Wang, David M Krum, Enylton M Coelho, and Doug A Bowman, Contextualized videos: Combining videos with environment models to support situational understanding, IEEE Transactions on Visualization and Computer Graphics 13 (2007), no. 6, 1568–1575.

Ping Wei, Yibiao Zhao, Nanning Zheng, and Song-Chun Zhu, Modeling 4d human-object interactions for event and object recognition, IEEE International Conference on Computer Vision, IEEE, 2013, pp. 3272–3279.

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei, Scene graph generation by iterative message passing, arXiv preprint arXiv:1701.02426 (2017).

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville, Describing videos by exploiting temporal structure, IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4507–4515.

S.C. Zhu and D. Mumford, A stochastic grammar of images, Foundations and Trends® in Computer Graphics and Vision 2 (2006), no. 4, 259–362.