Visual cues for view-invariant human action recognition
2017-02-17T04:42:02Z (GMT) by
Human action is a visually complex phenomenon. Visual representation, analysis and recognition of human actions has become a key focus of research in computer vision, artificial intelligence, robotics and other related scientific disciplines. Various applications of automated action recognition include but not limited to intelligent health care monitoring, smart-homes, content based video search, animation and entertainment, human-computer interaction and intelligent video surveillance. The main focus of all these application areas surrounds a fundamental question: Given a human subject doing something in the field of sensory input, what is the person doing? If machine is able to correctly answer this question, it can greatly benefit computer vision system development and practical usage. However, machine recognition of human action is a daunting task due to complex motion dynamics, anthropometric variations, occlusion and high dependency over camera viewpoint. In this thesis, we exploit the importance of rich visual cues from human actions and utilize them to propose valuable solutions to human action recognition. The important problem of view-invariance under viewpoint variations is taken as a case study. We collect and explore these visual cues from geometrical relationships, spatio-temporal patterns and features, frequency domain signal analysis, contextual associations of actions and derive action representations for machine recognition. Actions are known as spatio-temporal patterns and temporal order plays an important role in their interpretations. We, therefore, explore invariance property of temporal order of actions during action execution and utilize it for devising a new view-invariant action recognition approach. We apply order constraint and feature fusion on local spatiotemporal features. These features are representation of choice for action recognition due to their computational simplicity, robustness to occlusion and minor view-point changes. We introduce STOPs (spatio-temporal ordered packets) that combine discriminative characteristics of multiple features for better recognition performance. In addition, we introduce spatio-temporal ordering constraint that removes discrepancy of orderless formation of bag-of-feature framework for action recognition. Furthermore, to deal with limitations of feature based approaches, we explore multiple view geometry which has alleviated various complex problems in computer vision. We thoroughly study applications of static and multi-body flow fundamental matrix in context of relating across-view information. We introduce spatio-temporally consistent dense optical flow to avoid explicit manual human body landmark point detection and explicit point correspondences. We employ rank constraint to derive novel tracking and training-free action similarity measures across viewpoint variations. Next, we investigate that despite the considerable success of geometrical techniques, computational complexity due to dense optical flow calculations plays a hindering role. Therefore, we study and track frequency domain analysis of action sequences. It leads toward the derivation of spatio-temporal correlation filters that use frequency domain filtering to give fast and efficient solutions to action recognition. However, these filters are originally view-dependent solutions. To achieve this objective, view clustering is explored that extends frequency domain techniques to achieve view-invariance. Contextual information is another important cue for interpreting human actions especially when actions exhibit interactive relationships with their context. These contextual clues become even more crucial when videos are captured in unfavorable conditions like extreme low light nighttime scenarios. We, therefore, take case study of night vision and present contextual action recognition at nighttime. We discover that context enhancement is imperative in such challenging multi-sensor environment to achieve reliable action recognition which leads us to develop novel context enhancement techniques for night vision using multi-sensor image fusion. Extensive experimentation on well-known action datasets is performed and results are compared with the existing action recognition approaches in literature. The research findings in this thesis greatly encourage the exploitation of spatia-temporal visual cues for deriving novel action recognition approaches and increasing their performance.