Viewpoint invariant human activity recognition using pose series

2017-03-22T01:24:08Z (GMT) by Htike, Zaw Zaw
There is a growing interest in the problem of vision-based human activity recognition, motivated by its numerous promising applications in many domains. Because the camera position is arbitrary in many applications, practical human activity recognition systems should be viewpoint invariant. Nevertheless, the viewpoint issue has been neglected by the vast majority of computer vision researchers because of the inherent difficulty of training their systems across all possible viewpoints. Several state-of-the-art activity recognition systems claim to be viewpoint invariant. These can be broadly categorized by their sensory requirements: those requiring multiple synchronized cameras and those requiring only a single uncalibrated camera. While multi-camera systems work well, they are often not feasible, or practical, or both, in many domains and applications. Current single-camera systems are either too complex, hence not real-time capable, or require activity training data from multiple views, which again is not always feasible or practical, or both, in many domains and applications. Therefore, this thesis proposes a novel generic framework to recognize and classify human activities from a monocular video source from arbitrary viewpoints without requiring training activities using multiple views. The proposed framework comprises two stages: human pose recognition and human activity recognition. In the pose recognition stage, an ensemble of invariant pose models performs inference on each video frame. Each pose model estimates the probability that the given frame contains the corresponding pose. Over a sequence of frames, invariant pose models collectively concoct a multivariate time series. The activity recognition stage employs time series analysis to classify activities. The system has been rigorously tested on a number of standard benchmark datasets and has been found to outperform current state-of-the-art systems in terms of both its processing speed and classification accuracy. The framework developed in this thesis, as supported by the results, lays the foundation for monocular viewpoint invariant human activity recognition. Moreover, this framework can be extended and tailored to multiple domains and applications with diverse requirements.