A Component-based video content representation for action recognition
(Eine komponentenbasierte Videoinhaltsdarstellung zur Handlungserkennung)
This paper investigates the challenging problem of action recognition in videos and proposes a new component-based approach for video content representation. Although satisfactory performance for action recognition has already been obtained for certain scenarios, many of the existing solutions require fully-annotated video datasets in which region of the activity in each frame is specified by a bounding box. Another group of methods require auxiliary techniques to extract human-related areas in the video frames before being able to accurately recognize actions. In this paper, a Weakly-Supervised Learning (WSL) framework is introduced that eliminates the need for per-frame annotations and learns video representations that improve recognition accuracy and also highlights the activity related regions within each frame. To this end, two new representation ideas are proposed, one focus on representing the main components of an action, i.e. actionness regions, and the other focus on encoding the background context to represent general and holistic cues. A three-stream CNN is developed, which takes the two proposed representations and combines them with a motion-encoding stream. Temporal cues in each of the three different streams are modeled through LSTM, and finally fully-connected neural network layers are used to fuse various streams and produce the final video representation. Experimental results on four challenging datasets, demonstrate that the proposed Component-based Multi-stream CNN model (CM-CNN), trained on a WSL setting, outperforms the state-of-the-art in action recognition, even the fully-supervised approaches.
• Representing an innovative framework for recognizing human actions without the need of any human bounding box annotations
• The proposed method moves beyond just recognizing video frames and can estimate regions of interest in each frame
• All action components are identified, instead of finding a single bounding box in each frame.
• A priority based approach is proposed that can learn how to utilize foreground, background and motion in each activity class.
• State-of-the-art results is obtained on four challenging datasets.
© Copyright 2019 Image and Vision Computing. Elsevier. Alle Rechte vorbehalten.
| Schlagworte: | |
|---|---|
| Notationen: | Naturwissenschaften und Technik |
| Tagging: | Algorithmus künstliche Intelligenz deep learning |
| Veröffentlicht in: | Image and Vision Computing |
| Sprache: | Englisch |
| Veröffentlicht: |
2019
|
| Online-Zugang: | https://doi.org/10.1016/j.imavis.2019.08.009 |
| Jahrgang: | 90 |
| Seiten: | 103805 |
| Dokumentenarten: | Artikel |
| Level: | hoch |