Login Form


Activity recognition results on UCF Sports and Holywood2

Table above shows the results, obtained on UCF Sports dataset (http://crcv.ucf.edu/data/UCF_Sports_Action.php). We report recognition rate with respect to the number...


Computational efficiency and parallel implementation

The developed algorithms are computationally effective and the compositional processing pipeline is well-suited for implementation on massively parallel architectures. Many...


Motion hierarchy structure

Our model is comprised of three processing stages, as shown in the Figure. The task of the lowest stage (layers...


Server crash

After experiencing a total server failure, we are back online. We apologize for the inconvenience - we are still in...


L1: motion features

Layer L1 provides an input to the compositional hierarchy. Motion, obtained in L0 is encoded using a small dictionary.


Computer vision-­based motion analysis

Motion analysis in computer vision has been a challenging research domain for decades. It started in early 1970s. At that time it was very much motivated either by the psycho­ physical studies in early vision of human and primates, or by various object tracking inspirations. The decade that followed was largely driven by the invention of optical flow and motion field, and active models, or snakes. In 1990s the condensation algorithm emerged that, together with various feature or distinctive point detectors, descriptors, and trackers, triggered a renaissance that prospered over the last decade.

In the last two decades the majority of publications addressed the problem of human motion analysis, and in the last decade the research on action and activity recognition prevailed.This largely stems from the fact, that the task of activity recognition is highly relevant for a variety of solutions, such as automatic analysis of surveillance video streams, automatic video indexing, video summarization, human­computer interfaces, performance analysis in sports, medicine, health­care, etc. Most of this work has been recognized and documented in review papers by Aggarwal and Cai [Aggarwal1999], Gavrila [Gavrila1999], Moeslund et al. [Moeslund2006], Poppe [Poppe2007, Poppe2010], Turaga et al. [Turaga2008], Aggarwal and Ryoo [Aggarwal2010].

Human activity is a spatiotemporal phenomenon and when observing humans, we inevitably deal with complex and cluttered environments, background motion, occlusions, and in many cases also with camera motion and viewpoint changes. Therefore, fully automatic and robust human activity recognition remains an active topic of research. According to taxonomy in [Aggarwal2010] the approaches could be either single­layer (flat), or hierarchical that could be further classified as statistical (e.g. Hidden Markov Model based), syntactical (e.g. Context­Free Stochastic Grammar based), or descriptive that model activity by temporal, spatial and relational structures. Current (state­of­the­art) hierarchical approaches use rather “high­level” primitives to encode the set of activities from a given application domain, but substantial processing is needed to derive these sophisticated primitives from raw video data. An exemplar of recent publications only, including ours, is listed [Perše2009, Ryoo2009, Perše2010, Wo2010, Wang2010, Loy2011, Ryoo2011].

Single­layer approaches try to recognize the activities directly from video data. A straightforward way to address the problem of human activity recognition is by observing optical flow through the temporal sequence of images [Efros2003], where the optical flow in the spatiotemporal volumes was used as descriptor, and nearest neighbor search was used to classify actions. That approach suffers from many drawbacks. It relies on proper localization and segmentation of persons throughout the video sequence, and this is not yet fully solved problem. Nevertheless, a similar approach, based on histograms of optical flow, which translated the motion into the sequence of symbols, was successfully applied to the problem with more restricted domain ­ access control in the video­based surveillance [Perš2010], where it can distinguish even between very similar activities. [Yealsin2004] addressed this issue by application of nearest­neighbor clustering in combination with Hidden Markov Models to construct temporal signatures for facial expression detection.

Another obvious approach is to use the position of human limbs through time as an input into the appropriate classifier. This can be done either directly, as in [Yilmaz2005], where authors used manually annotated points on human body to recognize actions. Given a set of properly extracted points, such algorithm can handle moving cameras and different viewpoints by calculating multi­view geometry between the two actions. Nevertheless, in this approach, the problem is simply shifted to automatic extraction of useful human body points. This task is difficult and it cannot be considered solved. Positions of human limbs can be also used indirectly, such in [Ramanan2004]. In this case, authors tracked persons in two dimensions and tried to match them with the motion sequences from the database of annotated models.

More complex approaches to action recognition are in many cases inspired by the similar approaches to object recognition. In [Dollar2005] the researchers extended the concept of sparse spatial descriptors, patches, into the spatiotemporal domain. The spatiotemporal sequence of images was treated as a three­dimensional volume, from which, using appropriate spatiotemporal detectors, spatiotemporal cuboids were carved. The spatiotemporal descriptors were treated as bag of features, and clustering was performed. Recognition of activities was based on occurence of those clusters in input video. Similar concept can be used with denser spatiotemporal features, such as [Gilbert2008], where an overcomplete set of spatiotemporal corners was extracted from video sequences, spatiotemporally grouped and labeled according to the relative position in a group. Data mining was used to obtain frequent sets, which were then used for action localization and classification. Use of simpler spatiotemporal features enabled higher degree of robustness and invariance to scale. Similarly, in [Niebles2008] histograms of brightness gradient were used as spatiotemporal descriptors. The approach resulted in high number of detections, and together with bag­of­features approach (so­called “video words”) and probabilistic latent semantic analysis it resulted in a framework that is robust to temporarily changing background clutter and camera movements. To achieve that, only unsupervised learning (clustering) is needed. In [Wang2010] authors also extracted a number of spatiotemporal interest points, and defined stochastic grammar together with iterative mining to obtain discriminative model for action recognition. Recently [Zhao2007] proposed a powerful descriptor by extending the standard local binary patterns [Ojala2002] to account for the spatio­temporal information and encoded the temporal textures as spatio­temporal histograms. They reported excellent results on standard temporal texture data­sets.

One of the biggest challenges in action recognition is how to achieve recognition that is invariant to the most common variations in input sequences of images. In [Junejo 2008] authors presented an action recognition algorithm, that relies on self similarities in human activities. They use self similarity matrix, which can be computed from positions of human body parts or from the lower level image descriptors, such as HOG descriptors or optical flow. Self similarity of human activities is preserved across camera views, and therefore the algorithm is robust to these significant viewpoint changes.

This website uses cookies to manage authentication, navigation, and other functions. By using our website, you agree that we can place these types of cookies on your device.

View e-Privacy Directive Documents

You have declined cookies. This decision can be reversed.

You have allowed cookies to be placed on your computer. This decision can be reversed.