Back to Top

Visual saliency

Example frames of energy volumes computed using our frontend on the Lord of the Rings (Clip 1) from our Eye-Tracking Movie Database (ETMD). The galloping horse is perfectly detected by the luminance STDE.

Visual attention is a mechanism employed by biological vision systems for selecting the most salient spatio-temporal regions from a visual stimuli.

The development of computational frameworks that model visual attention is critical for designing human-computer interaction systems, as they can select only the most important regions from a large amount of visual data and then perform more complex and demanding processes. Attention models can be directly used for movie summarization, by producing video skims, or by constituting a visual frontend for many other applications, such as object and action recognition. Eyes' fixation prediction over different stimulus appeared to be a widely used procedure for analyzing and evaluating visual attention models. Although many databases with eye tracking data are available, most of them contain only static images, as the first saliency models were based only on static cues.

Our proposed visual saliency frontend is based on both low-level features, such as intensity, color and motion, and mid-level cues, i.e., face detection, providing a single saliency volume map. The overall process can be seen in Fig. 2.

Additionally, in our research we describe different approaches for predicting the eye's fixations in movie videos and give preliminary results from our computational framework for visual saliency estimation. From the database analysis we observed that in many cases humans focused on actors' faces, while their eyes' fixation had also the trend to be center biased. To model these two effects we used two simple methods:

  • A gaussian kernel of varying standard deviation fixed at the image kernel
  • Viola-Jones face detector as a saliency estimator only in the frames where people's face exist.

We have also tried to predict one viewers fixations from the other users eye fixations, as reference results.

The existing eye-tracked video databases, in most cases contain very short videos with simple semantic contain. In our effort to deal with more complex problems, such as movie summarization, we have developed a database with eye-tracking human annotation, which comprises video clips from Hollywood movies, which are longer in duration and include more complex semantics. Figure 3 shows example frames of our model's energies computed on the video Lord of the Rings (Clip 1) from our new Eye- Tracking Movie Database (ETMD) (for more information see Datasets section). In the first phase the initial RGB video volume is transformed into Lab space and split into two streams: luminance and color contrast. For the luminance channel we apply spatio-temporal Gabor filtering, followed by Dominant Energy Selection, while for the color contrast stream we apply a simple lowpass 3D Gaussian filter followed by a center-surround difference. For integrating the results from the Viola-Jones face detector to a final visual attention map we can use either a maximum or a variance-based fusion method.

For more information on Visual Saliency please see references Evangelopoulos et al and

P. Koutras, A. Katsamanis, and P. Maragos, Predicting Eyes' Fixations in Movie Videos: Visual Saliency Experiments on a New Eye-Tracking Database, D. Harris (Ed.): EPCE 2014, LNAI 8532, pp. 183-194, Springer International Publishing Switzerland 2014.

Action Recognition

Action Recognition algorithms are used for the extraction of semantic information from the video stream.

For more information on Action Recognition please see reference Georgakis et al.