Back to Top


Examples of the fixation points at frame no. 500 for each of the 12 movie clips. With green + are the fixations points over the color version of each clip, while with red * are the points for the grayscale version.


One of the main objectives of this research is the evaluation of the developed cognitive and computational models regarding their accuracy in detecting salient and important events. Event detection and summarization algorithms can be signiffcantly improved when there is adequate data for training, adaptation and evaluation of their parameters. The evaluation of the detected events is usually based on the comparison of the similarity or correlation between the system-detected observation and some ground-truth data (annotated reference event observations) selected by experienced/trained users.

Specifically, the first version of the dataset consists of half-hour continuous segments from seven movies (three and a half hours in total), namely: "A Beautiful Min" (BMI), "Chicago" (CHI), "Crash" (CRA), "The Departed" (DEP), "Gladiator" (GLA), "Lord of the Rings - the Return of the King" (LOR) and the animation movie "Finding Nemo" (FNE). Oscar-winning movies from various film genres (drama, musical, action, epic, fantasy, animation) were selected to form a systematic, genre-independent database of acclaimed, high production quality videos.

The database has been annotated with respect to saliency and specifically monomodal and multimodal saliency and semantic annotation including scene and shot segmentation. For this purpose Anvil video annotation interface ( has been used which is a free video annotation and labeling research tool offering frame-accurate, hierarchical multi-layered annotation driven by user-defined annotation schemes.

Data samples for saliency annotations and multimodal features can be found here.

Eye-Tracking Movie Database (ETMD)

We have developed a new eye-tracking movie database comprising video clips from the SEDM Hollywood movie database which we have enriched with eye-tracking human annotation (ETMDatabase). Specifically, we cut 2 short video clips (about 3-3.5 minutes) from six out of the seven movies (CHI, CRA, DEP, FNE, GLA and LOR). We have tried to include scenes with high motion and action as well as dialogues. These clips were annotated with eye-tracking data by 10 different people. The volunteers viewed the videos both in grayscale and in color, while an eye-tracking system recorded their eyes fixations on the screen.

For the annotation we used the commercial Eye Tracking System TM3 provided by EyeTechDS. This device uses a camera with infrared light and provides a real time continuous gaze estimation, defined as fixation points on the screen. The tracker's rate has been limited by the video frame rate in order to have one fixation point pair per frame. For the problem of visual attention a weighted average between two eye fixations is provided, which is defined either by the mean, if both eyes are found by the eye-tracker, or only by the detected eye's fixation. If neither eye is detected or the fixations lie out of screen boundaries, fixation gets a zero value. The eye-tracking system also provides some additional measurements, such as pupil and glints positions and pupil diameter. Fig.8. Examples of the fixation points at frame no. 500 for each of the 12 movie clips. With green + are the fixations points over the color version of each clip, while with red * are the points for the grayscale version.

For more information on ETMDatabase please see reference: P. Koutras, A. Katsamanis, and P. Maragos, Predicting Eyes' Fixations in Movie Videos: Visual Saliency Experiments on a New Eye-Tracking Database, D. Harris (Ed.): EPCE 2014, LNAI 8532, pp. 183-194, Springer International Publishing Switzerland 2014..

Cross-modal Semantic Labelling (COSMOROE)

We also labeled the semantics of the audio-visual-text streams especially as they pertain to the cross-modal integration of semantics. For this purpose we used the COSMOROE framework and investigated agreement/disagreement across modalities. Computational models comparing the performance of various fusion schemes have also been evaluated. Finally, we have started to work on top-down semantic integration models (plot) inspired by relevant work in the literature from narratives.

The corpora used includes audiovisual data from movie films and TV travel documentaries. Specifically, two movie films 15 TV travel documentaries (Greek and English) were selected and annotated/evaluated. Processing Steps for the COSMOROE labeling:

  • Spoken language transcription.
  • Acoustic events annotation.
  • Completion and evaluation of the already annotated acoustic events in TV travel documentaries.
  • Annotation of linguistic, visual and gesture/motion information.
  • Intermodal/crossmodal correlation of the annotated events.

For more information on COSMOROE framework please see: K. Pastra, COSMOROE: A Cross-Media Relations Framework for Modelling Multimedia Dialectics, Multimedia Systems Journal, vol. 14(5), pp. 299-323, Springer Verlag, 2008.