The basic problems here are to extract robust multicue features from the audio, visual and text streams, compute their unimodal instantaneous saliencies, fuse them to estimate a multimodal saliency, combine these with spatiotemporal synchrony issues, and group instantaneous salient keyframes into perceptual events. Wp1 includes the following tasks:
In this WP, we extract selected snippets of semantic information from the audio, visual and text modalities, such as, concepts and actions. We then proceed to identify the correspondence between the actions in the visual and audio/text streams, i.e., perform cross-modal labeling, using the COSMOROE framework and statistical modeling. Finally we integrate these actions into events in time and compute their relative significance (semantic saliency) using event segmentation and machine learning algorithms.
The sensory-semantic integration requires fusion of two different continuous modalities (audio and vision) with discrete language symbols and semantics extracted from text. The audiovisual fusion will give us perceptual micro-events (time scale of a few keyframes), and our grand goal now is to group them with the discrete linguistic and semantic saliency into stable meso-events (time scale of video shots) via some conscious attention process over a longer time window. This could be viewed as a framework of heterogeneous hierarchical control. The research in this workpackage is divided into the followings tasks.
WP4 will provide the test-beds and showcases to test the computational, perception and cognitive ideas explored within the objectives of WP 1, 2 and 3 integrating the multiple levels of information processing. Selected showcases share the need for a cross-integration of multiple modalities and Anthropocentric applications. Today the vast majority of multimedia content does not come with rating and semantic annotation. Despite standardization efforts, e.g., MPEG7, it is predicted that in the near future users will both consume large amounts of multimedia content and produce/gather huge volumes of video through their digital cameras, but do not have the time to structure and label the data. A technical challenge and exciting application of multimodal analysis is automatic summarization of video content. Summaries provide the user with a short version of the video that ideally contains the most important information for understanding the content. At the applications level, COGNIMUSE targets the development of integrated computational-cognitive saliency and adaptive attention models for exploiting structured audio-visual-text contents in two different video domains: (i) movies and (ii) TV documentaries or news.