In order to estimate word-level saliency scores in multimedia data, we first need to annotate or automatic recognize spoken language information in the audio stream. In addition, the (automatically or manually annotated) transcripts have to be time-aligned with the audio stream. In this work, we utilize the annotation available in the subtitles of commercial video streams, although, the proposed approach can be also applied to the output of an automatic speech recognizer.
- Audio Segmentation using Forced Alignment
Although, subtitles provided with commercially released video material are roughly time aligned with the audio stream, the synchronization is not perfect. To correct time-stamp bias and achieve accurate word-level alignment, we perform forced segmentation on the audio stream using the speech transcript and phone-based acoustic models, i.e., an automatic speech recognition (ASR) system.
- Syntactic Text Tagging
Next, the time-aligned transcripts are analyzed using a shallow syntactic parser that mainly performs part-of-speech (POS) tagging. Text saliency scores are assigned to each word based on the POS tag of that word. The motivation behind this approach is the well-known fact that (on-average) some POS convey more information than others.
- The most salient POS tags are proper nouns, followed by nouns, noun phrases and adjectives. Verbs can specify semantic restrictions on their pre- and post-arguments, which usually belong to the aforementioned classes. Finally, there is a list of words (often referred as stop-words) that have very little semantic content.
Fig. 6. POS tagger output and assigned weights for two example sentences from the "Lord of the Rings I". Note how proper nouns (PN), e.g. Sauron, Mordor, are very salient and are assigned a score of 1, common nouns (NN) a score of 0.7, noun phrases (NP) and verbs (VBZ, VVG) a score of 0.5, while "stop-words" (IN) are assigned a score of 0.2.
- Text Saliency Estimation
Based on the assignment of frames to words from the forced segmentation procedure and the word saliency scores assigned by the POS tagger a text saliency curve is computed at each frame.
For more information on Text Saliency please see reference Evangelopoulos et al.