Audio Saliency

Audio stream waveform (left-top) with audio saliency annotation and spectrogram (left-bottom) using the employed audio analysis parameters (15-ms windows, 7.5-ms overlap). Horizontal lines denote the filterbank (25 filters, 400-Hz bandwidth) central frequencies, i.e., in (3). Dominant modulation features: energy (solid line) and amplitude (dashed line, right-top); frequency (solid line) and frequency of the dominant filter (black dots) (right-bottom). Audio data are 300 frames (12 s) from film "Chicago", containing music, singing, and dialogue.

Attentional selection is a cognitive mechanisms employed by humans and animals for parsing, structuring, and organizing perceptual stimuli. Attention may be of two modes, top-down task-driven and bottom-up stimulus-driven, that control the gating of the processed information (input filtering) and the selective access to neural mechanisms (capacity limitation), for example, working memory. Bottom-up attention or saliency is based on the sensory cues of a stimulus captured by its signal-level properties, like spatial, temporal, and spectral contrast, complexity, or scale.

Saliency defines something that is striking about a specific feature. For example, a frequency tonemay be acoustically salient, a voice can be perceivable among environmental sounds, and an audiovisual scene can be biased towards any of the two signals. Feature saliency is the property of a feature to dominate the signal representation while preserving information about the stimulus.

Attention towards such salient events is triggered by changes or contrast in object appearance (texture and shape),motion activity and scene properties (visual events), changes in audio sources, textures or tempo (aural events), and the relevant -when available- transcribed dialogues or spoken narrations (textual events).

We approach saliency computation in an audio stream as a problem of assigning a measure of interest to audio frames, based on spectro-temporal cues.

For more information on Audio Saliency please see references Evangelopoulos et al and Zlatintsi et al.