Attentional selection is a cognitive mechanisms employed by humans and animals for parsing, structuring, and organizing perceptual stimuli. Attention may be of two modes, top-down task-driven and bottom-up stimulus-driven, that control the gating of the processed information (input filtering) and the selective access to neural mechanisms (capacity limitation), for example, working memory. Bottom-up attention or saliency is based on the sensory cues of a stimulus captured by its signal-level properties, like spatial, temporal, and spectral contrast, complexity, or scale.
Saliency defines something that is striking about a specific feature. For example, a frequency tonemay be acoustically salient, a voice can be perceivable among environmental sounds, and an audiovisual scene can be biased towards any of the two signals. Feature saliency is the property of a feature to dominate the signal representation while preserving information about the stimulus.
Attention towards such salient events is triggered by changes or contrast in object appearance (texture and shape),motion activity and scene properties (visual events), changes in audio sources, textures or tempo (aural events), and the relevant -when available- transcribed dialogues or spoken narrations (textual events).
We approach saliency computation in an audio stream as a problem of assigning a measure of interest to audio frames, based on spectro-temporal cues.
For more information on Audio Saliency please see references Evangelopoulos et al and Zlatintsi et al.