Neurocognitive Mechanisms of Multisensory Perception

Researchers: Tobias Andersen, Toni Auranen, Vasily Klucharev, Christina Krause* , Riikka Möttönen, Ville Ojanen, Mikko Sams, Kaisa Tiippana, Jyrki Tuomainen
* Presently in the Department of Cognitive Science, University of Helsinki

Perception of the world is inherently multisensory: we perceive objects simultaneously with multiple senses. Many objects around us can be both heard and seen, some of them even felt. A crucial question is how the various unimodal features and their combinations are combined into unitary multisensory percepts. Integration of the sensory information obtained via different modalities makes us more sensitive to even tiny changes in environment and improves object identification.

One of our main interests is to study the neurocognitive mechanisms of audiovisual speech perception, i.e. to determine how seeing the talking face affects the speech perception. Computational models are also being developed and tested to account for the results. The research will enhance understanding of the basic mechanisms of speech perception, and will have applications in communication technology.

In a typical experiment, an observer is presented with audiovisual speech, i.e. speech sounds through headphones or loudspeakers together with a talking face on a computer screen, and the observer reports what he or she heard. One of the main research tools is the McGurk effect. For example, an auditory syllable /pa/ is dubbed onto a visual syllable /ka/. This kind of combination is typically perceived as /ta/ of /ka/, very infrequently as the original auditory /pa/. The McGurk effect is used to study audiovisual integration since its strength is related to the extent of visual influence in speech perception. To enhance the effect of visual speech, quality of acoustic signals is often degraded by noise. Our neurophysiological studies focus mainly on disclosing the neural basis of auditory and visual speech perception. In particular, we aim at elucidating the spatio-temporal orchestration of acoustic and visual speech feature integration in the human brain. Research is described in the following in more detail:

Attentional influences on audiovisual speech perception. We studied the role of visual attention in audiovisual speech perception by measuring the McGurk effect in two conditions. In the baseline condition, attention was focused on the talking face. In the distracted attention condition, subjects ignored the face and attended to a visual distractor, which was a leaf moving across the face. The McGurk effect was weaker in the latter condition, indicating that visual attention modulated audiovisual speech perception. This modulation may occur at an early, unisensory processing stage, or it may be due to changes at the stage where auditory and visual information is integrated. We investigated this issue by conventional statistical testing, and by fitting Massaro's Fuzzy Logical Model of Perception to the results. The two methods suggested different interpretations, revealing a paradox in the current methods of analysis. See Figure 35 for experimental results.

In another experiment we investigated the role of spatial attention on audiovisual speech perception. When two faces are present - one uttering /k/ and the other /t/ - our auditory perception of an acoustic /p/ change between /k/ and /t/ as visual attention (not gaze direction) change between the two faces. We also monitored eye movements to ensure that subjects fixated between the facial stimuli.

Figure 35a Figure 35b

Figure 35: The McGurk effect was stronger when the face was attended (fewer auditory /p/ responses, purple bars). When attention was directed at a leaf floating across the face, the McGurk effect was weaker (more auditory /p/ responses, green bars).

Audiovisual speech perception in children
Young children show less visual influence in speech perception than adults, and we have shown that children reach adult performance between 8-14 years of age. We have also investigated audiovisual speech perception in children with learning problems who have deficits in categorization of auditory speech and phonological awareness. These children seem to be relatively more influenced by visual speech in audiovisual conditions, even though their lipreading skills may be worse than in normal-learning children. This research is conducted in collaboration with the Auditory Neuroscience Laboratory at Northwestern University, IL, U.S.

Auditorily and visually induced illusions
Information processing in auditory and visual modalities interacts in many circumstances. Spatially and temporally coincident acoustic and visual information are often bound together to form multisensory percepts. Shams and coworkers recently reported a multisensory fission illusion where a single flash is perceived as two flashes when two rapid tone beeps are presented concurrently. The absence of a fusion illusion where two flashes would fuse to one when accompanied by a single tone beep indicated a perceptual rather than cognitive nature of the illusion. We found both fusion and fission illusions using very similar stimuli as used by Shams et al. By directing subjects' attention to the auditory modality and decreasing the sound intensity to near threshold we also created a corresponding visually induced auditory illusion.

Modelling of audiovisual speech perception
The above results as well as others have been modelled with the Fuzzy Logical Model of Perception developed by Massaro and co-workers, and it accounts for the results well. The model has also been extended to account for the effect of audio signal-to-noise ratio, as illustrated in Fig. 36.

Figure 36

Figure 36: The Fuzzy Logical Model of Perception has been modified to account for the effect of auditory signal-to-noise ratio (-12 to 6 dB) in audiovisual speech perception. The new model accounts for both auditory (top) and audiovisual (bottom) response distributions.

However, when applied to the illusory flashes discovered by Shams and coworkers, we have shown, using cross-validation, that the FLMP fails when training and test samples are separated. This means that the FLMP is unable to predict subjects' responses to a combination of flashes and beeps given subjects responses to other combinations of flashes and beeps. This indicates that the models non-linear interaction term provides it with a hyper-flexibility comparable to that provided by additional free parameters and hence that the model is over-fitting. We have developed a linear feature-based model which is better than the FLMP at predicting subjects' responses.

Articulatory influences on speech perception
We have demonstrated that auditory speech percepts are modified when the subjects watch their own silent articulation in a mirror and acoustic stimuli are simultaneously presented to their ears. We also show that just own silent articulation causes very similar effects. Our results support the idea of a close relation between speech production and perception.

We have found that silent articulation of a vowel influences the subsequent classification of a target vowel continuum by assimilating the ambiguous vowels at the phoneme boundary to the category of the articulated vowel. However, if the context vowel is presented auditorily, a contrast effect is obtained yielding the two vowels more separate in the vowel space.

Audio-visual speech perception and speech mode
A crucial question about speech perception is whether speech is perceived as all other sounds or whether a specialized mechanism is responsible for coding the acoustic signal into phonetic segments. "Speech mode" refers either to a structurally and functionally encapsulated speech module operating selectively on articulatory gestures, or to a perceptual mode focusing on the phonetic cues in the speech signal. Integration of seen speech articulations and heard speech sounds to audiovisual percepts has been suggested to provide evidence for a specialized speech mode. As a striking example of integration, watching conflicting articulation can even change the phonetic identity of an acoustic speech stimulus, as occurs in the "McGurk effect". However, there is no empirical evidence that this type of audio-visual integration would only occur when the acoustic stimuli are perceived as speech but not when they are treated as non-speech. We showed that audio-visual integration of speech is strongly dependent on the subject's expectations about the nature of the acoustic stimuli. In order to integrate acoustic and visible speech, the perceiver has to be in a speech mode.

Neurophysiological mechanisms of audiovisual speech interactions
The monkey premotor cortex contains neurons that are activated both when the monkey performs motor acts and when it sees or hears actions made by. Such a "mirror-neuron system" in the human brain includes at least the superior temporal sulcus region, Broca's area, and the primary motor cortex. Enhancement of the 30-35-ms somatosensory evoked fields (SEFs) to median nerve stimulation during observation of hand actions suggests that also SI cortex contributes to the human mirror-neuron system. Our aim was to find out whether the human SI cortex is involved in the perception of acoustic and visual speech and whether the mouth and hand representation areas of SI would react differently to these stimuli. We stimulated the lower lip (tactile pulses) or the median nerves (electric pulses once every 1.5 while the subjects either rested, listened to speech or watched articulatory gestures. SI signals to lip stimuli, peaking at around 55 ms the left hemisphere, were significantly enhanced during watching speech. Listening to speech did not modulate these signals. The 35-ms SEFs to median nerve stimulation remained stable during watching and listening to speech. Our results show that watching speech modulates functioning of the speech-specific left hemisphere SI mouth projection area. These data suggest that watching speech actions, but not listening to them, activates the left SI in a somatotopic manner.

We studied the cortical areas at which acoustic syllables and corresponding visual lip movements are integrated. Neuromagnetic responses to audiovisual syllables were compared with the arithmetic sum of responses to acoustic and visual syllables presented alone. Differences were interpreted to reflect audiovisual interaction. The earliest audiovisual interaction was observed bilaterally in the auditory cortices 150 - 200 ms after stimulus onset. Later audiovisual interaction was observed in the right superior temporal sulcus (250 - 600 ms). The results indicate that both sensory-specific and polysensory regions of the human temporal cortex are involved in the integration of acoustic and visual phonetic inputs. The results also imply that the audiovisual interaction in the auditory cortex precede that in the polysensory cortex.

We studied the interactions in neural processing of auditory and visual speech by recording event-related brain potentials (ERPs). Unisensory (auditory - A and visual - V) and audiovisual (AV) vowels were presented to the subjects. AV vowels were phonetically either congruent (e.g., acoustic and visual /a/) or incongruent (e.g., acoustic /a/ and visual /y/). ERPs to AV stimuli and the sum of the ERPs to A and V stimuli (A+V) were compared. Similar ERPs to AV and A+V were hypothesized to indicate independent processing of A and V stimuli. Differences on the other hand would suggest AV interactions. Three deflections, the first peaking at about 85 ms after the A stimulus onset, were significantly larger in the ERPs to A+V than in the ERPs to both concordant and discordant AV stimuli. We suggest that these differences reflect AV interactions in the processing of general, non-phonetic, features shared by the acoustic and visual stimulus (spatial location, coincidence in time). The first difference in the ERPs to incongruent and congruent AV vowels peaked at 155 ms from the A stimuli onset. This and the two later differences are suggested to reflect interactions at phonetic level. The early general AV interactions are suggested to reflect modified activity in the sensory-specific cortices, whereas the later phonetic AV interactions are likely generated in the heteromodal cortices. Thus, our results suggest that sensory-specific and heteromodal brain regions participate in AV speech integration at separate latencies and are sensitive to different features of A and V speech stimuli.

Attentional selection of audiovisual stimuli
Our subjects were to attend audiovisual stimuli to detect targets which differed from the other relevant audiovisual stimuli in length. We found preliminary evidence about specific audio-visual selective attention mechanism, which could not be explained by separate auditory and visual attention systems.