Neurocognitive Mechanisms of Multisensory Perception

Researchers: Tobias Andersen, Toni Auranen, Iiro Jääskeläinen, Vasily Klucharev, Riikka Möttönen, Ville Ojanen, Johanna Pekkola, Mikko Sams, Kaisa Tiippana

Effect of preceding audiovisual context on auditory perception

We studied the representations underlying audiovisual integration using a priming paradigm. Audiovisual primes, preceding auditory targets, were either incongruent (auditory /ba/ & visual /va/) or congruent (auditory /va/ & visual /va/, auditory /ba/ & visual /ba/). The targets were /ba/ or /va/. The intensity of the primet’s auditory component was either 50 dB or 60 dB. Identification speed of the target /ba/ was strongly affected by the nature of the prime. The effect of the incongruent audiovisual prime depended on the intensity of its acoustic component. Our results can be explained by assuming that some properties of the visual representation were mapped into the auditory representation.

Modelling of audiovisual speech perception

The above results as well as others have been modelled with the Fuzzy Logical Model of Perception (FLMP) developed by Massaro and coworkers, and it fits the results well. The FLMP assumes that phonetic categorization occurs independently in both audition and vision. The final phonetic percept is model as a maximum likelihood integral of the two unimodal percepts. We have extended the FLMP to account for the effect of audio signal-to-noise ratio, as illustrated in Fig. 43. We have also applied the FLMP to the illusory flashes described above. Also here, the model fits the data well.

We have investigated the role of over-fitting in the FLMP’s good fits. Over-fitting occurs when a model is too general. It will then fit the noise in the data rather than the structure.

Figure 43

Figure 43: The Fuzzy Logical Model of Perception has been modified to account for the effect of auditory signal-to-noise ratio (-12 to 6 dB) in audiovisual speech perception. The new model accounts for both auditory (top) and audiovisual (bottom) response distributions.

Over-fitting models tend to fit the data well but perform poorly at predicting new data. Crossvalidation exploits this. By fitting the model to only a subset of the data and testing it on another subset the predictive ability of the model is tested. Cross-validation reveals that the FLMP does indeed over-fit.

We have developed a new model of audiovisual integration termed Linear Feature Integration. This model assumes that integration of acoustical and visual feature vectors occur prior to categorization as a weighted vector sum. We have applied LFI to the illusory flashes described above and obtained poorer fits but better cross validation results than obtained with the FLMP. This supports that LFI may be an appropriate model that does not over-fit the data.

Processing of audiovisual speech in the Broca’s area

We investigated neural mechanisms underlying processing of audiovisual phonetic information in humans using functional magnetic resonance imaging (fMRI) (See figure44) . Ten healthy volunteers were scanned with a ’clustered volume acquisition’ paradigm at 3T during presentation of phonetically congruent and incongruent audiovisual vowels /a/, /o/, /i/ and /y/. Comparing activations to congruent and incongruent audiovisual vowels enabled us to specifically map the cerebral areas participating in the audiovisual speech processing at the phonetic level. Phonetic incongruency (e.g., visual /a/ and auditory /y/), as compared with congruency (e.g., visual and auditory /y/), significantly activated the Brocat’s area, the prefrontal cortex and the superior parietal lobule in the left hemisphere. In contrast, we failed to see any enhanced activity to phonetically congruent stimulation in comparison to the incongruent stimulation. Our results highlight the role of the Brocat’s area in the processing of audiovisual speech and suggest that it might provide a common representational space for auditory and visual speech.

Figure 44

Figure 44: Across-subjects (N=10) z-statistic maps overlaid on an anatomical template. Congruent audiovisual speech activated the auditory and the visual cortical areas, as well as the inferior frontal, the premotor and the visual-parietal areas bilaterally (upper panel). Incongruent audiovisual speech caused a similar but more extensive pattern of brain activity (middle panel). The difference reached significance in three left hemisphere areas: Brocat’s area (BA44/45), superior parietal lobule (BA7) and prefrontal cortex (BA10) (lower panel). In the contrast ’Congruent > Incongruent’ no statistically significant voxels were detected. Activation images were thresholded using clusters determined by voxel-wise Z>3.0 and a cluster significance threshold of p<0.05, corrected for multiple comparisons.

Auditory and visual speech perception activate the speech motor regions

We investigated the neural basis of auditory and visual speech processing using a "clustered volume acquisition" functional magnetic resonance imaging (fMRI) pulse sequence at 3T (See figure 45). Common activation areas to presentation of auditory and visual vowels were observed in the left Insula, the Broca’s area, the lateral premotor cortex, and the inferior parietal area as well as the right superior temporal gyrus/sulcus. Significantly stronger activation for visual than auditory speech was observed in the left motor and sensory areas, inferior parietal lobule, posterior cingulate gyrus and visual sensory specific areas. Significantly stronger activation for auditory speech, in turn, was observed in the left lingual gyrus, the left insula, anterior cingulate bilaterally and auditory sensory specific areas. Our results suggest that the speech motor areas provide a common representational space for auditory and visual speech.

Effects of lip-reading in the auditory cortex

How auditory cortex works is generally less well understood than e.g. functions of the visual cortex. Only recently, evidence has emerged

Figure 45

Figure 45: Grand-average fMRI activations to auditory and visual phonetic stimulation. Areas responding to both auditory and visual phonetic stimulation included the speech motor areas (depicted in yellow).

about active information processing and possible multisensory engagement in the auditory areas. For example, lip-reading is known to activate secondary auditory areas, and, in deaf people, even simple visual stimuli (like moving dots) have been shown to activate "auditory" temporal lobe areas.

Using fMRI (functional magnetic resonance imaging), we studied which areas of the auditory cortex would be activated by silent lip-reading, specially focusing to the primary auditory cortex (See figure 46). During fMRI scanning the subjects were intermittently shown a face either silently uttering vowels or a still image of the same face.

We found secondary auditory cortex activation by visual speech in all subjects and primary auditory cortex activation in seven out of ten subjects. This suggests, that primary auditory cortex could actually receive visual input, or possibly modulation of its function by attentional mechanisms (where visual speech cues would "sensitize" the auditory cortex to listening).

Figure 46

Figure 46: Example of three subjects showing activation in the primary auditory cortex by visual speech. The yellow line outlines the brain area accommodating the primary auditory cortex, and the loci of statistically significant activations are marked with red.

In a related study, we utilized 306-channel magnetoencaphalogaphy (MEG) in 8 healthy volunteers to test whether seeing speech modulates the responsiveness of auditory-cortex neurons tuned on phonetic stimuli. Specifically, we hypothesized that seeing a visual articulation causes adaptation of auditory cortex MEG responses to a subsequently presented phonetic sound. Auditory ’test’ stimuli (Finnish vowels /ä/ and /ö/) were preceded (500-ms lag) by auditory (/ä/, /ö/, and the F2-midpoint between /ä/ and /ö/) or visual articulatory (/ä/ and /ö/) ’adaptor’ stimuli. As a separate control, the auditory /ä/ and /ö/ stimuli were presented without the adaptors. The subjects’ task was to behaviorally discriminate between the /ä/ and /ö/ test stimuli. The amplitude of the left-hemisphere N1m response to test stimuli was significantly suppressed with auditory (P<0.001) and visual (P<0.05) adaptors, this effect being signifi- cantly greater with the auditory adaptors (P<0.01)(see Fig.47). These findings suggest that seeing the articulatory gestures of a speaker influences auditory speech perception via modulation of the responsiveness of auditory cortex feature-detector neurons tuned on phonetic sounds features. This may relate to recent animal studies suggesting that tuning properties of auditory cortex neurons are modulated by the attentional/motivational state of the organism. The fact that adaptation was significantly greater when auditory as compared to visual adaptors preceded the test stimuli can be explained by additional adaptation to acoustic stimulus features.

Figure 47

Figure 47: The effects of auditory and visual adaptor stimuli on subsequently presented auditory cortex N100m responses to auditory phonemes. Auditory phonemes preceding the target phonemes caused significant decrease in response amplitudes. Visual phonemes (articulations) presented before the auditory phonetic stimuli caused significant suppression of the auditory reponses, which was significantly less than by the auditory adaptors.

Processing of sine-wave speech in the human brain

We studied in collaboration with the University of Oxford whether there are speech-specific regions in the human brain by using sine-wave speech (SWS) stimuli. Typically, naïve subjects perceive SWS stimuli as strange (non-speech) sounds. However, when subjects are told that SWS stimuli can be heard as speech, they typically start to perceive the stimuli phonetically. We used fMRI to investigate how subject’s knowledge on the nature of SWS stimuli affects their neural processing. The experiments were conducted in the FMRIB Centre in Oxford. Our findings suggest that posterior region of the left superior temporal sulcus would be a speech-specific processing site in the human brain.