Researchers: Janne Kulju, Riikka Möttönen, Jean-Luc Olivés, Pertti Palo, Mikko Sams, Antti Salovaara, Otto Seppälä, Sampsa Toivanen
We are developing an Artificial Person that can communicate and interact with humans via natural communication channels. Over the last two years we have worked on an audio-visual speech synthesizer, shown in Figure 41. The intelligibility of our synthesizer has been evaluated twice, and an appropriate user-interface for controlling the synthesizer has been developed.
|Figure 41: Set of expressions of the Artificial Person that is based on parametrised facial animation and synchronised speech synthesizer.|
The first version of the audiovisual synthesizer is a Talking Head, a combination of an acoustical synthesizer (MikroPuhe 4.1 by TimeHouse Ltd) and a dynamic animated facial model. A key aspect of the development is a continuous evaluation of the quality of the synthesis. Therefore, we have created tools to run experiments that help us with the evaluation. High controllability is needed in order to produce appropriate stimuli for the intelligibility experiments, and thus the interface development for the synthesizer has been given extra effort. The user-interface has been developed in collaboration with Professor Kari-Jouko Räihä's research group at the University of Tampere.
Two intelligibility studies for our synthesizer have been carried out. In the first one, the test corpus consisted of 39 VCV words that were presented under natural audiovisual, synthetic audiovisual, natural audio only, synthetic audio only, natural audio + synthetic vision or synthetic audio + natural vision conditions and with 0, -6, -12 and -18 dB signal-to-noise ratios (SNRs). The subjects were 11 male and 9 female native speakers of Finnish.
The results of global intelligibility are depicted in Figure 42. The facial animation improved the intelligibility of both the synthetic and acoustic speech. The mean improvement was about 15% being somewhat larger with smaller SNRs. The phoneme articulations of the synthesiser were improved and a tongue model was added after the first study. Whereas in the first intelligibility study 25% of the synthetic consonants were correctly identified, in the second study 33% of the synthetic consonant articulations were correctly identified. Especially identification of bilabials, labiodentals and incisives improved. 74% of the natural and 51% of the synthetic vowel articulations were correctly identified. The identification of visual vowel articulations formed four categories (a, ä, e), (i), (o,ö), and (u,y). We also presented the stimuli three-dimensionally using a special stereoscopic device, but this did not have any significant effect on the intelligibility. We have improved the visual speech based on the results of the second intelligibility study. The third evaluation will be made using expert lipreaders, i.e. hearing-impaired persons, as subjects.
We have started the development of a new Artificial Person. The new model will be much more detailed as the previous one but also scalable. A new flexible parameterization to change the facial configuration as well as visual speech has to be created. The new parameterization will be made MPEG4-SNHC compliant. We have collected an audiovisual speech database, which will be used in extracting the phoneme articulations. Facial emotions will be constructed using the commonly used 'facial action coding system' (FACS). To individualize the head, its surface structure has to be easily adjustable according to a photograph of a person, for example. In addition to the present available acoustical synthesizer, the facial animation can be connected to other synthesizers. The new synthesizer, in addition to speaking Finnish, will also speak English.
We also started the development of an anatomically-based head model, which simulates the behaviour of real human face by modeling the actual physical mechanisms responsible for the facial movements. The functionality of the head is based on three components: a skin model, a muscle model and a skull model. We have used Magnetic Resonance Images (MRIs) of a real person to obtain the head geometry. An image segmentation procedure was used to convert the volumetric MRI data to a polygonal representation of the head. The skin model is composed of a two-layered lattice of nodes, connected by viscoelastic springs. This imitates the real human skin. Muscles are modeled as straight lines extending from skin to skull surface. A contracting muscle applies an attractive force, in direction of its tail, to the node it is attached to. This force is reflected through the spring lattice to all nearby nodes. The result is that the skin around the attachment point of the muscle is displaced towards the other end of the muscle, see Figure 43.
|Figure 43: On the left: Anatomically-based model of a real person's face with geometry obtained from Magnetic Resonance imaging, is shown in its initial form. On the right: The muscles (white lines) at the corners of the mouth have contracted due to the smile of the face.|