|Even though everyday human-to-human communication is usually based on speech, people also utilize non-verbal information, through their face and body, to convey messages or to respond to stimuli from the environment or other humans. Besides the actual lexical content that conveys these messages, emotions are mostly expressed by facial expressions, body pose and gestures, and paralinguistic elements, such as speech intonation. Virtual communities attempt to establish a metaphor of this communication scheme, utilizing avatars as user proxies. In this case, interaction is usually based on exchanging text messages, with the possible extension of text-to-speech transformation. Non-verbal communication is restricted to a few predefined body postures, usually triggered by the user via keystrokes or user interface elements. As a result, users not fluent in English may face severe obstacles in trying to communicate their messages or comprehend what is written. But even when language is not a problem, such means of communication are rather monotonous and usually impede users from using them, especially if they lack technology savvy. An obvious improvement to this scenario is the introduction of realistic expression synthesis, depicting either the actual expression of the corresponding user or merely picturing the content of the textual communication. Despite the progress in related research, our intuition of what an expression or emotion actually represents is still based on trying to mimic the way the human mind works while making an effort to recognize such an emotion. This means that robust results cannot be obtained without taking into account features like speech, face and hand gestures or body pose. In the case of speech, features can come from both linguistic and paralinguistic analysis, that is from studying what users say and how they say it. Besides this, facial and hand gestures and body pose often convey messages in a much more expressive and definite manner than wording, which can be misleading or ambiguous, especially when users are not visible to each other and, therefore, can easily conceal their actual emotional state. While a lot of effort has been invested in examining individually these aspects of human expression, recent research has shown that even this approach can benefit from taking into account multimodal information. As a result, Man-Machine Interaction (MMI) systems that utilize multimodal information about users' current emotional state are presently at the forefront of interest of the computer vision and artificial intelligence community. Such interfaces give the opportunity to less technology-aware individuals, as well as handicapped people, to use computers more efficiently and thus overcome related fears and preconceptions. Apart from handicapped people, a multimodal system has numerous applications in different aspects of human life. The real world actions of a human can be transferred into a virtual environment through a representative (avatar), while the virtual world perceives these actions and corresponds through respective system avatars. An example of this enhanced MMI scheme is virtual malls. Business-to-client communication via the web is still poor and again based on the exchange of textual information via email. What most clients actually look for is a human salesman who would smile at them and adapt to their personal needs. So a humanoid embodiment, an avatar, could enhance the humane aspects of e–commerce, interactive TV or online advertising applications. Avatars can also be used more extensively in real-time, peer-to-peer multimedia communication, with avatars providing enhanced means of expression, missing from text-based communication. Avatars can express their emotions using human-like expressions and gestures not only during a chat via the web or a teleconference but also during broadcasting news, making them more attractive since they would be pronounced in a human-like way. In this paper we propose an efficient approach to facial and body expression synthesis in networked environments, via the tools provided in the MPEG-4 standard. More specifically, we describe an approach to synthesize facial expressions and body gestures based on real measurements and on universally accepted assumptions of their meaning. These assumptions are based on established psychological studies, as well as empirical analysis of actual video footage from human-computer interaction sessions and human-to-human dialogues. Investigation of the resulting findings leads to a classification scheme that separates body gestures into different categories with respect to whether they actually convey a message or not, the actual meaning conveyed and whether they are found in human-human or human-computer interaction activities.
|K. Karpouzis, A. Raouzaiou, St. Kollias, "'Moving' avatars: Emotion Synthesis in Virtual Worlds", Human-Computer Interaction International 2003, 22 - 27 June, Crete, Greece, vol. 2, pp. 503 - 507|