" Development of Feature Representations from Emotionally coded Facial Signals and Speech "

This report is part of the PHYSTA project, which aims to develop an artificial emotion decoding system The system will use two types of input, visual (specifically facial expression) and acoustic (specifically non-verbal aspects of the speech signal).

PHYSTA will use hybrid technology, i.e. a combination of classical (AI) computing and neural nets (NN). Broadly speaking, the classical component allows for the use of known procedures and logical operations which are suited to language processing. The neural net component allows for learning at various levels, for instance the weights that should be attached to various inputs, adjacencies, and probabilities of particular events given certain information.

A review of classical methods for face analysis is presented, incorporating normalisation and feature extraction methods like edge detection, motion estimation and advanced techniques. Section three explores the use of supervised and unsupervised techniques for static face perception and facial emotion extraction. The architectures covered are principle component analysis, backpropagation of error learning, local feature analysis, and independent component analysis. In section four the application of multiresolution based hierarchical neural networks to vision tasks are explored. Finally section five gives an overview of speech related emotion understanding using the ASSESS system which is described in chapter 6 followed by a general conclusion.

A supposition for satisfying recognition results, based on facial signals, is the good alignment of the faces to a general viewpoint. Therefore we review general schemes for face detection and normalisation, which could be used to normalise the images from an un-normalised dataset, or from real-time video data. We present several template-based approaches for face and expression recognition on static images, some of which were used to study their performance on a public available face dataset. To normalise the face images used in the study, we employ our own normalisation algorithm, which has to be expanded to cope with more general viewing positions of faces in real-world scenes. An outlook of the use of multiresolution hierarchical neural networks for vision is also given.

Since the analysis of emotional speech has already been performed by the ASSESS system, produced by the Queens University of Belfast team, we present the application of the system in a psychological study, followed by a brief description of the system.

Next, the feature extraction task from image sequences is tackled, based on the extraction of motion and 3-dimensional information from the images.

The above systems need to be combined, resulting in a fully working audio-visual emotion recognition system. To complete the feature extraction task, the normalisation procedure has to be expanded and made real-time capable. Further, a good database is needed including multiple sequences of facial expression with synchronised speech to develop and test the feature extraction software.