REPORT
REVIEW OF EXISTING TECHNIQUES FOR HUMAN
EMOTION UNDERSTANDING AND
APPLICATIONS IN HUMAN-COMPUTER INTERACTION
October 1998
Report for the TMR PHYSTA project
“Principled Hybrid systems: Theory and Applications”
Research contract FMRX-CT97-0098 (DG 12 – BDCN)
Contents
1 Preliminaries
*1.1 Theoretical issues for an IT-oriented approach to emotion
*1.2 Applications for an IT-oriented approach to emotion
*1.3 Structure
*2 Theoretical approaches to emotion
*2.1 The essence of emotion
*2.2 The natural history of emotional life
*2.3 Options arising from the review
*2.4 Summary
*3 Speech and emotion
*3.1 Underlying issues
*3.2 Speech parameters and specific emotions
*3.3 Computational studies of emotion in speech
*3.4 Automatic extraction of phonetic variables
*3.5 Beyond Banse and Scherer
*4 Faces and Emotion
*4.1 Neurobiology of Recognising Emotional Facial Expressions.
*4.2 Different Types of Approach to Facial Expression Recognition
*4.3 Material
*4.4 Conclusions
*5 Summary and agenda
*5.1 Signal analysis for speech
*5.2 Signal analysis for faces
*5.3 Effective representations for emotion
*5.4 Appropriate intervening variables
*5.5 Acquiring emotion-related information from other sources
*5.6 Integrating evidence
*5.7 Emotion-oriented world representations
*5.8. Material
*5.9 Input mechanisms
*References
*Appendix: Table of speech and emotion……………………………………………..57
1 Preliminaries
This report is part of the PHYSTA project, which aims to develop an artificial emotion decoding system.
PHYSTA will use hybrid technology, i.e. a combination of classical (AI) computing and neural nets. Broadly speaking, the classical component allows for the use of known procedures and logical operations which are suited to language processing. The neural net component allows for learning at various levels, for instance the weights that should be attached to various inputs, adjacencies, and probabilities of particular events given certain information.
The report reviews available information on the nature of emotion and the signs that may be used to detect it. It does not restrict itself to material that translates readily into system specifications. That would be short sighted, because there are background issues in the area which simply are not resolved, and which ought to be clearly recognised as issues rather than prejudged. The report also indicates why hybrid technology is appropriate for an artificial emotion decoding system.
This section prepares the ground by sketching the motivation for adopting this IT-oriented approach to emotion. It considers first the theoretical motivation, from the view of both human sciences and IT, and then the practical applications that can currently be envisaged.
1.1 Theoretical issues for an IT-oriented approach to emotion
Two channels have been distinguished in human interaction. One transmits explicit messages which may be about anything or nothing: the other transmits implicit messages about the speakers themselves. Both linguistics and technology have invested enormous efforts in understanding the first, explicit channel, but the second is much less well understood. Understanding the other party's emotions is one of the key tasks associated with the second, implicit channel.
PHYSTA's approach to that task has two broad kinds of theoretical goal - one primarily to do with consolidating psychological and linguistic analyses of emotion, the other primarily to do with extending the scope of IT.
1.1.1 Consolidating analyses of emotion
The human sciences contain a literature on emotion which is large, but fragmented. The main sources which are relevant to this project are in psychology and linguistics, with some inputs from biological and medical work. Translating abstract proposals into a working model system is a rational way of consolidating that knowledge base. The approach has several attractions, particularly when the system is a hybrid of symbolic and subsymbolic techniques.
First, building an emotion detection system makes it possible to assess how well ideas explain people's general competence at understanding emotion. So long as it is technically impossible to apply that kind of test, theories can only be assessed against their success or failure on selected examples, and that is not necessarily a constructive approach.
Second, model building enforces coherence. At a straightforward level, it provides a motivation to integrate information from sources that tend to be kept separate. It can also have subtler effects, such as showing that apparently meaningful ideas are actually difficult to integrate; or that conjunctions which seem difficult are actually quite possible; or that verbal distinctions and debates actually reduce to very little.
Hybrid systems have a particular attraction in that they offer the prospect of linking two types of element that are prominent in reactions to emotion - articulate verbal descriptions and explanations, and responses that are felt rather than articulated, which it is natural to think of as subsymbolic.
1.1.2 Extending the scope of IT
Developing ways to use the implicit messages that humans transmit is a major challenge for IT.
The approach that PHYSTA adopts addresses another major issue, the emergence of meaning from subsymbolic operations. Intuitively, meanings related to emotion seem to straddle the boundary between the logical, discrete, linguistic representations that classical computing handles neatly (perhaps too neatly to model human cognition well), and the fuzzy, subsymbolic representations that neural nets construct. That makes the domain of emotion a useful testbed for technologies which aim to create a seamless hybrid environment, in which it is possible for something that deserves the name meaning to emerge.
1.2 Applications for an IT-oriented approach to emotion
The PHYSTA project is a fundamental one in the sense that it is not motivated by a particular application. The implicit channel is a major feature of human communication, and if progress is made towards reproducing it, then applications can be expected to follow. However, it is useful to indicate the kinds of application that can easily be foreseen. In particular, that provides a context against which to assess the likely relevance of different theoretical approaches. Obvious possibilities can be summarised under nine headings, beginning with broad categories and then considering more specific applications.
1.2.1 Convergence
It is a feature of human communication that speakers who are in sympathy, or who want to indicate that they are, converge vocally on a range of parameters. Conversely, not to converge conveys a distinct message - roughly, aloofness or indifference. That is the message that is likely to be conveyed by an electronic speaker which always uses a register of controlled neutrality irrespective of the register used by a person interacting with it, and it is liable to interfere with the conduct of business. To vary its own register so that it can converge appropriately, a machine needs some ability to detect the speaker's state.
1.2.2 Interaction between channels
The two channels of human communication interact: the implicit channel tells people 'how to take' what is transmitted through the explicit channel. That becomes particularly critical in the context of full-blown conversation rather than minimal, stereotyped exchanges. There is a growing body of knowledge on the way prosody contributes to that function, and it is reasonable to see it as part of a wider domain linked to speaker state. For example, the same words may be used as a joke, or as a genuine question seeking an answer, or as an aggressive challenge (e.g. "I suppose you think England are going to win the World Cup"). Knowing what is an appropriate continuation of the interaction depends on detecting the register that the speaker is using, and a machine communicator that is unable to tell the difference will have difficulty managing conversation.
1.2.3 Augmenting human judgement
Some of the most immediate applications involve making information about signs of emotion available to a human, who is engaged in making judgements about another person, and who wants to make them more accurately or objectively. The classical example is lie detection. Improving on human performance in that area is a tall order, and it is not a primary goal in this project. However, there are areas where augmentation is a real possibility. Our own work provides two examples. First, some clinical diagnoses depend on detecting vocal signs of emotion, such as the diagnosis of flattened affect in schizophrenia, which is an indicator of poor prognosis and potential hospitalisation. Relying on psychiatrists' unaided judgement in that area may not be optimal, since they are are not necessarily chosen for the sensitivity of their ears. Hence it makes sense to supplement their subjective impressions with relevant objective measures, and there is prima facie evidence that the technology for obtaining relevant measures is within reach. Second, providers of teleconferencing have shown interest in on-screen displays that carry information about participants' emotional states to offset losses of sensitivity that seem to result from the unnaturalness of the medium.
1.2.4 Deconfounding
The vocal signs of emotion occupy what has been called the augmented prosodic domain - a collection of features involving pitch (which is usually equated with fundamental frequency, abbreviated as F0), amplitude, the distribution of energy across the spectrum, and some aspects of timing. Difficulties arise because the same domain carries other types of information, such as information about the stage an interaction is at (preliminary exchanges, business, inviting closure, signing off). It is important to develop ways of using these types of information to negotiate human/computer transactions, and that depends on understanding emotion-related variation well enough to recognise which underlies a particular pattern in the augmented prosodic domain.
1.2.5 Production
There is a good deal of interest in generating voices which have appropriate emotional colouring. There is a duality between that problem and the problem of recognising emotion in speech. In particular, techniques for learning the subtleties of emotional speech may provide a way of generating convincingly emotional speech. Similar points apply to the visual expression of emotion. Generation of synthetic agents which possess convincing expression characteristics is crucial for virtual reality, natural-synthetic imaging and human-computer interaction. A special case arises with compression techniques where there is the possibility of information about emotion being extracted, transmitted, and used to govern resynthesis. Note that resynthesis techniques which fail to transmit information about emotion have the potential to be catastrophically misleading.
1.2.6 Tutoring
An obvious application for emotion-sensitive machines is automatic tutoring. An effective tutor needs to know whether the user is finding examples boring or irritating or intimidating. As voice and camera inputs become more widely used, it becomes realistic to ask how machine tutors could be made sensitive to those issues.
1.2.7 Avoidance
A second obvious type of application involves machines acting as functionaries - personal assistants, information providers, receptionists, etc.. There would be clear advantages if these could recognise when the human they were interacting with was in a state that they were not equipped to handle, and either close the interaction, or hand it over to someone who was equipped to handle it.
1.2.8 Alerting
Related to the avoidance function is providing systems which can alert a user to signs of emotion that call for attention. Alerting may be necessary because the speaker and the person to be alerted are in different places (e.g. an office manager being alerted to problems with an interaction between one of several members of staff and a client, a ward nurse being alerted to a patient in distress) or because the speaker's attention is likely to be focused on other issues so that signs of emotion are overlooked (e.g. a GP talking to a patient who is not presenting their real concern, an academic adviser with a student who has undisclosed problems). The issue may be to alert people to their own emotions (for instance, so that signs of strain that might affect a critical operation or negotiation are picked up before damage is done).
1.2.9 Entertainment
Commercially, the first major application of emotion-related technology may well be in entertainment and games programs which respond to the user's state. There is probably an immense market for 'pets', 'friends', and dolls which respond even crudely to the owner's mood.
The structure of the report is as follows. Section 2 introduces theoretical approaches to emotion. Section 3 refers to emotion-related signals in speech, and the feature analysis techniques which are related to them. Section 4 provides a similar review of emotion-related signals in the face. Section 5 summarises the state of the art and the kinds of development that it is possible to envisage.
2 Theoretical approaches to emotion
Theories of emotion play a key part in setting the goals of a project like PHYSTA. A long established type of account, which it is convenient to call the classical approach, tends to be thought of as the scientific approach. Accepting it points very strongly towards a particular type of goal for PHYSTA. However, the theory of emotion has been through a period of rapid development, and more recent ideas suggest that very different types of goal make at least as much sense.
This section reviews classical and modern ideas on the subject. Its structure reflects a distinction between two main methods of approach.
The classical approach to emotion follows one of the standard methods of science - it sets out to pinpoint the essential core which is characteristic of all emotional states, and whose presence distinguishes them from all others; and then attempts to identify situations where that essential core can be studied in the purest possible form, with as few confounding influences as possible from other variables.
One of the hallmarks of recent research is that it has moved away from the classical preoccupation with identifying an essence of emotion. Instead the emphasis has been on an approach which is more akin to natural history: it has focused on describing the various aspects of emotional life, and finding effective ways of organising them.
Logically, these methods are not in competition. Presumably a mature theory of emotion would both specify fundamental elements and expand on the ways in which they may be expressed. However, that state has not been reached, and a project like PHYSTA needs to decide how much weight it attaches to each approach.
The classical approach, which remains the best known to scientists and engineers, reflects two main ideas about the essence of emotion.
The first idea was introduced by Descartes, who argued that all 'the passions of the soul' are produced by combining a few primary emotions. Descartes identified six. Spinoza reduced those to three, Hobbes recognised seven. Plutchik (1994, p. 58) tabulates a dozen contemporary lists of primary emotions, with the same sort of range of numbers. There are many more candidates (e.g. Oatley and Johnson-Laird give a list of 'primaries' containing three types of love, attachment, caregiving, and sexual).
The second idea was added in the nineteenth century. It is that emotion has particularly close connections with biology. Darwin added one element in the idea that human emotions are continuous with animals', and both have simple biological functions with a relatively direct relationship to survival. Cannon added a second element in the idea that these functions were linked to homeostasis, and that they were controlled by a particular structure in the brain (the hypothalamus).
The classical approach has exerted a powerful influence in many areas, not least in research which deals with the signs of emotion. For instance, influential articles on emotion and speech take it for granted that the material to be studied should express the primary emotions, and the measures to be taken should relate as directly as possible to the physiological events associated with them
. However, research directly concerned with emotion has moved away from the classical model.The most radical alternative is represented by theorists like Averill (1980) and Harre (1986), who advocate a view known as social constructionism. It argues that emotion words are irreducibly complex. They refer to syndromes which involve elements that tend to co-occur often enough to be worth having a compact description for. Unquestionably some of the elements are related to biology. However, others are dispositional - i.e. they relate to the sorts of things people are likely to do in a given state; and others refer to particular social contexts. Attaching labels to certain combinations of elements is a matter of social convention. Its function is to let people predict, evaluate, and explain behaviour, their own and other people's.
A variety of 'cognitive' theories can be regarded as standing between these two extremes. Arnold (1960) defines emotion as a felt tendency towards things appraised as good and away from things appraised as bad. Lazarus (1991) views emotions as responses that occur when 'core relational themes' are registered that prepare and mobilise us to make appropriate responses. Oatley and Johnson-Laird (1987, 1995) argue that emotion sets the brain into particular modes of organisation . Changes of mode are elicited by particular types of event and precipitate transitions to particular types of action.
These ideas will be revisited later. For the time being, the important point is that fundamentally different approaches to the issue of emotion do exist. These suggest different approaches to the task of detecting emotion.
2.2 The natural history of emotional life
Koch's classic overview of psychology came to the conclusion that it was still at the level of attempting to identify relevant variables. That theme runs through recent empirical work on emotions. It reviews evidence, and attempts to identify variables that allow aspects of emotional life to be distinguished and related in satisfactory ways.
Table 1 illustrates the kind of material that the approach has dealt with. It is based on two lists of emotion-related words, one due to Whissell and the other due to Plutchik. The words in Whissell's (shorter) list have numbers in the first two columns (Activation and Evaluation). The lists immediately show the richness of the vocabulary that people commonly use to describe emotion. The research question is: what kind of structure underlies this vocabulary?
The kind of structure that is most often discussed is geometric. Emotions are considered as points in a space with a relatively small number of dimensions. It is obvious that that kind of approach lends itself to implementation, including implementation using artificial neural nets.
Table 1: Emotion words from Whissell and Plutchik
Activ Eval Angle Activ Eval Angle
Accepting 0
Adventurous 4.2 5.9 270.7
Affectionate 4.7 5.4 52.3
Afraid 4.9 3.4 70.3
Aggressive 5.9 2.9 232
Agreeable 4.3 5.2 5
Amazed 5.9 5.5 152
Ambivalent 3.2 4.2 144.7
Amused 4.9 5 321
Angry 4.2 2.7 212
Annoyed 4.4 2.5 200.6
Antagonistic 5.3 2.5 220
Anticipatory 3.9 4.7 257
Anxious 6 2.3 78.3
Apathetic 3 4.3 90
Apprehensive 83.3
Ashamed 3.2 2.3 83.3
Astonished 5.9 4.7 148
Attentive 5.3 4.3 322.4
Awed 156.7
Bashful 2 2.7 74.7
Bewildered 3.1 2.3 140.3
Bitter 6.6 4 186
Boastful 3.7 3 257.3
Bored 2.7 3.2 136
Calm 2.5 5.5 37
Cautious 3.3 4.9 77.7
CheerfuI 5.2 5 25.7
Confused 4.8 3 141.3
Contemptuous 3.8 2.4 192
Content 4.8 5.5 338.3
Contrary 2.9 3.7 184.3
Co-operative 3.1 5.1 340.7
Critical 4.9 2.8 193.7
Curious 5.2 4.2 261
Daring 5.3 4.4 260.1
Defiant 4.4 2.8 230.7
Delighted 4.2 6.4 318.6
Demanding 5.3 4 244
Depressed 4.2 3.1 125.3
Despairing 4.1 2 133
Disagreeable 5 3.7 176.4
Disappointed 5.2 2.4 136.7
Discouraged 4.2 2.9 138
Disgusted 5 3.2 161.3
Disinterested 2.1 2.4 127.3
Disobedient 242.7
Displeased 181.5
Dissatisfied 4.6 2.7 183
Distrustful 3.8 2.8 185
Eager 5 5.1 311
Ecstatic 5.2 5.5 286
Elated 311
Embarrassed 4.4 3.1 75.3
Empty 3.1 3.8 120.3
Enthusiastic 5.1 4.8 313.7
Envious 5.3 2 160.3
Exasperated 239.7
Expectant 257.3
Forlorn 85
Furious 5.6 3.7 221.3
Generous 328
Gleeful 5.3 4.8 307
Gloomy 2.4 3.2 132.7
Greedy 4.9 3.4 249
Grief-stricken 127.3
Grouchy 4.4 2.9 230
Guilty 4 1.1 102.3
Happy 5.3 5.3 323.7
Helpless 3.5 2.8 80
Hesitant 134
Hopeful 4.7 5.2 298
Hopeless 4 3.1 124.7
Hostile 4 1.7 222
Humiliated 84
Impatient 3.4 3.2 230.3
Impulsive 3.1 4.8 255
Indecisive 3.4 2.7 134
Indignant 175
Inquisitive 267.7
Interested 315.7
Intolerant 3.1 2.7 185
Irritated 5.5 3.3 202.3
Jealous 6.1 3.4 184.7
Joyful 5.4 6.1 323.4
LoathfuI 3.5 2.9 193
Lonely 3.9 3.3 88.3
Meek 3 4.3 91
Nervous 5.9 3.1 86
Obedient 3.1 4.7 57.7
Obliging 2.7 3 43.3
Outraged 4.3 3.2 225.3
Panicky 5.4 3.6 67.7
Patient 3.3 3.8 39.7
Pensive 3.2 5 76.7
Perplexed 142.3
Planful 269.7
Pleased 5.3 5.1 328
Possessive 4.7 2.8 247.7
Proud 4.7 5.3 262
Puzzled 2.6 3.8 138
Quarrelsome 4.6 2.6 229.7
Ready 329.3
Receptive 32.3
Reckless 261
Rebellious 5.2 4 237
Rejected 5 2.9 136
Remorseful 3.1 2.2 123.3
Resentful 5.1 3 176.7
Revolted 181.3
Sad 3.8 2.4 108.5
Sarcastic 4.8 2.7 235.3
Satisfied 4.1 4.9 326.7
Scared 66.7
Scornful 5.4 4.9 227
Self-conscious 83.3
Self-controlled 4.4 5.5 326.3
Serene 4.3 4.4 12.3
Shy 72
Sociable 4.8 5.3 296.7
Sorrowful 4.5 3.1 112.7
Stubborn 4.9 3.1 190.4
Submissive 3.4 3.1 73
Surprised 6.5 5.2 146.7
Suspicious 4.4 3 182.7
Sympathetic 3.6 3.2 331.3
Terrified 6.3 3.4 75.7
Timid 65
Tolerant 350.7
Trusting 3.4 5.2 345.3
Unaffectionate 3.6 2.1 227.3
Uncertain 139.3
Uncooperative 191.7
Unfriendly 4.3 1.6 188
Unhappy 129
Unreceptive 170
Unsympathetic 165.6
Vascillating 137.3
Vengeful 186
Watchful 133.3
Wondering 3.3 5.2 249.7
Worried 3.9 2.9 126
Joy Acceptance
Anticipation Fear
Anger Surprise
Disgust Sadness
Figure 1: Plutchik's "Emotion Wheel"
The issue of dimensionality is important for the classical approach. The obvious inference from that approach is that the primary emotions constitute independent dimensions and that the secondary emotions can be understood as points in a space which has axes corresponding to the strength of each primary component. That model does not hold up well in empirical research. The dimensions of emotion terms seem to be of a quite different sort.
The numerical values from Whissell reflect dimensions that emerge from a variety of studies in which subjects are asked to rate emotion words and various techniques are used to establish the dimensionality of the resulting data. To a first approximation, the results seem to occupy two dimensions. Whissell's numbers are empirically derived co-ordinates for the terms she used, in a space with two dimensions, activation and evaluation. Activation is the degree of arousal associated with the term, with terms like patient and cautious (at 3.3) representing a midpoint, surprised and terrified (over 6) representing high activation), and bashful and disinterested (around 2) representing low activation. Evaluation is the degree of pleasantness associated with the term, with guilty (at 1.1) representing the negative extreme, delighted (at 6.6) representing the positive extreme.
The third column in Table 1 represents an observation that also emerges from a number of studies. Emotion terms are not evenly distributed through the space defined by dimensions like Whissell's. Instead they tend to form an approximately circular pattern. A neat summary of that approach is the 'emotion wheel' described by Plutchik (1980), shown in Figure 1. The terms which are shown in the figure can be thought of as representative points of reference. The last column in Table 1 shows empirically derived positions on the circle for all the terms in the list, using an angular measure in which the midline runs from Acceptance (0) to Disgust (180).
Plutchik does not regard the terms in Figure 1 simply as points of reference. He describes them as primary emotions, and emotions away from these cardinal points as secondary emotions formed by combining the neighbouring primaries. The terminology of 'primary emotions' is derived from the classical approach. However, it is not clear that the use of classical terminology represents much more than a gesture to classical theory. It certainly does not retain the natural interpretation of the classical view, that the primary emotions constitute independent dimensions.
The kind of structure that has been described so far involves two dimensions. It is natural to extend the idea of describing emotions in terms of dimensions based on psychological considerations, and a number of other possibilities are worth noting.
Strength Plutchik considers strength as a third basic dimension. The wheel in Figure 1 is actually a section through a cone with stronger emotions above (e.g. rage, loathing, and grief lie above anger, disgust, and sadness) and milder emotions below (e.g. annoyance, boredom, and pensiveness).
Intrusiveness This dimension relates to the interaction between emotion and cognition. It is often said that emotions are a kind of interrupt system which comes into play in extreme situations, over-riding rational control and releasing automated action patterns with high survival value. However, others descriptions imply that coexistence between emotion and rational processes is possible or even necessary. For example, Rado (1969) , a psychoanalyst, suggests that emotions are always active, but they are usually controlled by rational thought, giving rise to states which are derivative and mixed. Tomkins goes a step further: he regards emotion (which he calls 'the affect system') as an element of all action. It is 'the primary motivational system because without its amplification, nothing else matters, and with its amplification, anything else can matter' (1982, p.356).
It seems reasonable to relate these descriptions to a dimension of intrusiveness. At one extreme are emotions where rational control is largely pre-empted. At the other, emotion-related systems supply parameters that are required for rational decision-making, and that cannot be supplied by reason alone. Between is a zone of more or less stable coexistence. Oatley and Johnson-Laird's proposal, in which emotions restructure mental processing, is in a sense a cautionary note on that idea. It indicates that the mixture of emotion and cognition need not be additive.
Duration It is obviously important in practical terms to assess the likely duration or durability of emotional states, but the issue of duration receives surprisingly little systematic attention in the literature. One of the reasons may be that strongly entrenched ideas about emotions tend to suggest that they are intrinsically short term, for instance the idea that emotion is connected with interrupts. Another is that there are distinct terms for emotion-related states which are extended in time, and that leads to their being treated as separate topics. It has been said that moods are emotions that last longer than expected. If a tendency towards a particular emotional state is habitual, it is called a trait. It is not coincidence, though, that traits are often described using the same words as emotions - for instance, 'happy', 'sad', and 'angry' can all be used to describe either. Theory and research tend to focus on transitory emotional states, but it is natural to see the long-term issues of moods and traits as part of the same conceptual domain.
Two other issues can naturally be considered as dimensions on which an observer may want to place a display of emotion, although the literature rarely considers them in the same framework as those which have been mentioned above.
Adaptiveness Evolutionary approaches tend to assume that emotion is intrinsically adaptive. However, one of the main reasons for psychologists' interest in emotion is that it is involved in so much psychopathology. That can be because emotions such as anxiety or aggression are out of control, or because emotions are repressed - i.e. balance is critical, and by no means straightforward.
Genuineness People do simulate emotion - love, happiness, anger - in order to manipulate others. A recurring topics in the area has been how to detect a genuine smile. There is a double market for that kind of work - one side wants to detect falsehood better, the other (actors, politicians, etc) wants to be more convincing.
The model of geometric dimensions applies most naturally to descriptions of the internal states associated with emotions. However, the language of emotion has other aspects which it is more natural to handle in logical terms (geometric descriptions could be applied, of course, but it would be rather forced).
Figure 2 Development of emotional distinctions (after Fox)
1st level
Approach Withdrawal
2nd level
Joy Interest Anger Distress Disgust Fear
3rd level
Pride, Concern, Hostility, Misery, Contempt, Horror
Bliss Responsibility Jealousy Agony Resentment Anxiety
Figure 2 illustrates a modern conception of the way emotions develop. It seems possible that the child begins with a mode of responding along a very basic approach-withdrawal continuum, and gradually develops a refined set of responses. If so, something like the primary emotions may indeed have a kind of priority, but a priority in time rather than a priority in the psychology of the mature adult.
A second logical issue has been described as appraisal. Most accounts of emotion, including classical accounts, recognise that the concept of emotion is bound up with a perceived situation on one hand and a disposition to act on the other. The concept of fear suggests a perception of danger and a disposition to escape: the concept of love suggests a perception of social compatibility and a disposition to strengthen connections; and so on.
Theorists who remain close to the classical approach have developed this point in ways that suggest close links to simple action patterns with clear evolutionary functions. For instance, Table 2 gives the relevant parts of a system proposed by Plutchik in connection with his psychoevolutionary theory. But even in cases like these which are chosen to fit the account, several of the entries seem forced. Joy seems subjectively to be rather subtler than a prelude to mating. As for the list in Table 1, it is difficult to extend this kind of scheme to very many of the entries without a great deal of poetic licence.
Table 2: Extract from "Emotions and their derivatives" (Plutchik, 1980).
Stimulus Cognition Subjective Behaviour
language
Threat Danger Fear Escape
Obstacle Enemy Anger Attack
Potential mate Possess Joy Mate
Loss of valued individual Abandonment Sadness Cry
Member of one's group Friend Acceptance Groom
Unpalatable object Poison Disgust Vomit
New territory What's out there? Expectation Map
Unexpected object What is it? Surprise Stop
Setting out issues in the way that this section has done provides a perspective on classical research. Beginning with questions about the essence of emotion led it to focus on states which are intense, transient, interrupt cognition, are adaptive, and are not simulated. These were supposed to be the elements from which compound states were formed, synchronically rather than diachronically.
Beginning instead with an empirical overview of issues related to emotion suggests that the classical approach is at the very least severely limiting. Extending the metaphor from chemistry, the rest of emotional life no more follows from what we know about those archetypal emotions than the properties of water follow from studying oxygen and hydrogen.
That is the kind of background against which theorists like Averill and Harre argue that "emotions are transitory social roles - that is, institutionalised ways of interpreting and responding to particular classes of situations" (Averill 1986, p. 100). Looking at emotion words quickly suggests that very few can really be thought of as describing a pure quality of feeling with a biologically programmed response associated. As Averill says, they describe syndromes involving issues such as appraisal of the person's situation (e.g. 'bewildered', 'envious'), prediction of behaviour (e.g. 'quarrelsome', 'ambivalent'), and evaluation of the response (e.g. 'defiant', 'stubborn'). Many of the descriptions presuppose particular structures, a point which is often made by citing emotion terms from other cultures that are difficult to translate into our own, such as the Japanese amae (translated as 'sweet dependency') or the medieval accidie(roughly, a sinful inclination to idleness).
Averill does not deny that emotion involves 'behavioural systems that have survived the course of human evolution (e.g., systems related to attachment, aggression, reproduction, etc.)': the point is that 'these systems are rather loosely organised, genetically speaking, and can be transformed and combined in an almost indefinite variety of ways.' (1986, p.101). Extending the earlier analogy with chemistry, one might say that studying emotion without reference to this propensity to transform and combine, and the consequences for behaviour, is rather like studying the halogens meticulously in atomic form, and scrupulously not mentioning their reactivity. It is also about as easy.
Even theorists who stop well short of Averill and Harre are very far from the naive view that an emotion is simply a distinctive feeling tied to a distinctive physiological state. It is widely agreed that an emotion involves a global appraisal and a global propensity to act accordingly - a stripped down, strongly prioritised representation of the world, the self within it, and their potential, which at least has affinities with biologically primitive modes of cognition.
A good deal of the debate around these issues in psychology is properly classified as metaphysical. It assumes that the word emotion refers to a particular type of entity, and tries to discover the essential nature of that entity. In an IT context, there simply seems to be no point becoming involved with the metaphysics. What matters to an artificial emotion decoding system is what affects the way a person is likely to behave. It also matters to able to use descriptors in a way that is compatible with the way that humans do, both in order to be able to learn from humans about the significance of various signs, and in order to be able to convey to humans what is happening.
2.3 Options arising from the review
Reviewing research on emotion makes it clear that the PHYSTA project has a choice of ways to look at the topic. It should not restrict itself unthinkingly.
2.3.1 A classical approach
One option is to accept the classical framework. On that approach, the key task is matching (emotionally marked) linguistic acts and facial casts with changes in underlying biological control systems. That is clearly a difficult problem, but not impossible in principle.
The most direct approach is to probe physiological parameters that are reliably associated with psychophysiological symptoms of psychological primitives - for instance, the resistivity of skin, pupillary diameter, the electromyographic records of muscles that are possibly involved in the emotional responses (e.g. face, arms/hands, abdomen). A system for checking these parameters is the natural basis for a multidimensional feature space defined initially by plotting each feature (on an arbitrary) scale in one space dimension, and eventually by clustering and identifying, in the n-dimensional feature-space, a particular psychophysical counterpart to emotional primitives such as angry, aggressive, nervous, calm, pleased, joyful.
Since this project deals with voice and visual inputs, the direct approach is not relevant. However, voice and vision can be treated as surrogates for physiological recording equipment, in the sense that the initial stages of analysis set out to find features which reflect the relevant physiology. There is some background work on the derivation of relevant features from physiological theory.
From a more systemic point of view, the idea would be to spread an intermediate layer concerned with neurophysiological primitives between the sensory features and the emotional terms. That has the dual benefit of
1) relating primitives to sensory patterns in a way that is (broadly) reminiscent of the subsymbolic way we expect the human brain could do; and
2) managing the primitives in a symbolic way in order to capture how primitives merge to generate emotional states accounted by the mentioned emotional terms.
The first step could be supported by neurophysiological evidence. Specifically, the psychophysical clusters mentioned (call them phenotypical classes) should map into clusters in an analogous multidimensional space, where axes report sensory features coming from speech and image analysis (genotypic).
The second step falls in the scope of symbolic learning, and could represent an explication of the physiological network deputed to the same task in our brain.
That approach puts a premium on using recordings - audio and video - of 'pure' emotional performance (in terms of some set of primary categories). That in itself is a major problem, because material of that kind is not easily available, and there are ethical problems associated with inducing genuine strong emotions. Actors are sometimes used to portray emotions, but there are doubts about the accuracy of the portrayals, and most particularly at the level of physiological primitives (which is why lie detectors tend to use physiological measures - they are harder to control at will than face and voice).
2.3.2 A social constructivist approach
On the other hand, it is quite defensible theoretically to aim towards a system which makes competent use of emotion terms like the ones in Table 1 broadly along the lines envisaged by social constructionists. It is natural to formalise that approach in terms of symbolic descriptors consisting of schemata with a range of predefined slots for attributes of an emotion, slots whose contents would be elaborated though experience. The learning process could begin with a set of predefined schemata, each with a label, or (more fundamentally) with a process of differentiation analogous in a broad sense to the developmental pattern illustrated in Figure 2. The slots for an emotion schema remain to be developed, but Figure 3 illustrates some obvious possibilities.
Figure 3: Illustrative schema for a social constructivist emotion decoder
Emotion name: exasperated
Derivation history: decended from withdrawal -> distress -> hostility
Activation value:
Evaluation value:
Strength:
Intrusiveness value:
Duration:
Adaptiveness value:
Genuineness of instance:
Signs: visual :
auditory:
physiological (perhaps)
Context: obstacle
perceived as avoidable
not avoided
Timecourse: slow development
slow dissipation
Likely actions: withdraw from activity (high probability)
destructive actions (low probability)
Appropriate responses: 1. remove obstacle
2. offer withdrawal
3. self defence
There are several arguments in favour of emphasising the second option.
a. It has the potential to contribute to psychology. The relevance of nonverbal cues to primary emotions has been explored at length (though by no means exhausted), whereas much less is known about using nonverbal cues to identify everyday emotional categories.
b. It addresses the theme of the emergence of meaning. Developing schemata is a matter of constructing representations which support explanations and predictions, not just labelling.
c. It avoids the task of collecting samples of 'pure emotion', which are hard to obtain without heavy (and probably unethical) manipulation.
d. It has more practical potential - because it deals with types of event that actually occur.
e. It is better in tune with contemporary theoretical work on emotion.
2.3.3 Output issues
Associated with these options is a range of outputs that the system might try to generate, from minimal to supersophisticated.
The minimal is, of course, a classification in terms of primary emotions - whichever set is chosen.
Slightly beyond that is making a classification in terms of descriptive terms like Whissell's.
Dimensional descriptions of emotion terms suggest that classification per se may not be the most appropriate goal. It may make more sense to aim at locating the speaker as accurately as possible within a space defined by dimensions of the kinds described in section 2.2.
Matters become more interesting if we take on board the fact that emotion descriptors are dispositional, and look into predicting what the person will do. Closely linked to that is trying to react appropriately. These take us truly into the semantics of emotion. The idea may seem far-fetched, but there are scenarios where they are perfectly possible.
The simplest tack is to use theorists' descriptions of the behaviour that they would expect in connection with a particular emotion. That gives the prospect of setting up a net of predictions which would be triggered by the labels assigned to speech episodes - something that could just about be described as a sense of the meaning associated with an emotional expression, not just a label. The hybrid context means that that could be modified with experience.
Theorists have offered descriptions of action patterns that are of some use in that context. They are typically couched in abstract terms (e.g. from Oatley and Johnson-Laird - Happiness: Continue with plan, modifying if necessary; cooperate; show affection. Anger: try harder; aggress). That points to a deep problem, which is to represent specific actions, things and situations in a way that allows a system to recognise how they might relate to general emotion-related imperatives.
Possibly complementary to that is to develop a training set based on responses by human subjects in which they comment on descriptions of the behaviour they would expect on the basis of a speech extract, and the reactions they think might be relevant.
These kinds of approach give the prospect of a system that could talk in meaningful, albeit probably abstract terms, about the emotions associated with a speech episode.
A more thoroughgoing approach is to have the speaker interacting with a machine - learning to use a computer package, for instance (perhaps, for obvious reasons, training a speech to text system, or using a system with speech commands), or playing a computer game, or a voice version of ELIZA (a program which uses rather simple language processing to simulate a non-directive therapist, and draws users into interactions that can be remarkably intense). What should an intelligent interactive system do if the user is getting frustrated? Bored? Happy? Depressed? How should it find out? And, of course, can it learn to respond better with experience?
Arguably, that is the level where modelling engages seriously with the nature of meaning. It involves developing a representation of the user's disposition, at least implicitly and ideally explicitly, which feeds through into action - e.g. being able to ask 'are you getting bored with this? shall we wrap up the session?'.
As the introduction suggests, many other forms of output are also interesting.
This section has reviewed contemporary research on emotion - theoretical and empirical - with a view to defining the kind of output that an emotion detection system might reasonably aim to generate.
The key theme is that whatever the essence may be, emotion-related attributions are complex and highly structured. Research offers a basis for sketching the kind of structure that seems necessary to capture that kind of attribution. Developing systems which use that kind is a challenging long term goal, and progress towards it would have practical and theoretical implications.
There is a substantial body of literature on emotion and speech, dating back to the 1930s , ,. A number of key review articles summarise the main findings, notably Frick , Scherer and Murray and Arnott . A second body of literature is also relevant to emotion in the broad sense that was considered in the previous section. It deals with the intonational attributes of affective states and attitudes. Key descriptions are contained in Schubiger , Crystal and O'Connor and Arnold.
The material is very diverse methodologically. The relevant speech variables are sometimes measured instrumentally, but they are often described in impressionistic terms, which may be highly subjective. Broadly speaking, studies of emotion tend to be experimental, whereas studies of affective states and attitudes tend to be based on linguists' intuitions, and illustrated by examples of particular intonational patterns across phrases or sentences which it is assumed obviously convey a certain kind of emotion or feeling. A relatively small number of studies lead directly to possible implementations.
The aim of this section is to draw together as much of this material as possible in way that is reasonably systematic, in order to provide a background against which it is possible to make informed judgements about the features an emotion detection system might use.
3.1.1 States considered
The range of emotions studied seems to relate to the tradition in which researchers are working. By and large the experimentalists have focused on 'primary' emotions and on some other 'secondary' emotions. The primary emotions most studied are anger, happiness/joy, sadness fear and disgust: secondary emotions studied are grief/sorrow, affection/tenderness, sarcasm/irony and surprise/astonishment .
In contrast, linguists have used a much wider range of labels to describe affective states or attitudes. Two studies by Schubiger and O'Connor and Arnold, for example, used nearly 300 labels between them. These cover states such as 'abrupt, accusing, affable, affected, affectionate, aggressive, agreeable, airy, amused, angry, animated, annoyed, antagonistic, apologetic, appealing, appreciative, apprehensive, approving, argumentative, arrogant, authoritative, awed...'
In line with the general review of emotion, this section does not restrict itself to the first tradition. If there is a line between emotion proper and affective states or attitudes, then it needs to be identified empirically - it cannot be imposed a priori. It can only be identified empirically on the basis of a broad view which looks at material on both sides of the line. To anticipate, empirical studies which do examine supposed distinctions tend to suggest that they are not clear cut.
3.1.2 Levels of speech
A wide range of speech variables has been considered in relation to the expression of emotion in a broad sense. They can usefully be divided into four broad levels.
3.1.2.1 The continuous acoustic level
Many experimental studies focus on measuring continuous acoustic variables and their correlation with specific emotions (particularly primary and secondary emotions - see above). It is usually assumed that these measures tap paralinguistic properties of speech, that is, properties which are independent of any linguistic message being conveyed.
It is reasonably clear that a number of continuous acoustic variables are relevant - pitch (fundamental frequency- height and range), duration, intensity and spectral make-up. Studies in this mode sometimes manipulate speech instrumentally in order to separate out single acoustic parameters. Listeners are then tested to see if they can identify what emotion is being expressed on the basis of, for example, F0 alone. Results from these experiments suggest that certainly some emotions are carried paralinguistically.
3.1.2.1 The pitch contour level
A second group of experimental studies explore higher order properties of pitch, which tend to be described under the heading 'pitch contour'. Different contour types are generated and presented to listeners and they are asked to rate emotion type, for example on bipolar scales . It is not clear how contour measures relate to the linguistic/paralinguistic distinction. Contour types sometimes relate to categories in a phonological system, but are sometimes unmotivated by any systematic approach to intonation.
There is speculation in the literature that pitch contour type may be more related to attitudinal states than to the primary emotions. Scherer et al. suggested that continuous variables might reflect states of the speaker related to physiological arousal, while the more linguistic variables such as contour tended to signal speaker attitudes with a greater cognitive or attitudinal component. However in a subsequent study where listeners judged the relation of contour type to 2 scales, a cognitive-attitude scale and an arousal scale, they showed that contour type was related to arousal states. Other studies also indicate the relation of contour type to a wide range of emotional states, from primary emotions to attitudinal states.
3.1.2.3 The tone-based level
A more explicitly linguistic tradition focuses on what we will call the tone-based level of description. It tends to be rooted in the British approach to intonation, which describes patterns in terms of intonational phrases or tone groups. Each tone group contains a prominent or nuclear tone (a rising or falling movement or combination or level tone usually on the last stressed syllable of the group). The part that leads up to this nuclear tone is called the head or pretonic, and the part which follows it is the tail.
Studies in this tradition often associate different types of tones and heads with different emotions or attitudes. A relation between tone shape (e.g. rising or falling) and emotion is claimed, as is a relation between the phonetic realisation of a tone (e.g. high rise versus low rise) and emotion. Heads also can take different shapes (e.g. level, falling) and it is claimed that the head or pretonic shape (in conjunction with the tone shape and realisation) can express different emotions. The emotions listed cover a wide spectrum. Sometimes particular patterns are listed simply as emotional or non emotional, sometimes very specific labels are given (e.g. surprise, warmth) though these may be qualified with a comment that the specific label depends on the kinesic accompaniment or on a particular linguistic or grammatical context, e.g. statement, wh-question, yes/no question, command, interjection .
3.1.2.4 The voice quality level
The fourth level of speech related to the expression of emotion is voice quality. This level is discussed by researchers working within both the experimentalist and the linguistic tradition. Many describe voice quality auditorily. Terms often used are tense, harsh and breathy. However there is also research which suggests how auditory qualities may map on to spectral patterns ). Voice quality seems to be described most regularly with reference to the primary emotions.
Clearly there are relationships among the levels described above. For example, continuous spectral variables (see 3.1.2.1 above) relate to voice quality (see 3.1.2.4), and the pitch contours described in the experimental studies (3.1.2.2 above) must relate in some way to the tune patterns arising from different heads and tones (see 3.1.2.3). However links are rarely made in the literature and the result is a somewhat amorphous body of data on the levels of speech relevant to the expression of emotion.
3.1.3 Recurring problems
3.1.3.1 Conceptions of emotion
The first recurring problem is the lack of a coherent and psychologically up-to-date approach to the study of emotion. As the state of the art summary above suggests, there is very little cohesion between the approaches favoured by the linguists and the experimentalists. This is reflected in the very terms they use - in general the experimentalists tend to talk about studying the 'primary emotions', the linguists often talk about attitudinal or affective states (see above). This problem is widely recognised. For example, Ladd et al. , referring to their own study, say - "Perhaps the most important weakness of this study, and indeed of the whole general area of research, is the absence of a widely accepted taxonomy of emotion and attitude. Not only does this make it difficult to state hypotheses and predictions clearly, but (on a more practical level) it makes it difficult to select appropriate labels in designing rating forms" (p.442). Couper-Kuhlen makes a similar point. Contradictory results for some emotions may also relate to the absence of a coherent approach.
Within the experimental tradition, however, there does seem to be a degree of coherence. It is provided by at least broad adherence to the classical approach to emotion - researchers take it for granted that the material to be studied should express the primary emotions, and that the measures to be taken should relate as directly as possible to the physiological events associated with them. However, the problem with this approach is that research directly concerned with emotion has moved away from the classical model (see section 2). In fact modern approaches to emotion would encourage researchers to set up a wide and flexible framework that would pull together and integrate the primary emotions that the experimentalists focus on and the attitudinal states that dominate the linguistic studies.
3.1.3.2 Levels of description
A second problem is that the relationship between paralinguistic and linguistic levels is not at all clear. As the state of the art summary above shows, some researchers work with paralinguistic measures (the continuous acoustic variables described in 3.1.2 above), while others link the expression of emotion to linguistic categories (questions, statements etc - see O'Connor and Arnold): some work within the framework of intonational phonology, others assume it is irrelevant.
Ladd has recently been highlighted the issue of relationships between the paralinguistic and the linguistic. He cites the case of a study of emotion by Scherer et al (1984). There were two experiments in Scherer's study. In the first judges agreed on the emotions of question utterances with the words removed i.e. based on the pitch contour alone. This experiment is used to demonstrate that some of the emotional content of an utterance is indeed non phonological. But in a second experiment Scherer et al. presented the utterances without removing the words and then analysed the results. They found that, even on a crude phonological categorisation, judgements were affected by linguistic categories. For example, yes-no questions with a final fall were rated strongly challenging; rising yes-no questions and falling wh questions were rated high on a scale of agreeableness and politeness, while falling yes-no questions and rising 'wh' questions were rated low on the same scale).
There is a tendency in the literature on emotion and intonation to ignore the relationship of the linguistic and the paralinguistic. Acoustic studies often fail to consider the relationships with linguistic categories, and vice versa. The evidence from studies such as that by Scherer et al. suggests that the relationship between the linguistic and the paralinguistic in the expression of emotion should be addressed. As in the case of establishing a framework for emotion (see above), a broader and more flexible approach may be relevant.
3.1.3.3 Ecological validity
It is a feature of the literature that it tends to deal with highly artificial material. Most studies use actors' voices rather than examples of naturally occurring emotions. Data sets are often limited to short passages expressing emotion (often sentence length).
In particular, studies where listeners make judgements have tended to select or to modify material in a variety of ways designed to avoid presenting verbal cues. Key examples are as follows.
(i) meaningless content - speakers express emotions while reading semantically neutral material and listeners are asked to identify the intended emotion.
Example: Davitz and Davitz had subjects read sections of the alphabet while expressing a series of emotions and it was found that listening subjects could identify intended emotion in majority of cases
(ii) constant content - comparison of the same sentence given by speakers expressing different emotions
Example: Fairbanks recorded amateur actors reading a semantically neutral phrase in a variety of emotions and asked listening subjects to identify the emotions. These studies found that most emotions were identified correctly most of the time.
(iii) obscuring content - either by measuring only specific non verbal properties, or by electronically filtering the speech itself to obliterate word recognition
Example: Starkweather recorded speech from vocal role-playing sessions, and analysed listening subjects' perception of the aggressiveness/pleasantness of the speech when presented in 3 different forms - as a normal recording, as a written transcript only, as a filtered content-free recording. It was found that emotion was better perceived from the filtered content-free speech than from the transcript.
A key reason for these procedures is desire to exercise control over the linguistic content, i.e. hold it steady so that one can separate it off from the paralinguistic. The problem here is that this presupposes a model of intonation and emotion where emotion is carried on the paralinguistic dimension. We have already suggested above that emotion may cut across both paralinguistic and linguistic dimensions. Starting out therefore with preconceived ideas about the dimensions that are relevant risks circularity - if you have a preconceived model, this can lead to data collection that cannot possibly demonstrate the model's limitations.
Another reason for the use of artificial data may be the difficulty of getting real emotional data. That is certainly a problem if we think that the 'real' stuff of emotion is extreme states such as anger, fear, grief, ecstasy etc. However, as we have suggested above, restricting ourselves to those extremes is out of line with modern psychological analyses of emotion, and it is perfectly reasonable to start with emotional states that are much commoner and therefore much easier to observe.
There has been growing recognition that the issue of ecological validity needs to be addressed. An interesting methodology devised by Roach is to use live data from radio and television shows: emotional labels are attached on the basis of listener response. It is a useful adjunct to note that a flexible approach to the scope of emotion and to the levels of speech involved makes it easier to use real data.
3.1.4 Summary
Although the literature on emotion and speech lacks integration, it does offer a substantial body of data. The next section describes this data in more detail and attempts to pull it together.
3.2 Speech parameters and specific emotions
Table 3 is a summary of relationships between emotion and speech parameters from a standard review. It is convenient, but very partial. The Appendix gives a fuller summary covering most of the available material on the speech characteristics of specific emotions. The emotions considered include most of the emotional states which are commonly mentioned, whether or not they are described as primary.
Table 3: emotions and speech parameters (from Murray and Arnott, 1993)
Anger Happiness Sadness Fear Disgust
Rate slightly faster or slightly much very
faster slower slower faster much
slower
Pitch very much slightly very very
ave much higher lower much much
higher higher lower
Pitch much much slightly much slightly
range wider wider narrower wider wider
Intensity higher higher lower normal lower
Voice breathy, breathy, resonant irregular grumble
Quality chest blaring voicing chest
tone tone
Pitch abrupt smooth, down- normal wide,
changes on upward ward down-
stressed inflections inflections ward
syllables terminal
inflects
Articulation tense normal slurring precise normal
In the Appendix table, each emotion is described according to the levels of speech set out in the preceding section (continuous acoustic level, pitch contour level, tone based level, voice quality level). A fifth level called 'other' is included to cover other speech attributes which do not fall easily into any of those categories. The data is taken from a range of studies which are referenced in the table. Where there are blanks under one of the speech categories for a particular emotion, this means that no analysis has been carried out at that level for the emotion concerned. Descriptions under the category 'acoustic' generally mean that the emotion has been shown to contrast with neutral speech. Speech attributes given in bold typeface mean that these appear to be reliable indicators of the emotion, that is, they occur across a number of studies and the data is substantial.
The Appendix table highlights four broad points.
The first point is that a good deal is known about the speech correlates of the primary emotions of anger, happiness, sadness and fear. That is signalled by the degree of bold typeface for these emotions. The speech measures which seem to be reliable indicators of the primary emotions are the continuous acoustic measures, particularly pitch-related measures (range, mean, median, variability), intensity and duration.
The second point is that our knowledge is, nevertheless, patchy and inconclusive. There are a number of pointers to this. First, even within the primary emotions, there are contradictory reports. For example, there is disagreement on duration aspects of anger, happiness and fear - some report longer duration, some report faster speech rate, some report slower speech rate. Second, the large gaps under some headings in the table indicate incomplete knowledge. Third, our knowledge at the level of voice quality is noticeably incomplete. Attributes of voice quality are often mentioned, but they are mostly auditorily judged: it is only occasionally that voice quality is tied to acoustic or physiological measures (Cowie and Douglas-Cowie and Banse and Scherer).
The third point is the lack of integration across the paralinguistic (as represented by the continuous acoustic level) and the linguistic (as represented by the tone-based level). Although the table shows that we know about continuous speech attributes for the primary emotions, it highlights a dearth of knowledge about more linguistically based attributes for these emotions. Conversely, although there is information for linguistically based attributes of non primary emotions, there is little on continuous speech measures. This may be because primary emotions are carried on the continuous acoustic dimension and secondary emotions on a linguistic dimension. But in the absence of evidence we do not know. The table shows some interesting exceptions (see, for example, anger, surprise, boredom ) where the expression of emotion is carried on both dimensions. This suggests that the issue may be worth pursuing for other emotions.
The fourth point is that the primary emotions group together in particular ways on the basis of their speech attributes. Happiness, anger and to some extent fear share many speech attributes on the continuous acoustic level, e.g. increased mean F0 and range. Sadness goes in the opposite direction. Tied into this, but not apparent in the table, is the fact that more linguistically based descriptions implicitly seem to group emotions together into larger categories: general terms such as 'strong', 'intense' 'mild' are used to indicate attitudinal meaning.
3.3 Computational studies of emotion in speech
There are relatively few systems which approach the goal of recognising emotion automatically from a speech input. This section reviews key examples.
3.3.1 ASSESS (Cowie & Douglas-Cowie, various papers)
ASSESS is a system which goes part way towards a computational analysis. Automatic analysis routines generate a highly simplified core representation of the speech signal based on a few 'landmarks' - peaks & troughs in the profiles of pitch and intensity, and boundaries of pauses and fricative bursts. These landmarks can be defined in terms of a few measures. Those measures are then summarised in a standard set of statistics. The result is an automatically generated description of central tendency, spread & centiles for frequency, intensity, & spectral properties.
That kind of feature extraction represents a natural first stage for emotion recognition, but in fact ASSESS has not been used in that way. Instead the measures described above have been used to test for differences between speech styles, many of them at least indirectly related to emotion. The results provide some indication of the kinds of discrimination that could be made on the basis of an ASSESS-type representation.
Emotion-related reactions to deafened speech
A precursor to ASSESS was used in analysis of speech produced by deafened adults. One of the features of this type of speech is that hearers tend to attribute peculiarities to emotion-related speaker characteristics. These evaluative reactions were probed in a questionnaire study, and the programs were used to elicit the relevant speech variables. Table 4 summarises the correlations that were found between emotion attributions and speech features:
Table 4 Emotion attributions and features of deafened people’s speech
Response Speech factors
Judged stability relatively slow change in the lower spectrum.
Judged poise narrow variation in F0 accompanied by wide variation in intensity.
Judged warmth predominance of relatively simple tunes, change occurring in the mid-spectrum rather than at extremes; low level of consonant errors
Competence pattern of changes in the intensity contour.
ASSESS applied to emotion
In a later study, reading passages were used to suggest four emotions - fear, anger, sadness, happiness. All were compared to an emotionally neutral passage. The measures which distinguish the emotionally marked passages from the neutral passage are summed up in Table 5 below.
Table 5 Distinctions between emotional and neutral passages found by ASSESS
Afraid Angry Happy Sad
Spectrum
• midpt & slope + –
Pitch movement
• range + +
• timing + +
Intensity
• marking + + +
• duration + + +
Pausing
• total + +
•variability +
Discriminant analysis using ASSESS
A recent application of an ASSESS-type system illustrates the next natural step towards automatic discrimination. Discriminant analysis was used to construct functions which partition speech samples into types associated with different types of expression. The differences in question were involved with discourse function rather than emotion, though.
3.3.2 Banse and Scherer 1996
Scherer's group have a long record of research on vocal signs of emotion. A key recent paper is conceptually closely related to the ASSESS approach. A systematic battery of measurements was extracted from test utterances. The measures fall into four main blocks, reflecting the consensus of research concerned with the continuous acoustic level. They are summarised below.
Fundamental frequency
mean F0
standard deviation of F0
25th and 75th percentiles of F0
Energy
mean of log-transformed microphone voltage
Speech rate
duration of articulation periods
duration of voiced periods
Spectral measures
long term average spectra of voiced and unvoiced parts of utterances
The emotions considered were hot anger, cold anger, panic fear, anxiety, desperation, sadness, elation, happiness, interest, boredom, shame, pride, disgust, and contempt.
Discriminant analysis was then used to construct functions which partition speech samples into types associated with different types of expression. Classification by discriminant functions was generally of the order of 50% correct - which was broadly comparable with the performance of human judges.
It is natural to take the techniques described by Scherer and Banse as a baseline for emotion detection from speech. They show that automatic detection is a real possibility. The question then is where substantial improvements might be made. That theme is taken up in 3.5 below.
3.4 Automatic extraction of phonetic variables
One of the major tasks facing automatic emotion detection is automatic recovery of relevant features. Large parts of the literature described above consider features which can be identified by human observers, but which have no simple correlate in the acoustic signal.
ASSESS reflects one approach. It uses features which can be derived relatively directly from the acoustic signal - though even there the processing involved is far from trivial, and it depends on human intervention in the case of noisy signals. However, modern signal processing techniques mean that a far wider range of features could in principle be explored. This section begins with a case study which illustrates some of the relevant issues, and then considers the main feature types that are of interest in turn.
3.4.1 Voice stress: a case study
Voice level is one of the obvious indicators of emotion. Banse and Scherer measure it in what is the most obvious way, as a direct function of microphone voltage. However, simple relationships between voltage and voice level only exist under very special circumstances. Normally, microphone voltage depends critically on the distance between the speaker and the microphone, on the direction in which the the speaker is turned, and on reflection and absorption of sound in the environment. Humans provide an existence proof that it is possible to compensate for these effects - they can usually tell the difference between a person whispering nearby and a person shouting far off.
A report by Izzo considers the kinds of solution to this problem that modern engineering makes available. The context is speaker stress, which has direct applications, but what is most relevant to this context is the fact that the problem includes distinguishing loudness-related varieties - soft, neutral, clear, and loud. It uses speech from databases (SUSAS and TIMIT) which consist of short utterances labelled in detail. The following indicators are considered.
Pitch is shown to be a statistical indicator of some speech types (e.g. clear and soft).
The duration of speech sounds can be established because of the labelling, and it is indicative of speech type - particularly the duration of semivowels.
Intensity per se is subject to the confounding factors which have been mentioned above, but the distribution of energy is also an indicator of speech type - for instance, energy shifts towards vowels and away from consonants in loud speech.
There are standard techniques for estimating the cross-section of the vocal tract from particular speech sounds. These show that speech level affects the region in which greatest movement occurs during production of a vowel sound.
The spectral distribution of energy varies with speech effort - it is well known that effortful speech tends to contain relatively greater energy in low and mid spectral bands. That kind of relationship is exploited both in the ASSESS family (anomalous energy distributions are highly characteristic of deafened speakers) and by Banse and Scherer. Izzo examines a number of ways in which the approach can be refined. Wavelet transforms provide a more flexible method of energy decomposition than the Fourier-based techniques used in earlier work. Discrimination is increased by distinguishing the spectra associated with different speech sounds. Time variation in the energy distribution is also more revealing than static slices or averages.
The key message of the study is that intervening variables are central to the area. Voice level itself is an intervening variable - it is an indicator of emotion, but extracting it is a substantial task. Because of their potential relationship to the biology of emotion, intervening variables which refer to physiological states - such as vocal tract configuration - are particularly interesting, and there are techniques which allow them to be recovered. The use of information about speech sounds highlights the relevance of what may be called intervening covariables. Voice level may be a paralinguistic feature, but it is not necessarily optimal to ignore linguistic issues (such as phoneme identity) in the process of recovering it. Stress may also be considered as an intervening variable - a feature which distinguishes certain emotional states from others.
A systematic treatment of intervening variables is at the heart of theoretically satisfying work on emotion. PHYSTA's use of neural nets is highly relevant to the issue. On one hand, it is an attraction of neural nets that they have the potential to allow evidence to drive the emergence of suitable intervening structures. On the other, it is a danger that they may generate weighting patterns which work - particularly in a restricted domain - but which can neither be understood nor extended. PHYSTA's commitment to hybrid structure offers the prospect of avoiding that danger.
3.4.2 Relevant feature types
This section sets out to summarise the kinds of intervening variables that it makes sense to consider extracting from the raw input, and the techniques that are currently available. Because of the project's concern with neural nets, it highlights cases where they have been used or could naturally be.
3.4.2.1 Voice level
This was considered in the previous section.
3.4.2.2 Voice pitch
Voice pitch is certainly a key parameter in the detection of emotion. It is usually assumed for engineering purposes that voice pitch can be equated with F0, the fundamental frequency of the voice - which is usually determined by the rate at which the vocal cords open and close. The connection between perceived voice pitch and F0 is well known to be imperfect. In particular, vowel quality has quite large effects on perceived pitch. However, that issue can probably be ignored for present purposes.
Extracting F0 from recordings is a difficult problem, particularly if recording quality is not ideal. There are many approaches to it, and neural net techniques have been applied. It involves several sub-problems:
detecting the presence of voicing
detecting the 'glottal closure instant'
detecting harmonic structure in a brief episode
detecting short term pitch instabilities (jitter and vibrato)
fitting continuous pitch contours to instantaneous data points (for which ASSESS has a simple but reasonably satisfactory technique).
3.4.2.3 Phrase, word, phoneme and feature boundaries
Detecting boundaries is a major issue in speech processing. The fact that it is difficult is the reason why recognition of connected speech lags far behind recognition of discrete words. The issue arises at different levels.
Phrase / pause boundaries The highest level boundary that is likely to be relevant to PHYSTA is between a vocal phrase and a pause. Quite sophisticated techniques are available to locate pauses. ASSESS uses a method based on combining several types of evidence, and it is reasonably successful. However, the process depends on empirically chosen parameters, and it would be much better to have them set by a learning algorithm - or better still, by a context-sensitive process. As noted above, pause length and variability do seem to be emotionally diagnostic.
Word boundaries Speech rate is emotionally diagnostic, and the obvious way to describe it is in words per minute - which depends on recovering word boundaries. That turns out to be an extremely difficult task, and probably the best solution is to look for other measures of speech rate which lend themselves better to automatic extraction. Finding syllable nuclei is a promising option , and neural net techniques have been applied to it.
Phoneme boundaries The report by Izzo (3.4.1 above) indicates that good use can be made of information about phonemes if they can be identified. That directs attention to a large literature on phoneme recognition, in which neural net techniques are prominent .
Feature boundaries Some features, such as fricative bursts, are easier to detect than phonemes as such. ASSESS has routines for detecting them, and they appear to be emotionally diagnostic.
3.4.2.4 Voice quality
A wide range of phonetic variables contribute to the subjective impression of voice quality . The simplest approach to characterising it is based on spectral properties. The report by Izzo considered above reflects that tradition. A second uses inverse filtering aimed at recovering the glottal waveform (another task where neural net techniques can be used to set key parameters). Voice quality measures which have been directly related to emotion include open-to-closed ratio of the vocal cords, jitter, harmonics-to-noise ratio, and spectral energy distribution.
3.4.2.5 Temporal structure
This heading refers to measures at the pitch contour level (see 3.1.2.1) and related structures in the intensity domain. ASSESS incorporates probably the most systematic treatment of this level that there is.
Most basically, ASSESS divides the pitch contour into simple movements - rises, falls, and level stretches. Describing pitch movement in those terms appears to have some advantages in the description of emotion over first-order descriptions (mean, standard deviation, etc).
The intensity contour is treated in a similar way, and again, descriptions based on intensity movements seem to improve emotion-related discriminations.
ASSESS also incorporates simple measures of tune shape. Portions of pitch contour between adjacent pauses are described in terms of overall slope and curvature (by fitting quadratic curves). This has not been shown to confer any real advantage, but it has not been systematically related to in the contexts where claims about pitch contour are usually raised. The intensity contour is treated in a broadly similar way, also with few obvious benefits.
A natural extension to this kind of description is to consider rhythm. It is recognised that rhythm plays a positive role in speech , but techniques for recovering rhythmic attributes of speech are not well developed. ASSESS incorporated procedures designed to detect excessive rhythmic regularity, but they were not effective.
3.4.2.6 Linguistically determined properties
There is a fundamental reason for considering linguistic content in connection with the detection of emotion. On a surface level, it is easy to confound features which signal emotion and emotion-related states with features which are determined by linguistic rules. The best known example involves questions, which give rise to distinctive pitch contours that could easily be taken as evidence of emotionality if linguistic context is ignored. Some work has been done on the rules for drawing the distinction . Other linguistic contexts which give rise to distinctive pitch contours are turn taking, topic introduction and listing.
It is worth noting that these are contexts that are likely to be quite common in foreseeable interactions with speech competent computers: systematically misinterpreting them as evidence of emotionality would be a non-trivial problem. The only obvious way to avoid confounding in these contexts is to incorporate intervening variables which specify the default linguistic structure and allow the observed speech pattern to be compared with it.
This section attempts to identify the natural directions for research by taking Banse and Scherer as a point of reference, identifying where their approach is incomplete or questionable, and considering how it could be taken forward.
Perhaps the most obvious priority is to extend the range of speech material from which classifications can be made. Ideally speech samples should
• be natural rather than read by actors (who presumably tend to maximise the distinctiveness of emotional expression);
• have verbal content varying as it naturally would rather than held constant (so that potential confounding effects have to be confronted);
• include rather than exclude challenging types of linguistically determined intonation;
• be drawn from a genuine range of speakers, in terms of sex, age, and social background;
• use a range of languages.
A second priority is to extend the range of evidence that may be used in classification, and examine what - if anything - they contribute, particularly in less constrained tasks. Relevant types include
• derived speech parameters of the kinds considered in 3.4.2
• linguistic information of the kinds considered in 3.4.2
• non-speech context - facial, behavioural, etc.
Extending beyond speech information is not gratuitous. It addresses a central issue about speech information in real environments, that is, the way it combines with other sources - particularly when an interpretation that might be natural on the basis of speech alone conflicts with other types of information.
The third priority is to extend the range of responses. Increasing the range of emotion terms considered is one aspect of the issue, but considering how that might be done quickly indicates that it entails deeper changes, since it makes no sense to treat hundreds of emotion terms as independent entities. Relationships among them need to be considered systematically, in several senses.
• They may be located in dimensions of the kinds considered in section 2, in which case it becomes a theoretical priority to consider mappings between those dimensions and dimensions of variation in speech;
• They may be related to intermediate features which characterise a range of possible emotions (e.g. 'stressed', 'positive'), in which case it becomes a theoretical priority to consider mappings between those intermediate features and dimensions of variation in speech.
• They may be linked to possible actions which apply under uncertainty, e.g. "ask whether X", "look out for Y", "just get out fast".
Relationships to non-emotional states are also a very real issue. It is important to register that a husky voice may reflect either passion or a sore throat, and that is a problem for models which propose automatic links between voice parameters and emotional attributions.
Cutting across the last two points is the intuition that the process of attributing emotion is intrinsically context-sensitive. People are able to do a lot with a rather narrow paralinguistic channel because evidence from it feeds into knowledge-rich inference processes, and as a result they can make the right attribution for effects that have many potential causes. That picture may be wrong, but it is plausible enough to suggest that simpler models should not be taken for granted.
There has been considerable interest in human emotion recognition based on facial expressions , starting from Darwin's pioneering work and including extensive studies on face perception during the last twenty years . Facial expression recognition has a broad range of applications, such as in multimedia, the arts, medicine, mental health, stress control in normals; it constitutes a subject of research which can improve our understanding of human communication, including aspects such as how the brain is involved in the process . There are many scientific studies on the latter topic, although which brain areas are involved is o
nly now becoming uncovered. At the same time there is an increasing number of artificial systems being created with the ability to recognise faces , as well as reviews of face recognition . Few, however, have concentrated much on emotions and emotional expression. An important task therefore is to move some of the work being done on face recognition into this area.The techniques being used for face recognition are based on PCA and eigenimages , on feature mapping onto a battery of distinct features, on details of the muscles used in making facial gestures and on optical flow fields over faces . The latter approach is important for the expression recognition task since it can most effectively describe the emotional content. There is still, however, much to be done to make it applicable to the emotion understanding problem.
This section of the report explores methods by which a computer can recognise visually communicated facial actions-expressions. Such methods can contribute to effective human-computer interaction and to applications including multi-media facial image queries as well as face recognition from dynamic imagery. However, the most fundamental problem is how to categorise active and spontaneous facial expressions so as to extract information about the underlying emotional states. As a first step we describe here what work has to be done, what success levels it has had up to now, and at the end of the section what we can try to extract from it for our own project. We begin with a description of the neurobiological knowledge about emotional facial expressions.
4.1 Neurobiology of Recognising Emotional Facial Expressions.
4.1.1 Face Cells in Monkeys
Biological research on face recognition was transformed by the discovery of cells in the primate temporal lobe which are responsive to faces. Several techniques contribute to the identification of sites where neurones are activated specifically by faces. One approach has been to measure single cell activity in a suitable cortical area when the awake monkey is observing a face. Another approach is to remove a specific cortical area and determine if the loss causes deficits in face recognition tasks . The former approach implicated cells in the inferior temporal gyrus and the lower bank of the superior temporal sulcus; these are all in area TE of the inferior temporal cortex (IT). The latter showed that these regions may in particular be coding for the angle of regard, and may be coding for facial expression and bearing. A study of 2600 cells from STS in the rhesus monkey also supported this view . Similar studies cannot be performed in humans, and so we have to turn to indirect methods to search for face-sensitive areas in the human cortex.
4.1.2 Human Face- Sensitive Areas
Loss of brain regions due to war injury or car accidents led to the realisation that the right temporal lobe contains a face-sensitive area. Degeneration of this area can lead to prosopagnosia (the inability to recognise faces), a symptom that can lead to difficulties in social relations. More precise localisation of face regions in the human brain has become possible through the use of non-invasive brain imaging. This has indicated that there are sites in the occipital and temporal regions, especially those nearest the midline, which are most activated during face processing. More specifically a small area in the fusiform gyrus, at the base of the human cortex, is specifically responsive to faces. This is at the junction of the temporal and occipital lobes, and is in the comparable area (TE) to that in the monkey.
4.1.3 Recognition of Emotional Expressions
A number of psychophysical studies have shown that loss of the amygdala, a subcortical nucleus very well connected to many brain areas, causes loss of recognition of anger and fear expressions on human face visual images. Normal volunteers have been shown by fMRI to have increased excitation in the amygdala on viewing expressions of human faces showing mild or strong disgust. This activation even occurs when the subject has no conscious recognition of the face.
The importance of the amygdala for creating a recognition code for fearful expressions has been shown not only for vision. A subject (DR) who had lost her amygdala after surgery for epilepsy was shown to be unable to recognise emotional expression in spoken words. As in the case of visual presentations of emotionally expressive faces to subjects without their amygdala mentioned above, subject DR was especially poor at recognising the emotions used in speaking words with neutral content in themselves
.4.1.4 Coding of Faces in the Brain
It is clear from the above that there are at least two sites where faces are recognised in the brain. One is in the occipital- temporal lobe and involves a system of specific face recognition neurons built from earlier processing of lower order features. The other is the amygdala, and responds most effectively to the expressions of fear or disgust on a human face. The whereabouts of sites coding for pleasurable expressions on the human face is unknown at present. However, it is to be expected to involve the dopaminergic reward system in the limbic part of the brain as well as in the prefrontal cortex, where face sensitive cells are also observed in monkeys (and so expected to be present in man). This is thus the third site of a tripartite coding for faces in the brain.
4.2 Different Types of Approach to Facial Expression Recognition
Approaches to the recognition of facial expressions can be divided into two main categories: static and motion dependent. In static approaches, recognition of a facial expression is performed using a single image of a face. Motion dependent approaches extract temporal information by using at least two instances of a face in the same emotional state. When two instances are used (semi-static approaches), they usually represent the face in its neutral condition and the face at the peak (“apex”) of the expression. Fully dynamic approaches use several frames (generally more than two and less than 15) of video sequences containing a facial expression, which normally lasts 0.5 to 4 seconds. A brief description of the main techniques in each of the above categories is presented in the following.
4.2.1 Static Approaches
The recognition of faces from static pictures is an established subject in computer vision, with clear commercial applications - personal identification systems have already been installed with considerable success. During the last thirty years numerous approaches have been developed, ranging from feature-based techniques to volumetric 3D models. However, face identification is still a challenging task for any recognition system, since the snapshot pictures of the person to be identified can vary significantly from the stored example images in the database.
4.2.1.1 Normalisation
A simple classifier, which only compares pixel intensities or the detected edges of the image with the images in the database, will normally fail to recognise the person correctly. The causes are translations, varying distance, or a different orientation of the head to the camera compared to the example images, or changes in illumination and expression of the face. Further factors, which affect recognition performance, are aging and personal variations like wearing make-up, eyeglasses or a beard.
To account for these variations, strategies have been developed which minimise their effect on the recognition procedure. A general strategy is to normalise the image in a pre-processing step to standard co-ordinates (image registration). This can be achieved by detecting the main facial features (eyes, nose and mouth) and renormalizing the face by translating, rotating and expanding/shrinking it around a virtual central (nodal) point. To account for variations caused by facial expressions, an image warping transformation based on radial basis function has been proposed which decomposes the transformations into linear and radial terms.
4.2.1.2 Person identification
To recognise appropriately normalised faces, three general approaches have been proposed.
The view-based approach attempts to identify the face using the two-dimensional information of the image without recovering the three-dimensional structure of the face. The advantages of this approach are its simplicity and speed, since it includes training directly on the image data. Its disadvantage is the larger network size when multiple examples of each face have to be stored. The view-based approach includes feature-based methods, which extract local features on the face and match them to an internal model of the face, and template-based ones, which measure the global degree of correlation between the image and a set of templates which constitute the face database.
The template-based approach is reinforced by psychological research indicating that the human visual system processes faces to some extent holistically, rather on the basis of the individual features. A popular example of the approach is the linear Eigenface method, which encodes the variation among the images in the database using an ordered list of their principle components . This method has recently been revised, replacing the Euclidian norm type of similarity with a non-linear probabilistic matching method, resulting in improved performance and higher stability against deformations corresponding to facial expression changes.
The second class of methods try to extract the underlying three-dimensional geometry of the face before the recognition step. These techniques have the advantage that they can be extremely accurate, but have the disadvantage that they are often slow, fragile, and usually must be trained by hand . The third class includes dynamic approaches. These usually capture most of the face details, since they can integrate their recognition evidence over time, but they demand fast hardware which often has to operate in real-time.
4.2.1.3 Emotion identification
Complementary to methods which recognise faces are systems which try to extract facial expressions or the gender of a person , ignoring the personal identity of the subjects. Most of the constraints mentioned (in 4.2.1.1) above also apply to expression recognition. Most psychological research on facial expression analysis has been conducted on “mug-shot” pictures that capture the subject’s expression at its “apex” or peak . These pictures allow one to detect the presence of static cues (such as wrinkles) as well as the positions and shapes of facial features. However, few facial expression classification techniques based on static images have been successful. The main cause lies in the dynamic nature of facial emotions, which can be best extracted from image sequences.
Perhaps the most famous static approach was the one proposed by Ekman and Friesen . They produced a system for describing “all visually distinguishable facial movements”, called the Facial Action Coding System (FACS). It was based on the definition of “action units” (AUs) of a face that cause facial movements. Each AU could correspond to several muscles that together generate a certain facial action. In their model, 46 AUs were responsible for expression control and 12 for gaze direction and orientation. For example, the happiness expression was considered to be a combination of “pulling lip corners (AU12+13) and/or mouth opening (AU25+27) with upper lip raising (AU10) and bit of furrow deepening (AU11).” The FACS model has been used to synthesise images of facial expressions, but only limited exploration of its use has been performed in analysis problems . As some muscles give rise to more than one action unit, correspondence between action units and muscle units is only approximate.
It is widely recognised that the lack of temporal and detailed spatial (both local and global) information is a significant limitation of the FACS model , since the use of such “frozen” action descriptions is unsatisfactory for a system developed to code movements. Additionally, the heuristic “dictionary” of facial actions originally developed for FACS-based coding of emotions has proven difficult to adapt to machine recognition of facial expressions .
Another well-known approach is due to Ekman and Friesen , who classified facial signals into three types: static (such as skin color), slowly varying (such as permanent wrinkles), and rapidly varying (such as raising of the eyebrows). The rapid facial signals can be further classified as conveying emotional, emblematic, manipulator, illustrator and regulator messages . Emotional messages include such feelings as sadness, happiness, fear, etc. Emblematic messages describe facial signals as specific non-verbal equivalents of common words or phrases (e.g., an eye- wink). Manipulator messages include self-manipulative movements such as lip biting. Illustrators include actions accompanying and highlighting speech such as raising the eyebrows. Regulators are non-verbal mediators such as nods and smiles.
The results of this approach to deriving universal cues for recognition of the six principal emotions are summarised by Ekman and Friesen (1978) . The cues describe the peak of each expression, and provide a human interpretation of the static appearance of the face. For example, a description such as “brows are raised” means that the viewer’s interpretation of the location of the brows in relation to other facial features indicates that they are not in a neutral state but higher than usual. The viewer uses many cues to deduce such information from the image, such as the appearance of wrinkles in certain parts of the face or the effect of the hypothesis of a high brow on the shape of the eyes (i.e., state of eyelids). Unfortunately, if only static images are considered, humans are currently much better than computers at deriving at such descriptions. However, many of the cues can be computed from motion sequences, brow’s raising being such a case. That makes features computable by motion analysis a natural basis for expression recognition systems.
A recent study describes an ensemble of feed-forward neural networks for the categorical perception of static facial emotions. Their technique is similar to Eigenface approaches, using seven 32x32-pixel blocks from regions of interest in the face (both eyes and mouth) and projecting them onto the principle component space generated from randomly located blocks in the image data set. Each network was trained independently using on-line backpropagation and different input data sets. During recall only the network producing the highest output score was considered. This simple strategy has been shown to generate an expected generalisation rate of 86% on novel individual while humans scored 92% on the same database.
4.2.2 Motion Dependent Approaches
The extraction of emotions from facial expressions has been mainly studied using optical flow techniques, because motion vectors extracted from a facial image sequence contain rich information on facial actions of various expressions. These flow patterns can be used by conventional pattern classification techniques as well as neural networks to recognise the corresponding expressions.
4.2.2.1 Semi-static techniques
Bassili suggested that motion in the image of a face would allow emotions to be identified even with minimal information about the spatial arrangement of features. The subjects of his experiments viewed image sequences in which only white dots on a darkened surface of the face displaying the emotion were visible. The results verified that facial expressions were more accurately recognised from dynamic images than from a single static image. Whereas all expressions were recognised at above chance levels, when based on dynamic images, only happiness and sadness were recognised at above chance levels, when based on static images. Bassili identified principal facial motions that provide powerful cues to recognise facial expressions, but his results did not explicitly associate the motion patterns with specific facial features or muscles. For example, a “surprise” emotion was recognised by an upward motion in the upper part of the face and a downward motion in the lower part of it.
Recovering optical flow
Optical flow has been used to track action units by a series of authors beginning with Mase and Pentland. Optical flow algorithms can be divided into three categories based on image gradient, filtering, and correlation respectively.
Gradient algorithms are usually based on the formulation by Horn and Schunck . Such algorithms face difficulties in highly textured images. In the context of facial deformations, one has to assume that the deformations of the skin are locally smooth in order to use the gradient approach . Filtering approaches use an extended number of images to compute the motion field, based on analysis of the spatial and temporal frequencies of the images. Filtering approaches require the use of many frames for motion estimation, which limits their utilisation to measuring motion associated with expressions. In addition, since facial expressions correspond to non-rigid motion, the filter design needed for detecting and estimating motion can not be easily determined. Correlation approaches compare a linearly filtered intensity value of each image pixel with linearly filtered intensity values arriving, delayed in time, from neighbouring image pixels. Correlation approaches are generally computationally intensive, since some form of exhaustive search is carried out to determine the best estimate of motion. None of these approaches have been extensively tested on non-rigid motion.
Mase: movement of facial muscles
Mase approached facial expression recognition from both the top-down and bottom-up directions. In both cases, the focus was on computing the motion of facial muscles rather than the motion of facial features. Four facial expressions were studied: surprise, anger, happiness, and disgust. The top-down approach assumes that the facial image is divided into muscle units that correspond to the AUs suggested by Ekman and Friesen . Optical flow is computed within rectangles that include these muscle units, which in turn can be related to facial expressions. This approach relies heavily on locating rectangles containing the appropriate muscles, which is a difficult image analysis problem, since muscle units correspond to smooth, featureless surfaces of the face. Furthermore, it builds on a model that is suitable for synthesising facial expressions, but remains untested in analysis of facial expressions. Mase’s bottom-up approach tessellated the area of the face with rectangular regions over which optical flow feature vectors were computed. These vectors were computed over a 15-dimensional space that was based on the mean and variance of the optical flow. Optical flow calculation was averaged within each window to smooth the results over edges. Furthermore, optical flow was treated on a per-frame basis, without considering the time-sequence of frames. Recognition of expressions was performed on k-nearest-neighbour voting rule (k=3 was adopted). Experimental studies considered the expressions of a single subject and the results were compared to the performance of subjects that were asked to classify the displayed emotions. The major limitation of the latter work was that no physical model was employed; facial motion was formulated statically rather than within a dynamic optimal estimation framework. However, the results of this work were sufficiently good to show the usefulness of optical flow for observing facial motion and for tracking action units.
Yacoob and Davis: movement of facial edges
Mase’s work is closely related to the work proposed by Yacoob and Davis since both used optical flow computation for recognising and analysing facial expressions. However, Yacoob chose not to use models for muscle actions as well as not to consider the underlying anatomic and musculature models and actions. In doing so he tackled facial expression analysis based on recovering muscle actions . The potential advantage of techniques which focus on facial edges rather than muscle dynamics is that edges can be computed more easily than muscle parts; the latter are difficult to locate based on the observed deformations of the skin. Of course, there are practical difficulties associated with locating edges as well. In summary, however, edges are more stable than surfaces under projection changes, since the farther the face is from the frontal view, the harder it becomes to detect muscles or AUs. Motion is more accurately computed at image discontinuities than over smooth areas such as muscles. Mapping the observed motion at edges into linguistic descriptions is relatively straightforward. In contrast, analysing muscle actions requires the use of anatomic musculature models. Furthermore, the mapping of expressions into muscle actions is still not well developed; most of the available knowledge is only applicable to synthesis of facial expressions.
In order to develop a dictionary of facial feature actions, Yacoob unified the facial descriptions proposed by Ekman and Friesen and the motion patterns of expression proposed by Bassili . As a result, he arrived at a dictionary composed of motion-based feature descriptions of facial actions. The proposed dictionary was divided into (i) components, (ii) basic actions of these components, and (iii) motion cues. Components were defined qualitatively in relation to the rectangles surrounding facial regions; basic actions were determined by the components’ visible deformations; cues were used to recognise basic actions using optical flow within the regions. Since cues were not mutually exclusive (e.g., the raising of a corner of the mouth can be a by-product of raising the upper or lower lip), a ranking of actions was introduced according to interpretation precedence using the relation “part of ”. Lip actions had higher interpretation precedence than mouth corner actions and whole mouth actions had the highest interpretation precedence. The dictionary allowed conversion of local directional motion patterns within a face region into a linguistic, mid-level representation of facial actions. His goal in constructing such a mid-level representation of facial dynamics was to model spatio-temporal facial actions in a way that could allow addressing a broad range of problems related to facial communications. His mid-level representation can be, for example, used to study emotion, emblematic messages, speech and other facial gestures in a unified way. In addition to basic actions, the mid-level representation included region and co-ordinated actions.
Basic actions within a rectangle surrounding a feature were combined to construct a region action. For example, the simultaneous raising of the upper lip and lowering of the lower lip produced a region action corresponding to “mouth opening”. Region actions that occurred simultaneously at symmetric features (i.e., eyes and eyebrows) could be combined into a co-ordinated action. For example, the raising of the right and left eyebrow produced a “raising brows” co-ordinated action. This definition of multi-level actions served two purposes. The first was notational convenience, i.e. simplified modelling of facial expressions. The second was to represent and verify temporal actions at a level higher than basic actions, yet lower than facial expressions, thus reducing the complexity of reasoning about temporal facial actions. The mid-level representation was computed per frame, thus modelling instantaneous actions; it did not represent the time span of actions. The latter was done by the facial expression recognition component of the system. A temporal consistency procedure was applied to mid-level representations, i.e. co-ordinated, region and basic actions, to filter out errors due to noise or illumination changes.
Implementation of flow-based identification
Let us describe the procedure to implement the above in more detail, since the method is going to be considered next in the dynamic framework as well.
The cues underlying basic actions were computed first from the optical flow field; flow magnitudes were thresholded to reduce the effect of small motions due to noise. A direction label was used to quantify the optical flow vectors into eight principal directions. The vectors were then filtered; both spatially and temporally, so as to improve their coherence and continuity, respectively. The spatial procedure examined the neighbourhood of each point and performed a voting among all neighbours to determine the coherence of its direction label. If the majority of the neighbours of a pixel had a direction that was different from the pixel’s own direction, the latter was accordingly changed. The temporal procedure followed the spatial one; it used a fixed temporal window around each pixel, also changing the flow direction of the pixel if it disagreed with the direction of the majority of the pixels in the window. Statistical analysis of the resulting flow directions within each facial window provided indicators about the general motion patterns that the facial features had undergone. The analysis differed from one feature to another, based on an allowable set of motions. The largest set of motions was associated with the mouth, which had the highest number of degrees of freedom.
The set of mid-level representations was used next by a rule-based recognition system, which included some of the expression descriptions proposed in Ekman and Friesman and Bassili. A temporal procedure was then employed for final recognition of facial expressions, also resolving conflicts between hypothesised expressions. Every facial expression was assumed to consist of three temporal segments: beginning, apex and ending. Since the outward-upward motion of the mouth corners were the principal cues for a smile motion, these were used as features for the temporal classification task. In general, motion computed at the end of a facial action is not the reverse of corresponding motion at its beginning. For example, it takes only a small fraction of a second to begin a surprise expression, while it may take considerably longer to terminate it. Such temporal cues were defined for each facial expression or action. For example, “anger” was characterised by inward lowering motion of the eyebrows and by compaction of the mouth. Such a compaction could be hard to detect from the optical flow due to noise, aperture, or tracking inaccuracies. By measuring the aspect ratio of the window surrounding the mouth during the hypothesised beginning of the expression, compaction in the window size could be detected. If there was evidence that the mouth size was decreased, then the expression was accepted. Similar considerations were used to determine the exact ending of an expression. For example, a “surprise” expression ended when the mouth arrived at a neutral expression, in the sense that the mouth closing motion had stopped and the mouth size was approximately the same as it was before the onset of the expression. However, the overall modelling of expressions using a beginning-apex-ending trilogy suffers limitations due to the fact that expressions can overlap in time. For example, the transition from surprise to happiness may not include a formal ending of the surprise expression.
A model-based method of recovering facial movement
Li, Roivainen and Forchheimer described an approach using the FACS model and analysing facial images for re-synthesis purposes. They did not attempt to classify or reason about facial expressions and actions, although some aspects of their work might potentially contribute to that. The approach was model-based; it assumed that a 3-D mesh was placed on the face and that the depths of points on it were recovered. They derived an algorithm for recovering rigid and non-rigid motion of the face based on a set of frames, and used this computed motion to create an approximation of the original frames. Motion recovery employed a facial modelling approach that used six AUs to represent possible facial expressions. Two methods for computing motion between consecutive images were proposed, in cases of small and large motion. In addition, a closed loop feedback architecture was proposed for facial tracking within a large number of frames. This architecture assumed motion continuity and employed a prediction-correction strategy to handle the motion-tracking problem; both linear and adaptive predictors were used. The main limitation of this work was the lack of detail in motion estimation, as only large predefined areas were observed, and only affine motion was computed within each area.
4.2.2.2 Dynamic techniques
Approaches to extracting facial emotions from image sequences fall into three classes.
Optical flow based approaches
The optical flow based approach uses dense motion fields generated directly from the image data; it tries to map these motion vectors to facial emotions using motion templates which have been extracted by summing over a set of test motion fields . The motion vector at a given point can be computed either through local gradients in space and time at that point, or through cross correlation of the pattern in the neighbourhood of the point between successive frames. Recently, a coarse-to-fine strategy using a wavelet motion model has been proposed which produces good results for facial motion estimation, where the displacement vectors between successive frames can become large . This straightforward approach has the advantage of producing dense motion fields, while being well suited for special purpose hardware. A disadvantage of this approach is the inherent noise of the local estimates of motion vectors, resulting in degradation of recognition performance; it also has a high computational load.
A series of papers by Ohya et al applied a Hidden Markov Model (HMM) to feature vectors. Motivated by the ability of HMMs to deal with time sequences and to provide time scale invariance, as well as by their learning capabilities, they assigned the condition of facial muscles to a hidden state of the model for each expression. In the first of these papers , they used the Wavelet transform to extract features from facial images. A sequence of feature vectors was obtained by applying a wavelet filter, constructing different frequency bands of the image, and averaging the power of these bands in the areas corresponding to the eyes and the mouth. Vector quantization was used for symbolisation, enhanced by a category separation procedure. Experiments for recognising the six universal expressions were carried out and a recognition rate of 84.1% was obtained. In the second paper, continuous output probabilities were assigned to the HMM observations and phase information was added to the feature vectors. Recognition rate was slightly improved, in user independent mode. A recognition rate of 93%, including successful identification of multiple expressions in sequential order, was reported in the last paper .
The feature vectors were obtained in two steps. Velocity vectors were estimated between every two successive frames using an optical flow estimation algorithm. Then a two-dimensional Fourier transform was applied to the velocity vector field at the regions around the eyes and the mouth. The coefficients corresponding to low frequencies were selected to form the feature vectors.
Feature tracking approaches
In the second approach, motion estimates are obtained only for a selected set of prominent features in the scene. Analysis is performed in two steps: first each image frame of a video sequence is processed to detect prominent features, such as edges and corner-like patterns, or high-level patterns like eyes and mouth. The features are then matched between frames to determine their motion. The advantage of this approach is efficiency, due to the great reduction of image data prior to motion analysis; on the other hand, it is not certain that all feature points have been extracted, which may affect the latter emotion recognition stage.
Yacoob, extending his previous work, used dense sequences (30 frames per second) in order to capture expressions over several frames. He focused his attention on the near frontal image projection of the face and on motion associated with the edges of the mouth, eyes, and eyebrows. For example, he could analyse motion of the lips (raised, lowered), of the corners of the mouth (raised, lowered), or of the inner side of the eyebrows (raised, lowered). The overall rigid motion of the head was small between consecutive frames and non-rigid motion resulting from face deformations was spatially bounded. Optical flow was computed at points with high gradient in each frame. The flow computation algorithm (based on a correlation approach proposed by Abdel-Mottaleb et al. ) computed sub-pixel flow assuming that motion between two consecutive images was bounded. The database of image sequences included 32 different faces. Several expressions were recorded for each face, each lasting 15-120 frames at a resolution of 120x160 pixels. Each subject was asked to display the emotions in front of a video camera while minimising his head motion. Nevertheless, subjects inevitably moved their heads during a facial expression. As a result, the optical flow at facial regions was sometimes overwhelmed by the overall rigid motion of the head. The facial expression system easily detected such rigid motion (all facial regions moved in one direction, an event unlikely to be found in facial expressions) and marked the corresponding frames as unusable for analysis.
On a sample of 46 image sequences of the 32 subjects displaying a total of 105 emotions, the system achieved a recognition rate of 86% for smile, 94% for surprise, 92% for anger, 86% for fear, 80% for sadness, and 92% for disgust. Blinking detection success rate was 65%. Some confusion of expressions occurred between the following pairs: fear and surprise, anger and disgust, sadness and surprise, due to inaccurate shape and motion information detection.
It is evident that the richness of facial expressions requires the development of more sophisticated representations and capabilities, at all levels of such a system. At the lowest level, optical flow tracking has to improve so as to process live video that involves rigid, articulated and non-rigid motion of human faces. At the mid-level, the spatial and temporal representation of actions should be enhanced to capture more of the facial behaviour; incorporation of shape to support motion analysis may be useful in refining the representation. At the highest level, more complex models are needed so as to capture the diversity of facial actions observed on a single subject and across subjects.
Rosenblum and Yacoob used an RBF Network architecture to classify facial expressions based on the same features as in Yacoob’s previous work . They proposed a hierarchical approach, which at the highest level identified emotions, at the mid-level determined motion of facial features, and at the lowest level recovered motion directions. Individual emotion networks were trained to recognise the ‘smile’ and ‘surprise’ emotions. Each emotion network was trained using a set of sequences of different subjects. The trained neural network was then tested, providing an overall recognition rate of 76%.
Thalmane et al have also described related techniques using neural networks for facial expression recognition.
Model alignment approaches
The third method, which is one of the most promising ones, aligns a 3D model of the face and head to the image data in order to estimate both object motion and orientation (pose).
A series of publications by Essa and Pentland proposed a method of solving the problem of tracking facial expressions over time. Using a representation which extends the Facial Action Coding System (FACS) of Ekman et al, they were able to recognise expressions by matching spatio-temporal motion-energy templates of the whole face to the motion pattern, achieving substantially greater accuracy than previous systems .
The method was based on optimal estimation of the optical flow, coupled with a geometric and a physical muscle model describing the facial structure. A parametric representation of facial independent muscle action groups and an estimate of facial motion, were used. The first step was to create a facial model that would give adequate information about facial structure. They focused on the work of Platt and Badler , who had created a mesh based on isoparametric triangular shell elements; Waters had also proposed a technique that developed a muscle model in a dynamic framework. Through synthesis of these techniques, they created a dynamic model of the face, describing the elastic nature of facial skin and the anatomical nature of facial muscles. The transformations creating this model and its control parameters provided the necessary features to describe facial expressions through muscle actuation. They also divided each expression into three distinct phases: application, release and relaxation. This assumption forced the feature extraction method to be also divided into three distinct intervals of time and be normalised by the temporal course of the expression. The feature description technique proposed was based on observation of muscle actuation over time. A standard muscle activation feature vector was defined for each of five universal expressions (smile, surprise, anger, disgust, and raise eyebrow). The procedure producing this vector enabled it to take into account both temporal and spatial characteristics. The division of data into three phases with respect to the actual phases of any expression solved speed limitation problems. Normalisation with respect to time was achieved by warping all expressions into fixed periods of time.
The overall recognition rate of the method was about 98%. The recognition of the expressions smile, surprise, disgust and raise brow was faultless (100%) and only in the expression of anger the recognition rate was lower (90%). Unfortunately the experiments were performed in a small database (52 image sequences) and the performance of the method in a larger database should be investigated.
Video sequences containing facial expressions can be downloaded from the MIT Media Lab Perceptual Computing Group ftp server. The ftp site can be reached through the address: ftp://whitechapel.media.mit.edu/pub/DFacs++/ and contains video sequences of approximately 10 frames per expression. Expressions covered are smile, anger, disgust, and surprised. At its present form, the database consists of expressions made by three persons only.
Static face images capturing subject’s expressions at its peak can be found in many web sites, e.g.
http://www.icv.ac.il
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-8/faceimages/faces
ftp://ftp.orl.co.uk:pub/data/orl
A variety of features, models and methods for emotional expression recognition have been developed in the presented literature. The great majority of them deal with the classification of facial expressions in one of the primary emotion categories. Motion-dependent dynamic approaches, which combine feature extraction and physical models give the most promising results. There is, however, a variety of issues, of increased complexity and importance, that need to be further tackled for the expression and emotional recognition task. These include:
·
Investigation of the effectiveness and efficiency of the techniques used for feature extraction.·
Construction of higher order representations that possibly combine neurobiological and psychological findings.·
Coding and enrichment of already known rules through adaptive structures, e.g. neural networks, which have the ability to adapt to the expression mechanisms of specific users.·
Extension of recognition approaches to deal with the whole gamut of emotional states.·
Interweaving of speech and visual input data for taking advantage of the information included in both signals. A crucial requirement for this task is the construction of appropriate audio/visual data bases. This might require a different approach to the target categories, which, as mentioned in this report, instead of comprising primary and possibly secondary emotions, could correspond to emotional states that are usually met in real life.
The reviews above suggest an agenda for developing artificial emotion detection systems. An ideal program would involve co-ordinated treatment of the following issues.
5.1 Signal analysis for speech
• There is prima facie evidence that a wide range of speech features, mostly paralinguistic, have emotional significance (see 3.2, Appendix).
• Work is needed on techniques for extracting these features (see 3.3, 3.4.2).
• Techniques based on neural nets have been extensively used at this level, and could be used more to set parameters within classical algorithms (see 3.4.2).
• There would probably be gains if the extraction process could exploit relevant linguistic information - phonetic (see 3.4.2.3) or syntactic (see 3.4.2.6).
• There is prima facie evidence that a range of facial gestures have emotional significance (see 4.2.1.3, 4.2.2).
• The static approaches which are best known in psychology do not transfer easily to machine vision in real applications (see 4.2.1.3).
• Dynamic approaches have produced promising results, but their psychological basis is largely unexplored, and they have not been tested on a large scale (see 4.2.2.2).
5.3 Effective representations for emotion
• Describing emotion in an exclusive sense (i.e. cases of 'pure' emotion) is very different from describing emotion in an inclusive sense (i.e. emotionality as a pervasive feature of life); and conceptions suggested by the first task do not transfer easily to the second (see 2.2, 2.3.1).
• A range of techniques are potentially relevant to representing emotion in an inclusive sense, including continuous dimensions and schema-like logical structures (see 2.2, 2.3.2).
• Ideally a representation of emotion should not be purely descriptive: it should also concern itself with predicting and/or prescribing actions (see 2.2, 2.3.3, 3.5).
• Ideally representations of emotion should be capable of modification through experience, as developmental and cross-cultural evidence indicate human representations are (see 2.2).
5.4 Appropriate intervening variables
• Human judgements of emotion may proceed via intervening variables - referring to features of speech, facial gestures, and / or speaker state - rather than proceeding directly from the signal (see 3.4.1, 3.5, 4.2.1).
• The ability to describe these intervening variables in symbolic terms opens the way to explaining and reasoning about emotion-related judgements.
• Allowing suitable intervening variables to emerge through experience is a form of a recurring challenge to computational theories of learning.
5.5 Acquiring emotion-related information from other sources
• Contemporary word recognition techniques probably support the detection of words which have strong emotionally loadings in continuous speech.
• Information from behaviour and physical context are certainly relevant to emotional appraisal, and could be obtained in at least some contexts (see 2.3.2).
• Active acquisition of information about emotionality is clearly a possibility to be considered - e.g asking "are you bored with this task?" (see 2.3.2).
• Numerical methods of integrating evidence can generate good identification rates under some circumstances (see 3.3.1, 3.3.2, 4.2.2.2).
• In other circumstances it seems necessary to invoke logical techniques which examine possible explanations for observed effects, and discount them as evidence for X if explanation Y is known to apply (see 3.5) - i.e. inferences are causal, abductive, and cancellable.
Novelists certainly believe that the process of attributing emotion-related states is complex, e.g.
'Potato soup is waiting, whenever you tzaddiks decide you'd rather eat than philosophize.' Her smile belied the scolding tone in her voice.
5.7 Emotion-oriented world representations
• Cognitive theories highlight the connection between attributing an emotion and assessing how a person perceives the world in emotionally significant terms - as an assembly of obstacles, threats, boring, attractive, etc. (see 2.1, 2.3.3).
• Developing schemes which represent the world in emotion-oriented terms is a significant long term task which may lend itself to subsymbolic techniques.
• The task may be related to the well known that the meanings of everyday terms have an affective dimension.
• Some relevant databases already exist, but they have significant limitations. Hence developing suitable collections of emotional material is a priority (see 3.1.3.3, 3.4.1, 4.3).
• Material should be audio-visual, and natural rather than acted.
• It should represent a wide range of emotional behaviour, not simply 'primary' emotions.
• It should cover a wide range of speakers, and preferably cultures.
• Recordings should be accompanied by reference assessments of their emotional content. Traditional linguistic labelling is less critical.
• A long term goal is to obtain 'live' material by developing scenarios which tend to elicit emotional behaviour, and which allow assessments of speaker emotions to govern actions (see 2.3.3).
• There are problems associated with both capture and low-level processing of speech and video signals because of the amount of data involved. Efficient solutions would transform access to appropriate material.