From Data to Design

Peter Wallis

Natural Language Processing Group, University of Sheffield

This is a preprint of an article submitted for consideration in the Journal Applied Artificial Intelligence 2011 [copyright Taylor & Francis]; Applied Artificial Intelligence is available online at: http://www.tandf.co.uk/journals/journal.asp?issn=0883-9514&linktype=44

The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement no. 231868.

 

From chatbots and ECA on the web through synthetic characters in computer games and virtual assistants on the desktop to computers that answer phones, the primary way to create an operational conversational system is for some one to use introspection over log files to decide what he or she would say, and thus what the machine should have said. These systems are far from perfect in an interesting way: they are rarely simply ineffective, they are usually down right annoying [de Angeli 2005]. Why is that? What is it that we are missing about conversational agents and is there a better way to move from raw data of a form we can collect to the design of better conversational interfaces?

Computer scientists have of course been interested in computers and language from the start with some early successes. When the research community has looked at dialog systems in the wild however, results have been disappointing [Walker et al 2002] [Wallis 2008]. Indeed much of the work in the area is aimed, not at the dialog problem itself, but at ways of having Machine Learning solve the problem for us by using, for example, POMDPs [Young 2007]. The data sparsity issue however means that in practice these techniques are run over annotated training and test data. These annotation schemes abstract away from the raw data and arguably represent the theoretical content of field [Hovy 2010]. This is discussed further below. Computer science, looking to the dialog problem itself, generally views language as a means of conveying information, but from Heidegger and Wittgenstein on, we know there is more to language-in-use than informing. The idea of language as action [Searle 1969] highlights its social aspects. If language is to humans what preening is to monkeys [Dunbar 1996], the very act of talking to someone is making and maintaining social relations. Making systems polite is not simply a matter of saying please and thank you, it involves knowing when something can be said, how to say it, and if necessary how to mitigate the damage. What is more, the effort put into the mitigation of an face threatening act (FTA) is part of the message [Brown and Levinson 1987]. To a greater or lesser extent, conversational agents simulate human behaviour as a social animal and, rather than viewing dialog as primarily information state update [Traum et al 1999] or as a conduit for information [Reddy 1993], we view conversational systems as interactive artifacts in our space.

Previous work

The EC funded SERA project was set up to collect real data on human-robot interaction. At the time several high profile projects were integrating established technology as spoken language technology demonstrators. We knew there would be severe limitations and our initial aim was to put one of these demonstrators in someone's home and record what happens. We settled on using the Nabaztag ``internet connected rabbit'' [Violet] as an avatar for a smart home. The intention was to have the rabbit sense people and their activities in the same way as a smart home would, but have the rabbit as a focus for communication. Rather than having a disembodied voice for the "intelligence" as in 2001: a space odyssey, or on the flight deck of the Enterprise in Star Trek, users get to talk to something and, more importantly, our results would apply to the way people would interact with a classic mobile autonomous robot that actually used on board sensors.

One of the SERA reabbits
on its stand as installed in subjects halls and kitchens. The ostensive purpose of the SERA rabbits was to encourage exercise among the over 50s. The setup (see figure) could detect when someone was there using its motion detector (PIR) and was given an `exercise plan' that the subject was expecting to follow. If the subject picked up the house-keys in a period when she was expected to go swimming, the rabbit would say ``Are you going swimming? Have a good time.'' When she returned it would say ``Did you have a good time swimming?'' If the subject responded, it would ask if she had stuck to the amount of exercise she had planned and record the amount actually done in a diary. The figure shows the `house keys' being returned to the hook sensor, triggering the `welcome home' script.

The dialog manager for the SERA system was a classic state-based system with simple pattern-action rules to determine what to say next, what actions to do and possibly to transition to a new state. We used best practice to develop the scripting and employed a talented individual with no training to ``use introspection over log files to decide what he or she would say, and thus what the machine should have said.'' What is sometimes missed is the effort that goes into scripting even simple dialogs. Viewing a beginning to end conversation as a decision tree covering the user's choices, having a two way choice at the root of the tree rather than no choice doubles the amount of effort required to construct the scripts. Adding options at the leaves of the tree is of course far easier but even then leaving options open tends to percolate back requiring other changes that have exponential cost. The grand challenge, in a technical sense, is to find ways to reuse script from existing parts of this decision space and approaches range from no re-use (the flat structure typified by chat-bots) through state-based systems where a state (attempts to) capture a context free `chunk' that can be safely used anywhere (the SERA approach) to full-blown dialog planning systems such as TRAINS Allen et al 1995]. But that is another story.

Things went wrong as we thought they would (discussed below) but we also faced a major technical challenge with speech recognition in a domestic setting. For iteration 1 we had simple yes/no buttons but for the second and third iterations we used `flash cards' that the rabbit could read in order to communicate. The problems with speech recognition and our attempted solutions are detailed else where [Wallis 2011]. Despite this admittedly major technical difficulty, all subjects talked to their rabbit some of the time -- some much more than others -- and all expressed emotion while interacting with it. We installed this set-up in the homes of 6 subjects for 10 days at a time, over three iterations, and collected just over 300 videoed interactions.

Having collected the data, the challenge then was to translate the raw video data into information about the design of better conversational agents. In general the project partners could not reach consensus on how to look at the data -- although it is telling that we all had an opinion on how to improve the system. There were plenty of interesting things to look at in the video data and interviews and, having identified something interesting, we could design a quantitative experiment to collect evidence. Producing papers from the data is not a problem. What was missing was a means of getting the big picture - a means of deciding what really matters to the user. It is one thing to say that a particular conversational system needs to be more ``human-like,'' but some faults are insignificant, others are noticed but ignored, but another set of faults drive users to despair [de Angeli 2005]. Unless we can build the perfect human-like system, distinguishing between the severity of the faults is key to designing interactive artifacts based on human behaviour.

As an example of the challenge of moving from data to design, in iteration 1 our talented script writer had the rabbit say `that's good' whenever the subject had done more exercise than planned. This was greeted in the video recordings with ``eye-rolling'' which, as hopefully the rest of the paper will convince you, the significance of which is obvious. Our talented expert then attempted to make the rabbit less patronizing but she couldn't see how to do that. This makes sense given the observation by de Angeli et al [de Angeli et al 2001] that machines have very low social status. In the second iteration the rabbit scripts were changed to remove any notion of the rabbit being judgmental -- all assessment was ascribed to some other person or institution such as the research team or the National Health Service. So the question remains, is it possible to create a persuasive machine? For iteration 3, the project consortium decided that our talented expert should try harder. Like so many things about language in use, being more persuasive looks easy, but turning that into instructions for a plastic rabbit might (or might not) be impossible. A methodology for going from video data to a new design would certainly help clarify the issues involved.

Methodology

Looking at the data there are many interesting things to study, but the interesting are not necessarily the critical when it comes to engineering conversational interfaces. What is more, we humans do conversation effortlessly and it is a real challenge to notice what is actually happening. Naively I might think that I am annoyed because the machine miss-heard me, but perhaps the annoyance comes from the way it was said rather that what was said [Wallis et al 2001]. Is there a better way to study interaction with conversational artifacts?

Human Computer Interaction is of course a well established field with approaches ranging from the strict reductionism often seen in psychology, to the qualitative methods of the social sciences. When it comes to conversational agents, these approaches all have their uses [Wallis et al 2001], [Wallis 2008],[Payr & Wallis] but one cannot help but feel that the real issues are often lost in the detail. In an excellent book on interaction design, Sharp, Rogers and Preece provide a list of reasons why an interface might elicit negative emotional responses [Sharp et al 2007,p189]:

Many of the points are specific to the graphical nature of GUIs but three points are interesting in that they highlight an underlying assumption in HCI as a field, and that is the notion that the user is in control, and the machine is a tool to be wielded. The job of the interface is thus to make clear the consequences of an action and Sharp et al want a system that does what the user wants, does what they expect, and one that does not give vague or obtuse error messages. The HCI perspective assumes a passive system and the job of the interface is to make clear the capabilities and use of that system. A human conversational partner however will have strategies that enable repair in the follow-up interaction. In conversation, vagueness might be the price of speed or enable things to be said that otherwise cannot. The vagueness is not a problem because the process is interactive and a human conversational partner can be expected to clarify as required. What is more, Humans are inherently proactive and can help out in a timely manner. If a system could recognise when a user is (about to become) frustrated and annoyed then the system can proactively explain why it is not willing or able to do what the user wants, it can re-align user expectations, or provide more information. The point about conversational systems is that the relationship is on-going interaction and the system nearly always gets a second chance. Whereas HCI focuses on making the artifact understandable, conversational agents can help out - and indeed are expected to.

Applied psychology is again an established field with a range of techniques. Applied Cognitive Task Analysis [Militello &Hutton 1998] uses semi structured interview techniques to elicit knowledge from experts in a range of tasks including the writing of training manuals for fire-fighters and the modelling and simulation of figher pilots. As part of a large project to create an embodied conversational agent that acted as a virtual assistant, Wizard of Oz experiments were run on using the automated booking system scenario [Wallis et al 2001]. The wizard, KT, was then treated as the expert and interviewed to elicit her language skills. The interviews roughly followed the Critical Decision Method questions of O'Hare et al [O'Hare et al 1998] which are listed in the Apendix and discussed later in the paper. The conclusion at the time was that KT needed to know far more about politeness and power relations than she did about time or cars. The problem was that interviewing people about their everyday behaviour (as opposed to their expert behaviour) was difficult because, not only are things like politeness just common-sense to the subject, such things are perceived by the interviewee as just common-sense for the interviewer. The interviewee quickly becomes suspicious about why the interviewer is asking "dumb" questions. It seems a more fundamental way of looking at human action is required.

The classic debate in the pursuit of an objective science of human behaviour is between qualitative and quantitative methods. Those with a psychology background will tend to use quantitative methods and report results with statistical significance. The methods include structured interviews and questionnaires, press bars and eye trackers. Formal methodologies that use statistical evidence rely on having a prior hypothesis [Shavelson 1988] and the formation of hypotheses is left to researcher insight and what are often called "fishing trips" over existing data. A positive result from formal quantitative experiments is certainly convincing, but the costs make such an approach difficult to use outside the lab. Qualitative researchers argue that there is another way that is equally convincing and more suited to field work.

Two approaches to Qualitative Analysis

In linguistics, Conversation Analysis [Sacks 1992] in its early days was driven by the notion of "unmotivated looking" and generally attempting to build theory from the bottom up. The result was surprisingly informative in that it highlighted the amazing detail of human-human interaction in conversation. Like more recent approaches such as Grounded Theory [Urquhart 2010], CA has a strong focus on embracing the subjective nature of scientific theory, but claims - quite rightly in my view - that a poor theory will fall very quickly if continually subject to the evidence of the data. Grounded Theory and several other qualitative methods talk about "working up" a theory by continually (re) looking at the data in light of the theory so far. The result is a hypothesis that can be tested using quantitative methods, but usually the researcher feels the evidence is overwhelming. They may produce a quantitative analysis as well for reporting purposes of course, but the work has been done.

Another approach to studying human behaviour is interesting in that it does not rely on having no theory, but avoids the subjectivity of the scientist by using the theoretical framework of the subjects. With such "ethnomethods", the intention is to explain behaviour, not from an outsider's view, but from the perspective of the "community of practice". This is particularly relevant when the aim is to model a person participating in a community of language users. The observation is that the model needs to capture the reasoning of the subject. This reasoning can be wrong, but that is the reasoning that matters. A community of bees can be (quantitatively) shown to communicate with each other, but a model of a bee communicating needs to capture the available behaviours, actions and activities of the community of bees. Garfinkel's observation [Garfinkel 1967] (in different words) was that, as a bee, one would have direct access to the significance of communicative acts within that community. If a bee does something that is not recognisable, it was not a communicative act. If a bee fails to recognise a communicative act (that others would generally recognise) then that bee is not a member of the community of practice. That is, communication is defined in terms of a community of practice, and the notion of ``direct access'' to the significance of a communicative act is defined in terms of that. The same of course applies to humans. As a member of the community of practice one has direct access to the significance of an act, but as a scientist one ought to be objective. Studying human interaction as a scientist, I need to be careful about my theories about how things work. As a mostly successful human communicator, I do not need to justify my understanding of the communicative acts of other humans. The first challenge is to keep the two types of theory separate. My scientific theory is outside, hopefully objective, and independent of my ability to hold a conversation. My folk theory of what is going on in a conversation is critical to doing conversation. It is ``inside'' the process, and as long as it enables me to participate in communication based activities, the objectivity of the theory is immaterial.

The HCI community do emphasise the need for designers to understand users, but the dominant view has a strong tendency to be an ``outsider's view.'' Sharp et al for example provide a list of solidly academic cognitive models that are expected to shed light on how people will react to a given design, and that can be used to guide the design process. It is argued here that conversational interfaces ought to be designed with an insider's view of human agency. This perhaps explains why amateur developers are so good at scripting agents - there simply is no secret ingredient [Po]. One might have a micro level scientific understanding of human behaviour based on neurons, one might also have a meta level scientific understanding of human behaviour based on dialectics and control of the means of production, one might even have a ``meso'' level [Payr 2010] scientific understanding of human behaviour based on say production systems [Rosenbloom, Laird & Newell 1993], but what we need to capture is the ``folk'' theory at the meso level.

The proposal that we are pursuing at Sheffield is that, in order to simulate human conversational behaviour, we need to capture a suitable folk understanding of events, and that understanding looks much like the essence of a play or novel. With such a model, we will be able to identify the essential as opposed to the incidental features for language based interactive artifacts.

Capturing other people's folk theories

The idea of folk psychology and its status as theory has been discussed at length by Dennett [Dennett 1987] and Garfinkel has made explicit the challenges of collecting such data and championed techniques for studying one's own culture. In the 1970's it came to light that similar ideas were being developed in Soviet psychology and Vygotsky it seems had worked through the notion that theatre and plays capture something essential about the nature of human action. The gist is that plays are interesting to us because they exercise our understanding of other people. Perhaps the way to look at human (understanding of human) action is in terms of actors, roles, scenes, back-drops, theatrical props, audience and so on.

An examination of the computer interface from a Vygotskian perspective has been done before [Laurel 1993] but his legacy in HCI comes primarily through Leontiev and the idea of mediated action [Wertsch 1997]. Human action is mediated by artifacts that have a highly socialized relevance to us - spoons are used in a particular way and multiplication tables are a conceptual tool that can be used to multiply large numbers. What roles can a computer play in socialized mediated action? The artifacts HCI study are props in scenes performed by actors with roles and Action Theory as it is known is an acknowledged part of the HCI repertoire [Sharp et al 2007]. This perspective however does not acknowledge the distinction between an inside and outside view and critics can, quite rightly, question the objectivity of such an approach. Modelling human conversational behaviour on such explicitly ``folk'' understandings is a different matter. The folk reasoning of novels is fabulously about the inner workings of a human mind and definitively subjective in nature. The fact that novels exist at all however suggests there is something shared.

Narrative descriptions

Mike and the rabbit talking with
Peter People's reporting of events can differ enormously of course so why would we expect consistency in reporting of events unfolding in video data of human machine interactions? The key is that our interest is in gross level behaviour. Conversational machines fail grossly rather than in detail. Rather than arguing the point however, an informal demonstration was set up in which some "folk" were presented with a video in which there is trouble with a rabbit. This figure is an still from a video recorded in Peter's office. The recording was not rehersed, and indeed was not even planned, but was recorded spontaneously when someone pressed the video record button on the set-up under development. In the spirit of CA, the reader is invited to use their own folk understanding of the data and to this purpose, the recording has been made publicly available [PMR].

Two narrators were asked to describe what happens in the recording. To set the scene and suggest a style of writing, they were given an opening paragraph:

"Peter and Mike have been talking in Peter's office where he has a robot rabbit that talks to you and that you can talk to using picture cards."

They were then asked to, independently, finish the story in around 200 words. These are the resulting stories:

Table 1: Two narrative descriptions of the same event.
Narrator 1 Narrator 2
It is time to go home so Peter takes his keys from the rabbit. Mike notices this and says "Isn't it supposed to say hello?" Peter is about to say something when the rabbit says: "Hello, are you going out?" Peter replies that he is (using the card and verbally) and the rabbit tells him to have a good time, bye. Mike picks up a card and shows it to the rabbit, but nothing happens. He thinks this make sense as the rabbit has said goodbye but Peter thinks it should work and shows the rabbit another card. Mike sees that he has been showing the cards to the wrong part of the rabbit and gives it another go. Still nothing happens and Mike tries to wake it up with an exaggerated "HELLO!". Peter stops packing his bag and pays attention. Mike tries getting the rabbits attention by waving his hand at it. Still nothing happens. Mike looks enquiringly at Peter as if to ask "what's happening" He says "that's a new one" and goes back to his packing. Mike takes his leave at this point. Peter finishes his packing, and, as he leaves says to the rabbit "You're looking quite broken." Peter is about to do something to wake the rabbit up again and as he is about to speak, it says hello. Peter gestures to Mike that it is now talking as expected. Peter presses the video button to record the interaction. Mike laughs as it talks. It asks Peter if he is going out, to which he responds verbally that he is, showing the rabbit the card meaning yes. Seeing Peter's interaction, Mike tries using the cards to interact with the rabbit himself. It does not respond and Mike suggests that this is because it has said goodbye and finished the conversation. Peter tries to reawaken the rabbit with another card. Mike sees that he had put the card in the wrong place. He tries again with a card, after joking that the face card means "I am drunk". Peter laughs. When the rabbit does not respond, Mike says "hello" loudly up to the camera. Peter says he is not sure why there is no response while Mike tries to get a reaction moving his hand in front of the system. They wait to see if anything happens, Mike looking between the rabbit and Peter. When nothing happens, Peter changes topic and they both start to walk away. Mike leaves. As Peter collects some things together, walking past the rabbit, he looks at it. Before leaving the room he says to the rabbit "you're looking quite broken".

Table 2: The third-party common ground.
1. Peter is about to say something and is interrupted by the rabbit
2. the rabbit asks if he is going out, Peter's verbal and card response
3. the rabbit says bye
4. Mike's attempt to use a card and the non-response of the rabbit
5. Mike's explanation (that the rabbit has already said bye)
6. and Peter showing the rabbit another card
7. Mike sees that he has been showing the card to the wrong part of the rabbit and has another go
8. the rabbit does not respond
9. Mike says "Hello" loudly
10. Peter acknowledges it doesn't look right
11. Mike tries again by waving his hand in front of the rabbit
12. no response from the rabbit
13. Mike looks at Peter
14. They give up
15. Mike leaves
16. Peter leaves saying "You're looking quite broken" to the rabbit

There are many differences, and many things were left out entirely. Neither narrator mentioned the rather interesting equipment in the background nor commented on the colour of clothes the participants were wearing. There is no comment on accents or word usage; no comment on grammatical structure or grounding or forward and backward looking function. Whatever it is that the narrators attend to, it is different to the type of thing looked at by those using the popular annotation schemes. It does however seem to be shared, and so the events in Table 2 can be identified as common to both descriptions.

Note that the observation of these events are not only shared by the narrators, they will also be "foregrounded" for the participants. That is Peter and Mike will, to a large extent, observe the same things happening and, what is more, each will assume that his conversational partner, to a large extent, observes the same things. The hypothesis is that that shared background information is the context against which the conversation's utterances are produced. This is not to say that folk theory is true theory, but if we want to simulate human reasoning and engineer better dialog systems then the simulation needs to use the same reasoning. The scientific challenge is to capture it, and to do that in a way that is convincing.

From Data to Design

Narrative descriptions have been used as part of methodology before but why it is useful or relevant is apparently not discussed. What is advocated above is that narrative descriptions are used to elicit an inside view of what is going on in dialog, and that the subjectivity of such descriptions is an asset rather than some necessary evil. The intention is to develop a theory (scientific) of other people's theory (folk) of other people's communicative acts. It turns out that formal descriptions of narratives have been done before, primarily as a means of looking at case studies in the study of business processes.

Abell's model of human action

There is the adage that those who do not study history are bound to repeat it, but what does a proper study of history look like and how does it actually relate to future action? Schools of business studies tend to be split between those that look at case-studies in detail and those that look for statistical co-variation. The scientific validity of causal inference from large N samples tends to go unquestioned but how is one meant to draw conclusions from one or two historical examples and make informed decisions? Abell's theory of comparative narratives [Abell 2003] [Abell 2010] is a means of describing and comparing structures of sequential events in which human agency plays a part. What is more, it can be applied to single samples, and provides an explanation of why things happened as they did, and a mechanism for deciding future action. Abell's approach can be seen as a highly formal way of looking at the take-away message of a case-study. The mechanism is to look at the narrative structure, and our hypothesis is that the same mechanism provides the background to a jointly constructed conversation.

Abell's approach is to represent the world as being in a particular state, and human action moves the world to a new state. The human action is seen as partly intentional - that is, a human will act in a way to bring about preferred states of the world, based on beliefs - and partly normative - a human will do this time, what they did last time the same situation occurred.

Abell's format for a narrative [Abell 2010] consists of:

The first observation from the two narrative descriptions above is that the narratives produced are internally consistent, but do not necessarily refer to the events in the same way. That is, it is not clear we can re assemble a narrative from the events that occur in both descriptions. It seems the stories produced must be treated as whole units in any comparison rather than being disassembled.

A second observation is that, although the notion of states and transitions is, in a formal sense, complete, it seems sequences of state/transitions are also first order objects and can form singular causes in an actor's reasoning. Mike's attempt to interact with the rabbit is motivated by his observation of the entire preceding interaction. Abell does talk at length about levels of description, but it seems our participants are switching levels as they go. The principle however remains sound and we need a way to present, formally, the notion of multiple level descriptions.

Abell provides a theoretical framework for a model of causality that we can use to account for the action in our video data. These accounts are purely descriptive but the observation is that they can be reused.

Plausible accounts and Engineering

In order to make better conversational agents the argument goes, we need to simulate the way humans make decisions in conversation. Ideally we would have a good model of how people actually make decisions, but the observation is that, when the aim is to engineer a believable simulation of a person, we can simulate the decision processes of fictional characters instead.

Plays and novels exist because they provide plausible accounts of human behaviour. The principle is that plays and theatre work because the characters in them behave in accordance with our expectations. As a human in a conversation, I have no idea of the underlying mechanism my conversational parter is using, but I do have a quite reliable folk psychological model and it is that we intend to capture.

The accounts of interest are of course not in the video data, but are produced by the narrator. In effect we would be accessing the narrator's head and the data is there to prompt the annotator to apply his or her knowledge to particular scenarios.

Accounts of the action in the video data as written down by the narrators are of course descriptive in that they are written to `fit' past events. The claim - yet to be verified - is that they are also predictive. If Mike wants to use the system, then it would not be surprising if other people also want to try it. If failure to work causes disappointment in Mike, it is likely to also cause it in others. Having a predictive model of events we are well on the way to having prescriptive rules that can be used to drive conversational behaviour.

What do these accounts look like? They are folk theory and as such will be in line with Dennett's notion of an ``intentional stance.'' In detail they will fit with the idea that people do what they believe is in their interests - a fact too trivial to state for a human, but machines need to be told. Using Dennett's example, Seeing two children tugging at a teddy bear, we folk know they both want it. In the video, Peter wants to show Mike how the system works; Mike believes the rabbit has finished talking. In implementation a fleshed out account will look much like plans, goals and cues in a Belief, Desire and Intention (BDI) agent architecture [Bratman et alRao & Georgeff]. The BDI architecture was explicitly developed as a model of human reasoning and attempts to balance reactive and deliberative behaviour. It was not intended as a dialog manager but has been used that way by several researchers including Ardissono and Boella [1998] Wallis et al [2001] and Kopp et al [2005]. The problem, as always, is how to get the required data to populate such models and the proposal here is to have our annotators provide the details for their own narrative. The aim is to have a methodology and tools to help.

Cognitive Task Analysis, and in particular the Critical Decision Method, have been designed explicitly to elicit folk explanations from experts.

Table 3: A method for producing a plausible explanation of events and, from that, script for a conversational agent.
  1. write a narrative description of events in the video
  2. from your words, list the relevant events and causal links between them
  3. For each link answer the question "How did $E_p^A$ cause $E_q^B$?"
    1. Do you need to assume that agent $A$ has goal or intention?
    2. Or is $E_q$ simply the normal response to $E_p$?
      1. If so, what else might be expected to bring on $E_q$?
  4. Given agent $A$ has the goal, $g$, that caused $A$ to $E_p$, what else might $A$ have done to (attempt to) achieve $g$?
    1. Are there particular things about the situation that would make $A$ do $E_p$ rather than say $E_r$?
  5. Did $B$, in doing $E^B$, do the expected?
    1. If not, what was $B$'s motivation for doing $E^B$?
    2. Did $A$ figure out what $B$'s motivation was?
    3. What else might $B$ have been trying to do, given she did $E^B$?
  6. What does $A$ think (incorrectly) that $B$ knows, and visa versa?

 

Table 3 provides a preliminary draft for the set of instructions to be given to our annotators. The general approach has been to have the ``annotator'' produce a narrative description of the video in order to capture the essential, while leaving out the detail. The narrative is then formalized as described by Abell to provide events and the links between them. The next step is to have the annotator flesh out those links with questions like those by O'Hare et al (see the Appendix) to elicit the unstated and obvious. In particular, what are the goals of the characters in the narrative, what information does each character have that impinges on the action, and what choices were made.

Finally, having had the annotator work through their story in detail, he or she can be asked to expand on events in the video by exploring ``what if'' scenarios, producing script that agents might have said. Following the CDM approach, the conditions under which an agent might say one thing rather than another can be explored and documented, providing a future conversational agent with our annotators folk model of not only what to say but also when to say it. This is of course speculative at this stage and will be the subject of future work.

A (partial) walk through

Note that at this point we are engineering new scripts for our conversational agent and we are not aiming at a ground truth about what happened in the video. Table 4 provides a preliminary analysis of events and relations in Narrative 1 presented in XML mark-up. Note that the choice of event to mark up is determined in part by the choice of words used in the initial description, and in part by the need to put in the causal relations. Given we have inter annotator agreement, the (scientific) theory of Abell says that what makes it a narrative is the causal relations. One might want to have an alternate theory of why narratives hold together but, from the perspective of a science of language and action, our aim with this work is to provide supporting evidence for Abell's theory.

<?xml version="1.0" ?>
<analysis>
It is time to go home so Peter
<event id="E0" agent="peter">
  takes his keys from the rabbit.
</event>
Mike notices this and
<event id="E1" agent="mike" cause="E0">
  says "Isn't it supposed to say hello?"
</event>
Peter is about to
<event id="E3" agent="peter" cause="E1">
  say something
</event>
when the rabbit
<event id="E2" agent="rabbit" cause="E0">
  says: "Hello, are you going out?"
</event>
Peter
<event id="E4" instrument="card and verbally" cause="E2">
  replies that he is (using the card and verbally)
</event>
and the rabbit
<event id="E5" agent="rabbit" cause="E4" recipient="peter">
  tells him to have a good time, bye.
</event>
Mike
<event id="E6" agent="mike" recipient="rabbit">
  picks up a card and shows it to the rabbit, but nothing happens.
</event>
He
<event id="E8" agent="mike" cause="E6">
  thinks this make sense as the rabbit has said goodbye
</event>
but Peter thinks it should work and
<event id="E9" agent="peter" cause "E6">
  shows the rabbit another card.
</event>

Representing these events graphically, this figure provides the first half of that data graphically, putting in the agents on the vertical axis and time on the horizontal.

Events in Narrative 1 as a graph.

The engineering challenge is to use folk theory of the annotator to produce script. The next step is to explicitly state the cause of events that do not have a mentioned cause, and to flesh out the causal link for those relations in which $B$ does not necessarily follow from $A$. To explain $E0$ and $E6$ requires for instance the introduction of motivations on behalf of the characters. This figure provides these causes as boxed labels at the top:

Events and goals, with expectations
added.

This is perhaps a controversial move, but recall that first, as members of the community of practice, we have a strong shared understanding of such things -- it is not perfect, but not bad. If we did not understand the speaker's motivations, the text would not make sense. As a member of the community of practice, we may be wrong about another's intentions but as a participant in an interaction, any error can be safely ignored or it will get corrected. Second, and more objectively, we are engineering here and are only interested in a plausible explanation.

The dashed arrow and the open state represent an expectation, expressed in the original narrative as ``nothing happens.'' Formally in CA terms, this expectation is the second pair-part of an adjacency pair that did not occur. Once again, as members of the community of practice, we have a good understanding of the notion of the "normal response" to a first pair-part.

Given a state based description of the action, how might an annotator's workbench elicit data suitable for our implementation? In broad terms the aim is to get actions (things to actually say) events (things heard that are significant to plan choice) and plans (tactics). Given the state description, a work bench might automatically ask questions of the narrator about phenomena in their diagram. For instance $E3$ does not lead to a new event, so what is the relationship between that and $E4$? Is it part of the same plan being used by Peter? If not, did Peter's initial plan fail? In this case, no, he re planned in the light of a new event.

Having captured the causal links in the narrative, the annotator can now go on to explore the counter-factuals, and this process can once again be in part automated using the thematic roles and CDM style questions. For instance event $E9$ is ``[peter] shows the rabbit another card.'' A variant on the question in the Appendix would be ``is there something else that peter may have done to achieve ``demonstrate the rabbit''?

The aim is a methodology built around the notion of narratives as causal links between events. The hope is that such a methodology will take us one step closer to being able to engineer conversational agents by simply ``cranking the handle'' without need for insight as insight has, so far, failed to produce the goods.

Conclusion

The SERA project set out to collect real human-robot interactions and study them. The data collected is rich, but we did not reach a consensus on the best way to study the data, and from a personal perspective the methodologies we did use were unsatisfying. Having used CA in the past to work through data, it has been interesting to observe experts in other fields work through data. From the engineering perspective, the challenge to be addressed is how to identify the critical issues in human machine conversation rather than the interesting. The problem is to notice what is going on in something that it is very hard not to take for granted. I hope that the above outlines the differences between quantitative methods as used and in theory, qualitative methods that embrace the interpretive nature of the process and how they address it, and the principle behind ethnomethods and how they might relate to the challenges faced when engineering conversational interactive artifacts.

Bibliography

Abell, P. (2003). "The role of rational choice and narrative action theories in sociological theory the legacy of coleman's foundations" Revue franšaise de sociologie, 44(2):255--273 (paper)

Abell, P. (2010). A case for cases: Comparative narratives in sociological explanation (paper)

Allen, J.F., Schubert, L.K., Ferguson, G., Heeman, P., Hwang, C.H., Kato, T., Light, M., Martin, N.G., Miller, B.W., Poesio, M., and Traum, D.R. (1995). "The TRAINS project: A case study in defining a conversational planning agent" Journal of Experimental and Theoretical AI, 7(7):7--48

Ardissono, L and G. Boella (1998) 'An Agent Architecture for NL dialog modeling', Artificial Intelligence: Methodology, Systems and Applications, Springer (LNAI 2256)

Bratman, M.E., D. J. Israel and M. E. Pollack (1988), 'Plans and resource-bound practical reasoning', Computational Intelligence 4, 349--355

Brown, P. and Levinson, S.~C. (1987). Politeness: Some Universals in Language Usage Cambridge University Press

de Angeli, A. (2005). 'Stupid computer! abuse and social identity' In Angeli, A.~D., Brahnam, S., and Wallis, P., editors, Abuse: the darker side of Human-Computer Interaction (INTERACT '05) Rome. (paper)

de Angeli, A., Johnson, G.I., and Coventry, L. (2001). 'The unfriendly user: exploring social reactions to chatterbots' In Helander, K. and Tham, editors, Proceedings of The International Conference on Affective Human Factors Design, London. Asean Academic Press ( paper)

Dennett, D.C. (1987) The Intentional Stance The MIT Press

Dunbar, R. (1996) Grooming, Gossip, and the Evolution of Language Harvard University Press, Cambridge, MA.

Garfinkel, H. (1967). Studies in Ethnomethodology, Prentice-Hall

Hovy, E. (2010). Injecting linguistics into nlp by annotation. Invited talk, ACL Workshop 6, NLP and Linguistics: Finding the Common Ground.

Kopp, S., L. Gesellensetter, N. Kramer and I. Wachsmuth, (2005) 'A Conversational Agent as Museum Guide - Design and Evaluation of a Real-World Application', 5th International working conference on Intelligent Virtual Characters, note= http://iva05.unipi.gr/index.html

Laurel, B. (1993). Computers as Theatre. Addison-Wesley Professional

Militello, L.~G. and Hutton, R.~J. (1998). 'Applied cognitive task analysis ({ACTA}): a practitioner's toolkit for understanding cognitive task demands'. Ergonomics, 41(11):1618--1641.

O'Hare, D., Wiggins, M., Williams, A., and Wong, W. (1998). 'Cognitive task analyses for decision centred design and training' Ergonomics, 41(11):1698--1718.

Payr, S. (2010). Personal communication.

Payr, S. and Wallis, P. (2011). 'Socially Situated Affective Systems' In P. Petta, C. Pelachaud, R.C.Handbook of Emotion-Oriented Technologies Springer (to appear).

'Peter and Mike have trouble with a rabbit' http://staffwww.dcs.shef.ac.uk/ people/ P.Wallis/ PMRvideo.mov

Rao A., and M. Georgeff (1995) BDI Agents: from Theory to Practice, AAII, TR-56, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.7970

Reddy, M.~J. (1993) 'The conduit metaphor: A case of frame conflict in our language about language' In Ortony, A., editor, Metaphor and Thought, Cambridge University Press

Rosenbloom, P.S. and J. E. Laird and A. Newell (1993) The Soar Papers: Readings on Integrated Intelligence MIT Press

Sacks, H. (1992). Lectures on Conversation (edited by G. Jefferson) Blackwell, Oxford.

Searle, J.~R. (1969) Speech Acts, an essay in the philosophy of language, CUP

Sharp, H., Rogers, Y., and Preece, J. (2007) Interaction Design: beyond human-computer interaction (2ed) John Wiley and Sons, Chichester, UK.

Shavelson, R.~J. (1988) Statistical Reasoning for the Behavioral Sciences, Allyn and Bacon, Inc., 2nd edition.

Traum, D., Bos, J., Cooper, R., Larson, S., Lewin, I., Matheson, C., and Poesio, M. (1999) 'A model of dialogue moves and information state revision' Technical Report D2.1, Human Communication Research Centre, Edinbrough University

Urquhart, C., (2010). 'Putting the theory back into grounded theory: Guidelines for grounded theory studies in information systems' Information Systems Journal, 20(4):357--381.

Violet. The nabaztag. http://www.nabaztag.com/en/index.html

Walker, M., Alex Rudnicky, John Aberdeen, Elizabeth Bratt, John Garofolo, Helen Hastie, Audrey Le, Bryan Pellom, Alex Potamianos, Rebecca Passonneau, Rashmi Prasad, Salim Roukos, Gregory Sanders, Stephanie Seneff and David Stallard (2002). 'DARPA communicator evaluation: Progress from 2000 to 2001' Proceedings of ICSLP 2002, Denver, USA.

Wallis, P. (2001) 'Dialogue modelling for a conversational agent' In Stumptner, M., Corbett, D., and Brooks, M., editors, AI2001: Advances in Artificial Intelligence Springer (LNAI 2256)

Wallis, P. (2008) 'Revisiting the {DARPA} communicator data using Conversation Analysis', Interaction Studies, 9(3).

Wallis, P. (2011) 'Speech Interfaces for Robots: the state-of-the-art', Computer Speech & Language, (under review)

Wertsch, J.~V. (1997) Mind as Action Oxford University Press

Young, S.~J. (2007) 'Spoken dialogue management using partially observable markov decision processes', EPSRC Reference: EP/F013930/1.

 

Apendix: O'Hare et al - the revised CDM probes.

Goal specification What were your specific goals at the various decision points?
Cue identification What features were you looking at when you formulated your decision?
Expectancy Where you expecting to make this type of decision during the course fo the event?
  Describe how this affected your decision-making process
Conceptual model Are there any situations in which your decision would have turned out differently?
  Describe the nature of these situations and the characteristics that would have changed the outcome of your decision.
Influence of uncertainty At any stage, wee you uncertain about either the reliability or the relevanc eof the information that you had available?
  At any stage, were you uncertain about the appropriateness of the decision?
Information integration What was the most important piece of information that you used to formulate the decision?
Situation awareness What information did you have available to you at the time of the decision?
  What information did you have available to you when formulating the decison?
Situation assessment Did you use all the information available to you when formulating the decsion?
  Was there any additional information that you might have used to assist in the formulation of the decision?
Options Were there any other alternatives available to you other than the decision that you made?
 Why were these alternatives considered inappropriate?
Decision blocking - stress Was there any stage during the decision-making process in which you found it difficult to process and integrate theinformation available?
  Describe precisely the nature of this situation.
Basis of choice Do you think that you could develop a rule, based on your experience, which could assist another person to make the same decision successfully?
  Why/Why not?
Analogy/generalization Were you at any time, reminded of previous experiences in which a similar decision was made?
Were you at any time, reminded of previous experiences in which a different decision was made?