Media Coordination in SmartKom Norbert Reithinger Dagstuhl Seminar “Coordination and Fusion in...

Media Coordination in SmartKom

Norbert Reithinger

Dagstuhl Seminar “Coordination and Fusion in Multimodal Interaction”

Deutsches Forschungszentrum für Künstliche Intelligenz GmbHStuhlsatzenhausweg 3, Geb. 43.1 - 66123 Saarbrücken

Tel.: (0681) 302-5346Email: [email protected]/~bert

30.10.2001 2© NR

Overview

• Situated Delegation-oriented Dialog Paradigm• More About the System Software• Media Coordination Issues• Media Processing: The Data Flow• Processing the User‘s State• Media Fusion• Media Design • Conclusion

30.10.2001 3© NR

The SmartKom Consortium

MediaInterfaceEuropean Media LabUinv. Of

MunichUniv. ofStuttgart

Saarbrücken

Aachen

Dresden Berkeley

Stuttgart

MunichUniv. ofErlangen

Heidelberg

Main Contractor

DFKISaarbrücken

Project Budget: € 25.5 millionProject Duration: 4 years (September 1999 – September 2003)

Ulm

30.10.2001 4© NR

Situated Delegation-oriented Dialog Paradigm

User specifies goal

delegates task

cooperate

on problems

asks questions

presents results

Service 1 Service 1

Service 2Service 2

Service 3Service 3

IT Services

PersonalizedInteraction

Agent

Smartakus

30.10.2001 5© NR

More About the System

30.10.2001 6© NR

More About the System

• Modules realized as independent processes• Not all must be there (critical path: speech or graphic input

to speech or graphic output)• (Mostly) independent from display size • Pool Communication Architecture (PCA) based on PVM for

Linux and NT• Modules know about their I/O pools• Literature:

– Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann:Verbmobil From a Software Engineering Point of View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of Speech-To-Speech Translation. Springer, 2000.

• Data exchanged using M3L documents C:\Documents and Settings\

bert\Desktop\SmartKom-Systeminfo\index.html • All modules and pools are visualized here ...

30.10.2001 7© NR

30.10.2001 8© NR

Media Coordination Issues

• Input:– Speech

• Words

• Prosody: boundaries, stress, emotion

• Mimics: neutral, anger

– Gesture:

• Touch free (scenario public)

• Touch sensitive screen

• Output:– Display objects

– Speech

– Agent: posture, gesture, lip movement

30.10.2001 9© NR

Media Processing: The Data Flow

Display Objects with ref ID

and Location

Dialog-Core

Presentation(Media Design)

Media Fusion

User State Domain Information System State

Speech Speech

Agent‘s Posture and Behaviour

Mimics(Neutral or Anger)

InteractionModeling

Prosody (emotion)

Gesture

30.10.2001 10© NR

The Input/Output Modules

30.10.2001 11© NR

Processing the User‘s State

30.10.2001 12© NR

Processing the User‘s State

• User state: neutral and anger• Recognized using mimics and prosody• In case of anger activate the dynamic help in the

Dialog Core Engine

• Elmar Nöth will hopefully tell you more about this in his talk Modeling the User State - The Role of Emotions

30.10.2001 13© NR

Media Fusion

30.10.2001 14© NR

Gesture Processing

• Objects on the screen are tagged with IDs

• Gesture input– Natural gestures recognized by SIVIT

– Touch sensitive screen

• Gesture recognition– Location

– Type of gesture: pointing, tarrying, encircling

• Gesture Analysis– Reference object in the display described as XML

domain model (sub-)objects (M3L schemata)

– Bounding box

– Output: gesture lattice with hypotheses

30.10.2001 15© NR

• Speech Recognizer produces word lattice

• Prosody inserts boundary and stress information

• Speech analysis creates intention hypotheses with markers for deictic expressions

Speech Processing

30.10.2001 16© NR

Media Fusion

• Integrates gesture hypotheses in the intention hypotheses of speech analysis

• Information restriction possible from both media • Possible but not necessary correspondence of

gestures and placeholders (deictic expressions/ anaphora) in the intention hypothesis

• Necessary: Time coordination of gesture and speech information

• Time stamps in ALL M3L documents!!• Output: sequence of intention hypothesis

30.10.2001 18© NR

Media Design

• Starts with action planning• Definition of an abstract presentation goal• Presentation planner:

– Selects presentation, style, media, and agent‘s general behaviour

– Activates natural language generator which activates the speech synthesis which returns audio data and time-stamped phoneme/viseme sequence

• Character Animation realizes the agent‘s behaviour

• Synchronized presentation of audio and visual information

30.10.2001 19© NR

Lip Synchronization with Visemes

• Goal: present a speech prompt as natural as possible

• Viseme: elementary lip positions• Correspondence of visemes and phonemes• Examples:

30.10.2001 20© NR

Behavioural Schemata

• Goal: Smartakus is always active to signal the state of the system

• Four main states – Wait for user‘s input

– User‘s input

– Processing

– System presentation

• Current body movements– 9 vital, 2 processing, 9 presentation (5 pointing, 2

movements, 2 face/mouth)

– About 60 basic movements

30.10.2001 21© NR

Conclusion

• Three implemented systems (Public, Home, Mobile)

• Media coordination implemented• „Backbone“ uses declarative knowledge sources

and is rather flexible• Lot‘s remains to be done

– Robustness

– Complex speech expressions

– Complex gestures (shape and timing)

– Implementation of all user states

– ....

• Reuse of modules in other contexts, e.g. in MIAMM

Media Coordination in SmartKom Norbert Reithinger Dagstuhl Seminar “Coordination and Fusion in...

Documents

Transcript of Media Coordination in SmartKom Norbert Reithinger Dagstuhl Seminar “Coordination and Fusion in...