A generic hierarchic architecture for the coordination of ...
Media Coordination in SmartKom Norbert Reithinger Dagstuhl Seminar “Coordination and Fusion in...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Media Coordination in SmartKom Norbert Reithinger Dagstuhl Seminar “Coordination and Fusion in...
Media Coordination in SmartKom
Norbert Reithinger
Dagstuhl Seminar “Coordination and Fusion in Multimodal Interaction”
Deutsches Forschungszentrum für Künstliche Intelligenz GmbHStuhlsatzenhausweg 3, Geb. 43.1 - 66123 Saarbrücken
Tel.: (0681) 302-5346Email: [email protected]/~bert
30.10.2001 2© NR
Overview
• Situated Delegation-oriented Dialog Paradigm• More About the System Software• Media Coordination Issues• Media Processing: The Data Flow• Processing the User‘s State• Media Fusion• Media Design • Conclusion
30.10.2001 3© NR
The SmartKom Consortium
MediaInterfaceEuropean Media LabUinv. Of
MunichUniv. ofStuttgart
Saarbrücken
Aachen
Dresden Berkeley
Stuttgart
MunichUniv. ofErlangen
Heidelberg
Main Contractor
DFKISaarbrücken
Project Budget: € 25.5 millionProject Duration: 4 years (September 1999 – September 2003)
Ulm
30.10.2001 4© NR
Situated Delegation-oriented Dialog Paradigm
User specifies goal
delegates task
cooperate
on problems
asks questions
presents results
Service 1 Service 1
Service 2Service 2
Service 3Service 3
IT Services
PersonalizedInteraction
Agent
Smartakus
30.10.2001 5© NR
More About the System
30.10.2001 6© NR
More About the System
• Modules realized as independent processes• Not all must be there (critical path: speech or graphic input
to speech or graphic output)• (Mostly) independent from display size • Pool Communication Architecture (PCA) based on PVM for
Linux and NT• Modules know about their I/O pools• Literature:
– Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann:Verbmobil From a Software Engineering Point of View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of Speech-To-Speech Translation. Springer, 2000.
• Data exchanged using M3L documents C:\Documents and Settings\
bert\Desktop\SmartKom-Systeminfo\index.html • All modules and pools are visualized here ...
30.10.2001 7© NR
30.10.2001 8© NR
Media Coordination Issues
• Input:– Speech
• Words
• Prosody: boundaries, stress, emotion
• Mimics: neutral, anger
– Gesture:
• Touch free (scenario public)
• Touch sensitive screen
• Output:– Display objects
– Speech
– Agent: posture, gesture, lip movement
30.10.2001 9© NR
Media Processing: The Data Flow
Display Objects with ref ID
and Location
Dialog-Core
Presentation(Media Design)
Media Fusion
User State Domain Information System State
Speech Speech
Agent‘s Posture and Behaviour
Mimics(Neutral or Anger)
InteractionModeling
Prosody (emotion)
Gesture
30.10.2001 10© NR
The Input/Output Modules
30.10.2001 11© NR
Processing the User‘s State
30.10.2001 12© NR
Processing the User‘s State
• User state: neutral and anger• Recognized using mimics and prosody• In case of anger activate the dynamic help in the
Dialog Core Engine
• Elmar Nöth will hopefully tell you more about this in his talk Modeling the User State - The Role of Emotions
30.10.2001 13© NR
Media Fusion
30.10.2001 14© NR
Gesture Processing
• Objects on the screen are tagged with IDs
• Gesture input– Natural gestures recognized by SIVIT
– Touch sensitive screen
• Gesture recognition– Location
– Type of gesture: pointing, tarrying, encircling
• Gesture Analysis– Reference object in the display described as XML
domain model (sub-)objects (M3L schemata)
– Bounding box
– Output: gesture lattice with hypotheses
30.10.2001 15© NR
• Speech Recognizer produces word lattice
• Prosody inserts boundary and stress information
• Speech analysis creates intention hypotheses with markers for deictic expressions
Speech Processing
30.10.2001 16© NR
Media Fusion
• Integrates gesture hypotheses in the intention hypotheses of speech analysis
• Information restriction possible from both media • Possible but not necessary correspondence of
gestures and placeholders (deictic expressions/ anaphora) in the intention hypothesis
• Necessary: Time coordination of gesture and speech information
• Time stamps in ALL M3L documents!!• Output: sequence of intention hypothesis
30.10.2001 17© NR
Media Design (Media Fission)
30.10.2001 18© NR
Media Design
• Starts with action planning• Definition of an abstract presentation goal• Presentation planner:
– Selects presentation, style, media, and agent‘s general behaviour
– Activates natural language generator which activates the speech synthesis which returns audio data and time-stamped phoneme/viseme sequence
• Character Animation realizes the agent‘s behaviour
• Synchronized presentation of audio and visual information
30.10.2001 19© NR
Lip Synchronization with Visemes
• Goal: present a speech prompt as natural as possible
• Viseme: elementary lip positions• Correspondence of visemes and phonemes• Examples:
30.10.2001 20© NR
Behavioural Schemata
• Goal: Smartakus is always active to signal the state of the system
• Four main states – Wait for user‘s input
– User‘s input
– Processing
– System presentation
• Current body movements– 9 vital, 2 processing, 9 presentation (5 pointing, 2
movements, 2 face/mouth)
– About 60 basic movements
30.10.2001 21© NR
Conclusion
• Three implemented systems (Public, Home, Mobile)
• Media coordination implemented• „Backbone“ uses declarative knowledge sources
and is rather flexible• Lot‘s remains to be done
– Robustness
– Complex speech expressions
– Complex gestures (shape and timing)
– Implementation of all user states
– ....
• Reuse of modules in other contexts, e.g. in MIAMM