[Seminar] 200904 Seunghyeong Choe

Enhancing Mobile Voice Assistants
with WorldGaze
Sven Mayer, Gierad Laput, Chris Harrison
CHI 2020 Paper

Contents
• Introduction
• Related Work
• System Design
• Exploratory Study
• Implementation
• Evaluation
• Example Uses
• Limitation

Introduction
1
• Each major smartphone has its own voice assistance
• Apple Siri
• Samsung Bixby
• Google Assistant
• Help us with our daily tasks
• Calculating
• Setting an alarm
• Given us weather information
• Given as opening hours for a specific restaurant
Apple Siri Google Assistant

Introduction
1
• Weakness of voice assistance
• Lack in contextual information about the surroundings
• Siri shows a list of possible options for Starbucks while
user is standing right in front of the Starbucks
• What we need…
• Contextual information about the user’s surroundings
• Make up for a more interaction between the user and
the voice assistants

Introduction
1
“When does this place close?” “When does Oakland Fashion Optical close?”

Related Work
• Multimodal Interaction
 Combining pen and finger input on touchscreens
• Drini et al. "Unimanual Pen+ Touch Input Using Variations of Precision Grip Postures." UIST. 2018.
• Hinckley et al. "Pen+ touch= new tools." UIST. 2010.
 Combining touch and gaze for enhanced selection
• Pfeuffer et al. "Gaze-touch: combining gaze with multi-touch for interaction on the same surface." UIST. 2014.
• Gaze Pointing
 Mid-air pointing
• Mayer et al. "Modeling distant pointing for compensating systematic displacements." CHI. 2015.
• Mayer et al. "The effect of offset correction and cursor on mid-air pointing in real and virtual environments." CHI. 2018.
 Eye tracking
• Zhai et al. "Manual and gaze input cascaded (MAGIC) pointing." CHI. 1999.
• Geospatial Mobile Interactions
 GPS and Wi-Fi localization
• Object Context + Voice Interactions
 Gaze and voice combined systems
• Glenn III et al. "Eye-voice-controlled interface." HFES., 1986.
• Koons et al. "Integrating simultaneous input from speech, gaze, and hand gestures." Intelligent multimedia interfaces. 1993
1

System Design
• Rear camera
 Knowledge about the world
 Retrieve more information about the user surroundings
 Understand the surroundings around the user better
• Front camera
 Knowledge about the user’s gaze
 What’s the user taking, displayed within the viewport of the camera
• WorldGaze
 Extract where the users actually looking
 Retrieve the place or object (business logos or signage)
 Fed into a voice assistance, context-rich inquiry
 Extra contextual information is added to the inquiry automatically
1
WorldGaze makes the overall experience more natural feeling to the user

8
Exploratory Study
• Wizard-of-Oz Analyzes
 Comparing: Touch, Voice, and WorldGaze
 Task: retrieving information (e.g. opening hours, ratings, phone numbers)
 Participants: 12 (9 males and 3 females), mean age 25.5 years (SD=3.3)
• Condition
 Touch: use Google Maps to query information
 Voice: Wizard-of-Oz voice assistant (triggered by “Hey Siri”) that always returned the correct answer
 WorldGaze: the voice assistant similarly returned the answer
• Feedback
 System Usability Scale(SUS), 10-items on a 5-point Likert scale
 Raw NASA TLX questionnaire, 6-items on a 21-point Likert scale
 Future use desirability, 7-point Likert scale

9
Exploratory Study
• Quantitative feedback
Lower is betterLower is better Higher is better
 SUS: did not reveal any significant difference
 TLX: Touch had a significantly higher task load than Voice and WorldGaze
 Future Use: No significant difference for three different types
 WorldGaze requires less words to be articulated, utterance duration is shorter

10
Exploratory Study
• Qualitative feedback
WorldGaze is faster – or it feels faster anyway – less frustrating
WorldGaze would be useful to have (P5)
Implicit input with WorldGaze would be striking (P9)
WorldGaze offers a lot of potential for future interaction paradigms

11
Exploratory Study
• Use Scenario
 Asking questions about products in stores or menu items in restaurants
 Interacting with smart home objects, such as controlling the TV or lighting
 Navigation support in museums
 Desktop computer interaction
• New Interactions
 Integrating into smart glasses is suggested by 6 participants
 Camera-equipped smart device (Facebook Portal, Google Nest Hub)
 Comparing multiple objects or places

12
Implementation
• Platform Selection
 iPhone, iOS 13.0
• The only mobile OS permitting front and back cameras to be opened simultaneously
• Tested with iPhone XR
• Head Gaze Ray Casting
 Robust face API provided by the Apple ARKit 3 SDK
 The forward-facing head vector (GazeVector) is used to extend a ray out from the bridge of the nose
 Runs at 30 FPS with ~50 ms of latency on an iPhone XR
• Object Recognition & Segmentation
 Apple’s Vision Framework

13
Implementation
• Voice Assistant Integration
 Continuous listening feature on iOS combined with speech-to-text
 Listen “Hey Siri”
 Search the string for ambiguous nouns (e.g. “this”, “that place”) and replace instances
• Battery Life Implications
 Integrated as a background service that wakes upon a voice assistant trigger
 Estimated power consumption at ~0.1 mWh per inquiry, using bench equipment

15
Evaluation
• Evaluate the tracking and targeting performance
 Participants: 12 (9 males and 3 females), mean age 28.9 years (SD=5.8)
 Statistically significant influence of distance on error
 Horizontal and Vertical accuracy impact on error

16
Example Uses
• Streetscapes
• “When does this open?”
• “What is the rating for this place?”
• “Make me a reservation for 2 at 7 pm”

17
Example Uses
• Retail
• “Does this come in any other colors?”
• “Add this to my wishlist”
• “What is the price difference between this... and this.”
• Smart Homes and Offices
• Say “on” to lights or a TV
• Say “Down” to a TV or thermostat

18
Limitation
• Wider-angle lenses can cover more of the world gaze addressable
• Accuracy of gaze vector
 Numerous state-of-the-art algorithms are tested but severely lacking for use cases

19
Criticism
• User MUST see the screen while using the voice assistant
 Qiaohui Zhang, Atsumi Imamiya, Kentaro Go, and Xiaoyang Mao. 2004. Resolving ambiguities of a gaze and spee
ch interface. In Proceedings of the symposium on Eye tracking research & applications (ETRA ’04).
• Usage pattern of voice assistant
• How the system detect the restaurant which does not have logos?
• Accessibility
 People with low vision
• Social acceptance and privacy
 People may think user is recording them

20
Conclusion
WordGaze to enhance Voice Assistants
Exploration the possibilities of WorldGaze
Implementation of WorldGaze
Use Cases to showcase enhance Assistants

[Seminar] 200904 Seunghyeong Choe

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a [Seminar] 200904 Seunghyeong Choe

Semelhante a [Seminar] 200904 Seunghyeong Choe (20)

Mais de ivaderivader

Mais de ivaderivader (20)

Último

Último (20)

[Seminar] 200904 Seunghyeong Choe