3. Introduction
1
• Each major smartphone has its own voice assistance
• Apple Siri
• Samsung Bixby
• Google Assistant
• Help us with our daily tasks
• Calculating
• Setting an alarm
• Given us weather information
• Given as opening hours for a specific restaurant
Apple Siri Google Assistant
4. Introduction
1
• Weakness of voice assistance
• Lack in contextual information about the surroundings
• Siri shows a list of possible options for Starbucks while
user is standing right in front of the Starbucks
• What we need…
• Contextual information about the user’s surroundings
• Make up for a more interaction between the user and
the voice assistants
6. Related Work
• Multimodal Interaction
Combining pen and finger input on touchscreens
• Drini et al. "Unimanual Pen+ Touch Input Using Variations of Precision Grip Postures." UIST. 2018.
• Hinckley et al. "Pen+ touch= new tools." UIST. 2010.
Combining touch and gaze for enhanced selection
• Pfeuffer et al. "Gaze-touch: combining gaze with multi-touch for interaction on the same surface." UIST. 2014.
• Gaze Pointing
Mid-air pointing
• Mayer et al. "Modeling distant pointing for compensating systematic displacements." CHI. 2015.
• Mayer et al. "The effect of offset correction and cursor on mid-air pointing in real and virtual environments." CHI. 2018.
Eye tracking
• Zhai et al. "Manual and gaze input cascaded (MAGIC) pointing." CHI. 1999.
• Geospatial Mobile Interactions
GPS and Wi-Fi localization
• Object Context + Voice Interactions
Gaze and voice combined systems
• Glenn III et al. "Eye-voice-controlled interface." HFES., 1986.
• Koons et al. "Integrating simultaneous input from speech, gaze, and hand gestures." Intelligent multimedia interfaces. 1993
1
8. System Design
• Rear camera
Knowledge about the world
Retrieve more information about the user surroundings
Understand the surroundings around the user better
• Front camera
Knowledge about the user’s gaze
What’s the user taking, displayed within the viewport of the camera
• WorldGaze
Extract where the users actually looking
Retrieve the place or object (business logos or signage)
Fed into a voice assistance, context-rich inquiry
Extra contextual information is added to the inquiry automatically
1
WorldGaze makes the overall experience more natural feeling to the user
9. 8
Exploratory Study
• Wizard-of-Oz Analyzes
Comparing: Touch, Voice, and WorldGaze
Task: retrieving information (e.g. opening hours, ratings, phone numbers)
Participants: 12 (9 males and 3 females), mean age 25.5 years (SD=3.3)
• Condition
Touch: use Google Maps to query information
Voice: Wizard-of-Oz voice assistant (triggered by “Hey Siri”) that always returned the correct answer
WorldGaze: the voice assistant similarly returned the answer
• Feedback
System Usability Scale(SUS), 10-items on a 5-point Likert scale
Raw NASA TLX questionnaire, 6-items on a 21-point Likert scale
Future use desirability, 7-point Likert scale
10. 9
Exploratory Study
• Quantitative feedback
Lower is betterLower is better Higher is better
SUS: did not reveal any significant difference
TLX: Touch had a significantly higher task load than Voice and WorldGaze
Future Use: No significant difference for three different types
WorldGaze requires less words to be articulated, utterance duration is shorter
11. 10
Exploratory Study
• Qualitative feedback
WorldGaze is faster – or it feels faster anyway – less frustrating
WorldGaze would be useful to have (P5)
Implicit input with WorldGaze would be striking (P9)
WorldGaze offers a lot of potential for future interaction paradigms
12. 11
Exploratory Study
• Use Scenario
Asking questions about products in stores or menu items in restaurants
Interacting with smart home objects, such as controlling the TV or lighting
Navigation support in museums
Desktop computer interaction
• New Interactions
Integrating into smart glasses is suggested by 6 participants
Camera-equipped smart device (Facebook Portal, Google Nest Hub)
Comparing multiple objects or places
13. 12
Implementation
• Platform Selection
iPhone, iOS 13.0
• The only mobile OS permitting front and back cameras to be opened simultaneously
• Tested with iPhone XR
• Head Gaze Ray Casting
Robust face API provided by the Apple ARKit 3 SDK
The forward-facing head vector (GazeVector) is used to extend a ray out from the bridge of the nose
Runs at 30 FPS with ~50 ms of latency on an iPhone XR
• Object Recognition & Segmentation
Apple’s Vision Framework
14. 13
Implementation
• Voice Assistant Integration
Continuous listening feature on iOS combined with speech-to-text
Listen “Hey Siri”
Search the string for ambiguous nouns (e.g. “this”, “that place”) and replace instances
• Battery Life Implications
Integrated as a background service that wakes upon a voice assistant trigger
Estimated power consumption at ~0.1 mWh per inquiry, using bench equipment
16. 15
Evaluation
• Evaluate the tracking and targeting performance
Participants: 12 (9 males and 3 females), mean age 28.9 years (SD=5.8)
Statistically significant influence of distance on error
Horizontal and Vertical accuracy impact on error
17. 16
Example Uses
• Streetscapes
• “When does this open?”
• “What is the rating for this place?”
• “Make me a reservation for 2 at 7 pm”
18. 17
Example Uses
• Retail
• “Does this come in any other colors?”
• “Add this to my wishlist”
• “What is the price difference between this... and this.”
• Smart Homes and Offices
• Say “on” to lights or a TV
• Say “Down” to a TV or thermostat
19. 18
Limitation
• Wider-angle lenses can cover more of the world gaze addressable
• Accuracy of gaze vector
Numerous state-of-the-art algorithms are tested but severely lacking for use cases
20. 19
Criticism
• User MUST see the screen while using the voice assistant
Qiaohui Zhang, Atsumi Imamiya, Kentaro Go, and Xiaoyang Mao. 2004. Resolving ambiguities of a gaze and spee
ch interface. In Proceedings of the symposium on Eye tracking research & applications (ETRA ’04).
• Usage pattern of voice assistant
• How the system detect the restaurant which does not have logos?
• Accessibility
People with low vision
• Social acceptance and privacy
People may think user is recording them
21. 20
Conclusion
WordGaze to enhance Voice Assistants
Exploration the possibilities of WorldGaze
Implementation of WorldGaze
Use Cases to showcase enhance Assistants