SlideShare a Scribd company logo
1 of 27
Download to read offline
Research & Development




   Text vs. Speech
A Comparison of Tagging Input Modalities 
         for Camera Phones



      Mauro Cherubini, Xavier Anguera, 
     Nuria Oliver, and Rodrigo de Oliveira
people do not want to tag
             their pictures
intro → hypotheses → methodology → results → implications
research question:

 Assuming that users are willing to
 input at least one tag, which input
modality can help the production and
      retrieval of the pictures?


intro → hypotheses → methodology → results → implications
hypothesis 1

   Speech is preferred to text as an
   annotation mechanism on mobile
     phones (objective measure)

Support: 
- Mitchard and Winkles (2002)

intro → hypotheses → methodology → results → implications
hypothesis 1-bis

  Speech annotations are preferred by
users even if this means spending more
 time on the task (subjective measure)

 Support: 
 - Perakakis and Potamianos (2008)

intro → hypotheses → methodology → results → implications
hypothesis 2

  The longer the tag the larger the
  advantage of voice over text for
annotating pictures on mobile phones

Support: 
- Hauptmann and Rudnicky (1990)

intro → hypotheses → methodology → results → implications
hypothesis 3

 Retrieving pictures on mobile phones
with speech is not faster than with text
          (objective measure)

 Support: 
 - Mills et al. (2000)

 intro → hypotheses → methodology → results → implications
the user study
   field study
                          controlled
   (4 weeks)
                           experiment

                                     T1 - T2 - T3 - T4

  3 experimental conditions:
         a. Speech only
           b. Text only
      c. Speech and Text

intro → hypotheses → methodology → results → implications
MAMI




intro → hypotheses → methodology → results → implications
features of MAMI
                         

    •  processing is done entirely on the mobile
       phone
    •  speech is not transcribed
    •  to compare the waveforms of the audio tags,
       MAMI uses algorithm of Dynamic Time
       Warping


intro → hypotheses → methodology → results → implications
task 1: remember the tag
            stimulus
                    retrieval




Pictures taken during the field trial


intro → hypotheses → methodology → results → implications
task 2: remember the context
          stimulus
                      retrieval

      TASK 2
      PICTURE 1

      three little bushes
      Garden
      Tree
      Stairs




intro → hypotheses → methodology → results → implications
task 3: remember the picture
          stimulus
                      retrieval




                      Text
  Audio tags were converted into
    textual tags and vice versa

intro → hypotheses → methodology → results → implications
task 4: remember the
                         sequence
        assignment
                      retrieval

     TASK 4

     Three pictures among
     the oldest and three 
     pictures among the 
     newest.




intro → hypotheses → methodology → results → implications
metrics

     •  time to completion
     •  false positives
     •  retrieval errors


intro → hypotheses → methodology → results → implications
results H1




intro → hypotheses → methodology → results → implications
results H1-bis
 All participants in the BOTH group felt that tagging
 with text was more effective than tagging with voice.

   Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD])
    1 = completely agree; 5 = completely disagree




intro → hypotheses → methodology → results → implications
results H2




intro → hypotheses → methodology → results → implications
results H3




intro → hypotheses → methodology → results → implications
results H3 - continued
take away 1: 
       speech is not a given

the advantage of audio as an input modality for tagging
       pictures on mobile phones is not a given


                           why?
                  1. retrieval precision
                        2. privacy

intro → hypotheses → methodology → results → implications
take away 2: 
              input mistakes
     we address text input mistakes immediately. 
 on the contrary mistakes in audio recordings are less
                frequently addressed




intro → hypotheses → methodology → results → implications
take away 3: 
                  memory

      speech does not help memorizing the tags




intro → hypotheses → methodology → results → implications
implication 1:
   allow multiple modalities




                       © Pixar, 2008


intro → hypotheses → methodology → results → implications
implication 2:
    enable audio inspection




intro → hypotheses → methodology → results → implications
implication 3: 
enable modality synesthesia




                       © Disney, 1940
intro → hypotheses → methodology → results → implications
Research  Development




              end
              thanks

        martigan@gmail.com
          mauro@tid.es


http://www.i-cherubini.it/mauro/blog/
  http://research.tid.es/multimedia/

More Related Content

Similar to Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

Similar to Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones (8)

CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1
 
CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1
 
Clark ch 5 and 6
Clark ch 5 and 6Clark ch 5 and 6
Clark ch 5 and 6
 
Pennymotsett ppquiz
Pennymotsett ppquizPennymotsett ppquiz
Pennymotsett ppquiz
 
Cognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctmlCognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctml
 
GloCALL 2013 conference presentation
GloCALL 2013 conference presentationGloCALL 2013 conference presentation
GloCALL 2013 conference presentation
 
Science.1207745.full
Science.1207745.fullScience.1207745.full
Science.1207745.full
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

  • 1. Research & Development Text vs. Speech A Comparison of Tagging Input Modalities for Camera Phones Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira
  • 2. people do not want to tag their pictures intro → hypotheses → methodology → results → implications
  • 3. research question: Assuming that users are willing to input at least one tag, which input modality can help the production and retrieval of the pictures? intro → hypotheses → methodology → results → implications
  • 4. hypothesis 1 Speech is preferred to text as an annotation mechanism on mobile phones (objective measure) Support: - Mitchard and Winkles (2002) intro → hypotheses → methodology → results → implications
  • 5. hypothesis 1-bis Speech annotations are preferred by users even if this means spending more time on the task (subjective measure) Support: - Perakakis and Potamianos (2008) intro → hypotheses → methodology → results → implications
  • 6. hypothesis 2 The longer the tag the larger the advantage of voice over text for annotating pictures on mobile phones Support: - Hauptmann and Rudnicky (1990) intro → hypotheses → methodology → results → implications
  • 7. hypothesis 3 Retrieving pictures on mobile phones with speech is not faster than with text (objective measure) Support: - Mills et al. (2000) intro → hypotheses → methodology → results → implications
  • 8. the user study field study controlled (4 weeks) experiment T1 - T2 - T3 - T4 3 experimental conditions: a. Speech only b. Text only c. Speech and Text intro → hypotheses → methodology → results → implications
  • 9. MAMI intro → hypotheses → methodology → results → implications
  • 10. features of MAMI •  processing is done entirely on the mobile phone •  speech is not transcribed •  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping intro → hypotheses → methodology → results → implications
  • 11. task 1: remember the tag stimulus retrieval Pictures taken during the field trial intro → hypotheses → methodology → results → implications
  • 12. task 2: remember the context stimulus retrieval TASK 2 PICTURE 1 three little bushes Garden Tree Stairs intro → hypotheses → methodology → results → implications
  • 13. task 3: remember the picture stimulus retrieval Text Audio tags were converted into textual tags and vice versa intro → hypotheses → methodology → results → implications
  • 14. task 4: remember the sequence assignment retrieval TASK 4 Three pictures among the oldest and three pictures among the newest. intro → hypotheses → methodology → results → implications
  • 15. metrics •  time to completion •  false positives •  retrieval errors intro → hypotheses → methodology → results → implications
  • 16. results H1 intro → hypotheses → methodology → results → implications
  • 17. results H1-bis All participants in the BOTH group felt that tagging with text was more effective than tagging with voice. Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree intro → hypotheses → methodology → results → implications
  • 18. results H2 intro → hypotheses → methodology → results → implications
  • 19. results H3 intro → hypotheses → methodology → results → implications
  • 20. results H3 - continued
  • 21. take away 1: speech is not a given the advantage of audio as an input modality for tagging pictures on mobile phones is not a given why? 1. retrieval precision 2. privacy intro → hypotheses → methodology → results → implications
  • 22. take away 2: input mistakes we address text input mistakes immediately. on the contrary mistakes in audio recordings are less frequently addressed intro → hypotheses → methodology → results → implications
  • 23. take away 3: memory speech does not help memorizing the tags intro → hypotheses → methodology → results → implications
  • 24. implication 1: allow multiple modalities © Pixar, 2008 intro → hypotheses → methodology → results → implications
  • 25. implication 2: enable audio inspection intro → hypotheses → methodology → results → implications
  • 26. implication 3: enable modality synesthesia © Disney, 1940 intro → hypotheses → methodology → results → implications
  • 27. Research Development end thanks martigan@gmail.com mauro@tid.es http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/