SlideShare uma empresa Scribd logo
1 de 43
孟泽
张氏秋怀 TRUONGTHITHUHOAI

MULTIMODAL DEEP LEARNING
PRESENTATION
MULTIMODAL DEEP LEARNING
                Jiquan Ngiam
   Aditya Khosla, Mingyu Kim, Juhan Nam,
          Honglak Lee, Andrew Y. Ng

   Computer Science Department, Stanford
                  University
  Department of Music, Stanford University
  Computer Science & Engineering Division,
      University of Michigan, Ann Arbor
MCGURK EFFECT

   In speech recognition, people are known to
    integrate audio-visual information in order to
    understand speech.

    This was first exemplified in the McGurk
    effect where a visual /ga/ with a voiced /ba/
    is perceived as /da/ by most subjects.
AUDIO-VISUAL SPEECH RECOGNITION
FEATURE CHALLENGE




            Classifier (e.g.
                SVM)
REPRESENTING LIPS

• Can we learn better representations for
  audio/visual speech recognition?

• How can multimodal data (multiple
  sources of input) be used to find better
  features?
UNSUPERVISED FEATURE LEARNING
UNSUPERVISED FEATURE LEARNING
MULTIMODAL FEATURES
CROSS-MODALITY FEATURE LEARNING
FEATURE LEARNING MODELS
BACKGROUND


   Sparse Restricted Boltzmann Machines
    (RBMs)
FEATURE LEARNING WITH AUTOENCODERS

Audio Reconstruction   Video Reconstruction
        ...                   ...
        ...                   ...
        ...                   ...
    Audio Input            Video Input
BIMODAL AUTOENCODER

                      Video Reconstruction
   Audio Reconstruction
           ...                ...
                     ...       Hidden
                               Representation

           ...                ...
       Audio Input         Video Input
SHALLOW LEARNING
   Hidden Units




                  Video Input   Audio Input

 • Mostly unimodal features learned
BIMODAL AUTOENCODER

                       Video Reconstruction
    Audio Reconstruction
            ...                 ...
                     ...        Hidden
                                Representation

                     ...
                  Video Input

Cross-modality Learning:
Learn better video features by using audio as a
cue
CROSS-MODALITY DEEP AUTOENCODER
      Audio Reconstruction   Video Reconstruction
             ...                    ...
             ...                    ...
                        ...          Learned
                                     Representation


                        ...
                        ...
                     Video Input
CROSS-MODALITY DEEP AUTOENCODER
      Audio Reconstruction   Video Reconstruction
             ...                    ...
             ...                    ...
                        ...          Learned
                                     Representation


                        ...
                        ...
                     Audio Input
BIMODAL DEEP AUTOENCODERS
        Audio Reconstruction   Video Reconstruction
                 ...                  ...
                 ...                  ...
                            ...        Shared
                                       Representation

 “Phonemes”                                         “Visemes”
                 ...                  ...         (Mouth Shapes)


                 ...                  ...
              Audio Input          Video Input
BIMODAL DEEP AUTOENCODERS
     Audio Reconstruction   Video Reconstruction
            ...                    ...
            ...                    ...
                       ...
                                                 “Visemes”
                                   ...         (Mouth Shapes)


                                   ...
                                Video Input
BIMODAL DEEP AUTOENCODERS
        Audio Reconstruction   Video Reconstruction
                 ...                  ...
                 ...                  ...
                            ...
 “Phonemes”
                 ...
                 ...
              Audio Input
TRAINING BIMODAL DEEP AUTOENCODER
Audio Reconstruction   Video Reconstruction
        ...                    ...               Audio Reconstruction
                                                         ...
                                                                        Video Reconstruction
                                                                                ...
                                                                                                  Audio Reconstruction
                                                                                                          ...
                                                                                                                         Video Reconstruction
                                                                                                                                 ...
        ...                    ...                       ...                    ...                       ...                    ...
                   ...          Shared
                                Representation
                                                                    ...          Shared
                                                                                 Representation
                                                                                                                     ...          Shared
                                                                                                                                  Representation


        ...                    ...                       ...                                                                     ...
        ...                    ...                       ...                                                                     ...
    Audio Input            Video Input               Audio Input                                                             Video Input




        • Train a single model to perform all 3
          tasks

        • Similar in spirit to denoising
          autoencoders
EVALUATIONS
VISUALIZATIONS OF LEARNED FEATURES



             0 ms          33 ms   67 ms   100 ms




             0 ms          33 ms   67 ms   100 ms

Audio (spectrogram) and Video
features
learned over 100ms windows
LEARNING SETTINGS

   We will consider the learning settings
    shown in Figure 1.
LIP-READING WITH AVLETTERS

   AVLetters:                        Audio Reconstruction
                                             ...
                                                             Video Reconstruction
                                                                    ...
     26-way Letter Classification           ...                    ...
     10 Speakers                                       ...           Learned
                                                                      Representation


     60x80 pixels lip regions                          ...
                                                        ...
   Cross-modality learning                          Video Input




        Feature          Supervised
                                                        Testing
        Learning          Learning
      Audio + Video        Video                             Video
LIP-READING WITH AVLETTERS

   Feature Representation          Classification
                                     Accuracy
   Multiscale Spatial Analysis        44.6%
         (Matthews et al., 2002)

      Local Binary Pattern            58.5%
         (Zhao & Barnard, 2009)
LIP-READING WITH AVLETTERS
   Feature Representation         Classification
                                    Accuracy
  Multiscale Spatial Analysis        44.6%
        (Matthews et al., 2002)

     Local Binary Pattern            58.5%
        (Zhao & Barnard, 2009)


     Video-Only Learning
                                     54.2%
  (Single Modality Learning)
LIP-READING WITH AVLETTERS
   Feature Representation         Classification
                                    Accuracy
  Multiscale Spatial Analysis        44.6%
        (Matthews et al., 2002)

     Local Binary Pattern            58.5%
        (Zhao & Barnard, 2009)


     Video-Only Learning
                                     54.2%
  (Single Modality Learning)
        Our Features
                                     64.4%
  (Cross Modality Learning)
LIP-READING WITH CUAVE

   CUAVE:                            Audio Reconstruction   Video Reconstruction
                                             ...                    ...
     10-way Digit Classification
                                             ...                    ...
     36 Speakers
                                                        ...           Learned
                                                                      Representation


   Cross Modality Learning                             ...
                                                        ...
                                                     Video Input




        Feature          Supervised
                                                        Testing
        Learning          Learning
      Audio + Video        Video                             Video
LIP-READING WITH CUAVE
                               Classification
   Feature Representation
                                 Accuracy
 Baseline Preprocessed Video      58.5%
     Video-Only Learning
                                  65.4%
  (Single Modality Learning)
LIP-READING WITH CUAVE
                               Classification
   Feature Representation
                                 Accuracy
 Baseline Preprocessed Video      58.5%
     Video-Only Learning
                                  65.4%
  (Single Modality Learning)
        Our Features
                                  68.7%
  (Cross Modality Learning)
LIP-READING WITH CUAVE
                                   Classification
   Feature Representation
                                     Accuracy
 Baseline Preprocessed Video          58.5%
     Video-Only Learning
                                      65.4%
  (Single Modality Learning)
        Our Features
                                      68.7%
  (Cross Modality Learning)

  Discrete Cosine Transform           64.0%
        (Gurban & Thiran, 2009)

         Visemic AAM                  83.0%
       (Papandreou et al., 2009)
MULTIMODAL RECOGNITION
                                       Audio Reconstruction         Video Reconstruction

                                               ...                         ...
   CUAVE:                                     ...                         ...

     10-way Digit Classification                             ...            Shared
                                                                             Representation




                                              ...                          ...
     36 Speakers
                                              ...                          ...
                                           Audio Input                  Video Input




   Evaluate in clean and noisy audio
    scenarios
     Inthe clean audio scenario, audio performs
      extremely well alone
       Feature          Supervised
                                                   Testing
       Learning          Learning
                                                  Audio +
     Audio + Video     Audio + Video
                                                   Video
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
 Bimodal Deep Autoencoder         77.3%
MULTIMODAL RECOGNITION
                               Classification
                                 Accuracy
  Feature Representation
                            (Noisy Audio at 0db
                                   SNR)
   Audio Features (RBM)           75.8%
  Our Best Video Features         68.7%
 Bimodal Deep Autoencoder         77.3%


 Bimodal Deep Autoencoder
                                  82.2%
   + Audio Features (RBM)
SHARED REPRESENTATION EVALUATION
     Feature             Supervised
                                               Testing
     Learning             Learning
  Audio + Video            Audio                Video
   Linear Classifier                              Supervised
                                                    Testing

       Shared                            Shared
    Representation                    Representation


 Audio           Video             Audio             Video

      Training                             Testing
SHARED REPRESENTATION EVALUATION
   Method: Learned Features + Canonical Correlation
    Analysis
      Feature     Supervised
                                               Testing                     Accuracy
      Learning              Learning
    Audio + Video              Audio                Video                   57.3%
    Audio + Video              Video                Audio                   91.7%

                    Linear Classifier                         Supervised
                                                                Testing

                        Shared                     Shared
                     Representation             Representation


                 Audio              Video   Audio             Video

                         Training                   Testing
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.


 Audio    Video          Model Predictions
 Input    Input       /ga/       /ba/        /da/
  /ga/      /ga/     82.6%       2.2%       15.2%
  /ba/      /ba/     4.4%       89.1%       6.5%
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.


 Audio    Video          Model Predictions
 Input    Input       /ga/       /ba/        /da/
  /ga/      /ga/     82.6%       2.2%       15.2%
  /ba/      /ba/     4.4%       89.1%       6.5%
  /ba/      /ga/     28.3%      13.0%       58.7%
CONCLUSION
   Applied deep autoencoders to         Audio Reconstruction
                                                ...
                                                                      Video Reconstruction
                                                                               ...
    discover features in multimodal             ...                            ...
    data                                                        ...              Learned
                                                                                 Representation

                                                                ...
                                                                ...
   Cross-modality Learning:                                 Video Input




    We obtained better video features
                                                                           Video Reconstruction

    (for lip-reading) using audio as a
                                           Audio Reconstruction

                                                   ...                            ...
    cue                                            ...                            ...
                                                                  ...               Shared
                                                                                    Representation




   Multimodal Feature Learning:                  ...                             ...
    Learn representations that relate             ...                             ...
                                               Audio Input                     Video Input


    across audio and video data
THANK FOR ATTENTION!

Mais conteúdo relacionado

Mais procurados

Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Sujit Pal
 
Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021Yi-Hsuan Yang
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentiontaeseon ryu
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
 
CONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKCONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKMd Rajib Bhuiyan
 
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisPR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisHyeongmin Lee
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMustafa Yagmur
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGANSeongcheol Baek
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 

Mais procurados (20)

Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021Automatic Music Composition with Transformers, Jan 2021
Automatic Music Composition with Transformers, Jan 2021
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attention
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
CONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKCONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORK
 
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisPR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGAN
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
CIFAR-10
CIFAR-10CIFAR-10
CIFAR-10
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Resnet
ResnetResnet
Resnet
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 

Destaque

Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herringjrherring2
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysisAbdullah Khan Zehady
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013gmorong
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESRamnandan Krishnamurthy
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overviewsajanazoya
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningJörgen Sandig
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
What is lipreading?
What is lipreading?What is lipreading?
What is lipreading?Heidi Walsh
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringNAVER D2
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learningRishikesh .
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionananth
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networksShuhei Iitsuka
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learningPeter Wlodarczak
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 

Destaque (20)

Multimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie HerringMultimodal Learning Preferences - Jessie Herring
Multimodal Learning Preferences - Jessie Herring
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysis
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013
 
ECML-2015 Presentation
ECML-2015 PresentationECML-2015 Presentation
ECML-2015 Presentation
 
presentation
presentationpresentation
presentation
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
What is lipreading?
What is lipreading?What is lipreading?
What is lipreading?
 
Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learning
 
CBIR by deep learning
CBIR by deep learningCBIR by deep learning
CBIR by deep learning
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networks
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learning
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Multimodal deep learning

  • 2. MULTIMODAL DEEP LEARNING  Jiquan Ngiam  Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng  Computer Science Department, Stanford University  Department of Music, Stanford University  Computer Science & Engineering Division, University of Michigan, Ann Arbor
  • 3. MCGURK EFFECT  In speech recognition, people are known to integrate audio-visual information in order to understand speech.  This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.
  • 5. FEATURE CHALLENGE Classifier (e.g. SVM)
  • 6. REPRESENTING LIPS • Can we learn better representations for audio/visual speech recognition? • How can multimodal data (multiple sources of input) be used to find better features?
  • 12. BACKGROUND  Sparse Restricted Boltzmann Machines (RBMs)
  • 13. FEATURE LEARNING WITH AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input
  • 14. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... ... Audio Input Video Input
  • 15. SHALLOW LEARNING Hidden Units Video Input Audio Input • Mostly unimodal features learned
  • 16. BIMODAL AUTOENCODER Video Reconstruction Audio Reconstruction ... ... ... Hidden Representation ... Video Input Cross-modality Learning: Learn better video features by using audio as a cue
  • 17. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Video Input
  • 18. CROSS-MODALITY DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... ... ... ... Learned Representation ... ... Audio Input
  • 19. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... Shared Representation “Phonemes” “Visemes” ... ... (Mouth Shapes) ... ... Audio Input Video Input
  • 20. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Visemes” ... (Mouth Shapes) ... Video Input
  • 21. BIMODAL DEEP AUTOENCODERS Audio Reconstruction Video Reconstruction ... ... ... ... ... “Phonemes” ... ... Audio Input
  • 22. TRAINING BIMODAL DEEP AUTOENCODER Audio Reconstruction Video Reconstruction ... ... Audio Reconstruction ... Video Reconstruction ... Audio Reconstruction ... Video Reconstruction ... ... ... ... ... ... ... ... Shared Representation ... Shared Representation ... Shared Representation ... ... ... ... ... ... ... ... Audio Input Video Input Audio Input Video Input • Train a single model to perform all 3 tasks • Similar in spirit to denoising autoencoders
  • 24. VISUALIZATIONS OF LEARNED FEATURES 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 ms Audio (spectrogram) and Video features learned over 100ms windows
  • 25. LEARNING SETTINGS  We will consider the learning settings shown in Figure 1.
  • 26. LIP-READING WITH AVLETTERS  AVLetters: Audio Reconstruction ... Video Reconstruction ...  26-way Letter Classification ... ...  10 Speakers ... Learned Representation  60x80 pixels lip regions ... ...  Cross-modality learning Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
  • 27. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009)
  • 28. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning)
  • 29. LIP-READING WITH AVLETTERS Feature Representation Classification Accuracy Multiscale Spatial Analysis 44.6% (Matthews et al., 2002) Local Binary Pattern 58.5% (Zhao & Barnard, 2009) Video-Only Learning 54.2% (Single Modality Learning) Our Features 64.4% (Cross Modality Learning)
  • 30. LIP-READING WITH CUAVE  CUAVE: Audio Reconstruction Video Reconstruction ... ...  10-way Digit Classification ... ...  36 Speakers ... Learned Representation  Cross Modality Learning ... ... Video Input Feature Supervised Testing Learning Learning Audio + Video Video Video
  • 31. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning)
  • 32. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning)
  • 33. LIP-READING WITH CUAVE Classification Feature Representation Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning 65.4% (Single Modality Learning) Our Features 68.7% (Cross Modality Learning) Discrete Cosine Transform 64.0% (Gurban & Thiran, 2009) Visemic AAM 83.0% (Papandreou et al., 2009)
  • 34. MULTIMODAL RECOGNITION Audio Reconstruction Video Reconstruction ... ...  CUAVE: ... ...  10-way Digit Classification ... Shared Representation ... ...  36 Speakers ... ... Audio Input Video Input  Evaluate in clean and noisy audio scenarios  Inthe clean audio scenario, audio performs extremely well alone Feature Supervised Testing Learning Learning Audio + Audio + Video Audio + Video Video
  • 35. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7%
  • 36. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3%
  • 37. MULTIMODAL RECOGNITION Classification Accuracy Feature Representation (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Bimodal Deep Autoencoder 82.2% + Audio Features (RBM)
  • 38. SHARED REPRESENTATION EVALUATION Feature Supervised Testing Learning Learning Audio + Video Audio Video Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
  • 39. SHARED REPRESENTATION EVALUATION  Method: Learned Features + Canonical Correlation Analysis Feature Supervised Testing Accuracy Learning Learning Audio + Video Audio Video 57.3% Audio + Video Video Audio 91.7% Linear Classifier Supervised Testing Shared Shared Representation Representation Audio Video Audio Video Training Testing
  • 40. MCGURK EFFECT A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5%
  • 41. MCGURK EFFECT A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Video Model Predictions Input Input /ga/ /ba/ /da/ /ga/ /ga/ 82.6% 2.2% 15.2% /ba/ /ba/ 4.4% 89.1% 6.5% /ba/ /ga/ 28.3% 13.0% 58.7%
  • 42. CONCLUSION  Applied deep autoencoders to Audio Reconstruction ... Video Reconstruction ... discover features in multimodal ... ... data ... Learned Representation ... ...  Cross-modality Learning: Video Input We obtained better video features Video Reconstruction (for lip-reading) using audio as a Audio Reconstruction ... ... cue ... ... ... Shared Representation  Multimodal Feature Learning: ... ... Learn representations that relate ... ... Audio Input Video Input across audio and video data

Notas do Editor

  1. In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting.For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – images of his lips; and the audio – how do we integrate these two sources of data.Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at first-order as depth discontinuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have non-linear correlations at a “mid-level”, as phonemes or visemes; it is difficult to relate raw pixels to audio waveforms or spectrograms. In this paper, we are interested in modeling “mid-level” relationships, thus we choose to use audio-visual speech classification to validate our methods. In particular, we focus on learning representations for speech audio which are coupled with videos of the lips.
  2. So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier.While for audio, the speech community have developed many features such as MFCCs which work really well,it is not obvious what features we should use for lips.
  3. So what does state of the art features look like? Engineering these features took long timeTo this, we address two questions in this work – [click]Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level
  4. Concretely, our task is to convert sequences of lip images into a vector of numbersSimilarly, for the audio
  5. Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalitiesHowever, this is a very limited view of multimodal features – instead what we would like to do [click] is to
  6. Find better ways to relate the audio and visual inputs and get features that arise out of relating them together
  7. Next I’m going to describe adifferent feature learning settingSuppose that at test time, only the lip images are available, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time(lip-reading not well defined)But there are more settings to consider!If our task is only to do lip reading, visual speech recognition.An interesting question to ask is -- can we improve our lip reading features if we had audio data.
  8. Lets step back a bit and take a similar but related approach to the problem.What if we learn an autoencoderBut, this still has the problem! But, wait now we can do something interesting
  9. So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets.If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only…So why doesn’t this work? We think that there are two reasons for this.In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram.Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content.It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain)We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this.Review: 1) no incentive and 2) deep
  10. But, this still has the problem! But, wait now we can do something interestingThis model will be trained on clips with both audio and video.
  11. However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. moreSince audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well
  12. But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together.
  13. [pause] the second model we present is the bimodal deep autoencoderWhat we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality
  14. Features correspond to mouth motions and are also paired up with the audio spectrogramThe features are generic and are not speaker specific
  15. Explain in phases!
  16. Explain in phases
  17. Explain in phases