MediaEval 2016 - BUT Zero-Cost Speech Recognition

•

1 gostou•173 visualizações

Presenter: Miroslav Skácel BUT Zero-Cost Speech Recognition 2016 System Description In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf Video: https://youtu.be/0pNiLLVTa28 Abstract: This paper describes our work on developing speech recognizers for Vietnamese. It focuses on procedures to prepare provided data precisely. We aim on analysis of the textual transcriptions in particular. Methods to filter out defective data to improve performance of final system are proposed and described in detail. We also propose cleaning of other textual data used for language modeling. Several architectures are investigated to reach both sub-tasks goals. The achieved results are discussed.

Ciências

ﬁt
speech fit
BUT Zero-Cost 2016 Speech Recognition
Miroslav Skácel, Martin Karaﬁát, Lucas Ondel, Albert Uchytil, Igor Szöke
BUT Speech@FIT
Brno University of Technology
Czech Republic
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands

Overview
Our ideas for this task were:
• to use previous knowledge of low-resource languages
• to use existing systems from Babel1
• to adapt that on Vietnamese
• to follow zero-cost requirements
We ended up with:
• lots of data processing and cleaning
• 2x LVCSR sub-task systems
• 1x subword sub-task systems
• suggestions for further improvements
1https://www.iarpa.gov/index.php/research-programs/babel
1

Data preparation
• audio downsampled from 16kHz to 8kHz
• got Vietnamese alphabet from wordlist
• cleaned other characters (punctuation marks, brackets, etc.)
• numerals expanded to textual form
2

Data segmentation
• audio longer than 1 minute caused troubles
• alignment from ﬁrst training stages used
• data split when:
• silence longer than 0.5s occurs
• segment would be longer than 15s
3

Language models
3 LMs were created:
1) cleaned transcriptions from train set
2) cleaned subtitles
3) cleaned text from websites
• 3 or more words
• we combined them together
4

LVCSR systems - GMM/DNN2
2Martin Karaﬁát et al. BUT Neural Network Features for Spontaneous Vietnamese in
BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society,
2014.
5

LVCSR systems - BLSTM3
• 16kHz audio, original transcriptions
• Kaldi/TNet toolkits
3Martin Karaﬁát et al. Multilingual BLSTM and Speaker-Speciﬁc Vector Adaptation in
2016 BUT Babel System. Accepted at SLT 2016, 2016.
6

Subword system
• Acoustic Unit Discovery4
(AUD) model
• unlabeled data - no transcription needed
• like phone-loop model, fully Bayesian, Dirichlet distribution
• no ﬁxed number of components like in HMM
• can learn complexity of model
4Lucas Ondel et al. Variational Inference for Acoustic Unit Discovery. In Procedia
Computer Science, volume 2016, pages 80–86. Elsevier Science, 2016.
7

Results
System
Devel [WER] Test [WER]
all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube)
Kaldi Baseline 60.4 (45.3 / 99.8 / 73.5) 75.5 (46.7 / 98.1 / 76.2 / 97.1)
P-BUT - Babel Kaldi BLSTM 16kHz 17.9 (6.4 / 58.1 / 15.8) 48.0 (4.9 / 55.7 / 35.4 / 87.2)
L-BUT - Babel Kaldi BLSTM 16kHz - LM tune 17.6 (6.2 / 56.4 / 16.9) 46.3 (4.6 / 52.6 / 32.2 / 84.7)
L-BUT - Babel GMM/DNN 8kHz 36.1 (29.7 / 68.5 / 23.4) 55.7 (28.0 / 59.3 / 44.9 / 81.4)
System
Devel [NMI] Test [NMI]
all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube)
Kaldi Baseline 5.48 (7.88 / 10.44 / 13.8) 4.29 (7.49 / 11.25 / 15.99 / 6.35)
P-BUT AUD phone-loop 5.08 (6.45 / 8.76 / 14.19) 4.56 (5.52 / 9.59 / 18.49 / 7.59)
• BLSTM overperformed GMM/DNN system
• exception is unseen data (YouTube test set)
8

Further work
Ideas for future experiments:
• ﬁnd better way to clean original transcriptions
• methods to detect inappropriate audio/transcriptions
• adding noise to audio for more robust system
• audio reverberation using RIR
9

Mais conteúdo relacionado

Destaque

Video Retrieval for Multimedia Verification of Breaking News on Social NetworksInVID Project

MediaEval 2016 - Emotion in Music Task: Lessons Learnedmultimediaeval

MediaEval 2015 - JRS at Synchronization of Multi-user Event Media Taskmultimediaeval

MediaEval 2015 - CERTH at MediaEval 2015 Synchronization of Multi-User Event ...multimediaeval

MediaEval 2015 - GTM-UVigo Systems for Person Discovery Task at MediaEval 2015multimediaeval

MediaEval 2016 - TUD-MMC Predicting media Interestingness Taskmultimediaeval

MediaEval 2015 - Verifying Multimedia Use at MediaEval 2015multimediaeval

MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...multimediaeval

MediaEval 2016: A Multimodal System for the Verifying Multimedia Use Taskmultimediaeval

MediaEval 2016 - LAPI @ 2016 Retrieving Diverse Social Images Task: A Pseudo-...multimediaeval

MediaEval 2015 - Synchronization of Multi-User Event Media at MediaEval 2015:...multimediaeval

MediaEval 2016 - Simula Team @ Context of Experience Taskmultimediaeval

Media REVEALr: A social multimedia monitoring and intelligence system for Web...Symeon Papadopoulos

The InVID Plug-in: Web Video Verification on the BrowserInVID Project

MediaEval 2016: LAPI at Predicting Media Interestingness Taskmultimediaeval

MediaEval 2016 - Verifying Multimedia Use Task Overviewmultimediaeval

Destaque (16)

Video Retrieval for Multimedia Verification of Breaking News on Social Networks

MediaEval 2016 - Emotion in Music Task: Lessons Learned

MediaEval 2015 - JRS at Synchronization of Multi-user Event Media Task

MediaEval 2015 - CERTH at MediaEval 2015 Synchronization of Multi-User Event ...

MediaEval 2015 - GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

MediaEval 2016 - TUD-MMC Predicting media Interestingness Task

MediaEval 2015 - Verifying Multimedia Use at MediaEval 2015

MediaEval 2016 - Placing Images with Refined Language Models and Similarity S...

MediaEval 2016: A Multimodal System for the Verifying Multimedia Use Task

MediaEval 2016 - LAPI @ 2016 Retrieving Diverse Social Images Task: A Pseudo-...

MediaEval 2015 - Synchronization of Multi-User Event Media at MediaEval 2015:...

MediaEval 2016 - Simula Team @ Context of Experience Task

Media REVEALr: A social multimedia monitoring and intelligence system for Web...

The InVID Plug-in: Web Video Verification on the Browser

MediaEval 2016: LAPI at Predicting Media Interestingness Task

MediaEval 2016 - Verifying Multimedia Use Task Overview

Semelhante a MediaEval 2016 - BUT Zero-Cost Speech Recognition

Mediaeval 2013 Spoken Web Search results slidesXavier Anguera

Wreck a nice beach: adventures in speech recognitionStephen Marquard

PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksLifeng (Aaron) Han

Preliminary study on using vector quantization latent spaces for TTS/VC syste...Yamagishi Laboratory, National Institute of Informatics, Japan

Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus

Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Yamagishi Laboratory, National Institute of Informatics, Japan

Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss ‎‎‎‎‎‎‎

Asralexisronquillo

Corpus Linguistics :Analytical ToolsJitendra Patil

A Recorded Debating DatasetScott Faria

SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat Politècnica de Catalunya

Subjective comparison of_speech_enhancement_algori (1)Priyanka Reddy

Acceptance Testing Of A Spoken Language Translation SystemMichele Thomas

Speaker Dependent WaveNet VocoderAkira Tamamori

Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive

NMR Automationcknoxrun

Searching for the Best Machine Translation CombinationMatīss ‎‎‎‎‎‎‎

Introduction to text to speechBilgin Aksoy

High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com

Semelhante a MediaEval 2016 - BUT Zero-Cost Speech Recognition (20)

Mediaeval 2013 Spoken Web Search results slides

Wreck a nice beach: adventures in speech recognition

PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Preliminary study on using vector quantization latent spaces for TTS/VC syste...

Lenar Gabdrakhmanov (Provectus): Speech synthesis

Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...

Neural Network Language Models for Candidate Scoring in Multi-System Machine...

Asr

Corpus Linguistics :Analytical Tools

A Recorded Debating Dataset

SiddhantSancheti_MediumShortStory.pptx

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

Subjective comparison of_speech_enhancement_algori (1)

Acceptance Testing Of A Spoken Language Translation System

Speaker Dependent WaveNet Vocoder

Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...

NMR Automation

Searching for the Best Machine Translation Combination

Introduction to text to speech

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Mais de multimediaeval

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...multimediaeval

HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...multimediaeval

Sports Video Classification: Classification of Strokes in Table Tennis for Me...multimediaeval

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...multimediaeval

Essex-NLIP at MediaEval Predicting Media Memorability 2020 Taskmultimediaeval

Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...multimediaeval

Fooling an Automatic Image Quality Estimatormultimediaeval

Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...multimediaeval

Pixel Privacy: Quality Camouflage for Social Imagesmultimediaeval

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matchingmultimediaeval

Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...multimediaeval

HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...multimediaeval

Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...multimediaeval

Deep Conditional Adversarial learning for polyp Segmentationmultimediaeval

A Temporal-Spatial Attention Model for Medical Image Detectionmultimediaeval

HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...multimediaeval

Fine-tuning for Polyp Segmentation with Attentionmultimediaeval

Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...multimediaeval

Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...multimediaeval

Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...multimediaeval

Mais de multimediaeval (20)

Classification of Strokes in Table Tennis with a Three Stream Spatio-Temporal...

HCMUS at MediaEval 2020: Ensembles of Temporal Deep Neural Networks for Table...

Sports Video Classification: Classification of Strokes in Table Tennis for Me...

Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention...

Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task

Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a V...

Fooling an Automatic Image Quality Estimator

Fooling Blind Image Quality Assessment by Optimizing a Human-Understandable C...

Pixel Privacy: Quality Camouflage for Social Images

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching

Efficient Supervision Net: Polyp Segmentation using EfficientNet and Attentio...

HCMUS at Medico Automatic Polyp Segmentation Task 2020: PraNet and ResUnet++ ...

Depth-wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Int...

Deep Conditional Adversarial learning for polyp Segmentation

A Temporal-Spatial Attention Model for Medical Image Detection

HCMUS-Juniors 2020 at Medico Task in MediaEval 2020: Refined Deep Neural Netw...

Fine-tuning for Polyp Segmentation with Attention

Bigger Networks are not Always Better: Deep Convolutional Neural Networks for...

Insights for wellbeing: Predicting Personal Air Quality Index using Regressio...

Use Visual Features From Surrounding Scenes to Improve Personal Air Quality ...

Último

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

GBSN - Biochemistry (Unit 1)Areesha Ahmad

Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani

Botany 4th semester series (krishna).pdfSumit Kumar yadav

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330

Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani

COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed

VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P

Zoology 4th semester series (krishna).pdfSumit Kumar yadav

Seismic Method Estimate velocity from seismic data.pptxAlMamun560346

Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317

GBSN - Microbiology (Unit 2)Areesha Ahmad

Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

MediaEval 2016 - BUT Zero-Cost Speech Recognition

1. ﬁt speech fit BUT Zero-Cost 2016 Speech Recognition Miroslav Skácel, Martin Karaﬁát, Lucas Ondel, Albert Uchytil, Igor Szöke BUT Speech@FIT Brno University of Technology Czech Republic MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands

2. Overview Our ideas for this task were: • to use previous knowledge of low-resource languages • to use existing systems from Babel1 • to adapt that on Vietnamese • to follow zero-cost requirements We ended up with: • lots of data processing and cleaning • 2x LVCSR sub-task systems • 1x subword sub-task systems • suggestions for further improvements 1https://www.iarpa.gov/index.php/research-programs/babel 1

3. Data preparation • audio downsampled from 16kHz to 8kHz • got Vietnamese alphabet from wordlist • cleaned other characters (punctuation marks, brackets, etc.) • numerals expanded to textual form 2

4. Data segmentation • audio longer than 1 minute caused troubles • alignment from ﬁrst training stages used • data split when: • silence longer than 0.5s occurs • segment would be longer than 15s 3

5. Language models 3 LMs were created: 1) cleaned transcriptions from train set 2) cleaned subtitles 3) cleaned text from websites • 3 or more words • we combined them together 4

6. LVCSR systems - GMM/DNN2 2Martin Karaﬁát et al. BUT Neural Network Features for Spontaneous Vietnamese in BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society, 2014. 5

7. LVCSR systems - BLSTM3 • 16kHz audio, original transcriptions • Kaldi/TNet toolkits 3Martin Karaﬁát et al. Multilingual BLSTM and Speaker-Speciﬁc Vector Adaptation in 2016 BUT Babel System. Accepted at SLT 2016, 2016. 6

8. Subword system • Acoustic Unit Discovery4 (AUD) model • unlabeled data - no transcription needed • like phone-loop model, fully Bayesian, Dirichlet distribution • no ﬁxed number of components like in HMM • can learn complexity of model 4Lucas Ondel et al. Variational Inference for Acoustic Unit Discovery. In Procedia Computer Science, volume 2016, pages 80–86. Elsevier Science, 2016. 7

9. Results System Devel [WER] Test [WER] all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) Kaldi Baseline 60.4 (45.3 / 99.8 / 73.5) 75.5 (46.7 / 98.1 / 76.2 / 97.1) P-BUT - Babel Kaldi BLSTM 16kHz 17.9 (6.4 / 58.1 / 15.8) 48.0 (4.9 / 55.7 / 35.4 / 87.2) L-BUT - Babel Kaldi BLSTM 16kHz - LM tune 17.6 (6.2 / 56.4 / 16.9) 46.3 (4.6 / 52.6 / 32.2 / 84.7) L-BUT - Babel GMM/DNN 8kHz 36.1 (29.7 / 68.5 / 23.4) 55.7 (28.0 / 59.3 / 44.9 / 81.4) System Devel [NMI] Test [NMI] all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) Kaldi Baseline 5.48 (7.88 / 10.44 / 13.8) 4.29 (7.49 / 11.25 / 15.99 / 6.35) P-BUT AUD phone-loop 5.08 (6.45 / 8.76 / 14.19) 4.56 (5.52 / 9.59 / 18.49 / 7.59) • BLSTM overperformed GMM/DNN system • exception is unseen data (YouTube test set) 8

10. Further work Ideas for future experiments: • ﬁnd better way to clean original transcriptions • methods to detect inappropriate audio/transcriptions • adding noise to audio for more robust system • audio reverberation using RIR 9

11. Thanks for your attention! 9

MediaEval 2016 - BUT Zero-Cost Speech Recognition

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (16)

Semelhante a MediaEval 2016 - BUT Zero-Cost Speech Recognition

Semelhante a MediaEval 2016 - BUT Zero-Cost Speech Recognition (20)

Mais de multimediaeval

Mais de multimediaeval (20)

Último

Último (20)

MediaEval 2016 - BUT Zero-Cost Speech Recognition