2010 INTERSPEECH

Human Interface Laboratory
Speech to text adaptation:
Towards an efficient cross-modal distillation
2020. 10. 26, @Interspeech
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim

Contents
• Motivation
• Task and Dataset
• Related Work
• Method
• Result and Discussion
• Conclusion
1

Motivation
• Text and speech : Two main medium of communication
• But, Text resources >> Speech resources
 Why?
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog

Motivation
• Pretrained language models
 Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
 Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• Pretrained speech models?
 Recently suggested
• SpeechBERT, Speech XLNet …
 Why not prevalent?
• Difficulties in problem setting
– What is the correspondence of the tokens?
• Requires much high resources than text data
3

Motivation
• How to leverage pretrained LMs (or the inference thereof) in
speech processing?
 Direct use?
• Only if the ASR output are accurate
 Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various
scenarios
 Distillation?
4
(Hinton et al., 2015)

Task and Dataset
• Task: Spoken language understanding
 Literally – Understanding spoken language?
 In literature – Intent identification and slot filling
 Our hypothesis:
• On either case, abstracted speech data will meet the abstracted representation
of text, in semantic pathways
5
Lugosch et al. (2019)
Hemphill et al. (1990)
Allen (1980)

Task and Dataset
• Freely available benchmark!
 Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
 Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
6

Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
7

Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
8

Related Work
• End-to-end SLU
 Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
9

Related Work
• End-to-end SLU
 Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-
End Spoken Language Understanding," ICASSP 2020.
10

Related Work
11
• Pretrained LMs
 Transformer architectures

Related Work
• End-to-end speech processing + PLM
 Chuang, Yung-Sung, et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
12

Related Work
• End-to-end speech processing + KD
 Liu, Yuchen, et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
13

Method
• End-to-end SLU+ PLM + Cross-modal KD
14

Method
• End-to-end SLU
 Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
15
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)

Method
• PLM
 Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
17

Method
• PLM
 Fine-tuning with FSC ground truth scripts!
18

Method
• Cross-modal KD
 Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
19
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)

Method
• Cross-modal KD
 What determines the loss?
• WHO TEACHES
• HOW IS THE LOSS CALCULATED
– MAE, MSE
• HOW MUCH THE GUIDANCE
INFLUENCES (SCHEDULING)
20

Result and Discussion
• Teacher performance
 GT-based, high-performance
 Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
 Distillation is successful for
flexible teacher influence
 Reaches high performance
only with a simple distillation
 Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
22

• Teacher performance
 GT-based, high-performance
 Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
 Distillation is successful for
flexible teacher influence
 Reaches high performance
only with a simple distillation
 Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
23

• Comparison with the baseline (cont’d)
 Better teacher performance does not guarantee the high quality distillation
• In correspondence with the recent findings in image processing and ASR
distillation
– Tutor might be better than professor?
 MAE overall better than MSE
• Probable correspondence with SpeechBERT
• Why?
– Different nature of input
– MSE might amplify the gap
and lead to collapse
» Partly observed in
data shortage scenarios
24
(Chuang et al., 2019)

• Data shortage scenario
 MSE collapse is more explicit
 Scheduling also matters
• Exp. better than Tri. and err
shows that
– Warm up and decay is powerful
– Teacher influence does not
necessarily have to last long
• However, less mechanical
approach is still anticipated
– e.g., Entropy-based?
 Overall result suggests that
distillation from fine-tuned LM
helps student learn some information regarding uncertainty that is difficult
to obtain from speech-only end-to-end system?
25

• Discussion
 Is this cross-modal or multi-modal?
• Probably; though text (either ASR output or GT) comes from the speech, the
format are different by Waveform and Unicode
 Is this knowledge sharing?
• Also yes; though we exploit logit-level information, the different aspect of
uncertainty derived from each modality might affect the distillation process,
making the process as knowledge sharing rather than optimization
 To engage in paralinguistic properties?
• Further study; Frame-level acoustic information can be residual connected to
compensate for the loss; this might not leverage much from the text-based LMs
26

Conclusion
• Cross-modal distillation works in SLU, even if teacher input
modality is explicitly different from that of student
• Simple distillation from fine-tuned LM helps student learn some
uncertainty that is not probable from speech-only training
• MAE loss is effective in speech to text adaptation, possibly with
warm-up and decay scheduling of KD loss
27

Reference (in order of appearance)
• Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
• Allen, James F., and C. Raymond Perrault. "Analyzing intention in utterances." Artificial intelligence 15.3 (1980): 143-178.
• Hemphill, Charles T., John J. Godfrey, and George R. Doddington. "The ATIS spoken language systems pilot corpus."
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. 1990.
• Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." arXiv preprint
arXiv:1904.03670 (2019).
• Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding." ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
• Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you
need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint
arXiv:1810.04805 (2018).
• Chuang, Yung-Sung, Chi-Liang Liu, and Hung-Yi Lee. "SpeechBERT: Cross-modal pre-trained language model for end-to-
end spoken question answering." arXiv preprint arXiv:1910.11559 (2019).
• Liu, Yuchen, et al. "End-to-end speech translation with knowledge distillation." arXiv preprint arXiv:1904.08075 (2019).
• Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language
Technology Workshop (SLT). IEEE, 2018.
• Wolf, Thomas, et al. "HuggingFace's Transformers: State-of-the-art Natural Language Processing." ArXiv (2019): arXiv-
1910.
28

2010 INTERSPEECH

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2010 INTERSPEECH

Similar to 2010 INTERSPEECH (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

2010 INTERSPEECH

Editor's Notes