Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference"
Semelhante a Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference"
Semelhante a Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference" (18)
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference"
1. 論文紹介
“Similarity-and-Independence-Aware Beamformer:
Method for Target Source Extraction using Magnitude
Spectrogram as Reference”
R&D Center Tokyo Laboratory 21
Sony Corporation
Copyright 2020 Sony Corporation
廣江 厚夫
INTERSPEECH 2020 論文読み会用スライド(2020/11/20 開催)
2. 2 R&D Center, Tokyo Laboratory 21
自己紹介
氏名: 廣江 厚夫(ひろえ あつお)
1996年に東京工業大学を修了し、ソニーに入社。
以降、信号処理・音声認識・音声対話等の研究開発に従事する。
2006年 ICA2006 にて、独立成分分析(ICA)のパーミューテーション問題の解消に
ついて発表。(今回の発表とも関連あり)
題名: Solution of permutation problem in frequency domain ICA, using
multivariate probability density functions
2007年 ICA2007 にて、音源分離と残響除去との同時解決について発表。
題名: Blind Vector Deconvolution: Convolutive Mixture Models in Short-Time
Fourier Transform Domain
2009年 電子情報通信学会の招待論文で 2006年の発表(IVA 含む)について解説。
題名: パーミュテーション問題のない周波数領域独立成分分析
2014~
2016年
情報通信研究機構(NICT)に出向し、多言語対応(クロスリンガル)音声
対話システムの研究開発に従事。
解説動画: https://www.youtube.com/watch?v=xj1rMEbGICQ
2020年 INTERSPEECH2020 にて、DNN と組み合わせ可能な新規のビームフォーマー
について発表。(今回紹介する論文)
題名: Similarity-and-Independence-Aware Beamformer: Method for Target
Source Extraction using Magnitude Spectrogram as Reference
同じカンファレンスにて、似たアイデアの発表
が他に 2件あり(偶然の一致):
T. Kim, T. Eltoft, and T. W. Lee
“Independent vector analysis: An extension of
ICA to multivariate components”
Lee, T. Kim, and T. W. Lee
“Complex fastIVA: A robust maximum
likelihood approach of MICA for convolutive
BSS,
彼らは自分の方式に Independent Vector Analysis:
IVA という名称を付けていたため、今では廣江の
方式も含めて IVA と呼ばれている。
教訓: 新しい方式を考案したら、カッコいい名称
を付け、それを積極的に広めましょう!
主な対外発表など
22. 22 R&D Center, Tokyo Laboratory 21
まとめ
• リファレンスを使用する目的音抽出の新手法として、Similarity-and-Independence-Aware Beamformer
(SIBF)を提案。
• “抽出結果 > リファレンス” を実現するために、デフレーション型の独立成分分析(ICA)を拡張した新
たな枠組みを考案。
A) 独立性だけでなく、リファレンスとの依存性も考慮する。
B) 独立性を表現するため、TV Gaussian & BS Laplacian という2つの音源モデルを考案。
C) 抽出用のフィルターを求める式を導出
• CHiME3/4 データセット用いた実験により、 “抽出結果 > リファレンス” が実現できていることを確認。
SIBFICA BF
締めの言葉: SIBF は ICA と BF の分野に跨っており、この発表によって両方の分野の
研究が一層活発になることを期待する。
23. 23 R&D Center, Tokyo Laboratory 21
参考: 各図における入出力データの対応関係
各図において、同じ意味のデータを同じ色で表現し、対応関係を明確にしてみました。
リファレンス使用の目的音抽出(一般的な解説) SIBF の概略(ワークフロー)
SIBF の枠組み 実験評価系
24. 24 R&D Center, Tokyo Laboratory 21
参考: INTERSPEECH 2020 で発表された目的音抽出関連の論文(1/2)
Targeted Source Separation というセッションにおいて集中的に発表されていました。
Mon-3-11-1 SpEx+: A Complete Time Domain Speaker Extraction Network
Mon-3-11-2 Atss-Net: Target Speaker Separation via Attention-based Neural Network
Mon-3-11-3 Multimodal Target Speech Separation with Voice and Face References
Mon-3-11-4 X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
Mon-3-11-5 Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
Mon-3-11-6 A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments
Mon-3-11-7 Time-Domain Target-Speaker Speech Separation With Waveform-Based Speaker Embedding
Mon-3-11-8 Listen to What You Want: Neural Network-based Universal Sound Selector
Mon-3-11-9 Crossmodal Sound Retrieval based on Specific Target Co-occurrence Denoted with Weak Labels
Mon-3-11-10 Speaker-Aware Monaural Speech Separation
25. 25 R&D Center, Tokyo Laboratory 21
参考: INTERSPEECH 2020 で発表された目的音抽出関連の論文(2/2)
他のセッションでも目的音抽出の発表がありました。
Mon-1-2-2 Neural Spatio-Temporal Beamformer for Target Speech Separation
Wed-2-5-4 VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Wed-3-8-2 Microphone Array Post-filter for Target Speech Enhancement Without a Prior Information of Point
Interferers
Wed-3-8-3 Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction using Magnitude
Spectrogram as Reference(今回紹介した自分の発表)
26. 26 R&D Center, Tokyo Laboratory 21
参考文献(1/2)
[1] S. J. Chen, A. S. Subramanian, H. Xu, and S. Watanabe, “Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline,”
Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, pp. 1571–1575, 2018.
[2] J. Du, Q. Wang, T. Gao, Y. Xu, L. Dai, and C. H. Lee, “Robust speech recognition with speech enhanced deep neural networks,” in Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH, 2014.
[3] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Proceedings of the Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2014.
[4] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2018.
[5] Q. Wang et al., “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2019.
[6] M. Mizumachi and M. Origuchi, “Advanced delay-and-sum beamformer with deep neural network,” 22nd Int. Congr. Acoust., 2016.
[7] M. Mizumachi, “Neural Network-based Broadband Beamformer with Less Distortion,” no. September, pp. 2760–2764, 2019.
[8] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Comput. Speech
Lang., vol. 46, pp. 535–557, 2017.
[9] L. Wang, J. D. Reiss, and A. Cavallaro, “Over-Determined Source Separation and Localization Using Distributed Microphones,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 9,
pp. 1569–1584, 2016.
[10] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blind source separation based on temporal structure of speech signals,” Neurocomputing, 2001.
[11] K. Matsuoka, “Minimal distortion principle for blind source separation,” no. September 2002, pp. 2138–2143, 2003.
[12] J. X. Mi, “A novel algorithm for independent component analysis with reference and methods for its applications,” PLoS One, vol. 9, no. 5, 2014.
[13] Q. H. Lin, Y. R. Zheng, F. L. Yin, H. Liang, and V. D. Calhoun, “A fast algorithm for one-unit ICA-R,” Inf. Sci. (Ny)., 2007.
[14] M. Castella, S. Rhioui, E. Moreau, and J. C. Pesquet, “Quadratic higher order criteria for iterative blind separation of a MIMO convolutive mixture of sources,” IEEE Trans. Signal Process., vol.
55, no. 1, pp. 218–232, 2007.
[15] L. Gao, N. Zheng, Y. Tian, and J. Zhang, “Target signal extraction method based on enhanced ica with reference,” Math. Probl. Eng., vol. 2019, 2019.
[16] N. Makishima et al., “Independent Deeply Learned Matrix Analysis for Determined Audio Source Separation,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 10, pp. 1601–1615,
2019.
[17] H. Erdogan, J. Hershey, S. Watanabe, M. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” Proc. Annu. Conf. Int. Speech Commun.
Assoc. INTERSPEECH, vol. 08-12-Sept, pp. 1981–1985, 2016.
[18] Y. Kubo, T. Nakatani, M. Delcroix, K. Kinoshita, and S. Araki, “Mask-based MVDR Beamformer for Noisy Multisource Environments: Introduction of Time-varying Spatial Covariance Model,”
ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2019-May, pp. 6855–6859, 2019.
[19] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol.
2016-May, pp. 196–200, 2016.
[20] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge,” 2015 IEEE Work. Autom. Speech Recognit.
Understanding, ASRU 2015 - Proc., no. June 2016, pp. 444–451, 2016.
27. 27 R&D Center, Tokyo Laboratory 21
参考文献(2/2)
[21] A. Hyvärinen, J. Karhunen, and E. Oja, “ICA by Minimization of Mutual Information,” in Independent Component Analysis, 2003.
[22] A. Hyvärinen, J. Karhunen, and E. Oja, “ICA by Maximum Likelihood Estimation,” in Independent Component Analysis, 2003.
[23] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,”
IEEE/ACM Trans. Audio Speech Lang. Process., 2016.
[24] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 9, pp. 1652–1664,
2016.
[25] A. Hiroe, “Solution of permutation problem in frequency domain ica, using multivariate probability density functions,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell.
Lect. Notes Bioinformatics), vol. 3889 LNCS, pp. 601–608, 2006.
[26] T. Kim, T. Eltoft, and T. W. Lee, “Independent vector analysis: An extension of ICA to multivariate components,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 3889 LNCS, no. 1, pp. 165–172, 2006.
[27] I. Lee, T. Kim, and T. W. Lee, “Complex fastIVA: A robust maximum likelihood approach of MICA for convolutive BSS,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 3889 LNCS, pp. 625–632, 2006.
[28] T. Kim, H. T. Attias, S. Y. Lee, and T. W. Lee, “Blind source separation exploiting higher-order frequency dependencies,” IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 1, pp. 70–79,
2007.
[29] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” IEEE Work. Appl. Signal Process. to Audio Acoust., vol. 2, no. 9, pp. 189–192,
2011.
[30] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic
Speech Recognition and Understanding, ASRU 2015 - Proceedings, 2016.
28. SONY is a registered trademark of Sony Corporation.
Names of Sony products and services are the registered trademarks and/or trademarks of Sony Corporation or its Group companies.
Other company names and product names are registered trademarks and/or trademarks of the respective companies.