2. • MIR
‣ 2017 DJ music
classification
•
•
‣ ...
‣ ...
• ^_^
‣
- ML for Music
- Automated feature engineer using RL
ICASSP 2018 😆
3. Music Classification, Why?
• classification “representation learning”
‣ Music streaming service
?
‣ content-based recommendation
• Music streaming service
‣ !
•
‣ “ ”
‣
4. :
End-to-end Music Classification
• History of E2E Music Classification Models
‣ E2E ?
• Interpretability of E2E Music Classification Models
‣
5. • Sample-level Deep Convolutional Neural Networks
for Music Auto-tagging using Raw Waveforms (2017)
Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim and Juhan Nam
Sound and Music Computing Conf. (SMC), 2017.
‣ Music classification end-to-end approach
‣ Frequency
• Sample-level CNN Architectures
for Music Auto-tagging using Raw Waveforms (2018)
Taejun Kim, Jongpil Lee and Juhan Nam,
IEEE Int. Conf. Acoustical, Speech Signal Processing (ICASSP), 2018.
‣ CNN architecture
‣ Loudness
12. Convolutional Filters Decouple
Time & Frequency Resolution
• Conv. filter time & frequency resolution trade-off
• Convolution resolution:
‣ Time resolution stride
- Stride↓ time resolution↑
‣ Frequency resolution filter
- #filters↑ frequency resolution↑
‣ Stride time & frequency resolution
13. Frame-level Raw Waveform Model
• 2014 E2E music classification
‣ But, spectrogram model
- , E2E music classification
• STFT 1D strided conv. layer
‣ Spectrogram 1D conv. mid-level
representation
• Strided conv. output spectrogram
hyperparameter
‣ Filter size (=window size of STFT)
‣ Stride (=hop size of STFT)
2
Spectrogram
Waveform
with one 1D conv.
net.
17. Sample-level Raw Waveform Model3
•E2E model spectrogram
•E2E music classification model
Comparison with frame-level models
Comparison with state-of-the-arts
21. Basic
Res-2
SE 0.9083
0.9061
0.9055
SE Block
• From SENet (2017 ImageNet challenges )
• Motivation:
‣ Channel , channel recalibration
- channel (=frequency-like) ( ) ,
( )
- channel weight (0~1) rescale
(recalibration)
relu
relu
sigmoid
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
FC
FC
Scale
BatchNorm
MaxPool
GlobalAvgPool
AUC on MagnaTagATune
basic block
1.08
22. SE Block for Image (2D Conv.)
Squeeze operation:
• Aggregate spatial dimensions
• Produce channel-wise statistics
Excitation operation:
• Using the statistics, learn channel relationships
• Produce weight for each channel
Global spatial information
for each channel
Excitations (range 0~1):
Weight for each channel
Reweight each channel
using the weights
23. SE Block for Audio (1D Conv.)
Time
Channel(orFrequency-like)
Time
Global temporal statistics
for each channel
Squeeze operation:
• Aggregate temporal dimensions
• Produce frequency-wise statistics
Excitation operation:
• Using the statistics, learn frequency relationships
• Produce weight for each frequency
Excitations (range 0~1):
Weight for each frequency
Reweight each frequency
using the weights
24. SE Block for Audio (1D Conv.)
relu
relu
sigmoid
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
FC
FC
Scale
BatchNorm
MaxPool
GlobalAvgPool
25. Difference with Original SE Block
• Original SENet FC layer
‣ 𝑟: reduction ratio
•
‣ 𝜶: amplifying ratio
• Original SENet 16 ,
16
‣ Audio channel
?
relu
relu
sigmoid
T×C
1×C
1×αC
1×C
T×C
T×C
Conv1D
FC
FC
Scale
BatchNorm
MaxPool
GlobalAvgPool
32. Interpretability of Deep Learning
•
• interpretability
‣ vision
‣ Audio
• Weapons of Math Destruction ( )
— , “ ”
• ,
‣ ( )
‣ “ ?” “ ?” ...
33. SampleCNN Filter Visualization
• channel frequency signal
• frequency ( )
• Layer filter
(e.g. mel-scale)
‣ i.e. piano key
• Layer Low-frequency
Sorted channel index
Frequency(0~11KHz)
34. Channel back propagate
Input
Filter Viz. Process Example:
layer channel
6551 × 128
19683 × 128
2187× 128
729 × 256
243 × 256
81 × 256
utionalblock×9
strided conv
59049 × 1
raw waveformInitialize input randomly (random noise)
Backprop.
1
2
STFT
3
4
Frequency
Channel 3 of Layer 1
Time
Channel
layer
36. Excitation
Visualization
• SE block channel
tag
• Mid block general , last
block discriminative signal
processing
• block tag
excitation!
‣ Loudness
Sorted channel index
Excitation
38. Analysis of the First Excitation
Sorted channel index
Excitation
tag 50 excitation
39. Analysis of the First Excitation
• audio segment
• linear regression line
• SE block loudness
normalize
• But ,
Average
of
128 Channels
Most
Positive
Channel
Most
Negative
Channel
Most
Neutral
Channel
Least
Regression
Error
Channel
Loudness
Excitation
40. Analysis of the First Excitation
• linear regression
• loudness normalize
‣ #negative = 109
‣ #positive = 19
• Loudness excitation
?
Loudness
Excitation
41. Variation of Excitations increases
according to Loudness
• audio segment excitation channel
• segment ( )
• Loudness excitation