Context-based modeling of audio signals toward information retrieval

/24
Contextual modeling of audio signals
toward information retrieval
Samuel Kim Ph.D.
Given Zone, LLC
allthatsignal@gmail.com
http://allthatsignal.com
All that signal; for the people, by the people, of the people © Given Zone, LLC

/242
Audio Information Retrieval

/24
Open Challenges
Heterogeneous …

/24
Context-based approach rather than content-based
Proposed approach

/246
 An acoustic scene consists of a set of acoustic topics.
 Each acoustic topic has a probability over acoustic words.
Hypothesis
Acoustic Topics
Acoustic scene Acoustic Topics
Acoustic words
(signal characteristics)

/24
7
Conjugate pair of multinomial: Dirichlet distribution
What if the number of balls, i.e., , are random ?
Ball-picking problem (a.k.a. urn problem): Multinomial
Problem formulation

/24
Latent Dirichlet Allocation (LDA)
8
 Graphical representation of LDA
Dirichlet
Parameter
Topic
distributions
Word distribution
w/ given topic
Dirichlet
Parameter
Topic Word

/24
Approximation
 Infer/Model process
• Involves intractable computations, such as
 Approximation methods
• Gibbs sampling method [Steyvers 2007]
 A form of Markov Chain Monte Carlo (MCMC)
• Variational approximation [Blei2003], etc.
10

/24
Interpretation
11
Latent Acoustic Topics
Acoustic Topics
Probabilistic assign to individual topics
the size represents the probability
Acoustic Words
Audio Features
... ...
Discrete symbol of acoustic characteristics
Play similar roles with text words
Probabilistic (soft) Clustering
in terms of acoustic words’ co-occurrence

/24
 Two-step Learning
For Classification Applications
12
Training DB
Acoustic
Topic Model
Topic Distribution
Probability
Classifiers
(Multiclass SVM)
Test signal
Unsupervised
modeling
Supervised
classifier
TestphaseTrainingphase

/24
Possible Applications
13
Content Identification
• Music Information Retrieval [Levi2009]
• Audio Fingerprinting [Kim2012]
Audio Scene Analysis
• Understanding auditory scene [Kim2009]
• Environmental sound classification [Kim2010]
User Modeling
• Behavioral Analysis [Kim2011]
• Emotion recognition [Kim2013]

/24
Target application
Automatic classification of TV genre using audio content

/24
Scenarios
 Off-line
• Assumes the system knows when the program
starts and ends.
• Prior segmentation required.
 On-line
• Makes decisions without prior segmentation
 Every X seconds
 Online scene detection, etc.
15

/24
Scenarios
 Models are trained in an off-line manner
16
Training DB
Acoustic
Topic Model
Topic Distribution
Probability
Classifiers
(Multiclass SVM)
Test signal
TestphaseTrainingphase
Test signal
Segmentation
Off-line result
On-line resultOn
Off

/24
Dataset
 RAI dataset
• Providing a benchmarking test-bed (6-fold cv)
• Italian TV broadcast programs
• 7 genres
• 262 programs (15 min/pr.)
17

/24
Off-line classification
18
[2007] M. Montagnuolo and A. Messina, “TV Genre Classification Using Multimodal Information and Multilayer Perceptrons,” LNAI 4733, 2007.
[2009] M. Montagnuolo and A. Messina, “Parallel neural networks for multimodal video genre classification,” Multimedia Tools and Applications, vol. 41, 2009
[2010] H. Ekenel, et al. “Content-based Video Genre Classification Using Multiple Cues,” ACM 2010.
 Overall accuracy
• Comparison with conventional content-based
approaches
 Competitive results using only audio contents
Accuracy (%)
MLP *
[2007]
MLP *
[2009]
SVM **
[2010]
GMM
(64 mixtures)
ATM
(64 topics)
(2,048 words)
Audio Only - - 86.6 93.6 94.3
Audio-Visual 92.0 94.9 99.6 - -
* MLP: Multilayer Perceptron
** SVM: Support Vector Machine

/24
Off-line classification
 Confusion matrix
• ATM
• GMM
19
CT
CM
FB
MU
NE
TS
WF
Cartoon
Commercial
Football
Music
News
Talk show
Weather Forecast

/24
On-line classification
 Accuracy according to length of segments
20
0 1 2 3 4 5 6
68
70
72
74
76
78
80
82
Accuracy(%)
time (s)
ATM
GMM

/24
On-line classification
 Per-class F-measure
21
[1 second] [6 seconds]

/24
Summary
 Genre of TV programs can be detected using
only audio content
• Using context-based approach
 On-line and off-line tasks
• Competitive results with conventional audio-visual
approaches in off-line tasks
• ATM outperforms GMM if segments are long
enough in on-line tasks
22

/24
Conclusions
 Acoustic Topic Model (ATM)
• Capturing contextual information of audio signals
by modeling co-occurrence of text-like audio
signals
• Can be used in various classification application
incorporation with supervised classifier
23

Context-based modeling of audio signals toward information retrieval

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Context-based modeling of audio signals toward information retrieval

Semelhante a Context-based modeling of audio signals toward information retrieval (20)

Último

Último (20)

Context-based modeling of audio signals toward information retrieval

Notas do Editor