6. /246
An acoustic scene consists of a set of acoustic topics.
Each acoustic topic has a probability over acoustic words.
Hypothesis
Acoustic Topics
Acoustic scene Acoustic Topics
Acoustic words
(signal characteristics)
7. /24
7
Conjugate pair of multinomial: Dirichlet distribution
What if the number of balls, i.e., , are random ?
Ball-picking problem (a.k.a. urn problem): Multinomial
Problem formulation
8. /24
Latent Dirichlet Allocation (LDA)
8
Graphical representation of LDA
Dirichlet
Parameter
Topic
distributions
Word distribution
w/ given topic
Dirichlet
Parameter
Topic Word
10. /24
Approximation
Infer/Model process
• Involves intractable computations, such as
Approximation methods
• Gibbs sampling method [Steyvers 2007]
A form of Markov Chain Monte Carlo (MCMC)
• Variational approximation [Blei2003], etc.
10
11. /24
Interpretation
11
Latent Acoustic Topics
Acoustic Topics
Probabilistic assign to individual topics
the size represents the probability
Acoustic Words
Audio Features
... ...
Discrete symbol of acoustic characteristics
Play similar roles with text words
Probabilistic (soft) Clustering
in terms of acoustic words’ co-occurrence
12. /24
Two-step Learning
For Classification Applications
12
Training DB
Acoustic
Topic Model
Topic Distribution
Probability
Classifiers
(Multiclass SVM)
Test signal
Unsupervised
modeling
Supervised
classifier
TestphaseTrainingphase
13. /24
Possible Applications
13
Content Identification
• Music Information Retrieval [Levi2009]
• Audio Fingerprinting [Kim2012]
Audio Scene Analysis
• Understanding auditory scene [Kim2009]
• Environmental sound classification [Kim2010]
User Modeling
• Behavioral Analysis [Kim2011]
• Emotion recognition [Kim2013]
15. /24
Scenarios
Off-line
• Assumes the system knows when the program
starts and ends.
• Prior segmentation required.
On-line
• Makes decisions without prior segmentation
Every X seconds
Online scene detection, etc.
15
16. /24
Scenarios
Models are trained in an off-line manner
16
Training DB
Acoustic
Topic Model
Topic Distribution
Probability
Classifiers
(Multiclass SVM)
Test signal
TestphaseTrainingphase
Test signal
Segmentation
Off-line result
On-line resultOn
Off
17. /24
Dataset
RAI dataset
• Providing a benchmarking test-bed (6-fold cv)
• Italian TV broadcast programs
• 7 genres
• 262 programs (15 min/pr.)
17
18. /24
Off-line classification
18
[2007] M. Montagnuolo and A. Messina, “TV Genre Classification Using Multimodal Information and Multilayer Perceptrons,” LNAI 4733, 2007.
[2009] M. Montagnuolo and A. Messina, “Parallel neural networks for multimodal video genre classification,” Multimedia Tools and Applications, vol. 41, 2009
[2010] H. Ekenel, et al. “Content-based Video Genre Classification Using Multiple Cues,” ACM 2010.
Overall accuracy
• Comparison with conventional content-based
approaches
Competitive results using only audio contents
Accuracy (%)
MLP *
[2007]
MLP *
[2009]
SVM **
[2010]
GMM
(64 mixtures)
ATM
(64 topics)
(2,048 words)
Audio Only - - 86.6 93.6 94.3
Audio-Visual 92.0 94.9 99.6 - -
* MLP: Multilayer Perceptron
** SVM: Support Vector Machine
22. /24
Summary
Genre of TV programs can be detected using
only audio content
• Using context-based approach
On-line and off-line tasks
• Competitive results with conventional audio-visual
approaches in off-line tasks
• ATM outperforms GMM if segments are long
enough in on-line tasks
22
23. /24
Conclusions
Acoustic Topic Model (ATM)
• Capturing contextual information of audio signals
by modeling co-occurrence of text-like audio
signals
• Can be used in various classification application
incorporation with supervised classifier
23
Ambiguities in soundHeterogeneous A mixture of multiple sound sourcesDependency on context Similar audio contents may represent different meanings according to surrounding sound.
Acoustic word, which play similar role with words in textTransform audio signals to text-likesignalsWe have tried various strategies, like ASR, onomatopoeic words, but MFCC-VQ rocks