Anvita Audio Classification Presentation

Audio Clip Classification

Anvita Bajpai
anvita@mailcity.com

Source:
http://www.hindu.com/thehindu/seta/2002/01/10/stories/2002011000080300.htm

Exploding information

One hour of TV broadcast across the world is 100 Petabyte.
●

Source: http://www.sims.berkeley.edu/research/projects/how-much-info/summary.html#tv

Audio indexing
Reason of choosing audio data for study
●

Easier to process
–

Contains significant information
–

Indexing – method of organizing data for further
●

search and retrieval
Example – book indexing
–

Audio Indexing – indexing non-text data using
●

audio part of it

Example of an audio indexing system

Source: J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava.
“Speech and language technologies for audio indexing and retrieval”, in Proc. of the IEEE,
88(8), pp. 1338-1353, 2000.

More examples of audio indexing tasks

Spoken document retrieval
●

Speaker identification
●

Language identification
●

Music classification
●

Music/speech discrimination
●

Audio classification
●

An important step in building an audio indexing system.
–

Levels of information in audio signal
Subsegmental information
●

Related to excitation source characteristics
–

Segmental information
●

Related to system / physiological characteristics
–

Suprasegmental information
●

Related to behavioural characteristics of audio
–

Audio clip classification
Closed set problem
●

To classify a given audio clip in one of the
●

following predefined categories
Advertisement
–

Cartoon
–

Cricket
–

Football
–

News
–

Issues in audio clip classification
Feature extraction
●

Effective representation of data to capture all
–
significant properties of audio for the task
Robust under various conditions
–

Classification
●

Formulation of a distance measure and rule/models
–
Training a models for the task
●

Testing – actual classification task
●

Combining evidences from different systems
●

Missing component in existing
approaches and it's importance
Features derived based on spectral analysis
●

Carry significant properties of audio data at segmental level
–

Miss information present at subsegmental, suprasegmental level
–

Perceptually significant information in linear prediction
●

(LP) residual of signal
Complimentary in nature to the spectral information
–

Subsegmental and suprasegmental information not being used
–
in current systems

Presence of audio-specific

Residual
Original
information in LP residual

Aa_res.wav

Aa1.wav

Aa1.wav

Extracting audio-specific information
from LP residual
LP residual
–
May contain higher order correlation among samples
●

It is difficult to extract it using standard signal processing and
●

statistical techniques
Hence proposed autoassociative neural networks
–
(AANN) models to capture information from residual
Used to capture features
●

for speaker recognition task
Structure of network
●

40L 48N 12N 48N 40L
–

Use of audio component knowledge
Audio category
●

Composed of one or more audio components
–

Audio component
●

Specific to an audio category
–

Six components chosen for study
●

Music
–

Speech - Conversational, Cartoon, Clean
–

Noise - Football, Cricket
–

Training phase of AANN models
Trained one AANN model for each of six
●

components
Models trained
●

for 2000 epochs

AANN training error curve

Testing phase
(confidence scores output of 6 AANN models for a news test clip)

a) for a segment of the clip, (b) expended version of the same. Duration of total test clip is 10 sec

Work flow diagram

(of 6 components)

MLP – Multilayer perceptron

MLP for decision making task
MLP for capturing audio-specific information
●

captured by AANN, as it is
Suitable for pattern recognition tasks
–

Have ability to form complex decision surface by
–
using discriminating learning algorithms
Structure of MLP used - 6L 24N 12N 5N
●

Confidence scores output of 6 component AANN models
Contd...
24 12
Nodes 6
5
M A

S1 C

Audio Category
S2
K

S3
F
N1
N
N2 OP layer
IP layer Hidden layers

Classification results
Audio class % of clips correctly classified
DB1 DB2
Advertisement 83.00% 43.50%
Cartoon 88.00% 45.50%
Cricket 86.00% 38.50%
Football 90.50% 75.50%
News 85.50% 63.30%
Average 86.60% 53.26%

DB1 – Data collected from single TV channel, contains 200 clips, 40 of each category
DB2 – Data collected across all broadcasted channels, contains 1659 clips,
Adv. – 226, Cartoon – 208, Cricket – 318, Football – 600, News – 306

Classification results for spectral
1
features-based system
Audio class % of clips correctly classified
Spectral features-based system LP residual-based system
DB1 DB2 DB1 DB2
Advertisement 85.00% 65.00% 83.00% 43.50%
Cartoon 90.00% 75.00% 88.00% 45.50%
Cricket 90.00% 65.00% 86.00% 38.50%
Football 92.50% 40.00% 90.50% 75.60%
News 87.50% 65.30% 85.50% 63.30%

Average 89.00% 62.06% 86.60% 53.26%

Ref. [1] Gaurav Aggarwal, Features for Audio Indexing, M Tech report, CSE Deptt, IIT Madras, Apr. 2002

Classification results from source,
spectral features-based systems
A

System 1 System 2

A – All test audio clips (DB2)
System 1 – clips recognised using spectral features-based system
System 2 – clips recognised using excitation source (LP residual) based system

Results of combined (subsegmental
and segmental) system for DB2
Audio class % of clips correctly classified in systems

Spectral LP residual Abstract level Rank+measurement level
Based based Combination Combination
Advertisement 65.00% 43.50% 83.00% 92.47%
Cartoon 75.00% 45.50% 92.00% 98.55%
Cricket 65.00% 38.50% 87.50% 88.67%
Football 40.00% 75.60% 87.00% 91.16%
News 65.30% 63.30% 86.30% 95.10%

Average 62.06% 53.26% 87.25% 93.18%

uprasegmental information in Hilbert
nvelope of LP residual of audio signal

Suprasegmental information in LP
residual for audio clip classification

Autocorrelation samples of Hilbert envelope of LP residual for 5 audio classes

Statistics of autocorrelation sequence

Correction – here we have statistics of autocorrelation sequence peaks of HE (not LP residual)

Statistics of autocorrelation sequence

Scope of future work
Extending the framework for other audio
●

indexing applications
Exploring methods to add suprasegmental
●

information to the combined system
(though far away..) Building a multimedia
●

indexing system

Summary and conclusions
Need to organize audio data because of its large volume and
●

need in real-life applications
Presence of audio specific information in LP residual
●

AANN model's ability to capture subsegmental information
●

from residual for the task
Use of MLP for decision making using the information
●

captured by AANN
Complementary nature of source information to the system
●

information
Presence of audio-specific suprasegmental information in LP
●

residual

Major contributions
Extraction of audio-specific information from LP
●

residual using NN models
Showing the complementary nature of source and
●

system information for the audio clip
classification task
Showing the presence of audio-specific
●

suprasegmental information in LP residual

References
T. Zhang and C.-C. J. Kuo, quot;Content-based classification and retrieval of audio,quot; in Conference on
1.

Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, San Diego,
California, July 1998, vol. 3461 of Proc. of SPIE.
J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. “Speech and
2.

language technologies for audio indexing and retrieval”, in Proc. of the IEEE, 88(8), pp. 1338-1353,
2000.
Y. Wang, Z. Liu, and J. Huang. “Multimedia Content Analysis using Audio and Visual Clues”,
3.

IEEE SP Magazine, 17(6), Nov. 2000.
M.A. Kramer, quot;Nonlinear principal component analysis using autoassociative neural networks,quot;
4.

AIChE Journal, vol. 37, pp. 233-243, Feb. 1991.
J. Makhoul, quot;Linear prediction: A tutorial review,quot; in Proc. IEEE, vol. 63, pp. 561--580, 1975.
5.

B. Yegnanarayana, S.R.M. Prasanna, and K.S. Rao, “Speech enhancement using excitation source
6.

information,'' in Proc. Int. Conf. Acoust., Speech, Signal Processing, Orlando, FL, USA, May 2002.
S.R.M. Prasanna, Ch.S. Gupta, and B. Yegnanarayana, “Autoassociative neural network models for
7.

speaker verification using source features,'' in Proc. Sixth Int. Conf. Cognitive Neural Systems,
Boston University, Boston, USA, May-June 2002.
B. Yegnanarayana, Artificial Neural Networks, Prentice Hall of India, New Delhi, 1999.
8.

Related publications
1. Anvita Bajpai and B. Yegnanarayana, “Audio Clip Classification using LP
Residual and Neural Networks Models”, European Signal and Image
Processing Conference (EUSIPCO-2004), Vienna, Austria, 6-10 September
2004
2. Anvita Bajpai and B. Yegnanarayana, “Exploring Features for Audio
Indexing using LP Residual and AANN Models”, accepted for The 17th
International FLAIRS Conference (FLAIRS - 2004), Miami Beach, Florida,
17-19 May 2004.
3. Anvita Bajpai and B. Yegnanarayana, “Exploring Features for Audio Clip
Classification using LP Residual and Neural Networks Models”,
International Conference on Intelligent Signal and Image Processing (ICISIP-
2004), Chennai, India, 4-7 January 2004
4. Gaurav Aggarwal, Anvita Bajpai and B. Yegnanarayana, “Exploring
Features for Audio Indexing”, in Indian Research Scholar Seminar (IRIS-
2002), Indian Institute of Science, Bangalore, India, March 2002

Following are extra slides not part of
main presentation

Effect of # of epochs used for AANN
training

Confidence scores output of 6 AANN models for a news test clip

Even well-trained humans don't always react the
●

way they were trained.
Source: www.computer.org/computer/homepage/
–
0103/random/r1014.pdf, by Bob Colwell

Classification of audio using spectral features
•Extraction of features - based on
–Volume
Standard deviation and Dynamic range of volume, Volume
●

undulation, 4Hz modulation energy
–Zero Crossing Rate
Standard deviation of ZCR, Silence-nonsilence ratio
●

–Pitch
Pitch contour, Pitch standard deviation, Similar pitch ratio, Pitch-
●

nonpitch ratio
–Spectrum
Frequency centroid, Bandwidth, Ratio of energy in various frequency
●

sub-bands

Features for Categorization of Audio Clips
(4Hz modulation energy)

Cricket Football

News

Features for Categorization of Audio Clips
(Similar Pitch Ratio) .
(Contd..)

Cricket Football

News

Importance of Task Dependent Feature
(Standard deviation of ZCR)

Speaker 1 Speaker 2

Music

Anvita Audio Classification Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Anvita Audio Classification Presentation

Semelhante a Anvita Audio Classification Presentation (20)

Último

Último (20)

Anvita Audio Classification Presentation