Constantine Kotropoulos, Associate Professor, Aristotle University of Thessaloniki, Department of Informatics, Sparse and Low Rank Representations in Music Signal Analysis
Sparse and Low Rank Representations in Music Signal Analysis
1. Sparse and Low Rank Representations in Music
Signal Analysis
Constantine Kotropoulos, Yannis Panagakis
Artificial Intelligence & Information Analysis Laboratory
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki 54124, GREECE
2nd Greek Signal Processing Jam, Thessaloniki, May 17th, 2012
Sparse and Low Rank Representations in Music Signal Analysis 1/54
2. Outline
1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 2/54
3. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 3/54
4. Introduction
Music genre classification
Genre: the most popular description of music content despite the
lack of a commonly agreed definition. To classify music recordings
into distinguishable genres using information extracted from the
audio signal.
Musical structure analysis
To derive the musical form, i.e., the structural description of a
music piece at the time scale of segments, such as intro, verse,
chorus, bridge, from the audio signal.
Music tagging
Tags: text-based labels encoding semantic information related to
music (i.e., instrumentation, genres, emotions, etc.). Manual
tagging (expensive, time consuming, applicable to popular music);
Automatic tagging (fast, applied to new and unpopular music).
Sparse and Low Rank Representations in Music Signal Analysis 4/54
5. Introduction
Music genre classification
Genre: the most popular description of music content despite the
lack of a commonly agreed definition. To classify music recordings
into distinguishable genres using information extracted from the
audio signal.
Musical structure analysis
To derive the musical form, i.e., the structural description of a
music piece at the time scale of segments, such as intro, verse,
chorus, bridge, from the audio signal.
Music tagging
Tags: text-based labels encoding semantic information related to
music (i.e., instrumentation, genres, emotions, etc.). Manual
tagging (expensive, time consuming, applicable to popular music);
Automatic tagging (fast, applied to new and unpopular music).
Sparse and Low Rank Representations in Music Signal Analysis 4/54
6. Introduction
Music genre classification
Genre: the most popular description of music content despite the
lack of a commonly agreed definition. To classify music recordings
into distinguishable genres using information extracted from the
audio signal.
Musical structure analysis
To derive the musical form, i.e., the structural description of a
music piece at the time scale of segments, such as intro, verse,
chorus, bridge, from the audio signal.
Music tagging
Tags: text-based labels encoding semantic information related to
music (i.e., instrumentation, genres, emotions, etc.). Manual
tagging (expensive, time consuming, applicable to popular music);
Automatic tagging (fast, applied to new and unpopular music).
Sparse and Low Rank Representations in Music Signal Analysis 4/54
7. Introduction
Motivation
The appealing properties of slow temporal and spectro-temporal
modulations from the human perceptual point of viewa ;
The strong theoretical foundations of sparse representationsbc
and low-rank representationsd .
a
K. Wang and S. A. Shamma, “Spectral shape analysis in the central auditory system,” IEEE Trans. Speech and
Audio Processing, vol. 3, no. 5, pp. 382–396, 1995.
b
`
E. J. Candes, J. Romberg, and T. Tao,“Robust uncertainty principles: Exact signal reconstruction from highly
incomplete frequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489–509, February 2006.
c
D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289–1306, April
2006.
d
G. Liu, Z. Lin, S. Yan, J. Sun, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,
IEEE Trans. Pattern Analysis and Machine Intelligence, 2011, arXiv:1010.2955v4 (preprint).
Sparse and Low Rank Representations in Music Signal Analysis 5/54
8. Introduction
Motivation
The appealing properties of slow temporal and spectro-temporal
modulations from the human perceptual point of viewa ;
The strong theoretical foundations of sparse representationsbc
and low-rank representationsd .
a
K. Wang and S. A. Shamma, “Spectral shape analysis in the central auditory system,” IEEE Trans. Speech and
Audio Processing, vol. 3, no. 5, pp. 382–396, 1995.
b
`
E. J. Candes, J. Romberg, and T. Tao,“Robust uncertainty principles: Exact signal reconstruction from highly
incomplete frequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489–509, February 2006.
c
D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289–1306, April
2006.
d
G. Liu, Z. Lin, S. Yan, J. Sun, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,
IEEE Trans. Pattern Analysis and Machine Intelligence, 2011, arXiv:1010.2955v4 (preprint).
Sparse and Low Rank Representations in Music Signal Analysis 5/54
9. Notations
Span
Let span(X) denote the linear space spanned by the columns of X.
Then, Y ∈ span(X) denotes that all column vectors of Y belong to
span(X).
Sparse and Low Rank Representations in Music Signal Analysis 6/54
10. Notations
Vector norms
x 0 is 0 quasi-norm counting the number of nonzero entries in x.
If |.| denotes the absolute value operator, x 1 = i |xi | and
x 2 norm of x, respectively.
2 = i xi are the 1 and the 2
Sparse and Low Rank Representations in Music Signal Analysis 6/54
11. Notations
Matrix norms
1
q
q
p
mixed p,q matrix norm: X p,q = i j |xij |p .
For p = q = 0, the matrix 0 quasi-norm, X 0 , returns the number of
nonzero entries in X. For p = q = 1, the matrix 1 norm is obtained
X 1 = i j |xij |.
Frobenius norm: X = 2
xij .
F i j
norm of X: X 2
2/ 1 2,1 = j i xij .
nuclear norm of X, X ∗ , is the sum of singular values of X.
Sparse and Low Rank Representations in Music Signal Analysis 6/54
12. Notations
Support
A vector x is said to be q-sparse if the size of the support of x (i.e., the
set of indices associated to non-zero vector elements) is no larger than
q.
The support of a collection of vectors X = [x1 , x2 , . . . , xN ] is defined as
the union over all the individual supports.
A matrix X is called q joint sparse, if |supp(X)| ≤ q. That is, there are
at most q rows in X that contain nonzero elements, because
X 0,q = |supp(X)| for any q a .
a
M. Davies and Y. Eldar, “Rank awareness in joint sparse recovery,” arXiv:1004.4529v1, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 6/54
13. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 7/54
14. Auditory spectro-temporal modulations
Computational auditory model
It is inspired by psychoacoustical and neurophysiological
investigations in the early and central stages of the human
auditory system.
Auditory Spectro-Temporal Modulations
(Cortical Representation)
Auditory Spectrogram
Central auditory model
Early auditory model
Auditory Temporal Modulations
Sparse and Low Rank Representations in Music Signal Analysis 8/54
15. Auditory spectro-temporal modulations
Computational auditory model
It is inspired by psychoacoustical and neurophysiological
investigations in the early and central stages of the human
auditory system.
Auditory Spectro-Temporal Modulations
(Cortical Representation)
Auditory Spectrogram
Central auditory model
Early auditory model
Auditory Temporal Modulations
Sparse and Low Rank Representations in Music Signal Analysis 8/54
16. Auditory spectro-temporal modulations
Early auditory system
Auditory Spectrogram: time-frequency distribution of energy along
a tonotopic (logarithmic frequency) axis.
Auditory Spectrogram
Early auditory model
Sparse and Low Rank Representations in Music Signal Analysis 9/54
17. Auditory spectro-temporal modulations
Central auditory system - Temporal modulations
Auditory Spectrogram Auditory Temporal Modulations
z)
(H
ω
z)
(H
ω
Sparse and Low Rank Representations in Music Signal Analysis 10/54
18. Auditory spectro-temporal modulations
Auditory temporal modulations across 10 music genres
Blues Classical Country Disco Hiphop
Jazz Metal Pop Reggae Rock
Sparse and Low Rank Representations in Music Signal Analysis 11/54
19. Auditory spectro-temporal modulations
Central auditory system - Spectro-temporal modulations
Auditory Spectrogram Auditory Spectro-Temporal Modulations
Ω(c/o) z)
(H
ω
Sparse and Low Rank Representations in Music Signal Analysis 12/54
20. Auditory spectro-temporal modulations
Efficient implementation through constant Q transform (CQT)
Sparse and Low Rank Representations in Music Signal Analysis 13/54
21. Auditory spectro-temporal modulations
Parameters and implementation (1)
The audio signal is analyzed by employing 128 constant-Q filters
covering 8 octaves from 44.9 Hz to 11 KHz (i.e., 16 filters per
octave). The magnitude of the CQT is compressed by raising
each element of the CQT matrix to the power of 0.1a .
The 2D multiresolution wavelet analysis is implemented via a bank
of 2D Gaussian filters with scales ∈ {0.25, 0.5, 1, 2, 4, 8}
(Cycles/Octave) and rates ∈ {±2, ±4, ±8, ±16, ±32} (Hz).
For each music recording, the extracted 4D cortical representation
is time- averaged and the resulting rate-scale-frequency 3D
cortical representation is thus obtained.
a
C. Schoerkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing, ” in 7th Sound and Music
Computing Conf., Barcelona, Spain, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 14/54
22. Auditory spectro-temporal modulations
Parameters and implementation (1)
The audio signal is analyzed by employing 128 constant-Q filters
covering 8 octaves from 44.9 Hz to 11 KHz (i.e., 16 filters per
octave). The magnitude of the CQT is compressed by raising
each element of the CQT matrix to the power of 0.1a .
The 2D multiresolution wavelet analysis is implemented via a bank
of 2D Gaussian filters with scales ∈ {0.25, 0.5, 1, 2, 4, 8}
(Cycles/Octave) and rates ∈ {±2, ±4, ±8, ±16, ±32} (Hz).
For each music recording, the extracted 4D cortical representation
is time- averaged and the resulting rate-scale-frequency 3D
cortical representation is thus obtained.
a
C. Schoerkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing, ” in 7th Sound and Music
Computing Conf., Barcelona, Spain, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 14/54
23. Auditory spectro-temporal modulations
Parameters and implementation (1)
The audio signal is analyzed by employing 128 constant-Q filters
covering 8 octaves from 44.9 Hz to 11 KHz (i.e., 16 filters per
octave). The magnitude of the CQT is compressed by raising
each element of the CQT matrix to the power of 0.1a .
The 2D multiresolution wavelet analysis is implemented via a bank
of 2D Gaussian filters with scales ∈ {0.25, 0.5, 1, 2, 4, 8}
(Cycles/Octave) and rates ∈ {±2, ±4, ±8, ±16, ±32} (Hz).
For each music recording, the extracted 4D cortical representation
is time- averaged and the resulting rate-scale-frequency 3D
cortical representation is thus obtained.
a
C. Schoerkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing, ” in 7th Sound and Music
Computing Conf., Barcelona, Spain, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 14/54
24. Auditory spectro-temporal modulations
Parameters and implementation (2)
To sum up, each music recording is represented by a vector
x ∈ R7680 by stacking the elements of the 3D cortical
+
representation into a vector.
An ensemble of music recordings is represented by the data
matrix X ∈ R7680×S , where S is the number of the available
+
recordings.
Each row of X is normalized to the range [0, 1] by subtracting from
each entry the row minimum and then by dividing it with the
difference between the row maximum and the row minimum.
Sparse and Low Rank Representations in Music Signal Analysis 15/54
25. Auditory spectro-temporal modulations
Parameters and implementation (2)
To sum up, each music recording is represented by a vector
x ∈ R7680 by stacking the elements of the 3D cortical
+
representation into a vector.
An ensemble of music recordings is represented by the data
matrix X ∈ R7680×S , where S is the number of the available
+
recordings.
Each row of X is normalized to the range [0, 1] by subtracting from
each entry the row minimum and then by dividing it with the
difference between the row maximum and the row minimum.
Sparse and Low Rank Representations in Music Signal Analysis 15/54
26. Auditory spectro-temporal modulations
Parameters and implementation (2)
To sum up, each music recording is represented by a vector
x ∈ R7680 by stacking the elements of the 3D cortical
+
representation into a vector.
An ensemble of music recordings is represented by the data
matrix X ∈ R7680×S , where S is the number of the available
+
recordings.
Each row of X is normalized to the range [0, 1] by subtracting from
each entry the row minimum and then by dividing it with the
difference between the row maximum and the row minimum.
Sparse and Low Rank Representations in Music Signal Analysis 15/54
27. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 16/54
28. Learning Problem
Statement
Let X ∈ Rd×S be the data matrix that contains S vector samples in
its columns of size d. That is, xs ∈ Rd , s = 1, 2, . . . , S.
Without loss of generality, the data matrix can be partitioned as
X = [A | Y], where
A = [A1 |A2 | . . . |AK ] ∈ Rd×N represents a set of N training samples
that belong to K classes
Y = [Y1 |Y2 | . . . |YK ] ∈ Rd×M contains M = S − N test vector
samples in its columns.
If certain assumptions hold, learn a block diagonal matrix
Z = diag[Z1 , Z2 , . . . , ZK ] ∈ RN×M such that Y = AZ.
Sparse and Low Rank Representations in Music Signal Analysis 17/54
29. Learning Problem
Statement
Let X ∈ Rd×S be the data matrix that contains S vector samples in
its columns of size d. That is, xs ∈ Rd , s = 1, 2, . . . , S.
Without loss of generality, the data matrix can be partitioned as
X = [A | Y], where
A = [A1 |A2 | . . . |AK ] ∈ Rd×N represents a set of N training samples
that belong to K classes
Y = [Y1 |Y2 | . . . |YK ] ∈ Rd×M contains M = S − N test vector
samples in its columns.
If certain assumptions hold, learn a block diagonal matrix
Z = diag[Z1 , Z2 , . . . , ZK ] ∈ RN×M such that Y = AZ.
Sparse and Low Rank Representations in Music Signal Analysis 17/54
30. Learning Problem
Statement
Let X ∈ Rd×S be the data matrix that contains S vector samples in
its columns of size d. That is, xs ∈ Rd , s = 1, 2, . . . , S.
Without loss of generality, the data matrix can be partitioned as
X = [A | Y], where
A = [A1 |A2 | . . . |AK ] ∈ Rd×N represents a set of N training samples
that belong to K classes
Y = [Y1 |Y2 | . . . |YK ] ∈ Rd×M contains M = S − N test vector
samples in its columns.
If certain assumptions hold, learn a block diagonal matrix
Z = diag[Z1 , Z2 , . . . , ZK ] ∈ RN×M such that Y = AZ.
Sparse and Low Rank Representations in Music Signal Analysis 17/54
31. Learning Problem
Statement
Let X ∈ Rd×S be the data matrix that contains S vector samples in
its columns of size d. That is, xs ∈ Rd , s = 1, 2, . . . , S.
Without loss of generality, the data matrix can be partitioned as
X = [A | Y], where
A = [A1 |A2 | . . . |AK ] ∈ Rd×N represents a set of N training samples
that belong to K classes
Y = [Y1 |Y2 | . . . |YK ] ∈ Rd×M contains M = S − N test vector
samples in its columns.
If certain assumptions hold, learn a block diagonal matrix
Z = diag[Z1 , Z2 , . . . , ZK ] ∈ RN×M such that Y = AZ.
Sparse and Low Rank Representations in Music Signal Analysis 17/54
32. Learning Problem
Statement
Let X ∈ Rd×S be the data matrix that contains S vector samples in
its columns of size d. That is, xs ∈ Rd , s = 1, 2, . . . , S.
Without loss of generality, the data matrix can be partitioned as
X = [A | Y], where
A = [A1 |A2 | . . . |AK ] ∈ Rd×N represents a set of N training samples
that belong to K classes
Y = [Y1 |Y2 | . . . |YK ] ∈ Rd×M contains M = S − N test vector
samples in its columns.
If certain assumptions hold, learn a block diagonal matrix
Z = diag[Z1 , Z2 , . . . , ZK ] ∈ RN×M such that Y = AZ.
Sparse and Low Rank Representations in Music Signal Analysis 17/54
33. Learning Problem
Assumptions
If
1 the data are exactly drawn from independent linear subspaces, i.e.,
span(Ak ) linearly spans the k th class data space, k = 1, 2, . . . , K ,
2 Y ∈ span(A),
3 the data contain neither outliers nor noise,
then each test vector sample that belongs to the k th class can be
represented as a linear combination of the training samples in Ak .
Sparse and Low Rank Representations in Music Signal Analysis 18/54
34. Learning Problem
Assumptions
If
1 the data are exactly drawn from independent linear subspaces, i.e.,
span(Ak ) linearly spans the k th class data space, k = 1, 2, . . . , K ,
2 Y ∈ span(A),
3 the data contain neither outliers nor noise,
then each test vector sample that belongs to the k th class can be
represented as a linear combination of the training samples in Ak .
Sparse and Low Rank Representations in Music Signal Analysis 18/54
35. Learning Problem
Assumptions
If
1 the data are exactly drawn from independent linear subspaces, i.e.,
span(Ak ) linearly spans the k th class data space, k = 1, 2, . . . , K ,
2 Y ∈ span(A),
3 the data contain neither outliers nor noise,
then each test vector sample that belongs to the k th class can be
represented as a linear combination of the training samples in Ak .
Sparse and Low Rank Representations in Music Signal Analysis 18/54
36. Learning Problem
Assumptions
If
1 the data are exactly drawn from independent linear subspaces, i.e.,
span(Ak ) linearly spans the k th class data space, k = 1, 2, . . . , K ,
2 Y ∈ span(A),
3 the data contain neither outliers nor noise,
then each test vector sample that belongs to the k th class can be
represented as a linear combination of the training samples in Ak .
Sparse and Low Rank Representations in Music Signal Analysis 18/54
37. Learning Problem
Assumptions
If
1 the data are exactly drawn from independent linear subspaces, i.e.,
span(Ak ) linearly spans the k th class data space, k = 1, 2, . . . , K ,
2 Y ∈ span(A),
3 the data contain neither outliers nor noise,
then each test vector sample that belongs to the k th class can be
represented as a linear combination of the training samples in Ak .
Sparse and Low Rank Representations in Music Signal Analysis 18/54
38. Solutions
Sparsest Representation (SR)
Z ∈ RN×M is the sparsest representation of the test data Y ∈ Rd×M
with respect to the training data A ∈ Rd×N obtained by solving the
optimization problema :
SR: argmin zi 0 subject to yi = A zi , (1)
zi
a
E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in IEEE Int. Conf. Computer Vision and Pattern
Recognition, Miami, FL, USA, 2009, pp. 2790-2797.
Sparse and Low Rank Representations in Music Signal Analysis 19/54
39. Solutions
Lowest-rank representation (LRR)
or Z ∈ RN×M is the lowest-rank representation of the test data
Y ∈ Rd×M with respect to the training data A ∈ Rd×N obtained by
solving the optimization problema :
LRR: argmin rank(Z) subject to Y = A Z. (2)
Z
a
G. Liu, Z. Lin, S. Yan, J. Sun, and Y. Ma (2011)
Sparse and Low Rank Representations in Music Signal Analysis 19/54
40. Solutions
Convex relaxations
The convex envelope of the 0 norm is the 1 norma , while the convex
envelope of the rank function is the nuclear normb .
Convex relaxations can be obtained by replacing the 0 norm and the
rank function by their convex envelopes:
SR: argmin zi 1 subject to yi = A zi , (3)
zi
LRR: argmin Z ∗ subject to Y = A Z. (4)
Z
a
D. Donoho, “For most large underdetermined systems of equations, the minimal l1-norm near-solution
approximates the sparsest near-solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 7, pp.
907-934, 2006.
b
M. Fazel, Matrix Rank Minimization with Applications, Ph.D. thesis, Dept. Electrical Engineering, Stanford
University, CA, USA, 2002.
Sparse and Low Rank Representations in Music Signal Analysis 19/54
41. Solutions
SR pros and cons
The SR matrix Z ∈ RN×M is sparse block-diagonal and has good
discriminative properties, as has been demonstrated for the SR
based classifiersa .
However, the SR
1 can not model generic subspace structures. Indeed the SR models
accurately subregions on subspaces, the so-called bouquets, rather
than generic subspacesb .
2 does not capture the global structure of the data, since it is
computed for each data sample individually. Indeed, although the
sparsity offers an efficient representation, it damages the high
within-class homogeneity, which is desirable for classification,
especially in the presence of noise.
a
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009.
b
J. Wright and Y. Ma, “Dense error correction via l1-minimization,” IEEE Trans. Information Theory, vol. 56, no.
7, pp. 3540-3560, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 20/54
42. Solutions
SR pros and cons
The SR matrix Z ∈ RN×M is sparse block-diagonal and has good
discriminative properties, as has been demonstrated for the SR
based classifiersa .
However, the SR
1 can not model generic subspace structures. Indeed the SR models
accurately subregions on subspaces, the so-called bouquets, rather
than generic subspacesb .
2 does not capture the global structure of the data, since it is
computed for each data sample individually. Indeed, although the
sparsity offers an efficient representation, it damages the high
within-class homogeneity, which is desirable for classification,
especially in the presence of noise.
a
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009.
b
J. Wright and Y. Ma, “Dense error correction via l1-minimization,” IEEE Trans. Information Theory, vol. 56, no.
7, pp. 3540-3560, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 20/54
43. Solutions
SR pros and cons
The SR matrix Z ∈ RN×M is sparse block-diagonal and has good
discriminative properties, as has been demonstrated for the SR
based classifiersa .
However, the SR
1 can not model generic subspace structures. Indeed the SR models
accurately subregions on subspaces, the so-called bouquets, rather
than generic subspacesb .
2 does not capture the global structure of the data, since it is
computed for each data sample individually. Indeed, although the
sparsity offers an efficient representation, it damages the high
within-class homogeneity, which is desirable for classification,
especially in the presence of noise.
a
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009.
b
J. Wright and Y. Ma, “Dense error correction via l1-minimization,” IEEE Trans. Information Theory, vol. 56, no.
7, pp. 3540-3560, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 20/54
44. Solutions
SR pros and cons
The SR matrix Z ∈ RN×M is sparse block-diagonal and has good
discriminative properties, as has been demonstrated for the SR
based classifiersa .
However, the SR
1 can not model generic subspace structures. Indeed the SR models
accurately subregions on subspaces, the so-called bouquets, rather
than generic subspacesb .
2 does not capture the global structure of the data, since it is
computed for each data sample individually. Indeed, although the
sparsity offers an efficient representation, it damages the high
within-class homogeneity, which is desirable for classification,
especially in the presence of noise.
a
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009.
b
J. Wright and Y. Ma, “Dense error correction via l1-minimization,” IEEE Trans. Information Theory, vol. 56, no.
7, pp. 3540-3560, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 20/54
45. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
46. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
47. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
48. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
49. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
50. Solutions
LRR pros and cons
The LRR matrix Z ∈ RN×M
1 models data steming from generic subspace structures
2 preserves accurately the global data structure.
3 For clean data, the LRR also exhibits dense within-class
homogeneity and zero between-class affinities, making it an
appealing representation for classification purposes, e.g., in music
mood classificationa .
4 For data contaminated with noise and outliers, the low-rank
constraint seems to enforce noise correctionb .
But LRR looses sparsity within the classes.
a
Y. Panagakis and C. Kotropoulos, “Automatic music mood classification via low-rank representation,” in Proc.
19th European Signal Processing Conf., Barcelona, Spain, 2011, pp. 689–693.
b
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of ACM, vol. 58, no.
3, pp. 1-37, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 21/54
51. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 22/54
52. Joint sparse low-rank representations (JSLRR)
Motivation
Intuitively, a representation matrix that is able to reveal the most
characteristic subregions of the subspaces must be
simultaneously row sparse and low-rank.
The row sparsity ensures that only a small fraction of the training
samples is involved in the representation.
The low-rank constraint ensures that the representation vectors
(i.e., the columns of the representation matrix) are correlated in
the sense that the data lying onto a single subspace are
represented as a linear combination of the same few training
samples.
Sparse and Low Rank Representations in Music Signal Analysis 23/54
53. Joint sparse low-rank representations (JSLRR)
Motivation
Intuitively, a representation matrix that is able to reveal the most
characteristic subregions of the subspaces must be
simultaneously row sparse and low-rank.
The row sparsity ensures that only a small fraction of the training
samples is involved in the representation.
The low-rank constraint ensures that the representation vectors
(i.e., the columns of the representation matrix) are correlated in
the sense that the data lying onto a single subspace are
represented as a linear combination of the same few training
samples.
Sparse and Low Rank Representations in Music Signal Analysis 23/54
54. Joint sparse low-rank representations (JSLRR)
Motivation
Intuitively, a representation matrix that is able to reveal the most
characteristic subregions of the subspaces must be
simultaneously row sparse and low-rank.
The row sparsity ensures that only a small fraction of the training
samples is involved in the representation.
The low-rank constraint ensures that the representation vectors
(i.e., the columns of the representation matrix) are correlated in
the sense that the data lying onto a single subspace are
represented as a linear combination of the same few training
samples.
Sparse and Low Rank Representations in Music Signal Analysis 23/54
55. JSLRR
Problem statement and solution
The JSLRR of Y ∈ Rd×M with respect to A ∈ Rd×N is the matrix
Z ∈ RN×M with rank r min(q, M), where q N is the size of the
support of Z.
It can be found by minimizing the rank function regularized by the
0,q quasi-norm.
The 0,q regularization term ensures that the low-rank matrix is
also row sparse, since Z 0,q = |supp(Z)| for any q.
A convex relaxation of the just mentioned problem is solved:
JSLRR: argmin Z ∗ + θ1 Z 1 subject to Y = A Z, (5)
Z
where the term Z 1 promotes sparsity to the LRR matrix and
θ1 > 0 balances the two norms in (5).
Sparse and Low Rank Representations in Music Signal Analysis 24/54
56. JSLRR
Problem statement and solution
The JSLRR of Y ∈ Rd×M with respect to A ∈ Rd×N is the matrix
Z ∈ RN×M with rank r min(q, M), where q N is the size of the
support of Z.
It can be found by minimizing the rank function regularized by the
0,q quasi-norm.
The 0,q regularization term ensures that the low-rank matrix is
also row sparse, since Z 0,q = |supp(Z)| for any q.
A convex relaxation of the just mentioned problem is solved:
JSLRR: argmin Z ∗ + θ1 Z 1 subject to Y = A Z, (5)
Z
where the term Z 1 promotes sparsity to the LRR matrix and
θ1 > 0 balances the two norms in (5).
Sparse and Low Rank Representations in Music Signal Analysis 24/54
57. JSLRR
Problem statement and solution
The JSLRR of Y ∈ Rd×M with respect to A ∈ Rd×N is the matrix
Z ∈ RN×M with rank r min(q, M), where q N is the size of the
support of Z.
It can be found by minimizing the rank function regularized by the
0,q quasi-norm.
The 0,q regularization term ensures that the low-rank matrix is
also row sparse, since Z 0,q = |supp(Z)| for any q.
A convex relaxation of the just mentioned problem is solved:
JSLRR: argmin Z ∗ + θ1 Z 1 subject to Y = A Z, (5)
Z
where the term Z 1 promotes sparsity to the LRR matrix and
θ1 > 0 balances the two norms in (5).
Sparse and Low Rank Representations in Music Signal Analysis 24/54
58. JSLRR
Problem statement and solution
The JSLRR of Y ∈ Rd×M with respect to A ∈ Rd×N is the matrix
Z ∈ RN×M with rank r min(q, M), where q N is the size of the
support of Z.
It can be found by minimizing the rank function regularized by the
0,q quasi-norm.
The 0,q regularization term ensures that the low-rank matrix is
also row sparse, since Z 0,q = |supp(Z)| for any q.
A convex relaxation of the just mentioned problem is solved:
JSLRR: argmin Z ∗ + θ1 Z 1 subject to Y = A Z, (5)
Z
where the term Z 1 promotes sparsity to the LRR matrix and
θ1 > 0 balances the two norms in (5).
Sparse and Low Rank Representations in Music Signal Analysis 24/54
59. JSLRR
Any theoretical quarantee?
The JSLRR has a block-diagonal structure, a property that makes it
appealing for classification. This fact is proved in Theorem 1, which is
a consequence of Lemma 1.
Sparse and Low Rank Representations in Music Signal Analysis 25/54
60. JSLRR
Lemma 1
Let . θ = . ∗ + θ . 1 , with θ > 0. For any four matrices B, C, D, and F
of compatible dimensions,
B C B 0
≥ = B θ + F θ. (6)
D F θ
0 F θ
Sparse and Low Rank Representations in Music Signal Analysis 25/54
61. JSLRR
Theorem 1
Assume that the data are exactly drawn from independent linear
subspaces. That is, span(Ak ) linearly spans the training vectors of the
k th class, k = 1, 2, . . . , K , and Y ∈ span(A). Then, the minimizer of (5)
is block-diagonal.
Sparse and Low Rank Representations in Music Signal Analysis 25/54
62. Example 1
Ideal case
4 linear pairwise independent subspaces are constructed whose
bases {Ui }4 are computed by Ui+1 = RUi , i = 1, 2, 3.
i=1
U1 ∈ R600×110 is a column orthonormal random matrix and
R ∈ R600×600 is a random rotation matrix.
The data matrix X = [X1 , X2 , X3 , X4 ] ∈ R600×400 is obtained by
picking 100 samples from each subspace. That is, Xi ∈ R600×100 ,
i = 1, 2, 3, 4.
Next, the data matrix is partitioned into the training matrix
A ∈ R600×360 and the test matrix Y ∈ R600×40 by employing a
10-fold cross validation.
Sparse and Low Rank Representations in Music Signal Analysis 26/54
63. Example 1
Ideal case
4 linear pairwise independent subspaces are constructed whose
bases {Ui }4 are computed by Ui+1 = RUi , i = 1, 2, 3.
i=1
U1 ∈ R600×110 is a column orthonormal random matrix and
R ∈ R600×600 is a random rotation matrix.
The data matrix X = [X1 , X2 , X3 , X4 ] ∈ R600×400 is obtained by
picking 100 samples from each subspace. That is, Xi ∈ R600×100 ,
i = 1, 2, 3, 4.
Next, the data matrix is partitioned into the training matrix
A ∈ R600×360 and the test matrix Y ∈ R600×40 by employing a
10-fold cross validation.
Sparse and Low Rank Representations in Music Signal Analysis 26/54
64. Example 1
Ideal case
4 linear pairwise independent subspaces are constructed whose
bases {Ui }4 are computed by Ui+1 = RUi , i = 1, 2, 3.
i=1
U1 ∈ R600×110 is a column orthonormal random matrix and
R ∈ R600×600 is a random rotation matrix.
The data matrix X = [X1 , X2 , X3 , X4 ] ∈ R600×400 is obtained by
picking 100 samples from each subspace. That is, Xi ∈ R600×100 ,
i = 1, 2, 3, 4.
Next, the data matrix is partitioned into the training matrix
A ∈ R600×360 and the test matrix Y ∈ R600×40 by employing a
10-fold cross validation.
Sparse and Low Rank Representations in Music Signal Analysis 26/54
65. Example 1
JSLRR, LRR, SR matrices Z ∈ R360×40
Sparse and Low Rank Representations in Music Signal Analysis 27/54
66. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 28/54
67. JSLRR
Revisiting
The data are approximately drawn from a union of subspaces. The
deviations from the ideal assumptions can be treated collectively
as additive noise contaminating the ideal model, i.e., Y = AZ + E.
The noise term E models both small (but densely supported)
deviations and grossly (but sparse) corrupted observations (i.e.,
outliers or missing data).
In the presence of noise, both the rank and the density of the
representation matrix Z increases, since the columns in Z contain
non-zero elements associated to more than one class.
If one requests to reduce the rank of Z or to increase the sparsity
of Z, the noise in the test set can be smoothed and Z
simultaneously admits a close to block-diagonal structure.
Sparse and Low Rank Representations in Music Signal Analysis 29/54
68. JSLRR
Revisiting
The data are approximately drawn from a union of subspaces. The
deviations from the ideal assumptions can be treated collectively
as additive noise contaminating the ideal model, i.e., Y = AZ + E.
The noise term E models both small (but densely supported)
deviations and grossly (but sparse) corrupted observations (i.e.,
outliers or missing data).
In the presence of noise, both the rank and the density of the
representation matrix Z increases, since the columns in Z contain
non-zero elements associated to more than one class.
If one requests to reduce the rank of Z or to increase the sparsity
of Z, the noise in the test set can be smoothed and Z
simultaneously admits a close to block-diagonal structure.
Sparse and Low Rank Representations in Music Signal Analysis 29/54
69. JSLRR
Revisiting
The data are approximately drawn from a union of subspaces. The
deviations from the ideal assumptions can be treated collectively
as additive noise contaminating the ideal model, i.e., Y = AZ + E.
The noise term E models both small (but densely supported)
deviations and grossly (but sparse) corrupted observations (i.e.,
outliers or missing data).
In the presence of noise, both the rank and the density of the
representation matrix Z increases, since the columns in Z contain
non-zero elements associated to more than one class.
If one requests to reduce the rank of Z or to increase the sparsity
of Z, the noise in the test set can be smoothed and Z
simultaneously admits a close to block-diagonal structure.
Sparse and Low Rank Representations in Music Signal Analysis 29/54
70. JSLRR
Revisiting
The data are approximately drawn from a union of subspaces. The
deviations from the ideal assumptions can be treated collectively
as additive noise contaminating the ideal model, i.e., Y = AZ + E.
The noise term E models both small (but densely supported)
deviations and grossly (but sparse) corrupted observations (i.e.,
outliers or missing data).
In the presence of noise, both the rank and the density of the
representation matrix Z increases, since the columns in Z contain
non-zero elements associated to more than one class.
If one requests to reduce the rank of Z or to increase the sparsity
of Z, the noise in the test set can be smoothed and Z
simultaneously admits a close to block-diagonal structure.
Sparse and Low Rank Representations in Music Signal Analysis 29/54
71. Robust JSLRR
Optimization Problem
A solution is sought for the convex optimization problem:
Robust JSLRR: argmin Z ∗ + θ1 Z 1 + θ2 E 2,1
Z,E
subject to Y = A Z + E, (7)
where θ2 > 0 is a regularization parameter and . 2,1 denotes the
2 / 1 norm.
Problem (7) can be solved iteratively by employing the Linearized
Alternating Direction Augmented Lagrange Multiplier (LADALM)
methoda , a variant of the Alternating Direction Augmented
Lagrange Multiplier methodb .
a
J. Yang and X. M. Yuan,“Linearized augmented Lagrangian and alternating direction methods for nuclear norm
minimization,” Math. Comput., (to appear) 2011.
b
D. P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods, Athena Scientific, Belmont, MA,
2/e, 1996.
Sparse and Low Rank Representations in Music Signal Analysis 30/54
72. Robust JSLRR
Optimization Problem
A solution is sought for the convex optimization problem:
Robust JSLRR: argmin Z ∗ + θ1 Z 1 + θ2 E 2,1
Z,E
subject to Y = A Z + E, (7)
where θ2 > 0 is a regularization parameter and . 2,1 denotes the
2 / 1 norm.
Problem (7) can be solved iteratively by employing the Linearized
Alternating Direction Augmented Lagrange Multiplier (LADALM)
methoda , a variant of the Alternating Direction Augmented
Lagrange Multiplier methodb .
a
J. Yang and X. M. Yuan,“Linearized augmented Lagrangian and alternating direction methods for nuclear norm
minimization,” Math. Comput., (to appear) 2011.
b
D. P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods, Athena Scientific, Belmont, MA,
2/e, 1996.
Sparse and Low Rank Representations in Music Signal Analysis 30/54
73. Robust JSLRR
LADALM
That is, one solves
argmin J ∗ + θ1 W 1 + θ2 E 2,1
J,Z,W,E
subject to Y = A Z + E, Z = J, J = W, (8)
by minimizing the augmented Lagrangian function:
L(J, Z, W, E, Λ1 , Λ2 , Λ3 ) = J ∗ + θ1 W 1 + θ2 E 2,1
+tr ΛT (Y
1 − AZ − E) + tr ΛT (Z
2 − J) + tr ΛT (J
3 − W)
µ 2 2 2
+ Y − AZ − E F + Z−J F + J−W F , (9)
2
where Λ1 , Λ2 , and Λ3 are the Lagrange multipliers and µ > 0 is a
penalty parameter.
Sparse and Low Rank Representations in Music Signal Analysis 31/54
74. Robust JSLRR
LADALM
That is, one solves
argmin J ∗ + θ1 W 1 + θ2 E 2,1
J,Z,W,E
subject to Y = A Z + E, Z = J, J = W, (8)
by minimizing the augmented Lagrangian function:
L(J, Z, W, E, Λ1 , Λ2 , Λ3 ) = J ∗ + θ1 W 1 + θ2 E 2,1
+tr ΛT (Y
1 − AZ − E) + tr ΛT (Z
2 − J) + tr ΛT (J
3 − W)
µ 2 2 2
+ Y − AZ − E F + Z−J F + J−W F , (9)
2
where Λ1 , Λ2 , and Λ3 are the Lagrange multipliers and µ > 0 is a
penalty parameter.
Sparse and Low Rank Representations in Music Signal Analysis 31/54
75. Robust JSLRR
Optimization with respect to J[t]
J[t+1] = argmin L(J[t] , Z[t] , W[t] , E[t] , Λ1[t] , Λ2[t] , Λ3[t] )
J[t]
1
≈ argmin J ∗
J[t] µ [t]
1 2
+ J − (Z[t] − J[t] − Λ3[t] /µ + W[t] + Λ2[t] /µ)
2 [t] F
J[t+1] ← Dµ−1 Z[t] − J[t] − Λ3[t] /µ + W[t] + Λ2[t] /µ . (10)
The solution is obtained via the singular value thresholding operator
defined for any matrix Q as Dτ [Q] = USτ VT with Q = UΣVT being the
singular value decomposition and Sτ [q] = sgn(q)max(|q| − τ, 0) being
the shrinkage operator.
Sparse and Low Rank Representations in Music Signal Analysis 32/54
76. Robust JSLRR
Optimization with respect to Z[t]
Z[t+1] = argmin L(J[t+1] , Z[t] , W[t] , E[t] , Λ1[t] , Λ2[t] , Λ3[t] )
Z[t]
−1
Z[t+1] = I + AT A
AT (Y − E[t] ) + J[t+1] + (AT Λ1[t] − Λ2[t] )/µ . (11)
i.e., an unconstrained least squares problem.
Sparse and Low Rank Representations in Music Signal Analysis 32/54
77. Robust JSLRR
Optimization with respect to W[t]
W[t+1] = argmin L(J[t+1] , Z[t+1] , W[t] , E[t] , Λ1[t] , Λ2[t] , Λ3[t] )
W[t]
θ1 1 2
= argmin W[t] 1 + W[t] − (J[t+1] + Λ3[t] /µ) F
W[t] µ 2
W[t+1] ← Sθ1 µ−1 J[t+1] + Λ3[t] /µ . (12)
Sparse and Low Rank Representations in Music Signal Analysis 32/54
78. Robust JSLRR
Optimization with respect to E[t]
E[t+1] = argmin L(J[t+1] , Z[t+1] , W[t+1] , E[t] , Λ1[t] , Λ2[t] , Λ3[t] )
E[t]
θ2
= argmin E 2,1 +
E[t] µ [t]
1 2
+ E − (Y − AZ[t+1] + Λ1[t] /µ) F. (13)
2 [t]
Let M[t] = Y − AZ[t+1] + Λ1[t] /µ. Update E[t+1] column-wise as follows:
mj[t]
ej[t+1] ← Sθ2 µ−1 mj[t] 2 . (14)
mj[t] 2
Sparse and Low Rank Representations in Music Signal Analysis 32/54
79. Robust JSLRR
Updating of Lagrange multiplier matrices
Λ1[t+1] = Λ1[t] + µ(Y − AZ[t+1] − E[t+1] ),
Λ2[t+1] = Λ2[t] + µ(Z[t+1] − J[t+1] ),
Λ3[t+1] = Λ3[t] + µ(J[t+1] − W[t+1] ). (15)
Sparse and Low Rank Representations in Music Signal Analysis 32/54
80. Special cases
Robust joint SR (JSR)
The solution of the convex optimization problem is sought:
Robust JSR: argmin Z 1 + θ2 E 2,1 subject to Y = A Z + E.
Z,E
(16)
(16) takes into account the correlations between the test samples,
while seeking to jointly represent the test samples from a specific
class by a few columns of the training matrix.
L1 (Z, J, E, Λ1 , Λ2 ) = J 1 + θ2 E 2,1 + tr ΛT (Y − AZ − E)
1
µ
+tr ΛT (Z − J) +
2 Y − AZ − E 2 + Z − J 2 ,
F F (17)
2
where Λ1 , Λ2 are Lagrange multipliers and µ > 0 is a penalty
parameter.
Sparse and Low Rank Representations in Music Signal Analysis 33/54
81. Special cases
Robust joint SR (JSR)
The solution of the convex optimization problem is sought:
Robust JSR: argmin Z 1 + θ2 E 2,1 subject to Y = A Z + E.
Z,E
(16)
(16) takes into account the correlations between the test samples,
while seeking to jointly represent the test samples from a specific
class by a few columns of the training matrix.
L1 (Z, J, E, Λ1 , Λ2 ) = J 1 + θ2 E 2,1 + tr ΛT (Y − AZ − E)
1
µ
+tr ΛT (Z − J) +
2 Y − AZ − E 2 + Z − J 2 ,
F F (17)
2
where Λ1 , Λ2 are Lagrange multipliers and µ > 0 is a penalty
parameter.
Sparse and Low Rank Representations in Music Signal Analysis 33/54
82. Special cases
Robust joint SR (JSR)
The solution of the convex optimization problem is sought:
Robust JSR: argmin Z 1 + θ2 E 2,1 subject to Y = A Z + E.
Z,E
(16)
(16) takes into account the correlations between the test samples,
while seeking to jointly represent the test samples from a specific
class by a few columns of the training matrix.
L1 (Z, J, E, Λ1 , Λ2 ) = J 1 + θ2 E 2,1 + tr ΛT (Y − AZ − E)
1
µ
+tr ΛT (Z − J) +
2 Y − AZ − E 2 + Z − J 2 ,
F F (17)
2
where Λ1 , Λ2 are Lagrange multipliers and µ > 0 is a penalty
parameter.
Sparse and Low Rank Representations in Music Signal Analysis 33/54
83. Special cases
Robust LRR
The solution of the convex optimization problem is sought:
Robust LRR: argmin Z ∗ + θ2 E 2,1 subject to Y = A Z + E.
Z,E
(18)
by minimizing an augmented Lagrangian function similar to (17)
where the first term J 1 is replaced by J ∗ .
Sparse and Low Rank Representations in Music Signal Analysis 34/54
84. Special cases
Robust LRR
The solution of the convex optimization problem is sought:
Robust LRR: argmin Z ∗ + θ2 E 2,1 subject to Y = A Z + E.
Z,E
(18)
by minimizing an augmented Lagrangian function similar to (17)
where the first term J 1 is replaced by J ∗ .
Sparse and Low Rank Representations in Music Signal Analysis 34/54
85. Example 2
Noisy case
4 linear pairwise independent subspaces are constructed as in the
Example 1 and the matrices A ∈ R600×360 and Y ∈ R600×40 are
obtained.
We pick randomly 50 columns of A and we replace them by a
linear combination of randomly chosen vectors from two
subspaces with random weights. Thus, the training set is now
contaminated by outliers.
The 5th column of the test matrix Y is replaced by a linear
combination of vectors not drawn from any of the 4 subspaces
and the 15th column of Y is replaced by a vector drawn from the
1st and the 4th subspace, as previously said.
Sparse and Low Rank Representations in Music Signal Analysis 35/54
86. Example 2
Noisy case
4 linear pairwise independent subspaces are constructed as in the
Example 1 and the matrices A ∈ R600×360 and Y ∈ R600×40 are
obtained.
We pick randomly 50 columns of A and we replace them by a
linear combination of randomly chosen vectors from two
subspaces with random weights. Thus, the training set is now
contaminated by outliers.
The 5th column of the test matrix Y is replaced by a linear
combination of vectors not drawn from any of the 4 subspaces
and the 15th column of Y is replaced by a vector drawn from the
1st and the 4th subspace, as previously said.
Sparse and Low Rank Representations in Music Signal Analysis 35/54
87. Example 2
Noisy case
4 linear pairwise independent subspaces are constructed as in the
Example 1 and the matrices A ∈ R600×360 and Y ∈ R600×40 are
obtained.
We pick randomly 50 columns of A and we replace them by a
linear combination of randomly chosen vectors from two
subspaces with random weights. Thus, the training set is now
contaminated by outliers.
The 5th column of the test matrix Y is replaced by a linear
combination of vectors not drawn from any of the 4 subspaces
and the 15th column of Y is replaced by a vector drawn from the
1st and the 4th subspace, as previously said.
Sparse and Low Rank Representations in Music Signal Analysis 35/54
88. Example 2
Representation matrices (zoom in the 5th and 15th test samples)
Sparse and Low Rank Representations in Music Signal Analysis 36/54
89. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 37/54
90. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
91. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
92. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
93. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
94. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
95. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
96. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
97. Joint sparse low-rank subspace-based classification
Algorithm
Input: Training matrix A ∈ Rd×N and test matrix Y ∈ Rd×M .
Output: A class label for each column of Y.
1 Solve (8) to obtain Z ∈ RN×M and E ∈ Rd×M .
2 for m = 1 to M
3 ¯
ym = ym − em .
4 for k = 1 to K
5 ¯ ¯
Compute the residuals rk (ym ) = ym − A δk (zm ) 2 .
6 end for
7 ¯ ¯
class(ym ) = argmink rk (ym ).
8 end for
Sparse and Low Rank Representations in Music Signal Analysis 38/54
98. Joint sparse low-rank subspace-based classification
Linearity concentration index
The LCI of a coefficient vector zm ∈ RN associated to the mth test
sample is defined as
K · maxk δk (zm ) 2 / zm 2 −1
LCI(zm ) = ∈ [0, 1]. (19)
K −1
If LCI(zm ) = 1, the test sample is drawn from a single subspace. If
LCI(zm ) = 0 the test sample is drawn evenly from all subspaces.
By choosing a threshold c ∈ (0, 1), the mth test sample is claimed
to be valid if LCI(zm ) > c. Otherwise, the test sample can be
either rejected as totally invalid (for very small values of LCI(zm ))
or be classified into multiple classes by assigning to it the labels
associated to the larger values δk (zm ) 2 .
Sparse and Low Rank Representations in Music Signal Analysis 39/54
99. Joint sparse low-rank subspace-based classification
Linearity concentration index
The LCI of a coefficient vector zm ∈ RN associated to the mth test
sample is defined as
K · maxk δk (zm ) 2 / zm 2 −1
LCI(zm ) = ∈ [0, 1]. (19)
K −1
If LCI(zm ) = 1, the test sample is drawn from a single subspace. If
LCI(zm ) = 0 the test sample is drawn evenly from all subspaces.
By choosing a threshold c ∈ (0, 1), the mth test sample is claimed
to be valid if LCI(zm ) > c. Otherwise, the test sample can be
either rejected as totally invalid (for very small values of LCI(zm ))
or be classified into multiple classes by assigning to it the labels
associated to the larger values δk (zm ) 2 .
Sparse and Low Rank Representations in Music Signal Analysis 39/54
100. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 40/54
101. Music genre classification: Datasets and evaluation
procedure
GTZAN dataset
1000 audio recordings 30 seconds longa ;
10 genre classes: Blues, Classical, Country, Disco, HipHop, Jazz,
Metal, Pop, Reggae, and Rock;
Each genre class contains 100 audio recordings.
The recordings are converted to monaural wave format at 16 kHz
sampling rate with 16 bits and normalized, so that they have zero
mean amplitude with unit variance.
a
G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio
Processing, vol. 10, no. 5, pp. 293–302, July 2002.
Sparse and Low Rank Representations in Music Signal Analysis 41/54
102. Music genre classification: Datasets and evaluation
procedure
ISMIR 2004 Genre dataset
1458 full audio recordings;
6 genre classes: Classical (640), Electronic (229), Jazz Blues(52),
MetalPunk(90), RockPop(203), World (244).
Sparse and Low Rank Representations in Music Signal Analysis 41/54
103. Music genre classification: Datasets and evaluation
procedure
Protocols
GTZAN dataset: stratified 10-fold cross-validation: Each training set
consists of 900 audio recordings yielding a training matrix AGTZAN .
ISMIR 2004 Genre dataset: The ISMIR2004 Audio Description Contest
protocol defines training and evaluation sets, which consist of 729
audio files each.
Sparse and Low Rank Representations in Music Signal Analysis 41/54
104. Music genre classification: Datasets and evaluation
procedure
Classifiers
JSLRSC, the JSSC, and the LRSC;
SRCa with the coefficients estimated by the LASSOb ;
linear regression classifier (LRC)c ;
the SVM with a linear kernel, and the NN classifier with the cosine
similarity.
a
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, 2009.
b
R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp.
267-288, 1996.
c
I. Naseem, R. Togneri, and M. Bennamoun, “Linear regression for face recognition,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 32, no. 11, pp. 2106-2112, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 41/54
105. Music genre classification
Parameters θ1 > 0 and θ2 > 0
Sparse and Low Rank Representations in Music Signal Analysis 42/54
107. Music genre classification
Comparison with the state-of-the-art
Dataset: GTZAN ISMIR 2004 Genre
Rank Reference Accuracy (%) Reference Accuracy (%)
1) Chang et al.a 92.70 Lee et al.b 86.83
2) Lee et al. 90.60 Holzapfel et al.c 83.50
3) Panagakis et al.d 84.30 Panagakis et al. 83.15
4) Bergstra et al.e 82.50 Pampalk et al. 82.30
5) Tsunoo et al.f 77.20
a
K. Chang, J. S. R. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in Proc.
11th Int. Symp. Music Information Retrieval, pp. 387-392, 2010.
b
C. H. Lee, J. L. Shih, K. M. Yu, and H. S. Lin, ”Automatic music genre classification based on modulation
spectral analysis of spectral and cepstral features,” IEEE Trans. Multimedia, vol. 11, no. 4, pp. 670-682, 2009.
c
A. Holzapfel and Y. Stylianou, “Musical genre classification using nonnegative matrix factorization-based
features,” IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 424-434, February 2008.
d
Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of
auditory temporal modulations for music genre classification,” IEEE Trans. Audio, Speech, and Language
Technology, vol. 18, no. 3, pp. 576-588, 2010.
e
J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kegl, “Aggregate features and AdaBoost for music
classification,” Machine Learning, vol. 65, no. 2-3, pp. 473–484, 2006.
f
E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama, “Beyond timbral statistics: Improving music classification
using percussive patterns and bass lines,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 4, pp.
1003-1014, 2011.
Sparse and Low Rank Representations in Music Signal Analysis 44/54
108. Music genre classification
Confusion matrices
Sparse and Low Rank Representations in Music Signal Analysis 45/54
109. Music genre classification
Dimensionality reduction via random projections
Let the true low dimensionality of the data be denoted by r . A
random projection matrix, drawn from a normal zero-mean
distribution, provides with high probability a stable embeddinga
with the dimensionality of the projection d selected as the
minimum value such that d > 2r log(7680/d ).
r is estimated by the robust principal component analysis on a
training set for each dataset.
d = 1581 for the GTZAN dataset and d = 1398 for the ISMIR
2004 Genre dataset is found.
a
R.G. Baraniuk, V. Cevher, and M.B. Wakin, “Low-dimensional models for dimensionality reduction and signal
recovery: A geometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp. 959–971, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 46/54
110. Music genre classification
Dimensionality reduction via random projections
Let the true low dimensionality of the data be denoted by r . A
random projection matrix, drawn from a normal zero-mean
distribution, provides with high probability a stable embeddinga
with the dimensionality of the projection d selected as the
minimum value such that d > 2r log(7680/d ).
r is estimated by the robust principal component analysis on a
training set for each dataset.
d = 1581 for the GTZAN dataset and d = 1398 for the ISMIR
2004 Genre dataset is found.
a
R.G. Baraniuk, V. Cevher, and M.B. Wakin, “Low-dimensional models for dimensionality reduction and signal
recovery: A geometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp. 959–971, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 46/54
111. Music genre classification
Dimensionality reduction via random projections
Let the true low dimensionality of the data be denoted by r . A
random projection matrix, drawn from a normal zero-mean
distribution, provides with high probability a stable embeddinga
with the dimensionality of the projection d selected as the
minimum value such that d > 2r log(7680/d ).
r is estimated by the robust principal component analysis on a
training set for each dataset.
d = 1581 for the GTZAN dataset and d = 1398 for the ISMIR
2004 Genre dataset is found.
a
R.G. Baraniuk, V. Cevher, and M.B. Wakin, “Low-dimensional models for dimensionality reduction and signal
recovery: A geometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp. 959–971, 2010.
Sparse and Low Rank Representations in Music Signal Analysis 46/54
113. Music genre classification
Accuracy after rejecting 1 out of 5 test samples
JLRSC achieves classification accuracy 95.51% in GTZAN dataset.
For the ISMIR 2004 Genre dataset, the accuracy of the JSSC is
92.63%, while that of the JLRSC is 91.55%.
96
92
94
Classification accuracy (%)
Classification accuracy (%)
92 90
90
88
88
86
86 JSLRSC JSLRSC
JSSC JSSC
84
LRSC LRSC
84
LRC LRC
SRC 82 SRC
82 SVM SVM
NN NN
80 80
0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
Threshold c Threshold c
Sparse and Low Rank Representations in Music Signal Analysis 48/54
114. Music structure analysis
Optimization problem
Given a music recording of K music segments be represented by
a sequence of beat-synchronous feature vectors X = [x1 |x2 | . . .
|xN ] ∈ Rd×N learn Z ∈ RN×N by minimizing
1
˜ ˜˜
Let Z = U Σ VT . Define U = U Σ 2 . Set M = UUT . Build a
nonnegative symmetric affinity matrix W ∈ RN×N with elements
+
wij = mij and apply the normalized cutsa .
2
a
J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 22, no. 8, pp. 888-905, 2000.
Sparse and Low Rank Representations in Music Signal Analysis 49/54
115. Music structure analysis
Optimization problem
Given a music recording of K music segments be represented by
a sequence of beat-synchronous feature vectors X = [x1 |x2 | . . .
|xN ] ∈ Rd×N learn Z ∈ RN×N by minimizing
λ2 2
argmin λ1 Z 1 + Z F subject to X = X Z, zii = 0
Z 2
1
˜ ˜˜
Let Z = U Σ VT . Define U = U Σ 2 . Set M = UUT . Build a
N×N
nonnegative symmetric affinity matrix W ∈ R+ with elements
wij = mij and apply the normalized cutsa .
2
a
J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 22, no. 8, pp. 888-905, 2000.
Sparse and Low Rank Representations in Music Signal Analysis 49/54
116. Music structure analysis
Optimization problem
Given a music recording of K music segments be represented by
a sequence of beat-synchronous feature vectors X = [x1 |x2 | . . .
|xN ] ∈ Rd×N learn Z ∈ RN×N by minimizing
λ2 2
argmin λ1 Z 1 + Z F +λ3 E 1 subject to X = X Z + E, zii = 0.
Z,E 2
1
˜ ˜˜
Let Z = U Σ VT . Define U = U Σ 2 . Set M = UUT . Build a
N×N
nonnegative symmetric affinity matrix W ∈ R+ with elements
wij = mij and apply the normalized cutsa .
2
a
J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 22, no. 8, pp. 888-905, 2000.
Sparse and Low Rank Representations in Music Signal Analysis 49/54
117. Music tagging
Optimization problem
Assume that the tag-recording matrix Y and the matrix of the ATM
representations X are jointly low-rank. Learn a low-rank weight matrix
W such that:
argmin W ∗ +λ E 1 subject to Y = W X + E. (20)
W,E
Sparse and Low Rank Representations in Music Signal Analysis 50/54
119. 1 Introduction
2 Auditory spectro-temporal modulations
3 Suitable data representations for classification
4 Joint sparse low-rank representations in the ideal case
5 Joint sparse low-rank representations in the presence of noise
6 Joint sparse low-rank subspace-based classification
7 Music signal analysis
8 Conclusions
Sparse and Low Rank Representations in Music Signal Analysis 52/54
120. Conclusions
Summary-Future Work
A robust framework for solving classification and clustering
problems in music signal analysis has been developed.
In all the three problems addressed, the proposed techniques
achieve either top performance or meet the state-of-the-art.
Efficient implementations exploiting incremental update rules are
desparately needed.
Performance improvement for small sample sets deserves further
elaboration.
Sparse and Low Rank Representations in Music Signal Analysis 53/54