SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
COMPARISON OF FEATURES FOR MUSICAL INSTRUMENT RECOGNITION
                                    Antti Eronen
                          Signal Processing Laboratory, Tampere University of Technology
                                     P.O.Box 553, FIN-33101 Tampere, Finland
                                            antti.eronen@tut.fi

                           ABSTRACT
                                                                       ments. The performance of the system and the confusions it
Several features were compared with regard to recognition per-         makes are compared to the results reported in a human perception
formance in a musical instrument recognition system. Both mel-         experiment, which used a subset of the same data as stimuli [2].
frequency and linear prediction cepstral and delta cepstral coeffi-
cients were calculated. Linear prediction analysis was carried out                      2. FEATURE EXTRACTION
both on a uniform and a warped frequency scale, and reflection          2.1.   Cepstral features
coefficients were also used as features. The performance of earlier
                                                                            For isolated musical tones, the onset has been found to be
described features relating to the temporal development, modula-
                                                                       important for recognition by human subjects [4]. Motivated by
tion properties, brightness, and spectral synchronity of sounds
                                                                       this, the cepstral analyses are made separately for the onset and
was also analysed. The data base consisted of 5286 acoustic and
                                                                       steady state segments of a tone. Based on the root mean square
synthetic solo tones from 29 different Western orchestral instru-
                                                                       (RMS) -energy level of the signal, each tone is segmented into
ments, out of which 16 instruments were included in the test set.
                                                                       onset and steady state segments. The steady state begins when the
The best performance for solo tone recognition, 35% for individ-
                                                                       signal achieves its average RMS-energy level for the first time,
ual instruments and 77% for families, was obtained with a feature
                                                                       and the onset segment is the 10 dB rise before this point.
set consisting of two sets of mel-frequency cepstral coefficients
                                                                            For the onset portion of tones, both LP and filterbank analyses
and a subset of the other analysed features. The confusions made
                                                                       were performed in approximately 20 ms length hamming win-
by the system were analysed and compared to results reported in a
                                                                       dowed frames with 25% overlap. In the steady state segment,
human perception experiment.
                                                                       frame length of 40 ms was used. If the onset was shorter than 80
                      1. INTRODUCTION                                  ms, the beginning of steady state was moved forward so that at
                                                                       least 80 ms was analysed. Prior to the analyses, each acoustic sig-
     Automatic musical instrument recognition is a fascinating and     nal was preemphasized with the high pass filter 1, – 0.97z to
                                                                                                                                     –1
essential subproblem in music indexing, retrieval, and automatic       flatten the spectrum.
transcription. It is closely related to computational auditory scene        The LP coefficients were obtained from an all-pole approxi-
analysis. However, musical instrument recognition has not              mation of the windowed waveform, and were computed using the
received as much research interest as speaker recognition, for         autocorrelation method. In the calculation of the WLP coeffi-
instance.                                                              cients, the frequency warping transformation was obtained by
     The implemented musical instrument recognition systems            replacing the unit delays of the predicting filter with first-order
still have limited practical usability. Brown has reported a system    all-pass elements. In the z-domain this can be interpreted by the
that is able to recognize four woodwind instruments from mono-         mapping
phonic recordings with a performance comparable to that of
                                                                                                                 –1
human’s [1]. Martin’s system recognized a wider set of instru-                                 –1        –1z –λ
                                                                                           z        →z
                                                                                                     ˜ = ------------------- .
                                                                                                                           -            (1)
                                                                                                                        –1
ments, although it did not perform as well as human subjects in a                                         1 – λz
similar task [2].                                                          In the implementation this means replacing the autocorrela-
     This paper continues the work presented in [3] by using new       tion network with a warped autocorrelation network [5]. The
cepstral features and introducing a significant extension to the        parameter λ is selected in such a way that the resulting frequency
evaluation data. The research focuses on comparing different fea-      mapping approximates the desired frequency scale. By selecting
tures with regard of recognition accuracy in a solo tone recogni-      λ=0.7564 for 44.1 kHz samples, a Bark scale approximation was
tion task. First, we analyse different cepstral features that are      obtained [6]. Finally, the obtained linear prediction coefficients an
based either on linear prediction (LP) or filterbank analysis. Both     are transformed into cepstral coefficients cn with the recursion [7,
conventional LP having uniform frequency resolution and more           pp. 163]
psychoacoustically motivated warped linear prediction (WLP) are                                               n–1

                                                                                                              ∑ kck an – k .
used. WLP based features have not been used for musical instru-                                       1
                                                                                        c n = – a n – --
                                                                                                       -                               (2)
ment recognition before. Second, other features are analysed that                                     n
                                                                                                              k=1
are related to the temporal development, modulation properties,
                                                                       The number of cepstral coefficients was equal to the analysis
brightness, and spectral synchronity of sounds.
                                                                       order after the zeroth coefficient, which is a function of the chan-
     The evaluation database is extended to include several exam-
                                                                       nel gain, was discarded.
ples of a particular instrument. Both acoustic and synthetic iso-
                                                                           For the mel-frequency cepstral coefficient (MFCC) calcula-
lated notes of 16 Western orchestral instruments are used for
                                                                       tions, a discrete Fourier transform was first calculated for the win-
testing, whereas the training data includes examples of 29 instru-




21-24 October 2001, New Paltz, New York                                                                                          W2001-1
dowed waveform. The length of the transform was 1024 or 2048
point for 20 ms and 40 ms frames, respectively. 40 triangular                                                                 WLP cepstra
                                                                                                                              Refl. coeffs. (WLP)
bandpass filters having equal bandwith on the mel-frequency                                  70                                LP cepstra
scale were simulated, and the MFCCs were calculated from the                                                                  Refl. coeffs. (LP)
                                                                                            60
log filterbank amplitudes using a shifted discrete cosine transform




                                                                          Percent correct
[7, p.189].                                                                                 50
                                                                                                                               Family recognition
    In all cases, the median values of cepstral coefficients were
                                                                                            40
stored for the onset and steady state segments. Delta cepstral
coefficients were calculated by fitting a first order polynomial                               30
over the cepstral trajectories. For the delta-cepstral coefficients,                         20                                Instrument recognition
the median of their absolute value was calculated. We also experi-
mented with coefficient standard deviations in the case of the                               10
MFCCs.                                                                                       0
                                                                                                 0   5   10    15   20   25   30   35   40   45
2.2.   Spectral and temporal features                                                                         LP analysis order
     Calculation of the other features analysed in this study has      Figure 1. Classification performance as a function of analysis
been described in [3] and will be only shortly summarized here.        order for different LP based features.
     Amplitude envelope contains information e.g. about the type
of excitation; i.e. whether a violin has been bowed or plucked.       samples are recorded in studios with different acoustic character-
Tight coupling between the excitation and the resonance structure     istics and recording equipment, and the samples from Iowa Uni-
is indicated by a short onset duration. To measure the slope of the   versity are recorded in an anechoic chamber. The samples from
amplitude decay after the onset, a line was fitted over the ampli-     the Roland synthesizer were played on the keyboard and recorded
tude envelope on a dB scale. Also, the mean square error of the fit    through analog lines into a Silicon Graphics Octane workstation.
was used as a feature. Crest factor, i.e. maximum / RMS value         The synthesizer has a dynamic keyboard, thus these samples have
was also used to characterize the shape of the amplitude envelope.    varying dynamics. The samples from SOL include only the first
     Strength and frequency of amplitude modulation (AM) was          1.5 seconds of the played note.
measured at two frequency ranges: from 4-8 Hz to measure trem-             Cross validation aimed at as realistic conditions as possible
olo, i.e. AM in conjunction with vibrato, and 10-40 Hz for graini-    with this data set. On each trial, the training data consisted of all
ness or roughness of tones.                                           the samples except those of the particular performer and instru-
     Spectral centroid (SC) corresponds to perceived brightness       ment being tested. In this way, the training data is maximally uti-
and has been one of the interpretations for the dissimilarity rat-    lized, but the system has never heard the samples from that
ings in many multidimensional scaling studies [4]. SC was calcu-      particular instrument in those circumstances before. There were
lated from a short time power spectrum of the signal using            16 instruments that had at least three independent recordings, so
logarithmic frequency resolution. The normalized value of SC is       these instruments were used for testing. The instruments can be
the absolute value in Hz divided by the fundamental frequency.        seen in Figure 4. A total of 5286 samples of 29 Western orchestral
The mean, maximum and standard deviation values of SC were            instruments were included in the data set, out of which 3337 sam-
used as features.                                                     ples were used for testing. The classifier made its choice among
     Onset asynchrony refers to the differences in the rate of the    the 29 instruments. In these tests, a random guesser would score
energy development of different frequency components. A sinu-         3.5% in the individual instrument recognition task, and 16.7% in
soid envelope representation was used to calculate the intensity      family classification.
envelopes for different harmonics, and the standard deviation of           In each test, classifications were performed separately for the
onset durations for different harmonics was used as a one feature.    instrument family and individual instrument cases. A k-nearest
Another feature measuring this property is obtained by fitting the     neighbours (kNN) classifier was used, where the values of k were
intensity envelopes of individual harmonics into the overall inten-   11 for instrument family and for 5 individual instrument classifi-
sity evelope during the onset period, and the average mean square     cation. The distance metric was Mahalanobis with equal covari-
error of those fits was used as a feature.                             ance matrix for all classes, which was implemented by using the
     Fundamental frequency (f0) of tones is measured using the        discrete form of the Karhunen-Loeve transform to uncorrelate the
algorithm from [8], and used as a feature. Also, its standard devi-   features and normalize the variances, and then by using the eucli-
ation was used as measure for vibrato.                                dean distance metric in the normalized space.

                 3. EXPERIMENTAL SETUP                                                                          4. RESULTS
    Samples from five different sources were included in the vali-          Different orders of the linear prediction filter were used to see
dation database. First, the samples used in [3] consisted of the      the effect of that on the performance of several LP and WLP
samples from the McGill University Master Samples Collection          based features. The results for instrument family and individual
(MUMS) [9], as well as recordings of an acoustic guitar made at       instrument recognition are shown in Figure 1. The feature vector
Tampere University of Technology. The other sources of samples        at all points consisted of two sets of coefficients: medians over the
were the University of Iowa website, IRCAM Studio Online              onset period and medians over the steady state. The optimal anal-
(SOL), and a Roland XP-30 synthesizer. The MUMS and SOL               ysis order was between 9 and 14, above and below which per-




21-24 October 2001, New Paltz, New York                                                                                                      W2001-2
Individual instrument                                                                          100
                        Instrument family
                                                                                                                        80
                     Random guess (instrument)




                                                                                                     Percent correct
                            Random guess (family)
                                                                                                                        60
            23                                          std of MFCCs of steady state
                                                        std of MFCCs of onset
            21                                          DMFCCs of steady state                                          40
                                                             DMFCCs of onset
            19                                                   MFCCs of steady                                        20              Individual instrument
                                                                 MFCCs of onset                                                         Instrument family
            17                                  std of f0                                                                0
                                                                                                                             1     3   5      7      9      11
                                                fundamental frequency (f0)
  Feature




                                                                                                                                 Note sequence length
            15                                  onset duration
                                                error of fit between onset intensities       Figure 3. Classification performance as a function of note
            13                                  std of component onset durations
                                                strength of AM, range 10−40Hz
                                                                                             sequence length.
            11                                  frequency of AM, range 10−40Hz
                                                                                         backward select algorithm. If the MFCCs were replaced with
                                                heuristic strength of AM, range 4−8Hz
             9                                  strength of AM, range 4−8Hz              order 13 WLPCCs, the accuracy was 35% (72%).
                                                frequency of AM, range 4−8Hz                 In practical situations, a recognition system is likely to have
             7                                  std of normalized SC                     more than one note to use for classification. A simulation was
                                                std of SC
             5                                  mean of SC
                                                                                         made to test the system’s behaviour in this situation. Random
                                                mean of normalized SC                    sequences of notes were generated and each note was classified
             3                                  crest factor                             individually. The final classification result was pooled across the
                                                mean square error of line fit
                                                slope of line fit (post onset decay)
                                                                                         sequence by using the majority rule. The recognition accuracies
             1
                                                                                         were averaged over 50 runs for each instrument and note
                 0     10     20     30         40   50     60                           sequence length. Figure 3 shows the average accuracies for indi-
                            Percent correct                                              vidual instrument and family classification. With 11 random
 Figure 2. Classification performance as a function of features.                          notes, the average accuracy increased to 51% (96%). In instru-
 The features printed in italics were included in the best per-                          ment family classification, the recognition accuracy for the tenor
 forming configuration.                                                                   saxophone was the worst (55% with 11 notes), whereas the accu-
                                                                                         racy for the all other instruments was over 90%. In the case of
formance degrades. The number of cepstral coefficients was one
                                                                                         individual instruments, the accuracy for the tenor trombone, tuba,
less than the analysis order. WLP cepstral and reflection coeffi-
                                                                                         cello, violin, viola and guitar was poorer than with one note, the
cients outperformed LP cepstral and reflection coefficients at all
                                                                                         accuracy for the other instruments was higher.
analysis orders calculated. The best accuracy with LP based fea-
                                                                                             The recognition accuracy depends on the recording circum-
tures was 33% for individual instruments (66% for instrument
                                                                                         stances, as may be expected. The individual instrument recogni-
families), and was obtained with WLP cepstral coefficients
                                                                                         tion accuracies were 32%, 87%, 21% and 37% for the samples
(WLPCC) of order 13.
                                                                                         from MUMS, Iowa, Roland and SOL sources, respectively. The
     In Figure 2, the classification accuracy is presented as a func-
                                                                                         Iowa samples included only the woodwinds and the French horn,
tion of features. The cepstral parameters are mel-frequency ceps-
                                                                                         which were on the average recognized with 49% accuracy. Thus,
tral coefficients or their derivatives. The optimal number of
                                                                                         the recognition accuracy is clearly better for the Iowa samples
MFCCs was 12, above and below which the performance slowly
                                                                                         recorded in an anechoic chamber. The samples from the other
degraded. However, optimization of the filter bank parameters
                                                                                         three sources are comparable with the exception that the samples
should be done for the MFCCs, but was left for future research.
                                                                                         from SOL did not include tenor or soprano sax. With synthesized
By using the MFCCs both from the onset and steady state, the
                                                                                         samples the performance is clearly worse, which is probably due
accuracies were 32% (69%). Because of computational cost con-
                                                                                         to both the varying quality of the synthetic tones and the varying
siderations the MFCC were selected as the cepstrum features for
                                                                                         dynamics.
the remaining experiments. Adding the mel-frequency delta cep-
strum coefficients (DMFCC) slightly improved the performance,                                                                     5. DISCUSSION
using the MFCCs and DMFCCs of the steady state resulted in
34% (72%) accuracy.                                                                           The confusion matrix for the feature set giving the best accu-
     The other features did not alone prove out very successful.                         racy is presented in Figure 4. There are large differences in the
Onset duration was the most successful with 35% accuracy in                              recognition accuracies of different instruments. The soprano sax
instrument family classification. In individual instruments, spec-                        is recognized correctly in 72% of the cases, while the classifica-
tral centroid gave the best accuracy, 10%. Both were clearly infe-                       tion accuracies for the violin and guitar are only 4%. French horn
rior to the MFCCs and DMFCCs. It should be noted, however,                               is the most common target for misclassifications.
that the MFCC features are vectors of coefficients, and the other                              It is interesting to compare the behaviour of the system to
features consist of a single number each.                                                human subjects. Martin [2] has reported a listening experiment
     The best accuracy 35% (77%) was obtained by using a feature                         where fourteen subjects recognized 137 samples from the McGill
vector consisting of the features printed in italics in Figure 2. The                    collection, a subset of the data used in our evaluations. The differ-
feature set was found by using a subset of the data and a simple                         ences in the instrument sets are small, Martin’s samples did not




21-24 October 2001, New Paltz, New York                                                                                                                          W2001-3
Figure 4. Confusion matrix for the best performing feature set. Entries are expressed as percentages and are rounded to the nearest
    integer. The boxes indicate instrument families.

include any sax or guitar samples, but had the piccolo and the                           7. ACKNOWLEDGEMENT
English horn, which were not present in our test data. In his test,          The available samples in the web by the University of Iowa
the subjects recognized the individual instrument correctly in          (http://theremin.music.uiowa.edu/~web/) and IRCAM (http://
45.9% of cases (91.7% for instrument families). Our system made         soleil.ircam.fr/) helped greatly in collecting our database. The
more outside family confusions than the subjects in Martin’s test.      warping toolbox by Härmä and Karjalainen (http://www.acous-
It was not able to generalize into more abstract instrument fami-       tics.hut.fi/software/warp/) was used for the calculation of WLP
lies as well as humans, which was also the case in Martin’s com-        based features. Our MFCC analysis was based on Slaney’s imple-
puter simulations [2]. In individual instrument classification, the      mentation (http://rvl4.ecn.purdue.edu/~malcolm/interval/1998-
difference is perhaps smaller.                                          010/).
     The within-family confusions made by the system are quite
similar to the confusions made by humans. Examples include the                                8. REFERENCES
French horn as tenor trombone and vice versa, tuba as French
                                                                        [1] Brown, J. C. “Feature dependence in the automatic identifica-
horn, or B-flat clarinet as E-flat clarinet. The confusions between
                                                                            tion of musical woodwind instruments.” J. Acoust. Soc. Am.,
the viola and the violin, and the cello and the double bass were
                                                                            Vol. 109, No. 3, pp. 1064-1072, 2001.
also common to both humans and our system. In the confusions
                                                                        [2] Martin, K. D. Sound-Source Recognition: A Theory and Com-
occurring outside the instrument family, confusions of the B-flat
                                                                            putational Model. Ph.D. thesis, Massachusetts Institute of
clarinet as soprano or alto sax were common to both our system
                                                                            Technology, Cambridge, MA, 1999. Available at: http://
and the subjects.
                                                                            sound.media.mit.edu/Papers/kdm-phdthesis.pdf.
                       6. CONCLUSIONS                                   [3] Eronen, A. & Klapuri, A. “Musical Instrument Recognition
                                                                            Using Cepstral Coefficients and Temporal Features". Proc.
     Warped linear prediction based features proved to be success-          IEEE International Conference on Acoustics, Speech and Sig-
ful in the automatic recognition of musical instrument solo tones,          nal Processing, Istanbul, June 5-9, 2000.
and resulted in better accuracy than what was obtained with corre-      [4] Handel, S. Timbre perception and auditory object identifica-
sponding conventional LP based features. The mel-frequency                  tion. In Moore (ed.) Hearing. New York, Academic Press.
cepstral coefficients gave the best accuracy in instrument family        [5] Härmä, A. et.al. “Frequency-Warped Signal Processing for
classification, and would be the selection also for the sake of com-         Audio Applications”. J. Audio Eng. Soc., Vol. 48, No. 11, pp.
putational complexity. The best overall accuracy was obtained by            1011-1031, 2000.
augmenting the mel-cepstral coefficients with features describing        [6] Smith, J. O. & Abel, J. S. “Bark and ERB Bilinear Trans-
the type of excitation, brightness, modulations, synchronity and            forms”. IEEE Transactions on Speech and Audio Processing,
fundamental frequency of tones.                                             Vol. 7, No. 6, pp. 697-708, 1999.
     Care should be taken while interpreting the presented results      [7] Rabiner, L. R. & Juang, B. H. Fundamentals of speech recog-
on the accuracy obtained with different features. First, the best set       nition. Prentice-Hall 1993.
of features for musical instrument recognition depends on the           [8] Klapuri, A. “Pitch Estimation Using Multiple Independent
context [2,4]. Second, the extraction algorithms for features other         Time-Frequency Windows”. Proc. IEEE Workshop on Appli-
than cepstral coefficients are still in their early stages of develop-       cations of Signal Processing to Audio and Acoustics, New
ment. However, since the accuracy improved when cepstral fea-               Paltz, New York, Oct. 17-20, 1999.
tures were added with other features, this approach should be           [9] Opolko, F. & Wapnick, J. McGill University Master Samples
further developed.                                                          (compact disk). McGill University, 1987.




21-24 October 2001, New Paltz, New York                                                                                         W2001-4

Mais conteúdo relacionado

Semelhante a Comparison of features for musical instrument recognition

Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Cemal Ardil
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...CSCJournals
 
Computational Spectroscopy in G03
Computational Spectroscopy in G03Computational Spectroscopy in G03
Computational Spectroscopy in G03Inon Sharony
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalizationKamal Bhatt
 
Fractal analysis of resting state functional connectivity of the brain
Fractal analysis of resting state functional connectivity of the brainFractal analysis of resting state functional connectivity of the brain
Fractal analysis of resting state functional connectivity of the brainWonsang You
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechIOSR Journals
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_ASrimatre K
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET Journal
 
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” TaskThe Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Taskmultimediaeval
 
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...IDES Editor
 
Csss2010 20100803-kanevski-lecture2
Csss2010 20100803-kanevski-lecture2Csss2010 20100803-kanevski-lecture2
Csss2010 20100803-kanevski-lecture2hasan_elektro
 
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture TasksVictor Asanza
 
ECG_based_Biometric_Recognition_using_Wa.pdf
ECG_based_Biometric_Recognition_using_Wa.pdfECG_based_Biometric_Recognition_using_Wa.pdf
ECG_based_Biometric_Recognition_using_Wa.pdfssuser48b47e
 
Coincidence analysis of the effect of cochlear disparities on temporal respo...
Coincidence analysis of the effect of cochlear  disparities on temporal respo...Coincidence analysis of the effect of cochlear  disparities on temporal respo...
Coincidence analysis of the effect of cochlear disparities on temporal respo...TeruKamogashira
 
1103.0305
1103.03051103.0305
1103.0305nk179
 
Long latency responses (Niraj)
Long latency responses (Niraj)Long latency responses (Niraj)
Long latency responses (Niraj)Niraj Kumar
 

Semelhante a Comparison of features for musical instrument recognition (20)

Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
 
Computational Spectroscopy in G03
Computational Spectroscopy in G03Computational Spectroscopy in G03
Computational Spectroscopy in G03
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalization
 
Fractal analysis of resting state functional connectivity of the brain
Fractal analysis of resting state functional connectivity of the brainFractal analysis of resting state functional connectivity of the brain
Fractal analysis of resting state functional connectivity of the brain
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Clinical electro physio assessment
Clinical electro physio assessmentClinical electro physio assessment
Clinical electro physio assessment
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_A
 
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVMIRJET- Implementing Musical Instrument Recognition using CNN and SVM
IRJET- Implementing Musical Instrument Recognition using CNN and SVM
 
018 bci gamma band
018   bci gamma band018   bci gamma band
018 bci gamma band
 
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” TaskThe Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
 
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...
Chebyshev Functional Link Artificial Neural Networks for Denoising of Image C...
 
Csss2010 20100803-kanevski-lecture2
Csss2010 20100803-kanevski-lecture2Csss2010 20100803-kanevski-lecture2
Csss2010 20100803-kanevski-lecture2
 
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks
⭐⭐⭐⭐⭐ EMG Signal Processing with Clustering Algorithms for Motor Gesture Tasks
 
ECG_based_Biometric_Recognition_using_Wa.pdf
ECG_based_Biometric_Recognition_using_Wa.pdfECG_based_Biometric_Recognition_using_Wa.pdf
ECG_based_Biometric_Recognition_using_Wa.pdf
 
Coincidence analysis of the effect of cochlear disparities on temporal respo...
Coincidence analysis of the effect of cochlear  disparities on temporal respo...Coincidence analysis of the effect of cochlear  disparities on temporal respo...
Coincidence analysis of the effect of cochlear disparities on temporal respo...
 
1103.0305
1103.03051103.0305
1103.0305
 
Long latency responses (Niraj)
Long latency responses (Niraj)Long latency responses (Niraj)
Long latency responses (Niraj)
 

Mais de chakravarthy Gopi

400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interviewchakravarthy Gopi
 
Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentationchakravarthy Gopi
 
The performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemsThe performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemschakravarthy Gopi
 
implementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniquesimplementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniqueschakravarthy Gopi
 
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMA REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMchakravarthy Gopi
 
Lpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabLpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabchakravarthy Gopi
 

Mais de chakravarthy Gopi (12)

14104042 resume
14104042 resume14104042 resume
14104042 resume
 
400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview400 puzzles-and-answers-for-interview
400 puzzles-and-answers-for-interview
 
More puzzles to puzzle you
More puzzles to puzzle youMore puzzles to puzzle you
More puzzles to puzzle you
 
puzzles to puzzle u
 puzzles to puzzle u puzzles to puzzle u
puzzles to puzzle u
 
Mahaprasthanam
MahaprasthanamMahaprasthanam
Mahaprasthanam
 
GATE Ece(2012)
GATE Ece(2012)GATE Ece(2012)
GATE Ece(2012)
 
Linear predictive coding documentation
Linear predictive coding  documentationLinear predictive coding  documentation
Linear predictive coding documentation
 
The performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systemsThe performance of turbo codes for wireless communication systems
The performance of turbo codes for wireless communication systems
 
implementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniquesimplementation of _phase_lock_loop_system_using_control_system_techniques
implementation of _phase_lock_loop_system_using_control_system_techniques
 
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEMA REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
A REAL TIME PRECRASH VEHICLE DETECTION SYSTEM
 
Model resume
Model resumeModel resume
Model resume
 
Lpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlabLpc vocoder implemented by using matlab
Lpc vocoder implemented by using matlab
 

Comparison of features for musical instrument recognition

  • 1. COMPARISON OF FEATURES FOR MUSICAL INSTRUMENT RECOGNITION Antti Eronen Signal Processing Laboratory, Tampere University of Technology P.O.Box 553, FIN-33101 Tampere, Finland antti.eronen@tut.fi ABSTRACT ments. The performance of the system and the confusions it Several features were compared with regard to recognition per- makes are compared to the results reported in a human perception formance in a musical instrument recognition system. Both mel- experiment, which used a subset of the same data as stimuli [2]. frequency and linear prediction cepstral and delta cepstral coeffi- cients were calculated. Linear prediction analysis was carried out 2. FEATURE EXTRACTION both on a uniform and a warped frequency scale, and reflection 2.1. Cepstral features coefficients were also used as features. The performance of earlier For isolated musical tones, the onset has been found to be described features relating to the temporal development, modula- important for recognition by human subjects [4]. Motivated by tion properties, brightness, and spectral synchronity of sounds this, the cepstral analyses are made separately for the onset and was also analysed. The data base consisted of 5286 acoustic and steady state segments of a tone. Based on the root mean square synthetic solo tones from 29 different Western orchestral instru- (RMS) -energy level of the signal, each tone is segmented into ments, out of which 16 instruments were included in the test set. onset and steady state segments. The steady state begins when the The best performance for solo tone recognition, 35% for individ- signal achieves its average RMS-energy level for the first time, ual instruments and 77% for families, was obtained with a feature and the onset segment is the 10 dB rise before this point. set consisting of two sets of mel-frequency cepstral coefficients For the onset portion of tones, both LP and filterbank analyses and a subset of the other analysed features. The confusions made were performed in approximately 20 ms length hamming win- by the system were analysed and compared to results reported in a dowed frames with 25% overlap. In the steady state segment, human perception experiment. frame length of 40 ms was used. If the onset was shorter than 80 1. INTRODUCTION ms, the beginning of steady state was moved forward so that at least 80 ms was analysed. Prior to the analyses, each acoustic sig- Automatic musical instrument recognition is a fascinating and nal was preemphasized with the high pass filter 1, – 0.97z to –1 essential subproblem in music indexing, retrieval, and automatic flatten the spectrum. transcription. It is closely related to computational auditory scene The LP coefficients were obtained from an all-pole approxi- analysis. However, musical instrument recognition has not mation of the windowed waveform, and were computed using the received as much research interest as speaker recognition, for autocorrelation method. In the calculation of the WLP coeffi- instance. cients, the frequency warping transformation was obtained by The implemented musical instrument recognition systems replacing the unit delays of the predicting filter with first-order still have limited practical usability. Brown has reported a system all-pass elements. In the z-domain this can be interpreted by the that is able to recognize four woodwind instruments from mono- mapping phonic recordings with a performance comparable to that of –1 human’s [1]. Martin’s system recognized a wider set of instru- –1 –1z –λ z →z ˜ = ------------------- . - (1) –1 ments, although it did not perform as well as human subjects in a 1 – λz similar task [2]. In the implementation this means replacing the autocorrela- This paper continues the work presented in [3] by using new tion network with a warped autocorrelation network [5]. The cepstral features and introducing a significant extension to the parameter λ is selected in such a way that the resulting frequency evaluation data. The research focuses on comparing different fea- mapping approximates the desired frequency scale. By selecting tures with regard of recognition accuracy in a solo tone recogni- λ=0.7564 for 44.1 kHz samples, a Bark scale approximation was tion task. First, we analyse different cepstral features that are obtained [6]. Finally, the obtained linear prediction coefficients an based either on linear prediction (LP) or filterbank analysis. Both are transformed into cepstral coefficients cn with the recursion [7, conventional LP having uniform frequency resolution and more pp. 163] psychoacoustically motivated warped linear prediction (WLP) are n–1 ∑ kck an – k . used. WLP based features have not been used for musical instru- 1 c n = – a n – -- - (2) ment recognition before. Second, other features are analysed that n k=1 are related to the temporal development, modulation properties, The number of cepstral coefficients was equal to the analysis brightness, and spectral synchronity of sounds. order after the zeroth coefficient, which is a function of the chan- The evaluation database is extended to include several exam- nel gain, was discarded. ples of a particular instrument. Both acoustic and synthetic iso- For the mel-frequency cepstral coefficient (MFCC) calcula- lated notes of 16 Western orchestral instruments are used for tions, a discrete Fourier transform was first calculated for the win- testing, whereas the training data includes examples of 29 instru- 21-24 October 2001, New Paltz, New York W2001-1
  • 2. dowed waveform. The length of the transform was 1024 or 2048 point for 20 ms and 40 ms frames, respectively. 40 triangular WLP cepstra Refl. coeffs. (WLP) bandpass filters having equal bandwith on the mel-frequency 70 LP cepstra scale were simulated, and the MFCCs were calculated from the Refl. coeffs. (LP) 60 log filterbank amplitudes using a shifted discrete cosine transform Percent correct [7, p.189]. 50 Family recognition In all cases, the median values of cepstral coefficients were 40 stored for the onset and steady state segments. Delta cepstral coefficients were calculated by fitting a first order polynomial 30 over the cepstral trajectories. For the delta-cepstral coefficients, 20 Instrument recognition the median of their absolute value was calculated. We also experi- mented with coefficient standard deviations in the case of the 10 MFCCs. 0 0 5 10 15 20 25 30 35 40 45 2.2. Spectral and temporal features LP analysis order Calculation of the other features analysed in this study has Figure 1. Classification performance as a function of analysis been described in [3] and will be only shortly summarized here. order for different LP based features. Amplitude envelope contains information e.g. about the type of excitation; i.e. whether a violin has been bowed or plucked. samples are recorded in studios with different acoustic character- Tight coupling between the excitation and the resonance structure istics and recording equipment, and the samples from Iowa Uni- is indicated by a short onset duration. To measure the slope of the versity are recorded in an anechoic chamber. The samples from amplitude decay after the onset, a line was fitted over the ampli- the Roland synthesizer were played on the keyboard and recorded tude envelope on a dB scale. Also, the mean square error of the fit through analog lines into a Silicon Graphics Octane workstation. was used as a feature. Crest factor, i.e. maximum / RMS value The synthesizer has a dynamic keyboard, thus these samples have was also used to characterize the shape of the amplitude envelope. varying dynamics. The samples from SOL include only the first Strength and frequency of amplitude modulation (AM) was 1.5 seconds of the played note. measured at two frequency ranges: from 4-8 Hz to measure trem- Cross validation aimed at as realistic conditions as possible olo, i.e. AM in conjunction with vibrato, and 10-40 Hz for graini- with this data set. On each trial, the training data consisted of all ness or roughness of tones. the samples except those of the particular performer and instru- Spectral centroid (SC) corresponds to perceived brightness ment being tested. In this way, the training data is maximally uti- and has been one of the interpretations for the dissimilarity rat- lized, but the system has never heard the samples from that ings in many multidimensional scaling studies [4]. SC was calcu- particular instrument in those circumstances before. There were lated from a short time power spectrum of the signal using 16 instruments that had at least three independent recordings, so logarithmic frequency resolution. The normalized value of SC is these instruments were used for testing. The instruments can be the absolute value in Hz divided by the fundamental frequency. seen in Figure 4. A total of 5286 samples of 29 Western orchestral The mean, maximum and standard deviation values of SC were instruments were included in the data set, out of which 3337 sam- used as features. ples were used for testing. The classifier made its choice among Onset asynchrony refers to the differences in the rate of the the 29 instruments. In these tests, a random guesser would score energy development of different frequency components. A sinu- 3.5% in the individual instrument recognition task, and 16.7% in soid envelope representation was used to calculate the intensity family classification. envelopes for different harmonics, and the standard deviation of In each test, classifications were performed separately for the onset durations for different harmonics was used as a one feature. instrument family and individual instrument cases. A k-nearest Another feature measuring this property is obtained by fitting the neighbours (kNN) classifier was used, where the values of k were intensity envelopes of individual harmonics into the overall inten- 11 for instrument family and for 5 individual instrument classifi- sity evelope during the onset period, and the average mean square cation. The distance metric was Mahalanobis with equal covari- error of those fits was used as a feature. ance matrix for all classes, which was implemented by using the Fundamental frequency (f0) of tones is measured using the discrete form of the Karhunen-Loeve transform to uncorrelate the algorithm from [8], and used as a feature. Also, its standard devi- features and normalize the variances, and then by using the eucli- ation was used as measure for vibrato. dean distance metric in the normalized space. 3. EXPERIMENTAL SETUP 4. RESULTS Samples from five different sources were included in the vali- Different orders of the linear prediction filter were used to see dation database. First, the samples used in [3] consisted of the the effect of that on the performance of several LP and WLP samples from the McGill University Master Samples Collection based features. The results for instrument family and individual (MUMS) [9], as well as recordings of an acoustic guitar made at instrument recognition are shown in Figure 1. The feature vector Tampere University of Technology. The other sources of samples at all points consisted of two sets of coefficients: medians over the were the University of Iowa website, IRCAM Studio Online onset period and medians over the steady state. The optimal anal- (SOL), and a Roland XP-30 synthesizer. The MUMS and SOL ysis order was between 9 and 14, above and below which per- 21-24 October 2001, New Paltz, New York W2001-2
  • 3. Individual instrument 100 Instrument family 80 Random guess (instrument) Percent correct Random guess (family) 60 23 std of MFCCs of steady state std of MFCCs of onset 21 DMFCCs of steady state 40 DMFCCs of onset 19 MFCCs of steady 20 Individual instrument MFCCs of onset Instrument family 17 std of f0 0 1 3 5 7 9 11 fundamental frequency (f0) Feature Note sequence length 15 onset duration error of fit between onset intensities Figure 3. Classification performance as a function of note 13 std of component onset durations strength of AM, range 10−40Hz sequence length. 11 frequency of AM, range 10−40Hz backward select algorithm. If the MFCCs were replaced with heuristic strength of AM, range 4−8Hz 9 strength of AM, range 4−8Hz order 13 WLPCCs, the accuracy was 35% (72%). frequency of AM, range 4−8Hz In practical situations, a recognition system is likely to have 7 std of normalized SC more than one note to use for classification. A simulation was std of SC 5 mean of SC made to test the system’s behaviour in this situation. Random mean of normalized SC sequences of notes were generated and each note was classified 3 crest factor individually. The final classification result was pooled across the mean square error of line fit slope of line fit (post onset decay) sequence by using the majority rule. The recognition accuracies 1 were averaged over 50 runs for each instrument and note 0 10 20 30 40 50 60 sequence length. Figure 3 shows the average accuracies for indi- Percent correct vidual instrument and family classification. With 11 random Figure 2. Classification performance as a function of features. notes, the average accuracy increased to 51% (96%). In instru- The features printed in italics were included in the best per- ment family classification, the recognition accuracy for the tenor forming configuration. saxophone was the worst (55% with 11 notes), whereas the accu- racy for the all other instruments was over 90%. In the case of formance degrades. The number of cepstral coefficients was one individual instruments, the accuracy for the tenor trombone, tuba, less than the analysis order. WLP cepstral and reflection coeffi- cello, violin, viola and guitar was poorer than with one note, the cients outperformed LP cepstral and reflection coefficients at all accuracy for the other instruments was higher. analysis orders calculated. The best accuracy with LP based fea- The recognition accuracy depends on the recording circum- tures was 33% for individual instruments (66% for instrument stances, as may be expected. The individual instrument recogni- families), and was obtained with WLP cepstral coefficients tion accuracies were 32%, 87%, 21% and 37% for the samples (WLPCC) of order 13. from MUMS, Iowa, Roland and SOL sources, respectively. The In Figure 2, the classification accuracy is presented as a func- Iowa samples included only the woodwinds and the French horn, tion of features. The cepstral parameters are mel-frequency ceps- which were on the average recognized with 49% accuracy. Thus, tral coefficients or their derivatives. The optimal number of the recognition accuracy is clearly better for the Iowa samples MFCCs was 12, above and below which the performance slowly recorded in an anechoic chamber. The samples from the other degraded. However, optimization of the filter bank parameters three sources are comparable with the exception that the samples should be done for the MFCCs, but was left for future research. from SOL did not include tenor or soprano sax. With synthesized By using the MFCCs both from the onset and steady state, the samples the performance is clearly worse, which is probably due accuracies were 32% (69%). Because of computational cost con- to both the varying quality of the synthetic tones and the varying siderations the MFCC were selected as the cepstrum features for dynamics. the remaining experiments. Adding the mel-frequency delta cep- strum coefficients (DMFCC) slightly improved the performance, 5. DISCUSSION using the MFCCs and DMFCCs of the steady state resulted in 34% (72%) accuracy. The confusion matrix for the feature set giving the best accu- The other features did not alone prove out very successful. racy is presented in Figure 4. There are large differences in the Onset duration was the most successful with 35% accuracy in recognition accuracies of different instruments. The soprano sax instrument family classification. In individual instruments, spec- is recognized correctly in 72% of the cases, while the classifica- tral centroid gave the best accuracy, 10%. Both were clearly infe- tion accuracies for the violin and guitar are only 4%. French horn rior to the MFCCs and DMFCCs. It should be noted, however, is the most common target for misclassifications. that the MFCC features are vectors of coefficients, and the other It is interesting to compare the behaviour of the system to features consist of a single number each. human subjects. Martin [2] has reported a listening experiment The best accuracy 35% (77%) was obtained by using a feature where fourteen subjects recognized 137 samples from the McGill vector consisting of the features printed in italics in Figure 2. The collection, a subset of the data used in our evaluations. The differ- feature set was found by using a subset of the data and a simple ences in the instrument sets are small, Martin’s samples did not 21-24 October 2001, New Paltz, New York W2001-3
  • 4. Figure 4. Confusion matrix for the best performing feature set. Entries are expressed as percentages and are rounded to the nearest integer. The boxes indicate instrument families. include any sax or guitar samples, but had the piccolo and the 7. ACKNOWLEDGEMENT English horn, which were not present in our test data. In his test, The available samples in the web by the University of Iowa the subjects recognized the individual instrument correctly in (http://theremin.music.uiowa.edu/~web/) and IRCAM (http:// 45.9% of cases (91.7% for instrument families). Our system made soleil.ircam.fr/) helped greatly in collecting our database. The more outside family confusions than the subjects in Martin’s test. warping toolbox by Härmä and Karjalainen (http://www.acous- It was not able to generalize into more abstract instrument fami- tics.hut.fi/software/warp/) was used for the calculation of WLP lies as well as humans, which was also the case in Martin’s com- based features. Our MFCC analysis was based on Slaney’s imple- puter simulations [2]. In individual instrument classification, the mentation (http://rvl4.ecn.purdue.edu/~malcolm/interval/1998- difference is perhaps smaller. 010/). The within-family confusions made by the system are quite similar to the confusions made by humans. Examples include the 8. REFERENCES French horn as tenor trombone and vice versa, tuba as French [1] Brown, J. C. “Feature dependence in the automatic identifica- horn, or B-flat clarinet as E-flat clarinet. The confusions between tion of musical woodwind instruments.” J. Acoust. Soc. Am., the viola and the violin, and the cello and the double bass were Vol. 109, No. 3, pp. 1064-1072, 2001. also common to both humans and our system. In the confusions [2] Martin, K. D. Sound-Source Recognition: A Theory and Com- occurring outside the instrument family, confusions of the B-flat putational Model. Ph.D. thesis, Massachusetts Institute of clarinet as soprano or alto sax were common to both our system Technology, Cambridge, MA, 1999. Available at: http:// and the subjects. sound.media.mit.edu/Papers/kdm-phdthesis.pdf. 6. CONCLUSIONS [3] Eronen, A. & Klapuri, A. “Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features". Proc. Warped linear prediction based features proved to be success- IEEE International Conference on Acoustics, Speech and Sig- ful in the automatic recognition of musical instrument solo tones, nal Processing, Istanbul, June 5-9, 2000. and resulted in better accuracy than what was obtained with corre- [4] Handel, S. Timbre perception and auditory object identifica- sponding conventional LP based features. The mel-frequency tion. In Moore (ed.) Hearing. New York, Academic Press. cepstral coefficients gave the best accuracy in instrument family [5] Härmä, A. et.al. “Frequency-Warped Signal Processing for classification, and would be the selection also for the sake of com- Audio Applications”. J. Audio Eng. Soc., Vol. 48, No. 11, pp. putational complexity. The best overall accuracy was obtained by 1011-1031, 2000. augmenting the mel-cepstral coefficients with features describing [6] Smith, J. O. & Abel, J. S. “Bark and ERB Bilinear Trans- the type of excitation, brightness, modulations, synchronity and forms”. IEEE Transactions on Speech and Audio Processing, fundamental frequency of tones. Vol. 7, No. 6, pp. 697-708, 1999. Care should be taken while interpreting the presented results [7] Rabiner, L. R. & Juang, B. H. Fundamentals of speech recog- on the accuracy obtained with different features. First, the best set nition. Prentice-Hall 1993. of features for musical instrument recognition depends on the [8] Klapuri, A. “Pitch Estimation Using Multiple Independent context [2,4]. Second, the extraction algorithms for features other Time-Frequency Windows”. Proc. IEEE Workshop on Appli- than cepstral coefficients are still in their early stages of develop- cations of Signal Processing to Audio and Acoustics, New ment. However, since the accuracy improved when cepstral fea- Paltz, New York, Oct. 17-20, 1999. tures were added with other features, this approach should be [9] Opolko, F. & Wapnick, J. McGill University Master Samples further developed. (compact disk). McGill University, 1987. 21-24 October 2001, New Paltz, New York W2001-4