SlideShare a Scribd company logo
1 of 9
Konversa :  A Personal Audiolog<br />Guru Gopalakrishnan    Krishnakanth Chimalamarri    Midhun Achuthan<br />University of Southern California<br />{ggopalak,chimalam,achuthan}@usc.edu<br />Abstract<br />In recent years, the computing capabilility of mobile phone devices have increased drastically. Also, with the advent of 3G, these devices remain connected to the internet almost throughout the day over high bandwidth connections. We present konversa (Derived from conversa, the Portuguese translation for conversation) a system that logs a mobile phone user's non-phone conversations over the day. These conversations are time stamped and location stamped, so that the user can keep track of when and where he made the conversation.<br />1. Introduction<br />Mobile phones are unobtrusive devices that are carried by the user throughout the day. These devices have powerful microphones which allow average quality sound to be captured. The quality of sound may not always be the same in different environments. Outdoor conversations are often subject to high noise interference. Even indoor conversations may be interefered by music or crowd noise. There are also situations where the user may not be participating in a conversation, but is in close proximity of another conversation that his phone's microphone is able to capture the audio signals.<br />Konversa discards such conversations and by the end of the day presents the user with a time-and-location-stamped log of all the <br />conversations he has made during the day. This is done by opportunistically sending the recorded audio samples to a remote server on the internet. We run classification algorithms on the server to determine which of the samples contain relevant conversations. The server sends back the decision to the phone so that the phone can discard irrelevant samples. Konversa runs as a service on the Android G1 Phone and provides an interactive application to playback the audio clips based on time and location on the map. The backend server is a Linux machine that listens for requests from the phone. Konversa is written in Java and MATLAB.<br />2. Design and Implementation<br />2.1 Design Issues<br />We were faced with a number of challenges while designing konversa. Although the Android G1 phones are computationaly superior to most other phones in the market, they are not powerful enough to run classification algorithms. Also, in order to extract features, the input audio sample had to be filtered and processed. Doing this on the phone without interfering with the phones basic operations like calls and text messaging is not computationaly feasible given the phones limited memory allocation to user applications. To overcome this limitation, we had to setup a backend server to which the phone can opportunistically connect and upload samples of recordings. The server will then process these samples and send back the results to the phone.<br />2.2 System Specification<br />Konversa was deployed on the developer version of the Android G1 phone. The android platform supports applications written in Java that can be compiled and run on the Dalvik VM. The phone comes with various built-in sensors. For this experiment, we use the microphone to record audio and the GPS service to track the location of the phone. The backend server runs Ubuntu Linux 8 with JRE. Classification algorithms are written in MATLAB which runs on the server.<br />2.3 Architecture<br />2.3.1 Network Architecture<br />Konversa can communicate with the backend server via 802.11g Wi-Fi or 3G . It does so opportunistically. When there is no connectivity it caches all the recorded samples on the local phone memory. When there is a connection, a separate thread attempts to upload as many samples as it can, via SFTP. After processing, the server sends back a decision based on which the phone may save or discard the sample.<br />  <br />                        <br />Fig 1. Konversa Network<br />2.3.2 System Architecture<br />Konversa runs on the phone as a service. It uses the phones mic to capture audio. The recorder thread wakes up at specified intervals, enables the mic and starts capturing. It saves the audio file in the tive 3GP format on the phone in a temperory cache memory along with its time-stamp and location-stamp. This contains unclassified and unprocessed audio files.  The communications thread periodically picks an unprocessed file and sends it to the backend server via Secure FTP (SFTP). We chose SFTP because personal audio samples may contain private information and they need to be sent over a secure channel.<br />                           <br />                             <br /> Fig 2.  System Architecture<br />On receiving a file, the server needs to do some pre-processing before the features can be extracted and classified. The following processing takes place at the server.<br />a. Conversion to raw format<br />Android (and most other phones) record audio in 3GP format specified by the Third Generation Partnership Project. This format stores video in MPEG-4 Part 2 or H.263 or MPEG-4 Part 10 (AVC/H.264), and audio streams as AMR-NB, AMR-WB, AMR-WB+, AAC-LC or HE-AAC. [3]. In order to apply signal processing techniques, the audio has to be decoded and uncompressed to its raw format. The server decodes the audio channel in AMR format to the raw WAV format which is used by various audio signal processing tools for further processing. The ffmpeg tool, an open source initiative, is used to decode the samples to its raw form.<br />b. Noise removal<br />Audio captured from the phone is often subject to noise from the environment. Konversa eliminates noise using a band pass filter. A low pass filter filters out noise about 3400Hz. A high pass filter fitlers out noise below 300Hz. The resulting sample has a frequency range between 300 and 3400 hz. Two different tools were tested - JSyn and Sound Exchange (SoX). Sox gave a better performance in terms of speed.<br />c. Silence removal<br />Most conversation samples contain silence for irregular periods of time. The presence of silence affects the features extracted from the audio sample which in turn affects the classification accuracy. After experimenting on various thresholds, we got the best results by trimming anything below 0.3% of the peak amplitude value of the sample. Removing silence also shortens the clip length by varying amounts. Sound exchange was used to remove the silence from the clips.<br />Fig 3. Comparison of waveforms after each step of filtering<br />Fig 4. Comparison of waveforms (a) Before removing silence and (b) After removing silence<br />d.Splitting up samples for conversation detection.<br />The recorded clips originally had a duration of 60s. In order to detect a conversation, these samples have to be split up further. We discovered that splitting them into 5s parts was reasonable to identify the speakers presence from the classifier. Since removing silence shortens the clip, we sometimes discard the last part if its less than 5s.<br />e. Extracting features<br />In order to classify audio, we need to extract features that are unique to a speaker. The most widely used features are MFCC( Mel-Frequency Cepstral Coefficients). These are coefficients that collectively make up an MFC which is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are derived from a type of cepstral representation of the audio clip (a nonlinear quot;
spectrum-of-a-spectrumquot;
)[4]. The features for each part are extracted using MATLAB. We use<br />f. Classification<br />The resulting feature vector is then classified using the VQ codebook that was initially trained over the user's voice. The algorithm is explained in detail in the next section.<br />The result is sent back to the phone. If it is quot;
yesquot;
, the communication thread will move the file into the classified files list, which can be accessed by the user through the GUI via the native SQLite database. Otherwise the file is removed from the cache.<br />3. Classification<br />Konversa has a fairly simple multi-threaded architecture, where separate threads handle the capturing and communication as explained in section 2.3. Most of the development work was devoted to identifying  and implementing the right kind of classifier. After experimenting on a few different models, we had to chose one which gives the maximum accuracy with minimum implementation constraints that will seamlessly integrate with the underlying client-server architecture. In this section we explain a few models that we experimented on and why we choose the vector quantization model as the winner.<br />a. Artificial Neural Networks<br />Artificial Neural Networks implement a discrimination-based learning procedure. We used a 3-layer multiperceptron model. Training was based on backpropagation. 20 MFCC features were extracted for 20 windows of the sample, each of size 512. These features were input to 400 perceptrons on the input-layer. The hidden layer consists of 300 perceptrons and the output layer had a single perceptron which outputs the result of the classification. A similar model has been experimented previosly for text-dependent speech recognition in which case the number of output perceptrons correspond to the categories of phenomes to identify [2]. Although skeptical that this model would work well for text-independent classification, we decided to experiment and see what the results would like. The network was trained with samples from 4 different speakers. We will refer to the target speaker as Starget. The remaining participants will be referred to as S1,S2 and S3.<br /> <br />Fig 5. Structure of the neural network<br />For each of the 4 speakers, we collected 6 samples each of length 10s for creating the training set. The ideal output set for Starget  was {1.0} and {0.0} for the remaining speakers. The network was trainined for 500 epochs each or until the error rate dropped to 0.001. After training for each of the sample, the network tried to classify the samples from the training set and a validation set which contained 4 samples each that did not belong to the training set.  The results were not very impressive. We achieved a classification accuracy of around 60 - 70% even on the training set and 50% - 60% on the validation set, clearly indicating that the neural network model was not suitable in this scenario. Also from [8], their optimal structure has to be selected by trial-and-error procedures. The need to split the available train data in training and cross-validation sets, and the fact that the temporal structure of speech signals re-mains difficult to handle, makes it disadvantageous.<br />b. Gaussian Mixture Models<br />Mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering. Voice is considered to be a mixture of Gaussian components and hence the GMMs are known to perform well with speaker recognition.  A mixture model consists of several Gaussian Components, each of these components has mean, variance and weight. These have to be initialized to certain value and then trained using EM. The initialization algorithm we used was k-means clustering, with 19 clusters, for MFCC vector of 20 dimensions. We used a library called COMIRVA as a starting point for our design. This library was optimized for music instruments and we had to tweak it, particularly add a floor value to the co-variances (of 0.01) [6] and tweak the MFCC frequency to voice frequency between 300 and 3000Hz. We had initially recorded a training sample on the phone for length of 90s[6] and trained our GMM using EM ( Expectation Maximization) over initialized values. Then we try to classify out input samples and calculate the log-likelihood and probability of that model and representing the given classification sample.<br />On each EM iteration, the following reestimation formulas are used which gaurantee a monotonic increase in the model's likelihood value [5]<br />From our experiments we decided not to go with GMM due its poor poerformance in presence of noise and also our limited training set. From [6,7] we find out that GMMs are usually trained over 16 speakers or more. We were testing with 3 speakers and this could one of the reason why GMM did not work well. Also due to Singualirities in the Matrix while training, we had to use a floor. This technique although standard in speech processing, might have removed the subtle differences in speakers in our limited training set. <br />c. Vector Quantization (LBG)<br />Vector quantization (VQ) is a lossy data compression method based on the principle of block coding. It is a fixed-to-fixed length algorithm. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence, before this the VQ was considered tough due to multiple dimensional integrals. Using training sequences eliminates the need for these integrals. This is LBG VQ, which we used in our implementation.<br />A VQ is an approximation algorithm. Similar to round-off, the VQ design problem is:<br />quot;
Given a vector source with its statistical properties known, given a distortion measure, and given the number of code-vectors, find a codebook and a partition which result in the smallest average distortionquot;
.[10] two conditions have to be satisfied for a this algorithm.<br />    * Nearest Neighbor Condition: This condition says that the encoding region  should consists of all vectors that are closer to that code-vector, and not any other code-vectors.<br />    * Centroid Condition: This condition says that a code-vector should be average of all those training vectors that are in encoding region. we should ensure at least one training vector per region to avoid divide-by-zero problems.<br />LBG solves the above problem with these two conditions iteratively applied. Initially an average of all code vectors in calculated and then the Splitting stage is applied, where number of codebooks becomes twice. Then an Iteratively these steps are done till it reaches termination. we have chosen a size of 16. Following is an example for VQ - clustering.[9]<br />              Fig 6 : LBG-VQ clustering from [9].<br />4. Training<br /> <br />Once the classifier was determined, we performed a systematic training process. The samples collected for training, go through the entire pre-processing steps as outlined in the architecture. The training was performed on 3 different speakers.<br />In the first round of training, 4 samples of length 10s were collected for each speaker. Clips 1-4 were mapped to the target speaker. Clips 5-12 correspond to the other 2 speakers. The results were good for classifying samples that were recorded close to the mouth. There was some confusion when classifying samples recorded at a distance.<br />We then changed the training set. Sample size was increased to 45 seconds to model more realistic conversations. After removing silence, these lengths were reduced by varying amounts. Each speaker would talk for 45 seconds in two different positions. In the first position, the phone was held close to the mouth. In the second position, the phone was kept on a desk about 1.5m away from the speaker. This training set gave better classification accuracy in both positions.<br />We still had a major question unanswered. How will the model scale to participants who were not part of the training set? In order to do so, it was necessary to collect as many audio samples as possible. We managed to increase the training set to 6 speakers. The results are provided in section 6.<br />5. Modelling conversations.<br />In order to detect a conversation, we primarily look for the presence of the selected speaker Starget in the samples. We assume that in an active conversation, Starget speaks  at least for 7 to 10s at a stretch. We split the initial 60s sample (<60s after removing silence) into sub-samples of 5s duration. If speaker Starget speaks during most of the time in a sub-sample (Fig 7.a), there is higher accuracy in classification. However, if Starget 's speech is separated between two sub-samples as in (Fig 7.b) , there is a higher probabilty for error in classification. After splitting the original sample into smaller sub-samples, the features are extracted and classified. The output for each original sample is a list of 'yes' or 'no' based on the classification results of the sub-samples. We assume the conversation probability Pconversation to be the percentage of positives (sub-samples that were classified as 'yes') over the total number of sub-samples in the given original sample.<br />We set a minimum and maximum threshold for the conversation probability. In our experiments the minimum threshold was set to 20% and maximum to 80%. The minimum threshold overcomes some of the false-negatives (The classifier wrongly classified the sub sample as 'no'). This could be due to the fact stated in the beginning of this section or the inherent inaccuracies within the classifier. The maximum threshold is set to distinguish a conversation from a one-way speech, say a lecture, as it is highly unlikely that a person talks for most of the duration in a 60s conversation sample.<br />Fig 7.  Different scenarios which can affect classification accuracy of sub-samples<br />6. Results<br />The following are the results of the classification using VQ code book on training set of sample size 45s with 2 samples per speaker. Starget is the speaker we wish to identify in the conversation.<br />S1  and S2 are other speakers who participated in the training set. Each 45s sample was split into 5s sub-samples. The following table depict the false negative rate on a 5s sub-sample as the distance from the microphone increases.<br />a) Starget speaks alone (False Negative)<br />Distance (in m) False Negative (%)0.1 00.2 00.3 00.4 100.5 150.6 350.7 400.8 450.9 551 60<br />Table 1<br />Fig 8 - Increase in false negative % with distance<br />b) Starget speaks with S1 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 00.4 200.5 300.6 350.7 400.8 450.9 501 55<br />Table 2<br />                     <br />Fig 9 - Increase in false positive % with distance<br />c) Starget with S2 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 50.4 150.5 250.6 350.7 450.8 550.9 601 65<br />Table 3<br />Fig 10 - Increase in false positive % with distance<br />There is a general increase in the false positive rate in detecting conversations where the distance of the mic increases.<br />7. Conclusion and Future Work<br />We developed a useful phone application that provide a convenient way for users to keep track of important information that they may come across during their daily conversations. Textual conversations like short messages and emails can be easily logged. Providing a framework to log audio clips based on contextual information along with their spatio-temporal characteristics requires robust classification algorithms and network connectivity. While we were able to provide a delay tolerant communication framework, there are still improvements to be made in the classification algorithms. During the course of this work we evaluated several classification methods. Artificial neural networks performed poorly because of the improper selection of training data. GMM although has been claimed to work well with noiseless data, the amount of data we were testing it with ( 3 speakers) was very limited and could have led to bad performance. VQ is an algorithm which performs exceptionally well for very few speakers and also can handle noise better. Hence we chose VQ.<br />Some of the future work, apart from improving the classification accuracy, includes tuning the microphones sampling frequency. Since not many conversations are made during the day, keeping the microphone on can severely drain the devices battery power. Choosing the right frequency is a challenge and involves a trade-off between resource constraints and missing whole or parts of important conversations. Other ideas include creating a multi-user data from multiple users can be analyzed to determine relationships between them based on the spatio-temporal characteristics of their conversations. However, in such systems user privacy is of major concern. <br />8. Acknowledgement<br />First and foremost, we would like to thank Prof. Gaurav Sukhatme for this opportunity and his constant encouragement which helped us move forward and complete the project. We are also extremely grateful to Dr. Sameera Poduri and Karthik Dantu, for their time and support throughout the course of this work. We also thank Prof. Fei Sha for enlightening us on speech recognition models.<br />9. References<br />1. On the effectiveness of MFCCs and their statistical distribution properties in speaker identification<br />2. Speech Recognition Using Neural Networks at the Center for Spoken Language Understanding [http://speech.bme.ogi.edu/tutordemos/nnet_recog/recog.html]<br />3. 3rd Generation Partnership Project [http://www.3gpp.org]<br />4. Mel Frequency Cepstrum Coefficient [http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient]<br />5. Robust Text-Independent Speaker Identification Using gaussian Mixture Models - Douglas. A. Reynolds, and Richard C Rose. ( 1995)<br />6. Speaker Diarization using bottom-up clustering based on a Parameter-derived Distance between adapted GMMs Mathieu Ben, Micha¨el Betser, Fr´ed´eric Bimbot, Guillaume Gravier (2004)<br />7. SPEAKER RECOGNITION - Joseph P. Campbell, Jr. Book Chapter.<br />8. A Tutorial on Text-Independent Speaker Verification Fr ´ed´ eric Bimbot,1 Jean-Franc¸ois Bonastre,2 Corinne Fredouille,2 Guillaume Gravier,1 Ivan Magrin-Chagnolleau,3 SylvainMeignier,2 Teva Merlin.<br />9. http://www.data-compression.com/vq.shtml<br />10 Speech Data Compression using Vector Quantization H. B. Kekre, Tanuja K. Sarode<br />
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com

More Related Content

What's hot

Paper id 2720144
Paper id 2720144Paper id 2720144
Paper id 2720144IJRAT
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAbdullah al Mamun
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderIJTET Journal
 
multirate signal processing for speech
multirate signal processing for speechmultirate signal processing for speech
multirate signal processing for speechRudra Prasad Maiti
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Voice over internet protocol final
Voice over internet protocol finalVoice over internet protocol final
Voice over internet protocol finalYrasumalli Reddy
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPCDisha Modi
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMIJRES Journal
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniquesidescitation
 
Audio Steganography Using Tone Insertion Technique
Audio Steganography Using Tone Insertion TechniqueAudio Steganography Using Tone Insertion Technique
Audio Steganography Using Tone Insertion TechniqueEditor IJCATR
 
Channel Coding and Clipping in OFDM for WiMAX using SDR
Channel Coding and Clipping in OFDM for WiMAX using SDRChannel Coding and Clipping in OFDM for WiMAX using SDR
Channel Coding and Clipping in OFDM for WiMAX using SDRidescitation
 

What's hot (20)

Paper id 2720144
Paper id 2720144Paper id 2720144
Paper id 2720144
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
multirate signal processing for speech
multirate signal processing for speechmultirate signal processing for speech
multirate signal processing for speech
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Voice over internet protocol final
Voice over internet protocol finalVoice over internet protocol final
Voice over internet protocol final
 
SPEECH CODING
SPEECH CODINGSPEECH CODING
SPEECH CODING
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
 
User manual ramon
User manual  ramon User manual  ramon
User manual ramon
 
Audio Steganography Using Tone Insertion Technique
Audio Steganography Using Tone Insertion TechniqueAudio Steganography Using Tone Insertion Technique
Audio Steganography Using Tone Insertion Technique
 
Channel Coding and Clipping in OFDM for WiMAX using SDR
Channel Coding and Clipping in OFDM for WiMAX using SDRChannel Coding and Clipping in OFDM for WiMAX using SDR
Channel Coding and Clipping in OFDM for WiMAX using SDR
 
lpc and horn noise detection
lpc and horn noise detectionlpc and horn noise detection
lpc and horn noise detection
 
Speech coding std
Speech coding stdSpeech coding std
Speech coding std
 
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
 

Viewers also liked

activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
GoOpen 2010: Lars Åge Kamfjord
GoOpen 2010: Lars Åge KamfjordGoOpen 2010: Lars Åge Kamfjord
GoOpen 2010: Lars Åge KamfjordFriprogsenteret
 
Community_Partners - PBworks: Online Collaboration
Community_Partners - PBworks: Online CollaborationCommunity_Partners - PBworks: Online Collaboration
Community_Partners - PBworks: Online Collaborationbutest
 
Lecture Note
Lecture NoteLecture Note
Lecture Notebutest
 

Viewers also liked (6)

activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
GoOpen 2010: Lars Åge Kamfjord
GoOpen 2010: Lars Åge KamfjordGoOpen 2010: Lars Åge Kamfjord
GoOpen 2010: Lars Åge Kamfjord
 
Andressa
AndressaAndressa
Andressa
 
Community_Partners - PBworks: Online Collaboration
Community_Partners - PBworks: Online CollaborationCommunity_Partners - PBworks: Online Collaboration
Community_Partners - PBworks: Online Collaboration
 
Lecture Note
Lecture NoteLecture Note
Lecture Note
 
HDsamplesLMC
HDsamplesLMCHDsamplesLMC
HDsamplesLMC
 

Similar to Konversa.docx - konversa.googlecode.com

Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...IOSR Journals
 
Optimal Communication Of Real Time Data On Secure Cdma Ip...
Optimal Communication Of Real Time Data On Secure Cdma Ip...Optimal Communication Of Real Time Data On Secure Cdma Ip...
Optimal Communication Of Real Time Data On Secure Cdma Ip...Stefanie Yang
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
 
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...IJCSEA Journal
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABsipij
 
Mohammad Faisal Kairm(073714556) Assignment 2
Mohammad Faisal Kairm(073714556) Assignment 2Mohammad Faisal Kairm(073714556) Assignment 2
Mohammad Faisal Kairm(073714556) Assignment 2mashiur
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Digital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptxDigital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptxAyushJaiswal781174
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identificationIJCSEA Journal
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET Journal
 
ETE405-lec8.pptx
ETE405-lec8.pptxETE405-lec8.pptx
ETE405-lec8.pptxmashiur
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...ijma
 
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...ijma
 
Performance Analysis of VoIP by Communicating Two Systems
Performance Analysis of VoIP by Communicating Two Systems Performance Analysis of VoIP by Communicating Two Systems
Performance Analysis of VoIP by Communicating Two Systems IOSR Journals
 
Speech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingSpeech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingiosrjce
 

Similar to Konversa.docx - konversa.googlecode.com (20)

Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...Speech Recognized Automation System Using Speaker Identification through Wire...
Speech Recognized Automation System Using Speaker Identification through Wire...
 
Optimal Communication Of Real Time Data On Secure Cdma Ip...
Optimal Communication Of Real Time Data On Secure Cdma Ip...Optimal Communication Of Real Time Data On Secure Cdma Ip...
Optimal Communication Of Real Time Data On Secure Cdma Ip...
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...
 
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
 
Mohammad Faisal Kairm(073714556) Assignment 2
Mohammad Faisal Kairm(073714556) Assignment 2Mohammad Faisal Kairm(073714556) Assignment 2
Mohammad Faisal Kairm(073714556) Assignment 2
 
H42045359
H42045359H42045359
H42045359
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Digital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptxDigital Watermarking Of Audio Signals.pptx
Digital Watermarking Of Audio Signals.pptx
 
Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identification
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
 
ETE405-lec8.pptx
ETE405-lec8.pptxETE405-lec8.pptx
ETE405-lec8.pptx
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...
 
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
 
Performance Analysis of VoIP by Communicating Two Systems
Performance Analysis of VoIP by Communicating Two Systems Performance Analysis of VoIP by Communicating Two Systems
Performance Analysis of VoIP by Communicating Two Systems
 
Bq4301381388
Bq4301381388Bq4301381388
Bq4301381388
 
H010625862
H010625862H010625862
H010625862
 
Speech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingSpeech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law companding
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Konversa.docx - konversa.googlecode.com

  • 1. Konversa : A Personal Audiolog<br />Guru Gopalakrishnan Krishnakanth Chimalamarri Midhun Achuthan<br />University of Southern California<br />{ggopalak,chimalam,achuthan}@usc.edu<br />Abstract<br />In recent years, the computing capabilility of mobile phone devices have increased drastically. Also, with the advent of 3G, these devices remain connected to the internet almost throughout the day over high bandwidth connections. We present konversa (Derived from conversa, the Portuguese translation for conversation) a system that logs a mobile phone user's non-phone conversations over the day. These conversations are time stamped and location stamped, so that the user can keep track of when and where he made the conversation.<br />1. Introduction<br />Mobile phones are unobtrusive devices that are carried by the user throughout the day. These devices have powerful microphones which allow average quality sound to be captured. The quality of sound may not always be the same in different environments. Outdoor conversations are often subject to high noise interference. Even indoor conversations may be interefered by music or crowd noise. There are also situations where the user may not be participating in a conversation, but is in close proximity of another conversation that his phone's microphone is able to capture the audio signals.<br />Konversa discards such conversations and by the end of the day presents the user with a time-and-location-stamped log of all the <br />conversations he has made during the day. This is done by opportunistically sending the recorded audio samples to a remote server on the internet. We run classification algorithms on the server to determine which of the samples contain relevant conversations. The server sends back the decision to the phone so that the phone can discard irrelevant samples. Konversa runs as a service on the Android G1 Phone and provides an interactive application to playback the audio clips based on time and location on the map. The backend server is a Linux machine that listens for requests from the phone. Konversa is written in Java and MATLAB.<br />2. Design and Implementation<br />2.1 Design Issues<br />We were faced with a number of challenges while designing konversa. Although the Android G1 phones are computationaly superior to most other phones in the market, they are not powerful enough to run classification algorithms. Also, in order to extract features, the input audio sample had to be filtered and processed. Doing this on the phone without interfering with the phones basic operations like calls and text messaging is not computationaly feasible given the phones limited memory allocation to user applications. To overcome this limitation, we had to setup a backend server to which the phone can opportunistically connect and upload samples of recordings. The server will then process these samples and send back the results to the phone.<br />2.2 System Specification<br />Konversa was deployed on the developer version of the Android G1 phone. The android platform supports applications written in Java that can be compiled and run on the Dalvik VM. The phone comes with various built-in sensors. For this experiment, we use the microphone to record audio and the GPS service to track the location of the phone. The backend server runs Ubuntu Linux 8 with JRE. Classification algorithms are written in MATLAB which runs on the server.<br />2.3 Architecture<br />2.3.1 Network Architecture<br />Konversa can communicate with the backend server via 802.11g Wi-Fi or 3G . It does so opportunistically. When there is no connectivity it caches all the recorded samples on the local phone memory. When there is a connection, a separate thread attempts to upload as many samples as it can, via SFTP. After processing, the server sends back a decision based on which the phone may save or discard the sample.<br /> <br /> <br />Fig 1. Konversa Network<br />2.3.2 System Architecture<br />Konversa runs on the phone as a service. It uses the phones mic to capture audio. The recorder thread wakes up at specified intervals, enables the mic and starts capturing. It saves the audio file in the tive 3GP format on the phone in a temperory cache memory along with its time-stamp and location-stamp. This contains unclassified and unprocessed audio files. The communications thread periodically picks an unprocessed file and sends it to the backend server via Secure FTP (SFTP). We chose SFTP because personal audio samples may contain private information and they need to be sent over a secure channel.<br /> <br /> <br /> Fig 2. System Architecture<br />On receiving a file, the server needs to do some pre-processing before the features can be extracted and classified. The following processing takes place at the server.<br />a. Conversion to raw format<br />Android (and most other phones) record audio in 3GP format specified by the Third Generation Partnership Project. This format stores video in MPEG-4 Part 2 or H.263 or MPEG-4 Part 10 (AVC/H.264), and audio streams as AMR-NB, AMR-WB, AMR-WB+, AAC-LC or HE-AAC. [3]. In order to apply signal processing techniques, the audio has to be decoded and uncompressed to its raw format. The server decodes the audio channel in AMR format to the raw WAV format which is used by various audio signal processing tools for further processing. The ffmpeg tool, an open source initiative, is used to decode the samples to its raw form.<br />b. Noise removal<br />Audio captured from the phone is often subject to noise from the environment. Konversa eliminates noise using a band pass filter. A low pass filter filters out noise about 3400Hz. A high pass filter fitlers out noise below 300Hz. The resulting sample has a frequency range between 300 and 3400 hz. Two different tools were tested - JSyn and Sound Exchange (SoX). Sox gave a better performance in terms of speed.<br />c. Silence removal<br />Most conversation samples contain silence for irregular periods of time. The presence of silence affects the features extracted from the audio sample which in turn affects the classification accuracy. After experimenting on various thresholds, we got the best results by trimming anything below 0.3% of the peak amplitude value of the sample. Removing silence also shortens the clip length by varying amounts. Sound exchange was used to remove the silence from the clips.<br />Fig 3. Comparison of waveforms after each step of filtering<br />Fig 4. Comparison of waveforms (a) Before removing silence and (b) After removing silence<br />d.Splitting up samples for conversation detection.<br />The recorded clips originally had a duration of 60s. In order to detect a conversation, these samples have to be split up further. We discovered that splitting them into 5s parts was reasonable to identify the speakers presence from the classifier. Since removing silence shortens the clip, we sometimes discard the last part if its less than 5s.<br />e. Extracting features<br />In order to classify audio, we need to extract features that are unique to a speaker. The most widely used features are MFCC( Mel-Frequency Cepstral Coefficients). These are coefficients that collectively make up an MFC which is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are derived from a type of cepstral representation of the audio clip (a nonlinear quot; spectrum-of-a-spectrumquot; )[4]. The features for each part are extracted using MATLAB. We use<br />f. Classification<br />The resulting feature vector is then classified using the VQ codebook that was initially trained over the user's voice. The algorithm is explained in detail in the next section.<br />The result is sent back to the phone. If it is quot; yesquot; , the communication thread will move the file into the classified files list, which can be accessed by the user through the GUI via the native SQLite database. Otherwise the file is removed from the cache.<br />3. Classification<br />Konversa has a fairly simple multi-threaded architecture, where separate threads handle the capturing and communication as explained in section 2.3. Most of the development work was devoted to identifying and implementing the right kind of classifier. After experimenting on a few different models, we had to chose one which gives the maximum accuracy with minimum implementation constraints that will seamlessly integrate with the underlying client-server architecture. In this section we explain a few models that we experimented on and why we choose the vector quantization model as the winner.<br />a. Artificial Neural Networks<br />Artificial Neural Networks implement a discrimination-based learning procedure. We used a 3-layer multiperceptron model. Training was based on backpropagation. 20 MFCC features were extracted for 20 windows of the sample, each of size 512. These features were input to 400 perceptrons on the input-layer. The hidden layer consists of 300 perceptrons and the output layer had a single perceptron which outputs the result of the classification. A similar model has been experimented previosly for text-dependent speech recognition in which case the number of output perceptrons correspond to the categories of phenomes to identify [2]. Although skeptical that this model would work well for text-independent classification, we decided to experiment and see what the results would like. The network was trained with samples from 4 different speakers. We will refer to the target speaker as Starget. The remaining participants will be referred to as S1,S2 and S3.<br /> <br />Fig 5. Structure of the neural network<br />For each of the 4 speakers, we collected 6 samples each of length 10s for creating the training set. The ideal output set for Starget was {1.0} and {0.0} for the remaining speakers. The network was trainined for 500 epochs each or until the error rate dropped to 0.001. After training for each of the sample, the network tried to classify the samples from the training set and a validation set which contained 4 samples each that did not belong to the training set. The results were not very impressive. We achieved a classification accuracy of around 60 - 70% even on the training set and 50% - 60% on the validation set, clearly indicating that the neural network model was not suitable in this scenario. Also from [8], their optimal structure has to be selected by trial-and-error procedures. The need to split the available train data in training and cross-validation sets, and the fact that the temporal structure of speech signals re-mains difficult to handle, makes it disadvantageous.<br />b. Gaussian Mixture Models<br />Mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering. Voice is considered to be a mixture of Gaussian components and hence the GMMs are known to perform well with speaker recognition. A mixture model consists of several Gaussian Components, each of these components has mean, variance and weight. These have to be initialized to certain value and then trained using EM. The initialization algorithm we used was k-means clustering, with 19 clusters, for MFCC vector of 20 dimensions. We used a library called COMIRVA as a starting point for our design. This library was optimized for music instruments and we had to tweak it, particularly add a floor value to the co-variances (of 0.01) [6] and tweak the MFCC frequency to voice frequency between 300 and 3000Hz. We had initially recorded a training sample on the phone for length of 90s[6] and trained our GMM using EM ( Expectation Maximization) over initialized values. Then we try to classify out input samples and calculate the log-likelihood and probability of that model and representing the given classification sample.<br />On each EM iteration, the following reestimation formulas are used which gaurantee a monotonic increase in the model's likelihood value [5]<br />From our experiments we decided not to go with GMM due its poor poerformance in presence of noise and also our limited training set. From [6,7] we find out that GMMs are usually trained over 16 speakers or more. We were testing with 3 speakers and this could one of the reason why GMM did not work well. Also due to Singualirities in the Matrix while training, we had to use a floor. This technique although standard in speech processing, might have removed the subtle differences in speakers in our limited training set. <br />c. Vector Quantization (LBG)<br />Vector quantization (VQ) is a lossy data compression method based on the principle of block coding. It is a fixed-to-fixed length algorithm. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence, before this the VQ was considered tough due to multiple dimensional integrals. Using training sequences eliminates the need for these integrals. This is LBG VQ, which we used in our implementation.<br />A VQ is an approximation algorithm. Similar to round-off, the VQ design problem is:<br />quot; Given a vector source with its statistical properties known, given a distortion measure, and given the number of code-vectors, find a codebook and a partition which result in the smallest average distortionquot; .[10] two conditions have to be satisfied for a this algorithm.<br /> * Nearest Neighbor Condition: This condition says that the encoding region should consists of all vectors that are closer to that code-vector, and not any other code-vectors.<br /> * Centroid Condition: This condition says that a code-vector should be average of all those training vectors that are in encoding region. we should ensure at least one training vector per region to avoid divide-by-zero problems.<br />LBG solves the above problem with these two conditions iteratively applied. Initially an average of all code vectors in calculated and then the Splitting stage is applied, where number of codebooks becomes twice. Then an Iteratively these steps are done till it reaches termination. we have chosen a size of 16. Following is an example for VQ - clustering.[9]<br /> Fig 6 : LBG-VQ clustering from [9].<br />4. Training<br /> <br />Once the classifier was determined, we performed a systematic training process. The samples collected for training, go through the entire pre-processing steps as outlined in the architecture. The training was performed on 3 different speakers.<br />In the first round of training, 4 samples of length 10s were collected for each speaker. Clips 1-4 were mapped to the target speaker. Clips 5-12 correspond to the other 2 speakers. The results were good for classifying samples that were recorded close to the mouth. There was some confusion when classifying samples recorded at a distance.<br />We then changed the training set. Sample size was increased to 45 seconds to model more realistic conversations. After removing silence, these lengths were reduced by varying amounts. Each speaker would talk for 45 seconds in two different positions. In the first position, the phone was held close to the mouth. In the second position, the phone was kept on a desk about 1.5m away from the speaker. This training set gave better classification accuracy in both positions.<br />We still had a major question unanswered. How will the model scale to participants who were not part of the training set? In order to do so, it was necessary to collect as many audio samples as possible. We managed to increase the training set to 6 speakers. The results are provided in section 6.<br />5. Modelling conversations.<br />In order to detect a conversation, we primarily look for the presence of the selected speaker Starget in the samples. We assume that in an active conversation, Starget speaks at least for 7 to 10s at a stretch. We split the initial 60s sample (<60s after removing silence) into sub-samples of 5s duration. If speaker Starget speaks during most of the time in a sub-sample (Fig 7.a), there is higher accuracy in classification. However, if Starget 's speech is separated between two sub-samples as in (Fig 7.b) , there is a higher probabilty for error in classification. After splitting the original sample into smaller sub-samples, the features are extracted and classified. The output for each original sample is a list of 'yes' or 'no' based on the classification results of the sub-samples. We assume the conversation probability Pconversation to be the percentage of positives (sub-samples that were classified as 'yes') over the total number of sub-samples in the given original sample.<br />We set a minimum and maximum threshold for the conversation probability. In our experiments the minimum threshold was set to 20% and maximum to 80%. The minimum threshold overcomes some of the false-negatives (The classifier wrongly classified the sub sample as 'no'). This could be due to the fact stated in the beginning of this section or the inherent inaccuracies within the classifier. The maximum threshold is set to distinguish a conversation from a one-way speech, say a lecture, as it is highly unlikely that a person talks for most of the duration in a 60s conversation sample.<br />Fig 7. Different scenarios which can affect classification accuracy of sub-samples<br />6. Results<br />The following are the results of the classification using VQ code book on training set of sample size 45s with 2 samples per speaker. Starget is the speaker we wish to identify in the conversation.<br />S1 and S2 are other speakers who participated in the training set. Each 45s sample was split into 5s sub-samples. The following table depict the false negative rate on a 5s sub-sample as the distance from the microphone increases.<br />a) Starget speaks alone (False Negative)<br />Distance (in m) False Negative (%)0.1 00.2 00.3 00.4 100.5 150.6 350.7 400.8 450.9 551 60<br />Table 1<br />Fig 8 - Increase in false negative % with distance<br />b) Starget speaks with S1 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 00.4 200.5 300.6 350.7 400.8 450.9 501 55<br />Table 2<br /> <br />Fig 9 - Increase in false positive % with distance<br />c) Starget with S2 (False positive rate)<br />Distance (in m) False Positive (%)0.1 00.2 00.3 50.4 150.5 250.6 350.7 450.8 550.9 601 65<br />Table 3<br />Fig 10 - Increase in false positive % with distance<br />There is a general increase in the false positive rate in detecting conversations where the distance of the mic increases.<br />7. Conclusion and Future Work<br />We developed a useful phone application that provide a convenient way for users to keep track of important information that they may come across during their daily conversations. Textual conversations like short messages and emails can be easily logged. Providing a framework to log audio clips based on contextual information along with their spatio-temporal characteristics requires robust classification algorithms and network connectivity. While we were able to provide a delay tolerant communication framework, there are still improvements to be made in the classification algorithms. During the course of this work we evaluated several classification methods. Artificial neural networks performed poorly because of the improper selection of training data. GMM although has been claimed to work well with noiseless data, the amount of data we were testing it with ( 3 speakers) was very limited and could have led to bad performance. VQ is an algorithm which performs exceptionally well for very few speakers and also can handle noise better. Hence we chose VQ.<br />Some of the future work, apart from improving the classification accuracy, includes tuning the microphones sampling frequency. Since not many conversations are made during the day, keeping the microphone on can severely drain the devices battery power. Choosing the right frequency is a challenge and involves a trade-off between resource constraints and missing whole or parts of important conversations. Other ideas include creating a multi-user data from multiple users can be analyzed to determine relationships between them based on the spatio-temporal characteristics of their conversations. However, in such systems user privacy is of major concern. <br />8. Acknowledgement<br />First and foremost, we would like to thank Prof. Gaurav Sukhatme for this opportunity and his constant encouragement which helped us move forward and complete the project. We are also extremely grateful to Dr. Sameera Poduri and Karthik Dantu, for their time and support throughout the course of this work. We also thank Prof. Fei Sha for enlightening us on speech recognition models.<br />9. References<br />1. On the effectiveness of MFCCs and their statistical distribution properties in speaker identification<br />2. Speech Recognition Using Neural Networks at the Center for Spoken Language Understanding [http://speech.bme.ogi.edu/tutordemos/nnet_recog/recog.html]<br />3. 3rd Generation Partnership Project [http://www.3gpp.org]<br />4. Mel Frequency Cepstrum Coefficient [http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient]<br />5. Robust Text-Independent Speaker Identification Using gaussian Mixture Models - Douglas. A. Reynolds, and Richard C Rose. ( 1995)<br />6. Speaker Diarization using bottom-up clustering based on a Parameter-derived Distance between adapted GMMs Mathieu Ben, Micha¨el Betser, Fr´ed´eric Bimbot, Guillaume Gravier (2004)<br />7. SPEAKER RECOGNITION - Joseph P. Campbell, Jr. Book Chapter.<br />8. A Tutorial on Text-Independent Speaker Verification Fr ´ed´ eric Bimbot,1 Jean-Franc¸ois Bonastre,2 Corinne Fredouille,2 Guillaume Gravier,1 Ivan Magrin-Chagnolleau,3 SylvainMeignier,2 Teva Merlin.<br />9. http://www.data-compression.com/vq.shtml<br />10 Speech Data Compression using Vector Quantization H. B. Kekre, Tanuja K. Sarode<br />