Jesse Sampermans Research Project Presentation

Continuous Speech
Keyword Spotting
In

by Jesse Sampermans (502400)

Overview

1. Hypothesis
2. Historic Overview

Overview

1. Hypothesis
3. Human Speech Organ

Overview

1. Hypothesis
4. Phonetics & Speech Perception

Overview

1. Hypothesis
5. Telephone Speech Coding & Compression

Overview

1. Hypothesis
6. Speech Enhancement

Overview

1. Hypothesis
7. Speech Recognition Engine

Overview

1. Hypothesis
8. Speech Analytics

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion

1.Overview
Hypothesis

8. Speech Analytics
9. Conclusion

1. Hypothesis

“Is it possible, with today’s known technology,
to automatically trigger a recording device
with a random word in a sentence
over a telephone line?”

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


• Early Days (1700 - 1900)


• Early Days (1700 - 1900)

First artiﬁcial speech synthesizer


• Early Days (1700 - 1900)


- late 1700’s: Russian professor
Christian Kratzenstein


• Early Days (1700 - 1900)


- Resonant tube attached to pipe
organ


• Early Days (1700 - 1900)

First artiﬁcial speech synthesizer Enhancement

organ


• Early Days (1700 - 1900)


- late 1700’s: Russian professor - mid 1800’s: Charles
Christian Kratzenstein Wheatstone
organ


• Early Days (1700 - 1900)


- late 1700’s: Russian professor - mid 1800’s: Charles
Christian Kratzenstein Wheatstone
- Resonant tube attached to pipe - Replace tubes with leather
organ resonators


Wheatstone Resonator


• Early Days (1700 - 1900)

1881: Gramophone


• Early Days (1700 - 1900)

1881: Gramophone
Alexander Graham Bell


• Early Days (1700 - 1900)

1881: Gramophone
Alexander Graham Bell

- Dictation purposes


• Early Days (1700 - 1900)

1939 World Fair: VODER


• Early Days (1700 - 1900)

Homer Dudley


• Early Days (1700 - 1900)

Homer Dudley

- Based on Wheatstone Resonator


• Early Days (1700 - 1900)

Homer Dudley

- Based on Wheatstone Resonator
- Electrical & Mechanical Parts


• First Speech Recognizers (1950 - 1980)



Vs.



Vs.

- Digit Recognition System
based on speech formants



Vs.

- Digit Recognition System - 10 syllable recognizer
based on speech formants



Vs.

- Digit Recognition System - 10 syllable recognizer
based on speech formants - Dynamic Time Warping


Commercialization 1960’s



Vs.



Vs.

Ofﬁce Automation



Vs.

Ofﬁce Automation
- Voice typewriter



Vs.

Ofﬁce Automation
- Voice typewriter
- Trained databases



Vs.

Ofﬁce Automation Telecom Automation
- Voice typewriter
- Trained databases



Vs.

- Voice typewriter - Keyword Spotting
- Trained databases



Vs.

- Voice typewriter - Keyword Spotting
- Trained databases - Large Audience


• Modern evolutions (1980 - ...)


- Hidden Markov Models


- CMU “Sphynx” = commercial success



- DARPA (Defense Advances Research Projects
Agency) investments



- DARPA (Defense Advances Research Projects
Agency) investments
- Battle Management

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


- Lungs: pump air


- Lungs: pump air
- Larynx (Vocal Folds)


- Lungs: pump air
- Larynx (Vocal Folds)
- Articulators (Tongue, Lips, ...)

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


• Phonetics


• Phonetics
- Smallest part of human speech


• Phonetics
- Originated in India around 2500 BC


• Phonetics
- IPA (International Phonetic
Alphabet)


• Phonetics
- IPA (International Phonetic
Alphabet)
- 44 phonemes in American English


• Speech Perception


- Acoustic Cues


- Acoustic Cues
Voice Onset Time: Unaspirated plosives (near 0 ms)


- Acoustic Cues
Aspirated plosives (> 30 ms)


- Acoustic Cues
Voiced plosives (< 0 ms)


- Acoustic Cues

- Speech Segmentation


- Acoustic Cues

Identifying boundaries between words (lexical) or phonemes
(phonetic)


- Acoustic Cues

(phonetic)
[k] in “kit” and “caught” and [i] in “kit” and “kick”


- Acoustic Cues

(phonetic)
- Categorical Perception


- Acoustic Cues

(phonetic)
Identifying words from different speakers


- Acoustic Cues

(phonetic)
Categorize phonemes in brain


- Acoustic Cues

(phonetic)
Categorize phonemes in brain
Only native speakers


- Variations in speech


Phonetic environment can alter the sound of a phoneme



[o] in “Bob” and [u] in “vulture”




Speed of speech




Speed of speech
Fast → Shorter vowels, less pronounced stops, bad articulation




Speed of speech

Speaker identity




Speed of speech

Speaker identity
- Gender and age differences




Speed of speech

Speaker identity
- Vocal chord size and hormone levels




Speed of speech

Speaker identity
- Vocal chord size and hormone levels
- Place of birth

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


• Early days: Analog


- Speech converted to control voltage in the phone


- Passed through copper lines → crosstalk



• 1980’s - present day: Digital



- Main advantages: Longer distance / greater speed / less carrier noise



- Main advantages: Longer distance / greater speed / less carrier noise
- Use of Optic Fiber lines → no crosstalk


• Now: Mobile Phones


- GSM: Speech


- GSM: Speech
- UMTS: Data


- GSM: Speech
- UMTS: Data

- Frequency content of 3100 kHz


- GSM: Speech
- UMTS: Data

- Compressed full-rate (13 kbit/s) or half-rate (6,5 kbit/s) with 8kHz SR


- GSM: Speech
- UMTS: Data


• Technique: Linear Predictive Coding (LPC)


- GSM: Speech
- UMTS: Data


- Formants (human resonance) are removed from speech


- GSM: Speech
- UMTS: Data


- What is left = sine wave → digitized with Fourier transform


- GSM: Speech
- UMTS: Data


- Formants are synthesized again in the receivers cellphone


- GSM: Speech
- UMTS: Data


- Formants are synthesized again in the receivers cellphone

- Of great interest for speech recognition

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


• Pre-Filtering


• Pre-Filtering
- Frequency based


• Pre-Filtering
- Frequency based

- Filter banks


• Pre-Filtering
- Frequency based

- Filter banks
- Commonly know as an equalizer


• Pre-Filtering
- Frequency based

- Filter banks
- Used adaptively to suppress unwanted frequencies


• Pre-Filtering
- Frequency based

- Filter banks
- Boost low-end lost due to telephone coding


• Pre-Filtering
- Frequency based

- Filter banks
- Boost low-end lost due to telephone coding
- Improve audibility


• Noise-Filtering


• Noise-Filtering
- Spectral Substraction


• Noise-Filtering
- Simple and effective


• Noise-Filtering
- Uses the amplitude of the noise


• Noise-Filtering
- “Underwater” effect if overused


• Noise-Filtering
- Wiener Filtering


• Noise-Filtering
- Wiener Filtering
- Invented in 1940’s by Norbert Wiener


• Noise-Filtering
- Wiener Filtering
- Uses Fourier transform to detect noise


• Noise-Filtering
- Wiener Filtering
- Stationary (non-adaptive)


• Noise-Filtering
- Wiener Filtering
- Uses deconvolution to remove noise


• Noise-Filtering
- Wiener Filtering
- Signal Subspace approach


• Noise-Filtering
- Wiener Filtering
- Represents noise and original signal in “layers”


• Noise-Filtering
- Wiener Filtering
- Represents noise and original signal in “layers”
- Assigns vectors to high and low amplitudes


• Spectral Restoration


- Fixes dropouts in the signal.


- Works on a small scale


- Adds ﬁltered full band noise in the gaps


- Listener perceives the signal as whole



- Bad results with SREs



- Bad results with SREs
- Most SREs can ﬁll the gap in a different way

Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion


• Dynamic Time Warping (DTW)


- Mostly used in the early days


- Fast & simple but not accurate with complex speech



- Measures similarities in time and speed



- e.g. A video is played twice. One time fast and one time slow. A DTW
based algorithm will see that it is the same video




- Compares speech to a speech database




- Needs training most of the time




- Does not use phonemes




- Uses interval-based vectors.




- Uses interval-based vectors.
- Vector taken at the wrong time = bad representation


• Statistically Based Speech Recognition


Hidden Markov Models



- Heart and soul of statistically based SREs



- Allows use by people with different accents / dialects




- Markov Model: “predict” the future by knowing the current state




- Hidden Markov model: “predict” the current state by knowing the future





- Future = grammar ﬁle





- Future = grammar ﬁle
- Statistically rules out possibilities as the word progresses


Acoustic Model


Acoustic Model

- Gathers statistical information for the HMM


Acoustic Model

- Does this by analyzing a speech corpus (read or continuous)


Acoustic Model

- Different corpus (language, gender, frequency range)


Acoustic Model

- Different corpus (language, gender, frequency range)

- ISIP Switchboard corpus: 240h of speech, 500 talkers. Telephone quality


Language Model


Language Model

- Tries to predict the next word


Language Model

- Uses a grammar ﬁle


Language Model

- E.g. “Phone Steve Young; Phone Young; Phone Steve; Phone Young Steve”


Language Model

- E.g. “Phone Steve Young; Phone Young; Phone Steve; Phone Young Steve”

- Multiple can be combined to predict entire sentences

8. Speech Analytics
Overview

1. Hypothesis
8. Speech Analytics
9. Conclusion

8. Speech Analytics

- Separate engine

8. Speech Analytics

- Separate engine
- Analyze gender, age, identity and topic discussed

8. Speech Analytics

- Separate engine

• Audio Mining

8. Speech Analytics

- Separate engine

• Audio Mining
- Analyzes audio as soon as it enters the signal

8. Speech Analytics

- Separate engine

• Audio Mining
- Useful with background noise

8. Speech Analytics

- Separate engine

• Audio Mining
- Matches source to a speech database

8. Speech Analytics

- Separate engine

• Audio Mining

e.g.: Emotion detection with customer services

8. Speech Analytics

- Separate engine

• Audio Mining

e.g.: Emotion detection with customer services
Music recognition software (“Shazam”, “Soundhound”)

8. Speech Analytics

• Keyword Spotting

8. Speech Analytics

2 kinds:

8. Speech Analytics

2 kinds:

Isolated word

8. Speech Analytics

2 kinds:

Isolated word
- clearly enforced breaks

8. Speech Analytics

2 kinds:

Isolated word
- non-spontaneous

8. Speech Analytics

2 kinds:

Isolated word
- non-spontaneous
- user knows he is talking to an SRE

8. Speech Analytics

2 kinds:

Isolated word
- non-spontaneous

Unconstrained spotting

8. Speech Analytics

2 kinds:

Isolated word
- non-spontaneous

- continuous speech KWS

8. Speech Analytics

2 kinds:

Isolated word
- non-spontaneous

- continuous speech KWS
- difﬁcult due to speech segmentation

8. Speech Analytics

2 methods

8. Speech Analytics

2 methods
- ﬁller method (garbage-method): entire string of speech is analyzed

8. Speech Analytics

2 methods
excess words too (=garbage)

8. Speech Analytics

2 methods

- sliding model: interval based analyzing

8. Speech Analytics

2 methods

uses Hidden Markov Models & grammar ﬁle

8. Speech Analytics

2 methods

uses Hidden Markov Models & grammar ﬁle
resource intensive

9.Overview
Conclusion

1. Hypothesis
8. Speech Analytics
9. Conclusion

9. Conclusion


9. Conclusion


Answer:

9. Conclusion


Answer:

YES

9. Conclusion

- Keyword spotting algorithm based on a statistically based SRE

9. Conclusion

- Appropriate acoustic model

9. Conclusion

- ISIP Switchboard speech corpus: telephone compressed source

9. Conclusion


- Grammar ﬁle? → Maybe but will be big

9. Conclusion


- Normal speech corpus? → A lot of pre-ﬁltering / might nog be successful

9. Conclusion


- Normal speech corpus? → A lot of pre-ﬁltering / might nog be successful
- LPC? → artifacts in output due to 2x LPC ﬁltering

Jesse Sampermans Research Project Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Jesse Sampermans Research Project Presentation

Notas do Editor