SPEECH PROCESSING PREREQUISITES

Organization: Speech Processing
Prerequisites
Introduction
Speech Production
Representation of Speech Signals
Speech Processing
Govind
Center for Computational Engineering & Networking
Amrita Vishwa Vidyapeetham
Govind CEN, Amrita Vishwa Vidyapeetham

Prerequisites
Introduction
Speech Production
Outline
Introduction
Human Speech Production and Perception Systems
Representation of Speech in the Time and Frequency
Domains
Speech Sounds and Features
Signal Processing Methods for Estimating Speech
Features
Speech Processing Applications
Speech Recognition
Speech Synthesis

Prerequisites
Introduction
Speech Production
Prerequisites: S&S, DSP & ADSP
Prior Knowledge Required:
Signals and Systems
Digital signal Processing
Advanced DSP

Prerequisites
Introduction
Speech Production
Signals and Systems
Classiﬁcation of Signals
LTI systems
Correlation/Convolution Operations
Fourier Representation: FS, DTFS, DTFT,DFT,FFT,
Z-transform
Concepts of Impulse Response, Frequency Response etc.

Prerequisites
Introduction
Speech Production
Digital signal Processing
Sampling: Nyquist, Aliasing
FFT implementation of DFT
Design of FIR and IIR ﬁlters
Structures for realization of Filters
Multirate signal processing: Filter banks

Prerequisites
Introduction
Speech Production
Advanced DSP
Time-Frequency Analysis
TFA by STFT
TFA by wigner Distribututions
TFA by Wavelets

Prerequisites
Introduction
Speech Production
References
L. Rabiner, Biing-Hwang Juang and B.
Yegnanarayana,"Fundamentals of Speech
Recognition",Pearson Education Inc.2009
Douglas O’Shaughnessy,"Speech
Communication",University Press,2001
Thomas F Quatieri,"Discrete Time Speech Signal
Processing", Pearson Education Inc.,2004

Prerequisites
Introduction
Speech Production
Introduction
Information in Speech
Message
Language
Accent
Speaker
Emotions/Stress
Applications
Recognition
Speech recognition
Speaker Recognition/Veriﬁcation
Emotion Recognition etc..
Synthesis
Text to Speech Synthesis
Speech Enhancement
Voice Conversion

Prerequisites
Introduction
Speech Production
Applications:Recognition
Speech Objective Information Extracted
Message Author of the danger...
Speaker Its Govind Speaking
Speaker claim has to
be veriﬁed
Hi Govind, your claim is ac-
cepted

Prerequisites
Introduction
Speech Production
Applications:Synthesis
Input Objective Output
Text To Speech Synthesis
Text (Epochs Occur... Synthesize Text
Speech Enhancement
Remove noise
Remove reverberation
Enhance desired
speaker speech
Voice Conversion
Convert source
speaker speech target
speakr speech

Prerequisites
Introduction
Speech Production
What makes automatic processing of speech
Complicated?
Its an inter-disciplinary area
1 Signal Processing: The process of extracting relevant information from
speech signal
2 Physics: The science of understanding relationship between physical
speech signal and physiological mechanisms that produced it.
3 Pattern Recognition: Grouping or classifying patterns of various events
in speech
4 Communication and information theory: Deals with efﬁcient way of
encodng or decoding parameters of speech, efﬁcient serach for patterns of
interest in speech (dynamic programming, viterbi search, stack algorithms
etc..)
5 Linguistics: The relationship between sounds (phonology) with syntax
and semantics of a language and sense that derived from the meaning
(pragmatics)
6 Computer Science: The study of diferent algorithms for implementing in
Software/Hardware
7 Psychology: Understanding the psychological state of the
speaker/listener will be helpful for the tasks like emotion analysis.

Prerequisites
Introduction
Speech Production
Speaker-Listener Schematic Diagram in Speech
Communication
Figure: Schematic Diagram of Speech Communication: Figure
Courtesy- Rabiner et al.

Prerequisites
Introduction
Speech Production
Production-Perception Block Diagram
DĞƐƐĂŐĞ
&ŽƌŵƵůĂƚŝŽŶ
>ĂŶŐƵĂŐĞ
ŽĚĞ
EĞƵƌŽͲ
DƵƐĐƵůĂƌ
ŽŶƚƌŽůƐ
sŽĐĂů dƌĂĐƚ
^ǇƐƚĞŵ
ĐŽƵƐƚŝĐ
tĂǀĞĨŽƌŵ
dƌĂŶƐŵŝƐƐŝŽŶ
ŚĂŶŶĞů
ĐŽƵƐƚŝĐ
tĂǀĞĨŽƌŵ
DĞƐƐĂŐĞ
hŶĚĞƌƐƚĂŶĚŝŶŐ
>ĂŶŐƵĂŐĞ
dƌĂŶƐůĂƚŝŽŶ
EĞƵƌĂů
dƌĂŶƐĚƵĐƚŝŽŶ
ĂƐŝůĂƌ
DĞŵďƌĂŶĞ
DŽƚŝŽŶ
dĞǆƚ WŚŽŶĞŵĞƐͲ
WƌŽƐŽĚǇ
ƌƚŝĐƵůĂƚŽƌǇ
DŽƚŝŽŶ
^ĞŵĂŶƚŝĐƐ
WŚŽŶĞŵĞƐ
tŽƌĚƐ
^ĞŶƚĞŶĐĞƐ
&ĞĂƚƵƌĞ
ǆƚƌĂĐƚŝŽŶ
ŽĚŝŶŐ
^ƉĞĐƚƌƵŵ
ŶĂůǇƐŝƐ
Figure: Speech production BlockDiagram: Figure Courtesy- Rabiner
et al.

Prerequisites
Introduction
Speech Production
Speech Production
Figure: Speech production mechanism: Figure Courtesy- Thomas F. Quatieri,
"Discrete-Time Speech Signal Processing", Chapter. 3, pp. 58, Pearson Edu., Delhi

Prerequisites
Introduction
Speech Production
Mechanical Equivalent of Speech Production System
Figure: Speech production mechanism: Figure Courtesy- Rabiner et
al.

Prerequisites
Introduction
Speech Production
Spectro-Temporal Representation
classiﬁcation of Phonemes
Representation of Speech Signal
0 0.5 1 1.5 2 2.5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure: Speech Signal in Time domain

Prerequisites
Introduction
Speech Production
Glottal Air Flow During Speech Production
Figure: Glottal air ﬂow: Courtesy- Rabinar et al.

Prerequisites
Introduction
Speech Production
Glottal Air Flow: Graphical Illustration
1.3 1.35 1.4 1.45 1.5 1.55
x 10
4
−1
−0.5
0
0.5
Time (Samples)
Amplitude
Speech Waveform
1.3 1.35 1.4 1.45 1.5 1.55
x 10
4
−1
−0.5
0
0.5
Time (Samples)
Amplitude
Glottal Flow: EGG
Speech EGG
Glottis
Vibration

Prerequisites
Introduction
Speech Production
Classiﬁcation of Speech Sounds
Silence (S): No Speech is produced
Unvoiced (U): Vocal folds are not vibrating
Voiced (V): Periodic vibration of vocal cords
0 0.5 1 1.5 2 2.5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
US S
V
V
V
Figure: Speech signal in time domainGovind CEN, Amrita Vishwa Vidyapeetham

Prerequisites
Introduction
Speech Production
Classification of Speech Sounds
Separation of voiced sounds from unvoiced and silence
sounds is known as voiced-non-voiced detection
Issues in voiced-non-voiced detection:
Difficult to identify weak unvoiced sound from silence
Difficult to distinguish weakly periodic voiced sounds from
unvoiced sounds

Prerequisites
Introduction
Speech Production
SpectroGrams: Narrow-band & Wide-band

Prerequisites
Introduction
Speech Production
Spectral Envelope from a Long Segment of Speech
0
10
20
30
0
1000
2000
3000
4000
0
20
40
FrameIndex
Frequency (Hz)
Magnitude

Prerequisites
Introduction
Speech Production
Classiﬁcation of sound units
WŚŽŶĞŵĞƐ
sŽǁĞůƐ
ĨĨƌŝĐĂƚĞ
Ɛ
ŝƉŚƚŚŽŶŐƐ
^ĞŵŝͲ sŽǁĞůƐ
>ŝƋƵŝĚƐ 'ůŝĚĞƐ
ŽŶƐŽŶĂŶƚ
Ɛ
EĂƐĂůƐ
WůŽƐŝǀĞƐ
&ƌŝĐĂƚŝǀĞƐ tŚŝƐƉĞƌƐ
&ƌŽŶƚ DŝĚ ĂĐŬ
sŽŝĐĞĚ hŶǀŽŝĐĞĚ
ŝ ;ĞǀĞͿ
/ ;ŝƚͿ
Ğ ;ŚĂƚĞͿ
;ŵĞƚͿ
h;ďŽŽŬͿ
Ƶ;ďŽŽƚͿ
;ƵƉͿ
Ă ;ĨĂƚŚĞƌͿ
Ž;KďĞǇͿ
Đ; ůůͿ
ĂǇ ;ďƵǇͿ
Ăǁ;ĚŽǁŶͿ
ĞǇ ;ďĂŝƚͿ
K ;ďŽǇͿ
ƚǌ ;ƐƉŽƌƚƐͿ
ũŚ;ũƵĚŐĞͿ
ĐŚ ;ĐŚƵƌĐŚͿ
ů ;ůĂƌŐĞͿ
ƌ;ƌƵŶͿ
ǁ ;ǁŝƚͿ
Ǉ ;ǇŽƵͿ
ŵ ;ŵĞƚͿ
Ŷ;ŶĞƚͿ
ŶŐ;ƐŝŶŐͿ
Ś ;ŚĞͿ
ď ;ďĂůůͿ
Ě ;ĚĞďƚͿ
Ő ;ŐĞƚͿ
Ŭ ;ŬŝƚͿ
Ɖ ;ƉĞŶͿ
ƚ;ƚĞŶͿ
sŽŝĐĞĚ hŶǀŽŝĐĞĚ
ǀ ;ǀĂƚͿ
ĚŚ;ƚŚĂƚͿ
ǌ;ǌŽŽͿ
Ĩ ;ĨƵŶͿ
ƚŚ ;ƚŚŝŶŐͿ
Ɛ;ƐĂƚͿ
ƐŚ;ƐŚŽƵůĚͿ

Prerequisites
Introduction
Speech Production
Representation of sound units in speech
Sounds are classified into vowels and consonant
Vowels: By exciting fixed vocaltract shape with quasi
periodic glottal pulses
Vowels are classified into front, mid and back based on the
tongue-hump-position
Front vowels:/i/("eve"), /I/("it"),//("at"),/e/("hate")
Mid vowels: /a/("father"), /Λ/("Up")
Back Vowels: /U/("foot"),/u/("boot"),/o/("Obey")
Another classification is based on the length of vowels:
Long and short
Diphthongs: Combination of two vowels
/ay/ as in "buy",/aw/ as in "down",/ey/ as in "bait",/o/ as in
"boat",/cy/ as in "boy" etc.

Prerequisites
Introduction
Speech Production
Front Vowel
Front
Vowel
Speech Signal Spectrogram
I(It)
0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
1000
2000
3000
4000
5000
6000
7000
e(Hate)
0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1000
2000
3000
4000
5000
6000
7000
i(eve)
0.32 0.34 0.36 0.38 0.4 0.42
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1

Prerequisites
Introduction
Speech Production
Vowel Analysis
Front vowels found to show high frequency resonance
Front vowels are discriminated among each other by the
tongue height during the vowel production
Mid vowels found to show well separated and balanced
resonant frequency distribution
Back vowels shows almost no energy beyond low
frequency regions

Prerequisites
Introduction
Speech Production
Diphthongs

Prerequisites
Introduction
Speech Production
Semivowels
Group of sounds consisting of /w/,/r/,/l/,/y/
difﬁcult to characterize because they are vowel like in
nature
Characterized by gliding transition in vocaltract area
functions between adjacent phonemes
Best described as transitional vowel like sounds

Prerequisites
Introduction
Speech Production
Nasal Consonants
Group of sounds consisting of /m/,/n/,/η/
Produced with glottal Excitation and vocaltract totally
constricted along the oral passageway
Velam is lowered to block the air passage through oral
cavity and allowing through nasal cavity
Due the acoustic coupling of oral cavity to the pharynx, anti
resonances will be created
/m/,/n/ and /η/ are produced by the constiction at lips,
behind the teeth and at velum, respectively.

Prerequisites
Introduction
Speech Production
Nasalized Vowels

Prerequisites
Introduction
Speech Production
Unvoiced Fricatives
Produced by exciting vocaltract with a turbulant airﬂow
through a narrow constriction
/f/("four"),/θ/("thing"),/s/("sat") and /sh/ ("shut") are the
class of fricative sounds
/f/: Constriction at teeth
/s/: Constriction near middle of oral cavity
/sh/: constriction at the end of oral tract

Prerequisites
Introduction
Speech Production
Voiced Fricatives
/v/("vat"),/δ/("zoo"),/z/("zoo") and /zh/("azure") are the class
of fricative sounds
/v/: Constriction at teeth
/z/: Constriction near middle of oral cavity
/zh/: constriction at the end of oral tract
Except glottal vibrations, the place of articulation remains
same as that of unvoiced fricatives

SPEECH PROCESSING PREREQUISITES

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a SPEECH PROCESSING PREREQUISITES

Semelhante a SPEECH PROCESSING PREREQUISITES (20)

Último

Último (20)

SPEECH PROCESSING PREREQUISITES