DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
Part1 speech basics
1. Unit 6 Speech Signal
DR MINAKSHI PRADEEP ATRE
PVG’S COET & GKPIM PUNE
2. References
Book: Speech and Audio Processing by Dr Shaila Apte madam
Pdf document: http://cs.haifa.ac.il/~nimrod/Compression/Speech/S1Basics2010.pdf
For speech samples:
https://www.signalogic.com/index.pl?page=speech_codec_wav_samples
3. Contents
Speech:
1. Basics of speech signal and its features
2. LTI representation of speech signal
3. LTV representation of speech signal
4. Estimation of fundamental frequency
5. identification of voiced and unvoiced speech
6. and noise removal
4. Speech
Speech signal is generated by nature
Naturally occurring so random in nature
Necessary to understand the generalized human speech production
Simple linear time invariant (LTI) model for speech production
Inherently time varying nature of speech
Introduction to linear time variant (LTV) model of speech
Speech type: consonants, fricatives
Voiced and unvoiced (V/UV) speech
6. Vocal Tract
Vocal tract is the cavity between the vocal cords and the
lips, and acts as a resonator that spectrally shapes the
periodic input, much like the cavity of a musical wind
instrument. ƒ
Simple model of a steady-state vowel regards the vocal
tract as a linear time-invariant (LTI) filter with a periodic
impulse-like input.
7. What is Speech signal?
Created at the Vocal cords, travels through the Vocal tract, and
produced at speakers mouth
Gets to the listeners ear as a pressure wave
Non-Stationary, but can be divided to sound segments which have
some common acoustic properties for a short time interval
Two Major classes: Phonemes (Vowels and Consonants)
8. Phonemes
The basic sounds of a language (e.g. "a" in the word "father“) are
called phonemes
A typical speech utterance consists of a string of vowel and
consonant phonemes whose temporal and spectral characteristics
change with time
In addition, the time-varying source and system can also
nonlinearly interact in a complex way: our simple model is correct for
a steady vowel, but the sounds of speech are not always well
represented by linear time-invariant systems !
9. Vowel Production
In vowel production, air is forced from the lungs by contraction of
the muscles around the lung cavity
Air flows through the vocal cords, which are two masses of flesh,
causing periodic vibration of the cords whose rate gives the pitch of
the sound
Resulting periodic puffs of air act as an excitation input, or source,
to the vocal tract
11. Speech Production
A sound source excites a (vocal tract) filter
◦ Voiced: Periodic source, created by vocal cords
◦ Unvoiced: Aperiodic and noisy source
Pitch is the fundamental frequency of the vocal cords vibration (also called F0) followed by 4-5
Formants (F1 - F5) at higher frequencies
Natural frequencies occur at
odd multiples of 500 Hz.
These resonant frequencies
are called formants.
Vowel Adult Male Adult Female
F1 F2 F3 F1 F2 F3
(i) 255 2330 3000 340 2610 3210
(u) 290 940 2180 390 995 2585
(ae) 735 1625 2465 950 1955 2900
Typical formant frequencies for selected vowels in Hz
This table shows
the three values
12. LTI Model for speech production
Impulse Train
Generator
(Glottis)
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Impulse train generator is
used as an excitation signal
when a voiced segment is
produced VOWEL
e.g. “a”
Basic Assumption: source of excitation and
the vocal tract systems are independent
Periodic
13. LTI Model for speech production
Impulse Train
Generator
(Glottis)
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Random Signal Generator is
used as an excitation signal
when an unvoiced segment
is produced
CONSONANTS
e.g. “s”
LTI model is used for a short segment of
speech @10 ms for which we can assume the
parameters of vocal tract remain constant
Random
14. Nature of Speech Signal
Speech is generated by components like vocal cords and vocal tracts
It’s not possible to generate a speech signal on its own
Speech is random signal
Speech has/ can have infinite features (story of an elephant and the blind people touching the
elephant to identify and specify what the elephant looks like)
So it’s a complex problem
Uttering the different words is possible because of humans can change the resonant modes of
the vocal cavity and can also stretch the vocal cords to some extent for modifying the pitch
period for different vowels
And that’s why we have the linear time-varying (LTV) model
15. Linear Time-varying Model: Speech
production
Impulse Train
Generator
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Amplitude
Pitch period is
variable
Impulse response is
variable
16. Speech Sound Categories
Periodic (Sonorants, Voiced)
Noisy (Fricatives , Un-Voiced)
Impulsive (Plosive)
Example:
In the word “shop,” the “sh,” “o,” and “p” are generated from a
noisy, periodic, and impulsive source, respectively
18. Pitch
Pitch period: The time duration of one glottal cycle
Pitch (fundamental frequency): The reciprocal of the pitch period.
Remember: we will
calculate the pitch
for voiced segment
19. Pitch Detection
The pitch period and V/UV
decisions are elementary
to many speech coders
Many methods for the
calculation:
◦ Autocorrelation function
◦ ZCR
20. Features or categorization of speech
sound
Speech sounds are studied and classified from the following
perspectives:
1) The nature of the source: periodic, noisy, or impulsive, and
combinations of the three
2) The shape of the vocal tract
3) The time-domain waveform, which gives the pressure change with
time at the lips output
4) The time-varying spectral characteristics revealed through the
spectrogram
21. Spectrogram
Time-varying spectral characteristics of the speech signal can be graphically
displayed through the use of a tow-dimensional pattern
Vertical axis: frequency, Horizontal axis: time
The pseudo-color of the (red: high energy ) pattern is proportional to signal
energy
The resonance frequencies of the vocal tract show up as “energy bands”
Voiced intervals characterized by striated appearance (periodically of the
signal)
Un-Voiced intervals are more solidly filled in
23. Most common Manner of articulation
Plosive, or oral stop, where there is complete occlusion (blockage) of both the oral and nasal
cavities of the vocal tract, and therefore no air flow. Examples include English /p t k/ (voiceless)
and /b d g/ (voiced)
Nasal stop, where there is complete occlusion of the oral cavity, and the air passes instead
through the nose. The shape and position of the tongue determine the resonant cavity that
gives different nasal stops their characteristic sounds. Examples include English /m, n/
Fricative, sometimes called spirant, where there is continuous frication (turbulent and noisy
airflow) at the place of articulation. Examples include English /f, s/ (voiceless), /v, z/ (voiced), etc
24. Most common Manner of articulation
Sibilants are a type of fricative where the airflow is guided by a groove in the tongue toward the
teeth, creating a high-pitched and very distinctive sound. These are by far the most common
fricatives. English sibilants include /s/ and /z
Affricate, which begins like a plosive, but this releases into a fricative rather than having a
separate release of its own. The English letters "ch" and "j" represent affricates
Trill, in which the articulator (usually the tip of the tongue) is held in place, and the airstream
causes it to vibrate. The double "r" of Spanish "perro" is a trill.
Approximant, where there is very little obstruction. Examples include English /w/ and /r/. Lateral
approximants, usually shortened to lateral, are a type of approximant pronounced with the side
of the tongue. English /l/ is a lateral.