Straight

9-10 August, 2002 Computational Audition 1
Fixed Point Representations forFixed Point Representations for
Very High-Quality Speech andVery High-Quality Speech and
Sound Modification SystemSound Modification System
Hideki KawaharaHideki Kawahara
Wakayama University, JapanWakayama University, Japan

SummarySummary
nn Functional (computational after Marr)Functional (computational after Marr)
approach is important and productive.approach is important and productive.
nn Fixed points provide feature values asFixed points provide feature values as
well as their reliability indices.well as their reliability indices.
–– UsingUsing within channelwithin channel informationinformation
nn Fixed point concept may provide clue toFixed point concept may provide clue to
integrate Fourier based concept andintegrate Fourier based concept and
wavelet-wavelet-MellinMellin transform based concept.transform based concept.
Reference systemReference system
““LocalLocal””
center of gravitycenter of gravity

original
STRAIGHT: demoSTRAIGHT: demo

STRAIGHT demo: morphingSTRAIGHT demo: morphing
neutral angry
interpolationextrapolation extrapolation
Word: /hai/ (“Yes” in Japanese)

BackgroundBackground
nn ““Auditory BrainAuditory Brain”” Project by CRESTProject by CREST
–– Short term goal: speech processing systemsShort term goal: speech processing systems
based on functional models of auditory functions.based on functional models of auditory functions.
–– STRAIGHT: a very high-quality speechSTRAIGHT: a very high-quality speech
manipulation systemmanipulation system
–– Fixed point based algorithmsFixed point based algorithms
(alternative way of dimensional reduction of(alternative way of dimensional reduction of
auditory representations)auditory representations)
–– Long term goal:Long term goal: ““computationalcomputational”” theorirdtheorird ofof
auditionaudition
nn Frustrations in ways how auditory modelsFrustrations in ways how auditory models
are used in ASR and how speech processingare used in ASR and how speech processing
systems are evaluated.systems are evaluated.

““Auditory BrainAuditory Brain”” ProjectProject
nn To develop a very high quality speechTo develop a very high quality speech
and/or sound manipulation system basedand/or sound manipulation system based
on perceptually relevant parameters andon perceptually relevant parameters and
it does not preserve phase/waveformit does not preserve phase/waveform
information.information.
–– ?? Are distance based quality measures relevant???? Are distance based quality measures relevant??
–– ?? Why does periodic sound sounds smoother and?? Why does periodic sound sounds smoother and
richer (in Auditory Fovea)??richer (in Auditory Fovea)??
–– ?? Is it relevant to test highly nonlinear speech?? Is it relevant to test highly nonlinear speech
perception using elementary sounds ??perception using elementary sounds ??

Why high quality?Why high quality?
nn Ecological approach for investigatingEcological approach for investigating
highly nonlinear system, Humanhighly nonlinear system, Human
Not necessarily be predictable
from elementary test signals
Necessary to use
ecologically valid stimuli
Naturalness

Hans Moravec: Robot, 2000, Oxford
xbox
iMac
PlayStation-3
Key issue: compatibilityKey issue: compatibility
Background figure is removed.
Please visit Hans Moravec’s page for
the original figure.
Faster than exponential growth in computing power
(Chapter 3: Power and Presence, Page 60)
http://www.frc.ri.cmu.edu/~hpm/book98/

nn Computational theories of speech/auditoryComputational theories of speech/auditory
perceptionperception
–– ecological constraints on evolutionecological constraints on evolution
–– It cannot be ad hoc.It cannot be ad hoc.
»» When there is an elegant and reasonable algorithmWhen there is an elegant and reasonable algorithm
and it does not violate ecological (biological andand it does not violate ecological (biological and
environmental) constraints, there is no reason toenvironmental) constraints, there is no reason to
deny that the algorithm shares the commondeny that the algorithm shares the common
underlying principles with our auditory system.underlying principles with our auditory system.

nn Computational theories of speech/auditoryComputational theories of speech/auditory
perceptionperception
–– Periodicity: time-frequency sampling gridPeriodicity: time-frequency sampling grid
–– Periodicity: stable reference point for wavelet-Periodicity: stable reference point for wavelet-
MellinMellin transformtransform
–– Log-linear frequency axisLog-linear frequency axis
»» Wavelet-Wavelet-MellinMellin transform: shape and sizetransform: shape and size
–– Why two ears? ICAWhy two ears? ICA
–– Long term correlation (structure)Long term correlation (structure)
»» ASR, musicASR, music

STRAIGHT a core technologySTRAIGHT a core technology
nn Conceptually simple architectureConceptually simple architecture
–– Channel VOCODERChannel VOCODER
–– Source filter modelSource filter model
nn Graded parameters (Graded parameters (vsvs binary decision)binary decision)
–– Sensitivity analysisSensitivity analysis
–– MorphingMorphing
nn Reliability / TransparencyReliability / Transparency
–– No post-processingNo post-processing
–– Weakly constrained modelWeakly constrained model

Structure of STRAIGHT
STRAIGHT: architectureSTRAIGHT: architecture

STRAIGHT a core technologySTRAIGHT a core technology
nn Conceptually simple architectureConceptually simple architecture
–– Channel VOCODERChannel VOCODER
–– Source filter modelSource filter model
nn Graded parameters (Graded parameters (vsvs binary decision)binary decision)
–– Sensitivity analysisSensitivity analysis
–– MorphingMorphing
nn Reliability / TransparencyReliability / Transparency
–– No post-processingNo post-processing
–– Weakly constrained modelWeakly constrained model

Spectral envelope estimation
STRAIGHT: structureSTRAIGHT: structure

Weakly constrained spectralWeakly constrained spectral
envelope estimationenvelope estimation
waveform
Time window
Interferences
in the time domain
Interferences in the
frequency domain

Reduction of edge discontinuity
Reduction of periodicity interference
smoothing by spline basisComposite window

Time-frequency smoothing
（
Time-frequency smoothing
（current implementation）current implementation）
F0 synchronous
Gaussian window
complimentary
time window
reduced interference
spectrum
F0 synchronous
Gaussian window
complimentary
time window

Compensation of over-Compensation of over-
smoothingsmoothing

Compensation of over-smoothingCompensation of over-smoothing

Fixed point based algorithmsFixed point based algorithms
nn Fixed points in the frequency domain:
→
Fixed points in the frequency domain:
→ F0 extractionF0 extraction
nn Fixed points in the time domain:
→
Fixed points in the time domain:
→ Excitation extractionExcitation extraction

Fixed point of mappingFixed point of mapping
fixed point
y
x
y=f(x)
* Instantaneous frequency
of a filter output around
a sinusoidal component
* Energy centroid
of a windowed signal
around an event
Examples

Averaging and fixed pointAveraging and fixed point
nn Prominent componentProminent component
Window
locations
Average of windowed
value
Fixed point
background

Averaging and fixed pointAveraging and fixed point
nn Prominent componentProminent component
Window
locations
Average of windowed
value
Fixed point
background
Parameters
(position,slope,[level])

F0 estimation

window selection for reliablewindow selection for reliable
representation of mappingrepresentation of mapping
Refinement of Fo synchronous windows

Window with harmonicWindow with harmonic
cancellationcancellation

Fixed-point-based sinusoidalFixed-point-based sinusoidal
components extractioncomponents extraction

Fixed-point-based sinusoidalFixed-point-based sinusoidal
frequency and C/N estimationfrequency and C/N estimation
C/N information enablesC/N information enables
optimum F0 estimation based onoptimum F0 estimation based on
multiple harmonic componentsmultiple harmonic components

Approximate estimation of C/NApproximate estimation of C/N

Reliable built-in mechanismReliable built-in mechanism
for fundamental component selectionfor fundamental component selection
linearlinear filter
arrangement
log-linearlog-linear filter
arrangement
mapping filter output

Fixed points on C/N mapFixed points on C/N map
Fundamental
component

F0 evaluation based on EGGF0 evaluation based on EGG
gross error
W/O：0.72%
with：0.32%
female

Graded sourceGraded source
InformationInformation
nn Fixed point basedFixed point based
FoFo extractionextraction
(with C/N map)(with C/N map)
F0 trajectoriesF0 trajectories
(resolution: 1/F0)(resolution: 1/F0)
C/N for each fixed point
Graded aperiodicity information
Is also extracted

Fixed points in the time domainFixed points in the time domain
nn How to define auditory temporal eventsHow to define auditory temporal events
–– Localized energyLocalized energy centroidcentroid
Alternative representation

Fixed points in the time domainFixed points in the time domain
Squared whitened signal Energy centroid
Gaussian
window
Amount of energy
concentration
Speech
waveform

waveform
Energy
centrold
Window center
Fixed points
Fixed point based event detectionFixed point based event detection

Mean time
duration
Definition of event in the time domainDefinition of event in the time domain
Event location
Windowed whitened signal

Windowed event location andWindowed event location and
the original event locationthe original event location
Gaussian window
Approximation of envelope
Windowed location
Original
location
Window location

Slope at fixed point
duration
Window
parameter
Duration can be estimated fromDuration can be estimated from
the geometrical parameter atthe geometrical parameter at
the fixed pointthe fixed point

Equivalence between the time domainEquivalence between the time domain
definition and the frequency domaindefinition and the frequency domain
definitiondefinition
waveform
Time domain definition
Frequency domain definition
Group delay

Inverse problem:Inverse problem:
Where is the excitation?Where is the excitation?
Minimum phase
response
Event as the
energy centroid
Excitation
(impulse)
compensation

Equivalence in definitionsEquivalence in definitions
nn Frequency domain definition of the event locationFrequency domain definition of the event location
nn Assuming causalityAssuming causality
Group delay

Group delay of a minimum phaseGroup delay of a minimum phase
responseresponse
を介した計算Cepstrum

Compensation based onCompensation based on
minimum phase group delayminimum phase group delay
Observed group delay
Causal group delay
Compensated event location
Compensated event duration

example
Observed group delay
Minimum phase
group delay
Compensated group delay

Excitation estimation based on fixedExcitation estimation based on fixed
point based event detectionpoint based event detection
Event based concentration Excitation based concentration
Energy centroid
Compensated
group delay
excitation
Vocal fold closure

Excitation extraction accuracyExcitation extraction accuracy
Standard
deviation

Estimated
excitation
Speech
waveform
Multiple resolution display ofMultiple resolution display of
events (fixed points)events (fixed points)

Multiple resolution display ofMultiple resolution display of
events (fixed points)events (fixed points)
demo
Fixed points
due to one
excitation
aligns on a
straight line

Phase map of wavelet transformPhase map of wavelet transform

Instantaneous frequency basedInstantaneous frequency based
fixed pointsfixed points

Group delay based fixed pointsGroup delay based fixed points

Source attribute control

Group delay manipulated
mixed-mode excitation source
Group delay manipulated
mixed-mode excitation source
group delay asymmetry
impulse response
..provides continuous coverage
from pulse train to random noise

SummarySummary
nn Functional (computational after Marr)Functional (computational after Marr)
approach is important and productive.approach is important and productive.
nn Fixed points provide feature values asFixed points provide feature values as
well as their reliability indices.well as their reliability indices.
–– UsingUsing within channelwithin channel informationinformation
nn Fixed point concept may provide clue toFixed point concept may provide clue to
integrate Fourier based concept andintegrate Fourier based concept and
wavelet-wavelet-MellinMellin transform based concept.transform based concept.

ColleaguesColleagues
nn Haruhiro KatayoseHaruhiro Katayose, Toshio, Toshio IrinoIrino,, TakanobuTakanobu NishiuraNishiura
(Wakayama(Wakayama UnivUniv.).)
nn MinoruMinoru TsuzakiTsuzaki, Hideki, Hideki IwasawaIwasawa (ATR)(ATR)
nn ParhamParham ZolfaghariZolfaghari (NTT)(NTT)
nn Kiyohiro ShikanoKiyohiro Shikano, Hiroshi, Hiroshi SaruwatariSaruwatari (NAIST)(NAIST)
nn Fumitada ItakuraFumitada Itakura, Kazuya Takeda, Shoji, Kazuya Takeda, Shoji kajitakajita, Hideki, Hideki
BannoBanno (CIAIR, Nagoya(CIAIR, Nagoya UnivUniv.).)
nn MasatoMasato AkagiAkagi, Masashi, Masashi UnokiUnoki (JAIST)(JAIST)
nn Seiichi Nakagawa (Seiichi Nakagawa (ToyohashiToyohashi Inst. Tech)Inst. Tech)
nn ShigekiShigeki SagayamaSagayama, Nobuaki, Nobuaki MinematsuMinematsu ((UnivUniv. Tokyo). Tokyo)
nn DianeDiane KewleyKewley-Port (Indiana-Port (Indiana UnivUniv. USA). USA)
nn Osamu Fujimura (Ohio stateOsamu Fujimura (Ohio state UnivUniv. USA). USA)
nn Alain deAlain de CheveignCheveignéé (IRCAM, France)(IRCAM, France)
nn Roy D. Patterson (CNBH, UK)Roy D. Patterson (CNBH, UK)

ReferencesReferences
nn Hideki Kawahara,Hideki Kawahara, IkuyoIkuyo Masuda-Masuda-KatsuseKatsuse and Alain deand Alain de CheveigneCheveigne: Restructuring: Restructuring
speech representations using a pitch-adaptive time-frequency smoothing and anspeech representations using a pitch-adaptive time-frequency smoothing and an
instantaneous-frequency-based F0 extraction: Possible role of ainstantaneous-frequency-based F0 extraction: Possible role of a reptitivereptitive
structure in sounds, Speech Communication, 27, pp.187-207 (1999).structure in sounds, Speech Communication, 27, pp.187-207 (1999).
nn Hideki Kawahara,Hideki Kawahara, Haruhiro KatayoseHaruhiro Katayose, Alain de, Alain de CheveigneCheveigne, Roy D. Patterson:, Roy D. Patterson:
Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping forFixed Point Analysis of Frequency to Instantaneous Frequency Mapping for
Accurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6,Accurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6,
Page 2781-2784 (1999).Page 2781-2784 (1999).
nn Hideki Kawahara, YoshinoriHideki Kawahara, Yoshinori AtakeAtake and Parhamand Parham ZolfaghariZolfaghari: Accurate vocal event: Accurate vocal event
detection method based on a fixed-point to weighted average group delay,detection method based on a fixed-point to weighted average group delay,
ICSLP-2000, Beijing, pp.664-667 2000.ICSLP-2000, Beijing, pp.664-667 2000.
nn H. Kawahara and PH. Kawahara and P ZolfaghariZolfaghari: Systematic F0 glitches around vowel nasal: Systematic F0 glitches around vowel nasal
transitions, EUROSPEECH'2001, pp.2459-2462, 2001.transitions, EUROSPEECH'2001, pp.2459-2462, 2001.
nn H. Kawahara, JoH. Kawahara, Jo EstillEstill and O. Fujimura:and O. Fujimura: AperiodicityAperiodicity extraction and control usingextraction and control using
mixed mode excitation and group delay manipulation for a high quality speechmixed mode excitation and group delay manipulation for a high quality speech
analysis, modification and synthesis system STRAIGHT, MAVEBA 2001,analysis, modification and synthesis system STRAIGHT, MAVEBA 2001,
Sept.13-15,Sept.13-15, FirentzeFirentze Italy, 2001.Italy, 2001.
nn H. Kawahara and H.H. Kawahara and H. KatayoseKatayose: Scat generation research program based on: Scat generation research program based on
STRAIGHT, a high-quality speech analysis, modification and synthesis system,STRAIGHT, a high-quality speech analysis, modification and synthesis system,
J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)

For computationalFor computational ““AuditionAudition””
seed#1 seed#2
F0 trajectory andF0 trajectory and
frequency axis modificationfrequency axis modification
Parts preparationParts preparation
Mixing and level adjustmentMixing and level adjustment

Nonlinear time warping basedNonlinear time warping based
on phase of the F0 componenton phase of the F0 component
(FM pulse train)(FM pulse train)
without time warpingwithout time warping with time warpingwith time warping

Nonlinear time warping basedNonlinear time warping based
on phase of the F0 componenton phase of the F0 component
(vowel sequence /(vowel sequence /aiueoaiueo/)/)
without time warpingwithout time warping with time warpingwith time warping

Straight

Recommended

Recommended

More Related Content

Similar to Straight

Similar to Straight (20)

More from melvincabatuan

More from melvincabatuan (19)

Recently uploaded

Recently uploaded (20)

Straight