Khmer ASR system overview and future directions

Khmer ASR system
Sethserey SAM
sam.sethserey@itc.edu.kh

Part I: ASR in general
o  Definition
o  Type of ASR
o  ASR flow chart
o  Data requirement
o  Performance of ASR systems
o  Fundamental methods to create ASR system

2

What is ASR system?
o  ASR: Automatic speech recognition
system
o  ASR: A system or tool that can
convert audio flow contained speech
to text.
Seven
Seven days
ASR System Zaven
:
:

Text output

3

ASR: what for?
o  ASR systems improve your life (works ,
business, communication ,etc.)

Typology of ASR systems
o  Speaker-dependent vs. -independent

o  Language constraints: + Vocabulary:
n  isolated word recognition
n  connected word small (100),
n  keyword spotting medium (5 000),
large (50 000)
n  continuous speech recognition

o  Robustness constraints
n  laboratory (office) conditions: imposed
n  microphone, channel noise …

5

Levels of complexity

6

ASR flow chart

s
e Seven
v Seven days
Zaven
e
:
n
:

Signal processing Decoding/Searching
(digitalizing &
feature extraction)
ASR system

7

ASR data requirement
o  To train AM and ML models, huge amount of
data (text & audio) are needed.

Pronunciation
Audio + dictionary
Text data
transcription data

8

ASR Performance
o  English ASR system Evaluations at National Institute of
Standards and Technology (NIST)

9

Causes of ASR’s error rate
“seven”

o  The current ASR for continuous speech
can not reach 0% of WER, why ?
n  Acoustic model is affected by human character and
environment: gender, age, emotion, pitch, accent,
physical state, channel noise, etc.
n  Lexical model is affected by incorrect word
pronunciation.
n  Language model : incorrect usage of words,
grammar mistakes.
10

Three fundamental methods for
creating a new ASR system

o  Enough training data è bootstraping
o  Small amount of data è adaptation
o  No data è cross-language transfer

11

Part II:
Khmer language & its processing
o  Khmer language
o  Why research on Khmer ASR?

12

Khmer Language
o  Oﬃcial
language
of
Cambodia

o  Spoken
by
more
than
15
M
people

o  An
atonal
language

o  Wri>ng
system

n  33
Consonants,
23
dependent
vowels

n  14
independent
vowels,
13
diacri>cs
and
various
signs

n  No
explicit
word
boundary

13

Why research on Khmer ASR?
o  An
under-‐resourced
language

n  Lack
of
text
and
speech
data
in
digital
form

n  Lack
of
linguis>c
documents
(both
soK
and
hard

copies)

o  Lacking
explicit
Word
Segmenta>on

n  Automa>c
Word
Segmenta>on
is
needed

n  State-‐of-‐the-‐art
method
of

segmenta>on
uses

–  hand-‐craKed
lexicons,
word
frequencies,

–  op>miza>on
criteria
…

o  Others
under-‐resourced,
unsegmented

languages
in
the
region
:
Burmese,
Laos,
Thai

Vietnamese

14

Part III:
Khmer ASR at the glance
o  Corpus
o  Speech corpus setup
o  Text corpus setup
o  General overview
o  Current ASR system
o  Future work

15

Corpus: Speeh corpus setup
o  Two types of corpus:
n  small transcribed corpus (2007-2008)
o  Transcribed manually by Engineering students at ITC
o  only 6 hours of transcribed signal
o  Nature: radio signal (poor quality) downloaded from
radio australie, radio free asia and voice of america

n  Large transcribed corpus (2011)
o  Already have text and speech corresponding
o  Students help verifying the transcription
o  21 hours of transcribed signal
o  Nature: reading speech from newspaper

16

Corpus: Text corpus setup
o  Retrieving
text
from
the
Web
is
becoming
a
common
approach

o  Well
selected
rich-‐content
websites
Vs
crawling
the
Web

o  Adap>ng
ClipsTextTk,
an
open
source
tool
for
corpus
crea>on
for

Khmer
language

n  Conversion
from
legacy
character
encoding
to
Unicode

n  Automa>c
Segmenta>on

n  Conversion
of
special
sign
and
number
to
text

n  Normaliza>on
of
word
spelling

o  Text
Corpus
obtained
from
5
sites
:

n  2,5000
html
pages
retrieved

n  AKer
processing
:
0.5
M
sentences,
15
M
words

n  Dura>on
:
November
2007
–
January
2008

17

Corpus-Oveview
o  Description of Khmer ASR corpus
Type Small corpus Large corpus
Signal ~6h of transcribed ~20h of
(acoustic model) signal (radio) transcribed
signal (reading
speech)
Text 0,5 millions of to be improved
(language model) phrase
~ 15,5 millions of
words
Pronunciation ~ 20 000 words To be improved
Dictionary
(lexical model)
18

Current ASR system
Continue ASR Training & Word Error Rate (%)
System tasting corpus
Context Context
Dependent Dependent
(8gau) (16gau)
Khmer ASR v1 - LM: 15.5M words 42.5 40.3
- Training AM: 5h
- Testing: 172p
Khmer ASR v2 - LM: 15M words 36.4 35
- Training AM: 20h
- Testing: 290 p

19

Future Work
o  Collect more text data for language
model
o  Next challenge: How to improve
Khmer ASR for independent speakers
and in different environments?

20

Khmer ASR system overview and future directions

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Khmer ASR system overview and future directions

Semelhante a Khmer ASR system overview and future directions (20)

Mais de Bill Chea

Mais de Bill Chea (19)

Khmer ASR system overview and future directions