SlideShare uma empresa Scribd logo
1 de 28
© 2005, it - instituto de telecomunicações. Todos os direitos reservados.
Arlindo Veiga1,2
Sara Candeias1
Fernando Perdigão1,2
1Instituto de Telecomunicações, Polo de Coimbra, Portugal
2Universidade de Coimbra, DEEC, Portugal
STIL 2011
8th Symposium in Information and Human Language Technology
Oct. 14-26 2011 Cuiaba, Brazil
GENERATING A PRONUNCIATION DICTIONARY
FOR EUROPEAN PORTUGUESE
USING A JOINT-SEQUENCE MODEL
WITH EMBEDDED STRESS ASSIGNMENT
2
SUMMARY
• Goal
• Problem Statement
• G2P System
• Joint-Sequence Model
• Stressed Vowel Assignment
• Results
• Conclusions
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
3
GOAL
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• To Generate a Pronunciation Dictionary for EP
• To Develop a G2P System for EP
4
PROBLEM STATEMENT
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
What approaches?
How?
Implementing an
automatic system for
converter G2P
• linguistic rules
• Portuguese has an orthography roughly phonologically based 
provides a good coverage of the association between G2P
• No natural human-language satisfies this assumption  the
association between G and P is not quite one-to-one  list of
exceptions
• Very complex, hard and tiresome
5
PROBLEM STATEMENT
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
What approaches?
How?
Implementing an
automatic system for
converter G2P
• linguistic rules
• statistics
• Using pronunciation examples it could be possible to predict
the pronunciation of unseen words by analogy
• Is not smart enough…
• vaga -> v „a g 6 vs. vagarosa -> v 6 g 6 r „O z 6
• linguistic rules
6
PROBLEM STATEMENT
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
What approaches?
How?
Implementing an
automatic system for
converter G2P
• linguistic rules
• statistics
• MIXED
7
System based on a mixed approach funded on:
• a scholastic model: joint-sequence model
• rules for stressed vowel assignment
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
G2P SYSTEM
Alignment between graphemes and phonemes:
“one-to-one”
8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
JOINT-SEQUENCE MODEL
< B r a s i l >
/ b r 6 z i l /
Alignment between graphemes and
phonemes: “one-to-one”
9 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
< c h a m o u > < t ê m >
/ S 6 m o / / t 6~ i~ 6~ i~ /
< B r a s i l >
/ b r 6 z i l /
Alignment between graphemes and
phonemes: “one-to-one”
JOINT-SEQUENCE MODEL
10 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
< c h a m o u > < t ê m >
/ S 6 m o / / t 6~ i~ 6~ i~ /
Alignment between graphemes and
phonemes: “one-to-one”
JOINT-SEQUENCE MODEL
11 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• Implementing the Levenshtein algorithm (“1-01”)
• Defining alternative symbols
• Graphemes  DIGRAPHS
< c h a m o u >
< S a m º >
/ S 6 m o /
JOINT-SEQUENCE MODEL
12 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• Implementing the Levenshtein algorithm (“1-01”)
• Defining alternative symbols
• Graphemes  DIGRAPHS
• Phonemes  SAMPA UniChar
< t ê m >
< t 6 ~ i ~ 6 ~ i ~ /
/ t ï ï /
/ t Æ ï /
< c h a m o u >
< S a m º >
/ S 6 m o /
JOINT-SEQUENCE MODEL
13 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• Implementing the Levenshtein algorithm (“1-01”)
• Defining alternative symbols
• Graphemes  DIGRAPHS
• Phonemes  SAMPA UniChar
< c h a m o u >
< S a m º >
/ S 6 m o /
< t ê m >
/ t Æ ï /
JOINT-SEQUENCE MODEL
14 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• Implementing the Levenshtein algorithm (“1-01”)
• Defining alternative symbols
• Graphemes  DIGRAPHS
• Phonemes  SAMPA UniChar
< c h a m o u >
< S a m º >
/ S 6 m o /
< t ê m >
/ t Æ ï /
JOINT-SEQUENCE MODEL
15 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
• Implementing the Levenshtein algorithm (“1-01”)
• Defining alternative symbols
• Graphemes  DIGRAPHS
• Phonemes  SAMPA UniChar
< c h a m o u >
< S a m º >
/ S 6 m o /
< t ê m >
/ t Æ ï /
Graphonemes
GOAL: to compute the most probable
pronunciation of a word given the word‟s
graphoneme form
TECHNIQUE: using n-grams
JOINT-SEQUENCE MODEL
16
System based on a mixed approach funded on:
• a scholastic model: joint-sequence model
• rules for stressed vowel assignment
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
G2P SYSTEM
• Several errors due to incorrect stress assignment:
solidamente, incansavelmente
17
System based on a mixed approach funded on:
• a scholastic model: joint-sequence model
• rules for stressed vowel assignment
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
G2P SYSTEM
Marking the Vstressed improved the statistical model by
expressing graphoneme classes unequivocally
6 rules
18 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
STRESSED VOWEL ASSIGNMENT
For adverbs ending in <mente> (< pido> → <rapidamente> (fast → quickly):
• An algorithm that divides the word into two parts, <ROOT> and <mente>.
• The <ROOT> part undertakes a specific module (list of graphematic patterns which have the Vstressed
identified).
To generate a univocal graphoneme, we attributed special symbols to the Vstressed
19
To estimate the graphoneme‟s model:
• SpeechDat pronunciation dictionary
• 15k entries
• Deletion of foreign words
• Change of some transcriptions
• Standardization of the pronunciation
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
VOCABULARY
Applied to the CETEMPúblico vocabulary
40k words  40k pronunciations
20
CETEMPúblico 40k pronunciations:
• Iterative procedure:
• Long manual verification
• Correction of the transcriptions
• Comparison to the pronunciations of LOQUENDO
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
DICTIONARY
This dictionary was used for the training and test procedure.
• The majority of the transcriptions agreed.
• The transcriptions from our dictionary were the right ones most of the times.
21
EXPERIMENTS
All experiments were based on the dictionary of the
40K pronunciations:
• with stress marking
• without stress marking
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
Final results were obtained by evaluating the average of the five partial
results.
To train and test the model, each one of these two dictionaries was
partitioned into five folds for a cross-validation procedure.
22
The performance of the G2P conversion system was expressed
in two average error rates: average error rate of phonemes
(PER) and average error rate of words (WER)
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
RESULTS
23
RESULTS
The following figures summarize the results obtained using n-
grams with n between 2 and 8
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
24
RESULTS
The use of n-grams with large contexts (n greater than 5) did
not improve the system. In fact, there was a slight increase in
the error rates (lack of samples to estimate large contexts)
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
25
RESULTS
The marking of the stressed vowel contributed to a significant
improvement in the system performance
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
26
CONCLUSIONS
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
The joint-sequence model with embedded stress
assignment had good results.
By inspecting the test errors, we observed that most of them resulted
from uncommon grapheme patterns or compound words without graphic
stress marks.
The most frequent errors resulted from the pronunciation of the
stressed <e> and <o> since they could be pronounced as /E/ vs. /e/
(<selo>: verb vs. noun) and /O/ vs. /o/ (<ovos> (pl) vs. <ovo>(sing))
without any systematic rule.
Obrigada
Our system is freely available on http://www.co.it.pt/~labfala/g2p/ and
includes models, dictionaries and the G2P converter.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados.
Arlindo Veiga1,2
Sara Candeias1
(saracandeias@co.it.pt)
Fernando Perdigão1,2
1Instituto de Telecomunicações, Polo de Coimbra, Portugal
2Universidade de Coimbra, DEEC, Portugal
STIL 2011
8th Symposium in Information and Human Language Technology
Oct. 14-26 2011 Cuiaba, Brazil
GENERATING A PRONUNCIATION DICTIONARY
FOR EUROPEAN PORTUGUESE
USING A JOINT-SEQUENCE MODEL
WITH EMBEDDED STRESS ASSIGNMENT
28
INTRODUCTION
STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
Generate a Pronunciation Dictionary for PE
• Grapheme-to-Phoneme conversion (G2P)
Bom dia  b‟o~ d‟i6 (en. Good morning)
• Applications: component of ASR and TTS systems
e.g. in language learning, machine translation,…
• For correct pronunciation we need:
• G2P, stress assignment
• Contribution of this paper:
• Show phonological constraints (vowel stressed)
• Evaluate a mixed approach for G2P system
• Turn the dictionary (the model and the converter) publicly available

Mais conteúdo relacionado

Semelhante a Candeias sti lg2p_vfinal

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
MediaEval 2016 - BUT Zero-Cost Speech Recognition
MediaEval 2016 - BUT Zero-Cost Speech RecognitionMediaEval 2016 - BUT Zero-Cost Speech Recognition
MediaEval 2016 - BUT Zero-Cost Speech Recognitionmultimediaeval
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesMarc Pantel
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
 
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLSSBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLScsandit
 

Semelhante a Candeias sti lg2p_vfinal (12)

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in Portuguese
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
MediaEval 2016 - BUT Zero-Cost Speech Recognition
MediaEval 2016 - BUT Zero-Cost Speech RecognitionMediaEval 2016 - BUT Zero-Cost Speech Recognition
MediaEval 2016 - BUT Zero-Cost Speech Recognition
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLSSBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
SBML FOR OPTIMIZING DECISION SUPPORT'S TOOLS
 

Último

Italia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue muraItalia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue murasandamichaela *
 
Exploring Sicily Your Comprehensive Ebook Travel Guide
Exploring Sicily Your Comprehensive Ebook Travel GuideExploring Sicily Your Comprehensive Ebook Travel Guide
Exploring Sicily Your Comprehensive Ebook Travel GuideTime for Sicily
 
Dubai Call Girls O528786472 Call Girls Dubai Big Juicy
Dubai Call Girls O528786472 Call Girls Dubai Big JuicyDubai Call Girls O528786472 Call Girls Dubai Big Juicy
Dubai Call Girls O528786472 Call Girls Dubai Big Juicyhf8803863
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)Mazie Garcia
 
question 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationquestion 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationcaminantesdaauga
 
Inspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodInspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodKasia Chojecki
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxGregory DeShields
 
Aeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyAeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyFlyFairTravels
 
"Fly with Ease: Booking Your Flights with Air Europa"
"Fly with Ease: Booking Your Flights with Air Europa""Fly with Ease: Booking Your Flights with Air Europa"
"Fly with Ease: Booking Your Flights with Air Europa"flyn goo
 
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCRdollysharma2066
 
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCRsoniya singh
 
Moroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxMoroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxOmarOuazzani1
 
How Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersHow Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersMakena Coast Charters
 
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)Escort Service
 
Apply Indian E-Visa Process Online (Evisa)
Apply Indian E-Visa Process Online (Evisa)Apply Indian E-Visa Process Online (Evisa)
Apply Indian E-Visa Process Online (Evisa)RanjeetKumar108130
 
Where to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdWhere to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdusmanghaniwixpatriot
 
Haitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxHaitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxhxhlixia
 
Revolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI UpdateRevolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI Updatejoymorrison10
 

Último (20)

Italia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue muraItalia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue mura
 
Exploring Sicily Your Comprehensive Ebook Travel Guide
Exploring Sicily Your Comprehensive Ebook Travel GuideExploring Sicily Your Comprehensive Ebook Travel Guide
Exploring Sicily Your Comprehensive Ebook Travel Guide
 
Dubai Call Girls O528786472 Call Girls Dubai Big Juicy
Dubai Call Girls O528786472 Call Girls Dubai Big JuicyDubai Call Girls O528786472 Call Girls Dubai Big Juicy
Dubai Call Girls O528786472 Call Girls Dubai Big Juicy
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
 
question 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationquestion 2: airplane vocabulary presentation
question 2: airplane vocabulary presentation
 
Inspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodInspirational Quotes About Italy and Food
Inspirational Quotes About Italy and Food
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptx
 
Aeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyAeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change Policy
 
"Fly with Ease: Booking Your Flights with Air Europa"
"Fly with Ease: Booking Your Flights with Air Europa""Fly with Ease: Booking Your Flights with Air Europa"
"Fly with Ease: Booking Your Flights with Air Europa"
 
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR
8377087607 Full Enjoy @24/7 Call Girls in INA Market Dilli Hatt Delhi NCR
 
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Nand Nagri 🔝 Delhi NCR
 
Moroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxMoroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptx
 
How Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersHow Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s Waters
 
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
 
Apply Indian E-Visa Process Online (Evisa)
Apply Indian E-Visa Process Online (Evisa)Apply Indian E-Visa Process Online (Evisa)
Apply Indian E-Visa Process Online (Evisa)
 
Where to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdWhere to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasd
 
Haitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxHaitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptx
 
Revolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI UpdateRevolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI Update
 
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
 
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
 

Candeias sti lg2p_vfinal

  • 1. © 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga1,2 Sara Candeias1 Fernando Perdigão1,2 1Instituto de Telecomunicações, Polo de Coimbra, Portugal 2Universidade de Coimbra, DEEC, Portugal STIL 2011 8th Symposium in Information and Human Language Technology Oct. 14-26 2011 Cuiaba, Brazil GENERATING A PRONUNCIATION DICTIONARY FOR EUROPEAN PORTUGUESE USING A JOINT-SEQUENCE MODEL WITH EMBEDDED STRESS ASSIGNMENT
  • 2. 2 SUMMARY • Goal • Problem Statement • G2P System • Joint-Sequence Model • Stressed Vowel Assignment • Results • Conclusions STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  • 3. 3 GOAL STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • To Generate a Pronunciation Dictionary for EP • To Develop a G2P System for EP
  • 4. 4 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • Portuguese has an orthography roughly phonologically based  provides a good coverage of the association between G2P • No natural human-language satisfies this assumption  the association between G and P is not quite one-to-one  list of exceptions • Very complex, hard and tiresome
  • 5. 5 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • statistics • Using pronunciation examples it could be possible to predict the pronunciation of unseen words by analogy • Is not smart enough… • vaga -> v „a g 6 vs. vagarosa -> v 6 g 6 r „O z 6 • linguistic rules
  • 6. 6 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • statistics • MIXED
  • 7. 7 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM Alignment between graphemes and phonemes: “one-to-one”
  • 8. 8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 JOINT-SEQUENCE MODEL < B r a s i l > / b r 6 z i l / Alignment between graphemes and phonemes: “one-to-one”
  • 9. 9 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 < c h a m o u > < t ê m > / S 6 m o / / t 6~ i~ 6~ i~ / < B r a s i l > / b r 6 z i l / Alignment between graphemes and phonemes: “one-to-one” JOINT-SEQUENCE MODEL
  • 10. 10 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 < c h a m o u > < t ê m > / S 6 m o / / t 6~ i~ 6~ i~ / Alignment between graphemes and phonemes: “one-to-one” JOINT-SEQUENCE MODEL
  • 11. 11 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS < c h a m o u > < S a m º > / S 6 m o / JOINT-SEQUENCE MODEL
  • 12. 12 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < t ê m > < t 6 ~ i ~ 6 ~ i ~ / / t ï ï / / t Æ ï / < c h a m o u > < S a m º > / S 6 m o / JOINT-SEQUENCE MODEL
  • 13. 13 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / JOINT-SEQUENCE MODEL
  • 14. 14 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / JOINT-SEQUENCE MODEL
  • 15. 15 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / Graphonemes GOAL: to compute the most probable pronunciation of a word given the word‟s graphoneme form TECHNIQUE: using n-grams JOINT-SEQUENCE MODEL
  • 16. 16 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM • Several errors due to incorrect stress assignment: solidamente, incansavelmente
  • 17. 17 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM Marking the Vstressed improved the statistical model by expressing graphoneme classes unequivocally 6 rules
  • 18. 18 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 STRESSED VOWEL ASSIGNMENT For adverbs ending in <mente> (< pido> → <rapidamente> (fast → quickly): • An algorithm that divides the word into two parts, <ROOT> and <mente>. • The <ROOT> part undertakes a specific module (list of graphematic patterns which have the Vstressed identified). To generate a univocal graphoneme, we attributed special symbols to the Vstressed
  • 19. 19 To estimate the graphoneme‟s model: • SpeechDat pronunciation dictionary • 15k entries • Deletion of foreign words • Change of some transcriptions • Standardization of the pronunciation STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 VOCABULARY Applied to the CETEMPúblico vocabulary 40k words  40k pronunciations
  • 20. 20 CETEMPúblico 40k pronunciations: • Iterative procedure: • Long manual verification • Correction of the transcriptions • Comparison to the pronunciations of LOQUENDO STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 DICTIONARY This dictionary was used for the training and test procedure. • The majority of the transcriptions agreed. • The transcriptions from our dictionary were the right ones most of the times.
  • 21. 21 EXPERIMENTS All experiments were based on the dictionary of the 40K pronunciations: • with stress marking • without stress marking STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 Final results were obtained by evaluating the average of the five partial results. To train and test the model, each one of these two dictionaries was partitioned into five folds for a cross-validation procedure.
  • 22. 22 The performance of the G2P conversion system was expressed in two average error rates: average error rate of phonemes (PER) and average error rate of words (WER) STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 RESULTS
  • 23. 23 RESULTS The following figures summarize the results obtained using n- grams with n between 2 and 8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  • 24. 24 RESULTS The use of n-grams with large contexts (n greater than 5) did not improve the system. In fact, there was a slight increase in the error rates (lack of samples to estimate large contexts) STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  • 25. 25 RESULTS The marking of the stressed vowel contributed to a significant improvement in the system performance STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  • 26. 26 CONCLUSIONS STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 The joint-sequence model with embedded stress assignment had good results. By inspecting the test errors, we observed that most of them resulted from uncommon grapheme patterns or compound words without graphic stress marks. The most frequent errors resulted from the pronunciation of the stressed <e> and <o> since they could be pronounced as /E/ vs. /e/ (<selo>: verb vs. noun) and /O/ vs. /o/ (<ovos> (pl) vs. <ovo>(sing)) without any systematic rule. Obrigada Our system is freely available on http://www.co.it.pt/~labfala/g2p/ and includes models, dictionaries and the G2P converter.
  • 27. © 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga1,2 Sara Candeias1 (saracandeias@co.it.pt) Fernando Perdigão1,2 1Instituto de Telecomunicações, Polo de Coimbra, Portugal 2Universidade de Coimbra, DEEC, Portugal STIL 2011 8th Symposium in Information and Human Language Technology Oct. 14-26 2011 Cuiaba, Brazil GENERATING A PRONUNCIATION DICTIONARY FOR EUROPEAN PORTUGUESE USING A JOINT-SEQUENCE MODEL WITH EMBEDDED STRESS ASSIGNMENT
  • 28. 28 INTRODUCTION STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 Generate a Pronunciation Dictionary for PE • Grapheme-to-Phoneme conversion (G2P) Bom dia  b‟o~ d‟i6 (en. Good morning) • Applications: component of ASR and TTS systems e.g. in language learning, machine translation,… • For correct pronunciation we need: • G2P, stress assignment • Contribution of this paper: • Show phonological constraints (vowel stressed) • Evaluate a mixed approach for G2P system • Turn the dictionary (the model and the converter) publicly available