SlideShare uma empresa Scribd logo
1 de 12
Investigating the Possibilities of
       Using SMT for Text Annotation
                                 László J. Laki1,2
                                laki.laszlo@itk.ppke.hu

  1 Pázmány    Péter Catholic University, Faculty of Information Technology

                2 MTA-PPKE    Language Technology Research Group


This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
OUTLINE
•   SMT as POS tagger
•   Baseline system
•   Decreasing the size of target vocabulary
•   Handling OOV words
•   Evaluation
•   Conclusion
STATISTICAL MACHINE TRANSLATION




• Frameworks                      • Corpus
  – MOSES (Koehn et. al., 2007)     – Szeged Korpusz 2
                                      (Csendes et. al., 2003)
  – JOSHUA (Li et. al., 2009)
                                    – 1.2 million words
  – SRILM (Stolcke, 2002)           – MSD coding system
THE BASELINE SYSTEM
Plain text   a konszolidációra való törekvés találkozott a budapest#bank igényeivel is -
             tudjuk meg garadnai#róbert adattárház-menedzsertől .
Reference    a_[Tf]       konszolidáció_[Nc-ss]   való_[Afp-sn]    törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
             adattárház-menedzser_[Nc-sb] ._[Punct]
System’s     a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
              adattárház-menedzsertől ._[Punct]


• Correct annotation:         24557          System       BLEU score       Accuracy
• Incorrect annotation:         646           MOSES              98.49%         91.29%
• No annotation:               1697          JOSHUA              97.31%         91.07%
DECREASING THE SIZE OF TARGET VOCABULARY
• With only POS disambiguation
   – Annotate to POS tags without lemmatization
      • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n])
   – Complexity: 152694 ->1128 tokens;
   – Accuracy: 91.46% (+0.17%)
• With simplifying POS tags
   – Annotate to main POS tags
      • (e.g. [Vmis3s---n] -> V)
   – Complexity: 1128 -> 14 tokens;
   – Accuracy: 92.20% (+0.91%)
• Conclusion
   – None of the OOV words were tagged (1698 pieces)
   – Quality slightly increased at the cost of the significant
     information loss
HANDLING OOV WORDS
• OOV words are included in just a few                             Token            #
  word classes                                             ezt                120
• Analyze the context of the OOV words                     a                  100
• Create a dictionary based on the                         kívül              6
  frequency of the words calculated                        diplomáciai        4
  from training set
                                                           magyarországi      4
• The words not included in this
                                                           képességet         2
  dictionary are changed to string
  „unk”                                                    erőfeszítéseken 2
• Tested on different thresholds                           adhatnák           1


Plain text   ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül
                 a lobbyerőt és képességet a diplomáciai erőfeszítéseken
             mindenekelőtt a magyarországi multinacionálisokadhatnák . .
                                           multinacionálisok adhatnák
Modified     ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk
                   unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi
text         unk .
                 unk .
Threshold                   Accuracy
                         Original text   Lemmatized   Multiple
                                            text      treshold
         HANDLING OOV WORDS
           X<1(Baseline) 91.46%
                              93.13%         92.57%      93.28%
                              90.40%         92.25%      90.65%
                              88.41%         91.81%      88.62%
                              87.07%         91.48%      87.40%
                              85.97%         91.10%      86.15%

• In the original text
  – Best accuracy: 93.13%
• In case of lemmas
  – Best accuracy: 92.57%
• Multiple thresholds
  – Best accuracy: 93.28%
INTRODUCING POSTFIXES
• Goal: Separate nouns, verbs, adjectives,
  etc.
• Different POS types have characteristic
  postfixes
• Use last characters of the OOV words.
  – Last 2,3,4 characters
  – e.g. noun: házból -> unk_ból
         verb: megállítottuk -> unk_tuk
INTRODUCING POSTFIXES
Threshold           Accuracy
              Number of leftcharacters
              2          3          4
   X<1       91.46%    91.46%     91.46%
(Baseline)
             95.17%    95.83% 95.96%
             94.17%    95.32%     95.90%
             93.48%    94.97%     95.73%
             92.94%    94.70%     95.60%
             92.61%    94.55%     95.55%
EVALUATION
                               System                 Token   Sentence
• Baseline:                Only POS tagging          accuracy accuracy
  – Choose the best    Baseline (BL)                   89.66%   25.27%
                       SMT-_Baselin2                   91.46%   34.53%
• PurePos:
                       SMT-_OOV-_postfix               95.96%   56.47%
  – Maxent and HMM     PurePos                         96.03%   55.87%
    based              PurePos-MorphTable              97.29%   66.40%
  – Include            OpenNLP Maxent (ONM)            95.28%   26.00%

    morphological      OpenNLP Perceptron (ONP)        94.98%   26.67%

    disambiguation               System               Token   Sentence
                       POS tagging + lemmatization   accuracy accuracy
• OpenNLP              SMT-_Baselin1                   91.29%   33.73%
  – Maxent based       PurePos                         83.92%   10.00%
                       PurePos-MorphTable              84.89%   11.60%
  – Perceptron based
CONCLUSION
• SMT system was examined for part-of-
  speech disambiguation and lemmatization
  in Hungarian
• Absolutely automated system
• Best accuracy about 96%
• Decreasing the size of target vocabulary
• Handle OOV words
THANK YOU FOR YOUR ATTENTION

     laki.laszlo@itk.ppke.hu

Mais conteúdo relacionado

Semelhante a Investigating the Possibilities of Using SMT for Text Annotation

RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
Rubén Izquierdo Beviá
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
MediaEval2012
 

Semelhante a Investigating the Possibilities of Using SMT for Text Annotation (12)

LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech Tagging
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Investigating the Possibilities of Using SMT for Text Annotation

  • 1. Investigating the Possibilities of Using SMT for Text Annotation László J. Laki1,2 laki.laszlo@itk.ppke.hu 1 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research Group This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
  • 2. OUTLINE • SMT as POS tagger • Baseline system • Decreasing the size of target vocabulary • Handling OOV words • Evaluation • Conclusion
  • 3. STATISTICAL MACHINE TRANSLATION • Frameworks • Corpus – MOSES (Koehn et. al., 2007) – Szeged Korpusz 2 (Csendes et. al., 2003) – JOSHUA (Li et. al., 2009) – 1.2 million words – SRILM (Stolcke, 2002) – MSD coding system
  • 4. THE BASELINE SYSTEM Plain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is - tudjuk meg garadnai#róbert adattárház-menedzsertől . Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzser_[Nc-sb] ._[Punct] System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzsertől ._[Punct] • Correct annotation: 24557 System BLEU score Accuracy • Incorrect annotation: 646 MOSES 98.49% 91.29% • No annotation: 1697 JOSHUA 97.31% 91.07%
  • 5. DECREASING THE SIZE OF TARGET VOCABULARY • With only POS disambiguation – Annotate to POS tags without lemmatization • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n]) – Complexity: 152694 ->1128 tokens; – Accuracy: 91.46% (+0.17%) • With simplifying POS tags – Annotate to main POS tags • (e.g. [Vmis3s---n] -> V) – Complexity: 1128 -> 14 tokens; – Accuracy: 92.20% (+0.91%) • Conclusion – None of the OOV words were tagged (1698 pieces) – Quality slightly increased at the cost of the significant information loss
  • 6. HANDLING OOV WORDS • OOV words are included in just a few Token # word classes ezt 120 • Analyze the context of the OOV words a 100 • Create a dictionary based on the kívül 6 frequency of the words calculated diplomáciai 4 from training set magyarországi 4 • The words not included in this képességet 2 dictionary are changed to string „unk” erőfeszítéseken 2 • Tested on different thresholds adhatnák 1 Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül a lobbyerőt és képességet a diplomáciai erőfeszítéseken mindenekelőtt a magyarországi multinacionálisokadhatnák . . multinacionálisok adhatnák Modified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi text unk . unk .
  • 7. Threshold Accuracy Original text Lemmatized Multiple text treshold HANDLING OOV WORDS X<1(Baseline) 91.46% 93.13% 92.57% 93.28% 90.40% 92.25% 90.65% 88.41% 91.81% 88.62% 87.07% 91.48% 87.40% 85.97% 91.10% 86.15% • In the original text – Best accuracy: 93.13% • In case of lemmas – Best accuracy: 92.57% • Multiple thresholds – Best accuracy: 93.28%
  • 8. INTRODUCING POSTFIXES • Goal: Separate nouns, verbs, adjectives, etc. • Different POS types have characteristic postfixes • Use last characters of the OOV words. – Last 2,3,4 characters – e.g. noun: házból -> unk_ból verb: megállítottuk -> unk_tuk
  • 9. INTRODUCING POSTFIXES Threshold Accuracy Number of leftcharacters 2 3 4 X<1 91.46% 91.46% 91.46% (Baseline) 95.17% 95.83% 95.96% 94.17% 95.32% 95.90% 93.48% 94.97% 95.73% 92.94% 94.70% 95.60% 92.61% 94.55% 95.55%
  • 10. EVALUATION System Token Sentence • Baseline: Only POS tagging accuracy accuracy – Choose the best Baseline (BL) 89.66% 25.27% SMT-_Baselin2 91.46% 34.53% • PurePos: SMT-_OOV-_postfix 95.96% 56.47% – Maxent and HMM PurePos 96.03% 55.87% based PurePos-MorphTable 97.29% 66.40% – Include OpenNLP Maxent (ONM) 95.28% 26.00% morphological OpenNLP Perceptron (ONP) 94.98% 26.67% disambiguation System Token Sentence POS tagging + lemmatization accuracy accuracy • OpenNLP SMT-_Baselin1 91.29% 33.73% – Maxent based PurePos 83.92% 10.00% PurePos-MorphTable 84.89% 11.60% – Perceptron based
  • 11. CONCLUSION • SMT system was examined for part-of- speech disambiguation and lemmatization in Hungarian • Absolutely automated system • Best accuracy about 96% • Decreasing the size of target vocabulary • Handle OOV words
  • 12. THANK YOU FOR YOUR ATTENTION laki.laszlo@itk.ppke.hu