SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
A CROSS-LINGUAL ANNOTATION PROJECTION-
   BASED SELF-SUPERVISION APPROACH
   FOR OPEN INFORMATION EXTRACTION

  The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011)
                             November 10th, 2011, Chiang Mai

                          Seokhwan Kim (POSTECH)
                          Minwoo Jeong (Microsoft Bing)
                            Jonghoon Lee (POSTECH)
                          Gary Geunbae Lee (POSTECH)
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        2
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        3
Information Extraction
• Goal
   To generate structured information from natural language
    documents
      • Representing semantic relationships among a set of arguments


                     Birthday



 Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

                                Birthplace




                     Person          Barack Obama
                     Birthday        August 4, 1961
                     Birthplace      Honolulu
                                                                       4
Previous Approaches
• Many supervised machine learning approaches have been
  successfully applied to the RDC task
    (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta
     and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,
     2006)
    Large amounts of training data are required
• Weakly-supervised techniques have been sought
    (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)
    To learn the IE system without significant annotation effort
• Open Information Extraction
    (Banko et al., 2007; Wu and Weld, 2010)

                                                                          5
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        6
Open Information Extraction
• An alternative weakly-supervised IE paradigm
    (Banko et al., 2007)
• Problem Definition
                   ������: ������ →     ������������ , ������������,������ , ������������ 1 ≤ ������, ������ ≤ ������
    Binary relation extraction between ei and ej
    Considering relationships explicitly represented by ri,j
• Goal
    Large-scale IE
       • Domain-independent
       • Relation-independent
    Without hand-crafted rules or hand-annotated training examples

                                                                        7
How to Eliminate Human Supervision
• Self-supervised Learning for Open IE
    Using automatically obtained training examples
      • From external knowledge

• Previous Systems
    TextRunner (Banko et al., 2007)
      • Penn Treebank
      • A small set of heuristics about syntactic structural constraints
    WoE (Wu and Weld, 2010)
      • Wikipedia articles
      • Wikipedia Infoboxes




                                                                           8
What’s the Problem?
• Previous approaches mainly depend on language-specific
  knowledge for English
    Heuristic-based Approach
      • Syntactic treebank for the target language
      • Heuristics designed for the target language
    Wikipedia-based Approach
      • Wikipedia articles and infoboxes are available not only for English
      • Differences among languages in the amount of available resources
           English Wikipedia: 3,500,000 articles
           Korean Wikipedia: 150,000 articles




                                                                              9
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        10
Cross-lingual Annotation Projection
• Goal
   To obtain training examples for the target language LT
• Method
   To leverage parallel corpora to project the annotations on the
    source language LS to the target language LT
   The premise is that parallel corpora between LS and LT are much
    easier to obtain than the task-specific training dataset for LT

          <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
      Barack Obama was born in Honolulu , Hawaii .


   버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
   (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


   <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
                                                                                                 11
Cross-lingual Annotation Projection
• Previous Work
    Part-of-speech tagging (Yarowsky and Ngai, 2001)
    Named-entity tagging (Yarowsky et al., 2001)
    Verb classification (Merlo et al., 2002)
    Dependency parsing (Hwa et al., 2005)
    Mention detection (Zitouni and Florian, 2008)
    Semantic role labeling (Pado and Lapata, 2009)
• To the best of our knowledge, no work has reported on the
  Open IE task



                                                         12
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




                                                             13
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




      Barack Obama        was born in Honolulu       , Hawaii   .




                                                                    14
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




      Barack Obama was born in Honolulu , Hawaii .




                                                             15
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed


         <e1, r12, e2> = <Barack Obama, was born in, Honolulu>

      Barack Obama was born in Honolulu , Hawaii .




                                                                 16
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected




                                                                        17
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
            <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마                는       하와이         의      호놀룰루              에서        태어났다
     (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                   18
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                  19
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                  20
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


    <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
                                                                                                  21
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        22
Overall Architecture

   English-
                                 Korean Raw
Korean Parallel
                                    Text
    Corpus




   Self-
                  Learning      Extraction
Supervision




   Korean
                  Korean Open     Extracted
  Annotated
                    IE Model       Results
   Corpus




                                              23
Cross-lingual Annotation Projection-
      based Self-Supervision
  Annotation                Parallel
                                                      Projection
                            Corpus


                English                   Korean
               Sentences                 Sentences




                                          Korean
             English                   Preprocessors
          Preprocessors



                                       Word Alignment
          English Open IE
              System


                                         Projection
                English
               Annotated
                Corpus                    Korean
                                         Annotated
                                          Corpus                   24
Cross-lingual Annotation Projection-
       based Self-Supervision
• Dataset
    English-Korean Parallel Corpus
      • 266,892 bi-sentence pairs in English and Korean

• Preprocessors
    English
      • OpenNLP toolkit
    Korean
      • Espresso toolkit




                                                          25
Cross-lingual Annotation Projection-
       based Self-Supervision
• English Open IE
    Our own implementation of the Banko’s method
      • Dataset
           The WSJ part of Penn Treebank
           By applying a series of heuristics (Banko, 2009)
           1,028,361 instances from 49,208 sentences (9.0% were positive)
      • Model
           Conditional Random Fields (CRF)
                • With Lexical and POS tag features
                • CRF++ toolkit




                                                                             26
Cross-lingual Annotation Projection-
       based Self-Supervision
• Word Alignment
   Aligned by GIZA++ toolkit
     • In the standard configuration in both directions
     • The bi-directional alignments were joined using the grow-diag-final
       algorithm
   Chunk-based Reorganization
     • To reduce the word alignment errors
     • Generating alignments between pairs of base phrase chunks
     • Using a simple greedy algorithm
          Based on the overlap score of aligned words between base phrase chunks




                                                                               27
Cross-lingual Annotation Projection-
       based Self-Supervision
• Annotated Dataset
    English
    598,115 instances
      • 169.771 positive instances

• Projected Dataset
    Korean
    278,730 instances
      • 89,743 positive instances




                                     28
Learning & Extraction
• Extractor for Korean Open IE
    Maximum Entropy (ME) model
      • To detect whether or not each given instance is positive
      • Features
           Lexical, POS Tag
           On the dependency path
      • Maximum Entropy Modeling toolkit
    Conditional Random Fields (CRF) model
      • To identify the contextual subtext indicating the semantic relationship
      • Features
           Lexical, POS Tag
           On the dependency path
      • CRF++ toolkit


                                                                              29
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        30
Evaluation #1
• Dataset
    250 sentences from Korean Wikipedia articles
    With manually annotated gold standard
      • 1,434 instances
      • 308 positive instances

• Baseline
    Heuristic-based System
      • Sejong treebank corpus (Korean)
      • A set of heuristics utilized for the English Open IE system except
        language-specific rules




                                                                             31
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   32
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   33
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   34
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   35
Evaluation #2
• Datasets
    Korean Newswire
       • 302,276 documents
       • 2,565,487 sentences
    Korean Wikipedia
       • 123,000 articles
       • 1,342,003 sentences

• Manual Evaluation
    For four relation types
       • BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF




                                                          36
Evaluation #2
• Evaluation results for four relation types

                              Newswire                          Wikipedia
     Type
                  precision     # of extractions    precision     # of extractions
  Birth Place       65.2              256             69.1              971
  Won Award         57.4              824             63.3              286
  Acquisition       67.0             1112             50.3              143
  Invent Of         53.1              32              47.6              103




       3,727 extractions with a precision of 63.7% for four relation types



                                                                                 37
Evaluation #2
• Distribution of the errors



             Error Type                 # of errors
             Chunking Error             364 (26.9%)
             Dependency Parsing Error   461 (34.1%)
             Extracting Error           527 (39.0%)




                                                      38
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        39
Conclusions
• Summary
   A Cross-lingual Annotation Projection Approach for Open IE
   Korean Open IE system developed using an English Open IE
    system and an English-Korean parallel corpus
   Our system outperformed the heuristic-based system
   Our system achieved 63.7% in precision from a large-scale
    evaluation
• Ongoing Work
   Reducing sensitivity to the errors committed by preprocessors
   Investigating hybrid approaches considering various external
    knowledge sources


                                                                    40
Q&A

Mais conteúdo relacionado

Destaque

Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingSeokhwan Kim
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionSeokhwan Kim
 
jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告zhangsuoyong
 
Cancer al utero
Cancer al uteroCancer al utero
Cancer al uterorenacer_02
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...Seokhwan Kim
 
张所勇:前端开发工具推荐
张所勇:前端开发工具推荐张所勇:前端开发工具推荐
张所勇:前端开发工具推荐zhangsuoyong
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템Seokhwan Kim
 

Destaque (8)

Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic Tracking
 
офис мечты
офис мечтыофис мечты
офис мечты
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation Detection
 
jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告
 
Cancer al utero
Cancer al uteroCancer al utero
Cancer al utero
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
 
张所勇:前端开发工具推荐
张所勇:前端开发工具推荐张所勇:前端开发工具推荐
张所勇:前端开发工具推荐
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
 

Semelhante a A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
 
Working with big biomedical ontologies
Working with big biomedical ontologiesWorking with big biomedical ontologies
Working with big biomedical ontologiesrobertstevens65
 
Formalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationFormalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationgolpedegato2
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsMelanie Courtot
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Natural Language Inference in SICK
Natural Language Inference in SICKNatural Language Inference in SICK
Natural Language Inference in SICKValeria de Paiva
 
[word]
[word][word]
[word]butest
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesRokan Uddin Faruqui
 
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...Alp Öktem
 

Semelhante a A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction (17)

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Working with big biomedical ontologies
Working with big biomedical ontologiesWorking with big biomedical ontologies
Working with big biomedical ontologies
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Formalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationFormalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementation
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Survey on Open IE
Survey on Open IESurvey on Open IE
Survey on Open IE
 
AI Lesson 41
AI Lesson 41AI Lesson 41
AI Lesson 41
 
Lesson 41
Lesson 41Lesson 41
Lesson 41
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Natural Language Inference in SICK
Natural Language Inference in SICKNatural Language Inference in SICK
Natural Language Inference in SICK
 
Meghyn slides-hse-2014
Meghyn slides-hse-2014Meghyn slides-hse-2014
Meghyn slides-hse-2014
 
[word]
[word][word]
[word]
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
 
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
 
A Proposition Bank of Urdu
A Proposition Bank of UrduA Proposition Bank of Urdu
A Proposition Bank of Urdu
 

Mais de Seokhwan Kim

The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)Seokhwan Kim
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Seokhwan Kim
 
Dynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingDynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingSeokhwan Kim
 
The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)Seokhwan Kim
 
Natural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionNatural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionSeokhwan Kim
 
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Seokhwan Kim
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)Seokhwan Kim
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Seokhwan Kim
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Seokhwan Kim
 
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...Seokhwan Kim
 
Sequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSeokhwan Kim
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...Seokhwan Kim
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionSeokhwan Kim
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...Seokhwan Kim
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessSeokhwan Kim
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...Seokhwan Kim
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionSeokhwan Kim
 

Mais de Seokhwan Kim (17)

The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
 
Dynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingDynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic Tracking
 
The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)
 
Natural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionNatural Language in Human-Robot Interaction
Natural Language in Human-Robot Interaction
 
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
 
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
 
Sequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog States
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognition
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information access
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information Extraction
 

Último

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

  • 1. A CROSS-LINGUAL ANNOTATION PROJECTION- BASED SELF-SUPERVISION APPROACH FOR OPEN INFORMATION EXTRACTION The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing) Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)
  • 2. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 2
  • 3. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 3
  • 4. Information Extraction • Goal  To generate structured information from natural language documents • Representing semantic relationships among a set of arguments Birthday Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii. Birthplace Person Barack Obama Birthday August 4, 1961 Birthplace Honolulu 4
  • 5. Previous Approaches • Many supervised machine learning approaches have been successfully applied to the RDC task  (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al., 2006)  Large amounts of training data are required • Weakly-supervised techniques have been sought  (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)  To learn the IE system without significant annotation effort • Open Information Extraction  (Banko et al., 2007; Wu and Weld, 2010) 5
  • 6. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 6
  • 7. Open Information Extraction • An alternative weakly-supervised IE paradigm  (Banko et al., 2007) • Problem Definition ������: ������ → ������������ , ������������,������ , ������������ 1 ≤ ������, ������ ≤ ������  Binary relation extraction between ei and ej  Considering relationships explicitly represented by ri,j • Goal  Large-scale IE • Domain-independent • Relation-independent  Without hand-crafted rules or hand-annotated training examples 7
  • 8. How to Eliminate Human Supervision • Self-supervised Learning for Open IE  Using automatically obtained training examples • From external knowledge • Previous Systems  TextRunner (Banko et al., 2007) • Penn Treebank • A small set of heuristics about syntactic structural constraints  WoE (Wu and Weld, 2010) • Wikipedia articles • Wikipedia Infoboxes 8
  • 9. What’s the Problem? • Previous approaches mainly depend on language-specific knowledge for English  Heuristic-based Approach • Syntactic treebank for the target language • Heuristics designed for the target language  Wikipedia-based Approach • Wikipedia articles and infoboxes are available not only for English • Differences among languages in the amount of available resources  English Wikipedia: 3,500,000 articles  Korean Wikipedia: 150,000 articles 9
  • 10. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 10
  • 11. Cross-lingual Annotation Projection • Goal  To obtain training examples for the target language LT • Method  To leverage parallel corpora to project the annotations on the source language LS to the target language LT  The premise is that parallel corpora between LS and LT are much easier to obtain than the task-specific training dataset for LT <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru> 11
  • 12. Cross-lingual Annotation Projection • Previous Work  Part-of-speech tagging (Yarowsky and Ngai, 2001)  Named-entity tagging (Yarowsky et al., 2001)  Verb classification (Merlo et al., 2002)  Dependency parsing (Hwa et al., 2005)  Mention detection (Zitouni and Florian, 2008)  Semantic role labeling (Pado and Lapata, 2009) • To the best of our knowledge, no work has reported on the Open IE task 12
  • 13. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed 13
  • 14. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 14
  • 15. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 15
  • 16. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 16
  • 17. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected 17
  • 18. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 18
  • 19. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 19
  • 20. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 20
  • 21. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru> 21
  • 22. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 22
  • 23. Overall Architecture English- Korean Raw Korean Parallel Text Corpus Self- Learning Extraction Supervision Korean Korean Open Extracted Annotated IE Model Results Corpus 23
  • 24. Cross-lingual Annotation Projection- based Self-Supervision Annotation Parallel Projection Corpus English Korean Sentences Sentences Korean English Preprocessors Preprocessors Word Alignment English Open IE System Projection English Annotated Corpus Korean Annotated Corpus 24
  • 25. Cross-lingual Annotation Projection- based Self-Supervision • Dataset  English-Korean Parallel Corpus • 266,892 bi-sentence pairs in English and Korean • Preprocessors  English • OpenNLP toolkit  Korean • Espresso toolkit 25
  • 26. Cross-lingual Annotation Projection- based Self-Supervision • English Open IE  Our own implementation of the Banko’s method • Dataset  The WSJ part of Penn Treebank  By applying a series of heuristics (Banko, 2009)  1,028,361 instances from 49,208 sentences (9.0% were positive) • Model  Conditional Random Fields (CRF) • With Lexical and POS tag features • CRF++ toolkit 26
  • 27. Cross-lingual Annotation Projection- based Self-Supervision • Word Alignment  Aligned by GIZA++ toolkit • In the standard configuration in both directions • The bi-directional alignments were joined using the grow-diag-final algorithm  Chunk-based Reorganization • To reduce the word alignment errors • Generating alignments between pairs of base phrase chunks • Using a simple greedy algorithm  Based on the overlap score of aligned words between base phrase chunks 27
  • 28. Cross-lingual Annotation Projection- based Self-Supervision • Annotated Dataset  English  598,115 instances • 169.771 positive instances • Projected Dataset  Korean  278,730 instances • 89,743 positive instances 28
  • 29. Learning & Extraction • Extractor for Korean Open IE  Maximum Entropy (ME) model • To detect whether or not each given instance is positive • Features  Lexical, POS Tag  On the dependency path • Maximum Entropy Modeling toolkit  Conditional Random Fields (CRF) model • To identify the contextual subtext indicating the semantic relationship • Features  Lexical, POS Tag  On the dependency path • CRF++ toolkit 29
  • 30. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 30
  • 31. Evaluation #1 • Dataset  250 sentences from Korean Wikipedia articles  With manually annotated gold standard • 1,434 instances • 308 positive instances • Baseline  Heuristic-based System • Sejong treebank corpus (Korean) • A set of heuristics utilized for the English Open IE system except language-specific rules 31
  • 32. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 32
  • 33. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 33
  • 34. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 34
  • 35. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 35
  • 36. Evaluation #2 • Datasets  Korean Newswire • 302,276 documents • 2,565,487 sentences  Korean Wikipedia • 123,000 articles • 1,342,003 sentences • Manual Evaluation  For four relation types • BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF 36
  • 37. Evaluation #2 • Evaluation results for four relation types Newswire Wikipedia Type precision # of extractions precision # of extractions Birth Place 65.2 256 69.1 971 Won Award 57.4 824 63.3 286 Acquisition 67.0 1112 50.3 143 Invent Of 53.1 32 47.6 103 3,727 extractions with a precision of 63.7% for four relation types 37
  • 38. Evaluation #2 • Distribution of the errors Error Type # of errors Chunking Error 364 (26.9%) Dependency Parsing Error 461 (34.1%) Extracting Error 527 (39.0%) 38
  • 39. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 39
  • 40. Conclusions • Summary  A Cross-lingual Annotation Projection Approach for Open IE  Korean Open IE system developed using an English Open IE system and an English-Korean parallel corpus  Our system outperformed the heuristic-based system  Our system achieved 63.7% in precision from a large-scale evaluation • Ongoing Work  Reducing sensitivity to the errors committed by preprocessors  Investigating hybrid approaches considering various external knowledge sources 40
  • 41. Q&A