A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

A CROSS-LINGUAL ANNOTATION PROJECTION-
BASED SELF-SUPERVISION APPROACH
FOR OPEN INFORMATION EXTRACTION

The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011)
November 10th, 2011, Chiang Mai

Seokhwan Kim (POSTECH)
Minwoo Jeong (Microsoft Bing)
Jonghoon Lee (POSTECH)
Gary Geunbae Lee (POSTECH)

Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions

2

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

3

Information Extraction
• Goal
 To generate structured information from natural language
documents
• Representing semantic relationships among a set of arguments

Birthday

Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

Birthplace

Person Barack Obama
Birthday August 4, 1961
Birthplace Honolulu
4

Previous Approaches
• Many supervised machine learning approaches have been
successfully applied to the RDC task
 (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta
and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,
2006)
 Large amounts of training data are required
• Weakly-supervised techniques have been sought
 (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)
 To learn the IE system without significant annotation effort
 (Banko et al., 2007; Wu and Weld, 2010)

5

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

6

Open Information Extraction
• An alternative weakly-supervised IE paradigm
 (Banko et al., 2007)
• Problem Definition
��: �� → �� , ��,�� , �� 1 ≤ ��, �� ≤ ��
 Binary relation extraction between ei and ej
 Considering relationships explicitly represented by ri,j
• Goal
 Large-scale IE
• Domain-independent
• Relation-independent
 Without hand-crafted rules or hand-annotated training examples

7

How to Eliminate Human Supervision
• Self-supervised Learning for Open IE
 Using automatically obtained training examples
• From external knowledge

• Previous Systems
 TextRunner (Banko et al., 2007)
• Penn Treebank
• A small set of heuristics about syntactic structural constraints
 WoE (Wu and Weld, 2010)
• Wikipedia articles
• Wikipedia Infoboxes

8

What’s the Problem?
• Previous approaches mainly depend on language-specific
knowledge for English
 Heuristic-based Approach
• Syntactic treebank for the target language
• Heuristics designed for the target language
 Wikipedia-based Approach
• Wikipedia articles and infoboxes are available not only for English
• Differences among languages in the amount of available resources
 English Wikipedia: 3,500,000 articles
 Korean Wikipedia: 150,000 articles

9

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

10

Cross-lingual Annotation Projection
• Goal
 To obtain training examples for the target language LT
• Method
 To leverage parallel corpora to project the annotations on the
source language LS to the target language LT
 The premise is that parallel corpora between LS and LT are much
easier to obtain than the task-specific training dataset for LT

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .

버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
11

Cross-lingual Annotation Projection
• Previous Work
 Part-of-speech tagging (Yarowsky and Ngai, 2001)
 Named-entity tagging (Yarowsky et al., 2001)
 Verb classification (Merlo et al., 2002)
 Dependency parsing (Hwa et al., 2005)
 Mention detection (Zitouni and Florian, 2008)
 Semantic role labeling (Pado and Lapata, 2009)
• To the best of our knowledge, no work has reported on the
Open IE task

12

Annotation
• To obtain annotations for the sentences in LS
• Procedure
 A set of entities in the given sentence is identified
 Each instance is composed of a pair of entities
 For each instance, extraction is performed

13

Annotation
• Procedure


14

Annotation
• Procedure


15

Annotation
• Procedure



16

Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
 For each instance, the existence of relationship is determined
 If the instance is positive, the contextual subtext is projected

17

Projection
• Procedure


18

Projection
• Procedure


19

Projection
• Procedure


20

Projection
• Procedure


<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
21

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

22

Overall Architecture

English-
Korean Raw
Korean Parallel
Text
Corpus

Self-
Learning Extraction
Supervision

Korean
Korean Open Extracted
Annotated
IE Model Results
Corpus

23

Cross-lingual Annotation Projection-
based Self-Supervision
Annotation Parallel
Projection
Corpus

English Korean
Sentences Sentences

Korean
English Preprocessors
Preprocessors

Word Alignment
English Open IE
System

Projection
English
Annotated
Corpus Korean
Annotated
Corpus 24

• Dataset
 English-Korean Parallel Corpus
• 266,892 bi-sentence pairs in English and Korean

• Preprocessors
 English
• OpenNLP toolkit
 Korean
• Espresso toolkit

25

• English Open IE
 Our own implementation of the Banko’s method
• Dataset
 The WSJ part of Penn Treebank
 By applying a series of heuristics (Banko, 2009)
 1,028,361 instances from 49,208 sentences (9.0% were positive)
• Model
 Conditional Random Fields (CRF)
• With Lexical and POS tag features
• CRF++ toolkit

26

• Word Alignment
 Aligned by GIZA++ toolkit
• In the standard configuration in both directions
• The bi-directional alignments were joined using the grow-diag-final
algorithm
 Chunk-based Reorganization
• To reduce the word alignment errors
• Generating alignments between pairs of base phrase chunks
• Using a simple greedy algorithm
 Based on the overlap score of aligned words between base phrase chunks

27

• Annotated Dataset
 English
 598,115 instances
• 169.771 positive instances

• Projected Dataset
 Korean
 278,730 instances
• 89,743 positive instances

28

Learning & Extraction
• Extractor for Korean Open IE
 Maximum Entropy (ME) model
• To detect whether or not each given instance is positive
• Features
 Lexical, POS Tag
 On the dependency path
• Maximum Entropy Modeling toolkit
 Conditional Random Fields (CRF) model
• To identify the contextual subtext indicating the semantic relationship
• Features
 Lexical, POS Tag
 On the dependency path
• CRF++ toolkit

29

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

30

Evaluation #1
• Dataset
 250 sentences from Korean Wikipedia articles
 With manually annotated gold standard
• 1,434 instances
• 308 positive instances

• Baseline
 Heuristic-based System
• Sejong treebank corpus (Korean)
• A set of heuristics utilized for the English Open IE system except
language-specific rules

31

Evaluation #1
• Comparison of performances

Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1

32

Evaluation #1

Model P R F
Heuristic 47.7 20.1 28.3

33

Evaluation #1

Model P R F
Heuristic 47.7 20.1 28.3

34

Evaluation #1

Model P R F
Heuristic 47.7 20.1 28.3

35

Evaluation #2
• Datasets
 Korean Newswire
• 302,276 documents
• 2,565,487 sentences
 Korean Wikipedia
• 123,000 articles
• 1,342,003 sentences

• Manual Evaluation
 For four relation types
• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF

36

Evaluation #2
• Evaluation results for four relation types

Newswire Wikipedia
Type
precision # of extractions precision # of extractions
Birth Place 65.2 256 69.1 971
Won Award 57.4 824 63.3 286
Acquisition 67.0 1112 50.3 143
Invent Of 53.1 32 47.6 103

3,727 extractions with a precision of 63.7% for four relation types

37

Evaluation #2
• Distribution of the errors

Error Type # of errors
Chunking Error 364 (26.9%)
Dependency Parsing Error 461 (34.1%)
Extracting Error 527 (39.0%)

38

Contents
• Introduction
• Implementation
• Evaluation
• Conclusions

39

Conclusions
• Summary
 A Cross-lingual Annotation Projection Approach for Open IE
 Korean Open IE system developed using an English Open IE
system and an English-Korean parallel corpus
 Our system outperformed the heuristic-based system
 Our system achieved 63.7% in precision from a large-scale
evaluation
• Ongoing Work
 Reducing sensitivity to the errors committed by preprocessors
 Investigating hybrid approaches considering various external
knowledge sources

40

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (8)

Semelhante a A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Semelhante a A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction (17)

Mais de Seokhwan Kim

Mais de Seokhwan Kim (17)

Último

Último (20)

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction