1. Named Entity Annotation and Tagging in the Domain of Epizootics
K-State Laboratory for Knowledge Svitlana Volkova, William Hsu, Doina Caragea
Discovery in Databases (KDD)
Kansas State University, Department of Computing and Information Sciences, Manhattan, KS 66506
GAZETEER COLLECTION AND ONTOLOGY CONSTRUCTION THE EFFECT OF THE ONTOLOGY SIZE AND QUALITY ON THE
OVERVIEW
The main purpose of IE using a gazetteer is to retrieve tokens that match at least
ACCURACY OF DISEASE EXTRACTION
We present an information extraction (IE) application in the domain of animal
diseases. Previously, such tasks were performed only for human disease related one term with synonyms, abbreviations from known animal disease names. We The data set is sampled from animal disease crawled sources with number of
data. As opposed to that, our task is directly related to web crawling for retrieving collect prior domain specific knowledge and, as a result, construct ontology of occurrences of the disease named entities above predefined threshold. All animal
animal infectious disease related information. animal disease concepts. The extraction technique is based on a pattern matching diseases were manually annotated within the dataset for future cross-validation.
approach. The gazetteer is semi-automatically collected from official web-portals. In the first experiment the baseline run is processed using dictionary look-up with
WWW (official reports about animal disease outbreaks,
Using initial gazetteer we enriched ontology with latent synonymic and causal and w/o capitalization feature (1a, 1b). The next runs include addition only
surveillance networks disease descriptions, fact sheets etc.)
relations between related concepts. synonyms (2a, 2b) and abbreviations (3a, 3b) respectively. The last run (4a, 4b)
EMAIL combines all above mentioned features.
Run 1,
DOMAIN SPECIFIC Averaged 3.60% 1. Disease names and fact sheets from Iowa State University Center for
KNOWLEDGE "FMD" Extraction Run 2 - only
capitalization, Food Security and Public Health (CFSPH): In the second experiment we divide data into training and test sets. Using training
Performance
Document over 50 web-sites
14.00%
http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm examples we learn the model for animal disease name extraction by discovering
Medical ontology, containing 2. Word Organization of Animal Health (OIE) Animal Disease Data:
Collection names of diseases, viruses, http://www.oie.int/eng/maladies/en_alpha.htm
relations between concepts; we report accuracy on test data set.
animal species etc., organized in 3. Department for Environmental Food and Rural Affairs, UK (DEFRA): In the third experiment, we compare our approach of learning relations between
a conceptual hierarchy. Run 3 - only
http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm
abbreviations
+ synonyms,
84.36% 4. United States Department of Agriculture (USDA), Animal and Plant
concepts with Google Sets method. We report results in terms of precision, recall
CRAWLER Health Inspection Service and F-measure. We build learning curves for both methods in order to show the
DB Averaged Run 1,
DOMAIN INDEPENDENT "RVF" Extraction
0.20% http://www.aphis.usda.gov/animal_health/animal_diseases/
influence of the ontology size and quality on the accuracy of extracted results.
KNOWLEDGE Performance Run 2 - only
capitalization
5. Medline Plus, Service of National Library of Medicine and National
over 50 web-sites 38.02% Institute of Health
LITERATURE QUERY Location hierarchy, containing http://www.nlm.nih.gov/medlineplus/animaldiseasesandyourhealth.html List’s look up features: Document level features: keyword Word level
names of countries, states or Run 3 - only
abbreviation 6. Wikipedia flexible pattern match appearance within predefined window. morphological features
provinces, cities, etc; canonical + synonyms,
57.52% http://en.wikipedia.org/wiki/Animal_diseases
date/time representation.
RELATION DISCOVERY BETWEEN CONCEPTS Method A: Number of Training Instances
429 773 955 1159 1287 1442 1561 1590 1619 1682
Synonymy (“is a kind of” relation, e.g. “Swine influenza” is a kind of “Swine fever”); Accuracy
0.964 0.929 0.927 0.925 0.964 0.929 0.927 0.925 0.964 0.929
INFORMATION EXTRACTION IN THE DOMAIN OF EPIZOOTICS Method B: Number of Training Instances
The IE task in the domain of the epizootics can be defined as automatic extraction 429 754 925 1118 1238 1385 1497 1524 1552 1611
Accuracy
of structured information that is related to animal diseases from unstructured web 0.962 0.961 0.864 0.862 0.962 0.961 0.864 0.862 0.962 0.961
documents with different content. The IE task is related to development of several Example A: “Diseases such as Foot and Mouth Disease, Bovine TB or Johne’s Disease Dictionary Look-Up: Number of Instances (max. 429)
modules for tagging of specific entities such as: animal disease name, species, have far-reaching potential for major economic impact on cattle producers”. 1a 1b 2a 2b 3a 3b 4a 4b - -
vaccines, serotypes etc. at the document-level within a crawled collection of Accuracy
Causal links (“is caused by”, e.g. “Ovine epididymitis is caused by Brucella ovis”). 0.885 0.920 0.886 0.896 0.887 0.922 0.889 0.933 - -
documents. ANIMAL Learning Curve for Method B
Accuracy Learning Curve for Method A
Accuracy (Relation Discovery using Google Sets)
DISEASE (Relation Discovery within Training Data)
DOCUMENT Goal: to extract structured 1.00 1.00
information with facts and 0.98 0.98
COLLECTION entities related to events from 0.96 0.96
Dipylidium Example F: “Bluetongue virus (BTV), a member of Orbivirus genus within the 0.94 0.94
unstructured or semistructured Q fever Baylisascariasis
infection Reoviridae family causes Bluetongue disease in livestock (sheep, goat, cattle)”. 0.92 0.92
sources. 0.90 0.90
0.88 0.88
Coxiella Baylisascaris
DICTIONARY LOOKUP METHOD FOR DISEASE EXTRACTION 0.86 0.86
Tapeworm 0.84 0.84
burnetii procyonis
Output: 0.82 0.82
0.80 0.80
400 650 900 1150 1400 1650 400 650 900 1150 1400 1650
Index of the first/last character Number of Ontology Concepts Number of Ontology Concepts
C. burnetii B. melis
1
F-Measure Precision/Recall
Disease Matched text and length 0.9 1
0.8 0.8
Extractor 0.7
B. procyonis Module Canonical disease names 0.6 0.6
Input: Method B
Example: The US saw its latest FMD outbreak in Montebello, 0.5
0.4 Method A
0.4
Text from file Associated Synonyms/Abbreviations 0.2
California in 1929 where 3,600 animals were slaughtered. 0.3 Gazetteer
0.2 0
B. transfuga 0.1 0 0.2 0.4 0.6 0.8 1
1.0 Non-unique/Unique diseases 0
Animal Disease Names Locations 0.9 Precision, Recall, F-measure
1 2 3 4 5 6 7 8 9 10
Runs Metod A Metod B Dictionary Look-Up
0.8 1a - using only initial gazetteer w/o capitalization
Dates/Times Quantities 0.7 1b - using initial gazetteer + capitalization
0.6 FUTURE WORK
CLASSIFICATION-BASED NAMED ENTITY RECOGNITION 0.5
2a - initial gazetteer + only synonyms w/o capitalization
2b - initial gazetteer + only synonyms with capitalization
NLP TASKS
0.4 The animal disease extraction task is a
0.3 3a - init. gazetteer + only abbreviations w/o capitalization prerequisite for more advanced content
Named Entity Recognition (NER) task is a subtask of IE which seeks to locate and 0.2 3b - init. gazetteer + only abbreviations with capitalization Foot-and-mouth disease[DIS] killed 15
4a - init. gazetteer + synonyms + abbrev. w/o capitaliz.
analysis of the unstructured documents within hog on farm in Taiwan[LOC]
classify atomic elements in text into predefined categories, such as: 0.1
Run corpora. So, the design of an NER-driven
0.0 4b - init. gazetteer + synonyms + abbrev. with capitaliz. Syntactic Analysis Foot-and-mouth disease [SUBJ] killed[VP]
disease names (e.g. “foot and mouth disease”); 4b 4a 3b 3a 2b 2a 1b 1a system for extracting structured tuples that 15 hog on farm in Taiwan [PP]
Precision Recall F-Measure
ACKNOWLEDGEMENTS describe animal disease-related events will
viruses (e.g. “picornavirus”) and serotypes (e.g. “Asia-1”); 1.0
This work is supported through a grant from the U.S. Department be performed.
Fact:
Disease:
killed
foot-and-mouth disease
4b of Defense. A collaborative program on IE with faculty at the Location: Taiwan
species and its quantities (e.g. “sheep”, “pigs”); 0.9 Recall Range
The approach extends the shared NER task Extraction Species: hog
3b University of Illinois at Urbana-Champaign (ChengXiang Zhai, Dan
0.8
Roth, Jiawei Han, and Kevin Chang), the 2009 Data Sciences of identifying persons, organizations, and Quantity: 15
locations where outbreak happened (e.g. “United Kingdom”, “eastern provinces 0.7 3a Summer Institute (DSSI) on Multimodal Information Access and locations with not only disease names but
Foot-and-mouth disease killed 15 hog
0.6
of Shandong and Jiangsu, China” – different level of granularity); 0.5
2b Synthesis (MIAS), was made possible through the support of
DHS/ONR.
constituent entities and attributes of these Co-reference on farm in Taiwan. Outbreak was
reported on 9 June.
0.4 2a event tuples. These include dates and times, Resolution
dates in different formats including special cases (e.g. “last Tuesday”, “two 0.3 1b
We appreciate effective discussions with Dr. Chris Callison-Burch,
quantities with relevant units, and geo- Event: outbreak
Dr. Mark Dredze and Dr. Jason Eisner from Center for Language
month ago”); 0.2
1a and Speech Processing, Johns Hopkins University; Tim Weninger, referenced locations. A primary overall Species:
Disease:
15 hog
foot-and-mouth disease
0.1
Research Fellow, UIUC; objective of the IE task is to support timeline Location: Taiwan
organizations that reports outbreak (e.g. “DEFRA”, “CDC”). 0.0
4a Template Generation Date/Time: 9 June
0 50 100
Document number John Drouhard, Landon Fowles (KDD Lab, IE Team) for and map-based visualization of events.
assistance with experiments.
KANSAS STATE UNIVERSITY KNOWLEDGE DISCOVERY IN DATABASES LABORATORY NATIONAL AGRICULTURAL BIOSECURITY CENTER @ K-STATE