SlideShare uma empresa Scribd logo
1 de 89
Laboratory for Knowledge Discovery in Databases Entity Extraction, Animal Disease-related Event Recognition and Classification from Web Presenter: Svitlana Volkova  Adviser: William H. Hsu Committee: Dr. Doina Caragea, Dr. Gurdip Singh  Supported by: K-State National Agricultural Biosecurity Center (NABC), US Department of Defense
Agenda Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Background & Motivation Related Work ,[object Object]
Document Classification
Entity and Relation Extraction
Event RecognitionFramework for Epidemiological Analytics Disease-related Document Classification Domain-specific Entity Extraction ,[object Object]
Sequence Labeling using Syntactic FeaturesDisease-related Event Recognition  Summary & Future Work
Motivation influence on the travel and trade cause economic crises, political instability diseases, zoonotic in type can cause loss of life Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces (1) International: World Animal Health Information Database (WAHID) Interface - http://www.oie.int/wahis/public.php?page=home WHO Global Atlas of Infectious Diseases - http://diseasemaps.usgs.gov/index.htm Emergency Prevention System (EMPRES) for Transboundary Animal and Plant Pests and Diseases - http://www.fao.org/EMPRES/default.html Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces(2) USA Centers for Disease Control and Prevention (CDC) - http://www.cdc.gov U.S. Department of Agriculture (USDA) - http://www.usda.gov/wps/portal/usdahome U.S. Geological Survey (USGS) and U.S. Geological Survey (USGS) National Wildlife Health Center (NWHC) - http://www.nwhc.usgs.gov Iowa State University Center for Food Security and Public Health (CFSPH) - http://www.cfsph.iastate.edu BioPortal - http://biocomputingcorp.com/bpsystem.html FMD BioPortal - https://fmdbioportal.ucdavis.edu United Kingdom Department for Environment Food and Rural Affairs (DEFRA) - http://www.defra.gov.uk Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (1) BioCaster - http://biocaster.nii.ac.jp/ ,[object Object]
classifies documents as topically relevant or not
taxonomy of 4300 named entities (50 disease names, 243 country names, 4025 province/city names, latitudes and longitudes)
identifies 40 diseases at up to 25-30 locations per day
multilingual information extraction on to English, French, Spanish, Chinese, Thai, Vietnamese, Japanese
uses ontology pattern matching approaches to recognize disease-location-verb pairs
plots events on a Google Map
does not classify events into categories and does not report past outbreaks
no timeline visualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
BioCaster -http://biocaster.nii.ac.jp/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (2) Information retrieval system MedISys  -  http://medusa.jrc.it/medisys/homeedition/all/home.html Pattern-based Understanding and Learning System (PULS) - http://sysdb.cs.helsinki.fi/puls/jrc/all ,[object Object]
collects an average 50000 news articles per day from about 1400 news portals and about 150 specialized Public Health sites
43 languages
current ontology contains 2400 disease names, 400 organisms, 1500 political entities and over 70000 location names including towns, cities, provinces
real-time news clustering and filtering by matching 3000 patterns
does not classify events and does not report past outbreaks.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
MedISys - http://medusa.jrc.it/medisys/homeedition/all/home.html *part of the Europe Media Monitor (EMM) product family http://emm.jrc.it/overview.html Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Pattern-based Understanding and Learning System (PULS) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (3) HealthMap - http://healthmap.org/en ,[object Object]
2300 locations and 1100 disease names
identifies between 20-30 outbreaks per day
multiple languages English, Russian, Arabic, French, Portuguese, Spanish, Chinese
manually supported systemEpiSpider- http://www.epispider.org/ ,[object Object]
ProMED-Mail - www.promedmail.org
The Global Disaster Alert Coordinating System (GDACS) - www.gdacs.org
Central Intelligence Agency (CIA) Factbook - https://www.cia.gov/library/publications/the-world-factbook/
TheUnited Nations Human Development Report sites - http://hdr.undp.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
HealthMap - http://healthmap.org/en Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
ProMED-Mail -www.promedmail.org Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
EpiSpider - http://www.epispider.org/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease-related Data Online Structured Data Unstructured Data Official reports by different organizations: state and federal laboratories, bioportals;  health care providers; governmental agricultural or environmental agencies. Web-pages News E-mails (e.g., ProMed-Mail) Blogs Medical literature (e.g., books) Scientific papers (e.g., PubMed) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Problem Statement Suppose we have a document collection D with documents collected from different domains C: news, web pages, scientific papers, medical literature, e-mails. We classify documents into two classes: disease-related documents DR; disease non-related document DNR. We extract a set of events E from every document di in DR for every domain cj.  For every event ek in E we extract a set of domain-specific and domain-independent entities: disease, species, location, date, event status. We classify recognized events from E into: two classes – suspected or confirmed; three classes – susceptible, infected or recovered. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Research Tasks Classification of the disease-related documents collected from different domains Domain-specific entity extraction  ,[object Object],Automated animal disease-related event recognition and classification from unstructured web data Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Related Work: Text Categorization Supervised learning  training and testing datasets; lots of labeled data; Unsupervised Learning clustering; only unlabeled data; Semi-supervised Learning small amount of labeled data and a lot of unlabeled data. Feature Representation - “bag-of-words”    Binary       Term Frequency	   TF-IDF 	Word bigrams, “bag-of-concepts” Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification Algorithms Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Lazy IB1, IBk, KStar Meta AdaBoost Trees J48, RandomForest Rules ZeroR Bayes Naive Bayes, Naive Bayes Multinomial Functions Logistic, Multilayer Perceptron, RBFNetwork
Related Work: Entity Extraction Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Gazetteers and Regular expressions Limitations: dictionary look-up methods achieve high precision, but low recall limited to the size of the dictionary Hidden Markov Models (HMM) Conditional Random Fields (CRF) NER Systems Stanford NER System CMU Lemur Toolkit Open NLP Ontology-based Biomedical Entity Extraction BioCaster: 50 animal disease names + SVM PULS: 2400 human diseases and 1100 animal diseases
Related Work: Relation Extraction Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Related Work: Event Recognition Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 BioCaster for disease-location pairs and calculating their frequency in the document and in the collection PULS allows extracting metadata and structured facts related to animal disease outbreaks using pattern matching approach
Summary Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Overview of the existing automated and manually supported systems for monitoring animal disease outbreaks. Approaches for text categorization: supervised, unsupervised and semi-supervised learning and different feature representations: “bag-of-words”, terms frequency, binary features, word bigrams, classification algorithms: lazy learners, decision trees, Naïve Bayes. Entity extraction approaches: gazetteers, regular expressions, Hidden Markov Models  and Conditional random Fields; existing NER systems; ontology-based biomedical entity extraction. Relation extraction for automated ontology construction works. Animal disease-related event recognition methods.
Framework for Epidemiological Analytics Framework for Epidemiological Analytics Main Functional Components Data Collection (Document Relevance Classification) -> Data Sharing -> Search -> Data Analysis (Entity Extraction and Event Recognition) -> Visualization
Existing Systems vs. Designed System (1) Existing Systems Designed System Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Existing Systems vs. Designed System (2) Existing Systems Designed System Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
System Architecture Crawled Documents Search Interface 1. Entity Extraction     Component  2. Temporal Extraction Component Extracted Entities 3. Spatial      Extraction Component  Extracted EventTuples 4. Event Recognition     Component Data Storage Web Server TimeLine/Map View Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Main System Components Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
1. Data Collection (1) Periodically crawl the web using Heritrix crawler - http://crawler.archive.org/ set of seeds (ProMED-Mail, DEFRA etc.) set of terms (animal disease names from the ontology)  Text-to-tag ratio-based method for content extraction from web pages Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
1. Data Collection (2) WWW Email Crawler DB Document Collection Query Literature Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
2. Data Sharing Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Document relevance classification using Naive Bayes Classifier from Mallet - http://mallet.cs.umass.edu Relevant Non-relevant
3. Search Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Lucene-based* ranking Query-based keyword search Search by animal disease name and/or location *Lucene - http://lucene.apache.org
4. Data Analysis Event example: “On 12 September 2007, a new foot-and-mouth disease outbreak was confirmed in Egham, Surrey” Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
5. Visualization Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Map View GoogleMaps API - http://code.google.com/apis/maps/ TimeLine View SIMILE API - http://www.simile-widgets.org/timeline/
Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Summary Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Compare our system to other systems BioCaster, MedISys/PULS, HealthMap  System Architecture Data Collection -> Data Sharing -> Search -> Data Analysis ->  Visualization Main System Components Entity Extraction Component domain-specific entities (disease names, their synonyms, abbreviations, corresponding viruses and disease serotypes, species)  domain-independent entities  (locations and dates) Event Recognition Component
Disease-related Document Classification Binary Classification using Supervised Learning Feature Representations: “Bag-of-words”, TF, Bigrams Classifiers:  Naïve Bayes, MaxEntropy,  J48
SupervisedLearningFramework New Documents DTest Feature Representation  R1 … Feature Representation  Rn Learned Model  M1 … Learned Model Mk Crawled Documents DTrain Classifier Disease Related - DR  (processed to the next phases) Disease Non-related – DNR (eliminated from the index) Feature Representations: R1 – “bag-of-words” binary, |R1|=28908 R2 – “bag-of-words”  term frequency, |R2|=28908 R3 – “bag-of-words”  bigrams, |R3|=99108 R4 –  noun and verb keywords represented as binary counts, |R4|=2 R5 –  noun and verb keywords normalized frequency, |R5|=2 Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment ADisease-related Document Classification ~1500 crawled documents Foot-and-mouth disease (FMD) Rift valley fever (RVF) Focused Crawl Terms [foot and mouth disease, FMD, rift valley fever, RVF] After labeling - 813 related and 752 non-related docs Testing with 10-fold cross validation  + OR - Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification Results: Precision, Recall,F-Measure, Area Under Curve Binary Features Noun and Verb Frequency Features Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification Results: Accuracy Binary Features All unigrams, all bigrams and all term frequency features Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Summary (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 “Bag-of-words” representation gives higher accuracy; Generative approaches give the highest accuracy:  Naïve Bayes together with comprehensive feature representation R3 using bigram as features – 0.97; MaxEnt classier using unigram “bag-of words” representation R2 – 0.96; MaxEnt classier using comprehensive binary counts as feature representation R1 – 0.94. Normalized term frequency is much better than just binary features.
Summary (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction in the Domain of Veterinary Medicine (1) Ontology-based Entity Extraction Automated Ontology Construction
Domain Meta-data Domain-independent knowledge Domain-specific knowledge Location hierarchy names of countries, states, cities; Time hierarchy canonical dates. Medical ontology diseases, serotypes, and viruses. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Manually-constructedInitial Ontology |OINIT|=429	|OS|=581	 |OA|=581	 |OS+A|=605 1. Disease names and fact sheets from Iowa State University Center for Food Security and Public Health (CFSPH):  ,[object Object],2. Word Organization of Animal Health (OIE) Animal Disease Data: ,[object Object],3. Department for Environmental Food and Rural Affairs, UK (DEFRA): ,[object Object],4. Wikipedia ,[object Object],[object Object]
Algorithm for Semantic Relation Discovery and Automated Ontology Expansion Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment BOntology-based Entity Extraction ,[object Object]
100 manually labeled document for entity extractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: ROC Curves Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Learning Curves |OG|=754..1238  |OR|=772..1287 Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction in the Domain of Veterinary Medicine (2) Sequence Labeling using Syntactic Features  with Sliding Window
Syntactic Feature Extraction POS tag numeric word-level feature Capitalization binary word-level feature Capitalization inside binary word-level feature for identifying abbreviations Position in the sentence numeric document-level feature Position in the document numeric document-level feature Frequency numeric document-level feature Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Sequence Labeling Approach Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
An Example of Syntactic Feature Extraction “Severe disease in dairy cattle caused by Salmonella Newport” POS= [NNP, IN, NNS, VBN, …] = [2, 0, 2, 5, …] Xi = [POSi, CAPi, ICAPi, SPOSi, DPOSi, FREQi] Xi-3 = [2, 0, 0, 5, 5, 1] Xi = [2, 1, 0, 8, 8, 1] Xi-2 = [5, 0, 0, 6, 6, 1] Newport Xi-1 = [0, 0, 0, 7, 7, 1] … … wi wi+1 wi+2 wi-3 wi-1 wi+3 wi-2 cattle  caused  by  Salmonella  Xi+1 = [2, 1, 0, 9, 9, 1] Xi+2 = [-1, -1, -1, -1, -1, -1] Fi = [Xi, Xi-1, Xi-2, Xi-3, Xi+1, Xi+2, Xi+3], w = 3 Class = {0, 1} Xi+3 = [-1, -1, -1, -1, -1, -1] Fi = [2, 1, 0, 8, 8, 1, 0, 0, 0, 7, 7, 1, 5, 0, 0, 6, 6, 1, 2, 0, 0, 5, 5, 1,         2, 1, 0, 9, 9, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], Class = [1] Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment CSequence Labeling using Syntactic Features 100 manually labeled documents from Experiment B Number of disease names is more that 5 per document Keep capitalization Remove stop words 202977 examples in the dataset 80% for training (approx. 160000 examples) 20% for testing (approx. 40000 examples) Results are averaged over 3 runs We do not report accuracy because the data set is unbalanced (approx. 8570 positive examples vs. approx. 194430 negative examples) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: F-measure (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: F-measure (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Precision, Recall, AUC (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Precision, Recall, AUC (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Summary BioCaster named entity recognition system 200 news articles  F-score – 0.769 for all named entity classes SVM and feature window -2/+1 including surface word,  orthography, biomedical prefixes/suffixes, lemma, head noun etc. DNA, RNA, cell type extraction SVM and orthographic features F-score – 0.799 during the identification phase and 66.5 during the classification phase; Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Disease-related Event Recognition and Classification (1) Sentence-based Event Recognition and Classification
Animal Disease-related Event Types Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010

Mais conteúdo relacionado

Semelhante a Master Thesis

Improving Disease Surveillance in the United States Using Companion Animal Data
Improving Disease Surveillance in the United States Using Companion Animal DataImproving Disease Surveillance in the United States Using Companion Animal Data
Improving Disease Surveillance in the United States Using Companion Animal DataPamela Okerholm
 
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...RIFF - A Social Network and Collaborative Platform For Public Health Disease ...
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...InSTEDD
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Taha Kass-Hout, MD, MS
 
Exploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsExploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsNigel Collier
 
Internet2 and Public Health Surveillance
Internet2 and Public Health SurveillanceInternet2 and Public Health Surveillance
Internet2 and Public Health SurveillanceTaha Kass-Hout, MD, MS
 
Genomics, Cellular Networks, Preventive Medicine, and Society
Genomics, Cellular Networks, Preventive Medicine, and SocietyGenomics, Cellular Networks, Preventive Medicine, and Society
Genomics, Cellular Networks, Preventive Medicine, and SocietyLarry Smarr
 
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...Larry Smarr
 
Exploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsExploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsNigel Collier
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemTaha Kass-Hout, MD, MS
 
Using the internet to search for clinical trial information
Using the internet to search for clinical trial informationUsing the internet to search for clinical trial information
Using the internet to search for clinical trial informationEURORDIS - Rare Diseases Europe
 
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...Forum One
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Fondazione Giannino Bassetti
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
A generic model for disease outbreak
A generic model for disease outbreakA generic model for disease outbreak
A generic model for disease outbreakijcsit
 

Semelhante a Master Thesis (20)

MS Thesis Short
MS Thesis ShortMS Thesis Short
MS Thesis Short
 
Improving Disease Surveillance in the United States Using Companion Animal Data
Improving Disease Surveillance in the United States Using Companion Animal DataImproving Disease Surveillance in the United States Using Companion Animal Data
Improving Disease Surveillance in the United States Using Companion Animal Data
 
Summit2013 ho-jin choi - summit2013
Summit2013   ho-jin choi - summit2013Summit2013   ho-jin choi - summit2013
Summit2013 ho-jin choi - summit2013
 
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...RIFF - A Social Network and Collaborative Platform For Public Health Disease ...
RIFF - A Social Network and Collaborative Platform For Public Health Disease ...
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
 
Exploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsExploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease Informatics
 
Web Intelligence 2010
Web Intelligence 2010Web Intelligence 2010
Web Intelligence 2010
 
Environmental Public Health Tracking Network
Environmental Public Health Tracking NetworkEnvironmental Public Health Tracking Network
Environmental Public Health Tracking Network
 
Internet2 and Public Health Surveillance
Internet2 and Public Health SurveillanceInternet2 and Public Health Surveillance
Internet2 and Public Health Surveillance
 
Genomics, Cellular Networks, Preventive Medicine, and Society
Genomics, Cellular Networks, Preventive Medicine, and SocietyGenomics, Cellular Networks, Preventive Medicine, and Society
Genomics, Cellular Networks, Preventive Medicine, and Society
 
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...
Genomics in Society: Genomics, Cellular Networks, Preventive Medicine, and So...
 
Exploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease InformaticsExploiting NLP for Digital Disease Informatics
Exploiting NLP for Digital Disease Informatics
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response System
 
Using the internet to search for clinical trial information
Using the internet to search for clinical trial informationUsing the internet to search for clinical trial information
Using the internet to search for clinical trial information
 
Artificial Intelligence in Medicine
Artificial Intelligence in MedicineArtificial Intelligence in Medicine
Artificial Intelligence in Medicine
 
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...
HealthMap.org: Aggregation of Online Media Reports for Global Infectious Dise...
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
 
Evs ppt
Evs pptEvs ppt
Evs ppt
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
A generic model for disease outbreak
A generic model for disease outbreakA generic model for disease outbreak
A generic model for disease outbreak
 

Mais de Svitlana volkova

Mais de Svitlana volkova (13)

EACL'12 Poster
EACL'12 PosterEACL'12 Poster
EACL'12 Poster
 
Grace Hopper Celebration 2010
Grace Hopper Celebration 2010Grace Hopper Celebration 2010
Grace Hopper Celebration 2010
 
Multilingual Ner Using Wiki
Multilingual Ner Using WikiMultilingual Ner Using Wiki
Multilingual Ner Using Wiki
 
WiML Poster
WiML PosterWiML Poster
WiML Poster
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Methods Of Reliability Analysis
Methods Of Reliability AnalysisMethods Of Reliability Analysis
Methods Of Reliability Analysis
 
Ohio Project
Ohio ProjectOhio Project
Ohio Project
 
Ukraine Presentation
Ukraine PresentationUkraine Presentation
Ukraine Presentation
 
Ukraine Presentation at Kansas State University
Ukraine Presentation at Kansas State UniversityUkraine Presentation at Kansas State University
Ukraine Presentation at Kansas State University
 
Communicatons Fulbright
Communicatons FulbrightCommunicatons Fulbright
Communicatons Fulbright
 
Communications Ternopil
Communications TernopilCommunications Ternopil
Communications Ternopil
 

Último

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 

Último (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 

Master Thesis

  • 1. Laboratory for Knowledge Discovery in Databases Entity Extraction, Animal Disease-related Event Recognition and Classification from Web Presenter: Svitlana Volkova Adviser: William H. Hsu Committee: Dr. Doina Caragea, Dr. Gurdip Singh Supported by: K-State National Agricultural Biosecurity Center (NABC), US Department of Defense
  • 2.
  • 5.
  • 6. Sequence Labeling using Syntactic FeaturesDisease-related Event Recognition Summary & Future Work
  • 7. Motivation influence on the travel and trade cause economic crises, political instability diseases, zoonotic in type can cause loss of life Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 8. Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces (1) International: World Animal Health Information Database (WAHID) Interface - http://www.oie.int/wahis/public.php?page=home WHO Global Atlas of Infectious Diseases - http://diseasemaps.usgs.gov/index.htm Emergency Prevention System (EMPRES) for Transboundary Animal and Plant Pests and Diseases - http://www.fao.org/EMPRES/default.html Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 9. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 10. Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces(2) USA Centers for Disease Control and Prevention (CDC) - http://www.cdc.gov U.S. Department of Agriculture (USDA) - http://www.usda.gov/wps/portal/usdahome U.S. Geological Survey (USGS) and U.S. Geological Survey (USGS) National Wildlife Health Center (NWHC) - http://www.nwhc.usgs.gov Iowa State University Center for Food Security and Public Health (CFSPH) - http://www.cfsph.iastate.edu BioPortal - http://biocomputingcorp.com/bpsystem.html FMD BioPortal - https://fmdbioportal.ucdavis.edu United Kingdom Department for Environment Food and Rural Affairs (DEFRA) - http://www.defra.gov.uk Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 11.
  • 12. classifies documents as topically relevant or not
  • 13. taxonomy of 4300 named entities (50 disease names, 243 country names, 4025 province/city names, latitudes and longitudes)
  • 14. identifies 40 diseases at up to 25-30 locations per day
  • 15. multilingual information extraction on to English, French, Spanish, Chinese, Thai, Vietnamese, Japanese
  • 16. uses ontology pattern matching approaches to recognize disease-location-verb pairs
  • 17. plots events on a Google Map
  • 18. does not classify events into categories and does not report past outbreaks
  • 19. no timeline visualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 20. BioCaster -http://biocaster.nii.ac.jp/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 21.
  • 22. collects an average 50000 news articles per day from about 1400 news portals and about 150 specialized Public Health sites
  • 24. current ontology contains 2400 disease names, 400 organisms, 1500 political entities and over 70000 location names including towns, cities, provinces
  • 25. real-time news clustering and filtering by matching 3000 patterns
  • 26. does not classify events and does not report past outbreaks.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 27. MedISys - http://medusa.jrc.it/medisys/homeedition/all/home.html *part of the Europe Media Monitor (EMM) product family http://emm.jrc.it/overview.html Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 28. Pattern-based Understanding and Learning System (PULS) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 29.
  • 30. 2300 locations and 1100 disease names
  • 31. identifies between 20-30 outbreaks per day
  • 32. multiple languages English, Russian, Arabic, French, Portuguese, Spanish, Chinese
  • 33.
  • 35. The Global Disaster Alert Coordinating System (GDACS) - www.gdacs.org
  • 36. Central Intelligence Agency (CIA) Factbook - https://www.cia.gov/library/publications/the-world-factbook/
  • 37. TheUnited Nations Human Development Report sites - http://hdr.undp.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 38. HealthMap - http://healthmap.org/en Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 39. ProMED-Mail -www.promedmail.org Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 40. EpiSpider - http://www.epispider.org/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 41. Animal Disease-related Data Online Structured Data Unstructured Data Official reports by different organizations: state and federal laboratories, bioportals; health care providers; governmental agricultural or environmental agencies. Web-pages News E-mails (e.g., ProMed-Mail) Blogs Medical literature (e.g., books) Scientific papers (e.g., PubMed) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 42. Problem Statement Suppose we have a document collection D with documents collected from different domains C: news, web pages, scientific papers, medical literature, e-mails. We classify documents into two classes: disease-related documents DR; disease non-related document DNR. We extract a set of events E from every document di in DR for every domain cj. For every event ek in E we extract a set of domain-specific and domain-independent entities: disease, species, location, date, event status. We classify recognized events from E into: two classes – suspected or confirmed; three classes – susceptible, infected or recovered. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 43.
  • 44. Related Work: Text Categorization Supervised learning training and testing datasets; lots of labeled data; Unsupervised Learning clustering; only unlabeled data; Semi-supervised Learning small amount of labeled data and a lot of unlabeled data. Feature Representation - “bag-of-words” Binary Term Frequency TF-IDF Word bigrams, “bag-of-concepts” Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 45. Classification Algorithms Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Lazy IB1, IBk, KStar Meta AdaBoost Trees J48, RandomForest Rules ZeroR Bayes Naive Bayes, Naive Bayes Multinomial Functions Logistic, Multilayer Perceptron, RBFNetwork
  • 46. Related Work: Entity Extraction Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Gazetteers and Regular expressions Limitations: dictionary look-up methods achieve high precision, but low recall limited to the size of the dictionary Hidden Markov Models (HMM) Conditional Random Fields (CRF) NER Systems Stanford NER System CMU Lemur Toolkit Open NLP Ontology-based Biomedical Entity Extraction BioCaster: 50 animal disease names + SVM PULS: 2400 human diseases and 1100 animal diseases
  • 47. Related Work: Relation Extraction Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 48. Related Work: Event Recognition Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 BioCaster for disease-location pairs and calculating their frequency in the document and in the collection PULS allows extracting metadata and structured facts related to animal disease outbreaks using pattern matching approach
  • 49. Summary Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Overview of the existing automated and manually supported systems for monitoring animal disease outbreaks. Approaches for text categorization: supervised, unsupervised and semi-supervised learning and different feature representations: “bag-of-words”, terms frequency, binary features, word bigrams, classification algorithms: lazy learners, decision trees, Naïve Bayes. Entity extraction approaches: gazetteers, regular expressions, Hidden Markov Models and Conditional random Fields; existing NER systems; ontology-based biomedical entity extraction. Relation extraction for automated ontology construction works. Animal disease-related event recognition methods.
  • 50. Framework for Epidemiological Analytics Framework for Epidemiological Analytics Main Functional Components Data Collection (Document Relevance Classification) -> Data Sharing -> Search -> Data Analysis (Entity Extraction and Event Recognition) -> Visualization
  • 51. Existing Systems vs. Designed System (1) Existing Systems Designed System Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 52. Existing Systems vs. Designed System (2) Existing Systems Designed System Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 53. System Architecture Crawled Documents Search Interface 1. Entity Extraction Component 2. Temporal Extraction Component Extracted Entities 3. Spatial Extraction Component Extracted EventTuples 4. Event Recognition Component Data Storage Web Server TimeLine/Map View Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 54. Main System Components Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 55. 1. Data Collection (1) Periodically crawl the web using Heritrix crawler - http://crawler.archive.org/ set of seeds (ProMED-Mail, DEFRA etc.) set of terms (animal disease names from the ontology) Text-to-tag ratio-based method for content extraction from web pages Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 56. 1. Data Collection (2) WWW Email Crawler DB Document Collection Query Literature Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 57. 2. Data Sharing Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Document relevance classification using Naive Bayes Classifier from Mallet - http://mallet.cs.umass.edu Relevant Non-relevant
  • 58. 3. Search Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Lucene-based* ranking Query-based keyword search Search by animal disease name and/or location *Lucene - http://lucene.apache.org
  • 59. 4. Data Analysis Event example: “On 12 September 2007, a new foot-and-mouth disease outbreak was confirmed in Egham, Surrey” Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 60. 5. Visualization Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Map View GoogleMaps API - http://code.google.com/apis/maps/ TimeLine View SIMILE API - http://www.simile-widgets.org/timeline/
  • 61. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 62. Summary Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Compare our system to other systems BioCaster, MedISys/PULS, HealthMap System Architecture Data Collection -> Data Sharing -> Search -> Data Analysis -> Visualization Main System Components Entity Extraction Component domain-specific entities (disease names, their synonyms, abbreviations, corresponding viruses and disease serotypes, species) domain-independent entities (locations and dates) Event Recognition Component
  • 63. Disease-related Document Classification Binary Classification using Supervised Learning Feature Representations: “Bag-of-words”, TF, Bigrams Classifiers: Naïve Bayes, MaxEntropy, J48
  • 64. SupervisedLearningFramework New Documents DTest Feature Representation R1 … Feature Representation Rn Learned Model M1 … Learned Model Mk Crawled Documents DTrain Classifier Disease Related - DR (processed to the next phases) Disease Non-related – DNR (eliminated from the index) Feature Representations: R1 – “bag-of-words” binary, |R1|=28908 R2 – “bag-of-words” term frequency, |R2|=28908 R3 – “bag-of-words” bigrams, |R3|=99108 R4 – noun and verb keywords represented as binary counts, |R4|=2 R5 – noun and verb keywords normalized frequency, |R5|=2 Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 65. Experiment ADisease-related Document Classification ~1500 crawled documents Foot-and-mouth disease (FMD) Rift valley fever (RVF) Focused Crawl Terms [foot and mouth disease, FMD, rift valley fever, RVF] After labeling - 813 related and 752 non-related docs Testing with 10-fold cross validation + OR - Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 66. Classification Results: Precision, Recall,F-Measure, Area Under Curve Binary Features Noun and Verb Frequency Features Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 67. Classification Results: Accuracy Binary Features All unigrams, all bigrams and all term frequency features Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 68. Summary (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 “Bag-of-words” representation gives higher accuracy; Generative approaches give the highest accuracy: Naïve Bayes together with comprehensive feature representation R3 using bigram as features – 0.97; MaxEnt classier using unigram “bag-of words” representation R2 – 0.96; MaxEnt classier using comprehensive binary counts as feature representation R1 – 0.94. Normalized term frequency is much better than just binary features.
  • 69. Summary (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 70. Entity Extraction in the Domain of Veterinary Medicine (1) Ontology-based Entity Extraction Automated Ontology Construction
  • 71. Domain Meta-data Domain-independent knowledge Domain-specific knowledge Location hierarchy names of countries, states, cities; Time hierarchy canonical dates. Medical ontology diseases, serotypes, and viruses. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 72.
  • 73. Algorithm for Semantic Relation Discovery and Automated Ontology Expansion Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 74.
  • 75. 100 manually labeled document for entity extractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 76. Entity Extraction Results: ROC Curves Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 77. Entity Extraction Results: Learning Curves |OG|=754..1238 |OR|=772..1287 Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 78. Entity Extraction in the Domain of Veterinary Medicine (2) Sequence Labeling using Syntactic Features with Sliding Window
  • 79. Syntactic Feature Extraction POS tag numeric word-level feature Capitalization binary word-level feature Capitalization inside binary word-level feature for identifying abbreviations Position in the sentence numeric document-level feature Position in the document numeric document-level feature Frequency numeric document-level feature Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 80. Sequence Labeling Approach Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 81. An Example of Syntactic Feature Extraction “Severe disease in dairy cattle caused by Salmonella Newport” POS= [NNP, IN, NNS, VBN, …] = [2, 0, 2, 5, …] Xi = [POSi, CAPi, ICAPi, SPOSi, DPOSi, FREQi] Xi-3 = [2, 0, 0, 5, 5, 1] Xi = [2, 1, 0, 8, 8, 1] Xi-2 = [5, 0, 0, 6, 6, 1] Newport Xi-1 = [0, 0, 0, 7, 7, 1] … … wi wi+1 wi+2 wi-3 wi-1 wi+3 wi-2 cattle caused by Salmonella Xi+1 = [2, 1, 0, 9, 9, 1] Xi+2 = [-1, -1, -1, -1, -1, -1] Fi = [Xi, Xi-1, Xi-2, Xi-3, Xi+1, Xi+2, Xi+3], w = 3 Class = {0, 1} Xi+3 = [-1, -1, -1, -1, -1, -1] Fi = [2, 1, 0, 8, 8, 1, 0, 0, 0, 7, 7, 1, 5, 0, 0, 6, 6, 1, 2, 0, 0, 5, 5, 1, 2, 1, 0, 9, 9, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], Class = [1] Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 82. Experiment CSequence Labeling using Syntactic Features 100 manually labeled documents from Experiment B Number of disease names is more that 5 per document Keep capitalization Remove stop words 202977 examples in the dataset 80% for training (approx. 160000 examples) 20% for testing (approx. 40000 examples) Results are averaged over 3 runs We do not report accuracy because the data set is unbalanced (approx. 8570 positive examples vs. approx. 194430 negative examples) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 83. Entity Extraction Results: F-measure (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 84. Entity Extraction Results: F-measure (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 85. Entity Extraction Results: Precision, Recall, AUC (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 86. Entity Extraction Results: Precision, Recall, AUC (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 87. Summary BioCaster named entity recognition system 200 news articles F-score – 0.769 for all named entity classes SVM and feature window -2/+1 including surface word, orthography, biomedical prefixes/suffixes, lemma, head noun etc. DNA, RNA, cell type extraction SVM and orthographic features F-score – 0.799 during the identification phase and 66.5 during the classification phase; Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 88. Disease-related Event Recognition and Classification (1) Sentence-based Event Recognition and Classification
  • 89. Animal Disease-related Event Types Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 90. Event Recognition Methodology Step 1. Entity recognition from raw text. Step 2. Sentence classification from which entities are extracted as being related to an event or not; if they are related to an event we classify them as confirmed or suspected. Step 3. Combination of entities within an event sentence into the structured tuples and aggregation of tuples related to the same event into one comprehensive tuple. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 91. Step 1.Entity Recognition Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Locate and classify atomic elements into predefined categories: Disease names:“foot and mouth disease”, “rift valley fever”; viruses: “picornavirus”; serotypes: “Asia-1”; Species: “sheep”, “pigs”, “cattle” and “livestock”; Locationsof events specified at different levels of geo-granularity: “United Kingdom", “eastern provinces of Shandong and Jiangsu, China”; Datesin different formats: “last Tuesday”, “two month ago”.
  • 92. Entity Recognition Tools Animal Disease Extractor* relies on a medical ontology, automatically-enriched with synonyms and causative viruses. Species Extractor* pattern matching on a stemmed dictionary of animal names from Wikipedia. Location Extractor Stanford NER Tool** (uses conditional random fields); NGA GEOnet Names Database (GNS)*** for location disambiguation and retrieving latitude/longitude. Date/Time Extractor set of regular expressions. *KDD KSU DSEx - http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/ **Stanford NER - http://nlp.stanford.edu/ner/index.shtml ***GNS - http://earth-info.nga.mil/gns/html/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 93. Step 2. Event Sentence Classification Constraint: True events should include a disease name together with a status verb from Google Sets* and WordNet** (eliminate event non-related sentences). “Foot and mouth disease is[V] a highly pathogenic animal disease”. Confirmed status verbs “happened” and verb phrases “strike out” “On 9 Jun 2009, the farm's owner reported[V] symptoms of FMD in more than 30 hogs”. Suspected status verbs “catch” and verb phrases “be taken in” “RVF is suspected[V] in Saudi Arabia in September 2000”. *GoogleSets - http://labs.google.com/sets **WordNet - http://wordnet.princeton.edu/ Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 94. Step 3. Event Tuple Generation Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Event attributes: disease date location species confirmation status Event tuple: Eventi = < disease; date; location; species; status > = <FMD, 9 Jun 2009, Taoyuan, hog, confirmed> Event tuple with missing attributes: Eventj = <FMD, ?, ?, ?, confirmed>
  • 95. Event Recognition Workflow Step 1: Entity Recognition Foot-and-mouth disease[DIS]on hog[SP] farm in Taoyuan[LOC]. Taiwan's TVBS television station reports that agricultural authorities confirmed foot-and-mouth disease[DIS] on a hog[SP] farm in Taoyuan[LOC]. On 9 Jun 2009[DT], the farm's owner reported symptoms of FMD[DIS] in more than 30 hogs[SP]. Subsequent testing confirmed FMD[DIS]. Agricultural authorities asked the farmer to strengthen immunization. The outbreak has not affected other farms. Authorities stipulated that the affected hog[SP] farm may not sell pork for 2 weeks. Step 2: Sentence Classification YES 1. Foot-and-mouth disease[DIS]on hog[SP] farm in Taoyuan[LOC]. YES 2.Taiwan's TVBS television station reports that agricultural authorities confirmedfoot-and-mouth disease[DIS]on a hog[SP] farm in Taoyuan[LOC]. YES 3. On 9 Jun 2009[DT], the farm's owner reported symptoms of FMD[DIS] in more than 30 hogs[SP]. YES 4. Subsequent testing confirmedFMD[DIS]. NO 5. Agricultural authorities asked the farmer to strengthen immunization. NO 6. The outbreak has not affected other farms. NO 7. Authorities stipulated that the affected hog[SP] farm may not sell pork for 2 weeks. Step 3a: Tuple Generation E1 = <Foot-and-mouth disease, ?, Taoyuan, hog, ?> E3 = <FMD, 9 Jun 2009,?, hog, reported> E2 = <Foot-and-mouth disease, ?, Taoyuan, hog, confirmed > E4 = <FMD, ?, ?, ?, confirmed> Step 3b: Tuple Aggregation E = <disease, date, location, species, status> = <Foot-and-mouth disease, 9 Jun 2009, Taoyuan, hog, confirmed > Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 96. Rule-based event recognition approach Step 1. Entity Recognition Step 2. Event Sentence Classification Step 3. Event Tuple Generation & Aggregation Tuple aggregation is based on the set of rules: - disease name is the main attribute for the event tuple; - optimally combine the event tuples around a disease name (event tuple should have max number of event attributes); - if tuple has a new disease, then form the next tuple. The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 97. Experiment DEvent Recognition and Classification The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010) ~100 event-related documents Foot-and-mouth disease (FMD) Rift valley fever (RVF) Manually created 2 sets of summaries for 100 docs DUCView Pyramid Scoring Tool* – Score [0..1] relies on multiple summaries to assign the significance weights to summarization content units (i.e., entities) to compare automatically generated event tuples with entities from human summaries. Scorei = < wddisease; wtdate; wllocation; wsspecies; wcstatus… >, subject to disease + status = 2
  • 98. Event Score Distribution by Range We interpret the Pyramid score values as an event extraction accuracy: # of unique contributing entities (TP); # of entities not in the summary (FP); # of extra contributing entities from summary (FN). multiple summaries – majority voting for annotation. The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 99. Event Recognition & Classification Results The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 100. Event Recognition and Classification: Susceptible, Infected and Recovered Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 “The signs suggested the 27 pigs could be suffering from foot and mouth disease (FMD) in Island of Anglesey, Wales as reported on 2/21/2001” “The UK Ministry of Agriculture confirmed on 2/20/2001 that 27 pigs found with vesicles in an abattoir near Brentwood, Essex, have Foot and Mouth Disease” “Almost 2000 cattle and more than 15 000 sheep, have been or are waiting to be slaughtered since the resurgence of the disease in Northumberland as reported on 8/31/2001”
  • 101. Disease-related Event Recognition and Classification (2) Event Recognition and Classification in Predictive Epidemiology Domain
  • 102. ENTITY EXTRACTION Document 3, sentence s31 Almost 2000 cattle[SP] are waiting to be slaughtered on 02/28/2001[DATE]since the resurgence of FMD[DIS] in Northumberland[LOC]. Document 2, sentence s21 The UK Ministry of Agriculture confirmed on 2/20/01[DATE] that 27 pigs[SP] found with vesicles in an abattoir near Brentwood, Essex[LOC] have FMD[DIS]. Document 1, sentence s11, s12 The signs suggested the 27 pigs[SP] could be suffering from foot and mouth disease[DIS] in Anglesey, Wales[LOC].It was reported on 02/18/01[DATE]. … … EVENT TUPLE GENERATION e11 = [27 pigs, FMD, ?, Anglesey, Wales, “suggest”] e12 = [?,?, 02/18/01, ?, “report”] e21 = [27 pigs, FMD, 2/20/01, Brentwood, Essex, “confirm”] e31 = [2000 cattle, FMD, 2/28/01, Northumberland, “slaughter”] EVENT TUPLE CLASSIFICATION Susceptible Recovered Infected EVENT TUPLE AGGREGATION E2 = [27 pigs, FMD, 2/20/01, Brentwood, Essex, Infected] E3 = [2000 cattle, FMD, 2/28/01, Northumberland, Recovered] E1= [27 pigs, FMD, 02/18/01, Anglesey, Wales, Susceptible] Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 103. Modified Algorithm for Event Recognition & SIR Classification Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 104. The spread of foot-and-mouth disease outbreak in UK, 2001 118 ProMed-Mail reports yellow - susceptible red - infected green - recovered Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 105. Summary The accuracy of the event recognition completely depends on the separate entity extraction accuracy The event aggregation and deduplication requires much comprehensive heuristics and additional knowledge, for example co-reference resolution BioCaster 950 disease-location pairs per month reported results - 887/950 correct disease-location pairs and 0.934 precision MedISys/PULS 100 English-language documents with 156 events Reported results – 0.88 precision Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 106. Conclusions, Contributions and Future Work Summary: 1. Disease-related Document Classification 2. Ontology-based Entity Extraction 3. Entity Extraction using Sequence Labeling 4. Event Recognition and Classification
  • 107. Conclusions (1) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Disease-related Document Classification: applied supervised framework and considered different feature representations for the documents and machine learning classification algorithms; evaluated the document relevance classification using binary features, keyword term frequency and the “bag-of-words” representation using word unigrams and bigrams. Our experimental results demonstrate the efficiency of our text categorization component in the designed framework for epidemiological analytics.
  • 108. Conclusions (2) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Ontology-based Domain-specific Entity Extraction used a semantic relationship extraction based on syntactic patterns and POS tagging to construct an ontology; compared the automatically-constructed ontology obtained using our relationship extraction approach with an ontology constructed using GoogleSets expansion approach; compared the entities extracted using all ontology in terms of precision and recall, report F-measure. The results show that our semantic relationship extraction approach brings new knowledge to an initial ontology and, therefore, boosts the domain-specific biomedical entity extraction results.
  • 109. Conclusions (3) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Entity Extraction using Sequence Labeling Approach exacted syntactic word-level features (capitalization, POS tagging, abbreviations, term frequency) using different sliding window; evaluated our approach using different machine learning algorithms together with different feature representations and various window size. reported results of the domain-specific entity extraction in terms of F-measure, precision and recall and compared them with the results from the other surveillance systems.
  • 110. Conclusions (4) Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Event Recognition and Classification presented our novel sentence-based approach applied several lists of verbs for confirmation status extraction including WordNet and GoogleSets. used DUCView tool to calculate scores for automatically generated event tuples, which can be seen as a measure of accuracy of our approach. The highest accuracy was obtained using a WordNet augmented list of verbs. Applied event recognition and classification approach for the predictive epidemiology domain.
  • 111. Contributions Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Paper “Computational Knowledge and Information Management in Veterinary Epidemiology” IEEE Intelligence and Security Informatics Conference (ISI'10), 23-26 May 2010, Vancouver, BC, Canada Paper “Animal Disease Event Recognition and Classification” First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx'10), WWW Conference, 26-30 April 2010, Raleigh, NC, USA Paper “Boosting Biomedical Entity Extraction by Using Syntactic Patterns for Semantic Relation Discovery” (to appear) 2010 IEEE/WIC/ACM International Conference on Web Intelligence (WI'10), August 31 - September 3, York University, Toronto, Canada Poster “Named Entity Recognition and Tagging in the Domain of Epizootics” Women in Machine Learning Workshop (WiML'09) Workshop, 6-7 Dec 2009, Vancouver, Canada ACM Poster Presentation Competition “Automated Event Extraction and Named Entity Recognition in the Domain of Veterinary Medicine” 2010 Grace Hopper Celebration of Women in Computing (GHC'10),September 28 - October 1, Atlanta, Georgia, USA
  • 112. Future Work Domain-specific Entity Extraction add automated multilingual ontology construction for the domain of veterinary medicine using other semistructured sources e.g., Wikipedia. Automated Ontology Construction extend our semantic relationship extraction approach to other domains and other generalized named entities Event Recognition and Classification apply deeper syntactic analysis of the sentence and part-of-speech tagging in addition to the list of verbs; consider the negation words, modal words and tense; to integrate coreference resolution functionality. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 113. Acknowledgments Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010 Faculty: Dr. William H. Hsu Dr. Doina Caragea Dr. Gurdip Singh KDD Lab alumni: Tim Weninger (crawler deployment) and Jing Xia (rule-based event extraction) KDD Lab assistants: Information Extraction Team: John Drouhard, Landon Fowles, Swathi Bujuru Spatial Data Mining Team: Wesam Elshamy, AndrewBerggren Topic Detection & Tracking Team: Danny Jones, Srinivas Reddy
  • 114. Thank you! Svitlana Volkova, svitlana.volkova@gmail.com http://people.cis.ksu.edu/~svitlana Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010