Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Analysing Entity Type Variation across Biomedical Subdomains
1. Analysing Entity Type Variation
across Biomedical Subdomains
Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou
Claudiu Mihăilă
National Centre for Text Mining
School of Computer Science
University of Manchester
26 May 2012
2. BioTxtM 2012
Introduction
• Named entities
o Atomic elements, classified into various categories (protein,
gene, disease, treatment, metabolite etc.)
Theme
Organism Theme Organism
Pro Pro Pro Transcription +Reg Pro
In contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant.
2
4. BioTxtM 2012
Methodology
• Full-text open-access journal articles from UKPMC
• 20 subdomains 400 single broad-subject-termed articles
Allergy & Communicable
Biology Cell Biology Critical Care
Immunology Diseases
Health
Environmental Medical
Genetics Services Medicine
Health Informatics
Research
Microbiology Neoplasms Neurology Pharmacology Physiology
Pulmonary Tropical
Public Health Rheumatology Virology
Medicine Medicine
4
5. BioTxtM 2012
Methodology
• NE source: ASilver = AUKPMC AOscar ANeMine
Corpus Annotation
Allergy & UKPMC Communicable
Critical Care Biology Cell Biology Critical Care
Immunology Diseases
Health
Environmental Medical
Medicine Genetics Services Medicine
Health Informatics
Research
OSCAR
Physiology
Microbiology Neoplasms Neurology Pharmacology Physiology
Pulmonary Tropical
Virology
Public Health Rheumatology
NeMine Virology
Medicine Medicine
5
6. BioTxtM 2012
Methodology
NeMine UKPMC
Gene Gene
Protein Protein
Disease Disease
Drug Drug
Metabolite Metabolite
Bacteria Gene|Protein
Diagnostic process
General phenomenon
Silver
Indicator
Annotation
Natural phenomenon OSCAR
Organ Chemical molecule
Pathologic function Chemical adjective
Symptom Enzyme
Therapeutic process Reaction
6
7. BioTxtM 2012
Methodology
• Feature vectors
Document d Document d
Enzyme 2 Enzyme 0.45%
Chemical molecule 71 Chemical molecule 14.85%
Disease 8 Disease 1.67%
Drug 12 Drug 2.51%
Gene 15 Gene 3.13%
Gene|Protein 155 Gene|Protein 3.24%
Metabolite 3 Metabolite 0.62%
Protein 188 Protein 39.33%
Reaction 24 Reaction 5.02%
7
12. BioTxtM 2012
Feature evaluation
• Good features for
o Cell Biology
o Pharmacology
o Health Sciences
o Public Health
• Not-so-good features for
o Medical Informatics
o Medicine
o Microbiology
o Neoplasms
o Neurology
Frobenius norm of 2 vectors for each pair.
12
14. BioTxtM 2012
Classifier selection
Classifier Top result count
J48 0 0%
JRip 4 2.10%
Logistic 2 1.05%
Random Tree 0 0%
Random Forest 86 45.26%
SMO 0 0%
J48 6 3.15%
JRip 7 3.68%
Decision Stump 16 8.42%
AdaBoost
Logistic 0 0%
Random Tree 0 0%
Random Forest 68 35.78%
Random Forest F-score for each5.26%
SMO 1 pair.
14
15. BioTxtM 2012
Classifier evaluation
• Dissimilar subdomains
o Cell Biology
o Pharmacology
o Health Sciences
o Public Health
• Similar subdomains
o Medical Informatics
o Medicine
o Microbiology
o Neoplasms
o Neurology
Random Forest F-score for each pair.
15
16. BioTxtM 2012
Conclusions
• To remember
o Significant semantic variation of biomedical sublanguages
o Distinguishable bio-subdomains using only NE types
o Caution needed when adapting NLP tools to subdomains
• To do
o Extension to bio-events
o Combination with lexical, syntactical, discourse features
o Extension to other domains
16