SlideShare uma empresa Scribd logo
1 de 17
Analysing Entity Type Variation
      across Biomedical Subdomains
Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou



                                      Claudiu Mihăilă
                                   National Centre for Text Mining
                                    School of Computer Science
                                     University of Manchester

                                          26 May 2012
BioTxtM 2012




  Introduction
  • Named entities
        o Atomic elements, classified into various categories (protein,
          gene, disease, treatment, metabolite etc.)
                                                                            Theme
                                            Organism             Theme                            Organism
                                  Pro Pro                  Pro   Transcription      +Reg         Pro
In contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant.




   2
BioTxtM 2012




Introduction
• Corpora




3
BioTxtM 2012




Methodology
• Full-text open-access journal articles from UKPMC
• 20 subdomains 400 single broad-subject-termed articles


        Allergy &                                Communicable
                      Biology     Cell Biology                  Critical Care
      Immunology                                   Diseases


                                    Health
     Environmental                                  Medical
                     Genetics      Services                      Medicine
         Health                                   Informatics
                                   Research


     Microbiology    Neoplasms    Neurology      Pharmacology   Physiology



                     Pulmonary                     Tropical
     Public Health               Rheumatology                     Virology
                      Medicine                     Medicine




4
BioTxtM 2012




Methodology
• NE source: ASilver = AUKPMC                   AOscar       ANeMine

     Corpus                                                                   Annotation


             Allergy &                     UKPMC         Communicable
 Critical Care                Biology     Cell Biology                  Critical Care
           Immunology                                      Diseases


                                            Health
           Environmental                                    Medical
     Medicine                Genetics      Services                      Medicine
               Health                                     Informatics
                                           Research
                                           OSCAR

     Physiology
             Microbiology    Neoplasms    Neurology      Pharmacology   Physiology



                             Pulmonary                     Tropical
      Virology
             Public Health               Rheumatology
                                           NeMine                         Virology
                              Medicine                     Medicine




 5
BioTxtM 2012




Methodology
          NeMine                UKPMC
Gene                  Gene
Protein               Protein
Disease               Disease
Drug                  Drug
Metabolite            Metabolite
Bacteria              Gene|Protein
Diagnostic process
General phenomenon
                                             Silver
Indicator
                                           Annotation
Natural phenomenon              OSCAR
Organ                 Chemical molecule
Pathologic function   Chemical adjective
Symptom               Enzyme
Therapeutic process   Reaction
 6
BioTxtM 2012




Methodology
• Feature vectors


       Document d                   Document d
Enzyme               2    Enzyme                  0.45%
Chemical molecule   71    Chemical molecule      14.85%
Disease              8    Disease                 1.67%
Drug                12    Drug                    2.51%
Gene                15    Gene                    3.13%
Gene|Protein        155   Gene|Protein            3.24%
Metabolite           3    Metabolite              0.62%
Protein             188   Protein                39.33%
Reaction            24    Reaction                5.02%


 7
BioTxtM 2012




Methodology




8
BioTxtM 2012




Methodology




9
BioTxtM 2012




Methodology
• Chi-squared statistics




10
BioTxtM 2012




Methodology
• Frobenius norm




                   1247.0725




11
BioTxtM 2012




Feature evaluation
• Good features for
     o   Cell Biology
     o   Pharmacology
     o   Health Sciences
     o   Public Health

• Not-so-good features for
     o   Medical Informatics
     o   Medicine
     o   Microbiology
     o   Neoplasms
     o   Neurology
                   Frobenius norm of   2   vectors for each pair.
12
BioTxtM 2012




Feature evaluation
• Mean Chi-Squared for every feature over all pairs




13
BioTxtM 2012




Classifier selection
                       Classifier       Top result count
                       J48                 0       0%
                       JRip                4     2.10%
                       Logistic            2     1.05%
                       Random Tree         0       0%
                       Random Forest      86     45.26%
                       SMO                 0       0%
                       J48                 6     3.15%
                       JRip                7     3.68%
                       Decision Stump     16     8.42%
            AdaBoost




                       Logistic            0       0%
                       Random Tree         0       0%
                       Random Forest      68     35.78%
               Random Forest F-score for each5.26%
                  SMO                 1      pair.
14
BioTxtM 2012




Classifier evaluation
• Dissimilar subdomains
     o   Cell Biology
     o   Pharmacology
     o   Health Sciences
     o   Public Health

• Similar subdomains
     o   Medical Informatics
     o   Medicine
     o   Microbiology
     o   Neoplasms
     o   Neurology
                     Random Forest F-score for each pair.
15
BioTxtM 2012




Conclusions
• To remember
     o Significant semantic variation of biomedical sublanguages
     o Distinguishable bio-subdomains using only NE types
     o Caution needed when adapting NLP tools to subdomains
• To do
     o Extension to bio-events
     o Combination with lexical, syntactical, discourse features
     o Extension to other domains




16
BioTxtM 2012




Thank you!




        http://misteringo.deviantart.com/art/Bunnies-Scream-Again-79745974

17

Mais conteúdo relacionado

Mais procurados

METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and DiseasesMETOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
MAASTRO clinic
 
Application of proteomics science
Application of proteomics scienceApplication of proteomics science
Application of proteomics science
Aanchal46
 
An immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
An immunohistochemical analysis of Canine Haemangioma and HaemangiosarcomaAn immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
An immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
Rodrigo Shamed Cedillo Flores
 
Toxicogenomic technologies final
Toxicogenomic technologies finalToxicogenomic technologies final
Toxicogenomic technologies final
Dhananjaya Naik
 
PCMT Product Overview April 2013
PCMT Product Overview April 2013PCMT Product Overview April 2013
PCMT Product Overview April 2013
Chris Merritt
 
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
M. Luisetto Pharm.D.Spec. Pharmacology
 
A preliminary study on antibacterial efficacy of the methanolic
A preliminary study on antibacterial efficacy of the methanolicA preliminary study on antibacterial efficacy of the methanolic
A preliminary study on antibacterial efficacy of the methanolic
Alexander Decker
 

Mais procurados (19)

Malaria treatment schedules and socio economic implications of
Malaria treatment schedules and socio  economic implications ofMalaria treatment schedules and socio  economic implications of
Malaria treatment schedules and socio economic implications of
 
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and DiseasesMETOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
METOXIA Framework and Hypoxia and Acidosis in Human Physiology and Diseases
 
Publications
PublicationsPublications
Publications
 
Application of proteomics science
Application of proteomics scienceApplication of proteomics science
Application of proteomics science
 
OECD Guidlines By Genotoxicity
OECD Guidlines By GenotoxicityOECD Guidlines By Genotoxicity
OECD Guidlines By Genotoxicity
 
An immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
An immunohistochemical analysis of Canine Haemangioma and HaemangiosarcomaAn immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
An immunohistochemical analysis of Canine Haemangioma and Haemangiosarcoma
 
Au31314319
Au31314319Au31314319
Au31314319
 
Toxicogenomic technologies final
Toxicogenomic technologies finalToxicogenomic technologies final
Toxicogenomic technologies final
 
Clinical Laboratory
Clinical LaboratoryClinical Laboratory
Clinical Laboratory
 
PCMT Product Overview April 2013
PCMT Product Overview April 2013PCMT Product Overview April 2013
PCMT Product Overview April 2013
 
Challenges and opportunities in personal omics profiling
Challenges and opportunities in personal omics profilingChallenges and opportunities in personal omics profiling
Challenges and opportunities in personal omics profiling
 
Clinical proteomics in diseases lecture, 2014
Clinical proteomics in diseases lecture, 2014Clinical proteomics in diseases lecture, 2014
Clinical proteomics in diseases lecture, 2014
 
John O'Sullivan
John O'SullivanJohn O'Sullivan
John O'Sullivan
 
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
Endogenous toxicology luisetto m almukthar n behzad nili a gamal abdul hamid ...
 
A preliminary study on antibacterial efficacy of the methanolic
A preliminary study on antibacterial efficacy of the methanolicA preliminary study on antibacterial efficacy of the methanolic
A preliminary study on antibacterial efficacy of the methanolic
 
West African Sorghum Extract Again Shows Immune Health Benefits : Health-forever
West African Sorghum Extract Again Shows Immune Health Benefits : Health-foreverWest African Sorghum Extract Again Shows Immune Health Benefits : Health-forever
West African Sorghum Extract Again Shows Immune Health Benefits : Health-forever
 
Traditional Herbal Medicine To Increased Hemoglobin : HEALTH-FOREVER.COM
Traditional Herbal Medicine To Increased Hemoglobin : HEALTH-FOREVER.COMTraditional Herbal Medicine To Increased Hemoglobin : HEALTH-FOREVER.COM
Traditional Herbal Medicine To Increased Hemoglobin : HEALTH-FOREVER.COM
 
Ppt mutagenicity and carcinogenicity
Ppt mutagenicity and carcinogenicityPpt mutagenicity and carcinogenicity
Ppt mutagenicity and carcinogenicity
 
Proteomics
ProteomicsProteomics
Proteomics
 

Destaque (6)

Modelling social Web applications via tinydb
Modelling social Web applications via tinydbModelling social Web applications via tinydb
Modelling social Web applications via tinydb
 
Zemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineZemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation Engine
 
Functional Dependency Grammar
Functional Dependency GrammarFunctional Dependency Grammar
Functional Dependency Grammar
 
Nature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebNature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic Web
 
TEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnTEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition Yarn
 
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for RomanianTo Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
 

Semelhante a Analysing Entity Type Variation across Biomedical Subdomains

T Sornasse Elan Chi Accelerating Proof Of Concept 2010
T Sornasse Elan Chi Accelerating Proof Of Concept 2010T Sornasse Elan Chi Accelerating Proof Of Concept 2010
T Sornasse Elan Chi Accelerating Proof Of Concept 2010
tsornasse
 
Session 3 part 1
Session 3 part 1Session 3 part 1
Session 3 part 1
plmiami
 
Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
 Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
mitoaction
 
2013-11-26 DTL FIH symposium, Leiden
2013-11-26 DTL FIH symposium, Leiden2013-11-26 DTL FIH symposium, Leiden
2013-11-26 DTL FIH symposium, Leiden
Alain van Gool
 
George Church: Standards & Open-Access Genome-Environment-Trait Data
George Church: Standards & Open-Access Genome-Environment-Trait DataGeorge Church: Standards & Open-Access Genome-Environment-Trait Data
George Church: Standards & Open-Access Genome-Environment-Trait Data
GenomeInABottle
 

Semelhante a Analysing Entity Type Variation across Biomedical Subdomains (20)

Michael Buschmann_Nanomedecine
Michael Buschmann_NanomedecineMichael Buschmann_Nanomedecine
Michael Buschmann_Nanomedecine
 
T Sornasse Elan Chi Accelerating Proof Of Concept 2010
T Sornasse Elan Chi Accelerating Proof Of Concept 2010T Sornasse Elan Chi Accelerating Proof Of Concept 2010
T Sornasse Elan Chi Accelerating Proof Of Concept 2010
 
Bioinformatics Course
Bioinformatics CourseBioinformatics Course
Bioinformatics Course
 
Drug discovery
Drug discoveryDrug discovery
Drug discovery
 
Drug discovery and development
Drug discovery and developmentDrug discovery and development
Drug discovery and development
 
Drug discovery and development. Introducing
Drug discovery and development. IntroducingDrug discovery and development. Introducing
Drug discovery and development. Introducing
 
Drug discovery and development
Drug discovery and developmentDrug discovery and development
Drug discovery and development
 
Drugdiscoveryanddevelopment by khadga raj
Drugdiscoveryanddevelopment by khadga rajDrugdiscoveryanddevelopment by khadga raj
Drugdiscoveryanddevelopment by khadga raj
 
Session 3 part 1
Session 3 part 1Session 3 part 1
Session 3 part 1
 
2014 12-11 Skipr99 masterclass Arnhem
2014 12-11 Skipr99 masterclass Arnhem2014 12-11 Skipr99 masterclass Arnhem
2014 12-11 Skipr99 masterclass Arnhem
 
Building a Program in Personalized Medicine
Building a Program in Personalized Medicine Building a Program in Personalized Medicine
Building a Program in Personalized Medicine
 
Positions in the Clinical Laboratory
Positions in the Clinical LaboratoryPositions in the Clinical Laboratory
Positions in the Clinical Laboratory
 
Campo, Luis - Technologies in Personalized Medicine
Campo, Luis - Technologies in Personalized MedicineCampo, Luis - Technologies in Personalized Medicine
Campo, Luis - Technologies in Personalized Medicine
 
Tech Forum FJMS
Tech Forum FJMSTech Forum FJMS
Tech Forum FJMS
 
Bioteknologi usd07
Bioteknologi usd07Bioteknologi usd07
Bioteknologi usd07
 
Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
 Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
Primary Mitochondrial Disease and Secondary Mitochondrial Dysfunction
 
Drug discovery and development
Drug discovery and developmentDrug discovery and development
Drug discovery and development
 
2013-11-26 DTL FIH symposium, Leiden
2013-11-26 DTL FIH symposium, Leiden2013-11-26 DTL FIH symposium, Leiden
2013-11-26 DTL FIH symposium, Leiden
 
Molecular profiling 2012
Molecular profiling 2012Molecular profiling 2012
Molecular profiling 2012
 
George Church: Standards & Open-Access Genome-Environment-Trait Data
George Church: Standards & Open-Access Genome-Environment-Trait DataGeorge Church: Standards & Open-Access Genome-Environment-Trait Data
George Church: Standards & Open-Access Genome-Environment-Trait Data
 

Último

Último (20)

Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
 
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur  Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Guntur  Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
 

Analysing Entity Type Variation across Biomedical Subdomains

  • 1. Analysing Entity Type Variation across Biomedical Subdomains Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou Claudiu Mihăilă National Centre for Text Mining School of Computer Science University of Manchester 26 May 2012
  • 2. BioTxtM 2012 Introduction • Named entities o Atomic elements, classified into various categories (protein, gene, disease, treatment, metabolite etc.) Theme Organism Theme Organism Pro Pro Pro Transcription +Reg Pro In contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant. 2
  • 4. BioTxtM 2012 Methodology • Full-text open-access journal articles from UKPMC • 20 subdomains 400 single broad-subject-termed articles Allergy & Communicable Biology Cell Biology Critical Care Immunology Diseases Health Environmental Medical Genetics Services Medicine Health Informatics Research Microbiology Neoplasms Neurology Pharmacology Physiology Pulmonary Tropical Public Health Rheumatology Virology Medicine Medicine 4
  • 5. BioTxtM 2012 Methodology • NE source: ASilver = AUKPMC AOscar ANeMine Corpus Annotation Allergy & UKPMC Communicable Critical Care Biology Cell Biology Critical Care Immunology Diseases Health Environmental Medical Medicine Genetics Services Medicine Health Informatics Research OSCAR Physiology Microbiology Neoplasms Neurology Pharmacology Physiology Pulmonary Tropical Virology Public Health Rheumatology NeMine Virology Medicine Medicine 5
  • 6. BioTxtM 2012 Methodology NeMine UKPMC Gene Gene Protein Protein Disease Disease Drug Drug Metabolite Metabolite Bacteria Gene|Protein Diagnostic process General phenomenon Silver Indicator Annotation Natural phenomenon OSCAR Organ Chemical molecule Pathologic function Chemical adjective Symptom Enzyme Therapeutic process Reaction 6
  • 7. BioTxtM 2012 Methodology • Feature vectors Document d Document d Enzyme 2 Enzyme 0.45% Chemical molecule 71 Chemical molecule 14.85% Disease 8 Disease 1.67% Drug 12 Drug 2.51% Gene 15 Gene 3.13% Gene|Protein 155 Gene|Protein 3.24% Metabolite 3 Metabolite 0.62% Protein 188 Protein 39.33% Reaction 24 Reaction 5.02% 7
  • 12. BioTxtM 2012 Feature evaluation • Good features for o Cell Biology o Pharmacology o Health Sciences o Public Health • Not-so-good features for o Medical Informatics o Medicine o Microbiology o Neoplasms o Neurology Frobenius norm of 2 vectors for each pair. 12
  • 13. BioTxtM 2012 Feature evaluation • Mean Chi-Squared for every feature over all pairs 13
  • 14. BioTxtM 2012 Classifier selection Classifier Top result count J48 0 0% JRip 4 2.10% Logistic 2 1.05% Random Tree 0 0% Random Forest 86 45.26% SMO 0 0% J48 6 3.15% JRip 7 3.68% Decision Stump 16 8.42% AdaBoost Logistic 0 0% Random Tree 0 0% Random Forest 68 35.78% Random Forest F-score for each5.26% SMO 1 pair. 14
  • 15. BioTxtM 2012 Classifier evaluation • Dissimilar subdomains o Cell Biology o Pharmacology o Health Sciences o Public Health • Similar subdomains o Medical Informatics o Medicine o Microbiology o Neoplasms o Neurology Random Forest F-score for each pair. 15
  • 16. BioTxtM 2012 Conclusions • To remember o Significant semantic variation of biomedical sublanguages o Distinguishable bio-subdomains using only NE types o Caution needed when adapting NLP tools to subdomains • To do o Extension to bio-events o Combination with lexical, syntactical, discourse features o Extension to other domains 16
  • 17. BioTxtM 2012 Thank you! http://misteringo.deviantart.com/art/Bunnies-Scream-Again-79745974 17