SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
2. Workflow
1. Abstract 7. Abbreviation Detection
www.nextmovesoftware.co.uk
www.nextmovesoftware.com
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
England CB4 0EY
LeadMine: A grammar and dictionary driven approach to
chemical entity recognition
Daniel Lowe and Roger Sayle
NextMove Software Ltd, Cambridge
LeadMine is a system for recognizing entities, especially chemical entities,
using large grammars and dictionaries[1]. Entities are identified without an
explicit tokenization step. To allow recognition of terms slightly outside the
coverage of these resources spelling correction, entity extension and entity
merging are used. Recall is enhanced by the use of abbreviation detection,
and precision is enhanced by the removal of abbreviations of non-entities.
With the use of training data to produce further dictionaries of terms to
recognize/ignore LeadMine achieved 86.2% precision and 85.0% recall on an
unused development set.
10. Bibliography
1. Sayle R, Xie PH, Muresan S. Improved Chemical Text Mining of Patents with Infinite
Dictionaries and Automatic Spelling Correction. Journal of Chemical Information
and Modeling. 2011;52(1):51–62.
2. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara
R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical
entities of biological interest. Nucleic Acids Research. 2008;36:D344–350.
3. Schwartz A, Hearst M. A Simple Algorithm for Identifying Abbreviation Definitions
in Biomedical Text. In: Proceedings of the Pacific Symposium on Biocomputing.
Kauai; 2003. pp. 451–462.
The rules for chemical nomenclature are written as formal grammars e.g.
alkanStem : ‘meth’ | ‘eth’ | ‘prop’…
alkane: alkanStem ‘ane’
(485 rules are used in the systematic chemical name grammar and
many are inherited by the derived grammars)
The 2.94 million term PubChem dictionary is the primary source of
trivial names. It was produced by running a series of filters against the
~94 million synonyms provided by PubChem. These included
removing terms that are English words or start with an English word.
The records for structures that contained tetrasaccharides (or longer)
or hexadecapeptides (or longer) were excluded.
4. LeadMine Annotation
5. Entity extension and entity merging
8. Evaluation
9. Conclusions
Entities are extended until they reach whitespace, a mismatched bracket or
an English word. Entities are then trimmed of non-essential parts. Finally
adjacent entities are merged unless they are distinct molecules or one is an
instance of the other according to ChEBI[2] (e.g. genistein is an isoflavone).
LeadMine combines the capabilities of grammars to recognize regular
entities with the coverage of dictionaries. The results are readily
understandable and can be iteratively improved.
The Hearst and Schwartz algorithm[3] was adapted to recognize
abbreviations of the following forms:
• Tetrahydrofuran (THF)
• THF (tetrahydrofuran)
• Tetrahydrofuran (THF;
• Tetrahydrofuran (THF,
• (tetrahydrofuran, THF)
• THF = tetrahydrofuran
A list of domain specific abbreviations is used, which do not contain the
characters of the abbreviation e.g. mercury  Hg or estrone  E1
The training set was used to automatically identify holes in coverage and
identify common false positives and from this derive a dictionary of terms to
include (Whitelist) and a dictionary of terms to exclude (BlackList). The
workflow was then evaluated on the development set for the task of
identifying all chemical entity mentions.
3. Normalization
Configuration Precision Recall F-score
Baseline 0.869 0.820 0.844
WhiteList 0.862 0.850 0.856
BlackList 0.882 0.803 0.841
WhiteList + Blacklist 0.873 0.832 0.852
8. Non-entity abbreviation removal
The Hearst and Schwartz algorithm is used to find abbreviations which are
recognised entities but for which the unabbreviated form is not an entity.
The abbreviation is then ignored e.g.
current good manufacturing practice (cGMP)
LeadMine works internally on a normalized string with mappings back to the
original input. Normalization allows XML tags to be ignored and requires
fewer lexical varieties to be recognised.
Input Normalized
œstradiol oestradiol
5` or 5’ or 5′ (backtick/quotation mark/prime) 5'
<p>H<sub>2</sub>O</p> H2O
Input Found entities After extension/merging
α-Santalol Santalol α-Santalol
Allura Red AC dye Allura Red AC dye Allura Red AC
Glycine ester Glycine AND ester Glycine ester
Hexane-benzene Hexane AND benzene Hexane AND benzene
Genistein
isoflavone
Genistein AND isoflavone Genistein AND isoflavone
Optional
Blue: Grammars
Green: Traditional dictionaries
Orange: Blocking dictionaries

Mais conteúdo relacionado

Destaque

Destaque (11)

Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
InChI for Large Molecules
InChI for Large MoleculesInChI for Large Molecules
InChI for Large Molecules
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
Evidence-based medicinal chemistry using matched molecular series
Evidence-based medicinal chemistry using matched molecular seriesEvidence-based medicinal chemistry using matched molecular series
Evidence-based medicinal chemistry using matched molecular series
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Revising the Topliss Decision Tree
Revising the Topliss Decision TreeRevising the Topliss Decision Tree
Revising the Topliss Decision Tree
 

Semelhante a LeadMine: A grammar and dictionary driven approach to chemical entity recognition

Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Semelhante a LeadMine: A grammar and dictionary driven approach to chemical entity recognition (20)

Improved Chemical Text mining of Patents using automatic spelling correction ...
Improved Chemical Text mining of Patents using automatic spelling correction ...Improved Chemical Text mining of Patents using automatic spelling correction ...
Improved Chemical Text mining of Patents using automatic spelling correction ...
 
Chemical Named Entity Recognition
Chemical Named Entity RecognitionChemical Named Entity Recognition
Chemical Named Entity Recognition
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
Information extraction from EHR
Information extraction from EHRInformation extraction from EHR
Information extraction from EHR
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
Automatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionaryAutomatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionary
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics II
 
A Method For Correcting Typographical Errors In Subject Headings In OCLC Reco...
A Method For Correcting Typographical Errors In Subject Headings In OCLC Reco...A Method For Correcting Typographical Errors In Subject Headings In OCLC Reco...
A Method For Correcting Typographical Errors In Subject Headings In OCLC Reco...
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network" II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network"
 
Parser
ParserParser
Parser
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Dictionary Based Approaches in Protein Name Recognition
Dictionary Based Approaches in Protein Name RecognitionDictionary Based Approaches in Protein Name Recognition
Dictionary Based Approaches in Protein Name Recognition
 

Mais de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

LeadMine: A grammar and dictionary driven approach to chemical entity recognition

  • 1. 2. Workflow 1. Abstract 7. Abbreviation Detection www.nextmovesoftware.co.uk www.nextmovesoftware.com NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge England CB4 0EY LeadMine: A grammar and dictionary driven approach to chemical entity recognition Daniel Lowe and Roger Sayle NextMove Software Ltd, Cambridge LeadMine is a system for recognizing entities, especially chemical entities, using large grammars and dictionaries[1]. Entities are identified without an explicit tokenization step. To allow recognition of terms slightly outside the coverage of these resources spelling correction, entity extension and entity merging are used. Recall is enhanced by the use of abbreviation detection, and precision is enhanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore LeadMine achieved 86.2% precision and 85.0% recall on an unused development set. 10. Bibliography 1. Sayle R, Xie PH, Muresan S. Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction. Journal of Chemical Information and Modeling. 2011;52(1):51–62. 2. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research. 2008;36:D344–350. 3. Schwartz A, Hearst M. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. In: Proceedings of the Pacific Symposium on Biocomputing. Kauai; 2003. pp. 451–462. The rules for chemical nomenclature are written as formal grammars e.g. alkanStem : ‘meth’ | ‘eth’ | ‘prop’… alkane: alkanStem ‘ane’ (485 rules are used in the systematic chemical name grammar and many are inherited by the derived grammars) The 2.94 million term PubChem dictionary is the primary source of trivial names. It was produced by running a series of filters against the ~94 million synonyms provided by PubChem. These included removing terms that are English words or start with an English word. The records for structures that contained tetrasaccharides (or longer) or hexadecapeptides (or longer) were excluded. 4. LeadMine Annotation 5. Entity extension and entity merging 8. Evaluation 9. Conclusions Entities are extended until they reach whitespace, a mismatched bracket or an English word. Entities are then trimmed of non-essential parts. Finally adjacent entities are merged unless they are distinct molecules or one is an instance of the other according to ChEBI[2] (e.g. genistein is an isoflavone). LeadMine combines the capabilities of grammars to recognize regular entities with the coverage of dictionaries. The results are readily understandable and can be iteratively improved. The Hearst and Schwartz algorithm[3] was adapted to recognize abbreviations of the following forms: • Tetrahydrofuran (THF) • THF (tetrahydrofuran) • Tetrahydrofuran (THF; • Tetrahydrofuran (THF, • (tetrahydrofuran, THF) • THF = tetrahydrofuran A list of domain specific abbreviations is used, which do not contain the characters of the abbreviation e.g. mercury  Hg or estrone  E1 The training set was used to automatically identify holes in coverage and identify common false positives and from this derive a dictionary of terms to include (Whitelist) and a dictionary of terms to exclude (BlackList). The workflow was then evaluated on the development set for the task of identifying all chemical entity mentions. 3. Normalization Configuration Precision Recall F-score Baseline 0.869 0.820 0.844 WhiteList 0.862 0.850 0.856 BlackList 0.882 0.803 0.841 WhiteList + Blacklist 0.873 0.832 0.852 8. Non-entity abbreviation removal The Hearst and Schwartz algorithm is used to find abbreviations which are recognised entities but for which the unabbreviated form is not an entity. The abbreviation is then ignored e.g. current good manufacturing practice (cGMP) LeadMine works internally on a normalized string with mappings back to the original input. Normalization allows XML tags to be ignored and requires fewer lexical varieties to be recognised. Input Normalized œstradiol oestradiol 5` or 5’ or 5′ (backtick/quotation mark/prime) 5' <p>H<sub>2</sub>O</p> H2O Input Found entities After extension/merging α-Santalol Santalol α-Santalol Allura Red AC dye Allura Red AC dye Allura Red AC Glycine ester Glycine AND ester Glycine ester Hexane-benzene Hexane AND benzene Hexane AND benzene Genistein isoflavone Genistein AND isoflavone Genistein AND isoflavone Optional Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries