In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
In grammars we trust: LeadMine,
a knowledge driven solution
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK

Approaches to Entity
recognition
• Dictionary based
• Grammar based
• Machine Learning
LeadMineLeadMine

Optional

Normalization
Input Normalized
œstradiol oestradiol
5` or 5’ or 5′ (backtick/quotation mark/prime) 5'
<p>H<sub>2</sub>O</p> H2O

Blue: Grammars
Green: Traditional dictionaries
Orange: Blocking dictionaries

Advantages of grammars
• Don’t require annotated corpora
• Encode knowledge about the domain
• Very fast recognition
• Allow spelling correction if an entity is a near
match to one recognized by the grammar

Simple grammar Example
Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’
Digit : Digit1to9 | ‘0’
Cid : ‘CID:’ Digit1to9 Digit*
C I D 1..9:
0..9

Grammar for IUPAC names
• Grammar for complete molecules: 485 rules
– trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...
– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...
• Generally aims to match a superset of the
nomenclature covered by IUPAC
• Specifically this is the superset that can be
theoretically be converted to structures

Grammar inheritance
• Molecule grammar serves as a good starting
point for a substituent grammar or generic
chemical grammar
– Inherit rules rather than duplicate them
– Allow overriding of rules
pluralizedChemical : chemical 's'
elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition
metal'|'transuranic element' | _elementaryMetalAtom

Dictionaries… bigger is better
• For high recall of trivial names, dictionaries
with high coverage are required.
• The largest publically available dictionary is
PubChem with over 94 million terms
• However most of these terms are either not
useful or actually detrimental to text mining

Aggressive filtering
• “what you don't see won't hurt you”
• Hence remove terms are also English words or start with an
English word
– Accomplished using a large English dictionary with
chemistry terms removed
• Remove internal identifiers used by depositors
• Remove terms that are matched by our grammars
• Ultimate result: 94 million  2.94 million

Structure Aware filtering
• “Do not tag proteins, polypeptides (> 15aa),
nucleic acid polymers, polysaccharides,
oligosaccharides [tetrasaccharide or longer] and other
biochemicals.”
• About 40,000 polypeptides and
oligosaccharides excluded from PubChem
using these criteria

Entity Extension
• Even PubChem is far from comprehensive hence it can be
useful to extend the start and/or end of entities to avoid
partial hits
– α-santalol can be recognized from santalol in the
dictionary
• Extension is bracketing aware and blocked by English words
• Entity trimming also performed to comply with the
annotation guidelines
– ‘Allura Red AC dye’  ‘Allura Red AC’

Entity Merging
• Adjacent entities may actually be part of one
entity
– Ethyl ester one entity
– (+)-limonene epoxide  one entity
BUT
– Hexane-benzene two entities

Using an ontology to determine
when terms add information
• Genistein isoflavone  two entities
• Glycine ester  one entity
Genistein showing isoflavone core structure

Abbreviation detection
• Based on the Hearst and Schwartz algorithm
• Detects abbreviations of the following forms:
– Tetrahydrofuran (THF)
– THF (tetrahydrofuran)
– Tetrahydrofuran (THF;
– Tetrahydrofuran (THF,
– (tetrahydrofuran, THF)
– THF = tetrahydrofuran
Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.

Domain-specific abbreviations
• Some abbreviations are not acronyms
• Can use string replacements to recognize
them e.g.
– Sodium  Na
– Estradiol  E2
Hence can recognize: 17α-ethinylestradiol  EE2

Non-entity abbreviation
removal
• Finds entities detected as abbreviations of
unrecognized entities
– Can mean a common chemical abbreviation has
been redefined in the scope of the document
current good manufacturing practice (cGMP)
cGMP = Cyclic guanosine monophosphate =

Making the most of the
knowledge provided
• Use training data to identify:
– Terms that are not currently recognized (whitelist)
– Terms that are often false positives (blacklist)
• Each false positive and false negative is placed
into such a list if its inclusion increased F-score
(harmonic mean of precision and recall)

CEM Task Results
(on development set)
Configuration Precision Recall F-score
Baseline 0.87 0.82 0.84
WhiteList 0.86 0.85 0.86
BlackList 0.88 0.80 0.84
WhiteList +
BlackList
0.87 0.83 0.85

CDI task ranking
• Uses precision of entities when running
against the development set with the results
broken down by:
– Title vs abstract?
– Which dictionary matched?
– Was the entity’s bounds modified?
– Did the entity occur more than once in the
document?

Conclusions
• Grammars complement dictionaries to allow recognition
of novel entities
• Both the coverage and quality of dictionaries is
important
• The meaning of novel abbreviations can be determined
algorithmically
• Entities can be classified based on the resource that
recognized them

Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

In grammars we trust: LeadMine, a knowledge driven solution

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a In grammars we trust: LeadMine, a knowledge driven solution

Semelhante a In grammars we trust: LeadMine, a knowledge driven solution (20)

Mais de NextMove Software

Mais de NextMove Software (20)

Último

Último (20)

In grammars we trust: LeadMine, a knowledge driven solution