We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.
Dimapurâ Call Girls đ„° 8617370543 Service Offer VIP Hot Model
Â
In grammars we trust: LeadMine, a knowledge driven solution
1. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
In grammars we trust: LeadMine,
a knowledge driven solution
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
2. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Approaches to Entity
recognition
âą Dictionary based
âą Grammar based
âą Machine Learning
LeadMineLeadMine
3. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Optional
4. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Normalization
Input Normalized
Ćstradiol oestradiol
5` or 5â or 5âČ (backtick/quotation mark/prime) 5'
<p>H<sub>2</sub>O</p> H2O
5. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Blue: Grammars
Green: Traditional dictionaries
Orange: Blocking dictionaries
6. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Advantages of grammars
âą Donât require annotated corpora
âą Encode knowledge about the domain
âą Very fast recognition
âą Allow spelling correction if an entity is a near
match to one recognized by the grammar
7. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Simple grammar Example
Digit1to9 : â1â | â2â |â4â |â5â |â6â |â7â |â8â |â9â
Digit : Digit1to9 | â0â
Cid : âCID:â Digit1to9 Digit*
C I D 1..9:
0..9
8. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar for IUPAC names
âą Grammar for complete molecules: 485 rules
â trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...
â ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...
âą Generally aims to match a superset of the
nomenclature covered by IUPAC
âą Specifically this is the superset that can be
theoretically be converted to structures
9. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar inheritance
âą Molecule grammar serves as a good starting
point for a substituent grammar or generic
chemical grammar
â Inherit rules rather than duplicate them
â Allow overriding of rules
pluralizedChemical : chemical 's'
elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition
metal'|'transuranic element' | _elementaryMetalAtom
10. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Dictionaries⊠bigger is better
âą For high recall of trivial names, dictionaries
with high coverage are required.
âą The largest publically available dictionary is
PubChem with over 94 million terms
âą However most of these terms are either not
useful or actually detrimental to text mining
11. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Aggressive filtering
âą âwhat you don't see won't hurt youâ
âą Hence remove terms are also English words or start with an
English word
â Accomplished using a large English dictionary with
chemistry terms removed
âą Remove internal identifiers used by depositors
âą Remove terms that are matched by our grammars
âą Ultimate result: 94 million ï 2.94 million
12. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Structure Aware filtering
âą âDo not tag proteins, polypeptides (> 15aa),
nucleic acid polymers, polysaccharides,
oligosaccharides [tetrasaccharide or longer] and other
biochemicals.â
âą About 40,000 polypeptides and
oligosaccharides excluded from PubChem
using these criteria
13. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Extension
âą Even PubChem is far from comprehensive hence it can be
useful to extend the start and/or end of entities to avoid
partial hits
â α-santalol can be recognized from santalol in the
dictionary
âą Extension is bracketing aware and blocked by English words
âą Entity trimming also performed to comply with the
annotation guidelines
â âAllura Red AC dyeâ ï âAllura Red ACâ
14. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Merging
âą Adjacent entities may actually be part of one
entity
â Ethyl ester ï one entity
â (+)-limonene epoxide ï one entity
BUT
â Hexane-benzene ï two entities
15. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Using an ontology to determine
when terms add information
âą Genistein isoflavone ï two entities
âą Glycine ester ï one entity
Genistein showing isoflavone core structure
16. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Abbreviation detection
âą Based on the Hearst and Schwartz algorithm
âą Detects abbreviations of the following forms:
â Tetrahydrofuran (THF)
â THF (tetrahydrofuran)
â Tetrahydrofuran (THF;
â Tetrahydrofuran (THF,
â (tetrahydrofuran, THF)
â THF = tetrahydrofuran
Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
17. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Domain-specific abbreviations
âą Some abbreviations are not acronyms
âą Can use string replacements to recognize
them e.g.
â Sodium ï Na
â Estradiol ï E2
Hence can recognize: 17α-ethinylestradiol ï EE2
18. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Non-entity abbreviation
removal
âą Finds entities detected as abbreviations of
unrecognized entities
â Can mean a common chemical abbreviation has
been redefined in the scope of the document
current good manufacturing practice (cGMP)
cGMP = Cyclic guanosine monophosphate =
19. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Making the most of the
knowledge provided
âą Use training data to identify:
â Terms that are not currently recognized (whitelist)
â Terms that are often false positives (blacklist)
âą Each false positive and false negative is placed
into such a list if its inclusion increased F-score
(harmonic mean of precision and recall)
20. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CEM Task Results
(on development set)
Configuration Precision Recall F-score
Baseline 0.87 0.82 0.84
WhiteList 0.86 0.85 0.86
BlackList 0.88 0.80 0.84
WhiteList +
BlackList
0.87 0.83 0.85
21. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CDI task ranking
âą Uses precision of entities when running
against the development set with the results
broken down by:
â Title vs abstract?
â Which dictionary matched?
â Was the entityâs bounds modified?
â Did the entity occur more than once in the
document?
22. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Conclusions
âą Grammars complement dictionaries to allow recognition
of novel entities
âą Both the coverage and quality of dictionaries is
important
âą The meaning of novel abbreviations can be determined
algorithmically
âą Entities can be classified based on the resource that
recognized them
23. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com