SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Advanced grammars for
state-of-the-art named
entity recognition (NER)
Roger Sayle and daniel lowe
NextMove Software, Cambridge, UK
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
overview
• NextMove Software’s LeadMine text-mining engine
internally uses “CaffeineFix” (.cfx) technology for
specifying and efficiently matching important terms.
• In addition to case-sensitive and case-insensitive
term matching CaffeineFix/LeadMine also support
spelling correction (fuzzy matching).
• The most common usage is to simply compile
dictionaries into binary form for fast matching.
• Advanced users, specify “regular expressions”.
• In this presentation, we go beyond REGEXPs.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
leadmine v2 entity types
1. Chemicals
2. Biomolecules
3. Anatomy
4. Cell Lines
5. Diseases
6. Symptoms
7. Mechanisms of Action
8. Species/Organisms
9. Companies
10. Named Reactions
11. Regions
12. Languages/Possessives
1.1 Dictionary Names
1.2 Systematic Names
1.3 Generic Classes
1.4 Polymers
1.5 Formulae
2.1 Proteins
2.2 Genes
2.3 E.C. Numbers
2.4 PDB Codes
3.1 Cell Types
3.2 Cytogenetic Loci
1.1.1 Abbreviations
1.1.2 CAS RN Numbers
1.1.3 Registry Numbers
1.2.1 Functional Groups
1.2.2 Elements
1.2.3 Acids
1.2.4 SMILES
1.2.5 InChIs
2.1.1 Targets
2.1.2 P450s
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
named entity normal forms
• Chemicals SMILES and/or InChI
• Proteins UniProt
• Genes Entrez GeneID/HGNC
• Targets ChEMBL
• Species/Organism NCBI Taxonomy ID
• Diseases/Symptoms ICD-10
• Named Reactions RXNO
• Mechanism of Action ATC
• Many of these can also use NLM MeSH Terms.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Example entity dictionary as dag
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
• CaffeineFix supports (very large) user dictionaries.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Obo ontologies as dictionaries
• In addition to regular TSV (tab-separated value) files
for storing dictionaries, LeadMine’s obo2dict also
supports OBO ontologies, a convenient method for
tracking synonyms and foreign language forms.
[Term]
id: RXNO:0000006
name: Diels-Alder reaction
synonym: "Diels-Alder cycloaddition" EXACT []
synonym: "ディールス・アルダー反応" EXACT Japanese []
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Plural form generation
• LeadMine’s pluralize automatically generates English
plural forms from singular dictionary entries.
diels-alder couplings RXNO:0000006
diels-alder cycloadditions RXNO:0000006
diels-alder reactions RXNO:0000006
acridine syntheses RXNO:0000518
acyclic beckmann rearrangements RXNO:0000564
acyloin condensations RXNO:0000085
olefin metatheses RXNO:0000280
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unusual entities
• ISBN, URL, PubMed SQL statement
• Roman Numerals, Date Solvent Mixture
• ColorState, Zip codes Hearst Patterns
• Katakana Unknown acid
• HELM, InChI, SMILES, v2000 Unknown antibody
• Credit Card Numbers Unknown disease
• Region Unknown INN
• Person Ordinal numbers
• Disease Cardinal numbers
• Journal de, es, fr, it, sv
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars within grammars
• LeadMine grammar’s are specified constructively
effectively producing even more entity types.
• Region = City + Continent + Country + Island + Lake +
Mountain + Ocean + River + Sea + State/Province +
OtherFeature + OtherRegion.
• City = CityAlbania + CityAndorra + CityAustralia + CityAustria +
… + CityUS + …
• CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ +
CityUS_CA + CityUS_CO + …
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Pharma registry numbers
• CaffineFix v2.0 supports sets of user-defined
regular expressions as dictionaries.
• One application is specifying the format of
registry numbers, such as GSK204454A
• Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | …
• Number: d{3-7}
• Suffix: (“.” d) | [“a” .. “z”]
• RegistryNumber: Prefix [“ ” | “-”] Number [Suffix]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cardinal numbers
• English
– One, ten, two thousand and forty eight, ten million
• German
– Eins, Zehn, Hundert, Million, Viermillion
– Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig
• French
– Trois cents, un mille, mille neuf cent quatre-vingts dix-huit
• Italian
– Uno, due, trenta, ottocentosessantamila settecentoottantanove
• Swedish
– en miljon trehundrasjuttiåtta tusen niohundrasjuttiett
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas registry number grammar
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Roman numerals
One useful operator is NonEmpty that removes the empty string
from the set of valid matches, and requires at least one or more
characters to match.
I
II
III
IV
V
VI
VII
VIII
IX
X
XX
XXX
XL
L
LX
LXX
LXXX
XC
C
CC
CCC
CD
D
DC
DCC
DCCC
CM
M
MM
MMM
Thousands Hundreds Tens Units
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown acid
• Another operators allows wildcards with exceptions,
effectively a not operator.
• An unknown acid is “[a-z’-]+ acid” where the first
word excludes:
– Stop words: a, the, and, any, is, in, was, etc.
– Common qualifiers: acceptable, preferred, etc.
– Adjectives: battery, free, inorganic, strong, etc.
– Known acids: acetic, nitric, amino, carboxylic, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown inn
• A variation on this theme allows LeadMine to
recognize novel (recently announced) kinase
inhibitors and antibodies based on the structure of
their INN names.
• An unknown kinase inhibitor is “[a-z]+inib” and an
unknown antibody is “[a-z]+mab” where the words
exclude previously known/reported INN names and
“colliding” English words.
april != capropril, KappaB != rozrolimupab, yuletide != exenatide,
triumvir != zanamivir, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Person grammar
• The named person grammar matches:
1. [Salutation] FirstName [Initials] Surname [Suffix]
2. [Salutation] FirstName [Initials] UnknownSurname [Suffix]
3. [Salutation] UnknownFirstName [Initials] Surname [Suffix]
• where
Salutation includes Mr., Mrs., Dr., Sir, His Highness, …
FirstName includes David, John, Sarah, Tom, Angela, …
Surname includes Smith, Jones, Overington, …
UnknownFirstname excludes Big, Lake, The, Outer, etc.
UnknownSurname excludes Avenue, Bridge, Street, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
List construction operator
• Another frequently used idiom, are the operators for
constructing comma separated list.
• These turn the grammar matching “X” into the
grammar matching things like “X, X, X and X”.
• More specifically:
(X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X
• Another variation of this allows “other”, “similar”
and “related” to the final X if the list is non-empty.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Hearst pattern grammars
• An example use of list constructions is in the
recognition of Heart Patterns.
1. X such as Y [“including”, “especially” etc.]
2. Y and other X [“and related”, “or similar” etc.]
3. such X as Y
• Where X is category or classification term;
• And Y is a list of exemplified terms.
• Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings
of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Complex object builder
• An application of the list construction operator is in
our “complex object builder” construction operator.
ComplexObjectBuilder cob;
cob.insert(“red”, “lorry”, “lorries”);
cob.insert(“yellow”, “lorry”, “lorries”);
• Allows matching not only of
“red lorry”, “red lorries”, “yellow lorry” and “yellow lorries”
• But also of…
“red and yellow lorries”, “yellow and red lorries”, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
complex disease examples
• Adenomatous polyps of the colon and rectum.
• Fibroepithelia or epithelial hyperplasias.
• Inherited spinocerebellar ataxia.
• Stage II or stage III colorectal cancer.
• Inherited breast and overian cancers.
• Argentinian, Bolivian and Korean haemorrhagic
fevers.
• Dermatitis due to heat, cold, radiation, cosmetics,
fungi and shellfish.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars for Safety text mining
• “May cause lung damage if swallowed”
– “may” → “can”, “could”, “may”, “might”, “will”, etc.
– “cause” → “lead to”, “result in”, “trigger”, “bring on”, …
– “lung damage” → “explosion”, “cancer”, “injury”, …
– “if” → “when”, “once”…
– “swallowed” → “heated”, “shaken”, “dried”, “ignited”…
• “Highly toxic”
– “highly” → “very”, “extremely”, “unusually”, “intensely”…
– “toxic” → “explosive”, “carcinogenic”, “poisonous”…
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
efficient protein variant naming
• CaffeineFix technology can also be applied to naming
peptides and arbitrary protein variants/mutants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Dag representation of sequences
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
summary
• LeadMine’s .cfx files can do far more than efficiently
match very large dictionaries of terms.
• Indeed, many of the grammars used at NextMove
Software potentially match an infinite number of
terms.
• Construction of domain specific grammars can be
done in collaboration with LeadMine customers.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017

Mais conteúdo relacionado

Mais procurados

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialMuhammad Saleem
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataMuhammad Saleem
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationMuhammad Saleem
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationMuhammad Saleem
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedMuhammad Saleem
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesMuhammad Saleem
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)Olaf Hartig
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Dataaschwarte
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataThomas Meehan
 

Mais procurados (10)

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Data
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 

Semelhante a Advanced grammars for state-of-the-art named entity recognition (NER)

Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Iconic Translation Machines
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionNextMove Software
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...greatsalvation813
 
Cartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA WorkshopCartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA WorkshopALATechSource
 
RDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging communityRDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging communityAJL2011
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDARenette Davis
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...Dr. Haxel Consult
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013ALATechSource
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...Seth Grimes
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionNextMove Software
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expressionvaluebound
 
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...reynaldogonzales13
 

Semelhante a Advanced grammars for state-of-the-art named entity recognition (NER) (15)

Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
Cartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA WorkshopCartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA Workshop
 
RDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging communityRDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging community
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDA
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Cas 2
Cas 2Cas 2
Cas 2
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
 

Mais de NextMove Software

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 

Último

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Último (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

Advanced grammars for state-of-the-art named entity recognition (NER)

  • 1. Advanced grammars for state-of-the-art named entity recognition (NER) Roger Sayle and daniel lowe NextMove Software, Cambridge, UK 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 2. overview • NextMove Software’s LeadMine text-mining engine internally uses “CaffeineFix” (.cfx) technology for specifying and efficiently matching important terms. • In addition to case-sensitive and case-insensitive term matching CaffeineFix/LeadMine also support spelling correction (fuzzy matching). • The most common usage is to simply compile dictionaries into binary form for fast matching. • Advanced users, specify “regular expressions”. • In this presentation, we go beyond REGEXPs. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 3. leadmine v2 entity types 1. Chemicals 2. Biomolecules 3. Anatomy 4. Cell Lines 5. Diseases 6. Symptoms 7. Mechanisms of Action 8. Species/Organisms 9. Companies 10. Named Reactions 11. Regions 12. Languages/Possessives 1.1 Dictionary Names 1.2 Systematic Names 1.3 Generic Classes 1.4 Polymers 1.5 Formulae 2.1 Proteins 2.2 Genes 2.3 E.C. Numbers 2.4 PDB Codes 3.1 Cell Types 3.2 Cytogenetic Loci 1.1.1 Abbreviations 1.1.2 CAS RN Numbers 1.1.3 Registry Numbers 1.2.1 Functional Groups 1.2.2 Elements 1.2.3 Acids 1.2.4 SMILES 1.2.5 InChIs 2.1.1 Targets 2.1.2 P450s 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 4. named entity normal forms • Chemicals SMILES and/or InChI • Proteins UniProt • Genes Entrez GeneID/HGNC • Targets ChEMBL • Species/Organism NCBI Taxonomy ID • Diseases/Symptoms ICD-10 • Named Reactions RXNO • Mechanism of Action ATC • Many of these can also use NLM MeSH Terms. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 5. Example entity dictionary as dag • Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine • CaffeineFix supports (very large) user dictionaries. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 6. Obo ontologies as dictionaries • In addition to regular TSV (tab-separated value) files for storing dictionaries, LeadMine’s obo2dict also supports OBO ontologies, a convenient method for tracking synonyms and foreign language forms. [Term] id: RXNO:0000006 name: Diels-Alder reaction synonym: "Diels-Alder cycloaddition" EXACT [] synonym: "ディールス・アルダー反応" EXACT Japanese [] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 7. Plural form generation • LeadMine’s pluralize automatically generates English plural forms from singular dictionary entries. diels-alder couplings RXNO:0000006 diels-alder cycloadditions RXNO:0000006 diels-alder reactions RXNO:0000006 acridine syntheses RXNO:0000518 acyclic beckmann rearrangements RXNO:0000564 acyloin condensations RXNO:0000085 olefin metatheses RXNO:0000280 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 8. Unusual entities • ISBN, URL, PubMed SQL statement • Roman Numerals, Date Solvent Mixture • ColorState, Zip codes Hearst Patterns • Katakana Unknown acid • HELM, InChI, SMILES, v2000 Unknown antibody • Credit Card Numbers Unknown disease • Region Unknown INN • Person Ordinal numbers • Disease Cardinal numbers • Journal de, es, fr, it, sv 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 9. Grammars within grammars • LeadMine grammar’s are specified constructively effectively producing even more entity types. • Region = City + Continent + Country + Island + Lake + Mountain + Ocean + River + Sea + State/Province + OtherFeature + OtherRegion. • City = CityAlbania + CityAndorra + CityAustralia + CityAustria + … + CityUS + … • CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ + CityUS_CA + CityUS_CO + … 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 10. Pharma registry numbers • CaffineFix v2.0 supports sets of user-defined regular expressions as dictionaries. • One application is specifying the format of registry numbers, such as GSK204454A • Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | … • Number: d{3-7} • Suffix: (“.” d) | [“a” .. “z”] • RegistryNumber: Prefix [“ ” | “-”] Number [Suffix] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 11. Cardinal numbers • English – One, ten, two thousand and forty eight, ten million • German – Eins, Zehn, Hundert, Million, Viermillion – Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig • French – Trois cents, un mille, mille neuf cent quatre-vingts dix-huit • Italian – Uno, due, trenta, ottocentosessantamila settecentoottantanove • Swedish – en miljon trehundrasjuttiåtta tusen niohundrasjuttiett 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 12. cas registry number grammar • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 13. Cas check digit calculation • More generally CaffeineFix’s finite state machines can do limited processing... • The final check digit of a CAS number is calculated by series term summation modulo 10. • The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10. • The CAS number for water is 7732-18-5. • The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 14. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 15. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 16. cas number correction example • 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 17. Roman numerals One useful operator is NonEmpty that removes the empty string from the set of valid matches, and requires at least one or more characters to match. I II III IV V VI VII VIII IX X XX XXX XL L LX LXX LXXX XC C CC CCC CD D DC DCC DCCC CM M MM MMM Thousands Hundreds Tens Units 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 18. Unknown acid • Another operators allows wildcards with exceptions, effectively a not operator. • An unknown acid is “[a-z’-]+ acid” where the first word excludes: – Stop words: a, the, and, any, is, in, was, etc. – Common qualifiers: acceptable, preferred, etc. – Adjectives: battery, free, inorganic, strong, etc. – Known acids: acetic, nitric, amino, carboxylic, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 19. Unknown inn • A variation on this theme allows LeadMine to recognize novel (recently announced) kinase inhibitors and antibodies based on the structure of their INN names. • An unknown kinase inhibitor is “[a-z]+inib” and an unknown antibody is “[a-z]+mab” where the words exclude previously known/reported INN names and “colliding” English words. april != capropril, KappaB != rozrolimupab, yuletide != exenatide, triumvir != zanamivir, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 20. Person grammar • The named person grammar matches: 1. [Salutation] FirstName [Initials] Surname [Suffix] 2. [Salutation] FirstName [Initials] UnknownSurname [Suffix] 3. [Salutation] UnknownFirstName [Initials] Surname [Suffix] • where Salutation includes Mr., Mrs., Dr., Sir, His Highness, … FirstName includes David, John, Sarah, Tom, Angela, … Surname includes Smith, Jones, Overington, … UnknownFirstname excludes Big, Lake, The, Outer, etc. UnknownSurname excludes Avenue, Bridge, Street, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 21. List construction operator • Another frequently used idiom, are the operators for constructing comma separated list. • These turn the grammar matching “X” into the grammar matching things like “X, X, X and X”. • More specifically: (X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X • Another variation of this allows “other”, “similar” and “related” to the final X if the list is non-empty. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 22. Hearst pattern grammars • An example use of list constructions is in the recognition of Heart Patterns. 1. X such as Y [“including”, “especially” etc.] 2. Y and other X [“and related”, “or similar” etc.] 3. such X as Y • Where X is category or classification term; • And Y is a list of exemplified terms. • Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 23. Complex object builder • An application of the list construction operator is in our “complex object builder” construction operator. ComplexObjectBuilder cob; cob.insert(“red”, “lorry”, “lorries”); cob.insert(“yellow”, “lorry”, “lorries”); • Allows matching not only of “red lorry”, “red lorries”, “yellow lorry” and “yellow lorries” • But also of… “red and yellow lorries”, “yellow and red lorries”, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 24. complex disease examples • Adenomatous polyps of the colon and rectum. • Fibroepithelia or epithelial hyperplasias. • Inherited spinocerebellar ataxia. • Stage II or stage III colorectal cancer. • Inherited breast and overian cancers. • Argentinian, Bolivian and Korean haemorrhagic fevers. • Dermatitis due to heat, cold, radiation, cosmetics, fungi and shellfish. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 25. Grammars for Safety text mining • “May cause lung damage if swallowed” – “may” → “can”, “could”, “may”, “might”, “will”, etc. – “cause” → “lead to”, “result in”, “trigger”, “bring on”, … – “lung damage” → “explosion”, “cancer”, “injury”, … – “if” → “when”, “once”… – “swallowed” → “heated”, “shaken”, “dried”, “ignited”… • “Highly toxic” – “highly” → “very”, “extremely”, “unusually”, “intensely”… – “toxic” → “explosive”, “carcinogenic”, “poisonous”… 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 26. efficient protein variant naming • CaffeineFix technology can also be applied to naming peptides and arbitrary protein variants/mutants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 27. Dag representation of sequences These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 28. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 29. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 30. summary • LeadMine’s .cfx files can do far more than efficiently match very large dictionaries of terms. • Indeed, many of the grammars used at NextMove Software potentially match an infinite number of terms. • Construction of domain specific grammars can be done in collaboration with LeadMine customers. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017