ChemAxon’s Naming provides reliable English name and chemical structure conversion. It is the underlying technology utilized in ChemAxon’s chemical text mining tool D2S (Document-to-Structure), JChem for SharePoint, and Chemicalize.org. In this presentation, the latest enhancement will be highlighted, including: Chinese chemical name recognition to accommodate the fast grouping Chinese scientific literature, Custom corporate ID to structure conversion using a webservice, Database indexing of structures from document repositories.
7. Document to Structure
• Extract chemical information from
documents
– Names, SMILES, InChI, CAS number …
– Embedded objects
– Structure images
Support: OSRA currently
Multiple OSR engines (CLiDE, Imago…) in 6.1
– Works with scanned non-searchable PDF
– Returns structures and their locations in the
document
– Correct OCR errors
– Supported formats:
PDF, text, XML, HTML, MS Office document
(doc, docx, ppt, pptx, xls, xlsx), OpenOffice
7
Non-searchable
chemical patent
Documents
Structure (text + image)
+ location
D2S
12. JChem for SharePoint
• SharePoint 2010 and 2007
– Sketch, Import/export, store
structures
– Structure search
– Calculate properties and naming
– Filtering and Sorting
• New improvement
– Index and search chemical
information in documents
• Text
• Embedded structure object
– Connect SharePoint to your
chemical database
13. Free Online Service Chemicalize.org
• Extract
• Interactively display
• Calculate
• Search
13
Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–615
17. Name to Structure
17
• 2-(乙酰氧基)苯甲
酸
• 阿司匹林
• 2-(acetyloxy)
benzoic acid
• Aspirin
• Acetylsalicylate
• Easprin …
• 50-78-2
N2S
Chinese
18. In Fact, Even without CN2S…
18
• 阿司匹林 N2S
CN2S
Customized
Dictionary
19. Customized Dictionary
• A SMILES file “custom_names.smi”
• Default location ChemAxon DIR
e.g. in Windows 7 C:UsersUSERNAMEchemaxon
• Format
SMILES Tab ANY text string
19
c1ccccc1 CXN000001
21. Customized Dictionary
• A SMILES file “custom_names.smi”
• Default location ChemAxon DIR
e.g. in Windows 7 C:UsersUSERNAMEchemaxon
• Format
SMILES Tab ANY text string
•
From Version 6.0, a custom web service can also be used
21
c1ccccc1 CXN000001
28. The Challenges
3. English: name alterations
丁烷 buta + ane butane
4. Chinese: many Characters have different
meanings
盐 = salt
酸 = acid
盐酸 = hydrochloric acid
28
29. The Challenges
5. Chinese names are usually abbreviated
苯 = benzene
苯基 = phenyl
There are many other challenges, overall
solution:
Make our N2S more tolerant to mistakes
29
30. Accuracy Result
• Test set: 38,600 Chinese names + CAS
number
• Contains unusual, incorrect, ambiguous
names, radicals, inorganic salts,
• Conversion rate = 58 – 78 %
• Accuracy = 91%
• Look for another test set from Chinese
patents
30
38. But English and Chinese Names are Different, Really
(1S,3S)-1-bromo-1-chloro-3-
ethyl-3-methylcyclohexane
Fundamental Organic Chemistry
I, Qiyi Xing et. al., Ed. 3, Page 44
38
(1S,3S)-1-甲基-1-乙基-3-氯-3-溴环己
烷
基础有机化学(上),邢其毅等,第
三版,第44页