The document discusses building a structure-centric community for chemists using crowd-sourcing. It describes ChemSpider, a free online database of over 20 million chemical structures that allows users to search, deposit, and curate chemical data. ChemSpider aims to connect structured chemical information to other resources and enable collaborative authoring and discovery. Challenges include maintaining data quality at scale and integrating with established authorities through an open community effort.
9. WikiProteins
What
Is
Tegafur?
Building a Structure Centric Community for Chemists
10. Commonly Lacking…
Approaches generally lack “structural intelligence”
Structures have properties (Mw, MF, exp. & pred. properties)
Collections of structures need to be searchable by structure
Most data collections are “self-contained” and rarely
connecting to other resources via “structure”
Building a Structure Centric Community for Chemists
11. A Search Engine for Chemists
Questions a chemist might ask…
What is the melting point of n-butanol?
What is the chemical structure of Xanax?
Chemically, what is viagra?
What are the stereocenters of cholesterol?
Where can I find publications about Taxol?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
ChemSpider can answer all of these questions
Building a Structure Centric Community for Chemists
12. ChemSpider Data Content
Over 20 million unique chemical structures :
Online Databases –PubChem, Drugbank, HMDB, Wikipedia
Chemical Vendors – over 40 different vendors and growing
Personal Depositions – individual contributions
Journal Publishers
Content database vendors
Analytical data collections
Patents (9 MILLION Structures to search patents)
Web scraping
Content is linked back to the original data sources
Building a Structure Centric Community for Chemists
13. A Structure Centric Community for Chemists
A FREE ACCESS platform for deposition,
management, curation, annotation and extension of
information associated with chemical structures
Semantically connect to other sites providing access to
knowledge, data and information of determined quality
Search by alphanumeric text, chemical structure and
substructure and combination searches
Predict properties for submitted structures
Building a Structure Centric Community for Chemists
14. Tell me about Aspirin
Building a Structure Centric Community for Chemists
15. Tell me about Aspirin
Building a Structure Centric Community for Chemists
16. Links out to KEGG
Kyoto Encyclopedia of Genes and Genomes
Building a Structure Centric Community for Chemists
17. Tell me about Aspirin
Building a Structure Centric Community for Chemists
18. Tell me About Aspirin
Building a Structure Centric Community for Chemists
19. Tell me about Aspirin
Building a Structure Centric Community for Chemists
20. Tell me about Aspirin
Building a Structure Centric Community for Chemists
21. Abstract Compounds?
Is there any information about “Quesnoin”?
Type in the name (and there may be many) or other
identifier
Paste a chemical structure
Draw the structure
Building a Structure Centric Community for Chemists
24. Example Search 2
What compounds have a mass of 300+/-0.001?
or search a combination of intrinsic/predicted properties
Building a Structure Centric Community for Chemists
27. Search Open Access Journals – ChemSpider
Building a Structure Centric Community for Chemists
28. Search PubMed – ChemSpider
Building a Structure Centric Community for Chemists
29. The Quality of Data Online…
Aggregating data opens up quality issues
Structure-identifier associations are “dirty”
Structures are COMMONLY incorrect – stereochem issues
Manual curation of small databases is enough work – what
about millions of structures?
Structures are far from perfect. What is a “correct structure”?
Full stereochemistry?
Historical timeline of structure?
Who is the authority?
Building a Structure Centric Community for Chemists
30. Who holds THE Quality Authority?
Chemical Abstracts Service is the structural authority
today. 1400 (?) employees, world standard in chemistry
information
101 years of knowledge, process and expertise.
MANUAL curation is key. Robotic curation is enabling
How can an online, free access system peacefully co-
exist with the authority?
Building a Structure Centric Community for Chemists
31. Quality is a Major Issue- Search Butanol
Building a Structure Centric Community for Chemists
34. Wikipedia Chemistry Curation project
Only ca. 5000 organic structures, 7000 total structures
MONTHS of work so far for a team of 6 people
Many errors removed in the process. Curation process
is a daily event for users/depositors
Slow and torturous process for stereo molecules.
Building a Structure Centric Community for Chemists
35. Thymol Blue on ChemSpider
Data online includes:
UV-vis spectrum
Measured experimental properties
Link to Wikipedia article
Links to chromatography details
Multiple identifiers/trade names etc.
Links to vendors/suppliers/other databases
Safety information
Building a Structure Centric Community for Chemists
36. Differences between ChemSpider/Wikipedia
ChemSpider Wikipedia
>20 million unique structures ~5000 organics, 2000 others
Complex queries – Properties, Text
Text, structure/substructure, OA
publishers, Data Sources, …
Prediction of properties No
Analytical Data No
Active depositors/curators – 30 Active editors – about 50 (?)
5000 people/day; 1100 registered ????
Compound monographs linked Detailed compound monographs
Building a Structure Centric Community for Chemists
37. Differences between Wikipedia/ChemSpider
Wikipedia ChemSpider
Supported by tried and tested Primarily Microsoft .NET
Media-Wiki platform. technologies with OS components
Established infrastructure and “Out of a basement” on three
Wikipedia Foundation Team servers and 5 volunteers
Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider
GFL licensing for everything Mixed “licensing”
Strong team of WP:Chem Growing team of WP:Chem
advocates, curators and admins advocates, curators and admins
Worldwide reputation as quality Growing reputation as focused on
source quality
Building a Structure Centric Community for Chemists
38. Crowd-sourcing Curation
How to curate data for millions of structures?
Robot processes can clean up depositions
Search for Chloride and check molecular formula for Cl
Check for stereochemistry and remove names with stereo
Provide a simple-to-use platform to curate, annotate
and tag data
Provide curator administration to prevent vandalism
(Veropedia)
Building a Structure Centric Community for Chemists
40. Post Comments
Anyone can “Post Comments” associated with a
structure. To curate data we require login to track
Building a Structure Centric Community for Chemists
41. Crowd-sourcing Chemistry
Crowd-sourced curation: identify and tag errors, edit
names, synonyms, identify records for deprecation
ALSO
Crowd-sourced deposition: anyone can deposit data
(structures, text, images, analytical data)
Building a Structure Centric Community for Chemists
42. But, when registered and logged in…
Ability to curate and add to the database
Add structures
“Clean” structures
Add data (spectra, CIFs, images)
Add links to other pages (URLs)
Add publication details
Building a Structure Centric Community for Chemists
43. Adding to the Database - Structure
Building a Structure Centric Community for Chemists
44. Adding New Text Data
Add Publication Add URL
Add Identifier
Building a Structure Centric Community for Chemists
46. Can ChemSpider Enable Discovery?
Yes, chemists can search by text, structure, substructure or
properties to look at relationships and probe drug discovery
Building a Structure Centric Community for Chemists
47. ChemSpider – Research in Progress
Supporting Open Notebook Science as a repository –
JC Bradley at Drexel University
For the purpose of online virtual screening
Applying descriptors of various types to filter a
database of 20 million compounds
In progress:
Utilizing SimBioSys’ LASSO Descriptor
Collaboration based on NISS’ ChemModLab
Building a Structure Centric Community for Chemists
48. LASSO
Ligand Activity by Surface Similarity Order
Building a Structure Centric Community for Chemists
49. LASSO Descriptors on ChemSpider
SEMANTIC WEB in action
Building a Structure Centric Community for Chemists
50. LASSO Searching Method 1
Ask the question “What are the top 1000 molecules
with similar LASSO descriptors to the actives for the
Estrogen Receptor”
Building a Structure Centric Community for Chemists
51. It WORKS - Enrichment Plot
60% of the actives were recovered in the top 1% of the database.
“Environmental binders” are weak binders
The top ranked compounds may well be active ER binders
Likely candidates for experimental investigation
Building a Structure Centric Community for Chemists
52. Tipping Point
Tipping point - the point at
which a slow gradual change
becomes irreversible and then
proceeds with gathering pace
Building a Structure Centric Community for Chemists
53. ChemSpider Forums/Blogs
Forum.chemspider.com
www.chemspider.com/blog
Building a Structure Centric Community for Chemists
55. What would we most like to do?
Enable “Collaborative Science”. What would that look
like?
Access to chemical supplies when people need them
Awareness of available literature, patents, databases of
curated content – whether Open Access or not.
Transaction fees (or not) are between user and provider
Host Open Notebook Science exchanges
Building a Structure Centric Community for Chemists
56. “ChemSpider Inside”
Instrument vendors integrated ChemSpider to their
metabolism ID project – ChemSpider linked to all Mass
Spec Intruments doing Metabolite ID?
Wikipedia roundtrip linking to ChemSpider
Google indexing ChemSpider at “fixed rate”
Integration to desktop drawing packages
Members of Microsoft BioIT Alliance
Discussions on Taverna’s Workflow Sourceforge group
Hosting Open Access articles shortly…
Building a Structure Centric Community for Chemists
57. Where to from here? Short term
Integrated text and structure/substructure searching of the
Open Access literature is in development
Web-based scraping of structure-based information –
examples in place
Enhanced web services layer to integrate searches
Deposit updated Patent Database (9 million structures)
Reaction handling and deposition
Building a Structure Centric Community for Chemists
58. Where to from here? Mid-term
Spidering for Chemistry – extract data from articles,
webpages and data sources AND stay within copyright
WiChempedia project – wiki-layers on top of
ChemSpider, alongside Wikipedia curation project.
Deeper integration to text-based searching and
conversion of chemical names to structures for online
structure searching:
Improved integration with NCBI Entrez system
Deliver “dedicated websites” for specific publishers
Building a Structure Centric Community for Chemists
59. Where to from here? Mid-Term
An extensible datamodel “on the fly” allows us to
easily expand to integrate abstract data to structures
Data mine and curate “parameters” – physicochemical
and physiological parameters to enable QSAR
analysis, data modeling and provision of models
online (UNC-Chapel Hill, NISS)
Building a Structure Centric Community for Chemists
60. Our Challenges
There are “no employees”
ChemSpider is non-funded
System is hyper-dependent
on ISP, power and limited
compute power
We are upsetting a lot of
people – evangelists,
cheminformatics system
vendors, publishers, data
content providers
Building a Structure Centric Community for Chemists
61. Acknowledgments
The ChemSpider team of volunteer developers
ChemSpider Advisory Group
Our curators, depositors and users
Suppliers of commercial software – Microsoft,
ACD/Labs, OpenEye, ChemAxon, SimBioSys
SureChem – Structure Based Online Patent Searching
Building a Structure Centric Community for Chemists
62. Further reading
www.chemspider.com/blog
Internet-based tools for communication and
collaboration in chemistry, Drug Discovery Today,
Volume 13, Numbers 11/12, June 2008 502-506,
doi:10.1016/j.drudis.2008.03.015
A perspective of publicly accessible/open-access
chemistry databases, Drug Discovery Today, Volume
13, Numbers 11/12, June 2008, 495-501,
doi:10.1016/j.drudis.2008.03.017
Building a Structure Centric Community for Chemists