This document discusses the challenges of managing chemistry data online and the development of an open data repository to address these challenges. It proposes a new architecture for a data repository that would integrate diverse chemistry data types through APIs and user interfaces. The repository would standardize data, enable deposition from various sources, and provide metrics and recognition to encourage participation. However, challenges remain around data formats, encouraging data sharing, and meeting scientists' needs. The document advocates for continued testing and collaboration to develop effective solutions.
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dealing with the complex challenge of managing diverse chemistry data online
1. Dealing with the complex challenge
of managing diverse chemistry
data online
Antony Williams, Valery Tkachenko, Alexey
Pshenichnov and Ken Karapetyan
ACS San Francisco
August 2014
4. About Me…as a Chemist
• I’ve performed a few dozen chemical
syntheses
• I’ve run thousands of analytical spectra
• I’ve generated thousands of NMR assignments
• I’ve probably published <5% of all work
• Most of it has been lost
• But things can be different today….
• But it still needs to be associated with me…
5. • If we imagine that permission exists…
(i.e. forget IP, chemical and pharma
companies etc…think students…)
– How many syntheses are performed
– How many spectra are run
– How many properties are measured
– How many compounds are made
– How many, how much, how big??.....
– Let’s go manage it all!!
Think about chemistry a mo’
11. Open Data are everywhere
• Is Openness and Social Sharing changing
the world?
• The cultural experiments in Open Data and
exchange are almost daily
• Mobile platforms enhance participation
• And then what of Chemistry Data???
12. An Experiment - ChemSpider
• ChemSpider allowed the community to
participate in linking the internet of chemistry
& crowdsourcing of data
• Successful experiment in terms of building a
central hub for integrated web search
• More people are “users” than “contributors”
• Yet basic feedback and game-play helps
14. An EPSRC Call
“…the identification of the need for a UK
national service for the provision of a
searchable, electronic chemical database
for the UK academic research community.”
16. • Manage “all” of the chemistry data associated
with chemical substances – PUBLISHED and
UNPUBLISHED
• Based on user selected licensing the data to be
downloadable, reusable, interactive
• Build a platform that enables the scientist
• Data storage, validation, standardization and
curation
• Collaborative data sharing
• Provide data platform that can enable and
enhance publishing of scientific papers
We set a vision…
17. Data Repository
• Registration of chemical compounds
• Deposition of chemical syntheses
• Addition of analytical data
• Integration to electronic notebooks
• Rewards and recognition for data sharing
• Document processing
• Hosting of data as private, embargoed or
public
18. Development of Data Repository
• Data repository should not just be a data
dump – should not be a “big disk”
• Searchable, integrated, segregated
repository of data types
• Data access including private, shared
embargoed and public
• Delivery of derived models from data
20. New Repository Architecture
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Materials
Widgets
Documents
Widgets
Data tier
Data access
tier
User
interface
components
tier
Analytical Laboratory application
User
interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd
party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
21. Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module
͙
Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
LabTroveand other templated
data
Documents
API, FTP, etc
Raw data Validated data
Staging
databases
Alldatabases are
sliced by data
sources/data
collections and
havesimple
security model
where each data
slice/sourceis
private, public or
embargoed
26. For Deposition of Data
• Quality of data at source
• ensuring chemicals are correct - VALIDATION
• reactions map and balance as appropriate –
VALIDATION and STANDARDIZATION
• file format handling for analytical data types –
binary file formats are proprietary -
STANDARDIZATION
• valid interpretation of data – VALIDATION and
ANNOTATION
27. Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module
͙
Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
LabTroveand other templated
data
Documents
API, FTP, etc
Raw data Validated data
Staging
databases
Alldatabases are
sliced by data
sources/data
collections and
havesimple
security model
where each data
slice/sourceis
private, public or
embargoed
33. ChEMBL (1.3 million records)
• 11,020 records with 4 bonds and zero charge,
e.g. CHEMBL501101 or CHEMBL501973
• 271 records with hypervalent oxygen (e.g. ,
CHEMBL2219679), carbon (e.g. 1005895),
boron, chlorine, iodine or phosphine
• 6,177 records where direction of bond makes
no sense, e.g. CHEMBL12760 and
CHEMBL34704
36. The challenges of analytical data
• Vendors produce complex proprietary data
formats and standard formats are required
(JCAMP, NetCDF, AniML)
• ChemSpider already hosts thousands of JCAMP spectra
• Support of “assigned spectra” in place
• Data validation approaches understood
• There are a myriad of analytical data types…
44. Depositions from ELNs
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry table development
• Analytical data integration
• “ChemTrove” rolled out to a small test
group in January
66. What can drive participation?
• What can drive scientists to participate and
contribute?
• Ensuring provenance of their data for reuse
• Mandates from funding agencies
• Improved systems to ease contribution
• Additional contributions to science
• Improved publishing processes
• Recognition for contributions
72. Rewards and Recognition
Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of a thousand miles begins
with a single step”. In the same
way we hope that this will be the
first of many submissions that you
make to CSSP.
The First Step badge is
awarded when a user
submits (& has published)
their 1st
CSSP article.
74. AltMetrics Feeds
• For our data repository ensure contribution of
data will feed out to the AltMetrics platforms
• Every data point, every data download, use
and reuse will be associated with the scientist
• Data will be DOI’ed (presently under review)
• Services provided will allow for AltMetrics use
75. What do we have in place?
• We are testing an early form of the data
repository on our data – ChemSpider and our
archive of publications
• Working with collaborators to define needs
• Testing and enhancing deposition systems
• Chemical validation & standardization platform
• Analytical data handling formats
• And lots in development…
76. The Challenges Ahead
• Chemistry is NOT just nicely defined structures!
• Materials, minerals, attached to beads,
polymers, ambiguous materials
• Domain-specific measurements
• File format standards are limited in application
• Encouraging scientists to free up their data
• AltMetrics, open data mandates, systems
• The data explosion continues
77. But it’s not easy of course
• Not everything we would like around data
handling is there for sure
• Many systems, tools, platforms are already
available but we don’t know about them or
even if we did contributing us “more work”
• “What’s in it for me?”, “It’s my data”, “It’s too
much work”, “What credit do I get?”