Generic or specific? Making sensible software design decisions
Representation of molecular structures and related computations on the Semantic Web. Universal Data Model and its Ontology.
1. Chemical Semantics, Inc.
Chemical Semantics
Mirek Sopek*, Neil Ostlund, Jacob W.G. Bloom, Stuart Chalk
Chemical Semantics Inc., 1115 NW 4th Street, Gainesville, Florida
*sopek@chemicalsemantics.com
Representation of molecular structures
and related computations
on the Semantic Web.
Universal Data Model and its Ontology.
2. HypercubeChemical Semantics, Inc. – March 2016, San Diego3
Chemical Semantics
Chemical Semantics goals
InteroperablePUBLISHINGofComputational
Chemistrycalculations
SemanticREPRESENTATIONOFDATAforboth
humansandmachines
FEDERATIONofpublisheddatawithexistingweb-
basedchemicaldatasets
Cloud-likeARCHIVINGofComputationalChemistry
calculationresults,input/output filesetc.
http://chemsem.com
3. HypercubeChemical Semantics, Inc. – March 2016, San Diego4
Chemical Semantics
CSI Portal – a short review
chemsem.com – EXISTING PLATFORM FOR DATA PUBLISHING
4. HypercubeChemical Semantics, Inc. – March 2016, San Diego5
Chemical Semantics
CSI Portal – what’s new ?
Enhanced stability and
security
SPARQL Query Generator
based on chemical drawings
Extending the range of QC
packages to:
ADF, DALTON, GAMESS,
GAMESS-UK, Gaussian,
Jaguar, Molpro, NWChem,
ORCA, Psi4, and QChem.
(thanks to the use
of ccLib)
6. HypercubeChemical Semantics, Inc. – March 2016, San Diego7
Chemical Semantics
What is a data model and why is it
important?
What is a data model:
A data model organizes data elements and
standardizes how the data elements relate to one
another.
As such, a data model should be distinguished
from its serializations (i.e. file formats)
The most important place where we work
directly with data models is in the software!
7. HypercubeChemical Semantics, Inc. – March 2016, San Diego8
Chemical Semantics
Data Models in Chemistry
TABULAR data models (most popular: MOL files,
MOLDEN files, ZMT, GJF, HIN, R elational DBs
etc)
TREE based data models (CML, AniML, CSX etc)
KEY VALUE/MIXED data models (CIF, new
PDB/mmCIF, JCAMP-DX)
8. HypercubeChemical Semantics, Inc. – March 2016, San Diego9
Chemical Semantics
Why we need new data models and
standards
Existing data models have various levels of
extensibility, but all of them fall short when a
new, unknown or unpredicted (at the moment of
creation), kind of data appears in it.
Such new kind of data added to a model usually
breaks it, or, in the best case, is ignored.
There is no provision for dynamic sharing of
data where people can add new data in real time.
9. HypercubeChemical Semantics, Inc. – March 2016, San Diego10
Chemical Semantics
What is the solution?
We are convinced that the solution comes in the
form of:
a GRAPH-based data model based on the
smallest possible data pattern: A TRIPLE
The best implementation is offered by RDF –
Resource Description Framework known from
Semantic Technologies.
10. HypercubeChemical Semantics, Inc. – March 2016, San Diego11
Chemical Semantics
Why triples?
Arbitrary N-tuples can be constructed out of 3-tuples
Proved by W. Quin. MathematicalLogic. Harvard University Press, 1940.
11. HypercubeChemical Semantics, Inc. – March 2016, San Diego12
Chemical Semantics
RDF data model
Anatomy of the triple:
<molecule> gnvc:hasInChIString „1S/H2O/h1H2”
For example:
Subject Predicate Object
Thing Property Value
gc:hasInChIKey “DUGIDELPOPULAW-UHFFFAOYSA-N”
12. HypercubeChemical Semantics, Inc. – March 2016, San Diego13
Chemical Semantics
RDF data model
Typical data set contains large numbers of triples
forming a DIRECTED GRAPH
Identification and
addressing of nodes
is done via a URI
scheme – a
generalization of
URLs – standard web
addresses.
13. HypercubeChemical Semantics, Inc. – March 2016, San Diego14
Chemical Semantics
RDF data model in software
The RDF data model in software is usually
represented as:
Unordered SET of TRIPLES (3-TUPLES)
For example, in Python we have 3-tuple:
(subject, predicate,object)
14. HypercubeChemical Semantics, Inc. – March 2016, San Diego15
Chemical Semantics
How do we interact with the model?
Through SPARQL
queries
Through specific
API calls in your
language of
preference
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX gc: <http://purl.org/gc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?graph
WHERE {
GRAPH ?graph { {
?something gc:hasAtom ?atom1 ;
rdf:type ?somethingType ;
rdfs:label ?somethingLabel .
?atom1 gc:isElement "F" .
}
UNION
{
?something gc:hasAtom ?atom2 ;
rdf:type ?somethingType ;
rdfs:label ?somethingLabel .
?atom2 gc:isElement "Cl" .
}
UNION
{
?something gc:hasAtom ?atom3 ;
rdf:type ?somethingType ;
rdfs:label ?somethingLabel .
?atom3 gc:isElement "Br" .
}
UNION
(…)
ua=URIRef(u'http://purl.org/gc/Atom')
um=URIRef(u'http://purl.org/gc/Molecule')
ur=URIRef(u'http://purl.org/gc/Residue')
g=rdflib.Graph()
ba=g.parse(urn,format="turtle")
for m in g.subjects(RDF.type,um):
nmc += 1
napm=0 # number of atoms per molecule
res1=g.objects(m,uhr)
lres=len(list(res1))
if lres>0:
res=g.objects(m,uhr)
(…)
v=graph.value(subject=vURI,predicate=RDF.type)
h=graph.value(subject=vURI,predicate=gcn.hasName)
a=graph.value(subject=vURI,predicate=gcn.hasValue)
15. HypercubeChemical Semantics, Inc. – March 2016, San Diego16
Chemical Semantics
Software interaction with the model?
Out of all data models, RDF GRAPH
represents almost infinite
extensibility.
Its serializations (JSON-LD and
Turtle) are the best to work with.
18. HypercubeChemical Semantics, Inc. – March 2016, San Diego19
Chemical Semantics
Data model and its serializations
There is a number of serializations for the RDF
graphs:
RDF/XML, NTriples, Turtle, JSON-LD etc
The most important today are:
JSON-LD & Turtle
We shall never forget they are just
SERIALIZATIONS
of the underlying, more fundamental
Data Model
20. HypercubeChemical Semantics, Inc. – March 2016, San Diego21
Chemical Semantics
CSI Molecular Data Models
Existing model (currently used on our portal):
Follows closely CSX (XML) data model presented here last year
The New Data model features:
Alternate methods to describe molecular geometry: Cartesian,
Fractional and Internal coordinates
Flexible representation of molecular hierarchies (molecules,
residues , groups, chains, templates etc.)
Cleaner serializations to both JSON-LD and Turtle – easier to
work with also for humans
Closer integration with Gainesville Core Ontology
21. HypercubeChemical Semantics, Inc. – March 2016, San Diego22
Chemical Semantics
CSI Molecular Data Model
Geometrical objects: Top level class hierarchy
gc:Locus
gc:Atom gc:Point gc:DummyAtom
gc:GhostAtom
Rdf:subClass
Rdf:subClass
rdf:subClass
rdf:subClass
23. HypercubeChemical Semantics, Inc. – March 2016, San Diego24
Chemical Semantics
CSI Molecular Data Model
mSys
cart
p1
gc:contains
gc:usesType
A1gc:isPositionFor
gc:Point
rdf:type
0.06968 1.299703 0.021584gc:hasXValue
gc:hasYValue
gc:hasZValue
p2
A2gc:isPositionFor
rdf:type
1.000204 1.658998 0.011623gc:hasXValue
gc:hasYValue
gc:hasZValue
p9
A7gc:isPositionFor
1.000204 1.658998 0.01162361gc:hasVectorValue
.
rdf:type
gc:MolecularSystemrdf:type
gc:CartesianCoordinatesrdf:tpye
Cartesian coordinates representation
24. HypercubeChemical Semantics, Inc. – March 2016, San Diego25
Chemical Semantics
CSI Molecular
Data Model
Molecular hierarchy
mSys
gc:MolecularSystemrdf:type
R1
gc:hasMolecules
A1 A3 A5 A7
gc:hasAtom gc:hasAtom
gc:hasAtom gc:hasAtom
M1
R2
hasResidue
hasResidue
A2 A4
A6
g1
gc:Residue rdf:tpye
g2
gc:Group
chebi:CHEBI_32952
chebi:CHEBI_32952
Amine Group
Carboxylic acid group
rdf:tpye
rdf:tpye
label
gc:hasAtom
gc:hasAtom
gc:hasAtom
b1_2
A1
A2
gc:binds
gc:binds
gc:SingleBond rdf:type
b6_7
A6
A7
gc:binds
gc:binds
gc:DoubleBond rdf:type
25. HypercubeChemical Semantics, Inc. – March 2016, San Diego26
Chemical Semantics
CSI Molecular Data Model
Internal
coordinates
mSys
zmat
zL1
gc:contains
gc:hasZmatLines
zL2
(rdf List next)
zL3
(rdf List next)
zL4
(rdf List next)
A1
A2 A1
A2
A3
hasFirstAtom
hasFirstAtom
hasSecondAtom
A3 A1hasFirstAtom
hasSecondAtom
hasThirdAtom
A2A4 A1hasFirstAtom
hasSecondAtom
hasThirdAtom
hasFourthAtom
v1hasDistance
v2hasDistance v3
hasAngle
v4hasDistance v5
hasAngle
v6
hasDihedral
v1
R2
1.399645
hasName
hasValue
Data_view_value
rdf:type
(rdf List next)
.
v1 1.081060hasValue
Data_view_value
rdf:type
v6
D3
118.774
hasName
hasValue
Data_view_value
rdf:type
v7 hasAdditiveInverseData
Data_as_pointer
rdf:type
v6
.
gc:MolecularSystemrdf:type
gc:InternalCoordinatesrdf:type
26. HypercubeChemical Semantics, Inc. – March 2016, San Diego27
Chemical Semantics
POC - Representation of residues
Proof-of-Concept based on AMBER residues
(http://ambermd.org/doc/prep.html)
As simple as adding a few more triples to the
existing structure.
Another example of the data model’s flexibility
and processing software immunity to changes of
the data patterns.
28. HypercubeChemical Semantics, Inc. – March 2016, San Diego29
Chemical Semantics
The contents
mTemplate
residue
zL1
gc:contains
gc:hasZmatLines
zL2
(rdf List next)
zL3
(rdf List next)
zL4
(rdf List next)
A1
A2 A1
A2
A3
hasFirstAtom
hasFirstAtom
hasSecondAtom
A3 A1hasFirstAtom
hasSecondAtom
hasThirdAtom
A2A4 A1hasFirstAtom
v1hasDistance
v2hasDistance v3
hasAngle
v4hasDistance v5
hasAngle
v6
hasDihedral
(rdf List next)
.
gc:Templatesrdf:type
gc:PolymericTemplatesrdf:type
DUMMgc:residueAtomName
DUgc:residueAtomSymbol
Mgc:residueTopologicalType
0.0gc:AtomCharge
CD2
gc:residueAtomName
CD
gc:residueAtomSymbol
E
gc:residueTopologicalType
-0.0110
gc:AtomCharge
I , IGRAPH(I) , ISYMBL(I) , ITREE(I) , NA(I) , NB(I) , NC(I) , R(I) , THETA(I) , PHI(I) , CHG(I)
29. HypercubeChemical Semantics, Inc. – March 2016, San Diego31
Chemical Semantics
Amber residues
Creation of residue templates on
the base of internal coordinate
representations adds completely
new data to the system.
However, the existing information
is still readable by the software
that ”knew” how to interpret it.
The new data can now be
extracted by the software that
”knows” about residues.
30. HypercubeChemical Semantics, Inc. – March 2016, San Diego32
Chemical Semantics
Use in software
Excel example
Python example
PHP example
http://chemicalsemantics.com/rda/
31. HypercubeChemical Semantics, Inc. – March 2016, San Diego33
Chemical Semantics
Ontological description of the data model
The structure of the RDF data model can be described in an Ontology.
http://purl.org/gc
32. HypercubeChemical Semantics, Inc. – March 2016, San Diego34
Chemical Semantics
Conclusions
RDF data model delivers maximum possible
extensibility while preserving the
compatibility with the software used to
create and consume it.
It is suitable not only for knowledge
representation and metadata encoding, but
is also the best data model for encoding of
molecular structure information.
33. HypercubeChemical Semantics, Inc. – March 2016, San Diego35
Chemical Semantics
Acknowledgements
I would like to thank the following people for making this
presentation possible:
Dr. Neil S. Ostlund
Dr. Jacob W.G. Bloom
Dr. Bing Wang
Dr. Stuart Chalk
34. Chemical Semantics, Inc.
Chemical Semantics
Thank you!
Mirek Sopek, PhD
Chemical Semantics, Inc.
1115 NW 4th Street
32601 Gainesville, Florida
cell: +1 917 3467500
web: www.chemicalsemantics.com
email: sopek@chemicalsemantics.com