http://wiki.knoesis.org/index.php/MaterialWays
http://www.knoesis.org/?q=research/semMat
http://wiki.knoesis.org/index.php/MaterialWays
Abstract
The sharing, discovery, and application of materials science and engineering data and documents are possible only if domain scientists are able and willing to do so. We need to overcome technological challenges such as the development of convenient computational tools and repositories conducive to easy exchange, curation, attribution, and analysis of data, and cultural challenges such as proper protection, control, and credit for sharing data. Our thesis and value proposition is that associating machine-processable semantics with materials science and engineering data and documents can provide a solid foundation for overcoming challenges associated with data discovery, integration, and interoperability caused by data heterogeneity. Specifically, easy to use and low upfront cost lightweight semantics in the form of file-level annotation can enable document discovery and sharing, while deeper data-level annotation using standardized ontologies can benefit semantic search and summarization. Machine processability achieved through fine-grained semantic annotation, extraction, and translation can enable data integration, interoperability and reasoning, ultimately leading to Linked Open Materials Science Data. Thus, a different granularity of semantics provides a continuum of cost/ease of use and expressiveness trade-off. In this presentation, we also show the application of semantic techniques for content extraction from materials and process specifications which are semi-structured and table-rich, and the application of semantic web techniques and technologies for materials vocabulary integration and curation (via semantic media wiki), semantic web visualization, efficient representation of provenance metadata and access control (via singleton property), and biomaterials information extraction
Aspirational Block Program Block Syaldey District - Almora
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analytics, and Applications
1. Semantics-enhanced Cyberinfrastructure for ICMSE :
Interoperability, Analytics, and Applications
Krishnaprasad Thirunarayan (T. K. Prasad) and Amit Sheth
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
1
2. Relevant Funded Projects :
A Brush with Pain Points and Promise
• Semantic Web-based Data Exchange and
Interoperability for OEM-Supplier Collaboration
(Pratt and Whitney) (2014-2015)
• KDDM: Federated Semantic Services Platform
for Open Materials Science and Engineering
(AFRL) (2013-2016)
• Computer Assisted Document Interpretation
Tools. (NSF SBIR Phases I and II with Cohesia
Corp.) (1999-2002)
• Document => Materials and Process Specs (alloys)
2
3. Selected URLs and Publications
• http://www.knoesis.org/?q=research/semMat
• http://wiki.knoesis.org/index.php/MaterialWays
• Nishita Jaykumar, PavanKalyan Yallamelli, Vinh Nguyen, Sarasi
Lalithsena, Krishnaprasad Thirunarayan, Amit Sheth. KnowledgeWiki:
An OpenSource Tool for Creating Community-Curated Vocabulary,
with a Use Case in Materials Science. In LDOW - WWW 2016.
Montreal, Canada; 2016.
• Vinh Nguyen, Olivier Bodenreider, Amit Sheth. Don't like RDF
Reification? Making Statements about Statements using Singleton
Property. 23rd International conference on World Wide Web (WWW
2014). NY: ACM; 2014. p. 759-770.
• Krishnaprasad Thirunarayan, Amit Sheth, Kalpa Gunaratna, Vinh
Nguyen, Siva Cheekula, Sarasi Lalithsena, Nishita Jaykumar, Swapnil
Soni, Clare Paul. Architecture and Prototype for Materials Knowledge
Management System using Semantic Web Technologies and
Techniques: A Preliminary Report. WSU, 2014
3
4. Selected URLs and Publications
• Krishnaprasad Thirunarayan, On Embedding Machine-
Processable Semantics into Documents, In: IEEE Transactions
on Knowledge and Data Engineering, Vol. 17, No. 7, pp. 1014-
1018, July 2005.
• K. Thirunarayan, A. Berkovich, and D. Sokol, An Information
Extraction Approach to Reorganizing and Summarizing
Specifications, In: Information and Software Technology
Journal, Vol. 47, Issue 4, pp. 215-232, 2005.
• K. Thirunarayan, A. Berkovich, and D. Sokol, Semi-automatic
Content Extraction from Specifications, In: Proceedings of 6th
International Conference on Applications of Natural Language
to Information Systems, LNCS 2553, pp. 40-51, June 2002.
4
5. Outline
• Domain Goals and Challenges
• Utility and Continuum of Machine-Processable Semantics : An
Architecture
• What?: Nature of Data and Granurality of Semantics
• Why?: Lightweight semantics and its benefits
• How?: Community-ratified Ontologies
+ Semantic Annotations of Data and Documents
+ Linked Open Materials Data
• Applications:
• (Skip) Long-term Research: Processing Tabular Data
• Integrating vocabularies : Matvocab KnowledgeWiki use case
• Document Annotation : Biomaterials use case
• Visualization and Navigation : iExplore
• Private-Public Data Sharing
• Conclusion
5
6. Domain Goals and Challenges
• Materials Science and Engineering Data and
Document sharing, discovery, and application are
possible only if domain scientists are able and
willing to do so.
• Technological challenges
– Computational tools and repositories conducive to easy
exchange, curation, attribution, and analysis of data
• Cultural challenges
– Proper protection, control, and credit for sharing data
6
7. Our Thesis / Value Proposition
Associating machine-processable semantics
with materials science and engineering data
and documents can help overcome
challenges associated with data discovery,
integration and interoperability caused by
data heterogeneity.
7
8. What?: Nature of Data
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents
(e.g., publications and technical specs which
usually include text, numerics, units of measure,
images and equations)
• Tabular data (e.g., ad hoc spreadsheets and
complex tables incorporating “irregular” entries)
8
10. What?: Granularity of Semantics and Applications: Examples
• Synonyms
– Chemistry, Chemical Composition, Chemical Analysis, ...
– Bend Test, Bending, ...
– Delivery Condition, Process/Surface Finish, Temper, "as received by
purchaser", ...
• Co-reference vs broadening/narrowing
– Tubing vs welded tubing vs flash-welded part
• Capturing characteristic-value pairs
– Recognize and Normalize: “0.1 inch and under in nominal thickness”
is translated to “Thickness <= 0.1 in”.
– Glean elided characteristic: controlled term “solution heat treated”
implies the attribute “heat treat type”.
10
12. 1
• Ontology: Agreement about a common
vocabulary/nomenclature, conceptual models and
domain knowledge
– Codified as Schema + Knowledge Base.
– Agreement is what enables interoperability.
– Formal machine processable description is what
leads to automation.
13. 2
• Semantic Annotation (Metadata Extraction):
Associating meaning with data, or labeling data so
it is more meaningful to the system and people.
– Manual
– Semi-automatic (automatic with human
verification)
– Automatic
14. 3
• Reasoning/Computation:
– Semantics enabled search
– Data integration
– Answering complex queries and making connections
(paths, sub-graphs)
– Analyses including pattern discovery, mining, hypothesis
validation
– Visualization
16. SSN
Ontology
2 Interpreted data
(deductive)
[in OWL]
e.g., threshold
1 Annotated Data
[in RDF]
e.g., label
0 Raw Data
[in TEXT]
e.g., number
Using Semantics to Climb Levels of Abstraction: an example
3 Interpreted data
(abductive)
[in OWL]
e.g., diagnosis
Intellego
“150”
Systolic blood pressure of 150 mmHg
Elevated
Blood
Pressure
Hyperthyroidism
……
16
18. What?: Granularity of Semantics and Associated Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data integration,
interoperability and reasoning in Linked Open
Materials Science Data
18
19. Computer Assisted Document Extraction Tool
Tree/Structure view of the SpecTypical view of the tagged Spec
20. Computer Assisted Document Extraction Tool
Example: Procedure Melt Methods
View of the Original Spec Tagged Spec
Tag
Editor
22. Why?: Benefits of Lightweight Semantics
• Ease of use by domain experts
– Faster and wider adoption, promoting evolution
• Low upfront cost to support
• Shallow semantics has wider applicability to a
range of documents/data and appeal to a broader
community
• Bottom-line: “Learn to Walk before we Run”
22
23. How?: Using Semantic Web Technologies
Machine-processable semantics achieved by
addressing
• Syntactic Heterogeneity: Using XML syntax and
RDF datamodel (labelled graph structure)
• Semantic Heterogeneity:
– Using “common” controlled vocabularies, taxonomies
and ontologies
– Using federated data sources, exchanges, querying,
and services
23
24. How?: Ingredients for Semantics-based Cyber Infrastructure
• Use of community-ratified controlled vocabularies
and lightweight ontologies (upper-level,
hierarchies)
• Ease registration, publishing, and discovery
• Provide support for provenance and access control
• Track data citation for credit for data sharing
• Semi-automatic annotation of data and documents
: Manual + Automatic
24
25. How?: Search Continuum
• Keyword-based full-text search
• + Manually provided content and source metadata
• Uses upper-level ontology
• + Automatically extracted metadata
• Map text to concepts/properties/values
• Semantic + faceted search using background knowledge
• + Deeper semi-automatic content annotation and
extraction
• Aggregating related pieces of information; conditioning
• Integration and Interoperation
• + Linked Open Material Science Data
• + Federated and Faceted Querying and Services
25
26. Linked Open Data
• Use “URIs” as identifiers to describe things
http://dbpedia.org/resource/John_F._Kennedy
• Associate descriptions to the identifiers
26
db:John_F.
_Kennedy
db:Politician
db:Profession
27. Linked Open Data
• Connect things together
27
db:John_F.
_Kennedy
db:Politician
db:Profession
ex:John_K
ennedy
ex:A_Nation
_of_Immigra
nts
ex:authored_book
owl:sameAs
29. Title of data Selected from five tier vocabulary
provided Keywords
Type of data maps, excel files, images, text
Data format structured or unstructured
Description of data brief unstructured description of content
Contact information of provider(s) name of provider(s), email for verification,
lineage
Spatial extent of data and
reference system
location
Temporal extent of data date range in time or age range if not recent
Date and type of Related
Publication(s)
Journal, Thesis, Agency report, not published
Host site for publication Journal, Library, Personal computer
Access restrictions copyright regulations
Example: Lightweight Semantic Registration of Data
29
31. Problems and A Practical Approach
(“When rubber meets the road”)
Deeper Issues: Semantic Formalization
of Tabular Data
31
skip
32. Nature of tables
• Compact structures for sharing information
– Minimize duplication
• Types of Tables
– Regular : Dense Grid with explicit schema
information in terms of column and row
headings => Tractable
– Irregular: Sparse Grid with implicit schema and
ad hoc placement of heading => Hard
32
34. Challenges Associated with Typical Spreadsheet/Table
• Meant for human consumption
• Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
34
35. Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation
• USE: Manual population of regular tables and
automatic translation into LOD
35
37. Matvocab home page
Search and discovery
Annotate documents
Visualize the
knowledge base
Query vocabulary
View, edit, and add
Create and process
assertions
38. 38
Vocabulary Creation / Curation
N. Jaykumar, P. Yallamelli, V. Nguyen, S. Lalithsena,
K. Thirunarayan, A. Sheth, C. Paul:
KnowledgeWiki: An OpenSource Tool for Creating Community
Curated Vocabulary, with a Use Case in Materials Science
(Linked Data on the Web, World Wide Web Conference 2016)
39. KnowledgeWiki: An OpenSource Tool for Creating
Community-Curated Vocabulary, with a Use Case in
Materials Science
WWW - LDOW 2016, Canada
Nishita Jaykumar, Pavankalyan Yallamelli, Vinh Nguyen,
Sarasi Lalithsena, Krishnaprasad Thirunarayan, Amit Sheth
Kno.e.sis, Wright State University
Clare Paul
*Air Force Research Laboratory, Wright-Patterson AFB
40. 40
• Collaboration with AFRL
Context for Research
ASM
HNDBK
MIL
HNDBK-5
MIL
HNDBK-17
(Standardized
Vocabularies)
SKOS
Dublin Core
QUDT
VAEM
…
Crowdsourcing from
domain experts
Consolidated
vocabulary
(MatVocab)
41. 41
Motivating Example
Facts:
Name Definition Source
A-Basis The mechanical property value is
the value above which …
ASM Handbook, Volume 21:
Composites.
ABasis A statistically-based material
property; a 95% lower…
Composite Materials Handbook -
Volume 1.
MIL-HDBK-17F-1F, 17 June 2002
A-Basis The lower of either a statistically
calculated number…
Metallic Materials and Elements for
Aerospace Vehicle Structures, MIL-
HDBK-5J, 31 January 2003
42. 42
Facts:
Name Definition Source
YoungsModulus The ratio of normal stress to
corresponding …
ASM Handbook, Volume
21: Composites.
ModulusYoungs The ratio of change in stress to
change …
MIL-HDBK-17
• Same term has multiple definitions that needs to be
represented with its provenance information, that
includes data such as, source and time.
Motivating Example
44. • Extension to Mediawiki
• We use the Semantic Form extension of Semantic
Mediawiki for our task
• Inability to represent metadata about the metadata
44
Semantic Mediawiki
http://www.slideshare.net/cool_uk/semantic-mediawiki-simple-tutorial
Representing entities and
simple metadata
The '''United Kingdom''' is a
country located in
[[Located in::Europe]].
46. 46
• Adopted the Singleton Property method for capturing
triple metadata in SMW
• Importing legacy data with provenance in bulk using
the Singleton Property method
• Importing existing RDF datasets with provenance into
SMW for curation
Our Approach
47. Subject Predicate Object Source License
Autoclave hasDefinition “A closed vessel for
producing…”
MIL-HDBK-17F-1F,
17
All rights reserved
Singleton Property
Facts:
Subject Predicate Object
hasDefinition#1 rdf:sp hasDefinition
Autoclave hasDefinition#1 “A closed vessel for producing…”
hasDefinition#1 hasSource MIL-HDBK-17
hasDefinition#1 hasLicense All rights reserved
Singleton Property Translation
47
A singleton property represents one specific relationship between two entities under
a certain context. It is assigned a uri, as any other property, and can be considered as
a subproperty or an instance of a generic property.
"Don't like RDF reification?: making statements about statements using singleton property."Proceedings of the 23rd international
conference on World wide web. ACM, 2014.
48. • Formal semantics defined
• Scalable, e.g., to LOD
• Compatible with existing standards
– RDF, RDFS, SPARQL
• Can be used to capture multiple types of metadata
– Provenance, time, location
48
Why use Singleton Property?
Fu, Gang, et al. "Exposing Provenance Metadata Using Different RDF Models." arXiv preprint arXiv:1509.02822 (2015). Nguyen, Vinh, Olivier Bodenreider, and Amit Sheth.
Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What Works Well With Wikidata?." Proceedings of the 11th International Workshop on Scalable
Semantic Web Knowledge Base Systems co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA. 2015.
49. 49
Singleton v/s Regular Template
Autoclave
Definition Text
Image
Source
Rights
Autoclave
Definition Text
Image
Source
Rights
Source
Rights
52. • Properties of interest to domain experts:
– Definition Text
– Source
– License
– Creator
– Abbreviation
– Synonyms
– Units
– …..
52
Use Case in Materials Science
mv: is matvocab namespace
53. 53
Statistics of the Vocabulary Import Use Case
Type SMW
1 Number of vocabularies imported 3
2 Total number of terms imported from ASM 1295
3 Total number of terms imported from MILHNDBK-5 19
4 Total number of terms imported from MILHNDBK-17 179
5 Total number of Singleton Templates created 6
6 Total number of Regular Templates created 5
7 Total number of pages created 1,685
55. Annotate, search, and track provenance
• Vocabulary is used to annotate documents.
• Annotated documents can be indexed.
• Documents can be integrated reliably based
on common terms of interest and
provenance information.
55
57. • Explains the origin of an artifact, such as
– How was it created?
– Who created it?
– When was it created?
• Example: for a given material X
– Which processes are involved in making the material and
what are the relevant performance properties?
– What are the inputs, control parameters and outputs of a
process?
– Which research/engineering team performed an
experiment?
Provenance Metadata
58. 58
Capturing and Exploring provenance metadata - iExplore
generic PMC prepreg
generic hand lay-up
generic PMC lay-up
generic autoclave cure
generic PMC
subjected to
subjected to
yields
yields
60. Biomaterials Knowledge Extraction :
Protein/Peptides/Amino Acids-Precious Metal Bindings
• Recognition and extraction of crystalline surface
patterns for precious metals (e.g., Gold/Silver
surface patterns via Miller Indices - Au(100),
Au(110), Ag(111)), protein/peptide/amino acid
sequences, and indicators of binding relationship.
– Example Input: They found that an alanine-substituted
peptide (AYSSGAPPAPPF) exhibited the highest
affinity for gold, while a proline-substituted peptide
(AYPPGAPPMPPF) showed almost no affinity.
60
63. Goal and Example Accomplishment
• Implement a Collaboration Platform using Semantic
Web technology in the backend.
– Semantic Web representation (RDF) and querying
(SPARQL) hidden from the users (domain scientists) for
convenience.
• Example functionality incorporated in the “Beta”
version of the PW-11 Collaboration Platform
– Creation of a project by its owner and assigning users to
groups (e.g., ordinary, external, foreign) in a project
– Assigning access control rights based on group/user/file
– Searching, requesting, and uploading files respecting
access restrictions
63
64. Overall Plan
• Implement necessary user interfaces and backend
processing to facilitate the Collaboration use cases.
– Develop and document user interfaces to support flexible
access control and data exchange
– Store information as metadata in the form of triples to
support light-weight reasoning
• Virtuoso triple store
– Upload and store files (in the server’s file system)
respecting user-project access control restrictions
• Ubuntu, Java VM, Apache Tomcat Web Server
64
65. Pre-requisites
• Pre-populated set of authorized users (for
authentication)
– Realistically this will require significant scrutiny of a user
outside the collaboration platform.
• Simple access control architecture and mechanisms
(that can be extended further based on user feedback).
• Kno.e.sis prototype assumed availability of an ITAR
certified container to host the collaboration platform.
Thus, the development of additional infrastructure for
ITAR compliance was out of scope.
65
66. Public-Private Data Sharing
• Enhance publicly available datasets while
retaining intellectual property data privately for
businesses
66
Private data and metadata
(e.g. ongoing experimental processes, intellectual property data)
Selectively shared data and metadata
(e.g. with ongoing collaborators, licensed data)
Public data and metadata
(e.g., released products, material specifications)
67. OEM partner A
Federated Architecture
67
Private
Shared
Public
Federal Endpoint
1. User
Authentication
2. Federated Semantic
Query Processor
AC
Processor
Semantic
Query
Processor
OEM partner B
Private
Shared
Public
AC
Processor
Semantic
Query
Processor
OEM supplier C
Private
Shared
Public
AC
Processor
Semantic
Query
Processor
3. Semantics
Mappings
68. Principles of a Federation
• Each component controls access to its local data
independently (local autonomy).
• A query is decomposed to multiple sub-queries,
each sub-query is executed at one component.
• Results from sub-queries are combined by the
federated query processor (control global access)
69. Kno.e.sis Tools
• Doozer: Ontology creator from Wikipedia
category hierarchy
• Scooner: Tool for trailblazing using semantic
triples
• Kino: Faceted Search Engine
• iExplore: Visualize and navigate semantic /
linked data
• BLOOMS: Ontology alignment tool
69
70. Take Away
Use of semantic web technologies
can help overcome challenges associated with
data discovery, integration, and interoperability,
caused by data heterogeneity, and
use of provenance and access control
can help to share/exchange data reliably.
70
71. 71
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Kno.e.sis