2. Summary
• Semantic markup vs. markup semantics
• Why markup semantics
• Why XML is not enough
• Markup semantics with EARMARK and Linguistic Act
• Real-world scenarios
• Conclusions
3. Shift of meaning
Markup Tag Semantics and Markup
document markup markup element markup semantics
1990
Web of it tells us something a syntactic item
documents about the text or representing “what is the meaning of a
content of a document the building block of markup element title
a document structure contained in a document d?”
First Era of the Web (WWW)
Second Era of the Web (Semantic Web)
resource markup keyword semantic markup
it is used to identify a non-hierarchical keyword
today
Web of any data added to a or term assigned to a “the resource r has the string
data resource with the piece of information (such Dealing with Markup
intention to semantically as an Internet bookmark, Semantics as title”
describe it digital image or computer
file)
4. Markup semantics today
• The document markup is still here:
✦ lot of research issues are still open-problems now
✦ some on those partially-solved issues can be addressed in a better way through
nowadays tools and technologies
• So, our question is:
Why the Semantic Web has not yet addressed properly markup semantics?
Possible answers:
✦ Because the document markup is dead, really
✦ Because markup semantics is not an interesting research topic
✦ Because markup semantics is not an useful tool for solving valuable problems
✦ Actually, the Semantic Web addressed markup semantics
5. The document markup is dead... wait, really?
• The document markup does not play any important role in
nowadays research fields and company interests
Are we definitely sure?
Maybe not!
6. Research groups’ interest in markup semantics
• Does it mean that there is no research communities interested in this issue? Well,
actually, it is an old and still-live issue:
✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M. (2002). Towards a Semantics for XML Markup.
✦ Dubin, D. (2003). Object mapping for markup semantics.
✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M., Huitfeldt, C. (2003). XML Semantics and Digital Libraries.
✦ Simons, G. F., Lewis, W. D., Farrar, S. O., Langendoen, D. T., Fitzsimons, B., Gonzalez, H. (2004). The semantics of
markup: mapping legacy markup schemas to a common semantics.
✦ Garcia, R., Celma, O. (2005) Semantic Integration and Retrieval of Multimedia Metadata.
✦ Marcoux,Y. (2006). A natural-language approach to modeling: Why is some XML so difficult to write?
✦ Van Deursen, D., Poppe, C., Martens, G., Mannens, E.,Van de Walle, R. (2008). XML to RDF Conversion: a
Generic Approach.
✦ Marcoux,Y., Rizkallah, E. (2009). Intertextual semantics: A semantics for information design.
✦ Sperberg-McQueen, C. M., Marcoux,Y., Huitfeldt, C. (2009). Two representations of the semantics of TEI Lite
✦ Nuzzolese, A., Gangemi, A., Presutti,V. (2010). Gathering Lexical Linked Data and Knowledge Patterns from
FrameNet.
• “The problem addressed seems old and seems to have been solved before, but actually
has not [sufficiently]”
– by an anonymous reviewer
7. Markup semantics and real-world problems
• Some advantages when having a formal and machine-readable
semantics of markup:
✦ perform both syntactic and semantic validation
✦ infer facts from documents automatically
✦ simplify the federation, conversion and translation of documents among digital
repositories
✦ query upon the structure of the document by considering its semantics
✦ create visualisations of documents considering the semantics of their
structures rather than their markup vocabularies
✦ increase the accessibility of documents’ content (see the “tag abuse” issue)
✦ guarantee a better maintainability when a markup schema evolves
• Fields of interest: digital libraries and digital (and semantic)
publishing
8. Semantic Web approaching markup semantics
• RDFa may be a valid choice for associating formal semantics with arbitrary
text fragments
✦ Pros: easy to use and parse, compliant with XML-like formats
✦ Cons: we need to modify the structure of the document (more attributes, more elements)
<?xml version="1.0" encoding="UTF-8"?>
<p>Fabio says that overlhappens</p> 1 markup element only
<?xml version="1.0" encoding="UTF-8"?>
RDFa enhancing <p prefix=”: http://www.example.com/
foaf: http://xmlns.com/foaf/0.1/”>
<span about=”:fv” property=”foaf:firstName”>Fabio</span>
2 markup elements says that overlhappens
3 attributes </p>
• There are domains (e.g., those having to deal with administrative and juridical
documents) in which we cannot modify the structure of documents
• How can we say that the element p in the document means “paragraph”?
9. Our problems in addressing markup semantics
• ✦
Let’s use XML for defining document markup structures
Pros: it is the today common format, used in lot of tools and applications
✦ Cons: it does not define a formal way for specifying markup semantics
• Let’s use OWL for defining formal semantics and then associating it to
XML markup
✦ Pros: OWL was created for define semantics
✦ Cons: we have to use XML-based approaches (RDFa, GRDDL) to link semantics to
XML markup and this is not always possible
• A compromise between XML and OWL is not fully satisfying
• A solution: to elevate either the document markup formalism or the
formal semantics model to the level of the other, that means:
✦ to use XML for document markup and another formalism, fully compliant with XML in
all the possible scenarios, for defining its markup semantics (does it exist?), or
✦ to develop an OWL ontology for defining document markup and another OWL
ontology for specifying its semantics
try to guess what we did
10. • The Extremely Annotational RDF
Markup (EARMARK) is at the
same time a markup meta-language and
an ontology of (document) markup
✦ More expressive than XML – it allows to
organise markup structures as graphs
✦ It makes easy to associate OWL semantics
to document items – an EARMARK
document is a set of OWL assertions, all the
markup items and text nodes are individuals
of particular classes
✦ Lot of tools available: a Java API, frameworks
to convert XML documents into EARMARK
ones and to convert complex EARMARK
documents (i.e., having a graph structure)
into XML ones applying overlapping tricks
to store as much information as possible
into the simple XML tree hierarchy
more information at http://palindrom.es/phd/research/earmark
11. An example: XML tricks
p
agent noun verb This is not directly representable
in XML (unless using tricks):
“noun” and “verb” overlap
Fabio says that overlhappens
To be representable p XML serialisation
in XML it should be...
with TEI fragmentation
verb <p>
<agent>Fabio</agent> says that
<noun xml:id=”e1” next=”e2”>
overl
agent noun noun </noun>
<verb>
h<noun xml:id=”e2”>ap</noun>pens
</verb>
Fabio says that overlhappens </p>
12. An example: EARMARK document
p ex:doc a :StringDocuverse;
:hasContent "Fabio says that overlhappens".
ex:r0-5 a :PointerRange;
:refersTo ex:doc;
agent noun verb :begins "0"; :ends "5”.
ex:r5-16 a :PointerRange;
:refersTo ex:doc;
Fabio says that overlhappens :begins "5"; :ends "16".
ex:agent a :Element; ex:r16-21 a :PointerRange;
:hasGeneralIdentifier "agent"; :refersTo ex:doc;
c:firstItem [c:itemContent ex:r0-5]. :begins "16"; :ends "21".
ex:noun a :Element;
ex:r22-24 a :PointerRange;
:hasGeneralIdentifier "noun";
:refersTo ex:doc;
c:firstItem [c:itemContent ex:r16-21;
:begins "22"; :ends "24".
c:nextItem [c:itemContent ex:r22-24]] .
ex:verb a :Element; ex:r21-28 a :PointerRange;
:hasGeneralIdentifier "verb"; :refersTo ex:dox;
c:firstItem [c:itemContent ex:r21-28]. :begins "21"; :ends "28".
ex:p a :Element ; :hasGeneralIdentifier "p";
c:firstItem [c:itemContent ex:agent; c:nextItem [c:itemContent ex:r5-16;
c:nextItem [c:itemContent ex:noun; c:nextItem [c:itemContent ex:verb]]]].
13. Towards markup semantics
• EARMARK is suitable for expressing markup semantics
straightforwardly using OWL
• What model can we use? It must:
✦ follow precise and theoretically-founded principles
✦ be interoperable across different markup vocabularies
• A large amount of vocabularies addresses the representation of
terms vs. meanings vs. things – e.g., SKOS, FRBR, CIDOC, OWL-
WordNet
Problems:
✦ too specific for particular contexts
✦ they are not interoperable
14. Linguistic Act ontology design pattern
• References: any individual from the
world we are describing – e.g., Fabio
• Meanings: any (meta-level) object
that explains something – e.g., person
• Information entities: any symbol
that has a meaning or denotes one or
more references – e.g., the string
“Fabio”
• Linguistic acts: any communicative
situation including information entities,
agents, meanings, references, and a
possible spatio-temporal context – e.g.,
to add markup to a document
http://ontologydesignpatterns.org/cp/owl/semantics.owl
15. Example: “Results” section of a paper
<section>
<div class=”section”> 2 XML excerpts of <info>
<h1>Results</h1> <title>Results</title>
<p>...</p>
“Result” sections </info>
</div> <para>...</para>
</section>
Related EARMARK conversions
ex1:div a :Element; ex2:section a :Element;
:hasGeneralIdentifier “div”; :hasGeneralIdentifier “section”;
c:firstItem [c:itemContent c:firstItem [c:itemContent
ex1:class]; ex2:info;
c:nextItem [c:itemContent ex1:h1; c:nextItem [c:itemContent
c:nextItem [c:itemContent ex1:p]]]; ex2:para]];
la:expresses la:expresses
doco:Section, deo:Results. doco:Section, deo:Results.
... ...
ex1:p a :Element; ex2:para a :Element;
:hasGeneralIdentifier “p”; :hasGeneralIdentifier “para”;
c:firstItem [c:itemContent c:firstItem [c:itemContent
ex1:someText]; ex2:someText];
la:express doco:Paragraph. la:express doco:Paragraph.
... ...
We are using the Document Components Ontology (http://purl.org/spar/doco) and
the Discourse Elements Ontology (http://purl.org/spar/deo) to specify the semantics of markup elements
16. Searches on heterogeneous repositories
• Problem: how to search something across a large number of
digital libraries that use storing documents as XML documents of
different and non-interoperable formats?
• Query: give me all the markup elements that represents
paragraphs of any “Result” section of any available document that
were written by any person called Fabio
SELECT ?x WHERE {
?x a :Element ; la:expresses doco:Paragraph ;
dc:creator [a foaf:Person ; foaf:name “Fabio”];
(^c:itemContent/^c:item)+
[a :Element; la:expresses doco:Section , deo:Results]
}
ex1:p and ex2:para are returned
17. Semantic format conversion
• Problem: how to convert a document from a (unknown) format
into a target one, without knowing the markup vocabulary of the
former and having the possibility of querying its semantics
• Convert: substitute any markup element representing a section
with a new one named “sec” that contains the same elements and
text content of the removed one
DELETE {?s :hasGeneralIdentifier ?gi}
INSERT {?s :hasGeneralIdentifier “sec”}
WHERE {
?s a :Element; :hasGeneralIdentifier ?gi;
la:expresses doco:Section
}
<sec class=”section”> <sec>
<info>
previous excerpts change: <h1>Results</h1>
<title>Results</title>
...
...
18. Markup sensibility
• Problem: how to estimate whether a markup element, that is valid at the syntactical
and structural level, is also valid at the semantic level
• Semantic constraints can be defined as ontological axioms of the underlying
ontology, in order to understand whether a document is adhering to or in contrast
with them
<smith> a :Element; :hasGeneralIdentifier “TLCPerson”;
la:denotes </ontology/ul/person/JohnSmith> ...
</ontology/ul/person/JohnSmith> a akomantoso:Person.
<akomaNtoso> ...
<TLCPerson id=”smith” href=”/ontology/uk/person/JohnSmith” /> ...
<speech id=”sp_1” by=”#smith” as=”#mineconomy”>
<p>Honorable Members of the Parliament...</p>
</speech> ...
</akomaNtoso>
<sp_1> a :Element; :hasGeneralIdentifier “speech”;
la:expresses akomantoso:Speech; la:denotes _:aSpeechEvent; ...
_:aSpeechEvent a akomantoso:SpeechEvent;
akomantoso:hasSpeaker </ontology/ul/person/JohnSmith>.
[] a la:LinguisticAct; sit:isSettingFor <sp_1>, akomantoso:Speech,
</ontology/ul/person/JohnSmith>, _:aSpeechEvent.
19. Verifying semantic constraints
• Verify: check whether the markup element “speech” denotes a particular
speech event that involves only and at least 1 person as speaker, that is
introduced in the document through a markup element
(Element that hasGeneralIdentifier value “speech”)
SubClassOf
(sit:hasSetting only
(la:LinguisticAct that
sit:isSettingFor exactly 1 (Element and la:InformationEntity)
and
sit:isSettingFor exactly 1 (
(akomantoso:SpeechEvent and la:Reference)
that
akomantoso:hasSpeaker some (
akomantoso:Person that la:isDenotedBy some Element
)
)
and
sit:isSettingFor value akomantoso:Speech
)
)
20. Conclusions
• The issue of markup semantics is still a interesting research field, with a lot of
possible applications in real-world scenarios
• We proposed our approach for addressing markup semantics through Semantic
Web technologies and we introduced EARMARK, as a new document markup
meta-language, and the Linguistic Act ontology design pattern for expressing
semantics of EARMARK document markup
• We shown how to use these models for addressing real scenarios in which the
use of markup semantics can help when doing particular tasks, such as querying
on heterogeneous document repositories, converting document markup across
different vocabularies, and verifying the validity of markup elements at a semantic
level
• Future development:
✦ a software assistant that helps users in the definition of markup semantics of a given XML schema
✦ two applications for the semantic validation of markup documents and for the visualisation of
document parts according to their semantics