In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by sequentiality, containment, or contiguity of text fragments. In the past years we developed EARMARK, our OWL proposal for expressing arbitrary semantic annota- tions about the structure and the text content of a document. In this paper we describe FRETTA, our mechanism for rendering arbitrary EARMARK annotations (including non-sequential, non-hierarchical and non-contiguous ones) in XML, bringing into a unifying framework a half dozen of syntactic tricks used in literature to handle overlapping structures in a strictly hierarchical language.
08448380779 Call Girls In Friends Colony Women Seeking Men
Embedding semantic annotations within texts: the FRETTA approach
1. Embedding semantic annotations
within texts: the FRETTA approach
Gioele Barabucci - barabucc@cs.unibo.it
Silvio Peroni - essepuntato@cs.unibo.it
Francesco Poggi - fpoggi@cs.unibo.it
Fabio Vitali - fabio@cs.unibo.it
http://creativecommons.org/licenses/by-sa/3.0
2. Outline
• Conversion from an XML format into another
• Overlapping markup
• Abstract conversion framework
• FRETTA
• Evaluation
• Conclusions
3. Converting XML vocabularies that use
syntactic workarounds
• The conversion of OpenOffice Writer documents (ODT) into Microsoft Word
documents (DOCX) (and vice versa) is not a straightforward operation
• Converters exist and are included as core components of word processors
• Those converters do not implement mechanisms for a full and effective document
conversion, especially when particular features are needed – e.g., information tracking
document changes occuring over time
4. What happens to markup
<text:tracked-changes>
<text:changed-region text:id="S1">
! <text:insertion><office:change-info>
OpenOffice (ODT)
! ! <dc:creator>John Smith</dc:creator>
! ! <dc:date>2009-10-27T18:45:00</dc:date>
<text:p> ! </office:change-info></text:insertion>
The beginning </text:changed-region>
and the end. </text:tracked-changes>
</text:p> […]
<text:p>The beginning and
! <text:change-start text:change-id="S1"/></text:p>
<text:p>also
<text:change-end text:change-id="S1"/>
the end.</text:p>
Microsoft Word (DOCX)
<w:p>
! <w:pPr><w:rPr>
<w:p>
! ! <w:ins w:id="0" w:author="John Smith"
<w:r>
! ! ! w:date="2009-10-27T18:50:00Z"/>
<w:t>
! </w:rPr></w:pPr>
The beginning
! <w:r><w:t>The beginning and </w:t></w:r></w:p>
and the end.
<w:p>
</w:t>
! <w:ins w:id="1" w:author="John Smith"
</w:r>
! ! w:date="2009-10-27T18:50:00Z">
</w:p>
! ! <w:r><w:t>also </w:t></w:r></w:ins>
! <w:r><w:t>the end.</w:t></w:r></w:p>
5. Overlapping markup
• Overlapping markup is needed when different markup items refer to the same
document fragment
Previous example in incorrect XML
<p>The beginning and <ins></p>
<p>also </ins> the end</p>
XML formalisation via workarounds
<p>The beginning and <ins start=”foo”/></p>
<p>also <ins end=”foo”/>the end</p>
• Different techniques to embed overlapping structures in XML hierarchies:
✦ milestones: a pair of empty elements representing the start and the end tags, connected to each other by
special attributes
✦ fragmentation: elements separated within the primary hierarchy and connected to each other by special
attributes
✦ twin documents: each hierarchy is represented by a different document which contains the same textual
content
✦ stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position
(down to the individual character) of each start and end location within the main structure
6. Abstract conversion framework
XML format 1 with XML format 2 with
overlapping workarounds overlapping workarounds
(e.g., ODT + change tracking) (e.g., DOCX + change tracking)
Step1: Indentification of XML Step2: Syntactic and
Step3: Linearisation into
overlapping workarounds semantic conversion
XML document with
and creation of document with from format 1 into
overlapping workarounds
explicit overlap format 2
XML document EARMARK EARMARK XML document
format 1 document document format 2
format 1 format 2
EARMARK is a non-XML markup metalanguage used as
Today’s contribution
intermediate language for the conversion.
It allows markup structures to be organized both as trees
and as generic graphs with no particular limitations.
7. FRETTA
• FRETTA (From EARMARK To Tag) is a general and extensible Java framework
for expressing EARMARK documents in an embedded XML syntax
• Users that want to convert from EARMARK into XML document formats
must indicate which workarounds are used in a certain target format
• Fretta performs the requested conversion passing through four different and
consecutive steps
EARMARK
document XML document
workaround structural semantic
linearisation
specification conversion conversion
The user specifies Pure-structural conversion Semantic conversion Generation of the
which workaround that produces a new that may change the resulting XML tree
to use to represent EARMARK document in current structure of the with the requested
an (EARMARK) which overlapping EARMARK document workarounds
overlapping element elements are transformed according to how the
in XML appropriately according to target XML format
the specified workarounds handles the specified
workarounds
8. Evaluation
• Comparing FRETTA’s outputs
document workarounds WF V N M
against a set of twelve TEI
documents (TEIDocs) written by agrippine fragmentation ✓ ✓ ✓ ✓
markup experts agrippine milestones ✓ ✓ ✓ ✓
drivemycar fragmentation ✓ ✓ X X
• The evaluation took into account johnlovesmary fragmentation ✓ ✓ ✓ ✓
four different principles johnlovesmary milestones ✓ ✓ ✓ ✓
✦ well-formedness (WF): whether the
peergynt fragmentation ✓ ✓ ✓ ✓
framework returns well-formed XML
documents peergynt milestones ✓ ✓ ✓ ✓
✦ validity (V): whether the framework returns peterpaulhammer milestones ✓ ✓ ✓ ✓
valid XML documents according to the thoughtalice fragmentation ✓ ✓ ✓ ✓
particular target XML vocabulary titwillow fragmentation ✓ ✓ X ✓
✦ naturalness (N): how much the XML titwillow fragmentation ✓ ✓ X X
documents returned by the framework are
structurally similar to TEIDocs titwillow milestones ✓ ✓ X ✓
✦ minimality (M): how much the amount of 100% well-formed and valid documents
nodes (i.e., elements, attributes and text 67% continues to be natural (N) against TEIDocs
nodes) in the XML documents returned by 83% continues to be minimal (M) against TEIDocs
the framework varies from TEIDocs
9. Conclusions
• Converting XML documents with overlaps expressed via XML
workarounds is not a straightforward task
• We propose an abstract framework to address this issue, composed of
three consecutive steps
• FRETTA implements the third step of the conversion framework. It
enables one to convert any EARMARK document (that allows multiple
overlapping hierarchies at the same time) into one or more embedded
XML markup structures
• Future works:
✦ developing algorithms that autonomously select the workarounds to adopt in the
conversions
✦ integrating FRETTA in the broader framework for the semi-automatic and round-
trip conversion from any supported XML format into another