SlideShare a Scribd company logo
1 of 10
Download to read offline
Embedding semantic annotations
                     within texts: the FRETTA approach
                                    Gioele Barabucci - barabucc@cs.unibo.it
                            Silvio Peroni - essepuntato@cs.unibo.it
                                        Francesco Poggi - fpoggi@cs.unibo.it
                                              Fabio Vitali - fabio@cs.unibo.it




http://creativecommons.org/licenses/by-sa/3.0
Outline




•   Conversion from an XML format into another

•   Overlapping markup

•   Abstract conversion framework

•   FRETTA

•   Evaluation

•   Conclusions
Converting XML vocabularies that use
            syntactic workarounds
•   The conversion of OpenOffice Writer documents (ODT) into Microsoft Word
    documents (DOCX) (and vice versa) is not a straightforward operation

•   Converters exist and are included as core components of word processors

•   Those converters do not implement mechanisms for a full and effective document
    conversion, especially when particular features are needed – e.g., information tracking
    document changes occuring over time
What happens to markup
                                                    <text:tracked-changes>
                                                        <text:changed-region text:id="S1">
                                                        !    <text:insertion><office:change-info>
OpenOffice (ODT)




                                                        !    !   <dc:creator>John Smith</dc:creator>
                                                        !    !   <dc:date>2009-10-27T18:45:00</dc:date>
                        <text:p>                        !    </office:change-info></text:insertion>
                            The beginning               </text:changed-region>
                            and the end.            </text:tracked-changes>
                        </text:p>                   […]
                                                    <text:p>The beginning and
                                                    !   <text:change-start text:change-id="S1"/></text:p>
                                                    <text:p>also
                                                        <text:change-end text:change-id="S1"/>
                                                        the end.</text:p>
Microsoft Word (DOCX)




                                                    <w:p>
                                                    !   <w:pPr><w:rPr>
                        <w:p>
                                                    !   !   <w:ins w:id="0" w:author="John Smith"
                            <w:r>
                                                    !   !   !    w:date="2009-10-27T18:50:00Z"/>
                                <w:t>
                                                    !   </w:rPr></w:pPr>
                                    The beginning
                                                    !   <w:r><w:t>The beginning and </w:t></w:r></w:p>
                                    and the end.
                                                    <w:p>
                                </w:t>
                                                    !   <w:ins w:id="1" w:author="John Smith"
                            </w:r>
                                                    !   !   w:date="2009-10-27T18:50:00Z">
                        </w:p>
                                                    !   !   <w:r><w:t>also </w:t></w:r></w:ins>
                                                    !   <w:r><w:t>the end.</w:t></w:r></w:p>
Overlapping markup

•       Overlapping markup is needed when different markup items refer to the same
        document fragment
        Previous example in incorrect XML
        <p>The beginning and <ins></p>
        <p>also </ins> the end</p>

        XML formalisation via workarounds
        <p>The beginning and <ins start=”foo”/></p>
        <p>also <ins end=”foo”/>the end</p>

•       Different techniques to embed overlapping structures in XML hierarchies:
    ✦     milestones: a pair of empty elements representing the start and the end tags, connected to each other by
          special attributes
    ✦     fragmentation: elements separated within the primary hierarchy and connected to each other by special
          attributes
    ✦     twin documents: each hierarchy is represented by a different document which contains the same textual
          content
    ✦     stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position
          (down to the individual character) of each start and end location within the main structure
Abstract conversion framework


       XML format 1 with                                                   XML format 2 with
  overlapping workarounds                                              overlapping workarounds
(e.g., ODT + change tracking)                                      (e.g., DOCX + change tracking)


     Step1: Indentification of XML        Step2: Syntactic and
                                                                       Step3: Linearisation into
       overlapping workarounds           semantic conversion
                                                                         XML document with
     and creation of document with        from format 1 into
                                                                       overlapping workarounds
            explicit overlap                  format 2




XML document                 EARMARK                            EARMARK                XML document
format 1                     document                           document                    format 2
                              format 1                           format 2
      EARMARK is a non-XML markup metalanguage used as
                                                                      Today’s contribution
             intermediate language for the conversion.
    It allows markup structures to be organized both as trees
        and as generic graphs with no particular limitations.
FRETTA

 •   FRETTA (From EARMARK To Tag) is a general and extensible Java framework
     for expressing EARMARK documents in an embedded XML syntax

 •   Users that want to convert from EARMARK into XML document formats
     must indicate which workarounds are used in a certain target format

 •   Fretta performs the requested conversion passing through four different and
     consecutive steps
EARMARK
document                                                                                  XML document
             workaround            structural          semantic
                                                                          linearisation
             specification         conversion          conversion
       The user specifies Pure-structural conversion   Semantic conversion Generation of the
      which workaround       that produces a new      that may change the resulting XML tree
      to use to represent EARMARK document in current structure of the with the requested
        an (EARMARK)          which overlapping      EARMARK document        workarounds
      overlapping element elements are transformed   according to how the
            in XML        appropriately according to   target XML format
                          the specified workarounds   handles the specified
                                                          workarounds
Evaluation

•       Comparing FRETTA’s outputs
                                                    document       workarounds WF         V    N    M
        against a set of twelve TEI
        documents (TEIDocs) written by                agrippine     fragmentation    ✓    ✓    ✓     ✓
        markup experts                                agrippine       milestones     ✓    ✓    ✓     ✓
                                                     drivemycar     fragmentation    ✓    ✓    X     X
•       The evaluation took into account           johnlovesmary    fragmentation    ✓    ✓    ✓     ✓
        four different principles                  johnlovesmary      milestones     ✓    ✓    ✓     ✓
    ✦     well-formedness (WF): whether the
                                                          peergynt       fragmentation  ✓    ✓    ✓   ✓
          framework returns well-formed XML
          documents                                       peergynt         milestones   ✓    ✓    ✓   ✓
    ✦     validity (V): whether the framework returns peterpaulhammer      milestones   ✓    ✓    ✓   ✓
          valid XML documents according to the          thoughtalice     fragmentation  ✓    ✓    ✓   ✓
          particular target XML vocabulary                titwillow      fragmentation  ✓    ✓ X      ✓
    ✦     naturalness (N): how much the XML               titwillow      fragmentation  ✓    ✓ X      X
          documents returned by the framework are
          structurally similar to TEIDocs                 titwillow        milestones   ✓    ✓ X      ✓
    ✦     minimality (M): how much the amount of              100% well-formed and valid documents
          nodes (i.e., elements, attributes and text    67% continues to be natural (N) against TEIDocs
          nodes) in the XML documents returned by 83% continues to be minimal (M) against TEIDocs
          the framework varies from TEIDocs
Conclusions


•       Converting XML documents with overlaps expressed via XML
        workarounds is not a straightforward task

•       We propose an abstract framework to address this issue, composed of
        three consecutive steps

•       FRETTA implements the third step of the conversion framework. It
        enables one to convert any EARMARK document (that allows multiple
        overlapping hierarchies at the same time) into one or more embedded
        XML markup structures

•       Future works:
    ✦    developing algorithms that autonomously select the workarounds to adopt in the
         conversions
    ✦    integrating FRETTA in the broader framework for the semi-automatic and round-
         trip conversion from any supported XML format into another
Thanks for your attention

More Related Content

What's hot (20)

XML-talk
XML-talkXML-talk
XML-talk
 
Xml 215-presentation
Xml 215-presentationXml 215-presentation
Xml 215-presentation
 
Full xml
Full xmlFull xml
Full xml
 
Xml
XmlXml
Xml
 
Wsdl1
Wsdl1Wsdl1
Wsdl1
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
XML Introduction
XML IntroductionXML Introduction
XML Introduction
 
Xml applications
Xml applicationsXml applications
Xml applications
 
Xml
XmlXml
Xml
 
XML-Extensible Markup Language
XML-Extensible Markup Language XML-Extensible Markup Language
XML-Extensible Markup Language
 
XML and XML Applications - Lecture 04 - Web Information Systems (WE-DINF-11912)
XML and XML Applications - Lecture 04 - Web Information Systems (WE-DINF-11912)XML and XML Applications - Lecture 04 - Web Information Systems (WE-DINF-11912)
XML and XML Applications - Lecture 04 - Web Information Systems (WE-DINF-11912)
 
paper about xml
paper about xmlpaper about xml
paper about xml
 
XML
XMLXML
XML
 
Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
 
Xml
XmlXml
Xml
 
XML
XMLXML
XML
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
XML - The Extensible Markup Language
XML - The Extensible Markup LanguageXML - The Extensible Markup Language
XML - The Extensible Markup Language
 
Markup Languages
Markup Languages Markup Languages
Markup Languages
 
Xml
XmlXml
Xml
 

Similar to Embedding semantic annotations within texts: the FRETTA approach

Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Jack Bowers
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Soham Mondal
 
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptpptXML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptpptsivani14565220
 
[DSBW Spring 2010] Unit 10: XML and Web And beyond
[DSBW Spring 2010] Unit 10: XML and Web And beyond[DSBW Spring 2010] Unit 10: XML and Web And beyond
[DSBW Spring 2010] Unit 10: XML and Web And beyondCarles Farré
 
Web data management (chapter-1)
Web data management (chapter-1)Web data management (chapter-1)
Web data management (chapter-1)Dhaval Asodariya
 

Similar to Embedding semantic annotations within texts: the FRETTA approach (20)

Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
 
XML/XSLT
XML/XSLTXML/XSLT
XML/XSLT
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Unit 10: XML and Beyond (Sematic Web, Web Services, ...)
Unit 10: XML and Beyond (Sematic Web, Web Services, ...)Unit 10: XML and Beyond (Sematic Web, Web Services, ...)
Unit 10: XML and Beyond (Sematic Web, Web Services, ...)
 
XML
XMLXML
XML
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Xml and DTD's
Xml and DTD'sXml and DTD's
Xml and DTD's
 
Xml schema
Xml schemaXml schema
Xml schema
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
23xml
23xml23xml
23xml
 
Soap vs-rest
Soap vs-restSoap vs-rest
Soap vs-rest
 
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptpptXML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
 
Unit 5 xml (1)
Unit 5   xml (1)Unit 5   xml (1)
Unit 5 xml (1)
 
Xml
XmlXml
Xml
 
[DSBW Spring 2010] Unit 10: XML and Web And beyond
[DSBW Spring 2010] Unit 10: XML and Web And beyond[DSBW Spring 2010] Unit 10: XML and Web And beyond
[DSBW Spring 2010] Unit 10: XML and Web And beyond
 
Web data management (chapter-1)
Web data management (chapter-1)Web data management (chapter-1)
Web data management (chapter-1)
 
1 xml fundamentals
1 xml fundamentals1 xml fundamentals
1 xml fundamentals
 
Ch2 neworder
Ch2 neworderCh2 neworder
Ch2 neworder
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
XML Pipelines
XML PipelinesXML Pipelines
XML Pipelines
 

More from University of Bologna

The Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusThe Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusUniversity of Bologna
 
A document-inspired way for tracking changes of RDF data - The case of the Op...
A document-inspired way for tracking changes of RDF data - The case of the Op...A document-inspired way for tracking changes of RDF data - The case of the Op...
A document-inspired way for tracking changes of RDF data - The case of the Op...University of Bologna
 
A Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentA Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentUniversity of Bologna
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseUniversity of Bologna
 
A pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsA pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsUniversity of Bologna
 
Semantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherSemantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherUniversity of Bologna
 
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...University of Bologna
 
Characterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentCharacterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentUniversity of Bologna
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersUniversity of Bologna
 
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...University of Bologna
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsUniversity of Bologna
 
The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...University of Bologna
 
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...University of Bologna
 

More from University of Bologna (16)

The Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusThe Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations Corpus
 
OpenCitations
OpenCitationsOpenCitations
OpenCitations
 
A document-inspired way for tracking changes of RDF data - The case of the Op...
A document-inspired way for tracking changes of RDF data - The case of the Op...A document-inspired way for tracking changes of RDF data - The case of the Op...
A document-inspired way for tracking changes of RDF data - The case of the Op...
 
A Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentA Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology Development
 
FOOD: FOod in Open Data
FOOD: FOod in Open DataFOOD: FOod in Open Data
FOOD: FOod in Open Data
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations arise
 
A pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsA pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflows
 
Semantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherSemantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing together
 
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
 
Characterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentCharacterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experiment
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementa...
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citations
 
The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...
 
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
 
Dealing with Markup Semantics
Dealing with Markup SemanticsDealing with Markup Semantics
Dealing with Markup Semantics
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Embedding semantic annotations within texts: the FRETTA approach

  • 1. Embedding semantic annotations within texts: the FRETTA approach Gioele Barabucci - barabucc@cs.unibo.it Silvio Peroni - essepuntato@cs.unibo.it Francesco Poggi - fpoggi@cs.unibo.it Fabio Vitali - fabio@cs.unibo.it http://creativecommons.org/licenses/by-sa/3.0
  • 2. Outline • Conversion from an XML format into another • Overlapping markup • Abstract conversion framework • FRETTA • Evaluation • Conclusions
  • 3. Converting XML vocabularies that use syntactic workarounds • The conversion of OpenOffice Writer documents (ODT) into Microsoft Word documents (DOCX) (and vice versa) is not a straightforward operation • Converters exist and are included as core components of word processors • Those converters do not implement mechanisms for a full and effective document conversion, especially when particular features are needed – e.g., information tracking document changes occuring over time
  • 4. What happens to markup <text:tracked-changes> <text:changed-region text:id="S1"> ! <text:insertion><office:change-info> OpenOffice (ODT) ! ! <dc:creator>John Smith</dc:creator> ! ! <dc:date>2009-10-27T18:45:00</dc:date> <text:p> ! </office:change-info></text:insertion> The beginning </text:changed-region> and the end. </text:tracked-changes> </text:p> […] <text:p>The beginning and ! <text:change-start text:change-id="S1"/></text:p> <text:p>also <text:change-end text:change-id="S1"/> the end.</text:p> Microsoft Word (DOCX) <w:p> ! <w:pPr><w:rPr> <w:p> ! ! <w:ins w:id="0" w:author="John Smith" <w:r> ! ! ! w:date="2009-10-27T18:50:00Z"/> <w:t> ! </w:rPr></w:pPr> The beginning ! <w:r><w:t>The beginning and </w:t></w:r></w:p> and the end. <w:p> </w:t> ! <w:ins w:id="1" w:author="John Smith" </w:r> ! ! w:date="2009-10-27T18:50:00Z"> </w:p> ! ! <w:r><w:t>also </w:t></w:r></w:ins> ! <w:r><w:t>the end.</w:t></w:r></w:p>
  • 5. Overlapping markup • Overlapping markup is needed when different markup items refer to the same document fragment Previous example in incorrect XML <p>The beginning and <ins></p> <p>also </ins> the end</p> XML formalisation via workarounds <p>The beginning and <ins start=”foo”/></p> <p>also <ins end=”foo”/>the end</p> • Different techniques to embed overlapping structures in XML hierarchies: ✦ milestones: a pair of empty elements representing the start and the end tags, connected to each other by special attributes ✦ fragmentation: elements separated within the primary hierarchy and connected to each other by special attributes ✦ twin documents: each hierarchy is represented by a different document which contains the same textual content ✦ stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position (down to the individual character) of each start and end location within the main structure
  • 6. Abstract conversion framework XML format 1 with XML format 2 with overlapping workarounds overlapping workarounds (e.g., ODT + change tracking) (e.g., DOCX + change tracking) Step1: Indentification of XML Step2: Syntactic and Step3: Linearisation into overlapping workarounds semantic conversion XML document with and creation of document with from format 1 into overlapping workarounds explicit overlap format 2 XML document EARMARK EARMARK XML document format 1 document document format 2 format 1 format 2 EARMARK is a non-XML markup metalanguage used as Today’s contribution intermediate language for the conversion. It allows markup structures to be organized both as trees and as generic graphs with no particular limitations.
  • 7. FRETTA • FRETTA (From EARMARK To Tag) is a general and extensible Java framework for expressing EARMARK documents in an embedded XML syntax • Users that want to convert from EARMARK into XML document formats must indicate which workarounds are used in a certain target format • Fretta performs the requested conversion passing through four different and consecutive steps EARMARK document XML document workaround structural semantic linearisation specification conversion conversion The user specifies Pure-structural conversion Semantic conversion Generation of the which workaround that produces a new that may change the resulting XML tree to use to represent EARMARK document in current structure of the with the requested an (EARMARK) which overlapping EARMARK document workarounds overlapping element elements are transformed according to how the in XML appropriately according to target XML format the specified workarounds handles the specified workarounds
  • 8. Evaluation • Comparing FRETTA’s outputs document workarounds WF V N M against a set of twelve TEI documents (TEIDocs) written by agrippine fragmentation ✓ ✓ ✓ ✓ markup experts agrippine milestones ✓ ✓ ✓ ✓ drivemycar fragmentation ✓ ✓ X X • The evaluation took into account johnlovesmary fragmentation ✓ ✓ ✓ ✓ four different principles johnlovesmary milestones ✓ ✓ ✓ ✓ ✦ well-formedness (WF): whether the peergynt fragmentation ✓ ✓ ✓ ✓ framework returns well-formed XML documents peergynt milestones ✓ ✓ ✓ ✓ ✦ validity (V): whether the framework returns peterpaulhammer milestones ✓ ✓ ✓ ✓ valid XML documents according to the thoughtalice fragmentation ✓ ✓ ✓ ✓ particular target XML vocabulary titwillow fragmentation ✓ ✓ X ✓ ✦ naturalness (N): how much the XML titwillow fragmentation ✓ ✓ X X documents returned by the framework are structurally similar to TEIDocs titwillow milestones ✓ ✓ X ✓ ✦ minimality (M): how much the amount of 100% well-formed and valid documents nodes (i.e., elements, attributes and text 67% continues to be natural (N) against TEIDocs nodes) in the XML documents returned by 83% continues to be minimal (M) against TEIDocs the framework varies from TEIDocs
  • 9. Conclusions • Converting XML documents with overlaps expressed via XML workarounds is not a straightforward task • We propose an abstract framework to address this issue, composed of three consecutive steps • FRETTA implements the third step of the conversion framework. It enables one to convert any EARMARK document (that allows multiple overlapping hierarchies at the same time) into one or more embedded XML markup structures • Future works: ✦ developing algorithms that autonomously select the workarounds to adopt in the conversions ✦ integrating FRETTA in the broader framework for the semi-automatic and round- trip conversion from any supported XML format into another
  • 10. Thanks for your attention