SlideShare uma empresa Scribd logo
1 de 13
Provenance at the Dagstuhl seminar on
                                        Semantic Data Management, April 2012




                                        Paolo Missier, Jose Manuel Gómez-Perez,
Dagstuhl repost @ SWPM 12 - P.Missier




                                                       Satya Sahoo

                                                  SWPM’12, June. 2012



  1
previously at Dagstuhl...




                                        Much provenance, not much semantics

                                        - final report to be published soon

                                        Interim Seminar wiki
Dagstuhl repost @ SWPM 12 - P.Missier




  2
The provenance day @Dagstuhl
                                        Tuesday (main topic: provenance, person in charge: Grigoris Antoniou)
                                        Session 1. Provenance in semantic data management
                                         ■ Tutorial: Provenance some useful concepts (Paul, 20 minutes)
                                         ■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes)
                                         ■ Presentations from other attendees.
                                            ■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)"
                                            ■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition

                                        Session 2. Presentations
                                         ■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min)
                                         ■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min)
                                         ■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min).

                                        Session 3. Working groups and wrap-up
                                         ■ Objective: obtain roadmaps about typical problems on provenance
                                         ■ Working groups
                                            ■ Frank Van Harmelen: Provenance and scalability
Dagstuhl repost @ SWPM 12 - P.Missier




                                            ■ Paolo Missier: Provenance-specific benchmarks and corpora
                                            ■ José Manuel Gómez-Pérez: Novel usages of provenance information
                                            ■ Norbert Fuhr: Provenance and uncertainty




  3
WG: Novel usages of provenance information (José Manuel Gómez-Pérez)
                                        • Data integration
                                          – assisted analysis, exploration along different dimensions of quality
                                          – SmartCities, OpenStreetMap
                                        • Analytics in social networks
                                          – detect cool members in social networks
                                        • Provenance diff (hard in general)
                                        • Billing / Privacy
                                          – emerging pay-per-query models
                                        • Credit, attribution, citation and licensing
                                        • Result reproducibility (e.g., Executable Paper Challenge)
                                        • Determining quality in the report that has been generated by 3rd
                                          parties for an organisation (e.g., Government report)
Dagstuhl repost @ SWPM 12 - P.Missier




  4
WG: creating provenance-specific benchmarks
                                        • Another one of the spontaneous Working Group activities at
                                          Dagstuhl
                                        • Not strictly “semantic”
                                          – but PROV-RDF one of the expected encodings
                                        • Led by Satya Sahoo, PM
                                        • A community initiative


                                          Goal:

                                          To collect a corpus of reference provenance traces
                                          from multiple contributors
                                          from multiple domains
Dagstuhl repost @ SWPM 12 - P.Missier




                                          and make it available as a community resource




  5
Collecting reference provenance datasets
                                          Why:
                                        • to better understand actual usages of provenance
                                        • for analysing properties of provenance graphs
                                          – patterns in graphs
                                        • to create a level field for performance comparison
                                          – storage, compression methods
                                          – query models, query processing
                                             • SPARQL
                                             • Datalog
                                             • Graph query languages
                                        • to test algorithms that prove interesting hypotheses
                                          – “prov(D) contains valid indicators for quality(D)”
Dagstuhl repost @ SWPM 12 - P.Missier




                                          How:
                                        • By collecting submissions from the community
                                        • By generating synthetic provenance

  6
What: submissions
                                            Submission:
                                              - a collection of traces
                                              - a collection of queries
                                            hopefully from a variety of different domains


                                        •   Interesting properties of each trace:
                                        •   Graph structure -- regularity, recognizable patterns
                                        •   Graph size
                                        •   Scaling factors
                                        •   what is it to be used for

                                          Submission:
Dagstuhl repost @ SWPM 12 - P.Missier




                                        • Diversity of structure and size within the family
                                        • Numerosity of traces



  7
What: Traces format
                                        • The PROV assumptions:
                                          – uptake: PROV will be successful (!)
                                          – interoperability: PROV will be sufficiently expressive to provide interoperability


                                        • Thus, expecting PROV encoding for submissions seems
                                          reasonable

                                        • Advantages:
                                          – tools are being built to parse, visualize, validate, analyse PROV-compliant traces
                                          – multiple encodings available
                                              • especially good if RDF is your thing
                                        • Issues:
Dagstuhl repost @ SWPM 12 - P.Missier




                                          – Conversion: existing traces are not natively PROV
                                          – is there a need to dereference data at the end of URIs?
                                          – licensing: multiple tiers? specific to each dataset?




  8
What: Queries
                                        • Hypothesis: Some queries are generic, in the sense that they apply across
                                          multiple collections of traces
                                          Single trace queries:
                                        • Reachability queries over data and activity dependencies
                                           – backwards (diagnosis)
                                           – forwards (impact analysis)
                                        • “chains of responsibility” (delegation)
                                          Aggregation queries:
                                        • production/usages of data, activities across traces
                                           – assumes uniformity within a collection

                                        • Do graph mining problems apply? do they have interesting interpretations?
                                           – eg. subgraph discovery
Dagstuhl repost @ SWPM 12 - P.Missier




                                        • Feature extraction for learning, mining

                                        • Pairwise trace comparison:
                                           – “earliest divergence” queries between pairs of "nearly isomorphic" traces
                                           – differencing (complex)
  9
A provenance repository
                                        • If traces are submitted in one of the PROV standard encodings,
                                          then the P-rep can provide validation services upon admission

                                        • PROV is expected to support the following encodings:
                                          –   PROV-N -- the technology-neutral notation
                                          –   RDF -- the main official encoding
                                          –   XML -- unofficial XSD available
                                          –   JSON -- unofficial
                                          –   (Datalog? -- even more unofficial but syntactically very close to PROV-N)


                                          Available validations:                                   PROV-N
                                        • Syntax:
Dagstuhl repost @ SWPM 12 - P.Missier




                                          – PROV-N syntax
                                                                                     N 2 JSON      N 2 RDF       N 2 XML
                                          – XML schema validation
                                        • Consistency:
                                          – validation wrt PROV-constraints           PROV-         PROV-         PROV-
                                                                                      JSON           RDF           XML



10
Low-hanging fruits
                                        • Wikipedia history pages
                                          – dumps freely available
                                          – or, through the Wikipedia REST API
                                        • OpenStreetMap history pages
                                          – very similar structure


                                        • ...any other?
Dagstuhl repost @ SWPM 12 - P.Missier




11
Can we learn from similar initiatives?
                                        • Well-established repositories for testing Machine Learning methods
                                          – the UCI Machine Learning repositories
                                          – the KDD Cup datasets
                                          – ... and more


                                        • “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl
                                          –   DBpedia, UniProt -- large but no representative query workload
                                          –   YAGO: Wikipedia <-> Wordnet, 8 queries
                                          –   Barton Library, 7 queries
                                          –   Linked Sensor Dataset, no queries
                                          –   TPC-H as RDF
                                          –   Berlin SPARQL Benchmark (BSBM), 12 queries + mixes
                                          –   Lehigh University Benchmark (LUBM), 14 queries
Dagstuhl repost @ SWPM 12 - P.Missier




                                          –   SP2Bench (DBLP) 12 queries

                                          – Original approach:
                                             • Turn every dataset into a benchmark
                                             • by editing the dataset to enforce measures of
12                                               – Coverage and Coherence
WG: Provenance and uncertainty (Norbert Fuhr)
                                        • Uncertainty in the data
                                            – Sensor data, Customer reviews
                                        • Issues
                                            – Reliability (“is this the original painting?”)
                                            – Authenticity
                                        • Sources of uncertain provenance
                                            –   Information extraction / NLP methods
                                            –   Human errors
                                            –   Inferences
                                            –   Instruments
                                        • Challenges
                                            – We need a data model for uncertainty in provenance
                                               • probabilistic dependency relations
Dagstuhl repost @ SWPM 12 - P.Missier




                                            – Explanation of the derivation of uncertain results
                                        • Limitations
                                            – Hard rules vs soft rules
                                            – Knowledge acquisition process of those rules
                                            – provenance incompleteness vs uncertainty
13
                                        •

Mais conteúdo relacionado

Destaque

Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Paolo Missier
 

Destaque (10)

Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshop
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talk
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphs
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 

Semelhante a SWPM12 report on the dagstuhl seminar on Semantic Data Management

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
nisohq
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
butest
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Nolan Nichols
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
Xavier Llorà
 
Benchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - ParticipationBenchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - Participation
jdbess
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data Sharing
University of Arizona
 

Semelhante a SWPM12 report on the dagstuhl seminar on Semantic Data Management (20)

Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
Resource Sync - Introduction
Resource Sync - IntroductionResource Sync - Introduction
Resource Sync - Introduction
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social Machines
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Benchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - ParticipationBenchmark Tutorial -- IV - Participation
Benchmark Tutorial -- IV - Participation
 
The Economics of Data Sharing
The Economics of Data SharingThe Economics of Data Sharing
The Economics of Data Sharing
 
West coastrollout
West coastrolloutWest coastrollout
West coastrollout
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Stream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and BeyondStream Reasoning: State of the Art and Beyond
Stream Reasoning: State of the Art and Beyond
 
Michener Plenary PPSR2012
Michener Plenary PPSR2012Michener Plenary PPSR2012
Michener Plenary PPSR2012
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data Sharing
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 

Mais de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Mais de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Último

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

SWPM12 report on the dagstuhl seminar on Semantic Data Management

  • 1. Provenance at the Dagstuhl seminar on Semantic Data Management, April 2012 Paolo Missier, Jose Manuel Gómez-Perez, Dagstuhl repost @ SWPM 12 - P.Missier Satya Sahoo SWPM’12, June. 2012 1
  • 2. previously at Dagstuhl... Much provenance, not much semantics - final report to be published soon Interim Seminar wiki Dagstuhl repost @ SWPM 12 - P.Missier 2
  • 3. The provenance day @Dagstuhl Tuesday (main topic: provenance, person in charge: Grigoris Antoniou) Session 1. Provenance in semantic data management ■ Tutorial: Provenance some useful concepts (Paul, 20 minutes) ■ An introduction to the W3C PROV family of specs (Paul Groth / Luc Moreau / Paolo Missier / Olaf, 30 minutes) ■ Presentations from other attendees. ■ Manuel Salvadores: "Access Control in SPARQL: The BioPortal Use Case (15-20 min)" ■ Bryan Thompson: Simple and effective provenance mechanism for triples or quads based on composition Session 2. Presentations ■ Kerry Taylor: Reaping the rewards: what is the provenance saying? (20 min) ■ Martin Theobald: Reasoning in Uncertain RDF Knowledge Bases with Lineage (20 min) ■ James Cheney: Database Wiki and provenance for SPARQL updates (10-15 min). Session 3. Working groups and wrap-up ■ Objective: obtain roadmaps about typical problems on provenance ■ Working groups ■ Frank Van Harmelen: Provenance and scalability Dagstuhl repost @ SWPM 12 - P.Missier ■ Paolo Missier: Provenance-specific benchmarks and corpora ■ José Manuel Gómez-Pérez: Novel usages of provenance information ■ Norbert Fuhr: Provenance and uncertainty 3
  • 4. WG: Novel usages of provenance information (José Manuel Gómez-Pérez) • Data integration – assisted analysis, exploration along different dimensions of quality – SmartCities, OpenStreetMap • Analytics in social networks – detect cool members in social networks • Provenance diff (hard in general) • Billing / Privacy – emerging pay-per-query models • Credit, attribution, citation and licensing • Result reproducibility (e.g., Executable Paper Challenge) • Determining quality in the report that has been generated by 3rd parties for an organisation (e.g., Government report) Dagstuhl repost @ SWPM 12 - P.Missier 4
  • 5. WG: creating provenance-specific benchmarks • Another one of the spontaneous Working Group activities at Dagstuhl • Not strictly “semantic” – but PROV-RDF one of the expected encodings • Led by Satya Sahoo, PM • A community initiative Goal: To collect a corpus of reference provenance traces from multiple contributors from multiple domains Dagstuhl repost @ SWPM 12 - P.Missier and make it available as a community resource 5
  • 6. Collecting reference provenance datasets Why: • to better understand actual usages of provenance • for analysing properties of provenance graphs – patterns in graphs • to create a level field for performance comparison – storage, compression methods – query models, query processing • SPARQL • Datalog • Graph query languages • to test algorithms that prove interesting hypotheses – “prov(D) contains valid indicators for quality(D)” Dagstuhl repost @ SWPM 12 - P.Missier How: • By collecting submissions from the community • By generating synthetic provenance 6
  • 7. What: submissions Submission: - a collection of traces - a collection of queries hopefully from a variety of different domains • Interesting properties of each trace: • Graph structure -- regularity, recognizable patterns • Graph size • Scaling factors • what is it to be used for Submission: Dagstuhl repost @ SWPM 12 - P.Missier • Diversity of structure and size within the family • Numerosity of traces 7
  • 8. What: Traces format • The PROV assumptions: – uptake: PROV will be successful (!) – interoperability: PROV will be sufficiently expressive to provide interoperability • Thus, expecting PROV encoding for submissions seems reasonable • Advantages: – tools are being built to parse, visualize, validate, analyse PROV-compliant traces – multiple encodings available • especially good if RDF is your thing • Issues: Dagstuhl repost @ SWPM 12 - P.Missier – Conversion: existing traces are not natively PROV – is there a need to dereference data at the end of URIs? – licensing: multiple tiers? specific to each dataset? 8
  • 9. What: Queries • Hypothesis: Some queries are generic, in the sense that they apply across multiple collections of traces Single trace queries: • Reachability queries over data and activity dependencies – backwards (diagnosis) – forwards (impact analysis) • “chains of responsibility” (delegation) Aggregation queries: • production/usages of data, activities across traces – assumes uniformity within a collection • Do graph mining problems apply? do they have interesting interpretations? – eg. subgraph discovery Dagstuhl repost @ SWPM 12 - P.Missier • Feature extraction for learning, mining • Pairwise trace comparison: – “earliest divergence” queries between pairs of "nearly isomorphic" traces – differencing (complex) 9
  • 10. A provenance repository • If traces are submitted in one of the PROV standard encodings, then the P-rep can provide validation services upon admission • PROV is expected to support the following encodings: – PROV-N -- the technology-neutral notation – RDF -- the main official encoding – XML -- unofficial XSD available – JSON -- unofficial – (Datalog? -- even more unofficial but syntactically very close to PROV-N) Available validations: PROV-N • Syntax: Dagstuhl repost @ SWPM 12 - P.Missier – PROV-N syntax N 2 JSON N 2 RDF N 2 XML – XML schema validation • Consistency: – validation wrt PROV-constraints PROV- PROV- PROV- JSON RDF XML 10
  • 11. Low-hanging fruits • Wikipedia history pages – dumps freely available – or, through the Wikipedia REST API • OpenStreetMap history pages – very similar structure • ...any other? Dagstuhl repost @ SWPM 12 - P.Missier 11
  • 12. Can we learn from similar initiatives? • Well-established repositories for testing Machine Learning methods – the UCI Machine Learning repositories – the KDD Cup datasets – ... and more • “Building better RDF benchmarks”: Kavitha Srinivas @Dagstuhl – DBpedia, UniProt -- large but no representative query workload – YAGO: Wikipedia <-> Wordnet, 8 queries – Barton Library, 7 queries – Linked Sensor Dataset, no queries – TPC-H as RDF – Berlin SPARQL Benchmark (BSBM), 12 queries + mixes – Lehigh University Benchmark (LUBM), 14 queries Dagstuhl repost @ SWPM 12 - P.Missier – SP2Bench (DBLP) 12 queries – Original approach: • Turn every dataset into a benchmark • by editing the dataset to enforce measures of 12 – Coverage and Coherence
  • 13. WG: Provenance and uncertainty (Norbert Fuhr) • Uncertainty in the data – Sensor data, Customer reviews • Issues – Reliability (“is this the original painting?”) – Authenticity • Sources of uncertain provenance – Information extraction / NLP methods – Human errors – Inferences – Instruments • Challenges – We need a data model for uncertainty in provenance • probabilistic dependency relations Dagstuhl repost @ SWPM 12 - P.Missier – Explanation of the derivation of uncertain results • Limitations – Hard rules vs soft rules – Knowledge acquisition process of those rules – provenance incompleteness vs uncertainty 13 •

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n