2. Digital Enterprise Research Institute www.deri.ie
WikipediaWikipedia is one of the widest-known knowledge bases available on the Webis one of the widest-known knowledge bases available on the Web
Everyone can contributeEveryone can contribute TrustTrust andand qualityquality concerns!concerns!
Use ofUse of provenanceprovenance information to identify trust and quality values for pagesinformation to identify trust and quality values for pages
MotivationMotivation
2 of 23
Data Provenance as theData Provenance as the historyhistory, the, the originsorigins and theand the evolutionevolution of data.of data.
Ability to answer the following questions about data:Ability to answer the following questions about data:
WhoWho created/modified it?created/modified it? WhenWhen??
WhatWhat is the content?is the content? WhereWhere is it located?is it located?
HowHow andand WhyWhy was it created?was it created?
WhichWhich tools and processes were used?tools and processes were used?
3. Digital Enterprise Research Institute www.deri.ie
• By representing Wikipedia provenance information with Semantic WebBy representing Wikipedia provenance information with Semantic Web
technologies we enable:technologies we enable:
– TransparencyTransparency
– ReusabilityReusability
– Integration with the Web of DataIntegration with the Web of Data
• Our contribution:Our contribution:
– A semantic model to represent provenance information in wikisA semantic model to represent provenance information in wikis
– A software architecture to extract provenance from WikipediaA software architecture to extract provenance from Wikipedia
– An application that uses and exposes provenance data to computeAn application that uses and exposes provenance data to compute
measures and statistics on Wikipedia articlesmeasures and statistics on Wikipedia articles
3 of 23
Semantic provenance in WikipediaSemantic provenance in Wikipedia
4. Digital Enterprise Research Institute www.deri.ie
TheThe SIOCSIOC CoreCore ontology:ontology:
http://rdfs.org/sioc/spechttp://rdfs.org/sioc/spec
4 of 23
• WikiWiki andand WikiArticleWikiArticle classes with theclasses with the SIOCSIOC TypesTypes module.module.
AdvantagesAdvantages of using SIOC:of using SIOC:
• Widely used on the Web.Widely used on the Web.
• IntegrationIntegration with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc.with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc.
• Same queries to find items on aSame queries to find items on a WikiWiki or aor a BlogBlog,, ForumForum, etc., etc.
SIOCSIOC
Semantically-Interlinked Online CommunitiesSemantically-Interlinked Online Communities
Describes the content andDescribes the content and
structure of community sites.structure of community sites.
5. Digital Enterprise Research Institute www.deri.ie
• From aFrom a document-centricdocument-centric (SIOC)(SIOC) to anto an action-centricaction-centric (SIOC Actions)(SIOC Actions) view of onlineview of online
communities.communities. [Champin, Passant – 2010][Champin, Passant – 2010]
• It represents the dynamics of online communities, how they evolve:It represents the dynamics of online communities, how they evolve:
– A set ofA set of actionsactions, performed by a, performed by a useruser at someat some timetime, impacting one or more, impacting one or more
objectsobjects..
– In Wikipedia actions areIn Wikipedia actions are editsedits made by users on the articles.made by users on the articles.
Relies on theRelies on the Event OntologyEvent Ontology [Raimond et al. - 2007][Raimond et al. - 2007]
http://motools.sourceforge.net/event/event.htmlhttp://motools.sourceforge.net/event/event.html
The SIOCThe SIOC Actions moduleActions module
5 of 23
6. Digital Enterprise Research Institute www.deri.ie
• Ontological model created to describe the semantics of data provenanceOntological model created to describe the semantics of data provenance
[Ram, Liu - 2007][Ram, Liu - 2007]
– Based on the Bunge's ontology (Based on the Bunge's ontology (19771977).).
– Tracks theTracks the historyhistory of theof the eventsevents affecting the status ofaffecting the status of thingsthings duringduring
theirtheir lifcyclelifcycle..
– Extensible and generic, it can be used in different domains.Extensible and generic, it can be used in different domains.
– 7 interrogative words:7 interrogative words: WhatWhat,, HowHow,, WhenWhen,, WhereWhere,, WhoWho,, WhichWhich,, WhyWhy..
– Not implemented in RDFS/OWL.Not implemented in RDFS/OWL.
The W7 ModelThe W7 Model
6 of 23
7. Digital Enterprise Research Institute www.deri.ie
1 – What1 – What
AnAn eventevent (i.e. change of state) that happens to data during its life time(i.e. change of state) that happens to data during its life time
In Wikipedia every type of event (In Wikipedia every type of event (creation, modification, deletioncreation, modification, deletion) leads to) leads to
thethe creation of a new article revisioncreation of a new article revision..
Just using SIOC Core we can modelJust using SIOC Core we can model versioningversioning and history of wiki articles.and history of wiki articles.
Our modelling solutionOur modelling solution
7 of 23
<http://example.com/action?title=Linked_Data#38010613>
sioca:creates
<http://en.wikipedia.org/w/index.php?title=Linked_Data&oldid=38010613>;
sioca:modifies
<http://en.wikipedia.org/wiki/Linked_Data>;
a sioca:Action.
8. Digital Enterprise Research Institute www.deri.ie
• 2 – How2 – How
TheThe actionaction leading to an event.leading to an event.
• In Wikipedia the actions are theIn Wikipedia the actions are the editsedits applied to the articles.applied to the articles.
• By analyzingBy analyzing diffsdiffs between revisions we identify thebetween revisions we identify the type of actiontype of action involvedinvolved
in the creation of the newer revisionin the creation of the newer revision
(( InsertionInsertion || UpdateUpdate || DeletionDeletion ) () ( SentenceSentence || ReferenceReference ))
• To model the differences between revisions we created a lightweightTo model the differences between revisions we created a lightweight DiffDiff
ontologyontology that aims at describingthat aims at describing changes to plain text documentschanges to plain text documents..
(http://vocab.deri.ie/diff#)(http://vocab.deri.ie/diff#)
Our modelling solutionOur modelling solution
8 of 23
9. Digital Enterprise Research Institute www.deri.ie
3 – When3 – When
TheThe timetime an event occurs.an event occurs.
• In Wikipedia every edit has a timestamp recorded, and edits areIn Wikipedia every edit has a timestamp recorded, and edits are
considered instantaneous.considered instantaneous.
• Use ofUse of dc:createddc:created oror event:timeevent:time
Our modelling solutionOur modelling solution
9 of 23
<http://example.com/action?title=Linked_Data#380106133>
dc:created "2010-08-21T06:36:17Z";
event:time [
a time:Instant;
time:inXSDDateTime "2010-08-21T06:36:17Z".
];
a sioca:Action.
10. Digital Enterprise Research Institute www.deri.ie
4 – Where4 – Where
The onlineThe online spacespace or the location associated with an event.or the location associated with an event.
In Wikipedia the information about the location of the user editing theIn Wikipedia the information about the location of the user editing the
page is not provided.page is not provided.
This information cannot be modelled.This information cannot be modelled.
Our modelling solutionOur modelling solution
10 of 23
11. Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
11 of 23
5 – Who5 – Who
AnAn agentagent involved in an event.involved in an event.
In Wikipedia it is represented by theIn Wikipedia it is represented by the editoreditor of a page.of a page.
We use theWe use the sioc:UserAccountsioc:UserAccount class to identify the account of the agentclass to identify the account of the agent
<http://example.com/action?title=Linked_Data#36243686>
sioc:has_creator
<http://en.wikipedia.org/wiki/User:Timbl>;
a sioca:Action.
12. Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
12 of 23
6 – Which6 – Which
The programs orThe programs or instrumentsinstruments used in the event.used in the event.
• In Wikipedia it is represented by the MediaWiki software used to edit theIn Wikipedia it is represented by the MediaWiki software used to edit the
articles.articles.
• Different in case the editor is a “bot”.Different in case the editor is a “bot”.
13. Digital Enterprise Research Institute www.deri.ie
Our modelling solutionOur modelling solution
13 of 23
7 – Why7 – Why
TheThe reasonsreasons behind the event occurrence.behind the event occurrence.
• In Wikipedia it is defined by the justifications for a change inserted by aIn Wikipedia it is defined by the justifications for a change inserted by a
user in theuser in the “comment”“comment” field.field.
• PropertyProperty diff:commentdiff:comment with thewith the diff:Diffdiff:Diff class as domain.class as domain.
15. Digital Enterprise Research Institute www.deri.ie
Application using Wikipedia provenance dataApplication using Wikipedia provenance data
The application is composed mainly in 3 parts:The application is composed mainly in 3 parts:
• Data CollectionData Collection
– Extracts and generates provenance data from Wikipedia using our model.Extracts and generates provenance data from Wikipedia using our model.
• Firefox plug-inFirefox plug-in
– From the provenance data collected, it computes and shows statisticalFrom the provenance data collected, it computes and shows statistical
information directly on Wikipedia pages.information directly on Wikipedia pages.
• Exposing the data to the Web of dataExposing the data to the Web of data
– The statistical information and the provenance data are provided asThe statistical information and the provenance data are provided as
Linked Open Data.Linked Open Data.
15 of 23
16. Digital Enterprise Research Institute www.deri.ie
Data CollectionData Collection
A PHP script has been developed to extract all the articles belonging to aA PHP script has been developed to extract all the articles belonging to a
categorycategory and all its subcategories, and for each article, its entireand all its subcategories, and for each article, its entire revision historyrevision history..
Then the program extracts provenance information from the articles collected atThen the program extracts provenance information from the articles collected at
the previous step: it calculates thethe previous step: it calculates the diffdiff function between versions and retrievesfunction between versions and retrieves
other information from the Wikipedia API.other information from the Wikipedia API.
We ran our experiment with theWe ran our experiment with the “Semantic Web”“Semantic Web” category and all itscategory and all its 166166
Wikipedia articles. All the data has been loaded in a RDF store.Wikipedia articles. All the data has been loaded in a RDF store.
16 of 23
18. Digital Enterprise Research Institute www.deri.ie
A Firefox plug-inA Firefox plug-in
• This application displays a table directly on top of Wikipedia articlesThis application displays a table directly on top of Wikipedia articles
exposing information about the most active users and their edits.exposing information about the most active users and their edits.
• It is composed by:It is composed by:
– 1) The1) The triplestoretriplestore, exposing a SPARQL endpoint;, exposing a SPARQL endpoint;
– 2) A2) A PHP scriptPHP script, which queries the triplestore and sends the results to, which queries the triplestore and sends the results to
the Greasemonkey script;the Greasemonkey script;
– 3) A3) A Greasemonkey scriptGreasemonkey script, which retrieves the URL of the Wikipedia, which retrieves the URL of the Wikipedia
loaded page, sends the request to the PHP script and then displays theloaded page, sends the request to the PHP script and then displays the
returned HTML data on the Wikipedia page.returned HTML data on the Wikipedia page.
18 of 23
20. Digital Enterprise Research Institute www.deri.ie
To the Web of dataTo the Web of data
• The application is currently available atThe application is currently available at
http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php..
• Using this web service is possible to have RDF for the provenance dataUsing this web service is possible to have RDF for the provenance data
generated with our model.generated with our model.
• It is also possible to have the statistical information displayed with theIt is also possible to have the statistical information displayed with the
Firefox plugin represented in RDF.Firefox plugin represented in RDF.
• To represent the statistics we use SCOVO, the Statistical Core VocabularyTo represent the statistics we use SCOVO, the Statistical Core Vocabulary
(http://vocab.deri.ie/scovo)(http://vocab.deri.ie/scovo)
20 of 23
21. Digital Enterprise Research Institute www.deri.ie
To the Web of dataTo the Web of data
• As an example the following triples represent that:As an example the following triples represent that:
the user “KingsleyIdehen” made 11 edits on the SIOC pagethe user “KingsleyIdehen” made 11 edits on the SIOC page
21 of 23
@prefix WikiStats: <http://vmuss06.deri.ie/WikipediaStats.owl#>.
@prefix scovo: <http://purl.org/NET/scovo#>.
<WikiStats:title=SIOC&user=KingsleyIdehen&edits>
a scovo:Item ;
rdf:value 11 ;
scovo:dimension WikiStats:Edits ;
scovo:dimension <http://wikipedia.org/wiki/SIOC>;
scovo:dimension <http://wikipedia.org/wiki/User:KingsleyIdehen>.
22. Digital Enterprise Research Institute www.deri.ie
Conclusions and Future WorkConclusions and Future Work
Our contributionOur contribution:
• A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC.A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC.
• A framework for the extraction of provenance data from Wikipedia.A framework for the extraction of provenance data from Wikipedia.
• An application to access the generated data in a meaningful way and to expose it to theAn application to access the generated data in a meaningful way and to expose it to the
Web of data.Web of data.
Future work:Future work:
A refinement of the proposed model and anA refinement of the proposed model and an alignmentalignment with other general-purposewith other general-purpose
ontologies for provenance representation.ontologies for provenance representation.
To improve theTo improve the performanceperformance and extend theand extend the featuresfeatures of the application.of the application.
To model statistics using theTo model statistics using the SDMXSDMX vocabularyvocabulary (Statistical Data and Metadata eXchange)(Statistical Data and Metadata eXchange)
22 of 23
CommentComment:
• VeryVery large amount of datalarge amount of data generated for the “Semantic Web” category and its 166generated for the “Semantic Web” category and its 166
articles: almost 1.5 million triples for a total of 8.656 revisions.articles: almost 1.5 million triples for a total of 8.656 revisions.
23. Digital Enterprise Research Institute www.deri.ie
Applications and source code:Applications and source code:
http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php
The Diff ontology:The Diff ontology:
http://vocab.deri.ie/diffhttp://vocab.deri.ie/diff##
Contacts:Contacts:
fabrizio.orlandi@deri.orgfabrizio.orlandi@deri.org
@BadmotorF@BadmotorF
http://www.slideshare.net/badmotorfingerhttp://www.slideshare.net/badmotorfinger
23 of 23
Questions ?Questions ?