SlideShare uma empresa Scribd logo
1 de 15
Data.dcs: Converting Legacy Data into Linked Data Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield
Outline Problem Legacy data contained within the Department of Computer Science Motivation Why produce linked data? Converting Legacy Data into Linked Data: Triplification of Legacy Data Coreference  Resolution Linking into the Web of Linked Data Deployment Conclusions
Problem ,[object Object],People Research groups Publications Legacy data is defined as important information which is stored in proprietary formats  Each member of the DCS maintains his/her own web page Heterogeneous formatting Different presentation of content Devoid of any semantic markup Hard to find information due to its distribution and heterogeneity E.g. cannot query for all publications which two given people have worked on in the past year Must manually search and filter through a lot of information ,[object Object],Rarely updated  Contains duplicates and errors Not machine-readable
Motivation Leveraging legacy data from the DCS in a machine-readable and consistent form would allow related information to be linked together People would be linked with their publications Research groups would be linked to their members Co-authors of papers could be found Linking DCS data into the Web of Linked Data would allow additional information to be provided: Listing conferences which DCS members have attended Provide up-to-date publication listings Via external linked datasets The DCS web site use case is indicative of the makeup of the WWW overall: Organisations provide legacy data which is intended for human consumption HTML documents are provided which lack any explicit machine-readable semantics (Mika et al, 2009) presented crawling logs from 2008-2009 where only 0.6% of web documents contained RDFa and 2% contained the hCardmicroformat
Converting Legacy Data into Linked Data The approach is divided into 3 different stages: Triplification Converting legacy data into RDF triples Coreference Resolution Identifying coreferring entities into the RDF dataset Linked to the Web of Linked Data Querying and discovering links into the Linked Data cloud
Triplification of Legacy Data The DCS publication database provides publication listings as XML However, all publication information is contained within the same <description> element (title, author, year, book title): <description>   <![CDATA[Rowe, M. (2009). Interlinking Distributed Social Graphs.    In <i>Proceedings of Linked Data on the Web Workshop, WWW 2009,    Madrid, Spain. (2009)</i>. Madrid, Madrid, Spain.<br>   <br>Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> </description> DCS web site provides person information and research group listings in HTML documents Information is devoid of markup and is provided in a heterogeneous formats Extracting person information from HTML documents requires the derivation of context windows Portions of a given HTML document which contain person information (e.g. a person’s name together with their email address and homepage)
Triplification of Legacy Data
Triplification of Legacy Data Context windows are generated by identifying portions of a HTML document which contain a person’s name The structure of the HTML DOM is then used to partition the window such that it contains information about a single person HTML markup provides clues as to the segmentation of legacy data within the document Once a name is identified a set of algorithms moves up the DOM tree until a layout element is discovered (e.g. <div>, <li>, <tr>) Provided with a set of context windows, legacy data is then extracted from the window using Hidden Markov Models (HMMs) Context windows are extracted from HTML documents for DCS members and are provided from the publication listings XML HMMs:  Input: sequence of tokens from the context window Output: most likely state labels for each of the tokens States correspond to information which is to be extracted
Triplification of Legacy Data An RDF dataset is built from the extracted legacy data This provides the sourcedataset from which a linked dataset is built For person information triples are formed as follows: <http://data.dcs.shef.ac.uk/person/12025>  rdf:typefoaf:Person ; foaf:name "Matthew Rowe" . <http://www.dcs.shef.ac.uk/~mrowe/publications.html>  foaf:topic <http://data.dcs.shef.ac.uk/person/12025> For publication information triples are formed as follows: <http://data.dcs.shef.ac.uk/paper/239>  rdf:typebib:Entry ; bib:title "Interlinking Distributed Social Graphs." ; bib:hasYear "2009" ;	 bib:hasBookTitle "Proceedings of Linked Data on the Web Workshop, WWW, Madrid, Spain." ; foaf:maker _:a1 . _:a1 rdf:typefoaf:Person ;  foaf:name "Matthew Rowe"
Coreference Resolution The triplification of legacy data contained within the DCS web sites (from ~12,000 HTML documents)  produced 17,896 instances of foaf:Person and 1,088 instances of bib:Entry Contains many equivalent foaf:Person instances Must also assign people to their publications We create information about each research group manually to relate DCS members with their research groups: <http://data.dcs.shef.ac.uk/group/oak>	 rdf:typefoaf:Group ; foaf:name  "Organisations, Information and Knowledge Group" ; foaf:workplaceHomepage  <http://oak.dcs.shef.ac.uk> Querying the source dataset for all the people to appear on each of the group personnel listings pages provides the members of each group This provides a list of people who are members of the DCS For each person equivalent instances are found by Graph smushing: detecting equivalent instances which have the same email address or homepage Co-occurrence queries: detecting equivalent instances by matching coworkers who appear in the same page People are assigned to their publications using queries containing an abbreviated form of their name Applicable in this context due to the localised nature of the dataset
Coreference Resolution
Linking to the Web of Linked Data Our dataset at this stage in the approach is not linked data All links are internal to the dataset To overcome the burden of researchers updating their publications we query the DBLP linked dataset using a Networked Graph SPARQL query: The query detects authored research papers in DBLP based on coauthorship with co-workers The CONSTRUCT clause creates a triple relating DCS members with their papers in DBLP Using foaf:made When the Data.dcs member is later looked up, a follow-your-nose strategy can be used to get all papers from DBLP Can also look up conferences attended, countries visited, etc CONSTRUCT {   ?qfoaf:made ?paper .   ?pfoaf:made ?paper  } WHERE {   ?group foaf:member ?q .   ?group foaf:member ?p .   ?qfoaf:name ?n .   ?pfoaf:name ?c .    GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/>    {     ?paper dc:creator ?x .     ?xfoaf:name ?n .     ?paper dc:creator ?y .     ?yfoaf:name ?c .   }   FILTER (?p != ?q) } <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna> foaf:made  <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml/IresonCCFKL05> ; foaf:made  <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai/BrewsterCW01>
Deployment Data.dcs is now up and running and can be accessed at the following URL: http://data.dcs.shef.ac.uk (please try it!) The data is deployed using Recipe 1 from “How to Publish Linked Data” http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ Recipe 2 for Slash Namespaces from “Best Practices for Publishing RDF Vocabularies” http://www.w3.org/TR/swbp-vocab-pub/ Viewing <http://data.dcs.shef.ac.uk/group/oak> using OpenLinks’sURIBurner
Conclusions Leveraging legacy data requires information extraction components able to handle heterogeneous formats Hidden Markov Models provide a single solution to this problem, however other methods exist which could be explored Presented methods are applicable to other domains, simply requires a different topology and training   Current methods for Linked DCS Data into the Web of Linked Data are conservative: Bespoke SPARQL queries Future work will include the exploration of machine learning classification techniques to perform URI disambiguation This work is now being used as a blueprint for producing linked data from the entire University of Sheffield Covering different faculties and departments Intended to foster and identify possible collaborations across work areas
Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Mika et al, 2009) - Peter Mika, Edgar Meij, and Hugo Zaragoza. Investigating the semantic gap through query log analysis. In 8th International Semantic Web Conference (ISWC2009), October 2009.

Mais conteúdo relacionado

Mais procurados

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebTomek Pluskiewicz
 
RDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFaRDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFaPlatypus
 
Microformats Workshop (2009)
Microformats Workshop  (2009)Microformats Workshop  (2009)
Microformats Workshop (2009)Kelley Howell
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) DataAli Khalili
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
 
Semantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information SpacesSemantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information SpacesJohn Breslin
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in LibrariesCarl Hess
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015Cason Snow
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataJuan Sequeda
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecyclegeoknow
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Get on the Linked Data Web!
Get on the Linked Data Web!Get on the Linked Data Web!
Get on the Linked Data Web!Armin Haller
 

Mais procurados (20)

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
RDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFaRDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFa
 
Microformats Workshop (2009)
Microformats Workshop  (2009)Microformats Workshop  (2009)
Microformats Workshop (2009)
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) Data
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Semantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information SpacesSemantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information Spaces
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecycle
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Get on the Linked Data Web!
Get on the Linked Data Web!Get on the Linked Data Web!
Get on the Linked Data Web!
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 

Destaque

Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristicsGeorge Ang
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
Information Extraction from the Web - In today's web
Information Extraction from the Web - In today's webInformation Extraction from the Web - In today's web
Information Extraction from the Web - In today's webBenjamin Habegger
 
Open Calais
Open CalaisOpen Calais
Open Calaisymark
 
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Anna Lisa Gentile
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesMatthew Rowe
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache SparkMatthew Rowe
 

Destaque (9)

Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristics
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Information Extraction from the Web - In today's web
Information Extraction from the Web - In today's webInformation Extraction from the Web - In today's web
Information Extraction from the Web - In today's web
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Open Calais
Open CalaisOpen Calais
Open Calais
 
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Comparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache StanbolComparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache Stanbol
 

Semelhante a Data.dcs: Converting Legacy Data into Linked Data

An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...IJCSIS Research Publications
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data bindingPhúc Đỗ
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data bindingPhúc Đỗ
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data TutorialSören Auer
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Websamar_slideshare
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA KeynoteAxel Polleres
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...cscpconf
 
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...csandit
 
Information Intermediaries
Information IntermediariesInformation Intermediaries
Information IntermediariesDave Reynolds
 
Data Portability with SIOC and FOAF
Data Portability with SIOC and FOAFData Portability with SIOC and FOAF
Data Portability with SIOC and FOAFUldis Bojars
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 

Semelhante a Data.dcs: Converting Legacy Data into Linked Data (20)

An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Gt ea2009
Gt ea2009Gt ea2009
Gt ea2009
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
 
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
 
Information Intermediaries
Information IntermediariesInformation Intermediaries
Information Intermediaries
 
Data Portability with SIOC and FOAF
Data Portability with SIOC and FOAFData Portability with SIOC and FOAF
Data Portability with SIOC and FOAF
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 

Mais de Matthew Rowe

Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Matthew Rowe
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings Matthew Rowe
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesMatthew Rowe
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersMatthew Rowe
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Matthew Rowe
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...Matthew Rowe
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Matthew Rowe
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureMatthew Rowe
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMatthew Rowe
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Matthew Rowe
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web SystemsMatthew Rowe
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsMatthew Rowe
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research AgendaMatthew Rowe
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social SemanticsMatthew Rowe
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesMatthew Rowe
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsMatthew Rowe
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsMatthew Rowe
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeMatthew Rowe
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebMatthew Rowe
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataMatthew Rowe
 

Mais de Matthew Rowe (20)

Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online Communities
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web Users
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, Future
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web Systems
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositions
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on Youtube
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic Web
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social Data
 

Último

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 

Último (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

Data.dcs: Converting Legacy Data into Linked Data

  • 1. Data.dcs: Converting Legacy Data into Linked Data Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield
  • 2. Outline Problem Legacy data contained within the Department of Computer Science Motivation Why produce linked data? Converting Legacy Data into Linked Data: Triplification of Legacy Data Coreference Resolution Linking into the Web of Linked Data Deployment Conclusions
  • 3.
  • 4. Motivation Leveraging legacy data from the DCS in a machine-readable and consistent form would allow related information to be linked together People would be linked with their publications Research groups would be linked to their members Co-authors of papers could be found Linking DCS data into the Web of Linked Data would allow additional information to be provided: Listing conferences which DCS members have attended Provide up-to-date publication listings Via external linked datasets The DCS web site use case is indicative of the makeup of the WWW overall: Organisations provide legacy data which is intended for human consumption HTML documents are provided which lack any explicit machine-readable semantics (Mika et al, 2009) presented crawling logs from 2008-2009 where only 0.6% of web documents contained RDFa and 2% contained the hCardmicroformat
  • 5. Converting Legacy Data into Linked Data The approach is divided into 3 different stages: Triplification Converting legacy data into RDF triples Coreference Resolution Identifying coreferring entities into the RDF dataset Linked to the Web of Linked Data Querying and discovering links into the Linked Data cloud
  • 6. Triplification of Legacy Data The DCS publication database provides publication listings as XML However, all publication information is contained within the same <description> element (title, author, year, book title): <description> <![CDATA[Rowe, M. (2009). Interlinking Distributed Social Graphs. In <i>Proceedings of Linked Data on the Web Workshop, WWW 2009, Madrid, Spain. (2009)</i>. Madrid, Madrid, Spain.<br> <br>Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> </description> DCS web site provides person information and research group listings in HTML documents Information is devoid of markup and is provided in a heterogeneous formats Extracting person information from HTML documents requires the derivation of context windows Portions of a given HTML document which contain person information (e.g. a person’s name together with their email address and homepage)
  • 8. Triplification of Legacy Data Context windows are generated by identifying portions of a HTML document which contain a person’s name The structure of the HTML DOM is then used to partition the window such that it contains information about a single person HTML markup provides clues as to the segmentation of legacy data within the document Once a name is identified a set of algorithms moves up the DOM tree until a layout element is discovered (e.g. <div>, <li>, <tr>) Provided with a set of context windows, legacy data is then extracted from the window using Hidden Markov Models (HMMs) Context windows are extracted from HTML documents for DCS members and are provided from the publication listings XML HMMs: Input: sequence of tokens from the context window Output: most likely state labels for each of the tokens States correspond to information which is to be extracted
  • 9. Triplification of Legacy Data An RDF dataset is built from the extracted legacy data This provides the sourcedataset from which a linked dataset is built For person information triples are formed as follows: <http://data.dcs.shef.ac.uk/person/12025> rdf:typefoaf:Person ; foaf:name "Matthew Rowe" . <http://www.dcs.shef.ac.uk/~mrowe/publications.html> foaf:topic <http://data.dcs.shef.ac.uk/person/12025> For publication information triples are formed as follows: <http://data.dcs.shef.ac.uk/paper/239> rdf:typebib:Entry ; bib:title "Interlinking Distributed Social Graphs." ; bib:hasYear "2009" ; bib:hasBookTitle "Proceedings of Linked Data on the Web Workshop, WWW, Madrid, Spain." ; foaf:maker _:a1 . _:a1 rdf:typefoaf:Person ; foaf:name "Matthew Rowe"
  • 10. Coreference Resolution The triplification of legacy data contained within the DCS web sites (from ~12,000 HTML documents) produced 17,896 instances of foaf:Person and 1,088 instances of bib:Entry Contains many equivalent foaf:Person instances Must also assign people to their publications We create information about each research group manually to relate DCS members with their research groups: <http://data.dcs.shef.ac.uk/group/oak> rdf:typefoaf:Group ; foaf:name "Organisations, Information and Knowledge Group" ; foaf:workplaceHomepage <http://oak.dcs.shef.ac.uk> Querying the source dataset for all the people to appear on each of the group personnel listings pages provides the members of each group This provides a list of people who are members of the DCS For each person equivalent instances are found by Graph smushing: detecting equivalent instances which have the same email address or homepage Co-occurrence queries: detecting equivalent instances by matching coworkers who appear in the same page People are assigned to their publications using queries containing an abbreviated form of their name Applicable in this context due to the localised nature of the dataset
  • 12. Linking to the Web of Linked Data Our dataset at this stage in the approach is not linked data All links are internal to the dataset To overcome the burden of researchers updating their publications we query the DBLP linked dataset using a Networked Graph SPARQL query: The query detects authored research papers in DBLP based on coauthorship with co-workers The CONSTRUCT clause creates a triple relating DCS members with their papers in DBLP Using foaf:made When the Data.dcs member is later looked up, a follow-your-nose strategy can be used to get all papers from DBLP Can also look up conferences attended, countries visited, etc CONSTRUCT { ?qfoaf:made ?paper . ?pfoaf:made ?paper } WHERE { ?group foaf:member ?q . ?group foaf:member ?p . ?qfoaf:name ?n . ?pfoaf:name ?c . GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/> { ?paper dc:creator ?x . ?xfoaf:name ?n . ?paper dc:creator ?y . ?yfoaf:name ?c . } FILTER (?p != ?q) } <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna> foaf:made <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml/IresonCCFKL05> ; foaf:made <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai/BrewsterCW01>
  • 13. Deployment Data.dcs is now up and running and can be accessed at the following URL: http://data.dcs.shef.ac.uk (please try it!) The data is deployed using Recipe 1 from “How to Publish Linked Data” http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ Recipe 2 for Slash Namespaces from “Best Practices for Publishing RDF Vocabularies” http://www.w3.org/TR/swbp-vocab-pub/ Viewing <http://data.dcs.shef.ac.uk/group/oak> using OpenLinks’sURIBurner
  • 14. Conclusions Leveraging legacy data requires information extraction components able to handle heterogeneous formats Hidden Markov Models provide a single solution to this problem, however other methods exist which could be explored Presented methods are applicable to other domains, simply requires a different topology and training Current methods for Linked DCS Data into the Web of Linked Data are conservative: Bespoke SPARQL queries Future work will include the exploration of machine learning classification techniques to perform URI disambiguation This work is now being used as a blueprint for producing linked data from the entire University of Sheffield Covering different faculties and departments Intended to foster and identify possible collaborations across work areas
  • 15. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Mika et al, 2009) - Peter Mika, Edgar Meij, and Hugo Zaragoza. Investigating the semantic gap through query log analysis. In 8th International Semantic Web Conference (ISWC2009), October 2009.