SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
LDIF
Linked Data Integration Framework
|


               LINKED DATA CHALLENGES
•   Data sources that overlap in content may:
    •   use a wide range of different RDF vocabularies
    •   use different identifiers for the same real-world entity
    •   provide conflicting values for the same properties
•   Implications:
    •   Queries are usually hand-crafted against individual sources – no
        different than an API
    •   Improvised or manual merging of entities
•   Integrating public datasets with internal databases poses the same
    problems
|


                                    LDIF
•   LDIF homogenizes Linked Data from multiple sources into a clean, local
    target representation while keeping track of data provenance
          1   Collect data: Managed download and update

          2   Translate data into a single target vocabulary

          3   Resolve identifier aliases into local target URIs

          4   Cleanse data resolving the conflicting values

          5   Output

•   Open source (Apache License, Version 2.0)
•   Collaboration between Freie Universität Berlin and mes|semantics
|


                         LDIF PIPELINE

1   Collect data         Supported data sources:

2   Translate data       •   RDF dumps (various formats)
                         •   SPARQL Endpoints
3   Resolve identities
                         •   Crawling Linked Data
4   Cleanse data

5   Output
|


                             LDIF PIPELINE

1   Collect data
                               Sources use a wide range of different RDF vocabularies

2   Translate data                dbpedia-owl: City


3   Resolve identities            schema:Place                R2R              local:City


                                  fb:location.citytown
4   Cleanse data

5   Output               •   Mappings expressed in RDF (Turtle)
                         •   Simple mappings using OWL / RDFs statements
                             (x rdfs:subClassOf y)
                         •   Complex mappings with SPARQL expressivity
                         •   Transformation functions
|


                                       LDIF PIPELINE

1   Collect data                                    Sources use different identifiers for the same entity


2   Translate data                                  Berlin, Germany
                                               ,    Berlin, CT
                                           1′ N O
                                        ° 3 24′ 
                                      52 ° 
                                                    Berlin, MD
3   Resolve identities                   13
                                                    Berlin, NJ
                                                    Berlin, MA
4   Cleanse data                                                                                    Berlin
                                                                                                      =
5   Output
                                 Berlin                                       Silk                  Berlin,
                                  ,
                                 N
                              1′  O
                             3 ′
                                                                                                   Germany
                          2°  24
                         5 °
                            13


                                •       Profiles expressed in XML
                                •       Supports various comparators and transformations
|


                             LDIF PIPELINE
                                  Sources provide different values for the same property
1   Collect data
                                                   Berlin
2   Translate data                               population
                                                  is 3.4M
3   Resolve identities
                                                              ★
                                                          ★
                               Berlin
4   Cleanse data             population                                          Berlin
                              is 3.5M                             Sieve        population
                                                                                is 3.5M
5   Output
                                             ★
                                         ★
                                     ★




                         •   Profiles expressed in XML
                         •   Supports various quality assessment policies
                             and conflict resolution methods
|


                         LDIF PIPELINE

1   Collect data
                         Output options:
2   Translate data       •   N-Quads
3   Resolve identities   •   N-Triples
                         •   SPARQL Update Stream
4   Cleanse data

5   Output
                         •   Provenance tracking using Named Graphs
!
                                                                                                                   |
!

!

!
                    LDIF ARCHITECTURE

Application!Layer!                                         Application!Code!!


                                                                                SPARQL!or!RDF!API!

                                                   !!!!!!LDIF!!
                      !!
Data!Access,!!
                                              Data!                Identity!    Data!Quality!
Integration!and!!       Web!Data!                                                                    Integrated!
                                           Translation!           Resolution!    and!Fusion!
                      Access!Module!                                                                 Web!Data!
Storage!Layer!               !
                                             Module!               Module!        Module!
                                                !                      !

                              HTTP!


Web!of!Data!


                                         HTTP!                         HTTP!                    HTTP!

                                                                                  RDFa!
                           LD!Wrapper!             LD!Wrapper!
Publication!Layer!                                                                                      RDF/X
                                                                                                         ML!
                           Database!A!              Database!B!                   CMS!
|


                               LDIF VERSIONS
•   In-memory
    •   keeps all intermediate results in memory
    •   fast, but scalability limited by local RAM
•   RDF Store (TDB)
    •   stores intermediate results in a Jena TDB RDF store
    •   can process more data than In-memory but doesn't scale
•   Cluster (Hadoop)
    •   scales by parallelizing work across multiple machines using Hadoop
    •   can process a virtually unlimited amount of data
|


                           THANK YOU

•   Website: http://ldif.wbsg.de
•   Google group: http://bit.ly/ldifgroup



•   Supported in part by
    •   Vulcan Inc. as part of its Project Halo
    •   EU FP7 project LOD2 - Creating Knowledge out of Interlinked
        Data (Grant No. 257943)

Mais conteúdo relacionado

Destaque

Using Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality AssessmentUsing Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality AssessmentOlaf Hartig
 
Assessing and Refining Mappings to RDF to Improve Dataset Quality
Assessing and Refining Mappings to RDF to Improve Dataset QualityAssessing and Refining Mappings to RDF to Improve Dataset Quality
Assessing and Refining Mappings to RDF to Improve Dataset Qualityandimou
 
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...HTAi Bilbao 2012
 
MEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and ToolsMEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and ToolsMEASURE Evaluation
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introductiondatatovalue
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyAmrapali Zaveri, PhD
 
Data quality overview
Data quality overviewData quality overview
Data quality overviewAlex Meadows
 
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 

Destaque (13)

Using Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality AssessmentUsing Web Data Provenance for Quality Assessment
Using Web Data Provenance for Quality Assessment
 
Assessing and Refining Mappings to RDF to Improve Dataset Quality
Assessing and Refining Mappings to RDF to Improve Dataset QualityAssessing and Refining Mappings to RDF to Improve Dataset Quality
Assessing and Refining Mappings to RDF to Improve Dataset Quality
 
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...
 
MEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and ToolsMEASURE Evaluation Data Quality Assessment Methodology and Tools
MEASURE Evaluation Data Quality Assessment Methodology and Tools
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 

Mais de William Smith

Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoningWilliam Smith
 
Applied semantic technology and linked data
Applied semantic technology and linked dataApplied semantic technology and linked data
Applied semantic technology and linked dataWilliam Smith
 
NLP Linked Open Data "Is a" Solution
NLP Linked Open Data "Is a" SolutionNLP Linked Open Data "Is a" Solution
NLP Linked Open Data "Is a" SolutionWilliam Smith
 
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationAURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationWilliam Smith
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsWilliam Smith
 
Allen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationAllen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationWilliam Smith
 

Mais de William Smith (6)

Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
 
Applied semantic technology and linked data
Applied semantic technology and linked dataApplied semantic technology and linked data
Applied semantic technology and linked data
 
NLP Linked Open Data "Is a" Solution
NLP Linked Open Data "Is a" SolutionNLP Linked Open Data "Is a" Solution
NLP Linked Open Data "Is a" Solution
 
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki ApplicationAURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
AURA Wiki - Knowledge Acquisition with a Semantic Wiki Application
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data Visualizations
 
Allen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationAllen Institute Neurowiki Presentation
Allen Institute Neurowiki Presentation
 

LDIF Lightening Talk

  • 2. | LINKED DATA CHALLENGES • Data sources that overlap in content may: • use a wide range of different RDF vocabularies • use different identifiers for the same real-world entity • provide conflicting values for the same properties • Implications: • Queries are usually hand-crafted against individual sources – no different than an API • Improvised or manual merging of entities • Integrating public datasets with internal databases poses the same problems
  • 3. | LDIF • LDIF homogenizes Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs 4 Cleanse data resolving the conflicting values 5 Output • Open source (Apache License, Version 2.0) • Collaboration between Freie Universität Berlin and mes|semantics
  • 4. | LDIF PIPELINE 1 Collect data Supported data sources: 2 Translate data • RDF dumps (various formats) • SPARQL Endpoints 3 Resolve identities • Crawling Linked Data 4 Cleanse data 5 Output
  • 5. | LDIF PIPELINE 1 Collect data Sources use a wide range of different RDF vocabularies 2 Translate data dbpedia-owl: City 3 Resolve identities schema:Place R2R local:City fb:location.citytown 4 Cleanse data 5 Output • Mappings expressed in RDF (Turtle) • Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Transformation functions
  • 6. | LDIF PIPELINE 1 Collect data Sources use different identifiers for the same entity 2 Translate data Berlin, Germany , Berlin, CT 1′ N O ° 3 24′  52 °  Berlin, MD 3 Resolve identities 13 Berlin, NJ Berlin, MA 4 Cleanse data Berlin = 5 Output Berlin Silk Berlin, ,  N 1′  O  3 ′ Germany 2°  24 5 ° 13 • Profiles expressed in XML • Supports various comparators and transformations
  • 7. | LDIF PIPELINE Sources provide different values for the same property 1 Collect data Berlin 2 Translate data population is 3.4M 3 Resolve identities ★ ★ Berlin 4 Cleanse data population Berlin is 3.5M Sieve population is 3.5M 5 Output ★ ★ ★ • Profiles expressed in XML • Supports various quality assessment policies and conflict resolution methods
  • 8. | LDIF PIPELINE 1 Collect data Output options: 2 Translate data • N-Quads 3 Resolve identities • N-Triples • SPARQL Update Stream 4 Cleanse data 5 Output • Provenance tracking using Named Graphs
  • 9. ! | ! ! ! LDIF ARCHITECTURE Application!Layer! Application!Code!! SPARQL!or!RDF!API! !!!!!!LDIF!! !! Data!Access,!! Data! Identity! Data!Quality! Integration!and!! Web!Data! Integrated! Translation! Resolution! and!Fusion! Access!Module! Web!Data! Storage!Layer! ! Module! Module! Module! ! ! HTTP! Web!of!Data! HTTP! HTTP! HTTP! RDFa! LD!Wrapper! LD!Wrapper! Publication!Layer! RDF/X ML! Database!A! Database!B! CMS!
  • 10. | LDIF VERSIONS • In-memory • keeps all intermediate results in memory • fast, but scalability limited by local RAM • RDF Store (TDB) • stores intermediate results in a Jena TDB RDF store • can process more data than In-memory but doesn't scale • Cluster (Hadoop) • scales by parallelizing work across multiple machines using Hadoop • can process a virtually unlimited amount of data
  • 11. | THANK YOU • Website: http://ldif.wbsg.de • Google group: http://bit.ly/ldifgroup • Supported in part by • Vulcan Inc. as part of its Project Halo • EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943)