dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

•

1 like•947 views

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a "vertical" analytics perspective (by storing compact lists of literal values for a given attribute). http://diuf.unifr.ch/main/xi/diplodocus/

Science Education

Short and Long-Tail RDF Analytics for
Massive Webs of Data
Marcin Wylot, Jigé Pont, Mariusz Wiśniewski,
and Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg
Switzerland
International Semantic Web Conference
26th October 2011, Bonn, Germany

Motivation

● increasingly large semantic/LoD data sets
● increasingly complex queries
○ real time analytic queries
■ like “returning professor who supervises the most students”

urgent need for more efficient and scalable
solution for RDF data management

3 recipes to speed-up

○collocation
○collocation

3 recipes to speed-up

○collocation
○collocation
○collocation

Why collocation??
Because by collocating data together we
can reduce IO operations, which are
one of the biggest bottlenecks in
database systems.

Outline
● architecture
● main idea
● data structures
● basic operations (inserts, queries)
● evaluation & results
● future work

Molecule Clusters
● extremely compact sub-graphs
● precomputed joins

List of Literals
● extremely compact list of sorted values

Hash Table
lexicographic tree
to encode URIs

template based
indexing

extremely compact lists of
homologous nodes

Basic operations - inserts
n-pass algorithm

Basic operations - queries - triple patterns
?x type Student.
?x takesCourse Course0.

?x type Student.
?x takesCourse Course0.
?x takesCourse Course1.

=> intersection of sorted lists

Basic operations - queries - molecule queries

?a name 'Student1'.
?a ?b ?c.
?c ?d ?e.

Basic operations - queries
aggregates and analytics
?x type Student.
?x age ?y
filter (?y < 21)

Performance Evaluation
We used the Lehigh University Benchmark.
We generated two datasets, for 10 and 100 Universities.
● 1 272 814 distinct triples and 315 003 distinct strings
● 13 876 209 distinct triples and 3 301 868 distinct strings

We compared the runtime execution for 14 LUBM queries
and 3 analytic queries inspired from BowlognaBench.
● returning professor who supervises the most students
● returning big molecule containing everything around
Student0 within scope 2
● returning names for all graduate students

Future work
● open source
○ cleaning code
○ extending code
● parallelising operations
○ multi-core architecture
○ cloud
● automated database design

Conclusions
● advanced data collocation
○ molecules, RDF sub-graphs
○ lists of literals, compact sorted list of values
○ hash table indexed by templates
● slower inserts and updates
○ compact ordered structures
○ data redundancy
● 30 times faster on LUBM queries
● 350 times faster on analytic queries

Transitivity

● Inheritance Manager
○ typeX subClassOf

● Query
○ ?z type typeY
■ ?z type typeY
■ ?z type typeX

● subClassOf
● subPropertyOf

typeY

Serialising Molecules

#TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE
#TEMPLATES - the number of templates in the molecule
TEMPLATE_SIZE - the size of a key in bytes
#TRIPLES - the number of triples in the molecule
KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)

What's hot

Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis

Normalizing Data for MigrationsKyle Banerjee

Data structurepriyanka belekar

Introduction to mongo dbHemant Sharma

Effective and Efficient Entity Search in RDF dataRoi Blanco

Analytical data processingPolad Saruxanov

Web Scraping using Python | Web Screen ScrapingCynthiaCruz55

Over view of data structuresNagajothiN1

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgJindřich Mynarz

Intro to web scraping with PythonMaris Lemba

Towards Integration of Web Data into a coherent Educational Data GraphBesnik Fetahu

Geant4 Model Testing Framework: From PAW to ROOTRoman Atachiants

What's hot (12)

Positional Data Organization and Compression in Web Inverted Indexes

Normalizing Data for Migrations

Data structure

Introduction to mongo db

Effective and Efficient Entity Search in RDF data

Analytical data processing

Web Scraping using Python | Web Screen Scraping

Over view of data structures

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

Intro to web scraping with Python

Towards Integration of Web Data into a coherent Educational Data Graph

Geant4 Model Testing Framework: From PAW to ROOT

Similar to dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell

Web Archive Profiling Through Fulltext SearchSawood Alam

polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa

Instant search - A hands-on tutorialGanesh Venkataraman

Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh

A Practical Approach to Design, Implementation, and Management A Practical Ap...Cynthia Velynne

Research Papers Recommender based on Digital Repositories MetadataRicard de la Vega

An Overview of VIEWShiyong Lu

Open source data_warehousing_overviewAlex Meadows

Henning agt talk-caise-semnetcaise2013vlc

USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...ChemAxon

How to get started in Big Data for master's studentsMohamed Nadjib MAMI

Querying and reasoning over large scale building datasets: an outline of a pe...Ana Roxin

Making Linked Data SPARQL with the InterMine Biological Data WarehouseJustin Clark-Casey

Db presentation google_megastoreAlanoud Alqoufi

Converting Scripts into Reproducible Workflow Research ObjectsLucas Augusto Carvalho

Converting scripts into reproducible workflow research objectsKhalid Belhajjame

Data Structures & AlgorithmsMuhammad Jahanzaib

LODFlow: Workflow Management System for Linked Data ProcessingIvan Ermilov

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Similar to dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data (20)

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Web Archive Profiling Through Fulltext Search

polystore_NYC_inrae_sysinfo2021-1.pdf

Instant search - A hands-on tutorial

Ledingkart Meetup #2: Scaling Search @Lendingkart

A Practical Approach to Design, Implementation, and Management A Practical Ap...

Research Papers Recommender based on Digital Repositories Metadata

An Overview of VIEW

Open source data_warehousing_overview

Henning agt talk-caise-semnet

USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...

How to get started in Big Data for master's students

Querying and reasoning over large scale building datasets: an outline of a pe...

Making Linked Data SPARQL with the InterMine Biological Data Warehouse

Db presentation google_megastore

Converting Scripts into Reproducible Workflow Research Objects

Converting scripts into reproducible workflow research objects

Data Structures & Algorithms

LODFlow: Workflow Management System for Linked Data Processing

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Recently uploaded

Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix

CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456

Citronella presentation SlideShare mani upadhyayupadhyaymani499

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh

User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1

Speech, hearing, noise, intelligibility.pptxpriyankatabhane

Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

Harmful and Useful Microorganisms Presentationtahreemzahra82

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1

Radiation physics in Dental Radiology...navyadasi1992

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131

Recently uploaded (20)

Pests of safflower_Binomics_Identification_Dr.UPR.pdf

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...

Base editing, prime editing, Cas13 & RNA editing and organelle base editing

CHROMATOGRAPHY PALLAVI RAWAT.pptx

Citronella presentation SlideShare mani upadhyay

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine

Microteaching on terms used in filtration .Pharmaceutical Engineering

Thermodynamics ,types of system,formulae ,gibbs free energy .pptx

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...

Pests of castor_Binomics_Identification_Dr.UPR.pdf

Volatile Oils Pharmacognosy And Phytochemistry -I

User Guide: Orion™ Weather Station (Columbia Weather Systems)

GenBio2 - Lesson 1 - Introduction to Genetics.pptx

Speech, hearing, noise, intelligibility.pptx

Microphone- characteristics,carbon microphone, dynamic microphone.pptx

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)

Harmful and Useful Microorganisms Presentation

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx

Radiation physics in Dental Radiology...

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

1. Short and Long-Tail RDF Analytics for Massive Webs of Data Marcin Wylot, Jigé Pont, Mariusz Wiśniewski, and Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland International Semantic Web Conference 26th October 2011, Bonn, Germany

2. Motivation ● increasingly large semantic/LoD data sets ● increasingly complex queries ○ real time analytic queries ■ like “returning professor who supervises the most students” urgent need for more efficient and scalable solution for RDF data management

3. 3 recipes to speed-up

4. 3 recipes to speed-up ○collocation

5. 3 recipes to speed-up ○collocation ○collocation

6. 3 recipes to speed-up ○collocation ○collocation ○collocation

7. Why collocation?? Because by collocating data together we can reduce IO operations, which are one of the biggest bottlenecks in database systems.

8. Outline ● architecture ● main idea ● data structures ● basic operations (inserts, queries) ● evaluation & results ● future work

9. System Architecture

10. Main Idea - Hybrid Storage

11. Main Idea - data structures

12. Declarative Templates

13. Template Matching

14. Molecule Clusters ● extremely compact sub-graphs ● precomputed joins

15. List of Literals ● extremely compact list of sorted values

16. Hash Table lexicographic tree to encode URIs template based indexing extremely compact lists of homologous nodes

17. Basic operations - inserts n-pass algorithm

18. Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0. ?x type Student. ?x takesCourse Course0. ?x takesCourse Course1. => intersection of sorted lists

19. Basic operations - queries - molecule queries ?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.

20. Basic operations - queries aggregates and analytics ?x type Student. ?x age ?y filter (?y < 21)

21. Performance Evaluation We used the Lehigh University Benchmark. We generated two datasets, for 10 and 100 Universities. ● 1 272 814 distinct triples and 315 003 distinct strings ● 13 876 209 distinct triples and 3 301 868 distinct strings We compared the runtime execution for 14 LUBM queries and 3 analytic queries inspired from BowlognaBench. ● returning professor who supervises the most students ● returning big molecule containing everything around Student0 within scope 2 ● returning names for all graduate students

22. Results - LUBM - 10 Universities

23. Results - LUBM - 100 Universities

24. Results - analytic 10 Universities

25. Results - analytic 100 Universities

26. Future work ● open source ○ cleaning code ○ extending code ● parallelising operations ○ multi-core architecture ○ cloud ● automated database design

27. Conclusions ● advanced data collocation ○ molecules, RDF sub-graphs ○ lists of literals, compact sorted list of values ○ hash table indexed by templates ● slower inserts and updates ○ compact ordered structures ○ data redundancy ● 30 times faster on LUBM queries ● 350 times faster on analytic queries

28. Thank you for your attention

29. Update Manager - lazy updates

30. Transitivity ● Inheritance Manager ○ typeX subClassOf ● Query ○ ?z type typeY ■ ?z type typeY ■ ?z type typeX ● subClassOf ● subPropertyOf typeY

31. Serialising Molecules #TEMPLATES * TEMPLATE_SIZE + #TRIPLES * KEY_SIZE #TEMPLATES - the number of templates in the molecule TEMPLATE_SIZE - the size of a key in bytes #TRIPLES - the number of triples in the molecule KEY_SIZE - the size of a key in bytes, for example 8 in our case (Intel 64, Linux)

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data

Similar to dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data (20)

More from eXascale Infolab

More from eXascale Infolab (20)

Recently uploaded

Recently uploaded (20)

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data