SlideShare uma empresa Scribd logo
Bringing the Semantic Web
closer to reality
PostgreSQL as RDF Graph Database
Jimmy Angelakos
EDINA, University of Edinburgh
FOSDEM
04-05/02/2017
or how to export your data to
someone who's expecting RDF
Jimmy Angelakos
EDINA, University of Edinburgh
FOSDEM
04-05/02/2017
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
Semantic Web? RDF?
● Resource Description Framework
– Designed to overcome the limitations of HTML
– Make the Web machine readable
– Metadata data model
– Multigraph (Labelled, Directed)
– Triples (Subject – Predicate – Object)
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
● Information addressable via URIs
● <http://example.org/person/Mark_Twain>
<http://example.org/relation/author>
<http://example.org/books/Huckleberry_Finn>
● <http://edina.ac.uk/ns/item/74445709>
<http://purl.org/dc/terms/title>
"The power of words: A model of honesty and
fairness" .
● Namespaces
@prefix dc:
<http://purl.org/dc/elements/1.1/> .
RDF Triples
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
Triplestores
● Offer persistence to our RDF graph
● RDFLib extended by RDFLib-SQLAlchemy
● Use PostgreSQL as storage backend!
● Querying
– SPARQL
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
{ +
"DOI": "10.1007/11757344_1", +
"URL": "http://dx.doi.org/10.1007/11757344_1", +
"type": "book-chapter", +
"score": 1.0, +
"title": [ +
"CrossRef Listing of Deleted DOIs" +
], +
"member": "http://id.crossref.org/member/297", +
"prefix": "http://id.crossref.org/prefix/10.1007", +
"source": "CrossRef", +
"created": { +
"date-time": "2006-10-19T13:32:01Z", +
"timestamp": 1161264721000, +
"date-parts": [ +
[ +
2006, +
10, +
19 +
] +
] +
}, +
"indexed": { +
"date-time": "2015-12-24T00:59:48Z", +
"timestamp": 1450918788746, +
"date-parts": [ +
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
import psycopg2
from rdflib import plugin, Graph, Literal, URIRef
from rdflib.namespace import Namespace
from rdflib.store import Store
import rdflib_sqlalchemy
EDINA = Namespace('http://edina.ac.uk/ns/')
PRISM =
Namespace('http://prismstandard.org/namespaces/basic/2.1/')
rdflib_sqlalchemy.registerplugins()
dburi =
URIRef('postgresql+psycopg2://myuser:mypass@localhost/rdfgraph'
)
ident = EDINA.rdfgraph
store = plugin.get('SQLAlchemy', Store)(identifier=ident)
gdb = Graph(store, ident)
gdb.open(dburi, create=True)
gdb.bind('edina', EDINA)
gdb.bind('prism', PRISM)
item = EDINA['item/' + str(1)]
triples = []
triples += (item, RDF.type, EDINA.Item),
triples += (item, PRISM.doi, Literal('10.1002/crossmark_policy'),
gdb.addN(t + (gdb,) for t in triples)
gdb.serialize(format='turtle')
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
BIG
DATA
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
Super size me!
● Loop over original database contents
● Create triples
● Add them to graph efficiently
● Serialise graph efficiently
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
But but… ?
● Big data without Java?
● Graphs without Java?
– Gremlin? BluePrints? TinkerPop? Jena? Asdasdfaf? XYZZY?
● Why not existing triplestores? “We are the only Graph DB that...”
● Python/Postgres run on desktop hardware
● Simplicity (few LOC and readable)
● Unoptimised potential for improvement→
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
conn = psycopg2.connect(database='mydb', user='myuser')
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
seqcur = conn.cursor()
seqcur.execute("""
CREATE SEQUENCE IF NOT EXISTS item;
""")
conn.commit()
cur = conn.cursor('serverside_cur',
cursor_factory=psycopg2.extras.DictCursor)
cur.itersize = 50000
cur.arraysize = 10000
cur.execute("""
SELECT data
FROM mytable
""")
while True:
recs = cur.fetchmany(10000)
if not recs:
break
for r in recs:
if 'DOI' in r['data'].keys():
seqcur.execute("SELECT nextval('item')")
item = EDINA['item/' + str(seqcur.fetchone()[0])]
triples += (item, RDF.type, EDINA.Item),
triples += (item, PRISM.doi, Literal(r['data']['DOI'])),
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
1st
challenge
● rdflib-sqlalchemy
– No ORM, autocommit (!)
– Creates SQL statements, executes one at a time
– INSERT INTO … VALUES (…);
INSERT INTO … VALUES (…);
INSERT INTO … VALUES (…);
– We want INSERT INTO … VALUES (…),(…),
(…)
– Creates lots of indexes which must be dropped
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
2nd
challenge
● How to restart if interrupted?
● Solved with querying and caching.
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
from rdflib.plugins.sparql import prepareQuery
orgQ = prepareQuery("""
SELECT ?org ?pub
WHERE { ?org a foaf:Organization .
?org rdfs:label ?pub . }
""", initNs = { 'foaf': FOAF, 'rdfs': RDFS })
orgCache = {}
for o in gdb.query(orgQ):
orgCache[o[1].toPython()] = URIRef(o[0].toPython())
if 'publisher' in r['data'].keys():
publisherFound = False
if r['data']['publisher'] in orgCache.keys():
publisherFound = True
triples += (item, DCTERMS.publisher, orgCache[r['data']
['publisher']]),
if not publisherFound:
seqcur.execute("SELECT nextval('org')")
org = EDINA['org/' + str(seqcur.fetchone()[0])]
orgCache[r['data']['publisher']] = org
triples += (org, RDF.type, FOAF.Organization),
triples += (org, RDFS.label, Literal(r['data']
['publisher'])),
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
3rd
challenge
● rdflib-sqlalchemy (yup… guessed it)
– Selects whole graph into memory
● Server side cursor:
res = connection.execution_options(
stream_results=True).execute(q)
● Batching:
while True:
result = res.fetchmany(1000) …
yield …
– Inexplicably caches everything read in RAM!
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
4th
challenge
● Serialise efficiently! Multiprocessing→
– Processes, JoinableQueues
● Turtle: Unsuitable N-Triples→
– UNIX magic
python3 rdf2nt.py |
split -a4 -d -C4G
--additional-suffix=.nt
--filter='gzip > $FILE.gz' -
exported/rdfgraph_
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
Desktop hardware...
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
5th
challenge
● Serialisation outran HDD!
● Waits for:
– JoinableQueues to empty
– sys.stdout.flush()
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
rdfgraph=> dt
List of relations
Schema | Name | Type | Owner
--------+-----------------------------------+-------+----------
public | kb_a8f93b2ff6_asserted_statements | table | myuser
public | kb_a8f93b2ff6_literal_statements | table | myuser
public | kb_a8f93b2ff6_namespace_binds | table | myuser
public | kb_a8f93b2ff6_quoted_statements | table | myuser
public | kb_a8f93b2ff6_type_statements | table | myuser
(5 rows)
rdfgraph=> x
Expanded display is on.
rdfgraph=> select * from kb_a8f93b2ff6_asserted_statements limit
1;
-[ RECORD 1 ]-----------------------------------------------
id | 62516955
subject | http://edina.ac.uk/ns/creation/12993043
predicate | http://www.w3.org/ns/prov#wasAssociatedWith
object | http://edina.ac.uk/ns/agent/12887967
context | http://edina.ac.uk/ns/rdfgraph
termcomb | 0
Time: 0.531 ms
rdfgraph=>
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
More caveats!
● Make sure you are not entering literals in a URI field.
● Also make sure your URIs are valid (amazingly some
DOIs fail when urlencoded)
● rdflib-sqlalchemy unoptimized for FTS
– Your (BTree) indices will be YUGE.
– Don't use large records (e.g. 10+ MB cnt:bytes)
● you need to drop index; insert; create
index
– pg_dump is your friend
– Massive maintenance work memory (Postgres)
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
DROP INDEX public.kb_a8f93b2ff6_uri_index;
DROP INDEX public.kb_a8f93b2ff6_type_mkc_key;
DROP INDEX public.kb_a8f93b2ff6_quoted_spoc_key;
DROP INDEX public.kb_a8f93b2ff6_member_index;
DROP INDEX public.kb_a8f93b2ff6_literal_spoc_key;
DROP INDEX public.kb_a8f93b2ff6_klass_index;
DROP INDEX public.kb_a8f93b2ff6_c_index;
DROP INDEX public.kb_a8f93b2ff6_asserted_spoc_key;
DROP INDEX public."kb_a8f93b2ff6_T_termComb_index";
DROP INDEX public."kb_a8f93b2ff6_Q_termComb_index";
DROP INDEX public."kb_a8f93b2ff6_Q_s_index";
DROP INDEX public."kb_a8f93b2ff6_Q_p_index";
DROP INDEX public."kb_a8f93b2ff6_Q_o_index";
DROP INDEX public."kb_a8f93b2ff6_Q_c_index";
DROP INDEX public."kb_a8f93b2ff6_L_termComb_index";
DROP INDEX public."kb_a8f93b2ff6_L_s_index";
DROP INDEX public."kb_a8f93b2ff6_L_p_index";
DROP INDEX public."kb_a8f93b2ff6_L_c_index";
DROP INDEX public."kb_a8f93b2ff6_A_termComb_index";
DROP INDEX public."kb_a8f93b2ff6_A_s_index";
DROP INDEX public."kb_a8f93b2ff6_A_p_index";
DROP INDEX public."kb_a8f93b2ff6_A_o_index";
DROP INDEX public."kb_a8f93b2ff6_A_c_index";
ALTER TABLE ONLY public.kb_a8f93b2ff6_type_statements
DROP CONSTRAINT kb_a8f93b2ff6_type_statements_pkey;
ALTER TABLE ONLY public.kb_a8f93b2ff6_quoted_statements
DROP CONSTRAINT kb_a8f93b2ff6_quoted_statements_pkey;
ALTER TABLE ONLY public.kb_a8f93b2ff6_namespace_binds
DROP CONSTRAINT kb_a8f93b2ff6_namespace_binds_pkey;
ALTER TABLE ONLY public.kb_a8f93b2ff6_literal_statements
DROP CONSTRAINT kb_a8f93b2ff6_literal_statements_pkey;
ALTER TABLE ONLY public.kb_a8f93b2ff6_asserted_statements
DROP CONSTRAINT kb_a8f93b2ff6_asserted_statements_pkey;
Bringing the Semantic Web closer to reality
PostgreSQL as RDF Graph Database
Thank you =)
Twitter: @vyruss
EDINA Labs blog – http://labs.edina.ac.uk
Hack – http://github.com/vyruss/rdflib-sqlalchemy
Dataset – Stay tuned

Mais conteúdo relacionado

Mais procurados

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Jena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for JavaJena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for Java
Aleksander Pohl
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
Jose Emilio Labra Gayo
 
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
Amazon Web Services
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Real time analytics at any scale | PostgreSQL User Group NL | Marco SlotReal time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Citus Data
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
Heather Myers
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 
Distributed Systems In One Lesson
Distributed Systems In One LessonDistributed Systems In One Lesson
Distributed Systems In One Lesson
Tim Berglund
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
Eugene Hanikblum
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Graph and RDF databases
Graph and RDF databasesGraph and RDF databases
Graph and RDF databases
Nassim Bahri
 
Curso completo de Elasticsearch
Curso completo de ElasticsearchCurso completo de Elasticsearch
Curso completo de Elasticsearch
Federico Andrés Ocampo
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
Atlogys Technical Consulting
 

Mais procurados (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Jena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for JavaJena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for Java
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Real time analytics at any scale | PostgreSQL User Group NL | Marco SlotReal time analytics at any scale | PostgreSQL User Group NL | Marco Slot
Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Distributed Systems In One Lesson
Distributed Systems In One LessonDistributed Systems In One Lesson
Distributed Systems In One Lesson
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Graph and RDF databases
Graph and RDF databasesGraph and RDF databases
Graph and RDF databases
 
Curso completo de Elasticsearch
Curso completo de ElasticsearchCurso completo de Elasticsearch
Curso completo de Elasticsearch
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 

Semelhante a Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
Dimitris Kontokostas
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 

Semelhante a Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Mais de Jimmy Angelakos

Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]
Jimmy Angelakos
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
Jimmy Angelakos
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
Jimmy Angelakos
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in production
Jimmy Angelakos
 
The State of (Full) Text Search in PostgreSQL 12
The State of (Full) Text Search in PostgreSQL 12The State of (Full) Text Search in PostgreSQL 12
The State of (Full) Text Search in PostgreSQL 12
Jimmy Angelakos
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
Jimmy Angelakos
 
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλονEισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Jimmy Angelakos
 
PostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data ReplicationPostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data Replication
Jimmy Angelakos
 

Mais de Jimmy Angelakos (9)

Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in production
 
The State of (Full) Text Search in PostgreSQL 12
The State of (Full) Text Search in PostgreSQL 12The State of (Full) Text Search in PostgreSQL 12
The State of (Full) Text Search in PostgreSQL 12
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλονEισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
 
PostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data ReplicationPostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data Replication
 

Último

Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
alowpalsadig
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
dhavalvaghelanectarb
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 

Último (20)

Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
bgiolcb
bgiolcbbgiolcb
bgiolcb
 
Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)Photoshop Tutorial for Beginners (2024 Edition)
Photoshop Tutorial for Beginners (2024 Edition)
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database

  • 1. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Jimmy Angelakos EDINA, University of Edinburgh FOSDEM 04-05/02/2017
  • 2. or how to export your data to someone who's expecting RDF Jimmy Angelakos EDINA, University of Edinburgh FOSDEM 04-05/02/2017
  • 3. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Semantic Web? RDF? ● Resource Description Framework – Designed to overcome the limitations of HTML – Make the Web machine readable – Metadata data model – Multigraph (Labelled, Directed) – Triples (Subject – Predicate – Object)
  • 4. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database ● Information addressable via URIs ● <http://example.org/person/Mark_Twain> <http://example.org/relation/author> <http://example.org/books/Huckleberry_Finn> ● <http://edina.ac.uk/ns/item/74445709> <http://purl.org/dc/terms/title> "The power of words: A model of honesty and fairness" . ● Namespaces @prefix dc: <http://purl.org/dc/elements/1.1/> . RDF Triples
  • 5. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Triplestores ● Offer persistence to our RDF graph ● RDFLib extended by RDFLib-SQLAlchemy ● Use PostgreSQL as storage backend! ● Querying – SPARQL
  • 6. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database { + "DOI": "10.1007/11757344_1", + "URL": "http://dx.doi.org/10.1007/11757344_1", + "type": "book-chapter", + "score": 1.0, + "title": [ + "CrossRef Listing of Deleted DOIs" + ], + "member": "http://id.crossref.org/member/297", + "prefix": "http://id.crossref.org/prefix/10.1007", + "source": "CrossRef", + "created": { + "date-time": "2006-10-19T13:32:01Z", + "timestamp": 1161264721000, + "date-parts": [ + [ + 2006, + 10, + 19 + ] + ] + }, + "indexed": { + "date-time": "2015-12-24T00:59:48Z", + "timestamp": 1450918788746, + "date-parts": [ +
  • 7. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database import psycopg2 from rdflib import plugin, Graph, Literal, URIRef from rdflib.namespace import Namespace from rdflib.store import Store import rdflib_sqlalchemy EDINA = Namespace('http://edina.ac.uk/ns/') PRISM = Namespace('http://prismstandard.org/namespaces/basic/2.1/') rdflib_sqlalchemy.registerplugins() dburi = URIRef('postgresql+psycopg2://myuser:mypass@localhost/rdfgraph' ) ident = EDINA.rdfgraph store = plugin.get('SQLAlchemy', Store)(identifier=ident) gdb = Graph(store, ident) gdb.open(dburi, create=True) gdb.bind('edina', EDINA) gdb.bind('prism', PRISM) item = EDINA['item/' + str(1)] triples = [] triples += (item, RDF.type, EDINA.Item), triples += (item, PRISM.doi, Literal('10.1002/crossmark_policy'), gdb.addN(t + (gdb,) for t in triples) gdb.serialize(format='turtle')
  • 8. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database BIG DATA
  • 9. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Super size me! ● Loop over original database contents ● Create triples ● Add them to graph efficiently ● Serialise graph efficiently
  • 10. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database But but… ? ● Big data without Java? ● Graphs without Java? – Gremlin? BluePrints? TinkerPop? Jena? Asdasdfaf? XYZZY? ● Why not existing triplestores? “We are the only Graph DB that...” ● Python/Postgres run on desktop hardware ● Simplicity (few LOC and readable) ● Unoptimised potential for improvement→
  • 11. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database conn = psycopg2.connect(database='mydb', user='myuser') cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor) seqcur = conn.cursor() seqcur.execute(""" CREATE SEQUENCE IF NOT EXISTS item; """) conn.commit() cur = conn.cursor('serverside_cur', cursor_factory=psycopg2.extras.DictCursor) cur.itersize = 50000 cur.arraysize = 10000 cur.execute(""" SELECT data FROM mytable """) while True: recs = cur.fetchmany(10000) if not recs: break for r in recs: if 'DOI' in r['data'].keys(): seqcur.execute("SELECT nextval('item')") item = EDINA['item/' + str(seqcur.fetchone()[0])] triples += (item, RDF.type, EDINA.Item), triples += (item, PRISM.doi, Literal(r['data']['DOI'])),
  • 12. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database 1st challenge ● rdflib-sqlalchemy – No ORM, autocommit (!) – Creates SQL statements, executes one at a time – INSERT INTO … VALUES (…); INSERT INTO … VALUES (…); INSERT INTO … VALUES (…); – We want INSERT INTO … VALUES (…),(…), (…) – Creates lots of indexes which must be dropped
  • 13. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database 2nd challenge ● How to restart if interrupted? ● Solved with querying and caching.
  • 14. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database from rdflib.plugins.sparql import prepareQuery orgQ = prepareQuery(""" SELECT ?org ?pub WHERE { ?org a foaf:Organization . ?org rdfs:label ?pub . } """, initNs = { 'foaf': FOAF, 'rdfs': RDFS }) orgCache = {} for o in gdb.query(orgQ): orgCache[o[1].toPython()] = URIRef(o[0].toPython()) if 'publisher' in r['data'].keys(): publisherFound = False if r['data']['publisher'] in orgCache.keys(): publisherFound = True triples += (item, DCTERMS.publisher, orgCache[r['data'] ['publisher']]), if not publisherFound: seqcur.execute("SELECT nextval('org')") org = EDINA['org/' + str(seqcur.fetchone()[0])] orgCache[r['data']['publisher']] = org triples += (org, RDF.type, FOAF.Organization), triples += (org, RDFS.label, Literal(r['data'] ['publisher'])),
  • 15. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database 3rd challenge ● rdflib-sqlalchemy (yup… guessed it) – Selects whole graph into memory ● Server side cursor: res = connection.execution_options( stream_results=True).execute(q) ● Batching: while True: result = res.fetchmany(1000) … yield … – Inexplicably caches everything read in RAM!
  • 16. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database 4th challenge ● Serialise efficiently! Multiprocessing→ – Processes, JoinableQueues ● Turtle: Unsuitable N-Triples→ – UNIX magic python3 rdf2nt.py | split -a4 -d -C4G --additional-suffix=.nt --filter='gzip > $FILE.gz' - exported/rdfgraph_
  • 17. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Desktop hardware...
  • 18. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database 5th challenge ● Serialisation outran HDD! ● Waits for: – JoinableQueues to empty – sys.stdout.flush()
  • 19. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database rdfgraph=> dt List of relations Schema | Name | Type | Owner --------+-----------------------------------+-------+---------- public | kb_a8f93b2ff6_asserted_statements | table | myuser public | kb_a8f93b2ff6_literal_statements | table | myuser public | kb_a8f93b2ff6_namespace_binds | table | myuser public | kb_a8f93b2ff6_quoted_statements | table | myuser public | kb_a8f93b2ff6_type_statements | table | myuser (5 rows) rdfgraph=> x Expanded display is on. rdfgraph=> select * from kb_a8f93b2ff6_asserted_statements limit 1; -[ RECORD 1 ]----------------------------------------------- id | 62516955 subject | http://edina.ac.uk/ns/creation/12993043 predicate | http://www.w3.org/ns/prov#wasAssociatedWith object | http://edina.ac.uk/ns/agent/12887967 context | http://edina.ac.uk/ns/rdfgraph termcomb | 0 Time: 0.531 ms rdfgraph=>
  • 20. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database More caveats! ● Make sure you are not entering literals in a URI field. ● Also make sure your URIs are valid (amazingly some DOIs fail when urlencoded) ● rdflib-sqlalchemy unoptimized for FTS – Your (BTree) indices will be YUGE. – Don't use large records (e.g. 10+ MB cnt:bytes) ● you need to drop index; insert; create index – pg_dump is your friend – Massive maintenance work memory (Postgres)
  • 21. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database DROP INDEX public.kb_a8f93b2ff6_uri_index; DROP INDEX public.kb_a8f93b2ff6_type_mkc_key; DROP INDEX public.kb_a8f93b2ff6_quoted_spoc_key; DROP INDEX public.kb_a8f93b2ff6_member_index; DROP INDEX public.kb_a8f93b2ff6_literal_spoc_key; DROP INDEX public.kb_a8f93b2ff6_klass_index; DROP INDEX public.kb_a8f93b2ff6_c_index; DROP INDEX public.kb_a8f93b2ff6_asserted_spoc_key; DROP INDEX public."kb_a8f93b2ff6_T_termComb_index"; DROP INDEX public."kb_a8f93b2ff6_Q_termComb_index"; DROP INDEX public."kb_a8f93b2ff6_Q_s_index"; DROP INDEX public."kb_a8f93b2ff6_Q_p_index"; DROP INDEX public."kb_a8f93b2ff6_Q_o_index"; DROP INDEX public."kb_a8f93b2ff6_Q_c_index"; DROP INDEX public."kb_a8f93b2ff6_L_termComb_index"; DROP INDEX public."kb_a8f93b2ff6_L_s_index"; DROP INDEX public."kb_a8f93b2ff6_L_p_index"; DROP INDEX public."kb_a8f93b2ff6_L_c_index"; DROP INDEX public."kb_a8f93b2ff6_A_termComb_index"; DROP INDEX public."kb_a8f93b2ff6_A_s_index"; DROP INDEX public."kb_a8f93b2ff6_A_p_index"; DROP INDEX public."kb_a8f93b2ff6_A_o_index"; DROP INDEX public."kb_a8f93b2ff6_A_c_index"; ALTER TABLE ONLY public.kb_a8f93b2ff6_type_statements DROP CONSTRAINT kb_a8f93b2ff6_type_statements_pkey; ALTER TABLE ONLY public.kb_a8f93b2ff6_quoted_statements DROP CONSTRAINT kb_a8f93b2ff6_quoted_statements_pkey; ALTER TABLE ONLY public.kb_a8f93b2ff6_namespace_binds DROP CONSTRAINT kb_a8f93b2ff6_namespace_binds_pkey; ALTER TABLE ONLY public.kb_a8f93b2ff6_literal_statements DROP CONSTRAINT kb_a8f93b2ff6_literal_statements_pkey; ALTER TABLE ONLY public.kb_a8f93b2ff6_asserted_statements DROP CONSTRAINT kb_a8f93b2ff6_asserted_statements_pkey;
  • 22. Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database Thank you =) Twitter: @vyruss EDINA Labs blog – http://labs.edina.ac.uk Hack – http://github.com/vyruss/rdflib-sqlalchemy Dataset – Stay tuned