Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Similar to Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Architecture: Presented by Renaud Delbru & Giovanni Tummarello, SIREn Solutions
Similar to Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Architecture: Presented by Renaud Delbru & Giovanni Tummarello, SIREn Solutions (20)
4. Agenda
• What is Knowledge Graph ?
• Goals and approaches to searching a Knowledge Graph
• A reference Architecture
• Demo
• Conclusions
5. What is a “Knowledge Graph”?
See also:
• http://www.google.ie/insidesearch/features/search/knowledge.html
• http://semanticweb.com/at-semtechbiz-knowledge-graphs-are-everywhere_b37724
6. An Enterprise Knowledge Graph
RDF
Data
Graph
Tables
Data
Graphs
Content
References,
Key
Concepts,
Rela7ons
External
Domain
Data
Customer
Data
Domain
Ontologies
NEW
KNOWLEDGE
PRODUCTS
• Smarter
search
• Faceted
Browsing
• Domain/Customer
specific
targe9ng
and
mashups
8. Complex relational structure (schema)
Film
- alias
- directed_by
- starring
- rating
- soundtrack
- ...
Director
- alias
- gender
- nationality
- spouse
- ...
Performance
- actor
- character
- type
- ...
Actor
- alias
- netflix_id
- gender
- nationality
- spouse
- ...
Character
- alias
- created_by
- species
- occupation
- powers
- ...
Rating
- country
- rating_system
- min_age
- max_age
- ...
Soundtrack
- alias
- artist
- concert_tour
- release
- ...
Artist
- alias
- genre
- origin
- album
- label
- ...
Release
- date
- length
- track_list
- label
- ...
Concert Tour
- start_date
- end_date
- gross_proceedes
- concerts
- ...
9. Complex relational structure (instance)
label
types
“Award
Winning
Work”
lastName
“US”
label
“Person”
Film.film
Award.Award_winning_work
label
label
Person.Person
type
type
label
performance
Film.Performance
type
label
Film.Character
“character”
type
performance
“Forrest
Gump”
“Actor”
“Robert
Zemekis”
m432432
“An
American
movie
released
in
..
”
h8p://freebase.com/m123123
“Forrest”
“Gump”
firstName
lastName
“Life
is
like
a
box
of
chocolate
said
Forrest’s
mom..
”
“Film”
“US”
“Tom”
“Hanks”
firstName
“Oscar
winner
actor,
best
known
for
his
interpreta9ons
of..
”
“US”
firstName
“Jenny”
“-‐-‐-‐”
“US”
“Jenny”
“-‐-‐-‐”
lastName
“decrip9on
here
firstName
lastName
“Forres’s
Gump
friend
and
eventually
wife,
she
will
die
of
AIDS
…”
Person.Actor
10. What do we mean to search a KG?
• Typically “Entity Search” on the content of a KG
• .. via full text entity search.
• “Forrest
Gump”
• “Forrest
..
Movie”
• “Tom
Hanks
1994”
• “box
of
chocolate
tom
forrest
jenny”
• .. or structured/semistructured search
• “house”
• *.performance.actor.na7onality
=
“UK”
• *.performance.character.na7onality=“US”
• … with sorting/ranking/faceting etc
• … cost effectively, at interactive/production speed
11. Challenges and tradeoffs
• High diversity of data
• Large
schema
• Arbitrary
en7ty
rela7ons
• Relational data
• En7ty
search
based
on
neighbouring
nodes
• False
posi7ves
for
mul7
value
a8ributes
• Updates
• Query
7me
joins
à
allows
quick
changing
graphs,
might
complex
to
maintain
indexes
• Index
7me
joins
à
materializa7on,
might
require
considerable
reindexing
Response
Time
Update
Granularity
Index-‐
Time
Join
Query-‐
Time
Join
12. A typical basic approach
• Basic approach typical flattens all 1 doc per entity
• Pro: relatively easy
• Cons:
• Lose
the
informa7on
about
rela7onships
and
nested
data
• Loss
of
precision
when
fla8ening,
false
posi7ves
• Cant
do
structured
queries
• Primi7ve
ranking
Document
Film Title
Description(s)
Actors (MVA)
Nationalities
(MVA)
More “related
text”
13. Relationally/Nested data aware approaches
• Query-Time Join
• One
(flat)
document
per
en7ty
• Join
result
sets
at
query
7me
to
compute
rela7ons
• Index-time Join
• One
(SIREn)
or
more
(Blockjoin)
documents
per
en7ty
• Join
computed
at
index
7me
14. Query Time Join
• One (flat) document per entity
• Join result sets at query time to
compute relations
• Similar
to
RDB
• Advantages:
• Easier
to
index
and
update
• Limitations:
• Can
get
to
an
enormous
amounts
of
joins
• high
memory
requirements
to
be
fast
• low
response
7me
Document
Film
Document
Performance
Document
Actor
Document
Actor
Document
Performance
15. Blockjoin
• One (flat) document per entity
• Relations computed at index time
• Related
documents
are
indexed
in
a
same
“block”
• Faster
response
7me
• Works well for small and well-defined
schema
• Upfront
effort
required
to
design
and
configure
the
system
• Increase
memory
usage
due
to
crea7on
of
ar7ficial
documents
Document
Film
Document
Actor
Document Block
Document
Actor
Document
Performance
16. SIREn
• Lucene/Solr/Elasticsearch plugin for
indexing and searching JSON
• Rich data model (JSON)
• Nested
objects,
nested
arrays,
datatypes
• Mul7-‐valued
a8ributes
Schema-agnostic
• No
need
to
define
structure
(nested
model)
• No
need
to
define
schema
(fields)
• Advantages of SIREn vs Blockjoin
• Can
handle
arbitrary
and
large
nested
model
• Be8er
memory
usage
Document
Film
1
Performance Actor
Actor
1.1
Performance
1.2
1.1.1
1.1.2
18. Our Reference Architecture
• Currently, it requires a lot of design and development effort to index and search a
knowledge graph
• No
generic
solu7on,
ad-‐hoc
solu7on
for
a
par7cular
user
apps
or
graphs.
• Custom
code
-‐>
costly
to
maintain,
limited
extensibility
• We are building a reference architecture to simplify the task and reduce the effort
involved
• Reduce
custom
code
by
use
of
generic,
standardised
tools
• Quickly
adapt
to
change
in
the
data
schema
or
to
new
data
requirements
19. Reference Architecture
• Carefully analyse data to understand which node will be updated, which nodes will
not
• Nodes
that
are
likely
to
be
updated
• Query-‐7me
join
• Nodes
that
are
rela7vely
fixed
over
7me
• Index-‐7me
join
• Split the graph accordingly
• Each
node
will
be
associated
with
a
par7cular
graph
pa8erns
to
extract
• Output
is
a
set
of
subgraphs
• Generate documents
• Each
subgraph
can
be
converted
into
a
tree
• For
each
subgraphs,
generate
a
JSON
representa7on
20. Using the RDF stack for Graph Operations
• RDF as generic and common graph data model
• Simplify integration of various data sources
• Many tools available to convert various data formats into RDF
• Supported by major Graph DBs: Neo4J, Titan, etc
• SPARQL vs SQL
• It
has
contruct!
21. Reference Architecture
• Currently, it requires a lot of design and development effort to index and search a
knowledge graph
• Gives
some
reasons,
e.g.,
no
generic
solu7ons,
ad-‐hoc
solu7ons
for
a
par7cular
user
apps
or
graphs,
etc.
• Custom
code
-‐>
costly
to
maintain,
limited
extensibility
• We are building a reference architecture to simplify the task and reduce the effort
involved
• Reduce
custom
code
by
use
of
generic,
standardised
tools
• Quickly
adapt
to
change
in
data
schema
or
new
data
requirements
22. Extraction & Mapping
• Major efforts involved
• Extract
graph
subset
of
interest
• Given
a
graph
pa8ern,
extract
all
possible
matching
subgraphs
• Map
it
to
a
simplified
schema
• Graph
schema
is
usually
more
verbose
for
flexibility
reasons.
• However
apps
do
not
need
such
flexibility,
and
the
schema
can
be
simplified.
• Solution: SPARQL
• SQL
for
graph-‐oriented
databases
• Standardised
language
to
query
RDF
graphs
• Supported
by
most
major
Graph
DBs:
Neo4j,
Titan,
Virtuoso,
Stardog,
Oracle,
etc.
23. Extraction & Mapping
• Approach similar to Solr’s data import
handler for RDBMs
• Two queries:
1. Query
to
retrieve
all
iden7fiers
SELECT ?id WHERE {
?id a <http://rdf.freebase.com/ns/film.film> .
}
LIMIT 10000
24. Extraction & Mapping
• Approach similar to Solr’s data import
handler for RDBMs
• Two queries:
1. Query
to
retrieve
all
iden7fiers
2. Query
to
construct
the
subgraph
from
one
iden7fier
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
25. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
26. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
27. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
28. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
29. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
30. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
actor
…
31. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
actor
…
32. Extraction & Mapping
CONSTRUCT {
?id siren:title ?title .
?id siren:actors ?actor .
?actor siren:name ?name .
?actor siren:nationality ?nat .
}
WHERE {
?id rdfs:label ?title .
?id freebase:film.film.starring ?starring .
?starring freebase:film.film.actor ?actor .
?actor rdfs:label ?name .
?actor freebase:people.person.nationality ?nat .
}
label
starring
Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
character name
occupation
James
Bond
Spy
title Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
33. Representing Entity Graph in JSON
• Each entity subgraph can be mapped to a tree
• How to convert graph into JSON ?
• Solution: JSON-LD
• Standardised
format
to
export
RDF
graphs
into
JSON
• Also
supported
by
major
Graph
DBs
{
"@id": "m/01npcx",
"title": "Goldeneye",
"actor": [
{
"@id": "m/018p4y"
"name": "Pierce Brosnan",
"nationality": "Irish"
},
{
...
}
]
}
title Goldeneye
actor
…
name
nationality
Pierce
Brosnan
Irish
34. JSON in Solr: SIREn’s new update Handler
• We need to easily ingest arbitrary JSON documents in Solr
• Update Handler that mimics Elasticsearch
• Index
full
JSON
into
a
SIREn’s
field
• Fla8en
JSON
into
a
set
of
Solr’s
fields
• Schema
updated
automa7cally
(using
Solr’s
ManagedSchema)
• Reduce up-front design effort
• No
need
to
design
the
schema
beforehand
• You
can
change
the
SPARQL
query
upfront,
it
will
dynamically
adapt
37. Handling New Data Schema
• Show example of new schema requirement and how this is quickly integrated in the
pipeline with minimum effort
• This
might
be
be8er
demonstrated
with
a
live
demo
• New data requirement:
• A8ach
name
of
characters
to
an
actor
• Only one modification: SPARQL queries
• The change will be dynamically propagated downstream
38. Updating
Sparql
1
en7ty
to
be
updated
Label:
-‐-‐-‐-‐-‐
Type:
-‐-‐-‐-‐-‐
Desc
:
-‐-‐-‐-‐-‐
Prop
1:
-‐-‐-‐-‐-‐
Prop
2:
-‐-‐-‐-‐-‐
..Prop
N:
-‐-‐-‐-‐-‐
Movies,
Label:
-‐-‐-‐-‐-‐
People,
Companies,…
Type:
Movie
Nested
prop:value
s
Regular
prop:value
Customizable
Materializa7on
Templates
Siren
Query
Language
Same
en7ty
materialized
2
total
en7ty
documents
updated
40. Possible next steps
Sparql
Label:
-‐-‐-‐-‐-‐
Type:
-‐-‐-‐-‐-‐
Desc
:
-‐-‐-‐-‐-‐
Prop
1:
-‐-‐-‐-‐-‐
Prop
2:
-‐-‐-‐-‐-‐
..Prop
N:
-‐-‐-‐-‐-‐
Movies,
Label:
-‐-‐-‐-‐-‐
People,
Companies,…
Type:
Movie
Nested
prop:values
Regular
prop:value
Customizable
Materializa7on
Templates
Query
Time
Join
Query
Translator
Low
Level
Siren
Query
Language
Siren
Query
Language
41. Wrapping up
• A Knowledge Graph does not immediately match your typical search problem
• Many
useful
pieces
of
data
a8ached
to
“neighbouring
nodes”
• An efficient index or query time join strategy is required
• Introduced a reference architecture based on
• SPARQL
templates,
• JSON
LD
• .. Directly ingested by SIREn 1.4 for Solr
• High
quality
free
text
search
• Structured/Semistructured
queries
• Still “customization” to do, but better than ad hoc approaches.
• See more at http://siren.solutions