Mais conteúdo relacionado
Semelhante a Analytics and Graph Traversal with Solr - Yonik Seeley, Cloudera (20)
Analytics and Graph Traversal with Solr - Yonik Seeley, Cloudera
- 2. 2
©
Cloudera,
Inc.
All
rights
reserved.
My
Background
• Creator
of
Solr
• Cloudera
Engineer
• LucidWorks
Co-‐Founder
• Lucene/Solr
commiEer,
PMC
member
• Apache
SoIware
FoundaKon
member
• M.S.
in
Computer
Science,
Stanford
- 4. 4
©
Cloudera,
Inc.
All
rights
reserved.
Graph
Databases
• Graph
Databases
are
all
about
Nodes
and
Edges
(relaKonships)
Stanford
RPI
Ann
Cloudera
NJ
Mike
aEended
aEended
recommended
works_at
lives_in
aEended
- 5. 5
©
Cloudera,
Inc.
All
rights
reserved.
ProperKes
Stanford
Ann
aEended
start:
1992
end:
1993
degree:
MS
subject:
Computer
Science
bday:
5/01/1970
type:
private
opened:
1891
locaKon:
Stanford,
CA
- 6. 6
©
Cloudera,
Inc.
All
rights
reserved.
Graph
to
Document
Mapping
RelaKonships
without
properKes
• Only
index
the
nodes
• properKes
are
field
values
• nodes
without
properKes
can
be
skipped
• Edges
defined
at
query-‐Kme
only
• implicit
based
on
field
value
matches
Node1
id:
node1
relaKon:
node2
Node2
id:
node2
- 7. 7
©
Cloudera,
Inc.
All
rights
reserved.
Graph
to
Document
Mapping
RelaKonships
with
properKes
• RelaKonships
with
properKes:
•
Model
the
relaKonship
as
a
document
• "Pointers"
can
be
field
values
on
any
of
the
documents
RelaKonship1
Node2
Node1
RelaKonship1
Node2
Node1
target1:
node1
target2:
node2
rel:
relaKonship1
rel:
relaKonship1
id:
relaKonship1
OR
target:
[node1,
node2]
id:
node1
id:
node2
- 8. 8
©
Cloudera,
Inc.
All
rights
reserved.
Document
Mapping
type:
edu
name:
Stanford
opened:
1891
address:
Stanford,
CA
state:
CA
type:
aEendance
who:
Ann
where:
Stanford
start:
1992
end:
1993
degree:
MS
subject:
Computer
Science
type:
person
name:
Ann
bday:
5/01/1970
address:
Branchburg,
NJ
state:
NJ
- 10. 10
©
Cloudera,
Inc.
All
rights
reserved.
Graph
Query
(filter)
• Breadth-‐first
graph
traversal
• Modeled
as
a
normal
Query
• usable
as
main
query,
filter
query,
facet
query,
input
to
another
query,
etc
• cached
by
default
in
filterCache
q={!graph
from=nodeIdField
to=edgeIdField}<starting_query>
• Output
is
a
set
of
documents
• edges
are
defined
by
matches
between
the
fromField
and
toField
• each
iteraKon
moves
to
nodes
idenKfied
by
the
edge
field
- 11. 11
©
Cloudera,
Inc.
All
rights
reserved.
Graph
Filter
–
conKnued
• OpKonal
arguments
• maxDepth
–
maximum
number
of
hops
from
the
root
• traversalFilter
–
arbitrary
query
applied
to
nodes
on
each
hop
• returnRoot
–
(true/false)
include
the
root
in
the
final
set
• leafNodesOnly
–
(true/false)
return
only
docs
w/o
value
in
the
"to"
field
• NOTE:
{!graph}
isn't
(currently)
distributed!
• Edges
are
only
followed
within
a
shard
• SKll
useful,
and
compaKble
with
distributed
search
- 12. 12
©
Cloudera,
Inc.
All
rights
reserved.
TwiEer
Example
q={!graph
from=user_id
to=following}name:Yonik
user_id:
lucene_solr
name:
Yonik
Seeley
following:
[heismark,shalinmangar]
user_id:
heismark
name:
Mark
Miller
following:
[lucene_solr,GRRMSpeaking,...]
user_id:
shalinmangar
name:
Shalin
Mangar
following:
[romseygeek,_hossman,...]
• Finds
everyone
that
Yonik
follows,
and
their
followers,
etc
- 13. 13
©
Cloudera,
Inc.
All
rights
reserved.
{!graph}
vs
{!join}
q={!join
from=following
to=user_id}name:Yonik
q={!graph
from=user_id
to=following
maxDepth=1
returnRoot=false}name:Yonik
• pseudo-‐join
filter
query
{!join}
==
single-‐step
{!graph}
• Note
the
from/to
switch
(a
discrepancy
caught
too
late!)
• graph:
travels
"to"
nodes
idenKfied
by
the
edge
field
• join:
looks
at
values
in
the
"from"
field
and
travels
to
documents
with
those
values
in
the
"to"
field.
- 14. 14
©
Cloudera,
Inc.
All
rights
reserved.
Graph
Streaming
Expressions
- 15. 15
©
Cloudera,
Inc.
All
rights
reserved.
Graph
streaming
expressions
• Breadth-‐first
graph
traversals
• Part
of
streaming
expressions
• fully
distributed
• cross
collecKons
as
well
as
shards
• parallelizable
- 16. 16
©
Cloudera,
Inc.
All
rights
reserved.
Graph
streaming
expressions
example
• Index
some
books
in
one
collecKon
curl
http://localhost:8983/solr/books/update
-‐H
'Content-‐type:text/csv'
-‐d
'
id,cat,pubyear_i,title,author,series_s,sequence_i
book1,fantasy,2000,A
Storm
of
Swords,George
R.R.
Martin,A
Song
of
Ice
and
Fire,3
book2,fantasy,2005,A
Feast
for
Crows,George
R.R.
Martin,A
Song
of
Ice
and
Fire,4
book3,fantasy,2011,A
Dance
with
Dragons,George
R.R.
Martin,A
Song
of
Ice
and
Fire,5
book4,sci-‐fi,1987,Consider
Phlebas,Iain
M.
Banks,The
Culture,1
book5,sci-‐fi,1988,The
Player
of
Games,Iain
M.
Banks,The
Culture,2
book6,sci-‐fi,1990,Use
of
Weapons,Iain
M.
Banks,The
Culture,3
book7,fantasy,1984,Shadows
Linger,Glen
Cook,The
Black
Company,2
book8,fantasy,1984,The
White
Rose,Glen
Cook,The
Black
Company,3
book9,fantasy,1989,Shadow
Games,Glen
Cook,The
Black
Company,4
book10,sci-‐fi,2001,Gridlinked,Neal
Asher,Ian
Cormac,1
book11,sci-‐fi,2003,The
Line
of
Polity,Neal
Asher,Ian
Cormac,2
book12,sci-‐fi,2005,Brass
Man,Neal
Asher,Ian
Cormac,3
'
- 17. 17
©
Cloudera,
Inc.
All
rights
reserved.
Graph
streaming
expressions
example
• Index
some
book
reviews
into
another
collecKon
curl
http://localhost:8983/solr/reviews/update-‐H
'Content-‐type:text/csv'
-‐d
'
id,book_s,user_s,rating_i,review_t
book1_r1,book1,Yonik,5,awesome
book!
book1_r2,book1,Aarav,2,too
bloody
book1_r3,book1,Haruka,5,awesome
world
building
book2_r1,book2,Yonik,5,another
great
one
book2_r2,book2,Maria,5,wow!
book4_r1,book4,Yonik,2,i
am
lying...
actually
liked
it
book4_r2,book4,Aarav,5,Loved
it
book7_r1,book7,Yonik,4,read
back
in
college
but
it
was
good
book10_r1,book10,Maria,5,I
want
a
gridlink!
book11_r1,book11,Maria,1,Blech
book11_r2,book11,Aarav,4,is
this
the
first
book?
book12_r1,book12,Yonik,5,Mr.
Crane
is
scary...
'
1.
Find
books
I
like
2.
Find
who
else
rated
those
books
highly
3.
Find
other
books
they
rated
highly
4.
Profit!
- 18. 18
©
Cloudera,
Inc.
All
rights
reserved.
1.
Search
expression
to
find
my
high
raKngs
URL="http://localhost:8983/solr/reviews/stream"
#
Use
search
expression
to
find
reviews
that
I
have
the
book
a
"5"
curl
$URL
-‐d
'expr=search(reviews,
q="user_s:Yonik
AND
rating_i:5",
fl="id,book_s,user_s,rating_i",
sort="user_s
asc")'
{"result-‐set":{"docs":[
{"raKng_i":5,"id":"book2_r1","user_s":"Yonik","book_s":"book2"},
{"raKng_i":5,"id":"book1_r1","user_s":"Yonik","book_s":"book1"},
{"raKng_i":5,"id":"book12_r1","user_s":"Yonik","book_s":"book12"},
{"EOF":true,"RESPONSE_TIME":4}]}}
- 19. 19
©
Cloudera,
Inc.
All
rights
reserved.
2.
gatherNodes
expression
to
find
users
curl
$URL
-‐d
'expr=gatherNodes(reviews,
search(reviews,
q="user_s:Yonik
AND
rating_i:5",
fl="book_s,user_s,rating_i",sort="user_s
asc"),
walk="book_s-‐>book_s",
gather="user_s",
fq="rating_i:[4
TO
*]
-‐user_s:Yonik",
trackTraversal=true
)'
{"result-‐set":{"docs":[
{"node":"Haruka","collecKon":"reviews","field":"user_s","ancestors":["book1"],"level":1},
{"node":"Maria","collecKon":"reviews","field":"user_s","ancestors":["book2"],"level":1},
{"EOF":true,"RESPONSE_TIME":22}]}}
"gather"
values
- 20. 20
©
Cloudera,
Inc.
All
rights
reserved.
3.
gatherNodes
to
find
high
raKngs
by
those
users
curl
$URL
-‐d
'expr=gatherNodes(reviews,
gatherNodes(reviews,
search(reviews,q="user_s:Yonik
AND
rating_i:
5",fl="id,book_s,user_s,rating_i",sort="user_s
asc"),
walk="book_s-‐>book_s",
gather="user_s",fq="rating_i:[4
TO
*]
-‐user_s:Yonik"),
walk="node-‐>user_s",
gather="book_s",
fq="rating_i:[4
TO
*]",
avg(rating_i),
trackTraversal=true)'
{"result-‐set":{"docs":[
{"node":"book10","avg(raKng_i)":5.0,"field":"book_s","level":
2,"collecKon":"reviews","ancestors":["Maria"]},
{"EOF":true,"RESPONSE_TIME":65}]}}
- 21. 21
©
Cloudera,
Inc.
All
rights
reserved.
Retrieving
complete
traversal
curl
$URL
-‐d
'expr=gatherNodes(reviews,
[...],
scaEer="branches,leaves")'
{"result-‐set":{"docs":[
{"node":"book12","collecKon":"reviews","field":"book_s","level":0},
{"node":"book1","collecKon":"reviews","field":"book_s","level":0},
{"node":"book2","collecKon":"reviews","field":"book_s","level":0},
{"node":"Haruka","collecKon":"reviews","field":"user_s","level":1},
{"node":"Maria","collecKon":"reviews","field":"user_s","level":1},
{"node":"book10","avg(raKng_i)":5.0,"field":"book_s","level":2,
"collecKon":"reviews","ancestors":["Maria"]},
{"EOF":true,"RESPONSE_TIME":111}]}}
- 22. 22
©
Cloudera,
Inc.
All
rights
reserved.
{!graph}
single
collecKon/shard
version
curl
"http://localhost:8983/solr/reviews/query"
-‐d
'
q={!graph
from=user_s
to=user_s
returnRoot=false
traversalFilter=$f1
v=$g1}&
g1={!graph
from=book_s
to=book_s
returnRoot=false
traversalFilter=$f1
v=$q1}&
q1=user_s:Yonik
AND
rating_i:5&
f1=rating_i:[4
TO
*]
'
- 23. 23
©
Cloudera,
Inc.
All
rights
reserved.
More
graph
expressions
• shortestPath
• Finds
the
shortest
path
between
"from"
and
"to"
• scoreNodes
:
p-‐idf
inspired
scoring
• wraps
a
gatherNodes
expression
that
finds
the
co-‐occurrence
count
• p
factor
–
the
co-‐occurrence
count
• idf
factor
–
boosts
nodes
that
are
rarer
overall
- 24. 24
©
Cloudera,
Inc.
All
rights
reserved.
Network
analysis
and
visualizaKon
curl
http://localhost:8983/solr/reviews/graph
-‐d
'expr=gatherNodes(reviews,
[...],
scaEer="branches,leaves")'
<?xml
version="1.0"
encoding="UTF-‐8"?>
<graphml
xmlns="hEp://graphml.graphdrawing.org/xmlns"
xmlns:xsi="hEp://www.w3.org/2001/XMLSchema-‐instance"
xsi:schemaLocaKon="hEp://graphml.graphdrawing.org/xmlns
hEp://graphml.graphdrawing.org/xmlns/1.0/
graphml.xsd">
<graph
id="G"
edgedefault="directed">
<node
id="book12">
<data
key="field">book_s</data>
<data
key="level">0</data>
</node>
<node
id="book1">
<data
key="field">book_s</data>
[...]
- 26. 26
©
Cloudera,
Inc.
All
rights
reserved.
Analyzing
Book
Reviews
w/
JSON
Facet
API
- 27. 27
©
Cloudera,
Inc.
All
rights
reserved.
JSON
Facet
API
w/
Book
Reviews
• Same
books
&
reviews
data
set
as
before,
except:
• Index
books
and
reviews
into
the
same
collec<on
• Index
a
book
and
its
reviews
into
the
same
shard
• eliminates
cross-‐shard
"edges"
between
books
&
reviews
- 28. 28
©
Cloudera,
Inc.
All
rights
reserved.
compositeId
router
shard1
shard2
shard3
id:book1
id:book1!review1
id:book1!review2
a
16
bit
range
full
32
bit
hash
of
"book1"
top
16
bits
of
"book1",
bottom
16
"review1"
top
16
bits
of
"book1",
bottom
16
"review2"
• Easy
collocaKon
of
documents
in
SolrCloud
• Works
right
out
of
the
box
(it's
default!)
• Restrict
queries
to
shards
for
performance:
&q=reviewer:yonik
AND
book_id:book1
&_route_=book1!
32-‐bit
hash
ring
- 29. 29
©
Cloudera,
Inc.
All
rights
reserved.
Refresher:
Facet
commands
and
Domains
Domain
Facet
Command
A
• Domain:
A
set
of
documents
• Facet
command:
create
sub-‐domains
/
"facet
buckets"
Facet
Command
B
Domain
Domain
Domain
Domain
Facet
Command
C
Domain
Domain
Domain
Domain
Domain
Domain
- 30. 30
©
Cloudera,
Inc.
All
rights
reserved.
Unique
authors,
books
by
genre
curl
http://localhost:8983/solr/books/query
-‐d
'
q=cat:*&
json.facet=
{
num_authors
:
"hll(author)",
genres
:
{
type:
terms,
field:
cat
}
}
'
[…]
"facets":{
"count":13,
"num_authors":5,
"genres":{
"buckets":[{
"val":"fantasy",
"count":7},
{
"val":"sci-‐fi",
"count":6}]}}}
root
domain
defined
by
docs
matching
the
query
hyper-‐log-‐log
distributed
cardinality
funcKon
one
bucket
per
unique
value
in
the
"cat"
field
- 32. 32
©
Cloudera,
Inc.
All
rights
reserved.
Number
of
book
reviews
per
genre
json.facet={
genres
:
{
type:
terms,
field:
cat,
facet:
{
reviews
:
{
type:
query,
domain:{join:{from:id,
to:book_s}}
}
}
}
}
"facets":{
"count":13,
"genres":{
"buckets":[{
"val":"fantasy",
"count":7,
"reviews":{
"count":7}},
{
"val":"sci-‐fi",
"count":6,
"reviews":{
"count":5}}]}}}
Calculated
per-‐bucket
domain
switch!
happens
before
faceKng
- 33. 33
©
Cloudera,
Inc.
All
rights
reserved.
Average
raKng
for
each
genre
json.facet={
genres
:
{
type:
terms,
field:
cat,
facet:
{
reviews
:
{
type:
query,
domain:{join
{from:id,
to:book_s}},
facet:
{
rating:"avg(rating_i)"
}
}}}}
"facets":{
"count":13,
"genres":{
"buckets":[{
"val":"fantasy",
"count":7,
"reviews":{
"count":7,
"rating":3.857142}},
{
"val":"sci-‐fi",
"count":6,
"reviews":{
"count":5,
"rating":4.2}}]}}}
- 34. 34
©
Cloudera,
Inc.
All
rights
reserved.
Who
gives
the
highest
raKngs
per
genre?
json.facet={
genres
:
{
type:
terms,
field:
cat,
facet:
{
reviews
:
{
type:
terms,
field:
user_s,
sort:
"rating
desc",
limit:3,
domain:{join:{from:id,
to:book_s}},
facet:
{
rating:"avg(rating_i)"
}
[...]
"facets":{
"count":13,
"genres":{
"buckets":[{
"val":"fantasy",
"count":7,
"reviews":{
"buckets":[
{
"val":"Haruka",
"count":1,
"rating":5.0},
{
"val":"Yonik",
"count":3,
"rating":4.66666667},
{
"val":"Maria",
"count":2,
"rating":3.0}]}},
{
"val":"sci-‐fi",
"count":6,
"reviews":{
"buckets":[
- 35. 35
©
Cloudera,
Inc.
All
rights
reserved.
Histogram:
average
raKng
trends
over
Kme
json.facet={
genres
:
{
type:
terms,
field:
cat,
facet:
{
reviews
:
{
domain:{join:{from:id,
to:book_s}},
type:
range,
field:
review_date_i,
start:
1980,
end:
2020,
gap:
10,
facet:
{
rating:"avg(rating_i)"
}
}}}}
"facets":{
"count":13,
"genres":{
"buckets":[{
"val":"fantasy",
"count":7,
"reviews":{
"buckets":[
{
"val":1980,
"count":1323,
"rating":3.17},
{
"val":1990,
"count":1452,
"rating":3.26},
{
"val":2000,
"count":1559
"rating":3.48},
{
"val":2010,
"count":1793
"rating":3.54}]}},
{
"val":"sci-‐fi",
- 36. 36
©
Cloudera,
Inc.
All
rights
reserved.
Streaming
Expressions
vs
JSON
Facets
- 37. 37
©
Cloudera,
Inc.
All
rights
reserved.
JSON
Facet
API
• More
focused
on
web-‐scale
interacKve
responses
• Tighter
integraKon
• Just
another
search
component
• UKlizes
exisKng
distributed
search
framework
• Single
request-‐response
top-‐N,
grouping,
highlighKng,
faceKng,
etc.
• MulKple-‐facets
in
single
request
• Block
join
/
nested
document
support
• Document
centric
- 38. 38
©
Cloudera,
Inc.
All
rights
reserved.
Streaming
Expressions
• More
general
purpose,
larger
scope
• Wrap
streams
within
streams
to
do
preEy
much
anything
• Not
Ked
to
documents
(analyKcs
across
joins
w/
external
DBs)
• Update
streams,
machine
learning
streams,
etc.
• Exact
results
in
distributed
mode
(e.g.
cardinality)
• Distributed
joins,
graph
• Synergy:
Increasingly
works
with
JSON
Facet
API
to
push
down
work
to
leaves