The document discusses processing linked data at high speeds using the Signal/Collect graph algorithm framework. It provides examples of how Signal/Collect can be used to perform tasks like RDFS subclass inference and PageRank calculation on semantic graphs. It also summarizes performance results showing that TripleRush, an implementation of Signal/Collect, outperforms other graph processing systems on benchmark datasets. Finally, it discusses ongoing work on graph partitioning with TripleRush.
2. Processing Linked Data at
Warp Speed
Abraham Bernstein
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
4. "Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
5.
6.
7. "IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
8. Processing Graphs
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
11. Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
12. • Vertices as stateful
processing units
• Vertices interact through
signals along edges
• Which are collected by a
processing function that
updates the vertex state
Processing Graphs Naturally
13. • Define the graph structure
• Vertices represent
RDFS classes
• Edges from superclasses
to subclasses
• Vertex state initialized
with the class that the
vertex represents
Signal/Collect: An Intuition
for RDFS subclass inference
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: animal
id: bird
id: owlid: penguin
Stutz et al, 2010
17. PageRank in Code
class Document(id: Any) extends Vertex(id, 0.15) {
def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _)
}
Algorithm
class Citation(citer: Any, cited: Any) extends Edge(citer, cited) {
def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights
}
ExecutionInitialization
object Algorithm {
def executeCitationRank(db: SparqlAccessor) {
val computeGraph = new AsynchronousComputeGraph()
val citations = new SparqlTuples(db, "select ?source ?target where {"
+ "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}")
citations foreach {
case (citer, cited) =>
computeGraph.addVertex[Document](citer)
computeGraph.addVertex[Document](cited)
computeGraph.addEdge[Citation](citer, cited)
}
computeGraph.execute()
}
}
18. Signal/Collect as a Platform
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
19. Tr i p l e S t o re
E LV I S D Y L A N
J O B S
? X I N S P I R E D ? Y
? Y I N S P I R E D ? Z
I N S P I R E D
DATA QUERY
IN
SP
IR
E
D
22. DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!
Dylan inspired Jobs
✓
?X inspired ?Y!
?Y inspired ?Z
23. DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!
Dylan inspired Jobs
✓
?X inspired ?Y!
?Y inspired ?Z
24. P e r f o r m a n c e R e s u l t s
Distributed (8 nodes), LUBM 10240 (~1.36 billion triples)
Single-node, LUBM 160 (~21 million triples)
Fastest L1 L2 L3 L4 L5 L6 L7 Geo.
of 10 runs mean
TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1
Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0
TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0
TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0
Fastest L1 L2 L3 L4 L5 L6 L7 Geo
of 10 runs mean
TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94
Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0
TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0
TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
25. O n g o i n g Wo r k :
G r a p h P a r t i t i o n i n g
• https://github.com/uzh/triplerush
26. T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
27. D e c o m p o s i n g F r a u d
P a t t e r n s
• Participants can be
labeled as:
• Splitters
• Aggregators
• Forwarders
$10k$11k
2 x $5k $10k
10k CHF
2k CHF
4k CHF
6k CHF
8k CHF
28. E l i c i t a t i o n P ro c e s s
FilterConnectMatch
29. B i t c o i n Tr a n s a c t i o n s :
R u n t i m e v s . M a t c h i n g C o m p l e x i t y
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
4" 6" 8" 10" 12"
Time%(sec)%
Matching%Complexity%
Time"in"GC"
Total"Processing"Time"
Dataset Size: 50M Transactions, Matching Duration: 1 week
Runtime: 27 min
Throughput: 1.8M / min
Runtime: 35 min
Throughput: 1.4M / min
Runtime: 98 min
Throughput: 0.5M / min
30. T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://flic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://flic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://flic.kr/p/2CG5PU,CC-BY-SA
32. Vertex Coloring in action
Optimized Version of DSA Running on a MacBook Pro with 8 workers
(slow, due to lots of IO for logging, bookkeeping, etc.)
• Scaled to: 10 Million vertices / variables
Verman and Bernstein, 2014
33. Industry Usage
!
“We
are
using
Signal/Collect
to
analyze
millions
of
claims
every
day
to
iden9fy
opportuni9es
for
our
clients
to
save
money
through
be=er
healthcare
or
avoiding
fraud,
waste,
and
abuse.”
US
Healthcare
Analy/cs
Company
34. "IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
35. Processing Graph Streams
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
48. Semantic Flow
Processing is:
• Time-stamped tripes t = <s,p,o> [time]
• Semantic flow F = [t1, t2, ... tn]
• Perform query matching on cached subset of F
• Subject to Stress—incoming data-rate
overwhelms the system’s processing
capability
52. Eviction is:
• Remove cached data
• Note: Eviction may lower recall
• Evict potential to produce future results
• Types of eviction strategies (considered):
• Random
• Time-based (i.e, FIFO)
• Least Recently Used (LRU)
56. Why not LRU?
LRU
C
Input: EPG
Channel , <Play>, Show 5E
B
Input: EPG
Channel , <Play>, Show 4D
InputInput: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
D
EPG Cache
57. CLOCK is:
• Consider both recency (LRU) and past
results
• Giving each data entry a score
• The score could be incremented and
depreciated
• Named by the buffer management
algorithm CLOCK
72. Limitations
• Static depreciation function: dep()
• Experiment on other datasets
• Real implementation to investigate other
performance metrics
• Only local eviction strategies considered
73. CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: !
Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
CLOCK Eviction is:
!
• The weight is adjusted by dep():"
• Linear: dep(w) = w - 1"
• Exponential: dep(w) = w * ρ (0< ρ < 1)
Recency History