A presentation motivating graph stream processing as a paradigm for large-scale complex analytics and gelly-streaming, our new framework based on Apache Flink.
2. @GraphDevroom
Real Graphs are dynamic
Graphs created by events happening in real-time
• liking a post
• buying a book
• listening to a song
• rating a movie
• packet switching in computer networks
• bitcoin transactions
Each event adds an edge to the graph
2
4. @GraphDevroom
In a batch world
We create and analyze a snapshot of the real graph
• all events / interactions / relationships that
happened between t0 and tn
• the Facebook social network on January 30 2016
• user web logs gathered between March 1st 12:00 and 16:00
• retweets and replies for 24h after the announcement of the
death of David Bowie
4
6. @GraphDevroom
In a streaming world
• We receive and consume the events as they are
happening, in real-time
• We analyze the evolving graph and receive results
continuously
6
18. @GraphDevroom
Sounds expensive?
Challenges
• maintain the graph structure
• how to apply state updates efficiently?
• update the result
• re-run the analysis for each event?
• design an incremental algorithm?
• run separate instances on multiple snapshots?
• compute only on most recent events
18
22. @GraphDevroom
Data Streams as ADTs
22
• Direct access to the
execution graph / topology
• Suitable for engineers
• Abstract Data Type
Transformations hide
operator details
• Suitable data analysts
and engineers
similar to: PCollection, DStream
DataStream
23. @GraphDevroom
Nature of a DataStream Job
23
• Tasks are long running in
a pipelined execution.
• State is kept within tasks.
• Transformations are
applied per-record or per-
window.
Execution Graph
unbounded
data sinks
unbounded
data sources
• operator parallelism
• stream partitioning
Execution Properties
24. @GraphDevroom
Working with DataStreams
24
Creation Transformations
DataStream<String> myStream =
-for supported data sources:
env.addSource(new FlinkKafkaConsumer<String>(…));
env.addSource(new RMQSource<String>(…));
env.addSource(new TwitterSource(propsFile));
env.socketTextStream(…);
-for testing:
env.fromCollection(…);
env.fromElements(…);
-for adding any custom source:
env.addSource(MyCustomSource(…));
Properties
myStream.setParallelism(3)
myStream.broadcast();
.rebalance();
.forward();
.keyBy(key);
partitioning
partition stream and operator state by key
myStream.map(…);
myStream.flatMap(…);
myStream.filter(…);
myStream.union(myOtherStream);
-for aggregations on partitioned-by-key streams:
myKeyStream.reduce(…);
myKeyStream.fold(…);
myKeyStream.sum(…);
26. @GraphDevroom
Working with Windows
26
Why windows?
We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently
under different notions of time and deal with late events!
#sec
40 80
SUM #2
0
SUM #1
20 60 100
#sec
40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS),
Time.of(20, TimeUnit.SECONDS));
1) Sliding windows
2) Tumbling windows
myKeyStream.timeWindow(
Time.of(60, TimeUnit.SECONDS));
window buckets/panes
28. @GraphDevroom
Single-Pass Graph Streaming
with Windows
• Each event represents an edge addition
• Each edge is processed once and thrown away,
i.e. the graph structure is not explicitly maintained
• The state maintained corresponds to a graph
summary, a continuously improving property, an
aggregation
• Recent events can be grouped in a graph window
and processed independently
28
29. @GraphDevroom
What’s the benefit?
• Get results faster
• No need to wait for the job to finish
• Sometimes, early approximations are better than late exact
answers
• Get results continuously
• Process unbounded number of events
• Use less memory
• single-pass algorithms don’t store the graph structure
• run computations on a graph summary
29
30. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
30
31. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
31
32. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
32
33. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
33
34. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
34
35. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
35
36. @GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex /
edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
36
47. @GraphDevroom
What can you do in this model?
• spanners for distance estimation
• sparsifiers for cut estimation
• sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
47
48. @GraphDevroom
What can you do in this model?
• neighborhood aggregations on windows, e.g.
triangle counting, clustering coefficient (no
iterations… yet!)
48
50. @GraphDevroom
Batch Connected Components
• State: the graph and a component ID per vertex
(initially equal to vertex ID)
• Iterative Computation: For each vertex:
• choose the min of neighbors’ component IDs and own
component ID as new ID
• if component ID changed since last iteration, notify neighbors
50
55. @GraphDevroom
Streaming Connected Components
• State: a disjoint set data structure for the
components
• Computation: For each edge
• if seen for the 1st time, create a component with ID the min of
the vertex IDs
• if in different components, merge them and update the
component ID to the min of the component IDs
• if only one of the endpoints belongs to a component, add the
other one to the same component
55
68. @GraphDevroom
Streaming Bipartite Detection
Similar to connected components, but
• each vertex is also assigned a sign, (+) or (-)
• edge endpoints must have different signs
• when merging components, if flipping all signs doesn’t work =>
the graph is not bipartite
68
75. @GraphDevroom
Introducing Gelly-Stream
75
• Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations (and iterations in the future)
76. @GraphDevroom
Graph Property Streams
76
A
B
C D
A B C D A CGraph Stream:
.getEdges()
.getVertices()
.numberOfVertices()
.numberOfEdges()
.getDegrees()
.inDegrees()
.outDegrees()
GraphStream -> DataStream
95. @GraphDevroom
Summary
• Many graph analysis problems can be covered in single-pass
• Processing dynamic graphs requires an incremental graph
processing model
• We introduce Gelly-Stream, a simple yet powerful library for
graph streams