Presentation of the Gradoop Framework at the Graph Database Meetup in Munich (https://www.meetup.com/inovex-munich/events/231187528/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability (see www.gradoop.com)
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Distributed Graph Analytics with Gradoop
1. Distributed Graph Analytics with Gradoop
inovex Meetup Munich
Let‘s talk about Graph Databases
July 2016
Martin Junghanns (@kc1s)
University of Leipzig – Database Research Group
3. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3
Motivation EPGM Operators BenchmarkImplementation
3
Motivation
4. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4
Motivation EPGM Operators BenchmarkImplementation
4
Motivation
𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)
„Graphs are everywhere“
5. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5
Motivation EPGM Operators BenchmarkImplementation
5
Motivation
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)
„Graphs are everywhere“
6. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6
Motivation EPGM Operators BenchmarkImplementation
6
Motivation
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
„Graphs are everywhere“
7. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7
Motivation EPGM Operators BenchmarkImplementation
7
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs are heterogeneous“
8. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8
Motivation EPGM Operators BenchmarkImplementation
8
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
9. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9
Motivation EPGM Operators BenchmarkImplementation
9
Motivation
0.2
0.28
0.26
0.33
0.25
0.26
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
10. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10
Motivation EPGM Operators BenchmarkImplementation
10
Motivation
Assuming a social network
11. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11
Motivation EPGM Operators BenchmarkImplementation
11
Motivation
Assuming a social network
1. Determine subgraph
12. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12
Motivation EPGM Operators BenchmarkImplementation
12
Motivation
Assuming a social network
1. Determine subgraph
13. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13
Motivation EPGM Operators BenchmarkImplementation
13
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
14. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14
Motivation EPGM Operators BenchmarkImplementation
14
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
15. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15
Motivation EPGM Operators BenchmarkImplementation
15
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
16. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16
Motivation EPGM Operators BenchmarkImplementation
16
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
17. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17
Motivation EPGM Operators BenchmarkImplementation
17
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
18. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18
Motivation EPGM Operators BenchmarkImplementation
18
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
19. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19
Motivation EPGM Operators BenchmarkImplementation
19
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
20. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20
Motivation EPGM Operators BenchmarkImplementation
20
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
21. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21
Motivation EPGM Operators BenchmarkImplementation
21
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
22. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22
Motivation EPGM Operators BenchmarkImplementation
22
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
23. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23
Motivation EPGM Operators BenchmarkImplementation
23
Motivation
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
24. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24
Motivation EPGM Operators BenchmarkImplementation
24
Motivation
„And let‘s not forget …“
25. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25
Motivation EPGM Operators BenchmarkImplementation
25
Motivation
“...Graphs are large.”
26. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26
Motivation EPGM Operators BenchmarkImplementation
26
Motivation
„An open-source framework and research platform for
efficient, distributed and domain independent
management and analytics of heterogeneous graph data.“
27. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27
Motivation EPGM Operators BenchmarkImplementation
27
Motivation
Data Volume and Problem Complexity
Ease-of-use
Graph Processing Systems
Graph Databases
Graph Dataflow Systems Gelly
28. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28
Motivation EPGM Operators BenchmarkImplementation
28
Motivation
Distributed Graph Store (Apache HBase)
Apache Flink Operator Implementation
Apache Flink Distributed Operator Execution
Extended Property Graph Model (EPGM)
Graph Analytical Language (GrALa)
I/O
Distributed File System (Apache HDFS)
29. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29
Motivation EPGM Operators BenchmarkImplementation
29
Extended Property Graph Model
(EPGM)
30. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
31. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
33. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
1 3
4
5
2
1 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community
34. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34
Motivation EPGM Operators BenchmarkImplementationEPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties
1 3
4
5
2
1 2
3
4
5
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
35. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35
Motivation EPGM Operators BenchmarkImplementation
35
Operators
54. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54
Motivation EPGM Operators BenchmarkImplementation
54
Implementation
Apache Flink Gradoop on Flink
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
likes
since : 2014
likes
since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex> DataSet<EPGMEdge>
55. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55
Motivation EPGM Operators BenchmarkImplementation
55
Implementation
Apache Flink Gradoop on Flink
LogicalGraph grouped = graph1.combine(graph2).groupBy()
.useVertexLabel()
.useEdgeLabel()
.addVertexAggregate(new CountAggregator())
.addEdgeAggregate(new CountAggregator());
6 7
Person
count : 3
Band
count : 2
likes
count : 4
knows
count : 1
6
7
4
likes
since : 2014
likes
since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
56. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56
Motivation EPGM Operators BenchmarkImplementation
56
Implementation
Apache Flink Gradoop on Flink
GroupBy(1,2,3) +
GC + GR* + Map
Assign edges to groups
Compute aggregates
Build super edges
Filter + Map
Extract super vertex tuples
Build super vertices
GroupBy(1) + GroupReduce*
Assign vertices to groups
Compute aggregates
Create super vertex tuples
Forward updated group members
V
E
(1,[Person],[])
(2,[Band],[])
(3,[Person],[])
(4,[Band],[])
(5,[Person],[])
(-,6,[Person],[3])
(1,6,[],[])
(-,7,[Band],[2])
(2,7,[],[])
(3,6,[],[])
(4,7,[],[])
(5,6,[],[])
v6
v7
(1,6)
(2,7)
(3,6)
(4,7)
(5,6)
(1,1,2,[likes],[])
(2,3,2,[likes],[])
(3,3,4,[likes],[])
(4,3,5,[knows],[])
(5,5,4,[likes],[])
(1,6,7,[likes],[])
(2,6,7,[likes],[])
(3,6,7,[likes],[])
(4,6,6,[knows],[])
(5,6,7,[likes],[])
e6
e7
Map
Extract
attributes
Filter + Map
Extract group members
Reduce memory footprint
Join*
Replace Source/TargetId
with corresponding super
vertex id
Map
Extract
attributes
*requires worker communication
58. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58
Motivation EPGM Operators BenchmarkImplementation
58
Implementation
Apache Flink Gradoop on Flink
interface DataSource<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge> {
getLogicalGraph(...) : LogicalGraph<G, V, E>
getGraphCollection(...) : GraphCollection<G, V, E>
}
interface DataSink<G extends EPGMGraphHead,
V extends EPGMVertex,
E extends EPGMEdge > {
write(LogicalGraph<G, V, E>) : void
write(GraphCollection<G, V, E>) : void
}
class GraphDataSource<...> implements DataSource<...> { }
class HBaseDataSource<...> implements DataSource<...> { }
class JSONDataSource<...> implements DataSource<...> { }
class TLFDataSource<...> implements DataSource<...> { }
class HBaseDataSink<...> implements DataSink<...> { }
class JSONDataSink<...> implements DataSink<...> { }
class TLFDataSink<...> implements DataSource<...> { }
EPGM API (I/O)
59. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59
Motivation EPGM Operators BenchmarkImplementation
59
Benchmark
60. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60
Motivation EPGM Operators BenchmarkImplementation
60
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://ldbcouncil.org/
61. Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61
Motivation EPGM Operators BenchmarkImplementation
61
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj
66. Summary
• 0.0.1 First Prototype (May 2015)
– Hadoop MapReduce and Giraph for operator implementations
– Too much complexity
– Performance loss through serialization in HDFS/HBase
• 0.0.2 Using Flink as execution layer (June 2015)
– Basic operators
• 0.1 December 2015
– System-side identifiers (UUID)
– Improved property handling
– More operator implementations (e.g., Equality, Bool operators)
– Code refactoring
• 0.2-SNAPSHOT August 2016
– Graph Pattern Matching
– Frequent Subgraph Mining
– Memory optimization (96-bit ID, Dictionary Encoding, …)
– Refactoring
Release History
67. Summary
Contributions welcome!
• Code
• I/O Formats (GraphML, DOT, …)
• Operators and Algorithms
• Tuning (Memory consumption, serialization, …)
• API improvements
• Use cases and data
• Business Intelligence
• Fraud Detection
• Pattern Mining
• …
68. • Extended Property Graph Model
• Schema flexible: Type Labels and Properties
• Logical Graphs / Graphs Collection
• Graph and Collection Operators
• Combination to analytical workflows
• Implemented on Apache Flink
• Built-in scalability
• Combine with other libraries
Summary
69. www.gradoop.com
[1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E.,
„Analyzing Extended Property Graphs with Apache Flink“,
Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016.
[2] Petermann, A.; Junghanns, M.,
„Scalable Business Intelligence with Graph Collections“,
it – Special Issue on Big Data Analytics, 2016.
[3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E.,
„Graph-based Data Integration and Business Intelligence with BIIIG“,
Proc. VLDB Conf. (Demo), 2014.