SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
Web-Scale Graph Analytics
with Apache Spark
Joseph K Bradley
Bay Area Apache Spark Meetup
September 7, 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4	4	
DATABRICKS RUNTIME 3.2
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
5
Apache Spark Engine
…
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
6
7
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames: DataFrame-based graphs
• Bisecting K-Means: now part of MLlib
• Stanford CoreNLP integration: UDFs for NLP
spark-packages.org
8
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
9
Graphs
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK” “SEA” 45 1058923
10
Apache Spark’s GraphX library
Overview
•  General-purpose graph
processing library
•  Optimized for fast
distributed computing
•  Library of algorithms:
PageRank, Connected
Components, etc.
10	
Challenges
•  No Java, Python APIs
•  Lower-level RDD-based API
(vs. DataFrames)
•  Cannot use recent Spark
optimizations: Catalyst
query optimizer, Tungsten
memory management
11
The GraphFrames Spark Package
Goal: DataFrame-based graphs on Apache Spark
•  Simplify interactive queries
•  Support motif-finding for structural pattern search
•  Benefit from DataFrame optimizations
Collaboration between Databricks, UC Berkeley & MIT
+ Now with community contributors & committers!
11
12
Graphs
vertex
edge
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
src dst delay tripID
“JFK” “SEA” 45 1058923
13
GraphFrames
13	
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
vertices DataFrame edges DataFrame
14
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
14
15
Simple queries
SQL queries on vertices & edges
15	
Simple graph queries (e.g., vertex degrees)
16
Motif finding
16	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
17
Motif finding
17	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
18
Motif finding
18	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
19
Motif finding
19	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
20
Motif finding
20	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
21
Graph algorithms
Find important vertices
•  PageRank
21	
Find paths between sets of vertices
•  Breadth-first search (BFS)
•  Shortest paths
Find groups of vertices
(components, communities)
•  Connected components
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
Other
•  Triangle counting
•  SVDPlusPlus
22
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
22
23
GraphFrames vs. GraphX
23	
GraphFrames GraphX
Built on DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edge
attributes
Any number of
DataFrame columns
Any type (VD, ED)
24
2 types of graph libraries
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point updates)
25
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
26
Algorithm implementations
Mostly wrappers for GraphX
•  PageRank
•  Shortest paths
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
•  SVDPlusPlus
26	
Some algorithms implemented
using DataFrames
•  Breadth-first search
•  Connected components
•  Triangle counting
•  Motif finding
27
Moving implementations to DataFrames
DataFrames are optimized for a huge number of small records.
•  columnar storage
•  code generation (“Project Tungsten”)
•  query optimization (“Project Catalyst”)
27
28
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
29
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
à convenient for users
Algorithms prefer integer vertex IDs.
à optimize in-memory storage
à reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
30
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
•  1 - (k-1)/N * (k-2)/N * …
•  seems unlikely with long range N=264
•  with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph topology.
Name Hash
Tim 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
31
Generating unique IDs
Spark has built-in methods to generate unique IDs.
•  RDD: zipWithUniqueId(), zipWithIndex()
•  DataFrame: monotonically_increasing_id()
!
Possible solution: just use these methods
32
How it works
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	2	
Vertex	 ID	
Xiangrui	 100	+	0	
Felix	 100	+	1	
ParCCon	3	
Vertex	 ID	
…	 200	+	0	
…	 200	+	1
33
… but not always
• DataFrames/RDDs are immutable and reproducible by design.
• However, records do not always have stable orderings.
•  distinct
•  repartition
• cache() does not help.
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	1	
Vertex	 ID	
Joseph	 0	
Tim	 1	
re-compute
34
Our implementation
We implemented (v0.5.0) an expensive but correct version:
1.  (hash) re-partition + distinct vertex IDs
2.  sort vertex IDs within each partition
3.  generate unique integer IDs
35
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
36
Connected Components
Assign each vertex a component ID such that vertices receive the
same component ID iff they are connected.
Applications:
•  fraud detection
• Spark Summit 2016 keynote from Capital One
•  clustering
•  entity resolution
1	 3	
2
37
Naive implementation (GraphX)
1.  Assign each vertex a unique component ID.
2.  Iterate until convergence:
•  For each vertex v, update:
component ID of v ß Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
38
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1.  Assign each vertex a unique ID.
2.  Iterate until convergence:
• (small-star) for each vertex,
connect smaller neighbors to smallest neighbor
• (big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or itself)
39
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
40
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
41
Another interpretation
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
adjacency	matrix
42
Small-star operation
1	 5	 7	 8	 9	
1	 x	 x	 x	
5	
7	
8	 x	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
rotate	&	liK
43
Big-star operation
liK	
1	 5	 7	 8	 9	
1	 x	 x	
5	 x	
7	 x	
8	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9
44
Convergence
1	 5	 7	 8	 9	
1	 x	 x	 x	 x	 x	
5	
7	
8	
9
45
Properties of the algorithm
• Small-/big-star operations do not change graph connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star graph.
• Converges in log2(#nodes) iterations
46
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
47
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
48
Skewed joins
Real-world graphs contain big components.
à data skew during connected components iterations
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
1	 3	
2	 5	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5	
join
49
Skewed joins
4
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
hash	join	
1	 3	
2	 5	
broadcast	join	
(#nbrs	>	1,000,000)	
union	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5
50
Checkpointing
We checkpoint every 2 iterations to avoid:
•  query plan explosion (exponential growth)
•  optimizer slowdown
•  disk out of shuffle space
•  unexpected node failures
5
51
Experiments
twitter-2010 from WebGraph datasets (small diameter)
•  42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 4 minutes
•  GraphFrames: 6 minutes
–  algorithm difference, checkpointing, checking skewness
5
52
Experiments
uk-2007-05 from WebGraph datasets
•  105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 25 minutes
–  slow convergence
•  GraphFrames: 4.5 minutes
5
53
Experiments
regular grid 32,000 x 32,000 (large diameter)
•  1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1 hour
5
54
Experiments
regular grid 50,000 x 50,000 (large diameter)
•  2.5 billion nodes, 10 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1.6 hours
5
55
Future improvements
GraphFrames
•  update inefficient code (due to Spark 1.6 compatibility)
•  better graph partitioning
•  letting Spark SQL handle skewed joins and iterations
•  graph compression
Connected Components
•  local iterations
•  node pruning and better stopping criteria
56
https://spark-summit.org/eu-2017/
15% Discount code: Databricks
hRp://dbricks.co/2sK35XT
https://databricks.com/company/careers
Thank you!
Get started with GraphFrames
Docs, downloads & tutorials
http://graphframes.github.io
https://docs.databricks.com
Dev community
Github issues & PRs
Twitter: @jkbatcmu à I’ll share my slides.

Mais conteúdo relacionado

Mais procurados

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...Neo4j
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 

Mais procurados (20)

Spark
SparkSpark
Spark
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...
Training Series - Build A Routing Web Application With OpenStreetMap, Neo4j, ...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 

Semelhante a Web-Scale Graph Analytics with Apache® Spark™

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real WorldAchim Friedland
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Big Analytics Without Big Hassles 04/10/14 Webinar
Big Analytics Without Big Hassles 04/10/14 WebinarBig Analytics Without Big Hassles 04/10/14 Webinar
Big Analytics Without Big Hassles 04/10/14 WebinarParadigm4Inc
 

Semelhante a Web-Scale Graph Analytics with Apache® Spark™ (20)

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Presentation1
Presentation1Presentation1
Presentation1
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Big Analytics Without Big Hassles 04/10/14 Webinar
Big Analytics Without Big Hassles 04/10/14 WebinarBig Analytics Without Big Hassles 04/10/14 Webinar
Big Analytics Without Big Hassles 04/10/14 Webinar
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Último (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Web-Scale Graph Analytics with Apache® Spark™

  • 1. Web-Scale Graph Analytics with Apache Spark Joseph K Bradley Bay Area Apache Spark Meetup September 7, 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.2 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 5. 5 Apache Spark Engine … Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
  • 6. 6
  • 7. 7 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames: DataFrame-based graphs • Bisecting K-Means: now part of MLlib • Stanford CoreNLP integration: UDFs for NLP spark-packages.org
  • 8. 8 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 9. 9 Graphs vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK” “SEA” 45 1058923
  • 10. 10 Apache Spark’s GraphX library Overview •  General-purpose graph processing library •  Optimized for fast distributed computing •  Library of algorithms: PageRank, Connected Components, etc. 10 Challenges •  No Java, Python APIs •  Lower-level RDD-based API (vs. DataFrames) •  Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 11. 11 The GraphFrames Spark Package Goal: DataFrame-based graphs on Apache Spark •  Simplify interactive queries •  Support motif-finding for structural pattern search •  Benefit from DataFrame optimizations Collaboration between Databricks, UC Berkeley & MIT + Now with community contributors & committers! 11
  • 12. 12 Graphs vertex edge JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY src dst delay tripID “JFK” “SEA” 45 1058923
  • 13. 13 GraphFrames 13 id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  • 14. 14 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 14
  • 15. 15 Simple queries SQL queries on vertices & edges 15 Simple graph queries (e.g., vertex degrees)
  • 16. 16 Motif finding 16 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 17. 17 Motif finding 17 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 18. 18 Motif finding 18 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 19. 19 Motif finding 19 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 20. 20 Motif finding 20 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 21. 21 Graph algorithms Find important vertices •  PageRank 21 Find paths between sets of vertices •  Breadth-first search (BFS) •  Shortest paths Find groups of vertices (components, communities) •  Connected components •  Strongly connected components •  Label Propagation Algorithm (LPA) Other •  Triangle counting •  SVDPlusPlus
  • 22. 22 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 22
  • 23. 23 GraphFrames vs. GraphX 23 GraphFrames GraphX Built on DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edge attributes Any number of DataFrame columns Any type (VD, ED)
  • 24. 24 2 types of graph libraries Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)
  • 25. 25 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 26. 26 Algorithm implementations Mostly wrappers for GraphX •  PageRank •  Shortest paths •  Strongly connected components •  Label Propagation Algorithm (LPA) •  SVDPlusPlus 26 Some algorithms implemented using DataFrames •  Breadth-first search •  Connected components •  Triangle counting •  Motif finding
  • 27. 27 Moving implementations to DataFrames DataFrames are optimized for a huge number of small records. •  columnar storage •  code generation (“Project Tungsten”) •  query optimization (“Project Catalyst”) 27
  • 28. 28 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 29. 29 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs. à convenient for users Algorithms prefer integer vertex IDs. à optimize in-memory storage à reduce communication Our task: Map unique vertex IDs to unique (long) integers.
  • 30. 30 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? •  1 - (k-1)/N * (k-2)/N * … •  seems unlikely with long range N=264 •  with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  • 31. 31 Generating unique IDs Spark has built-in methods to generate unique IDs. •  RDD: zipWithUniqueId(), zipWithIndex() •  DataFrame: monotonically_increasing_id() ! Possible solution: just use these methods
  • 32. 32 How it works ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 ParCCon 3 Vertex ID … 200 + 0 … 200 + 1
  • 33. 33 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. •  distinct •  repartition • cache() does not help. ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 1 Vertex ID Joseph 0 Tim 1 re-compute
  • 34. 34 Our implementation We implemented (v0.5.0) an expensive but correct version: 1.  (hash) re-partition + distinct vertex IDs 2.  sort vertex IDs within each partition 3.  generate unique integer IDs
  • 35. 35 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 36. 36 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: •  fraud detection • Spark Summit 2016 keynote from Capital One •  clustering •  entity resolution 1 3 2
  • 37. 37 Naive implementation (GraphX) 1.  Assign each vertex a unique component ID. 2.  Iterate until convergence: •  For each vertex v, update: component ID of v ß Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs
  • 38. 38 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1.  Assign each vertex a unique ID. 2.  Iterate until convergence: • (small-star) for each vertex, connect smaller neighbors to smallest neighbor • (big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself)
  • 39. 39 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 40. 40 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 41. 41 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  • 42. 42 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & liK
  • 43. 43 Big-star operation liK 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9
  • 44. 44 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9
  • 45. 45 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations
  • 47. 47 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 48. 48 Skewed joins Real-world graphs contain big components. à data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join
  • 49. 49 Skewed joins 4 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5
  • 50. 50 Checkpointing We checkpoint every 2 iterations to avoid: •  query plan explosion (exponential growth) •  optimizer slowdown •  disk out of shuffle space •  unexpected node failures 5
  • 51. 51 Experiments twitter-2010 from WebGraph datasets (small diameter) •  42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 4 minutes •  GraphFrames: 6 minutes –  algorithm difference, checkpointing, checking skewness 5
  • 52. 52 Experiments uk-2007-05 from WebGraph datasets •  105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 25 minutes –  slow convergence •  GraphFrames: 4.5 minutes 5
  • 53. 53 Experiments regular grid 32,000 x 32,000 (large diameter) •  1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1 hour 5
  • 54. 54 Experiments regular grid 50,000 x 50,000 (large diameter) •  2.5 billion nodes, 10 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1.6 hours 5
  • 55. 55 Future improvements GraphFrames •  update inefficient code (due to Spark 1.6 compatibility) •  better graph partitioning •  letting Spark SQL handle skewed joins and iterations •  graph compression Connected Components •  local iterations •  node pruning and better stopping criteria
  • 59. Thank you! Get started with GraphFrames Docs, downloads & tutorials http://graphframes.github.io https://docs.databricks.com Dev community Github issues & PRs Twitter: @jkbatcmu à I’ll share my slides.