SlideShare uma empresa Scribd logo
1 de 25
2015/05/26
김상우(lightzeus@gmail.com)
정향민(jhyangmin@gmail.com)
Graph Analytics in Spark
소셜 네트워크
사람과 사람, 사람과 SNS와의 관계
월드와이드웹
하이퍼링크로 연결된 웹 페이지
Graph
Vertices + Edges  Graph
Vertex
• 사람
• 웹 페이지
• …
Edge
• 인간 관계
• 하이퍼링크
• …
Degree:
Vertex에 연결된
Edge 수
방향성 존재 가능
Degree:4
PageRank
Graph에서 중요한 vertex를 찾는 방법
유저가 하이퍼링크를 따라 사이트를 방문하거나,
임의로 사이트를 방문하는 경우에 가장 중요한 website를 찾는 방법
PageRank
Graph에서 중요한 vertex를 찾는 방법
유저가 하이퍼링크를 따라 사이트를 방문하거나,
임의로 사이트를 방문하는 경우에 가장 중요한 website를 찾는 방법
Triangle Counting
선택된 vertex를 기준으로 삼각형 개수 확인
Triangle Counting
선택된 vertex를 기준으로 삼각형 개수 확인
Graph 분석 방식
Graph 분석은 선택된 vertex와 vertex의 이웃으로만 분석 함
Graph 분석 방식
Graph 분석은 선택된 vertex와 vertex의 이웃으로만 분석 함
Table vs. Graph
Table Graph
Dependency GraphTable
Result
Row
Row
Row
Row
출처: UC Berkerley Lab
Graph 저장 형태
B C
A D
F E
A DD
Property Graph
B C
D
E
AA
F
출처: UC Berkerley Lab
Graph 저장 형태
Vertex
Table
(RDD)
B C
A D
F E
A DD
Property Graph
A
B
C
D
E
A
F
출처: UC Berkerley Lab
Graph 저장 형태
Vertex
Table
(RDD)
Property Graph
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
출처: UC Berkerley Lab
Part. 2
Part. 1
B C
A D
F E
A D
Graph 저장 형태
Part. 2
Part. 1
Vertex
Table
(RDD)
B C
A D
F E
A D
Property Graph
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
출처: UC Berkerley Lab
Vertex
Table
(RDD)
GraphX의 분석 과정
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
B
C
D
E
A
F
출처: UC Berkerley Lab
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
GraphX의 분석 과정
B
C
D
E
A
F
Change
Change
출처: UC Berkerley Lab
Vertex
Table
(RDD)
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror
Cache
B
C
D
A
Mirror
Cache
D
E
F
A
GraphX의 분석 과정
B
C
D
E
A
F Scan
Change
Change
Change
Change
Local
Aggregate
Local
Aggregate
B
C
D
F
출처: UC Berkerley Lab
Vertex
Table
(RDD)
GraphX 예제 (1/3)
Vertex Id는 숫자형(int, double, … )만 지원함
source destination attribute
GraphX 예제 (2/3)
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
//vertices RDD 생성
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")),
(2L, ("istoica", "prof"))))
//Edge RDD 생성
val relationships: RDD[Edge[String]] =
sc.parallelize(Array(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"),
Edge(5L, 7L, "pi")))
//graph 생성
val graph = Graph(users, relationships)
GraphX 예제 (3/3)
graph.degrees.collect
Array[(org.apache.spark.graphx.VertexId, Int)] =
Array((2,1), (3,2), (5,3), (7,2))
graph.edges.collect
Array[org.apache.spark.graphx.Edge[String]] =
Array(Edge(3,7,collab), Edge(5,3,advisor),
Edge(2,5,colleague), Edge(5,7,pi))
graph.vertices.collect
Array[(org.apache.spark.graphx.VertexId, (String,
String))] = Array((2,(istoica,prof)),
(3,(rxin,student)), (5,(franklin,prof)),
(7,(jgonzal,postdoc)))
graph.pageRank(0.1, 0.15).vertices.collect
Array[(org.apache.spark.graphx.VertexId,
Double)] = Array((2,0.15), (3,0.2679375),
(5,0.2775), (7,0.3954375))
graph.triangleCount.vertices.collect
Array[(org.apache.spark.graphx.VertexId, Int)] =
Array((2,0), (3,1), (5,1), (7,1))
PageRank Benchmark
Good!
출처: UC Berkerley Lab
Graph-Parallel 분석
• Collaborative Filtering
 Alternating Least
Squares
 Stochastic Gradient
Descent
 Tensor Factorization
• Graph Analytics
 PageRank
 Triangle-Counting
 Shortest Path
• Community Detection
 Triangle-Counting
 K-core Decomposition
 K-Truss
 Label Propagation
• Classification
 Neural Networks
GraphX
• Vertex와 edge로 이루어진 graph 분석
관계 분석
Graph 분석은 Hadoop이나 Naïve Spark보다 빠
름
• Map/Reduce보다 복잡한 분석
• 앞으로의 발전 계획
다양한 알고리즘
시간에 따라 변하는 graph 분석
Q&A

Mais conteúdo relacionado

Mais procurados

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Mais procurados (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 

스사모 테크톡 - GraphX