SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
1 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
MatFast
IN-MEMORY	DISTRIBUTED	MATRIX	COMPUTATION	
PROCESSING	AND	OPTIMIZATION	BASED	ON	SPARK	SQL
Mingjie	Tang
Yanbo Liang	
Oct,	2017
2 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
About	Authors
à Yongyang Yu
• Machine	learning,	Database	system,	Computation	Algebra	
• PhD	students	at	Purdue	University	
à Mingjie	Tang
• Spark	SQL,	Spark	ML,	Database,	Machine	Learning
• Software	Engineer	at	Hortonworks
à Yanbo Liang
• Apache	Spark	committer,	Spark	MLlib
• Staff	Software	Engineer	at	Hortonworks
à …	All	Other	Contributors
3 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
Motivation
Overview	of	MatFast
Implementation	 and	optimization
Use	cases
4 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Many	applications	rely	on	efficient	processing	of	queries	over	big	
matrix	data:
– Recommender	 systems
– Social	network	analysis
– Predict	traffic	data	flow
– Anti-fraud	and	spam	detection		
– Bioinformatics
5 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Recommender	 Systems
Netflix’s	user-movie	rating	table	(sample)
Problem:	Predict	 the	missing	entries	 in	the	table
Input:	User-movie	rating	table	with	missing	entries
Output:	Complete	user-movie	rating	table	with	predictions
For	Netflix,	#users	=	80	million,	#movies	=	2	million
Batma
n	
begins
Alice	in	
Wonde
rland
Doctor	
Strange
Trolls Ironma
n
Alice 4 ? 3 5 4
Bob ? 5 4 ? ?
Cindy 3 ? ? ? 2
movies
users
6 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Gaussian	Non-negative	Matrix	Factorization	(GNMF)
– Assumption:				𝑉"×$ ≈ 𝑊"×'	×	𝐻'×$
V ≈ W H×users
(80M)
movies	 (2M)
users
(80M)
modeling	 dims	
(e.g.,	topics,	age,	language,	etc.)
modeling
dims	(500)
movies	 (2M)
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
fori =	1	to nIter do
end
Initialize	W and	H
Matrix	operation	for	GNMF	 Algorithm
huge	volume
dense/sparse	
storage
iterative	 execution
7 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à User-Movie	 Rating	Prediction	with	GNMF
val p = 200 // number of topics
val V = loadMatrix(“in/V”) // read matrix
val max_niter = 10 // max number of iteration
W = RandomMatrix(V.nrows, p)
H = RandomMatrix(p, V.ncols)
for (i <- 0 until max_niter) {
H = H * (W.t %*% V) / (W.t %*% W %*% H)
W = W * (V %*% H.t) / (W %*% H %*% H.t)
}
(H %*% W).saveToHive()
8 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
State	of	the	art	solution	in	Spark	ecosystem
à Alternative	Least	Square	approach	in	Spark	(ALS)
– Experiment	on	Spotify	data
– 50+	million	users	x	30+	million	songs	
– 50	billion	ratings	For	rank	10	with	10	iterations	
– ~1	hour	running	time
à How	to	extend	ALS	to	other	matrix	computation?
– SVD
– PCA
– QR
9 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat mat-elem
mat-elem
loop
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)										(																														)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝑊	×	𝐻	 ×
intermediate	result	cost:
8 X	1016( )
10 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
𝑊	×	𝐻	
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat mat-elem
mat-elem
intermediate	result	cost:
5	X	1011
loop
materialization	of	result
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)									(																												)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝐻,
)×(
11 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
𝑊	×	𝐻	
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat
chained
mat-elem
intermediate	result	size:
5	X	1011
loop
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)										(																											)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝐻,×( )
12 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Overview	of	MatFast
13 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Matrix	operators
à Unary	operator
– Transpose:	B =	AT
à Binary	operators
– B =	A +	𝛽;	B =	A *	𝛽;	
– C =	A ★ B,	★∈{+,	*,	/};	
– C =	A B (A %*% B)
à Others
– return	a	matrix:	abs(A),	pow(A,	p)
– return	a	vector:	rowSum(A),	colSum(A)
– return	a	scalar:		max(A),	min(A)
matrix-matrix	multiplication
14 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	targets
à MATFAST generates	a	computation- and	communication-
efficient	execution	plan:
– Optimize	a	single	matrix	operator	in	an	expression
– Optimize	multiple	operators	in	an	expression
– Exploit	data	dependency	between	different	 expressions
15 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Comparison	with	other	systems
Single Distributed w.	multiple	 nodes
R ScaLAPACK SciDB SystemML MLlib DMac
huge	volume. ✔ ✔ ✔ ✔ ✔
sparse	comp. ✔ 〜 ✔ 〜 〜
multiple	
operators	
✔ ✔ ✔ ✔ ✔ ✔
partition	w.	
dependency
✔
opt.	exec. plan ✔
interface R	script C/Fortran SQL-like R-like Java/Scala Scala
fault	tolerance ✔ ✔ ✔ ✔
open	source ✔ ✔ 〜 ✔ ✔
16 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Compare	with	Spark	SQL
Matrix	operators SQL	relational	
query
Data	type matrix relational	table
Operators
transpose, mat-mat,	
mat-scalar,	mat-elem
join,	select,	group
by,	aggregate
Execution	
scheme
iterative acyclic
17 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Spark	SQL
System	framework
MATFAST
ML	algorithms:	SVD,	PCA,	NMF,	
PageRank,	QR,	etc
Spark	RDD
Applications:	Image processing, Text	
processing,	Collaborative	filtering,
Spatial	computation,	etc.
18 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
System	framework
MATFAST
Components Architecture
19 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
MatFast within	Spark	Catalyst	
à Extend	Spark	Catalyst
Rule	based	optimization
(single	matrix	operators,	
multiple	matrix	operators)
Cost	based	optimization
(optimizing	data	partitioning)
20 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Implementation	and	optimization
21 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	1:	a	Single	Operator	- Cost	Based	Optimization
plan
1
plan
2
plan
3
plan
4
plan
5
MatFast
AverageExecutionTime(s)
102
10
3
10
4
((A1
× A2
) × A3
) × A4
(A1
× (A2
× A3
)) × A4
A1
× ((A2
× A3
) × A4
)
A1
× (A2
× (A3
× A4
))
(A1
× A2
) × (A3
× A4
)
22 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à Distribute	matrix	data	over	a	set	of	workers
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?	
à Partitioning	schemes
– Row	scheme	(“r”)
– Column	scheme	(“c”)
– Block-Cyclic	scheme	(“b-c”)
– Broadcast	scheme	(“b”)
23 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
compute	firstWT
00 WT
10 WT
20 WT
30
WT
01 WT
11 WT
21 WT
31
E0
E1
E2
E3
W
W00 W01
W10 W11
W20 W21
W30 W31
WT
Hash-based	partition,	(i+	j)	%	N
3	block	shuffles 7	block	shuffles
Total:	 20	block	shuffles
Executors
24 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
compute	firstWT
00 WT
10 WT
20 WT
30
WT
01 WT
11 WT
21 WT
31
E0
E1
E2
E3
W
W00 W01
W10 W11
W20 W21
W30 W31
WT
Row-based	partition
Total:	 12	block	shuffles
Executors
3	block	shuffles 3 block	shuffles
25 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
…
……
…
……
……
…
W
T
V
W
T
V
WT
W
H
WTW
WTWH
…
H
…
H
…
V
…
VHT
…
W
…
WH
…
WHH
T
…
W
W
…
H
stage 1
stage 2 stage 3 stage 4
stage 5
stage 6
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
H
à We	need	an	optimized	plan	to	
determine	an	optimized	data	
partitioning	scheme	for	each	
matrix	such	that	minimum	shuffle	
overhead	is	introduced	for	the	
entire	pipeline.
à For	example,	with	hash-based	
data	partitioning,	the	
computation	pipeline	involves	
multiple	shuffles	for	aligning	the	
data	blocks.
26 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à MATFAST determines	 the	
partitioning	scheme	for	an	
input	matrix	with	min	shuffle	
cost	according	to	the	cost	
model.
à Greedily optimizes	each	
operator
𝑠23(24) ⟵ argmin
<=>(=?)
𝐶AB$$(𝑜𝑝, 𝑠23 , 𝑠24 , 𝑠B)
à Physical	execution	plan	with	optimized	data	
partitioning
……
…
WT
V
W
T
V
…
……
W
T
W
…
WTW
…
W
T
WH
…
H
…
H
…
V
…
VHT
…
HT
…
HHT
…
…
W
WHH
T
…
W
W
stage 1
stage 2 stage 3 stage 4
…
Row	scheme
for	W
Row	scheme	
for	V
27 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Case	studies
28 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Experiments
à Dataset	APIs
– Code	examples	link
à Compare	with	state-of-the-art	systems
– Spark	MLlib (provided	matrix	operation)
– SystemML (Spark)
– ScaLAPACK
– SciDB
à Netflix	data	
– 100,480,507	ratings	
– 17,770	movies	 from	480,189	customers
à Social	network	data
29 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
PageRank	on	different	datasets
MATFAST
30 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
GNMF	on	the	Netflix	dataset
MATFAST
31 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Future	plan
à More	user	friend	APIs
à Advanced	plan	optimizer
à Python	and	R	interface
à Vertical	applications
32 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Conclusion
à Proposed	and	realized	MATFAST,	an	in-memory	distributed	platform	that	
optimizes	query	pipelines	of	matrix	operations
à Take	advantage	of	dynamic	cost-based	 analysis	and	rule-based	 heuristics	to	
generate	a	query	execution	plan
à Communication-efficient	 data	partitioning	scheme	assignment
33 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Reference
– Yongyang Yu, MingJie Tang, Walid G.	Aref, Qutaibah M.	Malluhi, Mostafa	
M.	Abbas, Mourad Ouzzani:
In-Memory	Distributed	Matrix	Computation	Processing	and	
Optimization. ICDE 2017: 1047-1058
34 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Thanks
Q	&	A
mtang@hortonworks.com

Mais conteúdo relacionado

Mais procurados

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 

Mais procurados (20)

Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemApache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing System
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 

Destaque

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 

Destaque (20)

Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache AtlasPartner Ecosystem Showcase for Apache Ranger and Apache Atlas
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
 
Ibm watson
Ibm watsonIbm watson
Ibm watson
 
빅데이터윈윈 컨퍼런스_데이터시각화자료
빅데이터윈윈 컨퍼런스_데이터시각화자료빅데이터윈윈 컨퍼런스_데이터시각화자료
빅데이터윈윈 컨퍼런스_데이터시각화자료
 
Softnix Messaging Server
Softnix Messaging ServerSoftnix Messaging Server
Softnix Messaging Server
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets

 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
 
Softnix Security Data Lake
Softnix Security Data Lake Softnix Security Data Lake
Softnix Security Data Lake
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Building the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time AnalyticsBuilding the Ideal Stack for Real-Time Analytics
Building the Ideal Stack for Real-Time Analytics
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data Architecture
 
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / Cloudera
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Zoomdata
ZoomdataZoomdata
Zoomdata
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Cloudera and Qlik: Big Data Analytics for Business
Cloudera and Qlik: Big Data Analytics for BusinessCloudera and Qlik: Big Data Analytics for Business
Cloudera and Qlik: Big Data Analytics for Business
 

Semelhante a MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL Yanbo Liang and Mingie Tang

Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 

Semelhante a MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL Yanbo Liang and Mingie Tang (20)

Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
 
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Integrate SparkR with existing R packages to accelerate data science workflows
 Integrate SparkR with existing R packages to accelerate data science workflows Integrate SparkR with existing R packages to accelerate data science workflows
Integrate SparkR with existing R packages to accelerate data science workflows
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
 
Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 

Mais de Spark Summit

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Último (20)

ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL Yanbo Liang and Mingie Tang

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MatFast IN-MEMORY DISTRIBUTED MATRIX COMPUTATION PROCESSING AND OPTIMIZATION BASED ON SPARK SQL Mingjie Tang Yanbo Liang Oct, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Authors à Yongyang Yu • Machine learning, Database system, Computation Algebra • PhD students at Purdue University à Mingjie Tang • Spark SQL, Spark ML, Database, Machine Learning • Software Engineer at Hortonworks à Yanbo Liang • Apache Spark committer, Spark MLlib • Staff Software Engineer at Hortonworks à … All Other Contributors
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation Overview of MatFast Implementation and optimization Use cases
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Many applications rely on efficient processing of queries over big matrix data: – Recommender systems – Social network analysis – Predict traffic data flow – Anti-fraud and spam detection – Bioinformatics
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Recommender Systems Netflix’s user-movie rating table (sample) Problem: Predict the missing entries in the table Input: User-movie rating table with missing entries Output: Complete user-movie rating table with predictions For Netflix, #users = 80 million, #movies = 2 million Batma n begins Alice in Wonde rland Doctor Strange Trolls Ironma n Alice 4 ? 3 5 4 Bob ? 5 4 ? ? Cindy 3 ? ? ? 2 movies users
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Gaussian Non-negative Matrix Factorization (GNMF) – Assumption: 𝑉"×$ ≈ 𝑊"×' × 𝐻'×$ V ≈ W H×users (80M) movies (2M) users (80M) modeling dims (e.g., topics, age, language, etc.) modeling dims (500) movies (2M) 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) fori = 1 to nIter do end Initialize W and H Matrix operation for GNMF Algorithm huge volume dense/sparse storage iterative execution
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à User-Movie Rating Prediction with GNMF val p = 200 // number of topics val V = loadMatrix(“in/V”) // read matrix val max_niter = 10 // max number of iteration W = RandomMatrix(V.nrows, p) H = RandomMatrix(p, V.ncols) for (i <- 0 until max_niter) { H = H * (W.t %*% V) / (W.t %*% W %*% H) W = W * (V %*% H.t) / (W %*% H %*% H.t) } (H %*% W).saveToHive()
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved State of the art solution in Spark ecosystem à Alternative Least Square approach in Spark (ALS) – Experiment on Spotify data – 50+ million users x 30+ million songs – 50 billion ratings For rank 10 with 10 iterations – ~1 hour running time à How to extend ALS to other matrix computation? – SVD – PCA – QR
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Observation H W transpose V mat-mat mat-mat mat-mat mat-elem mat-elem loop 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝑊 × 𝐻 × intermediate result cost: 8 X 1016( )
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 𝑊 × 𝐻 Observation H W transpose V mat-mat mat-mat mat-mat mat-elem mat-elem intermediate result cost: 5 X 1011 loop materialization of result 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝐻, )×(
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 𝑊 × 𝐻 Observation H W transpose V mat-mat mat-mat mat-mat chained mat-elem intermediate result size: 5 X 1011 loop 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝐻,×( )
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview of MatFast
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Matrix operators à Unary operator – Transpose: B = AT à Binary operators – B = A + 𝛽; B = A * 𝛽; – C = A ★ B, ★∈{+, *, /}; – C = A B (A %*% B) à Others – return a matrix: abs(A), pow(A, p) – return a vector: rowSum(A), colSum(A) – return a scalar: max(A), min(A) matrix-matrix multiplication
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization targets à MATFAST generates a computation- and communication- efficient execution plan: – Optimize a single matrix operator in an expression – Optimize multiple operators in an expression – Exploit data dependency between different expressions
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Comparison with other systems Single Distributed w. multiple nodes R ScaLAPACK SciDB SystemML MLlib DMac huge volume. ✔ ✔ ✔ ✔ ✔ sparse comp. ✔ 〜 ✔ 〜 〜 multiple operators ✔ ✔ ✔ ✔ ✔ ✔ partition w. dependency ✔ opt. exec. plan ✔ interface R script C/Fortran SQL-like R-like Java/Scala Scala fault tolerance ✔ ✔ ✔ ✔ open source ✔ ✔ 〜 ✔ ✔
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Compare with Spark SQL Matrix operators SQL relational query Data type matrix relational table Operators transpose, mat-mat, mat-scalar, mat-elem join, select, group by, aggregate Execution scheme iterative acyclic
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark SQL System framework MATFAST ML algorithms: SVD, PCA, NMF, PageRank, QR, etc Spark RDD Applications: Image processing, Text processing, Collaborative filtering, Spatial computation, etc.
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved System framework MATFAST Components Architecture
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MatFast within Spark Catalyst à Extend Spark Catalyst Rule based optimization (single matrix operators, multiple matrix operators) Cost based optimization (optimizing data partitioning)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementation and optimization
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 1: a Single Operator - Cost Based Optimization plan 1 plan 2 plan 3 plan 4 plan 5 MatFast AverageExecutionTime(s) 102 10 3 10 4 ((A1 × A2 ) × A3 ) × A4 (A1 × (A2 × A3 )) × A4 A1 × ((A2 × A3 ) × A4 ) A1 × (A2 × (A3 × A4 )) (A1 × A2 ) × (A3 × A4 )
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à Distribute matrix data over a set of workers à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? à Partitioning schemes – Row scheme (“r”) – Column scheme (“c”) – Block-Cyclic scheme (“b-c”) – Broadcast scheme (“b”)
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) compute firstWT 00 WT 10 WT 20 WT 30 WT 01 WT 11 WT 21 WT 31 E0 E1 E2 E3 W W00 W01 W10 W11 W20 W21 W30 W31 WT Hash-based partition, (i+ j) % N 3 block shuffles 7 block shuffles Total: 20 block shuffles Executors
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) compute firstWT 00 WT 10 WT 20 WT 30 WT 01 WT 11 WT 21 WT 31 E0 E1 E2 E3 W W00 W01 W10 W11 W20 W21 W30 W31 WT Row-based partition Total: 12 block shuffles Executors 3 block shuffles 3 block shuffles
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline … …… … …… …… … W T V W T V WT W H WTW WTWH … H … H … V … VHT … W … WH … WHH T … W W … H stage 1 stage 2 stage 3 stage 4 stage 5 stage 6 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) H Ã We need an optimized plan to determine an optimized data partitioning scheme for each matrix such that minimum shuffle overhead is introduced for the entire pipeline. Ã For example, with hash-based data partitioning, the computation pipeline involves multiple shuffles for aligning the data blocks.
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à MATFAST determines the partitioning scheme for an input matrix with min shuffle cost according to the cost model. à Greedily optimizes each operator 𝑠23(24) ⟵ argmin <=>(=?) 𝐶AB$$(𝑜𝑝, 𝑠23 , 𝑠24 , 𝑠B) à Physical execution plan with optimized data partitioning …… … WT V W T V … …… W T W … WTW … W T WH … H … H … V … VHT … HT … HHT … … W WHH T … W W stage 1 stage 2 stage 3 stage 4 … Row scheme for W Row scheme for V
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case studies
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Experiments à Dataset APIs – Code examples link à Compare with state-of-the-art systems – Spark MLlib (provided matrix operation) – SystemML (Spark) – ScaLAPACK – SciDB à Netflix data – 100,480,507 ratings – 17,770 movies from 480,189 customers à Social network data
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved PageRank on different datasets MATFAST
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved GNMF on the Netflix dataset MATFAST
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future plan à More user friend APIs à Advanced plan optimizer à Python and R interface à Vertical applications
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Conclusion à Proposed and realized MATFAST, an in-memory distributed platform that optimizes query pipelines of matrix operations à Take advantage of dynamic cost-based analysis and rule-based heuristics to generate a query execution plan à Communication-efficient data partitioning scheme assignment
  • 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Reference – Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa M. Abbas, Mourad Ouzzani: In-Memory Distributed Matrix Computation Processing and Optimization. ICDE 2017: 1047-1058
  • 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thanks Q & A mtang@hortonworks.com