Apache Spark is an open-source cluster computing framework that provides faster analytics than Hadoop by keeping data in memory as much as possible. It uses Resilient Distributed Datasets (RDDs) that can be operated on in parallel across a cluster. Spark also offers easier development than Hadoop through APIs in Scala, Java, Python and an interactive shell. It provides unified analytics capabilities including SQL, streaming, machine learning and graph processing. Spark can scale to clusters of over 1,000 nodes and has a large community of over 171 contributors.
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
1. Leveraging For Cluster Computing
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
Robin M. E. Swezey
Rakuten Institute of Technology, Tokyo
Intelligence Domain Group
robin.swezey@mail.rakuten.com
2. Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
2
What is Spark?
3. Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
3
In short, Spark is the future of
open-source MapReduce
4. Why Spark?
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
4
Current Hadoop stack is heterogeneous
Spark = Fully integrated analytics suite and cluster
computing framework
Berkeley AMP lab + Apache Software Foundation
Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.
5. Platform
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
5
On the surface, very similar to Hadoop
• Relies on HDFS
• Runs on Yarn, Mesos, or standalone
• MapReduce + General cluster computing
6. Platform
Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
6
1. Resilient Distributed Dataset (RDD)
Central to Spark (R dataframe-ish)
RDD
RDD RDD
7. Platform
Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
7
1. Resilient Distributed Dataset (RDD)
Central to Spark (R dataframe-ish)
RDD<String>
RDD<Tuple>
RDD<Tuple>
8. Key differences with usual stack
Read blocks
from disk
Cache aggregates
in memory
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
8
Platform
2. Better resource utilization
Disk is slow. Memory is fast. Several levels of persistence.
9. Each node x each core
/ each local block
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
9
Platform
Key differences with usual stack
2. Better resource utilization
More cores > more machines. Resource locality.
10. Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
10
Platform
(Logistic Regression)
3. Easier development & operations
Scala, Java, Python API
8 Lines
11. Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
11
Platform
3. Easier Analytics
Interactive Shells in Scala, Python
Easy to connect with SparkContext (e.g. iPython Notebook)
12. Key differences with usual stack
Easy MapReduce
Platform
DBMS-like Functionality
Machine Learning
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
12
Streaming
4. Integrated Solution
Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.
13. Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
13
Applications
Thanks to RDD Distributed Operators
map()
reduce()
reduceByKey()
groupBy()
sample()
pipe()
foreach()
fold()
histogram()
…
Easy MapReduce
14. Sped-up Analytics with DBMS-like SQL Functionality
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
14
Applications
Cf.
Integrated Unified Data Access
Hive Compatible Standard Connectivity
15. Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
15
Applications
Cf.
Streaming
16. Machine Learning
Statistics
Classification / Regression
Collaborative Filtering
Clustering
Dimensionality Reduction
Feature Extraction
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
16
Applications
Cf.
Image: http://en.wikipedia.org/wiki/Machine_learning#mediaviewer/File:Linear-svm-scatterplot.svg
17. Graph Processing
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
17
Applications
Flexible
Fast
PageRank
Connected components
Label propagation
SVD++
Triangle count
18. How does it scale?
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
18
More
There are deployed clusters
of 1,000+ nodes
19. There’s open-source, and there’s highly supported open-source
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
19
More
Spark 1.1.0
had 171 contributors!
21. http://spark.apache.org
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
21
Thank you!