2. What is Spark?
• General engine for large-scale data processing
• Supports cyclic data flow and in-memory computing
• Java, Scala, Python, R interfaces
• Libraries: SQL and DataFrames, MLlib, GraphX, and
Spark Streaming.
3. RDD
Resilient Distributed Dataset
• Distributed collection of objects in memory
• Fault-tolerant: RDD can be reconstructed
automatically
• RDD can be cached to save computations
4. RDD operations
• Transformations
• operations on RDDs that return a new RDD
(map, filter, …)
• transformations are lazy
• Actions
• return a result to the driver program or
write it to storage, and kick off a
computation (count, first, …)
7. Why to use it when there is …
• Hadoop? Better for iterative processes.
• Storm? Rather different thing.
• Flink? Looks interesting and keep an eye on it.
But it seems that Spark is evolving faster.
• …
Not the “exclusive OR” scenario. Spark fits well
into Hadoop ecosystem.
8. Spark use-cases
• Hadoop-like (addition/replacement)
bigdata, bigdata...
• Data scientist/analyst’s workplace (great with
ipython notebooks or something similar)
• Distributed python environment (easily run
your own tasks on the cluster)
9. MLlib
• spark.mllib contains the original API built on
top of RDDs.
• spark.ml provides higher-level API built on
top of DataFrames for constructing ML
pipelines.
http://spark.apache.org/docs/latest/mllib-guide.html
17. What’s new in 1.5
● Improved DataFrames, ML Pipelines, R support
● The first phase of Project Tungsten, a new execution backend
for DataFrames/SQL
○ Memory Management and Binary Processing: manage memory
explicitly, eliminate the overhead of JVM object model and
garbage collection
○ Cache-aware computation
○ Code generation (SQL and DataFrames)
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html
18. What’s new in 1.5
● ML/New Algorithms: multilayer perceptron classifier (scala),
PrefixSpan for sequential pattern mining (scala), association
rule generation, 1-sample Kolmogorov-Smirnov test, etc.
● Python API: distributed matrices (pyspark.mllib.linalg.
distributed), streaming k-means and linear models, LDA,
power iteration clustering, etc.
19. What’s new in 1.5
More details
● Announcing Spark 1.5
https://databricks.com/blog/2015/09/09/announcing-spark-1-5.html
● Spark Release 1.5.0
http://spark.apache.org/releases/spark-release-1-5-0.html