Apache Spark & MLlib

Apache Spark & MLlib
Grigory Sapunov / eclass.cc
Moscow Independent Data Science Meetup / 14.09.2015
https://ru.linkedin.com/in/grigorysapunov

What is Spark?
• General engine for large-scale data processing
• Supports cyclic data flow and in-memory computing
• Java, Scala, Python, R interfaces
• Libraries: SQL and DataFrames, MLlib, GraphX, and
Spark Streaming.

RDD
Resilient Distributed Dataset
• Distributed collection of objects in memory
• Fault-tolerant: RDD can be reconstructed
automatically
• RDD can be cached to save computations

RDD operations
• Transformations
• operations on RDDs that return a new RDD
(map, filter, …)
• transformations are lazy
• Actions
• return a result to the driver program or
write it to storage, and kick off a
computation (count, first, …)

https://spark.apache.org/docs/latest/cluster-overview.html

Why to use it when there is …
• Hadoop? Better for iterative processes.
• Storm? Rather different thing.
• Flink? Looks interesting and keep an eye on it.
But it seems that Spark is evolving faster.
• …
Not the “exclusive OR” scenario. Spark fits well
into Hadoop ecosystem.

Spark use-cases
• Hadoop-like (addition/replacement)
bigdata, bigdata...
• Data scientist/analyst’s workplace (great with
ipython notebooks or something similar)
• Distributed python environment (easily run
your own tasks on the cluster)

MLlib
• spark.mllib contains the original API built on
top of RDDs.
• spark.ml provides higher-level API built on
top of DataFrames for constructing ML
pipelines.
http://spark.apache.org/docs/latest/mllib-guide.html

MLlib
MLlib: Machine Learning in Apache Spark,
http://arxiv.org/pdf/1505.06807.pdf

MLlib evolution
http://arxiv.org/pdf/1505.06807.pdf

MLlib
• Classification and regression (SVM, Log.regr,
Lin.regr, naive Bayes, Decision trees, Random
Forests, GBTs, …
• Clustering (k-means, GMM, PIC, LDA,
streaming k-means)
• Collaborative filtering (ALS)
• Dimensionality reduction (SVD, PCA)
• and much more…

Spark Version Timeline
1.5.0 (Sep 09 2015)
1.4.1 (Jul 15 2015)
1.4.0 (Jun 11 2015)
1.3.1 (Apr 17 2015)
1.3.0 (Mar 13 2015)
1.2.2 (Apr 17 2015)
1.2.1 (Feb 09 2015)
1.2.0 (Dec 18 2014)
1.1.1 (Nov 26 2014)
1.1.0 (Sep 11 2014)
1.0.2 (Aug 05 2014)
0.9.2 (Jul 23 2014)

What’s new in 1.5
● Improved DataFrames, ML Pipelines, R support
● The first phase of Project Tungsten, a new execution backend
for DataFrames/SQL
○ Memory Management and Binary Processing: manage memory
explicitly, eliminate the overhead of JVM object model and
garbage collection
○ Cache-aware computation
○ Code generation (SQL and DataFrames)
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html

What’s new in 1.5
● ML/New Algorithms: multilayer perceptron classifier (scala),
PrefixSpan for sequential pattern mining (scala), association
rule generation, 1-sample Kolmogorov-Smirnov test, etc.
● Python API: distributed matrices (pyspark.mllib.linalg.
distributed), streaming k-means and linear models, LDA,
power iteration clustering, etc.

What’s new in 1.5
More details
● Announcing Spark 1.5
https://databricks.com/blog/2015/09/09/announcing-spark-1-5.html
● Spark Release 1.5.0
http://spark.apache.org/releases/spark-release-1-5-0.html

Apache Spark & MLlib

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Apache Spark & MLlib

Semelhante a Apache Spark & MLlib (20)

Mais de Grigory Sapunov

Mais de Grigory Sapunov (20)

Último

Último (20)

Apache Spark & MLlib