Spark - Alexis Seigneurin (English)

Alexis Seigneurin
@aseigneurin @ippontech

Spark
● Processing of large volumes of data
● Distributed processing on commodity
hardware
● Written in Scala, Java and Python bindings

History
● 2009: AMPLab, Berkeley University
● June 2013 : "Top-level project" of the
Apache foundation
● May 2014: version 1.0.0
● Currently: version 1.2.0

Use cases
● Logs analysis
● Processing of text files
● Analytics
● Distributed search (Google, before)
● Fraud detection
● Product recommendation

● Same use cases
● Same development
model: MapReduce
● Integration with the
ecosystem
Proximity with Hadoop

Simpler than Hadoop
● API simpler to learn
● “Relaxed” MapReduce
● Spark Shell: interactive processing

Faster than Hadoop
Spark officially sets a new record in large-scale
sorting (5th November 2014)
● Sorting 100 To of data
● Hadoop MR: 72 minutes
○ With 2100 noeuds (50400 cores)
● Spark: 23 minutes
○ With 206 noeuds (6592 cores)

Spark ecosystem
● Spark
● Spark Shell
● Spark Streaming
● Spark SQL
● Spark ML
● GraphX

Integration
● Yarn, Zookeeper, Mesos
● HDFS
● Cassandra
● Elasticsearch
● MongoDB

● Resilient Distributed Dataset
● Abstraction of a collection processed in
parallel
● Fault tolerant
● Can work with tuples:
○ Key - Value
○ Tuples must be independent from each other
RDD

Sources
● Files on HDFS
● Local files
● Collection in memory
● Amazon S3
● NoSQL database
● ...
● Or a custom implementation of
InputFormat

Transformations
● Processes an RDD, returns another RDD
● Lazy!
● Examples :
○ map(): one value → another value
○ mapToPair(): one value → a tuple
○ filter(): filters values/tuples given a condition
○ groupByKey(): groups values by key
○ reduceByKey(): aggregates values by key
○ join(), cogroup()...: joins two RDDs

Actions
● Does not return an RDD
● Examples:
○ count(): counts values/tuples
○ saveAsHadoopFile(): saves results in Hadoop’s
format
○ foreach(): applies a function on each item
○ collect(): retrieves values in a list (List<T>)

● Trees of Paris: CSV file, Open Data
● Count of trees by specie
Spark - Example
geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta
48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;
48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;
48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;
48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29
...

Spark - Example
JavaSparkContext sc = new JavaSparkContext("local", "arbres");
sc.textFile("data/arbresalignementparis2010.csv")
.filter(line -> !line.startsWith("geom"))
.map(line -> line.split(";"))
.mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1))
.reduceByKey((x, y) -> x + y)
.sortByKey()
.foreach(t -> System.out.println(t._1 + " : " + t._2));
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
u
m
k
m
a
a
textFile mapToPairmap
reduceByKey
foreach
1
1
1
1
1
u
m
k
1
2
1
2a
...
...
...
...
filter
...
...
sortByKey
a
m
2
1
2
1u
...
...
...
...
...
...
geom;...
1 k

Spark - Example
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...

Topology & Terminology
● One master / several workers
○ (+ one standby master)
● Submit an application to the cluster
● Execution managed by a driver

Spark in a cluster
Several options
● YARN
● Mesos
● Standalone
○ Workers started manually
○ Workers started by the master

MapReduce
● Spark (API)
● Distributed processing
● Fault tolerant
Storage
● HDFS, base NoSQL...
● Distributed storage
● Fault tolerant
Storage & Processing

Data locality
● Process the data where it is stored
● Avoid network I/Os

Data locality
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark
Worker
HDFS
Datanode
Spark Master
HDFS
Namenode
HDFS
Namenode
(Standby)
Spark
Master
(Standby)

Demo
$ $SPARK_HOME/sbin/start-master.sh
$ $SPARK_HOME/bin/spark-class
org.apache.spark.deploy.worker.Worker
spark://MBP-de-Alexis:7077
--cores 2 --memory 2G
$ mvn clean package
$ $SPARK_HOME/bin/spark-submit
--master spark://MBP-de-Alexis:7077
--class com.seigneurin.spark.WikipediaMapReduceByKey
--deploy-mode cluster
target/pres-spark-0.0.1-SNAPSHOT.jar

● Usage of an RDD in SQL
● SQL engine: converts SQL instructions to
low-level instructions
Spark SQL

Spark SQL
Prerequisites:
● Use tabular data
● Describe the schema → SchemaRDD
Describing the schema :
● Programmatic description of the data
● Schema inference through reflection (POJO)

JavaRDD<Row> rdd = trees.map(fields -> Row.create(
Float.parseFloat(fields[3]), fields[4]));
● Creating tabular data (type Row)
Spark SQL - Example
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |

Spark SQL - Example
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));
fields.add(DataType.createStructField("espece", DataType.StringType, false));
StructType schema = DataType.createStructType(fields);
JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);
schemaRDD.registerTempTable("tree");
---------------------------------------
| hauteurenm | espece |
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
● Describing the schema

● Counting trees by specie
Spark SQL - Example
sqlContext.sql("SELECT espece, COUNT(*)
FROM tree
WHERE espece <> ''
GROUP BY espece
ORDER BY espece")
.foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...

Micro-batches
● Slices a continuous flow of data into batches
● Same API
● ≠ Apache Storm

DStream
● Discretized Streams
● Sequence of RDDs
● Initialized with a Duration

Window operations
● Sliding window
● Reuses data from other windows
● Initialized with a window length and a slide
interval

Sources
● Socket
● Kafka
● Flume
● HDFS
● MQ (ZeroMQ...)
● Twitter
● ...
● Or a custom implementation of Receiver

Spark Streaming Demo
● Receive Tweets with hashtag #Android
○ Twitter4J
● Detection of the language of the Tweet
○ Language Detection
● Indexing with Elasticsearch
● Reporting with Kibana 4

$ curl -X DELETE localhost:9200
$ curl -X PUT localhost:9200/spark/_mapping/tweets '{
"tweets": {
"properties": {
"user": {"type": "string","index": "not_analyzed"},
"text": {"type": "string"},
"createdAt": {"type": "date","format": "date_time"},
"language": {"type": "string","index": "not_analyzed"}
}
}
}'
● Launch ElasticSearch
Demo
● Launch Kibana -> http://localhost:5601
● Launch the Spark Streaming process

@aseigneurin
aseigneurin.github.io
@ippontech
blog.ippon.fr

Spark - Alexis Seigneurin (English)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Spark - Alexis Seigneurin (English)

Similar to Spark - Alexis Seigneurin (English) (20)

Recently uploaded

Recently uploaded (20)

Spark - Alexis Seigneurin (English)