Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Modernizing Infrastructures for Fast Data
Spark, Kafka, Cassandra, Reactive Platform and Mesos
by Dean Wampler, Ph.D. (@deanwampler)

Outline
•Reactive Enterprise Architectures: The Lightbend Perspective
•Big Data and the Emergence of Apache Spark
•An Architecture for Fast Data
2

Reactive Enterprise Applications:
The Lightbend Perspective

The Lightbend Reactive Platform

7
Online Services
IoT
Retail
Education
Technology
Social
Media
Finance
7

Big Data and the Emergence of
Apache Spark

Distributed compute frameworks: MapReduce
9
• Distribution computation over that data.

Hadoop
YARN
HDFS
MR job #1
MR job #2
Flume Sqoop
DBs
Slave Node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
Master
Resource
Manager
Name Node

Hadoop Strengths
• Lowest CapEx system for Big Data.
• Excellent for ingesting and integrating diverse datasets.
• Flexible: from classic analytics (aggregations and data warehousing) to
machine learning.
11

Hadoop Weaknesses
• Complex administration.
• YARN can’t manage all distributed services.
• MapReduce:
•Has poor performance.
•A difficult programming model.
•Doesn’t support stream processing.
12

YARN
HDFS
MR job #1
MR job #2
Flume Sqoop
DBs
Slave Node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
Master
Resource
Manager
Name Node
Spark job #1
Spark job #2
Hadoop 2013:
Embrace Spark

Spark vs. MapReduce Performance
15
100x better for
many algorithms.

Spark: Major Performance Improvements
16
Sort 100TB

One of the Fastest Growing OS Projects
17

Modules
18
Spark Streaming
(~Real Time)
MLlib
(Machine Learning)
SQL/DataFrames
(Structured Data)
GraphX
(Graphs)
Spark RDD
(Core)

The Core - Resilient Distributed Datasets
19
Spark Streaming
(~Real Time)
MLlib
(Machine Learning)
SQL/DataFrames
(Structured Data)
GraphX
(Graphs)
Spark RDD
(Core)
Cluster
Node Node Node Node
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3
RDD
Partition 4

“Inverted Index” in Spark
20
sparkContext.textFile("/path/to/input")
.map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) => toWords(contents).map(w=>((w,id),1))
}.reduceByKey {
(count1, count2) => count1 + count2
}.map {
case ((word, path), n) => (word, (path, n))}
.groupByKey
.map {
case (word, list) => (word, sortByCount(list))
}.saveAsTextFile("/path/to/output")
reduceByKey
flatMap
textFile
map
map
groupByKey
map
saveAsTextFile

SQL queries and a “DataFrame” DSL
21
Spark Streaming
(~Real Time)
MLlib
(Machine Learning)
SQL/DataFrames
(Structured Data)
GraphX
(Graphs)
Spark RDD
(Core)
• For data with a fixed schema...
• Write SQL queries (currently a subset of HiveQL).
• Use equivalent Python-inspired DataFrame API.

Use SQL or the Idiomatic DataFrame API
22
# SQL:
sqlContext.sql("""
SELECT state, age, COUNT(*) AS cnt
FROM people
GROUP BY state, age
ORDER BY cnt DESC, state ASC, age ASC
""")
// DataFrame (Scala):
people.state($"state", $"age")
.groupBy($"state", $"age").count()
.orderBy($"count".desc, $"state".asc, $"age".asc)

Spark Streaming: “Mini-batch” Processing
23
Spark Streaming
(~Real Time)
MLlib
(Machine Learning)
SQL/DataFrames
(Structured Data)
GraphX
(Graphs)
Spark RDD
(Core)
DStream
RDD #2
Event
Event
Event
Event
Event
…
Windows
(2 batches)
t0
RDD #1
Event
Event
RDD #3
Event
Event
Event
t1 =
t0 + ∆
t2 =
t0 + 2∆
t3 =
t0 + 3∆

Streaming Inverted Index
24
val kafkaBrokers = "host1:port1,host2:port2,..."
val kafkaTopics = Set("topic1", "topic2", ...)
val sparkConf = new SparkConf().setAppName("...")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with kafkaBrokers and kafkaTopics
val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers)
val messages =
KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](
ssc, kafkaParams, kafkaTopics)
messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))}
.reduceByKey (_ + _)
.map {case ((word, topic), n) => (word, (path, n))}
.groupByKey
.map {case (word, list) => (word, sortByCount(list))}

25
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with kafkaBrokers and kafkaTopics
val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers)
val messages =
KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](
ssc, kafkaParams, kafkaTopics)
messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))}
.reduceByKey (_ + _)
.map {case ((word, topic), n) => (word, (path, n))}
.groupByKey
.map {case (word, list) => (word, sortByCount(list))}
.saveAsTextFiles("/path/to/output")
ssc.start()
ssc.awaitTermination()

•Update a search engine in real time as web page or
documents change.
•Train a SPAM filter with every email.
•Detect anomalies as they happen through processing of
logs and monitoring data.
Fast as in Streaming. Why?

Mesos, YARN
on
Bare Metal, Cloud
HDFS, S3, CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Core of Spark, Kafka,
and Cassandra

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
“SMACK”
Stack

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Data
Sources

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Event
Event
Event
Event
Event
Event
Producer Consumer
bounded queue
feedbackfeedback
Reactive Streams

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Lightbend Reactive
Platform

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Kafka for Stream
Storage

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Minibatch
Processing

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Short and Long-
term Storage

Mesos, YARN
on
Bare Metal, Cloud
Core
Streaming SQL
MLlib GraphX
HTTP/
REST
Internet
ReacHve
Services
Logs and
Other Files
Actors
Cluster …Persist
Akka Streams
Web Services
Infrastructure

•Next Steps
•Learn - Fast Data: Big Data Evolved
•Watch - Using Spark, Kafka, Cassandra and Akka on
Mesos for Real-Time Personalization
•Review - Spark success stories by Lightbend clients

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Similar to Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos (20)

More from Lightbend

More from Lightbend (20)

Recently uploaded

Recently uploaded (20)

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos