2. Intro - What’s Spark?
Spark is a distributed open-source
framework for real-time data processing
over Hadoop.
Influenced by Google’s Dremel project,
it was developed at UC Berkeley and is
now part of Apache Incubator.
Spark introduces a functional
mapreduce model for repetitive
iterations over distributed data.
3. Basic problems
1. In-coming events digestion (filtering,
categorizing, storing).
We currently use RabbitMQ and Storm, but Spark can be
used here.
2. Batch processing
In our case, attribution per-conversion, aggregation perkeyword/ad and optimization per-campaign.
We currently have a proprietary infrastructure, that
doesn’t scale very well. Spark would shine here.
3. Grids/widgets for online - slice n dice.
We currently have aggregation tables in MySQL.
A short demo of what we did here with Spark...
4. Problem #3: Grids over billions of cells
We manage billions of keywords.
We handle hundreds of millions of clicks
and conversions per day.
Our clients query the data in lots of
varying ways:
● different aggregations, timeperiods, filters, sorts
● drill-down to specific items of
concern
5. Architecture for demo
Grid in App
Server
Spark
Master
Web Server
Spark Worker
Cassandra
Spark Worker
Cassandra
7. Code Snippet - setting up an RDD
val job = new Job()
job.setInputFormatClass(classOf[ColumnFamilyInputFormat])
val configuration: Configuration = job.getConfiguration
ConfigHelper.setInputInitialAddress(configuration, cassandraHost)
ConfigHelper.setInputRpcPort(configuration, cassandraPort)
ConfigHelper.setOutputInitialAddress(configuration, cassandraHost)
ConfigHelper.setOutputRpcPort(configuration, cassandraPort)
ConfigHelper.setInputColumnFamily(configuration, keyspaceName, columnFamily)
ConfigHelper.setThriftFramedTransportSizeInMb(configuration, 2047)
ConfigHelper.setThriftMaxMessageLengthInMb(configuration, 2048)
ConfigHelper.setInputSlicePredicate(configuration, predicate)
ConfigHelper.setInputPartitioner(configuration, "Murmur3Partitioner")
ConfigHelper.setOutputPartitioner(configuration, "Murmur3Partitioner")
val casRdd = sc.newAPIHadoopRDD( configuration, classOf[ColumnFamilyInputFormat],
classOf[ByteBuffer], classOf[util.SortedMap[ByteBuffer, IColumn]])
8. Mapp’in & reduc’in with Spark
val flatRdd = creaeteFlatRDD(cachedRDD, startDate, endDate, profileId, statusInTarget)
val withGroupByScores = flatRdd.map {
case (entity, performance) => {
val scores = performance.groupBy(score => score.name )
(entity, scores)
}
}
val withAggrScores = withGroupByScores.map {
case (entity, scores) => { val aggrScores = scores.map {
case (column, sc) => {
val aggregation = sc.reduce[Score]({
(left, right) => { Score(left.name, left.value + right.value) })
(column, aggregation)
}}
(entity, aggrScores)
}
}
9. Reading RAM is suddenly a hot-spot..
def createByteArray(date: String, column: Column, value: ByteBuffer): Array[Byte] = {
val daysFromEpoch = calcDaysFromEpoch(date)
val columnOrdinal = column.id
val buffer = ByteBuffer.allocate(4 + 4 + value.remaining())
buffer.putInt(daysFromEpoch)
buffer.putInt(columnOrdinal)
buffer.put(value)
buffer.array()
}
10. White Hat* - facts
● For this demo: EC2 Cluster of Master and 2 Slave nodes.
● Each Slave with: 240Gb memory, 32 cores, SSD drives,
10Gb network
● Data size: 100Gb
● Cassandra 2.1
● Spark 0.8.0
● Rule of thumb for cost est.: ~25K $ / Tb of data.
You’ll probably need X2 memory, as RDD’s are immutable.
* Colored hats metaphor taken from de Bono’s “Six Thinking Hats”
11. Yellow Hat - optimism
● Full slice ‘n dice over all data with acceptable latency for
online (< 5 seconds)
● Additional aggregations at no extra performance cost
● Ease of setup (but as always, be prepared for some
tinkering)
● Single source of truth
● Horizontal scale
● MapReduce capabilities for machine learning algorithms
● Enable merging recent data with old data (what Nathan
Marz coinded: “lambda architecture”)
12. Black Hat - concerns
● System stability
● Changing API
● Limited ecosystem
● Scala-based code - learning curve
● Maintainability: optimal speed means low-level of
abstraction.
● Data duplication, especially in-transit
● Master node is a single point of failure
● Scheduling
13. Green Hat - alternatives
● Alternatives to Spark:
○ Cloudera’s Impala (commercial product)
○ Presto (recently open-sourced out of Facebook)
○ Trident/Storm (for stream processing)
14. Red Hat - emotions, intuitions
Spark’s technology is nothing short of astonishing:
yesterday’s “impossible!” is today’s “merely difficult...”