Here is my talk at Scala by the Bay 2017, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors, futures, and reactive streams; back pressure, reactive monitoring with Kamon, and performing extremely high speed Scala: how not to allocate / copy / deserialize with high performace Filo vectors and BinaryRecords.
http://github.com/filodb/FiloDB
http://github.com/velvia/filo
2. Who am I
User and contributor to Spark since 0.9,
Cassandra since 0.6
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit, Strata,
Scala Days, etc.
http://velvia.github.io/
4. Needs
• Ingest HUGE streams of events — IoT etc.
• Real-time, low latency, and somewhat flexible queries
• Dashboards, quick answers on new data
• Flexible schemas and query patterns
• Keep your streaming pipeline super simple
• Streaming = hardest to debug. Simplicity rules!
6. Spark + HDFS Streaming
Kafka
Spark
Streaming
Many small files
(microbatches)
Dedup,
consolidate
job
Larger efficient
files
• High latency
• Big impedance mismatch between streaming
systems and a file system designed for big blobs
of data
7. Cassandra?
• Ingest HUGE streams of events — IoT etc.
• C* is not efficient for writing raw events
• Real-time, low latency, and somewhat flexible queries
• C* is real-time, but only low latency for simple
lookups. Add Spark => much higher latency
• Flexible schemas and query patterns
• C* only handles simple lookups
10. 100% Reactive
• Scala
• Akka Cluster
• Spark
• Monix / Reactive Streams
• Typesafe Config for all configuration
• Scodec, Ficus, Enumeratum, Scalactic, etc.
• Even most of the performance critical parts are written in Scala
:)
12. Why use Scala and Akka?
• Akka Cluster!
• Just the right abstractions - streams, futures,
Akka, type safety….
• Failure handling and supervision are critical for
databases
• All the pattern matching and immutable goodness
:)
13. Scala Big Data Projects
• Spark
• GeoMesa
• Khronus - Akka time-series DB
• Sirius - Akka distributed KV Store
• FiloDB!
17. Akka vs Futures
• Akka Actors:
• External FiloDB node API (remote + cluster)
• Async messaging with clients
• Cluster/distributed state management
• Futures and Observables:
• Core I/O
• Columnar data processing / ingestion
• Type-safe processing stages
18. Futures for Single Actions
/**
* Clears all data from the column store for that given projection, for all versions.
* More like a truncation, not a drop.
* NOTE: please make sure there are no reprojections or writes going on before calling this
*/
def clearProjectionData(projection: Projection): Future[Response]
/**
* Completely and permanently drops the dataset from the column store.
* @param dataset the DatasetRef for the dataset to drop.
*/
def dropDataset(dataset: DatasetRef): Future[Response]
/**
* Appends the ChunkSets and incremental indices in the segment to the column store.
* @param segment the ChunkSetSegment to write / merge to the columnar store
* @param version the version # to write the segment to
* @return Success. Future.failure(exception) otherwise.
*/
def appendSegment(projection: RichProjection,
segment: ChunkSetSegment,
version: Int): Future[Response]
19. Monix / Reactive Streams
• http://monix.io
• “observable sequences that are exposed as
asynchronous streams, expanding on the
observer pattern, strongly inspired by ReactiveX
and by Scalaz, but designed from the ground up
for back-pressure and made to cleanly interact
with Scala’s standard library, compatible out-of-
the-box with the Reactive Streams protocol”
• Much better than Future[Iterator[_]]
20. Monix / Reactive Streams
def readChunks(projection: RichProjection,
columns: Seq[Column],
version: Int,
partMethod: PartitionScanMethod,
chunkMethod: ChunkScanMethod = AllChunkScan): Observable[ChunkSetReader] = {
scanPartitions(projection, version, partMethod)
// Partitions to pipeline of single chunks
.flatMap { partIndex =>
stats.incrReadPartitions(1)
readPartitionChunks(projection.datasetRef, version, columns, partIndex, chunkMethod)
// Collate single chunks to ChunkSetReaders
}.scan(new ChunkSetReaderAggregator(columns, stats)) { _ add _ }
.collect { case agg: ChunkSetReaderAggregator if agg.canEmit => agg.emit() }
}
}
21. Functional Reactive Stream
Processing
• Ingest stream merged with flush commands
• Built in async/parallel tasks via mapAsync
• Notify on end of stream, errors
val combinedStream = Observable.merge(stream.map(SomeData), flushStream)
combinedStream.map {
case SomeData(records) => shard.ingest(records)
None
case FlushCommand(group) => shard.switchGroupBuffers(group)
Some(FlushGroup(shard.shardNum, group, shard.latestOffset))
}.collect { case Some(flushGroup) => flushGroup }
.mapAsync(numParallelFlushes)(shard.createFlushTask _)
.foreach { x => }
.recover { case ex: Exception => errHandler(ex) }
27. Yes, Akka in Spark
• Columnar ingestion is stateful - need stickiness of
state. This is inherently difficult in Spark.
• Akka (cluster) gives us a separate, asynchronous
control channel to talk to FiloDB ingestors
• Spark only gives data flow primitives, not async
messaging
• We need to route incoming records to the correct
ingestion node. Sorting data is inefficient and forces
all nodes to wait for sorting to be done.
28. Data Ingestion Setup
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Node
Cluster
Actor
Partition Map
29. FiloDB NodeFiloDB Node
FiloDB separate nodes
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Node
Cluster
Actor
Partition Map
30. Testing Akka Cluster
• MultiNodeSpec / sbt-multi-jvm
• NodeClusterSpec
• Tests joining of different cluster nodes and
partition map updates
• Is partition map updated properly if a cluster
node goes down — inject network failures
• Lessons
31. Kamon Tracing
• http://kamon.io
• One trace can encapsulate multiple Future steps
all executing on different threads
• Tunable tracing levels
• Summary stats and histograms for segments
• Super useful for production debugging of reactive
stack
36. How do you go REALLY fast?
• Don’t serialize
• Don’t allocate
• Don’t copy
37. Filo fast
• Filo binary vectors - 2 billion records/sec
• Spark InMemoryColumnStore - 125 million
records/sec
• Spark CassandraColumnStore - 25 million
records/sec
38. Filo: High Performance
Binary Vectors
• Designed for NoSQL, not a file format
• random or linear access
• on or off heap
• missing value support
• Scala only, but cross-platform support possible
http://github.com/velvia/filo is a binary data vector library designed
for extreme read performance with minimal deserialization costs.
39. Billions of Ops / Sec
• JMH benchmark: 0.5ns per FiloVector element access / add
• 2 Billion adds per second - single threaded
• Who said Scala cannot be fast?
• Spark API (row-based) limits performance significantly
val randomInts = (0 until numValues).map(i => util.Random.nextInt)
val randomIntsAray = randomInts.toArray
val filoBuffer = VectorBuilder(randomInts).toFiloBuffer
val sc = FiloVector[Int](filoBuffer)
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
def sumAllIntsFiloApply(): Int = {
var total = 0
for { i <- 0 until numValues optimized } {
total += sc(i)
}
total
}
40. JVM Inlining
• Very small methods can be inlined by the JVM
• final def avoids virtual method dispatch.
• Thus methods in traits, abstract classes not inlinable
val base = baseReader.readInt(0)
final def apply(i: Int): Int = base + dataReader.read(i)
case (32, _) => new TypedBufferReader[Int] {
final def read(i: Int): Int = reader.readInt(i)
}
final def readInt(i: Int): Int = unsafe.getInt(byteArray, (offset + i * 4).toLong)
0.5ns/read is achieved through a stack of very small methods:
41. BinaryRecord
• Tough problem: FiloDB must handle many
different datasets, each with different schemas
• Cannot rely on static types and standard
serialization mechanisms - case classes,
Protobuf, etc.
• Serialization very costly, especially strings
• Solution: BinaryRecord
42. BinaryRecord II
• BinaryRecord is a binary (ie transport ready) record
class that supports any schema or mix of column
types
• Values can be extracted or written with no serialization
cost
• UTF8-encoded string class
• String compare as fast as native Java strings
• Immutable API once built
43. Use Case: Sorting
• Regular sorting: deserialize record, create sort
key, compare sort key
• BinaryRecord sorting: binary compare fields
directly — no deserialization, no object allocations
45. BinaryRecord Sorting
• BinaryRecord sorting: binary compare fields
directly — no deserialization, no object allocations
name: Str age: Int
lastTimestamp:
Long
group: Str
name: Str age: Int
lastTimestamp:
Long
group: Str
46. SBT-JMH
• Super useful tool to leverage JMH, the best micro
benchmarking harness
• JMH is written by the JDK folks
47. In Summary
• Scala, Akka, reactive can give you both awesome
abstractions AND performance
• Use Akka for distribution, state, protocols
• Use reactive/Monix for functional, concurrent
stream processing
• Build (or use FiloDB’s) fast low-level abstractions
with good APIs