SlideShare a Scribd company logo
1 of 103
RETHINKING STREAMING ANALYTICS
FOR SCALE
Helena Edelson
1
@helenaedelson
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/stream-analytics-scalability
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
@helenaedelson
Who Is This Person?
• VP of Product Engineering @Tuplejump
• Big Data, Analytics, Cloud Engineering, Cyber Security
• Committer / Contributor to FiloDB, Spark Cassandra
Connector, Akka, Spring Integration
2
• @helenaedelson
• github.com/helena
• linkedin.com/in/helenaedelson
• slideshare.net/helenaedelson
@helenaedelson
@helenaedelson
Tuplejump - Open Source
github.com/tuplejump
• FiloDB - part of this talk
• Calliope - the first Spark-Cassandra integration
• Stargate - an open source Lucene indexer for Cassandra
• SnackFS - open source HDFS for Cassandra
4
@helenaedelson
What Will We Talk About
• The Problem Domain
• Example Context
• Rethinking Architecture
• We don't have to look far to look back
• Streaming & Data Science
• Challenging Assumptions
• Revisiting the goal and the stack
• Integration
• Simplification
5
@helenaedelson
THE PROBLEM DOMAIN
Delivering Meaning From A Flood Of Data
6
@helenaedelson
@helenaedelson
The Problem Domain
Need to build scalable, fault tolerant, distributed data
processing systems that can handle massive amounts of
data from disparate sources, with different data structures.
7
@helenaedelson
Translation
How to build adaptable, elegant systems
for complex analytics and learning tasks
to run as large-scale clustered dataflows
8
@helenaedelson
How Much Data
Yottabyte = quadrillion gigabytes or septillion bytes
We all have a lot of data
• Terabytes
• Petabytes...
http://en.wikipedia.org/wiki/Yottabyte
9
100 trillion $ in DC fees
@helenaedelson
Delivering Meaning
• Deliver meaning in sec/sub-sec latency
• Disparate data sources & schemas
• Billions of events per second
• High-latency batch processing
• Low-latency stream processing
• Aggregation of historical from the stream
10
@helenaedelson
While We Monitor, Predict & Proactively
Handle
• Massive event spikes & bursty traffic
• Fast producers / slow consumers
• Network partitioning & out of sync systems
• DC down
• Wait, we've DDOS'd ourselves from fast streams?
• Autoscale issues
– When we scale down VMs how do we not lose data?
11
@helenaedelson
And stay within our
AWS / Rackspace budget
12
@helenaedelson
EXAMPLE CONTEXT:
CYBER SECURITY
Hunting The Hunter
13
@helenaedelson
• Track activities of international threat actor groups,
nation-state, criminal or hactivist
• Intrusion attempts
• Actual breaches
• Profile adversary activity
• Analysis to understand their motives, anticipate actions
and prevent damage
Adversary Profiling & Hunting:
Online & Offline
14
@helenaedelson
• Machine events
• Endpoint intrusion detection
• Anomalies/indicators of attack or compromise
• Machine learning
• Training models based on patterns from historical data
• Predict potential threats
• profiling for adversary Identification
Stream Processing
15
@helenaedelson
Data Requirements & Description
• Streaming event data
• Log messages
• User activity records
• System ops & metrics data
• Disparate data sources
• Wildly differing data structures
16
@helenaedelson
Massive Amounts Of Data
• One machine can generate 2+ TB per day
• Tracking millions of devices
• 1 million writes per second - bursty
• High % writes, lower % reads
17
@helenaedelson
RETHINKING
ARCHITECTURE
18
@helenaedelson
19
few years
in Silicon Valley
Cloud Engineering team
@helenaedelson
@helenaedelson
20
Batch analytics data flow from several years ago looked like...
@helenaedelson
21
Batch analytics data flow from several years ago looked like...
@helenaedelson
22
Transforming data multiple times, multiple ways
@helenaedelson
23
Sweet, let's triple the code we have to update and regression test
every time our analytics logic changes
@helenaedelson
STREAMING &
DATA SCIENCE
Enter Streaming for Big Data
24
@helenaedelson
Streaming:
Big Data, Fast Data, Fast Timeseries Data
• Reactive processing of data as it comes in to derive
instant insights
• Is this enough?
• Need to combine with existing big data, historical
processing, ad hoc queries
25
@helenaedelson
New Requirements, Common Use Case
I need fast access to historical data on the fly for
predictive modeling with real time data from the stream
26
@helenaedelson
It's Not A Stream It's A Flood
• Netflix
• 50 - 100 billion events per day
• 1 - 2 million events per second at peak
• LinkedIn
• 500 billion write events per day
• 2.5 trillion read events per day
• 4.5 million events per second at peak with Kafka
• 1 PB of stream data
27
@helenaedelson
Which Translates To
• Do it fast
• Do it cheap
• Do it at scale
28
@helenaedelson
Oh, and don't loose data
29
@helenaedelson
AND THEN WE GREEKED OUT
30
Lambda
@helenaedelson
Lambda Architecture
A data-processing architecture designed to handle
massive quantities of data by taking advantage of both
batch and stream processing methods.
31
• Or, "How to beat the CAP theorum"
• An approach coined by Nathan Mars
• This was a huge stride forward
@helenaedelson
• Doing complex asynchronous transformations
• That need to run with low latency (say, a few seconds to a few
hours)
• Examples
• Weather analytics and prediction system
• News recommendation system
32
Applications Using Lambda Architecture
@helenaedelson
https://www.mapr.com/developercentral/lambda-architecture 33
@helenaedelson
Implementing Is Hard
• Real-time pipeline backed by KV store for updates
• Many moving parts - KV store, real time, batch
• Running similar code in two places
• Still ingesting data to Parquet/HDFS
• Reconcile queries against two different places
34
@helenaedelson
Performance Tuning & Monitoring
on so many disparate systems
35
Also Hard
@helenaedelson
36
λ: Streaming & Batch Flows
Evolution Or Just Addition?
Or Just Technical Debt?
@helenaedelson
Lambda Architecture
Ingest an immutable sequence of records is captured
and fed into
• a batch system
• and a stream processing system
in parallel
37
@helenaedelson
WAIT, DUAL SYSTEMS?
38
Challenge Assumptions
@helenaedelson
Which Translates To
• Performing analytical computations & queries in dual
systems
• Duplicate Code
• Untyped Code - Strings
• Spaghetti Architecture for Data Flows
• One Busy Network
39
@helenaedelson
40
Why?
• Why support code, machines and running services of
two analytics systems?
• Is a separate batch system needed?
• Can we do everything in a streaming system?
@helenaedelson
YES
41
• A unified system for streaming and batch
• Real-time processing and reprocessing
• Code changes
• Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps
@helenaedelson
ANOTHER ASSUMPTION:
ETL
42
Challenge Assumptions
@helenaedelson
Extract, Transform, Load (ETL)
• Extraction of data from one system into another
• Transforming it
• Loading it into another system
43
@helenaedelson
ETL
• Each step can introduce errors and risk
• Writing intermediary files
• Parsing and re-parsing plain text
• Tools can cost millions of dollars
• Decreases throughput
• Increased complexity
• Can duplicate data after failover
44
@helenaedelson
Extract, Transform, Load (ETL)
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
45
Also unnecessarily redundant and often typeless
@helenaedelson
And let's duplicate the pattern over
all our DataCenters
46
@helenaedelson
47
These are not the solutions you're looking for
@helenaedelson
REVISITING THE GOAL
48
@helenaedelson
Removing The 'E' in ETL
Thanks to technologies like Avro and Protobuf we don’t need the
“E” in ETL. Instead of text dumps that you need to parse over
multiple systems:
E.g Scala and Avro

• A return to strong typing in the big data ecosystem 

• Can work with binary data that remains strongly typed
49
@helenaedelson
Removing The 'L' in ETL
If data collection is backed by a distributed messaging
system (e.g. Kafka) you can do real-time fanout of the
ingested data to all consumers. No need to batch "load".
• From there each consumer can do their own transformations
50
@helenaedelson
51
#NoMoreGreekLetterArchitectures
@helenaedelson
NoETL
52
@helenaedelson
Pick Technologies Wisely
Based on your requirements
• Latency
• Real time / Sub-Second: < 100ms
• Near real time (low): > 100 ms or a few seconds - a few hours
• Consistency
• Highly Scalable
• Topology-Aware & Multi-Datacenter support
• Partitioning Collaboration - do they play together well
53
@helenaedelson
And Remember
• Flows erode
• Entropy happens
• "Everything fails, all the time" - Kyle Kingsbury
54
@helenaedelson
REVISITING THE STACK
55
@helenaedelson
Stream Processing & Frameworks
56
+
GearPump
@helenaedelson
Strategies
57
• Partition For Scale & Data Locality
• Replicate For Resiliency
• Share Nothing
• Fault Tolerance
• Asynchrony
• Async Message Passing
• Memory Management
• Data lineage and reprocessing in
runtime
• Parallelism
• Elastically Scale
• Isolation
• Location Transparency
@helenaedelson
Fault Tolerance
• Graceful service degradation
• Data integrity / accuracy under failure
• Resiliency during traffic spikes
• Pipeline congestion / bottlenecks
• Easy to debug and find failure source
• Easy to deploy
58
@helenaedelson
Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
59
@helenaedelson
SMACK
• Scala & Spark Streaming
• Mesos
• Akka
• Cassandra
• Kafka
60
@helenaedelson
Spark Streaming
• One runtime for streaming and batch processing
• Join streaming and static data sets
• No code duplication
• Easy, flexible data ingestion from disparate sources to
disparate sinks
• Easy to reconcile queries against multiple sources
• Easy integration of KV durable storage
61
@helenaedelson
Training
Data
Feature
Extraction
Model
Training
Model
Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
62
val context = new StreamingContext(conf, Milliseconds(500))
val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils
.createStream(ssc, zkQuorum, group,..)
.map(event => model.predict(event.feature))
@helenaedelson
63
High performance concurrency framework for Scala and
Java
• Fault Tolerance
• Asynchronous messaging and data processing
• Parallelization
• Location Transparency
• Local / Remote Routing
• Akka: Cluster / Persistence / Streams
@helenaedelson
Akka Actors
64
A distribution and concurrency abstraction
• Compute Isolation
• Behavioral Context Switching
• No Exposed Internal State
• Event-based messaging
• Easy parallelism
• Configurable fault tolerance
@helenaedelson
High Performance Streaming
Built On Akka
• Apache Flink - uses Akka for
• Actor model and hierarchy, Deathwatch and distributed
communication between job and task managers
• GearPump - models the entire streaming system with
an actor hierarchy
• Supervision, Isolation, Concurrency
65
@helenaedelson
Apache Cassandra
• Extremely Fast
• Extremely Scalable
• Multi-Region / Multi-Datacenter
• Always On
• No single point of failure
• Survive regional outages
• Easy to operate
• Automatic & configurable replication
66
@helenaedelson
90% of streaming data at Netflix is
stored in Cassandra
67
@helenaedelson
STREAM INTEGRATION
68
@helenaedelson
69
KillrWeather
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and
batch data processing with Apache Spark Streaming, Apache
Cassandra, Apache Kafka and Akka for fast, streaming computations
on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/
databricks/apps/weather
@helenaedelson
val context = new StreamingContext(conf, Seconds(1))
val stream = KafkaUtils.createDirectStream[Array[Byte],
Array[Byte], DefaultDecoder, DefaultDecoder](
context, kafkaParams, kafkaTopics)
stream.flatMap(func1).saveToCassandra(ks1,table1)
stream.map(func2).saveToCassandra(ks1,table1)
context.start()
70
Kafka, Spark Streaming and Cassandra
@helenaedelson
class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {



override val supervisorStrategy =

OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {

case _: ActorInitializationException => Stop

case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate

case _: Exception => Escalate

}


private val producer = new KafkaProducer[K, V](producerConfig)



override def postStop(): Unit = producer.close()


def receive = {

case e: KafkaMessageEnvelope[K,V] => producer.send(e)

}

} 71
Kafka, Spark Streaming, Cassandra & Akka
@helenaedelson
Spark Streaming, ML, Kafka & C*
val ssc = new StreamingContext(new SparkConf()…, Seconds(5)

val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)



val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](

ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map(_._2).map(LabeledPoint.parse)
trainingStream.saveToCassandra("ml_keyspace", "raw_training_data")



val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(weights))

.trainOn(trainingStream)
//Making predictions on testData
model
.predictOnValues(testData.map(lp => (lp.label, lp.features)))
.saveToCassandra("ml_keyspace", "predictions")
72
@helenaedelson
STREAM INTEGRATION:
DATA LOCALITY &
TIMESERIES
73
SMACK
@helenaedelson
74
@helenaedelson
75
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
Kafka, Spark Streaming, Cassandra & Akka
@helenaedelson
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
76
Now we can replay
• On failure
• Reprocessing on code changes
• Future computation...
@helenaedelson
77
Here we are pre-aggregating to a table for fast querying later -
in other secondary stream aggregation computations and scheduled computing
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
@helenaedelson
CREATE TABLE weather.raw_data (

wsid text, year int, month int, day int, hour int, 

temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
CREATE TABLE daily_aggregate_precip (

wsid text,

year int,

month int,

day int,

precipitation counter,

PRIMARY KEY ((wsid), year, month, day)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Data Model (simplified)
78
@helenaedelson
79
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {

import settings._


val stream = KafkaUtils.createStream(

ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)



stream
.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

}
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
@helenaedelson
The Thing About S3
"Amazon S3 is a simple key-value store" -
docs.aws.amazon.com/AmazonS3/latest/dev/UsingObjects.html
• Keys 2015/05/01 and 2015/05/02 do not live in the “same
place”
• You can roll your own with AmazonS3Client and do the heavy
lifting yourself and throw that data into Spark
80
@helenaedelson
CREATE TABLE weather.raw_data (

wsid text, year int, month int, day int, hour int, 

temperature double, dewpoint double, pressure double,
wind_direction int, wind_speed double, one_hour_precip
PRIMARY KEY ((wsid), year, month, day, hour)

) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
C* Clustering Columns Writes by most recent
Reads return most recent first
Timeseries Data
81
Cassandra will automatically sort by most recent for both write and read
@helenaedelson
82
val multipleStreams = for (i <- numstreams) {
streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))
}
streamingContext.union(multipleStreams)
.map { httpRequest => TimelineRequestEvent(httpRequest)}
.saveToCassandra("requests_ks", "timeline")
CREATE TABLE IF NOT EXISTS requests_ks.timeline (
timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,
PRIMARY KEY ((url, timesegment) , t_uuid)
);
Record Every Event In The Order In
Which It Happened, Per URL
timesegment protects from writing
unbounded partitions.
timeuuid protects from simultaneous
events over-writing one another.
@helenaedelson
val stream = KafkaUtils.createDirectStream(...)

.map(_._2.split(","))

.map(RawWeatherData(_))


stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
stream
.map(hour => (hour.id, hour.year, hour.month, hour.day, hour.oneHourPrecip))

.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

83
Replay and Reprocess - Any Time
Data is on the nodes doing the querying
- Spark C* Connector - Partitions
• Timeseries data with Data Locality
• Co-located Spark + Cassandra nodes
• S3 does not give you
Cassandra & Spark Streaming:
Data Locality For Free®
@helenaedelson
class PrecipitationActor(ssc: StreamingContext, settings: Settings) extends AggregationActor {

import akka.pattern.pipe


def receive : Actor.Receive = { 

case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)

}



/** Returns the 10 highest temps for any station in the `year`. */

def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {

val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,

ssc.sparkContext.parallelize(aggregate).top(k).toSeq)



ssc.cassandraTable[Double](keyspace, dailytable)

.select("precipitation")

.where("wsid = ? AND year = ?", wsid, year)

.collectAsync().map(toTopK) pipeTo requester

}

}
84
Queries pre-aggregated
tables from the stream
Compute Isolation: Actor
@helenaedelson
85
class TemperatureActor(sc: SparkContext, settings: Settings) extends AggregationActor {

import akka.pattern.pipe


def receive: Actor.Receive = {

case e: GetMonthlyHiLowTemperature => highLow(e, sender)

}



def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =

sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)

.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)

.collectAsync()

.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester
}
C* data is automatically sorted by most recent - due to our data model.
Additional Spark or collection sort not needed.
Efficient Batch Analysis
@helenaedelson
A NEW APPROACH
86
Simplification
@helenaedelson
Everything On The Streaming Platform
87
@helenaedelson
Reprocessing
• Start a new stream job to re-process from the
beginning
• Save re-processed data as a version table
• Application should then read from new version table
• Stop old version of the job, and delete the old table
88
@helenaedelson
One Pipeline For Fast & Big Data
89
• How do I make the SMACK stack work for ML, Ad-Hoc + Fast Data?
• How do I combine Spark Streaming + Ad Hoc and have good
performance?
@helenaedelson
FiloDB
Designed to ingest streaming data, including machine,
event, and time-series data, and run very fast analytical
queries over them.
90
• Distributed, versioned, columnar analytics database
• Built for fast streaming analytics & OLAP
• Currently based on Apache Cassandra & Spark
• github.com/tuplejump/FiloDB
@helenaedelson
Breakthrough Performance For
Analytical Queries
• Queries run in parallel in Spark for scale-out ad-hoc analysis
• Fast for interactive data science and ad hoc queries
• Up to 200x Faster Queries for Spark on Cassandra 2.x
• Parquet Performance with Cassandra Flexibility
• Increased performance ceiling coming
91
@helenaedelson
Versioning & Why It Matters
• Immutability
• Databases: let's mutate one giant piece of state in place
• Basically hasn't changed since 1970's!
• With Big Data and streaming, incremental processing is
increasingly important
92
@helenaedelson
FiloDB Versioning
FiloDB is built on functional principles and lets you version and
layer changes
• Incrementally add a column or a few rows as a new version
• Add changes as new versions, don't mutate!
• Writes are idempotent - exactly once ingestion
• Easily control what versions to query
• Roll back changes inexpensively
• Stream out new versions as continuous queries :)
93
@helenaedelson
No Cassandra? Keep All In Memory
94
• Unlike RDDs and DataFrames, FiloDB can ingest new data,
and still be fast
• Unlike RDDs, FiloDB can filter in multiple ways, no need for
entire table scan
@helenaedelson
Spark Streaming to FiloDB
val ratingsStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder(

ssc, kafkaParams, topics)



ratingsStream.foreachRDD { (message: RDD[(String, String)], batchTime: Time) =>

val df = message

.map(_._2.split(","))

.map(rating => Rating(trim(rating))

.toDF("fromuserid", "touserid", "rating")



// add the batch time to the DataFrame

val dfWithBatchTime = df.withColumn(

"batch_time", org.apache.spark.sql.functions.lit(batchTime.milliseconds))



// save the DataFrame to FiloDB

dfWithBatchTime.write.format("filodb.spark")

.option("dataset", "ratings")

.save()

}
95
.dfWithBatchTime.write.format("org.apache.spark.sql.cassandra")
@helenaedelson
Architectyr?
"This is a giant mess"
- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 96
@helenaedelson
97
@helenaedelson
98
@helenaedelson
99
@helenaedelson
github.com/helena
helena@tuplejump.com
slideshare.net/helenaedelson
THANKS!
@helenaedelson
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/stream-
analytics-scalability

More Related Content

What's hot

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Flink Forward
 
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Lightbend
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...HostedbyConfluent
 
Events Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public SectorEvents Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public Sectorconfluent
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Implyconfluent
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builderTimothy Spann
 
The Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondThe Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondSolace
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streamingconfluent
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streamsconfluent
 
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...confluent
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...confluent
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward
 
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...confluent
 
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...confluent
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 

What's hot (20)

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

 
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Events Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public SectorEvents Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public Sector
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
 
The Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyondThe Event Mesh: real-time, event-driven, responsive APIs and beyond
The Event Mesh: real-time, event-driven, responsive APIs and beyond
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streaming
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
 
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
Building Information Systems using Event Modeling (Bobby Calderwood, Evident ...
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
Building a Secure, Tamper-Proof & Scalable Blockchain on Top of Apache Kafka ...
 
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 

Similar to Rethinking Streaming Analytics for Scale

Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Yellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Data
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices ZalandoHayley
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices Zalando Technology
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataRob Winters
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
How Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastHow Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastYellowbrick Data
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Dataconomy Media
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 

Similar to Rethinking Streaming Analytics for Scale (20)

Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Yellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time Analytics
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
How Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments WebcastHow Yellowbrick Data Integrates to Existing Environments Webcast
How Yellowbrick Data Integrates to Existing Environments Webcast
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Rethinking Streaming Analytics for Scale

  • 1. RETHINKING STREAMING ANALYTICS FOR SCALE Helena Edelson 1 @helenaedelson
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /stream-analytics-scalability
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. @helenaedelson Who Is This Person? • VP of Product Engineering @Tuplejump • Big Data, Analytics, Cloud Engineering, Cyber Security • Committer / Contributor to FiloDB, Spark Cassandra Connector, Akka, Spring Integration 2 • @helenaedelson • github.com/helena • linkedin.com/in/helenaedelson • slideshare.net/helenaedelson
  • 6. @helenaedelson Tuplejump - Open Source github.com/tuplejump • FiloDB - part of this talk • Calliope - the first Spark-Cassandra integration • Stargate - an open source Lucene indexer for Cassandra • SnackFS - open source HDFS for Cassandra 4
  • 7. @helenaedelson What Will We Talk About • The Problem Domain • Example Context • Rethinking Architecture • We don't have to look far to look back • Streaming & Data Science • Challenging Assumptions • Revisiting the goal and the stack • Integration • Simplification 5
  • 8. @helenaedelson THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 6 @helenaedelson
  • 9. @helenaedelson The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 7
  • 10. @helenaedelson Translation How to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows 8
  • 11. @helenaedelson How Much Data Yottabyte = quadrillion gigabytes or septillion bytes We all have a lot of data • Terabytes • Petabytes... http://en.wikipedia.org/wiki/Yottabyte 9 100 trillion $ in DC fees
  • 12. @helenaedelson Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream 10
  • 13. @helenaedelson While We Monitor, Predict & Proactively Handle • Massive event spikes & bursty traffic • Fast producers / slow consumers • Network partitioning & out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not lose data? 11
  • 14. @helenaedelson And stay within our AWS / Rackspace budget 12
  • 16. @helenaedelson • Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches • Profile adversary activity • Analysis to understand their motives, anticipate actions and prevent damage Adversary Profiling & Hunting: Online & Offline 14
  • 17. @helenaedelson • Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise • Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification Stream Processing 15
  • 18. @helenaedelson Data Requirements & Description • Streaming event data • Log messages • User activity records • System ops & metrics data • Disparate data sources • Wildly differing data structures 16
  • 19. @helenaedelson Massive Amounts Of Data • One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads 17
  • 21. @helenaedelson 19 few years in Silicon Valley Cloud Engineering team @helenaedelson
  • 22. @helenaedelson 20 Batch analytics data flow from several years ago looked like...
  • 23. @helenaedelson 21 Batch analytics data flow from several years ago looked like...
  • 25. @helenaedelson 23 Sweet, let's triple the code we have to update and regression test every time our analytics logic changes
  • 27. @helenaedelson Streaming: Big Data, Fast Data, Fast Timeseries Data • Reactive processing of data as it comes in to derive instant insights • Is this enough? • Need to combine with existing big data, historical processing, ad hoc queries 25
  • 28. @helenaedelson New Requirements, Common Use Case I need fast access to historical data on the fly for predictive modeling with real time data from the stream 26
  • 29. @helenaedelson It's Not A Stream It's A Flood • Netflix • 50 - 100 billion events per day • 1 - 2 million events per second at peak • LinkedIn • 500 billion write events per day • 2.5 trillion read events per day • 4.5 million events per second at peak with Kafka • 1 PB of stream data 27
  • 30. @helenaedelson Which Translates To • Do it fast • Do it cheap • Do it at scale 28
  • 32. @helenaedelson AND THEN WE GREEKED OUT 30 Lambda
  • 33. @helenaedelson Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 31 • Or, "How to beat the CAP theorum" • An approach coined by Nathan Mars • This was a huge stride forward
  • 34. @helenaedelson • Doing complex asynchronous transformations • That need to run with low latency (say, a few seconds to a few hours) • Examples • Weather analytics and prediction system • News recommendation system 32 Applications Using Lambda Architecture
  • 36. @helenaedelson Implementing Is Hard • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places 34
  • 37. @helenaedelson Performance Tuning & Monitoring on so many disparate systems 35 Also Hard
  • 38. @helenaedelson 36 λ: Streaming & Batch Flows Evolution Or Just Addition? Or Just Technical Debt?
  • 39. @helenaedelson Lambda Architecture Ingest an immutable sequence of records is captured and fed into • a batch system • and a stream processing system in parallel 37
  • 41. @helenaedelson Which Translates To • Performing analytical computations & queries in dual systems • Duplicate Code • Untyped Code - Strings • Spaghetti Architecture for Data Flows • One Busy Network 39
  • 42. @helenaedelson 40 Why? • Why support code, machines and running services of two analytics systems? • Is a separate batch system needed? • Can we do everything in a streaming system?
  • 43. @helenaedelson YES 41 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps
  • 45. @helenaedelson Extract, Transform, Load (ETL) • Extraction of data from one system into another • Transforming it • Loading it into another system 43
  • 46. @helenaedelson ETL • Each step can introduce errors and risk • Writing intermediary files • Parsing and re-parsing plain text • Tools can cost millions of dollars • Decreases throughput • Increased complexity • Can duplicate data after failover 44
  • 47. @helenaedelson Extract, Transform, Load (ETL) "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 45 Also unnecessarily redundant and often typeless
  • 48. @helenaedelson And let's duplicate the pattern over all our DataCenters 46
  • 49. @helenaedelson 47 These are not the solutions you're looking for
  • 51. @helenaedelson Removing The 'E' in ETL Thanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems: E.g Scala and Avro • A return to strong typing in the big data ecosystem • Can work with binary data that remains strongly typed 49
  • 52. @helenaedelson Removing The 'L' in ETL If data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load". • From there each consumer can do their own transformations 50
  • 55. @helenaedelson Pick Technologies Wisely Based on your requirements • Latency • Real time / Sub-Second: < 100ms • Near real time (low): > 100 ms or a few seconds - a few hours • Consistency • Highly Scalable • Topology-Aware & Multi-Datacenter support • Partitioning Collaboration - do they play together well 53
  • 56. @helenaedelson And Remember • Flows erode • Entropy happens • "Everything fails, all the time" - Kyle Kingsbury 54
  • 58. @helenaedelson Stream Processing & Frameworks 56 + GearPump
  • 59. @helenaedelson Strategies 57 • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
  • 60. @helenaedelson Fault Tolerance • Graceful service degradation • Data integrity / accuracy under failure • Resiliency during traffic spikes • Pipeline congestion / bottlenecks • Easy to debug and find failure source • Easy to deploy 58
  • 61. @helenaedelson Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 59
  • 62. @helenaedelson SMACK • Scala & Spark Streaming • Mesos • Akka • Cassandra • Kafka 60
  • 63. @helenaedelson Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 61
  • 64. @helenaedelson Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict 62 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
  • 65. @helenaedelson 63 High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams
  • 66. @helenaedelson Akka Actors 64 A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance
  • 67. @helenaedelson High Performance Streaming Built On Akka • Apache Flink - uses Akka for • Actor model and hierarchy, Deathwatch and distributed communication between job and task managers • GearPump - models the entire streaming system with an actor hierarchy • Supervision, Isolation, Concurrency 65
  • 68. @helenaedelson Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 66
  • 69. @helenaedelson 90% of streaming data at Netflix is stored in Cassandra 67
  • 71. @helenaedelson 69 KillrWeather http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
  • 72. @helenaedelson val context = new StreamingContext(conf, Seconds(1)) val stream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( context, kafkaParams, kafkaTopics) stream.flatMap(func1).saveToCassandra(ks1,table1) stream.map(func2).saveToCassandra(ks1,table1) context.start() 70 Kafka, Spark Streaming and Cassandra
  • 73. @helenaedelson class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {
 
 override val supervisorStrategy =
 OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
 case _: ActorInitializationException => Stop
 case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate
 case _: Exception => Escalate
 } 
 private val producer = new KafkaProducer[K, V](producerConfig)
 
 override def postStop(): Unit = producer.close() 
 def receive = {
 case e: KafkaMessageEnvelope[K,V] => producer.send(e)
 }
 } 71 Kafka, Spark Streaming, Cassandra & Akka
  • 74. @helenaedelson Spark Streaming, ML, Kafka & C* val ssc = new StreamingContext(new SparkConf()…, Seconds(5)
 val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)
 
 val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY) .map(_._2).map(LabeledPoint.parse) trainingStream.saveToCassandra("ml_keyspace", "raw_training_data")
 
 val model = new StreamingLinearRegressionWithSGD()
 .setInitialWeights(Vectors.dense(weights))
 .trainOn(trainingStream) //Making predictions on testData model .predictOnValues(testData.map(lp => (lp.label, lp.features))) .saveToCassandra("ml_keyspace", "predictions") 72
  • 77. @helenaedelson 75 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } Kafka, Spark Streaming, Cassandra & Akka
  • 78. @helenaedelson class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } 76 Now we can replay • On failure • Reprocessing on code changes • Future computation...
  • 79. @helenaedelson 77 Here we are pre-aggregating to a table for fast querying later - in other secondary stream aggregation computations and scheduled computing class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 }
  • 80. @helenaedelson CREATE TABLE weather.raw_data (
 wsid text, year int, month int, day int, hour int, 
 temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); CREATE TABLE daily_aggregate_precip (
 wsid text,
 year int,
 month int,
 day int,
 precipitation counter,
 PRIMARY KEY ((wsid), year, month, day)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); Data Model (simplified) 78
  • 81. @helenaedelson 79 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val stream = KafkaUtils.createStream(
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 
 stream .map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
  • 82. @helenaedelson The Thing About S3 "Amazon S3 is a simple key-value store" - docs.aws.amazon.com/AmazonS3/latest/dev/UsingObjects.html • Keys 2015/05/01 and 2015/05/02 do not live in the “same place” • You can roll your own with AmazonS3Client and do the heavy lifting yourself and throw that data into Spark 80
  • 83. @helenaedelson CREATE TABLE weather.raw_data (
 wsid text, year int, month int, day int, hour int, 
 temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, one_hour_precip PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); C* Clustering Columns Writes by most recent Reads return most recent first Timeseries Data 81 Cassandra will automatically sort by most recent for both write and read
  • 84. @helenaedelson 82 val multipleStreams = for (i <- numstreams) { streamingContext.receiverStream[HttpRequest](new HttpReceiver(port)) } streamingContext.union(multipleStreams) .map { httpRequest => TimelineRequestEvent(httpRequest)} .saveToCassandra("requests_ks", "timeline") CREATE TABLE IF NOT EXISTS requests_ks.timeline ( timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text, PRIMARY KEY ((url, timesegment) , t_uuid) ); Record Every Event In The Order In Which It Happened, Per URL timesegment protects from writing unbounded partitions. timeuuid protects from simultaneous events over-writing one another.
  • 85. @helenaedelson val stream = KafkaUtils.createDirectStream(...)
 .map(_._2.split(","))
 .map(RawWeatherData(_)) 
 stream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) stream .map(hour => (hour.id, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 83 Replay and Reprocess - Any Time Data is on the nodes doing the querying - Spark C* Connector - Partitions • Timeseries data with Data Locality • Co-located Spark + Cassandra nodes • S3 does not give you Cassandra & Spark Streaming: Data Locality For Free®
  • 86. @helenaedelson class PrecipitationActor(ssc: StreamingContext, settings: Settings) extends AggregationActor {
 import akka.pattern.pipe 
 def receive : Actor.Receive = { 
 case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
 }
 
 /** Returns the 10 highest temps for any station in the `year`. */
 def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
 val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
 ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
 
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync().map(toTopK) pipeTo requester
 }
 } 84 Queries pre-aggregated tables from the stream Compute Isolation: Actor
  • 87. @helenaedelson 85 class TemperatureActor(sc: SparkContext, settings: Settings) extends AggregationActor {
 import akka.pattern.pipe 
 def receive: Actor.Receive = {
 case e: GetMonthlyHiLowTemperature => highLow(e, sender)
 }
 
 def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
 sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
 .where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
 .collectAsync()
 .map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester } C* data is automatically sorted by most recent - due to our data model. Additional Spark or collection sort not needed. Efficient Batch Analysis
  • 89. @helenaedelson Everything On The Streaming Platform 87
  • 90. @helenaedelson Reprocessing • Start a new stream job to re-process from the beginning • Save re-processed data as a version table • Application should then read from new version table • Stop old version of the job, and delete the old table 88
  • 91. @helenaedelson One Pipeline For Fast & Big Data 89 • How do I make the SMACK stack work for ML, Ad-Hoc + Fast Data? • How do I combine Spark Streaming + Ad Hoc and have good performance?
  • 92. @helenaedelson FiloDB Designed to ingest streaming data, including machine, event, and time-series data, and run very fast analytical queries over them. 90 • Distributed, versioned, columnar analytics database • Built for fast streaming analytics & OLAP • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB
  • 93. @helenaedelson Breakthrough Performance For Analytical Queries • Queries run in parallel in Spark for scale-out ad-hoc analysis • Fast for interactive data science and ad hoc queries • Up to 200x Faster Queries for Spark on Cassandra 2.x • Parquet Performance with Cassandra Flexibility • Increased performance ceiling coming 91
  • 94. @helenaedelson Versioning & Why It Matters • Immutability • Databases: let's mutate one giant piece of state in place • Basically hasn't changed since 1970's! • With Big Data and streaming, incremental processing is increasingly important 92
  • 95. @helenaedelson FiloDB Versioning FiloDB is built on functional principles and lets you version and layer changes • Incrementally add a column or a few rows as a new version • Add changes as new versions, don't mutate! • Writes are idempotent - exactly once ingestion • Easily control what versions to query • Roll back changes inexpensively • Stream out new versions as continuous queries :) 93
  • 96. @helenaedelson No Cassandra? Keep All In Memory 94 • Unlike RDDs and DataFrames, FiloDB can ingest new data, and still be fast • Unlike RDDs, FiloDB can filter in multiple ways, no need for entire table scan
  • 97. @helenaedelson Spark Streaming to FiloDB val ratingsStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder(
 ssc, kafkaParams, topics)
 
 ratingsStream.foreachRDD { (message: RDD[(String, String)], batchTime: Time) =>
 val df = message
 .map(_._2.split(","))
 .map(rating => Rating(trim(rating))
 .toDF("fromuserid", "touserid", "rating")
 
 // add the batch time to the DataFrame
 val dfWithBatchTime = df.withColumn(
 "batch_time", org.apache.spark.sql.functions.lit(batchTime.milliseconds))
 
 // save the DataFrame to FiloDB
 dfWithBatchTime.write.format("filodb.spark")
 .option("dataset", "ratings")
 .save()
 } 95 .dfWithBatchTime.write.format("org.apache.spark.sql.cassandra")
  • 98. @helenaedelson Architectyr? "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 96
  • 103. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/stream- analytics-scalability