2. 2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/streaming-singapore-questions
3. 3
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Questions? tiny.cloudera.com/streaming-singapore-questions
Ted Malaska Mark Grover
7. 7
When to stream, and when not to
• We are looking for a SLA sweet spot
• Multi milliseconds to seconds
• Not minutes
• Not constant low milliseconds or under
• Doesn’t come for free
Questions? tiny.cloudera.com/streaming-singapore-questions
12. 12
But there multiple sources
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest
13. 13
But..
• Sources, sinks, ingestion channels may go down
• Sources and sinks may be producing/consuming at different rates
• Regular maintenance windows may need to be scheduled
• We need a resilient message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
14. 14
Need for a message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
16. 16
But ‘queue’ doesn’t ‘push’
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Message broker
17. 17
Streaming data ingestion process
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka Connect
Apache Flume
Message broker
18. 18
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
Connect
Apache
Flume
Message broker
20. 20
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
21. 21
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
22. 22
Two types of transformations
Atomic
• Need to work with one event at a
time
• Example – mask a credit card
number
With context
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
Questions? tiny.cloudera.com/streaming-singapore-questions
23. 23
Atomic transformations
• Require no context
• Can be simply done within Flume interceptors, Kafka connect or
Spark streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
24. 24
Flume Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
Questions? tiny.cloudera.com/streaming-singapore-questions
25. 25
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
28. 28
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Copycat
Apache
Flume
Message broker
Can be used to
do simple
transformations
29. 29
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
Questions? tiny.cloudera.com/streaming-singapore-questions
30. 30
Categories of storage systems
“Puts” based
• Can be re-inserted without side
effects since re-inserted record will
have duplicate key
“Appends” based
• Can not be re-inserted
Questions? tiny.cloudera.com/streaming-singapore-questions
31. 31
How to achieve exactly once?
• For “puts” based storage systems
– At least once is enough (keys have to be unique though i.e. primary key)
– Re-inserted records will have duplicate keys
– Will simply overwrite the exist record with the same value
• For “appends” based storage systems (e.g. HDFS)
– Still easiest to do at least once
– Need to de-duplicate before processing
Questions? tiny.cloudera.com/streaming-singapore-questions
37. 37
Why did Streaming Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm with out Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
Questions? tiny.cloudera.com/streaming-singapore-questions
38. 38
We have come a long way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
Questions? tiny.cloudera.com/streaming-singapore-questions
41. 41
Advanced Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
42. 42
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
Questions? tiny.cloudera.com/streaming-singapore-questions
43. 43
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
44. 44
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
46. 46
Advanced Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R
Questions? tiny.cloudera.com/streaming-singapore-questions