SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Spark Streaming
State of the Union and Beyond
Tathagata “TD” Das
@tathadas
Feb 19, 2015
Who am I?
Project Management Committee (PMC) member of Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
Founded by the creators of Spark in 2013
Largest organization contributing to Spark
End-to-end hosted service, Databricks Cloud
What is Databricks?
What is Spark
Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
Streaming
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
What can you use it for?
6
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
How does it work?
Data streams are chopped up into batches
Each batch is processed in Spark
Results pushed out in batches
7
data streams
receivers
Streaming
batches results
Streaming Word Count
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
8
print some counts on screen
count the words
split lines into words
create DStream
from data over socket
start processing the stream
Word Count
9
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Word Count
10
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val context = new StreamingContext(sparkConf, Seconds(1))
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Spark Streaming
public class WordCountTopology {
public static class SplitSentence extends ShellBolt implements IRichBolt {
public SplitSentence() {
super("python", "splitsentence.py");
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Storm
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
Languages
Can natively use
Can use any other language by using pipe()
11
Integrates with Spark Ecosystem
12
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine batch and streaming processing
Join data streams with static data sets
// Create data set from Hadoop file!
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)
.filter( ... )
}
13
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine machine learning with streaming
Learn models offline, apply them online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
14
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
Interactively query streaming data with SQL
// Register each batch in stream as table
kafkaStream.map { batchRDD =>
batchRDD.registerTempTable("latestEvents")
}
// Interactively query table
sqlContext.sql("select * from latestEvents")
15
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
History
16
Late 2011 – research idea
AMPLab, UC Berkeley
We need to
make Spark
faster
Okay...umm,
how??!?!
History
17
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea
AMPLab, UC Berkeley
History
18
Late 2011 – idea
AMPLab, UC Berkeley
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release
Graduation with Spark 0.9
Current state of
Spark Streaming
Adoption
20
Roadmap
Development
21
What have we added
in the last year?
Python API
Core functionality in Spark 1.2,
with sockets and files as sources
Kafka support coming in Spark 1.3
Other sources coming in future
22
lines = ssc.socketTextStream(“localhost", 9999))
counts = lines.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a+b)
counts.pprint()
Streaming MLlib algorithms
val model = new StreamingKMeans()
.setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
// Apply model to DStreams
model.trainOn(trainingDStream)
model.predictOnValues(testDStream.map { lp =>
(lp.label, lp.features) } ).print()
23
Continuous learning and
prediction on streaming data
StreamingLinearRegression in
Spark 1.1
StreamingKMeans in Spark 1.2
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Other library additions
Amazon Kinesis integration [ Spark 1.1]
More fault-tolerant Flume integration [Spark 1.1]
New Kafka API for more native integration [Spark 1.3]
24
System Infrastructure
Automated driver fault-tolerance [Spark 1.0]
Graceful shutdown [Spark 1.0]
Write Ahead Logs for zero data loss [Spark 1.2]
25
Contributors to Streaming
26
0
10
20
30
40
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Contributors - Full Picture
27
0
30
60
90
120
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Streaming
Core + Streaming
(w/o SQL, MLlib,…)
All contributions
to core Spark
directly improve
Spark Streaming
Spark Packages
More contributions from the
community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
28
Who is using
Spark Streaming?
Spark Summit 2014 Survey
30
40% of Spark users were
using Spark Streaming in
production or prototyping
Another 39% were
evaluating it
Not using
21%
Evaluating
39%
Prototyping
31%
Production
9%
31
32
80+
known
deployments
Intel China builds big data solutions for large enterprises
Multiple streaming applications for different businesses
Real-time risk analysis for a top online payment company
Real-time deal and flow metric reporting for a top online shopping company
Complicated stream processing
SQL queries on streams
Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark
Streaming
Kafka
RocketMQ
HBase
One of the largest publishing and education company, wants
to accelerate their push into digital learning
Needed to combine student activities and domain events to
continuously update the learning model of each student
Earlier implementation in Storm, but now moved on to
Spark Streaming
YARN
Spark
StreamingKafka
Cassandra
Chose Spark Streaming, because Spark together combines
batch, streaming, machine learning, and graph processing
Apache Blur
More information: http://dbricks.co/1BnFZZ8
Leading advertising automation company with an exchange
platform for in-feed ads
Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark
Streaming
Kinesis MySQL
Redis
RabbitMQ SQS
http://techblog.netflix.com/2015/02/whats-trending-on-netflix.html
http://goo.gl/mJNf8X
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib to
analyze neural activities
Laser microscope scans Zebrafish
brainà Spark Streaming à
interactive visualization à
laser ZAP to kill neurons!
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning
algorithms on time series data of
every neuron
2TB/hour and increasing with
brain size
80 HPC nodes
Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
What’s coming next?
Beyond Spark 1.3
Libraries
Streaming machine learning algorithms
A/B testing
Online Latent Dirichlet Allocation (LDA)
More streaming linear algorithms
Streaming + SQL, Streaming + DataFrames
43
Beyond Spark 1.3
Operational Ease
Better flow control
Elastic scaling
Cross-version upgradability
Improved support for non-Hadoop environments
44
Beyond Spark 1.3
Performance
Higher throughput, especially of stateful operations
Lower latencies
Easy deployment of streaming apps in Databricks Cloud!
45
You can help!
Roadmaps are heavily driven by community feedback
We have listened to community demands over the last year
Write Ahead Logs for zero data loss
New Kafka integration for stronger semantics
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
46
Takeaways
Spark Streaming is scalable, fault-tolerant stream processing
system with high-level API and rich set of libraries
Over 80+ deployments in the industry
More libraries and operational ease in the roadmap
47
48
Backup slides
Typesafe survey of Spark users
2136 developers, data scientists,
and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
Typesafe survey of Spark users
65% of Spark users are interested
in Spark Streaming
Typesafe survey of Spark users
2/3 of Spark users want to process
event streams
52
More usecases
•  Big data solution provider for enterprises
•  Multiple applications for different businesses
-  Monitoring +optimizing online services of Tier-1 bank
-  Fraudulent transaction detection for Tier-2 bank
•  Kafka à SS à Cassandra, MongoDB
•  Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
•  Provides data analytics solutions for Communication
Service Providers
-  4 of 5 top mobile ops, 3 of 4 top internet backbone providers
-  Processes >50% of all US mobile traffic
•  Multiple applications for different businesses
-  Real-time anomaly detection in cell tower traffic
-  Real-time call quality optimizations
•  Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
•  Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
•  Predictive models can look
for claims that are likely to
be held up for approval
•  Spark Streaming allows
model scoring in seconds
instead of hours

Mais conteúdo relacionado

Mais procurados

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Mais procurados (20)

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Destaque

Hadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライドHadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライド
hamaken
 
Scala overview
Scala overviewScala overview
Scala overview
Steve Min
 

Destaque (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
TensorFlow User Group #1
TensorFlow User Group #1TensorFlow User Group #1
TensorFlow User Group #1
 
デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項デブサミ2017 公募セッション募集要項
デブサミ2017 公募セッション募集要項
 
Culture
CultureCulture
Culture
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014
 
Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?
 
Hadoopビッグデータ基盤の歴史を振り返る #cwt2015
Hadoopビッグデータ基盤の歴史を振り返る #cwt2015Hadoopビッグデータ基盤の歴史を振り返る #cwt2015
Hadoopビッグデータ基盤の歴史を振り返る #cwt2015
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Hadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライドHadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライド
 
Scala overview
Scala overviewScala overview
Scala overview
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 
Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較KafkaとAWS Kinesisの比較
KafkaとAWS Kinesisの比較
 

Semelhante a Spark streaming State of the Union - Strata San Jose 2015

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Semelhante a Spark streaming State of the Union - Strata San Jose 2015 (20)

Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Spark streaming State of the Union - Strata San Jose 2015

  • 1. Spark Streaming State of the Union and Beyond Tathagata “TD” Das @tathadas Feb 19, 2015
  • 2. Who am I? Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
  • 3. Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud What is Databricks?
  • 5. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter Streaming High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  • 6. What can you use it for? 6 Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral
  • 7. How does it work? Data streams are chopped up into batches Each batch is processed in Spark Results pushed out in batches 7 data streams receivers Streaming batches results
  • 8. Streaming Word Count val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() 8 print some counts on screen count the words split lines into words create DStream from data over socket start processing the stream
  • 9. Word Count 9 object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
  • 10. Word Count 10 object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } Spark Streaming public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } Storm public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); }
  • 11. Languages Can natively use Can use any other language by using pipe() 11
  • 12. Integrates with Spark Ecosystem 12 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 13. Combine batch and streaming processing Join data streams with static data sets // Create data set from Hadoop file! val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) } 13 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 14. Combine machine learning with streaming Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) } 14 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 15. Combine SQL with streaming Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents") 15 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 16. History 16 Late 2011 – research idea AMPLab, UC Berkeley We need to make Spark faster Okay...umm, how??!?!
  • 17. History 17 Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Late 2011 – idea AMPLab, UC Berkeley
  • 18. History 18 Late 2011 – idea AMPLab, UC Berkeley Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Jan 2014 – Stable release Graduation with Spark 0.9
  • 21. 21 What have we added in the last year?
  • 22. Python API Core functionality in Spark 1.2, with sockets and files as sources Kafka support coming in Spark 1.3 Other sources coming in future 22 lines = ssc.socketTextStream(“localhost", 9999)) counts = lines.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.pprint()
  • 23. Streaming MLlib algorithms val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) // Apply model to DStreams model.trainOn(trainingDStream) model.predictOnValues(testDStream.map { lp => (lp.label, lp.features) } ).print() 23 Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
  • 24. Other library additions Amazon Kinesis integration [ Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] New Kafka API for more native integration [Spark 1.3] 24
  • 25. System Infrastructure Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2] 25
  • 26. Contributors to Streaming 26 0 10 20 30 40 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
  • 27. Contributors - Full Picture 27 0 30 60 90 120 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2 Streaming Core + Streaming (w/o SQL, MLlib,…) All contributions to core Spark directly improve Spark Streaming
  • 28. Spark Packages More contributions from the community in spark-packages Alternate Kafka receiver Apache Camel receiver Cassandra examples http://spark-packages.org/ 28
  • 29. Who is using Spark Streaming?
  • 30. Spark Summit 2014 Survey 30 40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it Not using 21% Evaluating 39% Prototyping 31% Production 9%
  • 31. 31
  • 33. Intel China builds big data solutions for large enterprises Multiple streaming applications for different businesses Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company
  • 34. Complicated stream processing SQL queries on streams Join streams with large historical datasets > 1TB/day passing through Spark Streaming YARN Spark Streaming Kafka RocketMQ HBase
  • 35. One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming
  • 36. YARN Spark StreamingKafka Cassandra Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing Apache Blur More information: http://dbricks.co/1BnFZZ8
  • 37. Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads Mesos+Marathon Spark Streaming Kinesis MySQL Redis RabbitMQ SQS
  • 39. Neuroscience @ Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons! http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  • 40. Neuroscience @ Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron 2TB/hour and increasing with brain size 80 HPC nodes
  • 41. Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 41
  • 43. Beyond Spark 1.3 Libraries Streaming machine learning algorithms A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms Streaming + SQL, Streaming + DataFrames 43
  • 44. Beyond Spark 1.3 Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments 44
  • 45. Beyond Spark 1.3 Performance Higher throughput, especially of stateful operations Lower latencies Easy deployment of streaming apps in Databricks Cloud! 45
  • 46. You can help! Roadmaps are heavily driven by community feedback We have listened to community demands over the last year Write Ahead Logs for zero data loss New Kafka integration for stronger semantics Let us know what do you want to see in Spark Streaming Spark user mailing list, tweet it to me @tathadas 46
  • 47. Takeaways Spark Streaming is scalable, fault-tolerant stream processing system with high-level API and rich set of libraries Over 80+ deployments in the industry More libraries and operational ease in the roadmap 47
  • 49. Typesafe survey of Spark users 2136 developers, data scientists, and other tech professionals http://java.dzone.com/articles/apache-spark-survey-typesafe-0
  • 50. Typesafe survey of Spark users 65% of Spark users are interested in Spark Streaming
  • 51. Typesafe survey of Spark users 2/3 of Spark users want to process event streams
  • 53. •  Big data solution provider for enterprises •  Multiple applications for different businesses -  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank •  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on Spark Streaming, Kafka, Cassandra, MongoDB
  • 54. •  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic •  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations •  Kafka à SS http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
  • 55. •  Runs claims processing applications for healthcare providers http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims •  Predictive models can look for claims that are likely to be held up for approval •  Spark Streaming allows model scoring in seconds instead of hours