SlideShare uma empresa Scribd logo
1 de 29
Yi Pan
Streams Team @LinkedIn
Committer and PMC Chair, Apache Samza
1
class PageKeyViewsCounterTask implements StreamTask, InitableTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
public void init(Config config, TaskContext context) {
pageKeyViews = (KeyValueStore<String, Counter>) context.getStore(“myPageKeyViews);
}
}
Task-0
Task-1
Task-2
Deployed via YARN
 Pros
◦ Simple API
◦ Built-in support for states
◦ Leverage YARN for fault-tolerance
◦ High performance (1.2 Mqps / host)
 Cons
◦ Not easy to write end-to-end processing pipeline in a single program
◦ Deployment is tightly coupled with YARN
◦ No support to run as batch job
• High-level API
• Flexible Deployment Model
• Convergence between Batch and Stream Processing
4
Application logic: Count PageViewEvent for each member in a 5 minute window
and send the counts to PageViewEventPerMemberStream
Re-partition by
memberId
window map sendTo
PageViewEvent
PageViewEventPerMembe
rStream
5
Re-partition window map sendTo
PageViewEvent
PageViewEventByMe
mberId
PageViewEventPerMembe
rStream
Job-1: PageViewRepartitionTask Job-2: PageViewByMemberIdCounterTask
Application in low-level API
6
• Job-1: Repartition job
public class PageViewRepartitionTask implements StreamTask {
private final SystemStream pageViewByMIDStream = new SystemStream("kafka", "PaveViewEventByMemberId");
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
collector.send(new OutgoingMessageEnvelope(pageViewByMIDStream, pve.memberId, pve));
}
}
7
• Job-2: Window-based counter
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
private final SystemStream pageViewCounterStream = new SystemStream("kafka", "PageViewEventPerMemberStream");
private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters;
private Long windowSize;
@Override
public void init(Config config, TaskContext context) throws Exception {
this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>)
context.getStore("windowed-counter-store");
this.windowSize = config.getLong("task.window.ms");
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
getWindowCounterEvent().forEach(counter ->
collector.send(new OutgoingMessageEnvelope(pageViewCounterStream, counter.memberId, counter)));
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
countPageViewEvent(pve);
}
}
8
• Job-2: Window-based counter
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
...
List<PageViewPerMemberIdCounterEvent> getWindowCounterEvent() {
List<PageViewPerMemberIdCounterEvent> retList = new ArrayList<>();
Long currentTimestamp = System.currentTimeMillis();
Long cutoffTimestamp = currentTimestamp - this.windowSize;
String lowerBound = String.format("%08d-", cutoffTimestamp);
String upperBound = String.format("%08d-", currentTimestamp + 1);
this.windowedCounters.range(lowerBound, upperBound).forEachRemaining(entry ->
retList.add(entry.getValue()));
return retList;
}
void countPageViewEvent(PageViewEvent pve) {
String key = String.format("%08d-%s", (pve.timestamp - pve.timestamp % this.windowSize), pve.memberId);
PageViewPerMemberIdCounterEvent counter = this.windowedCounters.get(key);
if (counter == null) {
counter = new PageViewPerMemberIdCounterEvent(pve.memberId, (pve.timestamp - pve.timestamp % this.windowSize), 0);
}
counter.count ++;
this.windowedCounters.put(key, counter);
}
}
9
• Samza High Level API (NEW)
– Ability to express a multi-stage processing pipeline in a single user
program
– Built-in library to provide high-level stream transformation functions
10
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEventStream", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewEventPerMemberStream", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue,
(m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Built-in transform functions
11
• Visualized execution plan
Visualization:
12
• Built-in transformation functions in high-level API
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
stateless
functions
I/O
functions
stateful
functions
13
• High-level API
• Flexible Deployment Model
• Convergence between Batch and Stream Processing
14
 Tight dependency on YARN
 Can’t easily port over to non-YARN clusters (e.g. Mesos, Kubernetes, AWS)
 Can’t directly embed stream processing in other application (eg. a web frontend)
15
• Flexible deployment of Samza applications
– Samza-as-a-library (NEW)
• Run embedded stream processing in a user program
• Zookeeper based coordination between multiple instances of user program
– Samza in a cluster
• Run stream processing as a managed program in a cluster (e.g.
SamzaContainer in YARN)
• Use the cluster manager (e.g. YARN) to provide deployment, coordination,
and resource management
16
Samza Job is composed of a collection of standalone processes
● Full control on
● Application’s life cycle
● Physical resource allocated to Samza processors
● Configuration and initialization
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator...
Leader
17
● ZooKeeper-based JobCoordinator (stateful use case)
● JobCoordinator uses ZooKeeper for leader election
● Leader will perform partition assignments among all active
StreamProcessors
ZooKeeper
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator...
18
● Embedded application code example
public class WikipediaZkLocalApplication {
/**
* Executes the application using the local application runner.
* It takes two required command line arguments
* config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name
* config-path: path to application properties
*
* @param args command line arguments
*/
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new LocalApplicationRunner(config);
WikipediaApplication app = new WikipediaApplication();
runner.run(app);
runner.waitForFinish();
}
}
19
● Embedded application code example
public class WikipediaZkLocalApplication {
/**
* Executes the application using the local application runner.
* It takes two required command line arguments
* config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name
* config-path: path to application properties
*
* @param args command line arguments
*/
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new LocalApplicationRunner(config);
WikipediaApplication app = new WikipediaApplication();
runner.run(app);
runner.waitForFinish();
}
}
20
job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
job.coordinator.zk.connect=my-zk.server:2191
• Embedded application launch sequence
myApp.main()
Stream
Application
Local
Application
Runner
Stream
Processor
runner.run() streamProcessor.start()
n
21
• Cluster-based application launch sequence
run-app.sh
Remote
Application
Runner
JobRunnerjobRunner.run()
n
main()
app.class=my.app.MyStreamApplication
Yarn
RM
run-jc.sh
task.execute=run-local-app.sh
run-local-app.sh
Stream
Application
myApp.main()
Local
Application
Runner
Stream
Processor
runner.run() streamProcessor.start()
n
Job
Coordinator
22
23
• High-level API
• Flexible Deployment Model
• Convergence between Batch and Stream Processing
24
Application logic: Count PageViewEvent for each member in a 5 minute window
and send the counts to PageViewEventPerMemberStream
Re-partition by
memberId
window map sendTo
PageViewEvent
PageViewEventPerMemb
erStream
HDFS
PageViewEvent: hdfs://mydbsnapshot/PageViewEvent/
PageViewEventPerMemberStream: hdfs://myoutputdb/PageViewEventPerMemberFiles
25
• No code change in application
streams.pageViewEventStream.system=kafka
streams.pageViewEventPerMemberStream.system=kafka
streams.pageViewEventStream.system=hdfs
streams.pageViewEventStream.physical.name=hdfs://mydbsnapshot/PageViewEvent/
streams.pageViewEventPerMemberStream.system=hdfs
streams.pageViewEventPerMemberStream.physical.name=hdfs://myoutputdb/PageViewEventPerMemberFiles
old config
new config
26
27
High-level API
Unified Stream & Batch Processing
Remote Runner
Run in Remote Cluster
Cluster-based
Yarn, (Mesos)
Local Runner
Run Locally
Embedded
ZooKeeper, Standalone
APIRUNNERDEPLO
YMENT
PROCESSO
R
StreamProcessor
Streams
Kafka, Kinesis, HDFS ...
Local State
RocksDb, In-Memory
Remote Data
Multithreading
27
 Samza runner for Apache Beam
 Event-time processing
 Support for Exactly-once processing
 Support partition expansion for stateful application
 Easy access to Adjunct datasets
 SQL over Streams
28
Q&A
29

Mais conteúdo relacionado

Mais procurados

Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemFabian Reinartz
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor APIconfluent
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaRick Warren
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for AndroidTomáš Kypta
 
Flink on Kubernetes operator
Flink on Kubernetes operatorFlink on Kubernetes operator
Flink on Kubernetes operatorEui Heo
 
Akka streams - Umeå java usergroup
Akka streams - Umeå java usergroupAkka streams - Umeå java usergroup
Akka streams - Umeå java usergroupJohan Andrén
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
 
Data Microservices In The Cloud + 日本語コメント
Data Microservices In The Cloud + 日本語コメントData Microservices In The Cloud + 日本語コメント
Data Microservices In The Cloud + 日本語コメントTakuya Saeki
 
Using Grails to power your electric car
Using Grails to power your electric carUsing Grails to power your electric car
Using Grails to power your electric carMarco Pas
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootVMware Tanzu
 
Monitoring infrastructure with prometheus
Monitoring infrastructure with prometheusMonitoring infrastructure with prometheus
Monitoring infrastructure with prometheusShahnawaz Saifi
 
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016ManageIQ
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 

Mais procurados (20)

Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring System
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for Android
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Flink on Kubernetes operator
Flink on Kubernetes operatorFlink on Kubernetes operator
Flink on Kubernetes operator
 
Akka streams - Umeå java usergroup
Akka streams - Umeå java usergroupAkka streams - Umeå java usergroup
Akka streams - Umeå java usergroup
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 
Blood magic
Blood magicBlood magic
Blood magic
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
 
Data Microservices In The Cloud + 日本語コメント
Data Microservices In The Cloud + 日本語コメントData Microservices In The Cloud + 日本語コメント
Data Microservices In The Cloud + 日本語コメント
 
Using Grails to power your electric car
Using Grails to power your electric carUsing Grails to power your electric car
Using Grails to power your electric car
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring Boot
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Monitoring infrastructure with prometheus
Monitoring infrastructure with prometheusMonitoring infrastructure with prometheus
Monitoring infrastructure with prometheus
 
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016
Investigative Debugging - Peter McGowan - ManageIQ Design Summit 2016
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 

Semelhante a Nextcon samza preso july - final

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Apache Apex
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiYi Pan
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!HostedbyConfluent
 
Using React, Redux and Saga with Lottoland APIs
Using React, Redux and Saga with Lottoland APIsUsing React, Redux and Saga with Lottoland APIs
Using React, Redux and Saga with Lottoland APIsMihail Gaberov
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19confluent
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the wayOleg Podsechin
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every dayVadym Khondar
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelinesTimothy Farkas
 
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]Igor Lozynskyi
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkLuciano Mammino
 
Stream Processing using Samza SQL
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQLSamarth Shetty
 
Samza sql stream processing meetup
Samza sql stream processing meetupSamza sql stream processing meetup
Samza sql stream processing meetupSrinivasulu Punuru
 

Semelhante a Nextcon samza preso july - final (20)

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!Unleashing your Kafka Streams Application Metrics!
Unleashing your Kafka Streams Application Metrics!
 
Using React, Redux and Saga with Lottoland APIs
Using React, Redux and Saga with Lottoland APIsUsing React, Redux and Saga with Lottoland APIs
Using React, Redux and Saga with Lottoland APIs
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelines
 
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless framework
 
Stream Processing using Samza SQL
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQL
 
Samza sql stream processing meetup
Samza sql stream processing meetupSamza sql stream processing meetup
Samza sql stream processing meetup
 

Último

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 

Último (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 

Nextcon samza preso july - final

  • 1. Yi Pan Streams Team @LinkedIn Committer and PMC Chair, Apache Samza 1
  • 2. class PageKeyViewsCounterTask implements StreamTask, InitableTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } public void init(Config config, TaskContext context) { pageKeyViews = (KeyValueStore<String, Counter>) context.getStore(“myPageKeyViews); } } Task-0 Task-1 Task-2 Deployed via YARN
  • 3.  Pros ◦ Simple API ◦ Built-in support for states ◦ Leverage YARN for fault-tolerance ◦ High performance (1.2 Mqps / host)  Cons ◦ Not easy to write end-to-end processing pipeline in a single program ◦ Deployment is tightly coupled with YARN ◦ No support to run as batch job
  • 4. • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 4
  • 5. Application logic: Count PageViewEvent for each member in a 5 minute window and send the counts to PageViewEventPerMemberStream Re-partition by memberId window map sendTo PageViewEvent PageViewEventPerMembe rStream 5
  • 6. Re-partition window map sendTo PageViewEvent PageViewEventByMe mberId PageViewEventPerMembe rStream Job-1: PageViewRepartitionTask Job-2: PageViewByMemberIdCounterTask Application in low-level API 6
  • 7. • Job-1: Repartition job public class PageViewRepartitionTask implements StreamTask { private final SystemStream pageViewByMIDStream = new SystemStream("kafka", "PaveViewEventByMemberId"); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); collector.send(new OutgoingMessageEnvelope(pageViewByMIDStream, pve.memberId, pve)); } } 7
  • 8. • Job-2: Window-based counter public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { private final SystemStream pageViewCounterStream = new SystemStream("kafka", "PageViewEventPerMemberStream"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounterStream, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } } 8
  • 9. • Job-2: Window-based counter public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { ... List<PageViewPerMemberIdCounterEvent> getWindowCounterEvent() { List<PageViewPerMemberIdCounterEvent> retList = new ArrayList<>(); Long currentTimestamp = System.currentTimeMillis(); Long cutoffTimestamp = currentTimestamp - this.windowSize; String lowerBound = String.format("%08d-", cutoffTimestamp); String upperBound = String.format("%08d-", currentTimestamp + 1); this.windowedCounters.range(lowerBound, upperBound).forEachRemaining(entry -> retList.add(entry.getValue())); return retList; } void countPageViewEvent(PageViewEvent pve) { String key = String.format("%08d-%s", (pve.timestamp - pve.timestamp % this.windowSize), pve.memberId); PageViewPerMemberIdCounterEvent counter = this.windowedCounters.get(key); if (counter == null) { counter = new PageViewPerMemberIdCounterEvent(pve.memberId, (pve.timestamp - pve.timestamp % this.windowSize), 0); } counter.count ++; this.windowedCounters.put(key, counter); } } 9
  • 10. • Samza High Level API (NEW) – Ability to express a multi-stage processing pipeline in a single user program – Built-in library to provide high-level stream transformation functions 10
  • 11. public class RepartitionAndCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { Supplier<Integer> initialValue = () -> 0; MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewEventStream", (k, m) -> (PageViewEvent) m); OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph .getOutputStream("pageViewEventPerMemberStream", m -> m.memberId, m -> m); pageViewEvents .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewEventPerMemberStream); } } Built-in transform functions 11
  • 12. • Visualized execution plan Visualization: 12
  • 13. • Built-in transformation functions in high-level API filter select a subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams stateless functions I/O functions stateful functions 13
  • 14. • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 14
  • 15.  Tight dependency on YARN  Can’t easily port over to non-YARN clusters (e.g. Mesos, Kubernetes, AWS)  Can’t directly embed stream processing in other application (eg. a web frontend) 15
  • 16. • Flexible deployment of Samza applications – Samza-as-a-library (NEW) • Run embedded stream processing in a user program • Zookeeper based coordination between multiple instances of user program – Samza in a cluster • Run stream processing as a managed program in a cluster (e.g. SamzaContainer in YARN) • Use the cluster manager (e.g. YARN) to provide deployment, coordination, and resource management 16
  • 17. Samza Job is composed of a collection of standalone processes ● Full control on ● Application’s life cycle ● Physical resource allocated to Samza processors ● Configuration and initialization StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator... Leader 17
  • 18. ● ZooKeeper-based JobCoordinator (stateful use case) ● JobCoordinator uses ZooKeeper for leader election ● Leader will perform partition assignments among all active StreamProcessors ZooKeeper StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator... 18
  • 19. ● Embedded application code example public class WikipediaZkLocalApplication { /** * Executes the application using the local application runner. * It takes two required command line arguments * config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name * config-path: path to application properties * * @param args command line arguments */ public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); WikipediaApplication app = new WikipediaApplication(); runner.run(app); runner.waitForFinish(); } } 19
  • 20. ● Embedded application code example public class WikipediaZkLocalApplication { /** * Executes the application using the local application runner. * It takes two required command line arguments * config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name * config-path: path to application properties * * @param args command line arguments */ public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); WikipediaApplication app = new WikipediaApplication(); runner.run(app); runner.waitForFinish(); } } 20 job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191
  • 21. • Embedded application launch sequence myApp.main() Stream Application Local Application Runner Stream Processor runner.run() streamProcessor.start() n 21
  • 22. • Cluster-based application launch sequence run-app.sh Remote Application Runner JobRunnerjobRunner.run() n main() app.class=my.app.MyStreamApplication Yarn RM run-jc.sh task.execute=run-local-app.sh run-local-app.sh Stream Application myApp.main() Local Application Runner Stream Processor runner.run() streamProcessor.start() n Job Coordinator 22
  • 23. 23
  • 24. • High-level API • Flexible Deployment Model • Convergence between Batch and Stream Processing 24
  • 25. Application logic: Count PageViewEvent for each member in a 5 minute window and send the counts to PageViewEventPerMemberStream Re-partition by memberId window map sendTo PageViewEvent PageViewEventPerMemb erStream HDFS PageViewEvent: hdfs://mydbsnapshot/PageViewEvent/ PageViewEventPerMemberStream: hdfs://myoutputdb/PageViewEventPerMemberFiles 25
  • 26. • No code change in application streams.pageViewEventStream.system=kafka streams.pageViewEventPerMemberStream.system=kafka streams.pageViewEventStream.system=hdfs streams.pageViewEventStream.physical.name=hdfs://mydbsnapshot/PageViewEvent/ streams.pageViewEventPerMemberStream.system=hdfs streams.pageViewEventPerMemberStream.physical.name=hdfs://myoutputdb/PageViewEventPerMemberFiles old config new config 26
  • 27. 27 High-level API Unified Stream & Batch Processing Remote Runner Run in Remote Cluster Cluster-based Yarn, (Mesos) Local Runner Run Locally Embedded ZooKeeper, Standalone APIRUNNERDEPLO YMENT PROCESSO R StreamProcessor Streams Kafka, Kinesis, HDFS ... Local State RocksDb, In-Memory Remote Data Multithreading 27
  • 28.  Samza runner for Apache Beam  Event-time processing  Support for Exactly-once processing  Support partition expansion for stateful application  Easy access to Adjunct datasets  SQL over Streams 28