SlideShare uma empresa Scribd logo
1 de 30
Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.
Motivation
 The system collects data from multiple sources in streaming log
format
 Some common questions in Email Anti-Abuse system
 Most frequent Items (IP, domain, sender, etc.)
 Number of unique items
 Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
Data statistics
 6K email logs/second
 One email log is flatten out to subevents
 Ip, sender, sender domain, etc
 Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second
Challenges
 Our system needs to be
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Sketching data structures
 How many times have we seen a certain IP?
 Count Min Sketch (CMS): Counting things + TopK
 How many unique senders have we seen yesterday?
 HyperLogLog (HLL): Set cardinality
 Did we see a certain IP last month?
 Bloom Filter (BF): Set membership
SPACE / SPEED
 Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
 Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available
Sketching data structures
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Akka Actor
BACK PRESSURE?
Akka Stream
GraphDSL
FLOW-SHAPE NODE
Using GraphDSL
(msg-type, @timestamp, key, value)
GraphDSL - Limitations
Our design – Dynamic stream
Merge Hub
 Provided by Akka Stream:
Allow dynamic set of TCP producers
Splitter Hub
 Split the stream based on event type to a dynamic set of
downstream consumers.
 Consumers are actors which implement CMS, BF, HLL, etc logic.
 Not available in akka-stream.
Splitter Hub API
 Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
 [[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector
Splitter Hub’s Implementation
Splitter Hub
 The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time
Consumers
 Can be either local or remote.
 Managed by coordination actor.
 Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
 Responsibility:
 Answer a specific query.
 Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Scaling out
 If data does not fit in one machine.
 Server crashes.
 How to maintain back pressure end-to-end.
Scaling out
Akka stream TCP
 Handled by Kernel (back-pressure, reliable).
 For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
 Connect each source to a TCP connection and send to worker.
 Backpressure is maintained across network.
~>
~>
Master-Worker communication
Master Failover
 The Coordinator is the Single Point of Failure.
 Run multiple Coordinator Actors as Cluster Singleton .
 Worker communicates to master (heartbeat) using Cluster Client.
Worker Failover
 Worker persists all events to DB journal + snapshot.
 Akka Persistent.
 Redis for storing Journal + Snapshot.
 When a worker is down, its keys are re-distributed.
 Master then redirects traffic to other workers.
 CMS Actors are restored on new worker from Snapshot + Journal.
Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)
Conclusion
 Our system is
 Responsive
 Reactive
 Scalable
 Resilient
 Future works:
 Make worker metric agnostics
 Scale out master
 Exactly one delivery for worker
 More flexible filter using SplitterHub
Q&A

Mais conteúdo relacionado

Mais procurados

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Todd Lanning
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalconfluent
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfOpenStack Foundation
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLconfluent
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rullimfrancis
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafkaconfluent
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor APIconfluent
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLconfluent
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINESingleStore
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream DemoSingleStore
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 

Mais procurados (20)

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQL
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 

Destaque

MDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntMDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntSyeera Azryn
 
Acessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisAcessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisValdemar Júnior
 
Final draft(I 2)
Final draft(I 2)Final draft(I 2)
Final draft(I 2)Keaton Ott
 
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"Badalona Serveis Assistencials
 
Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016paul young cpa, cga
 
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"Badalona Serveis Assistencials
 
62 oitava categoria - caso 02 e caso 03
62   oitava categoria - caso 02 e caso 0362   oitava categoria - caso 02 e caso 03
62 oitava categoria - caso 02 e caso 03Fatoze
 
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...Badalona Serveis Assistencials
 
Southern Transport Service SOP
Southern Transport Service SOPSouthern Transport Service SOP
Southern Transport Service SOPRichard Gibbens
 

Destaque (16)

MDJ 202 2nd Assgmnt
MDJ 202 2nd AssgmntMDJ 202 2nd Assgmnt
MDJ 202 2nd Assgmnt
 
Acessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiaisAcessibilidade para as pessoas com necessidades comunicativas especiais
Acessibilidade para as pessoas com necessidades comunicativas especiais
 
Final draft(I 2)
Final draft(I 2)Final draft(I 2)
Final draft(I 2)
 
Carpeta san francisco11
Carpeta san francisco11Carpeta san francisco11
Carpeta san francisco11
 
Mapa conceptual
Mapa conceptual Mapa conceptual
Mapa conceptual
 
Curriculum vitae sheyla
Curriculum vitae sheylaCurriculum vitae sheyla
Curriculum vitae sheyla
 
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
3.c Pilar Sala - "Lecciones aprendidas. Aspectos prácticos del cambio de modelo"
 
PPTSIRISHPROPOSAL
PPTSIRISHPROPOSALPPTSIRISHPROPOSAL
PPTSIRISHPROPOSAL
 
Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016Wealthiest income – analysis and commentary - Canada - 2016
Wealthiest income – analysis and commentary - Canada - 2016
 
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"7.a Jan Schwietzke - "Caring  me, tratamiento online para la depresión"
7.a Jan Schwietzke - "Caring me, tratamiento online para la depresión"
 
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
Manual passo a passo instalação moldura 2 DIN Fiat Ducato/Peugeot Boxer/Citro...
 
62 oitava categoria - caso 02 e caso 03
62   oitava categoria - caso 02 e caso 0362   oitava categoria - caso 02 e caso 03
62 oitava categoria - caso 02 e caso 03
 
Sobrecargado De Informacion: Medidas Que Tomar
Sobrecargado De Informacion: Medidas Que TomarSobrecargado De Informacion: Medidas Que Tomar
Sobrecargado De Informacion: Medidas Que Tomar
 
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
6.b German Lorenzo y Silvia Morea - "Experiencia en VIC de integración de ser...
 
evaluacion
evaluacionevaluacion
evaluacion
 
Southern Transport Service SOP
Southern Transport Service SOPSouthern Transport Service SOP
Southern Transport Service SOP
 

Semelhante a [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And DesignYaroslav Tkachenko
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCKonrad Malawski
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsSingleStore
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLNick Dearden
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to productionShreya Mukhopadhyay
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 

Semelhante a [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka (20)

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to production
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 

Último

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 

Último (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

  • 1. Implement a scalable statistical aggregation system using Akka Scala by the Bay, 12 Nov 2016 Stanley Nguyen, Vu Ho Email Security@Symantec Singapore
  • 2. The system Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.
  • 3. Motivation  The system collects data from multiple sources in streaming log format  Some common questions in Email Anti-Abuse system  Most frequent Items (IP, domain, sender, etc.)  Number of unique items  Have we seen an item before? => Need to be able to answer such questions in a timely manner
  • 4. Data statistics  6K email logs/second  One email log is flatten out to subevents  Ip, sender, sender domain, etc  Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc) Total ~200K messages/second
  • 5. Challenges  Our system needs to be  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 6. Sketching data structures  How many times have we seen a certain IP?  Count Min Sketch (CMS): Counting things + TopK  How many unique senders have we seen yesterday?  HyperLogLog (HLL): Set cardinality  Did we see a certain IP last month?  Bloom Filter (BF): Set membership SPACE / SPEED
  • 7.  Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird  Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements What we try to solveWhat is available
  • 9.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 14. Our design – Dynamic stream
  • 15. Merge Hub  Provided by Akka Stream: Allow dynamic set of TCP producers
  • 16. Splitter Hub  Split the stream based on event type to a dynamic set of downstream consumers.  Consumers are actors which implement CMS, BF, HLL, etc logic.  Not available in akka-stream.
  • 17. Splitter Hub API  Similar to built-in akka stream’s BroadcastHub; different in back- pressure implementation.  [[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data. selector
  • 19. Splitter Hub  The [[Source]] can be materialized any number of times — each materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream. Consumer can be added at run time
  • 20. Consumers  Can be either local or remote.  Managed by coordination actor.  Implements a specific data structure (CMS/BF/HLL) for a particular event type from a specific time-range.  Responsibility:  Answer a specific query.  Persisting serialization of internal data structure such as count-min-table, etc. regularly. COUNT-QUERY forward ref snapshot
  • 21.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 22. Scaling out  If data does not fit in one machine.  Server crashes.  How to maintain back pressure end-to-end.
  • 24. Akka stream TCP  Handled by Kernel (back-pressure, reliable).  For each worker, we create a source for each message type it is responsible for using SplitterHub source() API.  Connect each source to a TCP connection and send to worker.  Backpressure is maintained across network. ~> ~>
  • 26. Master Failover  The Coordinator is the Single Point of Failure.  Run multiple Coordinator Actors as Cluster Singleton .  Worker communicates to master (heartbeat) using Cluster Client.
  • 27. Worker Failover  Worker persists all events to DB journal + snapshot.  Akka Persistent.  Redis for storing Journal + Snapshot.  When a worker is down, its keys are re-distributed.  Master then redirects traffic to other workers.  CMS Actors are restored on new worker from Snapshot + Journal.
  • 28. Benchmark Akka-stream on single node 100K+ msg/second (one msg-type) Akka-stream on remote node (remote TCP) 15-20K msg/second (one msg-type) Akka-stream on remote node (remote TCP) with akka persistent journal 2000+ msg/second (one msg-type)
  • 29. Conclusion  Our system is  Responsive  Reactive  Scalable  Resilient  Future works:  Make worker metric agnostics  Scale out master  Exactly one delivery for worker  More flexible filter using SplitterHub
  • 30. Q&A