O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

1.579 visualizações

Publicada em

In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.

Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.

In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.

Publicada em: Tecnologia
  • Entre para ver os comentários

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

  1. 1. Modern Stream Processing with Apache Flink® Till Rohrmann @stsffap GOTO Berlin 2017 1
  2. 2. 2 Original creators of Apache Flink® dA Platform 2 Open Source Apache Flink + dA Application Manager
  3. 3. What changes faster? Data or Query? 3 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, … Stream Processing Use Case
  4. 4. Batch Processing 4
  5. 5. Stream Processing 5
  6. 6. 6 Apache Flink in a Nutshell
  7. 7. Apache Flink in a Nutshell 7 Queries Applications Devices etc. Database Stream File / Object Storage Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Historic Data Streams Application
  8. 8. 8 Event Streams State (Event) Time Snapshots The Core Building Blocks real-time and hindsight complex business logic consistency with out-of-order data and late data forking / versioning / time-travel
  9. 9. Powerful Abstractions 9 Process Function (events, state, time) DataStream API (streams, windows) Stream SQL / Tables (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Layered abstractions to navigate simple to complex use cases
  10. 10. Hardened at scale 10 Athena X Streaming SQL Platform Service Streaming Platform as a Service Fraud detection Streaming Analytics Platform 100s jobs, 1000s nodes, TBs state metrics, analytics, real time ML Streaming SQL as a platform
  11. 11. 11 Distributed application infrastructure
  12. 12. Good old centralized architecture 12 The big mean central database $$$ The grumpy DBA Application Application Application Application
  13. 13. Write log and tables 13 From the Apache Cassandra docs
  14. 14. Modern distributed app. architecture 14 Event Sourcing Turning the DB inside out CQRS
  15. 15. Modern distributed app. architecture 15 Application Sensor APIs Application Application Application The limit to what you can do is how sophisticated can you compute over the stream! Boils down to: How well do you handle state and time
  16. 16. 16 A Flink-favored approach
  17. 17. Event Sourcing + Memory Image 17 event log persists events (temporarily) event / command Process main memory update local variables/structures periodically snapshot the memory
  18. 18. Event Sourcing + Memory Image 18 Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process
  19. 19. Distributed Memory Image 19 Distributed application, many memory images. Snapshots are all consistent together.
  20. 20. Stateful Event & Stream Processing 20 Scalable embedded state Access at memory speed & scales with parallel operators
  21. 21. Compute, State, and Storage 21 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  22. 22. Performance 22 synchronous reads/writes across tier boundary asynchronous writes of large blobs all modifications are local Classic tiered architecture Streaming architecture
  23. 23. Consistency 23 distributed transactions at scale typically at-most / at-least once exactly once per state =1 =1 Classic tiered architecture Streaming architecture
  24. 24. Scaling a Service 24 separately provision additional database capacity provision compute and state together Classic tiered architecture Streaming architecture provision compute
  25. 25. Rolling out a new Service 25 provision a new database (or add capacity to an existing one) simply occupies some additional backup space Classic tiered architecture Streaming architecture provision compute and state together
  26. 26. What users built on checkpoints… § Upgrades and Rollbacks § Cross Datacenter Failover § State Archiving § Application Migration § Spot Instance Region Chasing § A/B testing § … 26
  27. 27. 27 What is the next wave of stream processing applications?
  28. 28. What changes faster? Data or Query? 28 Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, … Stream Processing Use Case
  29. 29. Analytics & Business Logic 29 Stream Source analytics metrics Business Logic Stream Processor events ApplicationDatabase
  30. 30. Blending Analytics & Business Logic 30 Stream Source Streaming Application Stream Processor events RPCs (Actor) Messages …
  31. 31. AthenaX by Uber 31
  32. 32. 32 Can one build an entire sophisticated web application (say a social network) on a stream processor? (Yes, we can!™)
  33. 33. 33 Social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink @
  34. 34. 34 The next wave of stream processing applications… … is all types of stateful applications that react to data and time!
  35. 35. 35 Stateful stream applications versioning, upgrading, rollback, duplicating, migrating, … Continuous applications
  36. 36. The dA Platform Architecture 36 dA Platform 2 Apache Flink Stateful stream processing Kubernetes Container platform Logging Streams from Kafka, Kinesis, S3, HDFS, Databases, ... dA Application Manager Application lifecycle management Metrics CI/CD Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more)
  37. 37. Versioned Applications, not Jobs/Jars 37 Stream Processing Application Version 3 Version 2 Version 1 Code and Application Snapshot upgrade upgrade New Application Version 3a Version 2a fork / duplicate
  38. 38. Deployments, not Flink Clusters 38 Testing / QA Kubernetes Cluster Production Kubernetes Cluster Threat Metrics App. Testing Activity Monitor Application Fraud Detection Application Fraud Detection App. Testing
  39. 39. Hooks for CI/CD pipelines 39 Kubernetes Cluster Application Version 1 CI Service dA Application Manager push update trigger CI upgrade API stateful upgrade Application Version 2
  40. 40. 4 Thank you! @stsffap @ApacheFlink @dataArtisans
  41. 41. We are hiring! data-artisans.com/careers 41

×