O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Sep 2015
Google Dataflow
introduction
iglushkov@machinezone.com
What is Google Dataflow
❖ Data processing system: batch and streaming
❖ Set of SDKs
❖ Google Cloud Platform managed servic...
Programming Model
❖ Pipeline - entire series of computations
❖ PCollection - set of data in a pipeline
❖ Transform - any d...
Pipeline
❖ Data + Transforms
❖ Branching + merging
❖ Multiple sources
❖ Unit testing + Integration testing
❖ Pipeline Exec...
PCollection
❖ Represent data in a pipeline from any source
❖ Potentially unlimited (stream)
❖ Serializable, immutable, no ...
Windowing
❖ Window - subdivided logical parts of a PCollection
❖ Each element is assigned to one or more windows
❖ Fixed t...
Late Data
❖ Event time / Processing time
❖ No order guarantee
❖ No consistent delta b/w Event and Processing time
❖ Waterm...
Triggers
❖ Enough data for the window -> aggregate result: “pane”
❖ Help handle late data
❖ Time-based triggers
❖ Data-dri...
Transforms
❖ Math, convert format, grouping, filtering, combining
❖ [PCollection] -> [PCollection]
❖ Core Transforms: ParDo...
Pipeline I/O
❖ Read/Write from/to external sources
❖ Text Files in Google Cloud Storage or local FS
❖ BigQuery tables
❖ Go...
Extra
❖ Parallelization, distribution, optimization, scaling
❖ Dataflow monitoring UI and CLI
❖ Logging
❖ Unit testing (loc...
Questions?
Próximos SlideShares
Carregando em…5
×

Google Dataflow Intro

872 visualizações

Publicada em

Main concepts of Google Dataflow. Pipelines, Windowing, Triggers, Late Data, etc.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Google Dataflow Intro

  1. 1. Sep 2015 Google Dataflow introduction iglushkov@machinezone.com
  2. 2. What is Google Dataflow ❖ Data processing system: batch and streaming ❖ Set of SDKs ❖ Google Cloud Platform managed services: ❖ Google Compute Engine (VMs) ❖ Google Cloud Storage (r/w data) ❖ BigQuery (r/w data)
  3. 3. Programming Model ❖ Pipeline - entire series of computations ❖ PCollection - set of data in a pipeline ❖ Transform - any data processing operation ❖ Pipeline I/O - data source and data sink APIs
  4. 4. Pipeline ❖ Data + Transforms ❖ Branching + merging ❖ Multiple sources ❖ Unit testing + Integration testing ❖ Pipeline Execution Parameters (local/prod) ❖ Where from, what it looks like, what to do, where store
  5. 5. PCollection ❖ Represent data in a pipeline from any source ❖ Potentially unlimited (stream) ❖ Serializable, immutable, no random access to elements ❖ Deferred data (may have yet to be computed) ❖ Windowing, triggers
  6. 6. Windowing ❖ Window - subdivided logical parts of a PCollection ❖ Each element is assigned to one or more windows ❖ Fixed time windows ❖ Sliding time windows ❖ Per-session windows ❖ Single global windows
  7. 7. Late Data ❖ Event time / Processing time ❖ No order guarantee ❖ No consistent delta b/w Event and Processing time ❖ Watermark ❖ Late data ❖ Triggers to refine windowing, data reporting time
  8. 8. Triggers ❖ Enough data for the window -> aggregate result: “pane” ❖ Help handle late data ❖ Time-based triggers ❖ Data-driven triggers (e.g. certain amount is enough) ❖ Composite triggers: OR, AND - operations on triggers ❖ Window Accumulation modes: accumulate/discard the previous “panes”
  9. 9. Transforms ❖ Math, convert format, grouping, filtering, combining ❖ [PCollection] -> [PCollection] ❖ Core Transforms: ParDo, GroupByKey, Combine, … ❖ Functions with business logic to apply:
 Serializable, Thread-compatible, Idempotent ❖ Composite Transforms
  10. 10. Pipeline I/O ❖ Read/Write from/to external sources ❖ Text Files in Google Cloud Storage or local FS ❖ BigQuery tables ❖ Google Cloud PubSub ❖ Custom Sources and Sinks
  11. 11. Extra ❖ Parallelization, distribution, optimization, scaling ❖ Dataflow monitoring UI and CLI ❖ Logging ❖ Unit testing (locally) any Fn, end-to-end ❖ Introspection toolchain ❖ Update toolchain: for code, windowing configs
  12. 12. Questions?

×