SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
How a BEAM runner executes a
pipeline
Javier Ramirez (@supercoco9)
Head of Engineering @teamdatatonic
2018-10-02
▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence
Why
▪ I started using beam dataflow private alpha when it was “just” a serverless runner
▪ Then Beam was born as a common layer on top of multiple runners
▪ Wanted to understand what is part of Beam and what’s part of the runner
▪ Might help choosing the right runner for the job
Pipeline overview
■ Write pipeline code in Java, Python, or Go
■ The abstraction is a Directed Acyclic Graph (DAG) where nodes are transforms and edges are data flowing as
PCollections.
■ Both PTransforms and PCollections can be distributed and parallelised, and the model is fault-tolerant, so
they need to be serializable to be sent across workers
■ Read data from one or more inputs, bounded or unbounded
■ Apply transforms, stateless or stateful
■ Write data to one or more outputs
■ Optionally, keep track of metrics
Runner overview
BEAM-compatible is a very flexible claim
▪ Can choose to support only some languages (The portability API will change this)
▪ Can choose to support only batch or streaming processing
▪ Can choose to what extent to support early triggers and late data, refinements, state…
▪ Needs to translate from BEAM code to runner-native code
▪ Is responsible for submitting and monitoring the pipeline
▪ Must serialize/deserialize data and functions across workers and stages
▪ Is responsible for performance, scalability, optimisations, and enforcing the BEAM
model guarantees (some methods will be called exactly once, a transform will not be
executed by more than a thread at once within a worker, if a bundle of data is
processed by a transform more than once, it will not generate duplicates…)
Runner entrypoint: exploring the DAG
■ Beam provides a method to traverse (visit) the graph. Runners need to walk the graph to:
■ Validate the pipeline
■ Get insights to choose the best execution strategy
❏ Example: Spark Runner
❏ Chooses if using the batch or streaming engine by visiting the graph and checking if any PCollection
is unbounded
❏ Detects PCollections that are used in more than one transform, and creates internal caches to store
those collections
■ Translate the BEAM transforms into native transforms
■ Optimise the graph execution (to minimize serialization and shuffling)
Implementing PCollections
■ Unordered bags of elements
■ Might be bounded or unbounded
■ All the elements are of the same type and the PCollection has a coder to serialize/deserialize
elements
■ Every element will always have
■ A Timestamp (might be negative infinity if not important)
■ A Window, which is initially the global window, but can be changed via transforms
■ Every PCollection has a watermark estimating how complete it is
Watermark Propagation
Watermark propagation taken from the Flink documentation https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html
Implementing PTransforms
■ Beam can do pretty complex things with just a few primitives
■ Read
■ Flatten
■ Window
■ GroupByKey
■ ParDo
Implementing Read
■ Read can be bounded or unbounded.
■ The runner calls split to the source into
bundles
■ The runner gets a reader object and
then it can execute =====>
■ If supported, the runner can call
splitAtFraction to enable dynamic
rebalancing
Implementing unbounded Read
■ The general pattern is the same for both bounded and unbounded, but unbounded sources
■ Report a watermark that the runner needs to associate to the elements and propagate downstream
■ Provide Record IDs in case we need to use deduplication to enforce exactly-once processing
■ Support Checkpointing. The runner can get information about the current checkpoint on the stream,
and can call “FinalizeCheckpoint” to tell the unbounded source the elements are safe on the pipeline
and can be acknowledged from the stream if needed
Implementing Flatten
■ The runner only needs to verify the window strategies of all the PCollections to flatten are
compatible
■ The result is a single PCollection containing all the elements and windows of the input
PCollections without any changes
Implementing Window
■ Window is just a grouping key with a maximum timestamp
■ One element can be conceptually in one window only. If you need to assign an element to
multiple windows, it counts as multiple elements from Beam’s point of view.
■ The runner may choose to use a physical representation where one element appears to be
assigned to multiple windows for storage efficiency, but it maps conceptually to multiple
elements
Implementing GroupByKey
■ GroupByKey groups a PCollection of key-value pairs by Key and Window
■ GroupByKey will emit results only when window triggers allow it, and should automatically drop
expired elements
■ Since GroupByKey is closely related to Windows, it needs to be able to merge element by
window when requested, for example to keep session windows
■ GroupByKey needs to choose the timestamp to emit with the results
Implementing Pardo
■ Conceptually simple:
■ Setup is called once per instance of the ParDo
■ The runner decides on the bundle size (some runners allow user control)
■ It calls startBundle once per bundle
■ It calls processElement once per element
■ If we are using timely processing, it calls onTimer for each timer activation
■ It finishes by calling finishBundle
■ If an element fails, the whole bundle is retried
■ Teardown is called to release ParDo resources
■ Under the hood, the runner needs to take into account ParDos can be stateful and can have side
inputs. In those cases the runner is responsible for keeping and propagating state, and for
materialising the side inputs
Optimising the DAG execution
■ Two levels of optimisation
■ Execution plan (Supported by BEAM)
■ Intermediate data materialisation (Depends on the Runner)
Optimising the Execution Plan
■ Fusion and combine
aggregation are core concepts
behind the dataflow/BEAM
model
■ The JAVA SDK provides core
helpers to deal with this
Intermediate data materialisation
■ What to do if one transform fails downstream and we need to reprocess the data?
■ With some sources (like a static file) we could potentially replay the whole data and retry. Not
deterministic or fast, but it would work
■ With unbounded sources (or with bounded sources with changing data), we might not be
able to replay the whole data, so we need to have some way of
“materialising/checkpointing/snapshotting” the data
❏ Flink and IBM Streams distributed snapshots
❏ Samza incremental checkpoints
Transforms called more than once
SDK independent runners
■ Recap: Executing a user's pipeline can be broken up into the following categories:
■ Perform any grouping and triggering logic including keeping track of a watermark, window merging,
propagating status...
■ Pipeline execution management such as polling for counter information, job status, …
■ Execute user definable functions (UDFs) including custom sources/sinks, regular DoFns...
■ Executing UDFs is the only category which requires a specific language-specific SDK context to execute in.
Moving the execution of UDFs to language-specific SDK harnesses and using an RPC service between the
two allows for a cross language/cross Runner portability story.
SDK independent runners: RunnerAPI & FnAPI
■ The harness is a docker container able to run the language-specific parts of the pipeline. The
Runner is responsible for launching and managing the container. Communication between
Runner and Harness is via the FnApi, implemented via gRPC
SDK independent runners: FnAPI
▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence
Cheers
Javier Ramirez (@supercoco9)
Head of Engineering
@teamdatatonic

Mais conteúdo relacionado

Mais procurados

Key Performance Indicators for Managing MongoDB and Recommended Production Co...
Key Performance Indicators for Managing MongoDB and Recommended Production Co...Key Performance Indicators for Managing MongoDB and Recommended Production Co...
Key Performance Indicators for Managing MongoDB and Recommended Production Co...MongoDB
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPSeungmo Koo
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버Heungsub Lee
 
서버 아키텍쳐 입문
서버 아키텍쳐 입문서버 아키텍쳐 입문
서버 아키텍쳐 입문중선 곽
 
중앙 서버 없는 게임 로직
중앙 서버 없는 게임 로직중앙 서버 없는 게임 로직
중앙 서버 없는 게임 로직Hoyoung Choi
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략YEONG-CHEON YOU
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache BeamEtienne Chauchot
 
Unity3D로 풀3D web mmorpg 만들기
Unity3D로 풀3D web mmorpg 만들기Unity3D로 풀3D web mmorpg 만들기
Unity3D로 풀3D web mmorpg 만들기JP Jung
 
Know your app: Add metrics to Java with Micrometer | DevNation Tech Talk
Know your app: Add metrics to Java with Micrometer | DevNation Tech TalkKnow your app: Add metrics to Java with Micrometer | DevNation Tech Talk
Know your app: Add metrics to Java with Micrometer | DevNation Tech TalkRed Hat Developers
 
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...InfluxData
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Intel® Software
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...StreamNative
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템QooJuice
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsSérgio Nunes
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 

Mais procurados (20)

Key Performance Indicators for Managing MongoDB and Recommended Production Co...
Key Performance Indicators for Managing MongoDB and Recommended Production Co...Key Performance Indicators for Managing MongoDB and Recommended Production Co...
Key Performance Indicators for Managing MongoDB and Recommended Production Co...
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCP
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
 
서버 아키텍쳐 입문
서버 아키텍쳐 입문서버 아키텍쳐 입문
서버 아키텍쳐 입문
 
중앙 서버 없는 게임 로직
중앙 서버 없는 게임 로직중앙 서버 없는 게임 로직
중앙 서버 없는 게임 로직
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 
Unity3D로 풀3D web mmorpg 만들기
Unity3D로 풀3D web mmorpg 만들기Unity3D로 풀3D web mmorpg 만들기
Unity3D로 풀3D web mmorpg 만들기
 
Know your app: Add metrics to Java with Micrometer | DevNation Tech Talk
Know your app: Add metrics to Java with Micrometer | DevNation Tech TalkKnow your app: Add metrics to Java with Micrometer | DevNation Tech Talk
Know your app: Add metrics to Java with Micrometer | DevNation Tech Talk
 
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템
 
Kafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notificationsKafka used at scale to deliver real-time notifications
Kafka used at scale to deliver real-time notifications
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 

Semelhante a How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceAnant Corporation
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance ObservationsAdam Roberts
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Managementbasisspace
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPSamuel Chow
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdfPramodhN3
 

Semelhante a How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018 (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 

Mais de javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfestjavier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessjavier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloudjavier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analyticsjavier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Divejavier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
 

Mais de javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 

Último

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 

Último (20)

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018

  • 1. How a BEAM runner executes a pipeline Javier Ramirez (@supercoco9) Head of Engineering @teamdatatonic 2018-10-02
  • 2. ▪ Why do I care? ▪ Pipeline basics ▪ Runner Overview ▪ Exploring the graph ▪ Implementing PCollections ▪ Watermark Propagation ▪ Implementing Ptransforms: Read, Pardo, GroupByKey, Window, and Flatten ▪ Optimising the execution Plan ▪ Persisting state (coders and snapshots) ▪ FnApi and RunnerApi for SDK independence
  • 3. Why ▪ I started using beam dataflow private alpha when it was “just” a serverless runner ▪ Then Beam was born as a common layer on top of multiple runners ▪ Wanted to understand what is part of Beam and what’s part of the runner ▪ Might help choosing the right runner for the job
  • 4. Pipeline overview ■ Write pipeline code in Java, Python, or Go ■ The abstraction is a Directed Acyclic Graph (DAG) where nodes are transforms and edges are data flowing as PCollections. ■ Both PTransforms and PCollections can be distributed and parallelised, and the model is fault-tolerant, so they need to be serializable to be sent across workers ■ Read data from one or more inputs, bounded or unbounded ■ Apply transforms, stateless or stateful ■ Write data to one or more outputs ■ Optionally, keep track of metrics
  • 5. Runner overview BEAM-compatible is a very flexible claim ▪ Can choose to support only some languages (The portability API will change this) ▪ Can choose to support only batch or streaming processing ▪ Can choose to what extent to support early triggers and late data, refinements, state… ▪ Needs to translate from BEAM code to runner-native code ▪ Is responsible for submitting and monitoring the pipeline ▪ Must serialize/deserialize data and functions across workers and stages ▪ Is responsible for performance, scalability, optimisations, and enforcing the BEAM model guarantees (some methods will be called exactly once, a transform will not be executed by more than a thread at once within a worker, if a bundle of data is processed by a transform more than once, it will not generate duplicates…)
  • 6. Runner entrypoint: exploring the DAG ■ Beam provides a method to traverse (visit) the graph. Runners need to walk the graph to: ■ Validate the pipeline ■ Get insights to choose the best execution strategy ❏ Example: Spark Runner ❏ Chooses if using the batch or streaming engine by visiting the graph and checking if any PCollection is unbounded ❏ Detects PCollections that are used in more than one transform, and creates internal caches to store those collections ■ Translate the BEAM transforms into native transforms ■ Optimise the graph execution (to minimize serialization and shuffling)
  • 7. Implementing PCollections ■ Unordered bags of elements ■ Might be bounded or unbounded ■ All the elements are of the same type and the PCollection has a coder to serialize/deserialize elements ■ Every element will always have ■ A Timestamp (might be negative infinity if not important) ■ A Window, which is initially the global window, but can be changed via transforms ■ Every PCollection has a watermark estimating how complete it is
  • 8. Watermark Propagation Watermark propagation taken from the Flink documentation https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html
  • 9. Implementing PTransforms ■ Beam can do pretty complex things with just a few primitives ■ Read ■ Flatten ■ Window ■ GroupByKey ■ ParDo
  • 10. Implementing Read ■ Read can be bounded or unbounded. ■ The runner calls split to the source into bundles ■ The runner gets a reader object and then it can execute =====> ■ If supported, the runner can call splitAtFraction to enable dynamic rebalancing
  • 11. Implementing unbounded Read ■ The general pattern is the same for both bounded and unbounded, but unbounded sources ■ Report a watermark that the runner needs to associate to the elements and propagate downstream ■ Provide Record IDs in case we need to use deduplication to enforce exactly-once processing ■ Support Checkpointing. The runner can get information about the current checkpoint on the stream, and can call “FinalizeCheckpoint” to tell the unbounded source the elements are safe on the pipeline and can be acknowledged from the stream if needed
  • 12. Implementing Flatten ■ The runner only needs to verify the window strategies of all the PCollections to flatten are compatible ■ The result is a single PCollection containing all the elements and windows of the input PCollections without any changes
  • 13. Implementing Window ■ Window is just a grouping key with a maximum timestamp ■ One element can be conceptually in one window only. If you need to assign an element to multiple windows, it counts as multiple elements from Beam’s point of view. ■ The runner may choose to use a physical representation where one element appears to be assigned to multiple windows for storage efficiency, but it maps conceptually to multiple elements
  • 14. Implementing GroupByKey ■ GroupByKey groups a PCollection of key-value pairs by Key and Window ■ GroupByKey will emit results only when window triggers allow it, and should automatically drop expired elements ■ Since GroupByKey is closely related to Windows, it needs to be able to merge element by window when requested, for example to keep session windows ■ GroupByKey needs to choose the timestamp to emit with the results
  • 15. Implementing Pardo ■ Conceptually simple: ■ Setup is called once per instance of the ParDo ■ The runner decides on the bundle size (some runners allow user control) ■ It calls startBundle once per bundle ■ It calls processElement once per element ■ If we are using timely processing, it calls onTimer for each timer activation ■ It finishes by calling finishBundle ■ If an element fails, the whole bundle is retried ■ Teardown is called to release ParDo resources ■ Under the hood, the runner needs to take into account ParDos can be stateful and can have side inputs. In those cases the runner is responsible for keeping and propagating state, and for materialising the side inputs
  • 16. Optimising the DAG execution ■ Two levels of optimisation ■ Execution plan (Supported by BEAM) ■ Intermediate data materialisation (Depends on the Runner)
  • 17. Optimising the Execution Plan ■ Fusion and combine aggregation are core concepts behind the dataflow/BEAM model ■ The JAVA SDK provides core helpers to deal with this
  • 18. Intermediate data materialisation ■ What to do if one transform fails downstream and we need to reprocess the data? ■ With some sources (like a static file) we could potentially replay the whole data and retry. Not deterministic or fast, but it would work ■ With unbounded sources (or with bounded sources with changing data), we might not be able to replay the whole data, so we need to have some way of “materialising/checkpointing/snapshotting” the data ❏ Flink and IBM Streams distributed snapshots ❏ Samza incremental checkpoints
  • 20. SDK independent runners ■ Recap: Executing a user's pipeline can be broken up into the following categories: ■ Perform any grouping and triggering logic including keeping track of a watermark, window merging, propagating status... ■ Pipeline execution management such as polling for counter information, job status, … ■ Execute user definable functions (UDFs) including custom sources/sinks, regular DoFns... ■ Executing UDFs is the only category which requires a specific language-specific SDK context to execute in. Moving the execution of UDFs to language-specific SDK harnesses and using an RPC service between the two allows for a cross language/cross Runner portability story.
  • 21. SDK independent runners: RunnerAPI & FnAPI ■ The harness is a docker container able to run the language-specific parts of the pipeline. The Runner is responsible for launching and managing the container. Communication between Runner and Harness is via the FnApi, implemented via gRPC
  • 23. ▪ Why do I care? ▪ Pipeline basics ▪ Runner Overview ▪ Exploring the graph ▪ Implementing PCollections ▪ Watermark Propagation ▪ Implementing Ptransforms: Read, Pardo, GroupByKey, Window, and Flatten ▪ Optimising the execution Plan ▪ Persisting state (coders and snapshots) ▪ FnApi and RunnerApi for SDK independence
  • 24. Cheers Javier Ramirez (@supercoco9) Head of Engineering @teamdatatonic