TPL Dataflow Pipelines Usage and Best Practices

Pipeline.
TPL Dataflow.
Usage.

by Alexey Kursov
http://www.linkedin.com/in/kursov

TPL Dataflow
The Task Parallel Library (TPL) provides dataflow components to help increase the
robustness of concurrency-enabled applications. These dataflow components are
collectively referred to as the TPL Dataflow Library. Dataflow model providing inprocess message passing for coarse-grained dataflow and pipelining tasks...

Pipeline basics
In software engineering, a pipeline consists of a chain of processing elements
(processes, threads, coroutines, etc.), arranged so that the output of each element is
the input of the next. Usually some amount of buffering is provided between
consecutive elements. The information that flows in these pipelines is often a stream
of records, bytes or bits.

The concept is also called the pipes and filters design pattern. It was named by
analogy to a physical pipeline.
Simple example:

Pipeline basics
A linear pipeline is a series of processing stages which are arranged linearly to
perform a specific function over a data stream. The basic usages of linear pipeline is
instruction execution, arithmetic computation and memory access.

A non-linear pipeline (also called dynamic pipeline) can be configured to perform
various functions at different times. In a dynamic pipeline there is also feed forward
or feedback connection. Non-linear pipeline also allows very long instruction word.

Dataflow programming
Dataflow programming is a programming paradigm that
models a program as a directed graph of the data flowing
between operations, thus implementing dataflow principles and
architecture.
● emphasizes the movement of data
● program is series of connections
● explicitly defined inputs and outputs connect operations

Popular in

● parallel computing frameworks
● database engine designs
● digital signal processing
● network routing
● graphics processing

Usage
In Unix-like computer operating systems, a pipeline is the original software pipeline:
a set of processes chained by their standard streams, so that the output of each
process (stdout) feeds directly as input (stdin) to the next one. Each connection is
implemented by an anonymous pipe. Filter programs are often used in this
configuration.
The concept was invented by Douglas McIlroy
for Unix shells and it was named by analogy to a
physical pipeline.
Abstract and concrete examples:
% program1 | program2 | program3
% ls | grep xxx

Usage
Cascading is a Java application framework that enables typical developers to
quickly and easily develop rich Data Analytics and Data Management applications
that can be deployed and managed across a variety of computing environments.
Cascading works seamlessly with Apache Hadoop and API compatible distributions.
It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows
reusable ‘pipes’ that perform data analysis processes, where the results are stored in
output files or ‘sinks’

Usage
Cascading pipeline example:

Usage
Apache Crunch (Simple and Efficient MapReduce Pipelines by Cloudera)
The Apache Crunch Java library provides a framework for writing, testing, and
running MapReduce pipelines. Its goal is to make pipelines that are composed of
many user-defined functions simple to write, easy to test, and efficient to run.

Storm
Storm is a distributed realtime computation system. Similar to how Hadoop provides
a set of general primitives for doing batch processing, Storm provides a set of
general primitives for doing realtime computation. Storm is simple, can be used with
any programming language

TPL Dataflow
The Task Parallel Library (TPL) provides dataflow components to help increase the
robustness of concurrency-enabled applications. These dataflow components are
collectively referred to as the TPL Dataflow Library.

Data Flow Tasks
Coordination data
structure

Task parallel library

Threads

What it provides for me?
●

provides a foundation for message passing and parallelizing CPU-intensive and
I/O-intensive applications

●

gives you explicit control over how data is buffered and moves around the
system

●

improve responsiveness and throughput by efficiently managing the underlying
threads

●

allows you to easily create a mesh through which your data flows

●

meshes can split and join the data flows, and even contain data flow loops

●

allows to create custom blocks and extend functionality

Type of blocks
Dataflow blocks - are data structures that buffer and process
data.

1. source blocks (acts as a source of data ) ISourceBlock<TOutput>
2. target blocks (acts as a receiver of data) ITargetBlock<TInput>
3. propagator blocks (acts as both a source block and a
target block) IPropagatorBlock<TInput, TOutput>

Buffering blocks
●

BufferBlock<T> - stores a first in, first out (FIFO) queue of messages that can be written to by multiple
sources or read from by multiple targets. If some target receives message from bufferblock, that
message will be removed
input

●

output (original)

BroadcastBlock<T> - broadcast a message to multiple components
Current

input

output (originals or copies)

Task

●

WriteOnceBlock<T> - class resembles the BroadcastBlock<T> class, except that a
WriteOnceBlock<T> object can be written to one time only
input

First writed value (readonly)

Task

output (originals or copies)

Execution blocks
●

ActionBlock<TInput> - is a target block that calls a delegate when it receives data
input
Task

●

TransformBlock<TInput, TOutput> - it acts as both a source and as a target and delegate that you
pass should return a value of TOutput type
input

output
Task

●

TransformManyBlock<TInput, TOutput> - resembles the TransformBlock except that
TransformManyBlock produces zero or more output values for each input value, instead of only one
output value for each input value.
input

output
Task

Grouping blocks
●

BatchBlock<T> - combines sets of input data, which are known as batches, into arrays of output data.
input

output
Task

●

The JoinBlock<T1, T2> and JoinBlock<T1, T2, T3> - collect input elements and propagate out
System.Tuple<T1, T2> or System.Tuple<T1, T2, T3> objects that contain those elements
input (T1)
output
input (T2)

●

Task

The BatchedJoinBlock<T1, T2> and BatchedJoinBlock<T1, T2, T3> - collect batches of input
elements and propagate out System.Tuple(IList(T1), IList(T2)) or System.Tuple(IList(T1), IList(T2), IList
(T3)) objects that contain those elements
input (T1)
output
input (T2)

Task

LinkTo and Predicate
Link/UnLink
The ISourceBlock<TOutput>.LinkTo (returns IDisposable) method links a source dataflow block to a target
block. If you want to unlink block you should call Dispose method on result of LinkTo call. The predefined
dataflow block types handle all thread-safety aspects of linking and unlinking. Also the source will be unlinked
automatically if you set MaxMessages larger than -1 on LinkTo call in DataflowLinkOptions after the
declared number of messages is received

Predicate
When you link target block you can set “predicate” that will check message before adding it to input buffer.
You should specify delegate in DataflowLinkOptions that recives message of TInput type of target block
and returns bool value.

Another options
You can specify:

●

degree of parallelism for block

●

maximum number of messages that may be buffered by the block

●

task scheduler

●

number of message per task

●

cancellation

●

greedy behavior

●

completion

Recommendations
Recommendations for building TPL Dataflow pipelines:

●

make each block do one thing well

●

design for composition

●

be stateless where you can

Use cases
1.

Prototyping pipelines for use in more complex systems

2.

Development of flexible asynchronous applications that process some data, like:
○
○

Image processors

○

Sound processors

○

Pipelines in mobile phone apps

○

Data analysis/mining services

○
3.

Web-crawlers

etc.

Study pipeline based development

Useful links

●

http://www.nuget.org/packages/Microsoft.Tpl.Dataflow/

●

http://msdn.microsoft.com/en-us/library/hh228603.aspx

●

http://blogs.microsoft.co.il/blogs/bnaya/archive/2012/01/28/tpl-dataflow-walkthrough-part-5.aspx

●

http://www.cascading.org/

●

http://crunch.apache.org/

●

http://storm-project.net/

TPL Dataflow Pipelines Usage and Best Practices

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a TPL Dataflow Pipelines Usage and Best Practices

Semelhante a TPL Dataflow Pipelines Usage and Best Practices (20)

Último

Último (20)

TPL Dataflow Pipelines Usage and Best Practices