Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Mute
Be on mute until you have
questions or concerns.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
3. Agenda
01 Big Data evolution
02
Introduction to Flink
03
Features of Flink
Architecture of Flink
Anatomy of a Flink program
Demo
04
05
06
4. Big Data Evolution
Problems with Big Data:
● Storing huge and exponentially growing datasets.
● Processing of huge data datasets having complex structure.
● 3v’s of Big Data - Volume, Variety, Velocity
5. Continue..
● At early 2000, Big Data era started with multiple frameworks focusing on
specifying Big Data problem.
6. Continue..
● A unified platform that alone can handle various Big Data problem:
➢ Batch processing
➢ Stream processing
➢ Graph processing
➢ Iterative processing
● A unified platform must have following characteristics to solve Big
Data Problem:
➢ Distributed/ parallel computation
➢ Fault tolerance
➢ Ease of use (developer friendly API’s)
➢ Powerful predefined operators/functions(Like Join, filter)
➢ Fast
7. Apache Spark (3G Big Data Framework)
● Spark is a lightning-fast cluster computing engine that is 100 times faster than
Hadoop in running applications in memory
● Apache Spark is best known for its in-memory computing capabilities that
deliver high-speed processing.
➢ Problem
● Process data streams in micro batches and not in real time.
● High throughput but medium latency in some use cases.
8. Introduction to Flink
● Apache Flink is a Big Data framework and distributed processing engine for
stateful computations over unbounded and bounded data streams.
● Flink is based on the streaming first principle which means it is real streaming
processing engine Flink considers batch processing as a special case of
streaming
● Flink has been designed to run in all common cluster environments, perform
computations at in-memory speed and at any scale.
10. ➢ A Flink application may consume real-time data from streaming sources such as
message queues or distributed logs, like Apache Kafka or Kinesis.
➢ Flink can also consume bounded, historic data from a variety of data sources.
➢ The streams of results being produced by a Flink application can be sent to a wide
variety of systems that can be connected as sinks
11. ➢ Programs in Flink are inherently parallel and distributed.
➢ During execution, a stream has one or more stream partitions, and each
operator has one or more operator subtasks.
12. ➢ Flink facilitate stateful operations.
➢ Current handling event can depend on the accumulated effect of all the events
that came before it.
➢ The set of parallel instances of a stateful operator is effectively a sharded
key-value store. Each parallel instance is responsible for handling events for a
specific group of keys, and the state for those keys is kept locally.
13. Flink Architecture
➢ Flink 1.X's architecture consists of various components such as deploy,
core processing, and APIs.
➢ Flink has a layered architecture and each component is a part of a
specific layer.
➢ Each layer is built on top of the others for clear abstraction.
14. Flinks Distributed Execution
➢ Flink is based on master slave architecture.
➢ Various processes take part in the Flink’s program execution, namely
Job Manager, Task Manager, and Job Client.