ApacheCon 2020 - Flink SQL in 2020: Time to show off!

© 2020 Ververica
Timo Walther
@twalthr
Flink SQL in 2020

© 2020 Ververica
About me
● Apache Flink Committer and PMC Member
● Working on Flink before it became part of the Apache Software
Foundation
● Software Engineer at Ververica
(first dataArtisans, then acquired by Alibaba in 2019)
● Part of the SDK Team, focused on Table / SQL API and Ecosystem

© 2020 Ververica
Apache Flink is a Distributed Data Processing System
Stateful computations over streams
real-time and historic
fast, scalable, fault tolerant,
event time, large state, exactly-once.

© 2020 Ververica
Scalable and Consistent Data Processing
● Flexible and expressive APIs
● Guaranteed correctness
○ Exactly-once state consistency
○ Event-time semantics
● Processing at massive scale
○ Runs on 10000s of cores
○ Manages 10s TBs of state either in-memory or on disk

© 2020 Ververica
Powered By Apache Flink
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
Also check out the Flink Forward YouTube channel more than 350 recorded talks at https://www.youtube.com/channel/UCY8_lgiZLZErZPF47a2hXMA

© 2020 Ververica6
A standard-compliant SQL service
to query static and streaming data alike
that leverages the performance, scalability, and consistency
of Apache Flink.
Flink SQL in a Nutshell

© 2020 Ververica7
Refreshing Streaming SQL Semantics
● Basically all tables that are processed with SQL queries change over time
○ Transactions from applications
○ Bulk inserts from ETL processes
○ …
● Traditional processors run SQL queries on static snapshots of the tables
○ The query input is finite à result is also finite and definitive
● Stream SQL processors run continuous queries on changing (dynamic) tables
○ The query input is unbounded à result is potentially unbounded, and continuously updated
● Semantics of a query are the same for both snapshot and continuously
changing table!

© 2020 Ververica8
Running a One-time Query on a Static Table Snapshot
user cnt
Mary 2
Bob 1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Take a snapshot when
the query starts
A final result is
produced
A row that was added after the query
was started is not considered
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
The query
terminates

© 2020 Ververica9
Running a Continuous Query on a Changing Table
user cTime url
user cnt
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1Mary 2
Ingest all changes
as they happen
Continuously update
the result
The result is identical to the one-time query (at this point)

© 2020 Ververica10
Why is Stream-Batch Unification Important?
● Usability
○ ANSI SQL syntax: No custom “StreamSQL” syntax.
○ ANSI SQL semantics: No stream-specific result semantics.
● Portability
○ Run the same query on bounded & unbounded data
○ Run the same query on recorded & real-time data
○ Bootstrapping query state or backfilling results from historic data
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query

© 2020 Ververica11
What about Time? Aren't we in the Streaming Space?
● Proper time handling is very important in many continuous queries
○ Group or join rows that are temporally related
○ Semantics are the same if a query runs on a snapshot
● Tracking progress in time enables efficient execution of continuous queries
○ Determine when input of a computation is complete
○ Determine when rows are no longer needed and clean up state
○ Periodically trigger computations and result updates
● Flink SQL supports sophisticated event-time handling with watermarks
● Those are streaming optimizations, they don't affect standard SQL queries!

© 2020 Ververica
What Will You See in This Demo?

© 2020 Ververica13
What Will You See in This Demo?
● Read and write data from and to different storage systems
○ Apache Kafka
○ MySQL (via a generic JDBC connector)
○ S3-compatible storage
● Manage catalog metadata
○ Create (alter and drop) tables and views with DDL statements
○ Persistently store catalog metadata in Apache Hive Metastore
● Show how Flink unifies batch and stream processing with SQL
○ Demonstrate different ways to join dynamic tables
● Maintain the results of continuous queries in Kafka and MySQL

© 2020 Ververica
Our Demo Environment
JobManager
TaskManager
SQL Client
Data
Provider
Assign & monitor
query tasks
Push events
Submit query
Coordinate
MetaStore
Manage & lookup
Catalog Metadata
Read & write data
Execute
query tasks
S3-compatible Storage
Query data

© 2020 Ververica15
Our Demo Scenario - An Order System (derived from TPC-H)
LineitemOrders
RatesCustomerNationRegion
1
n
1 n
nn 1
1
n
1
Frequently updated tables
Seldomly updated tables
o_orderkey
o_ordertime
o_custkey
o_orderpriority
...
l_orderkey
l_linenumber
l_ordertime
l_proctime
l_currency
l_extendedprice
...
rs_symbol
rs_timestamp
rs_rate
r_regionkey
r_name
n_nationkey
n_name
n_regionkey
c_custkey
c_name
c_nationkey
...
Rates
History
rs_symbol
rs_timestamp
rs_rate
n 1

© 2020 Ververica18
SQL Feature Set in Flink 1.11
STREAMING ONLY
● OVER / WINDOW
○ UNBOUNDED + BOUNDED PRECEDING
● INNER JOIN with
○ Time-versioned table
○ External lookup table
● MATCH_RECOGNIZE
○ Pattern Matching/CEP (SQL:2016)
BATCH ONLY
● Full TPC-DS support
STREAMING & BATCH
● SELECT FROM WHERE
● GROUP BY [HAVING]
○ Non-windowed
○ TUMBLE, HOP, SESSION windows
● JOIN
○ Time-Windowed INNER + OUTER JOIN
○ Non-windowed INNER + OUTER JOIN
● User-Defined Functions
○ Scalar
○ Aggregation
○ Table-valued

© 2020 Ververica19
SQL Feature Set in Flink 1.11
CREATE TABLE people (
id BIGINT,
name STRING,
email STRING
) WITH (
'connector'='kafka',
'topic'='people',
'properties.bootstrap.servers'='localhost:9092',
'scan.startup.mode'='earliest-offset',
'format'='debezium-json'
);
● Changelog processing support (FLIP-95, FLIP-105)
○ New table source and sink interfaces
○ Deeper integration with connectors (interpret a Kafka topic as a changelog)
○ Change Data Capture (CDC) processing using the Debezium format

© 2020 Ververica20
● Flink SQL is evolving super fast!
● Flink SQL runs continuous queries at scale on static and dynamic data.
● Flink SQL connects to many systems in the data ecosystem.
● Flink can do a lot more
○ Python Table API & support for notebooks like Apache Zeppelin
○ Java/Scala DataStream API
○ Stateful Functions API
Go, check it out!
=> https://github.com/fhueske/flink-sql-demo
Summary

ApacheCon 2020 - Flink SQL in 2020: Time to show off!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ApacheCon 2020 - Flink SQL in 2020: Time to show off!

Similar to ApacheCon 2020 - Flink SQL in 2020: Time to show off! (20)

Recently uploaded

Recently uploaded (20)

ApacheCon 2020 - Flink SQL in 2020: Time to show off!