Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
5. About Alibaba
Alibaba Group
• Operates the world’s largest e-commerce platform
• Recorded GMV of $485 Billion in year 2016, $17.8 billion worth of GMV in a single day on Nov 11, 2016
Realtime Data Infrastructure
• Supports internal products such as search, recommendation, BI
• Also supports external customers through its cloud service
6. Blink – Alibaba’s version of Flink
Looked into Flink two years ago
• best choice of unified computing engine
• a few issues in Flink that can be problems for large scale applications
Started “Blink” project
• aimed to make Flink work reliably and efficiently at the very large scale at Alibaba
Made various improvements in Flink runtime
Enhanced Flink SQL & Table API to production ready
Working with Flink community to contribute back since last August
• several key improvements
• hundreds of patches
7. Blink Ecosystem in Alibaba
Cluster Resource Management (Yarn/Fuxi)
Search
Storage (HDFS/Pangu)
SQL & Table API
Blink
Products Recommendation BI Security
DataStream API
Runtime Engine
Ads
DataSet API
Machine Learning Platform StreamCompute PlatformPlatform
9. Why SQL & Table API
Unified batch and streaming
• Flink currently offers DataSet API for batch and DataStream API for streaming
• We want a single API that can run in both batch and streaming mode
Improved development efficiency
• Users only describe the semantics of data processing
• Leave hard optimization problems to the system
• SQL is proven to be good at describing data processing
• Table API offers seamless integration with Scala and Java
• Table API makes it easy to extend standard SQL when necessary
11. Dynamic Tables
Apply Changelog Stream to Dynamic Table
• Append Mode: each stream record is an insert to the dynamic table. Hence, all records of a stream are
appended to the dynamic table
• Update Mode: a stream record can represent an insert, update, or delete modification on the dynamic
table (append mode is in fact a special case of update mode)
12. Derive Changelog Stream from Dynamic Table
• REDO Mode: where the stream records the new value of a modified element to redo lost changes of
completed transactions
• REDO+UNDO Mode: where the stream records the old and the new value of a changed element to undo
incomplete transactions and redo lost changes of completed transactions
Dynamic Tables
13. There is no such thing as Stream SQL
Stream SQL?
Dynamic Tables generalize the concept of Static Tables
SQL serves as the unified way to describe data processing in both batch and streaming
15. Blink SQL & Table API Overview
Simple Query: Select and Where
Stream-Stream Inner Join
User Defined Function (UDF)
User Defined Table Function (UDTF)
User Defined Aggregate Function (UDAGG)
Retraction (stream only)
Aggregate
16. A Simple Query: Select and Where
id name price sales stock
1 Latte 6 1 1000
8 Mocha 8 1 800
4 Breve 5 1 200
7 Tea 4 1 2000
1 Latte 6 2 998
id name price sales stock
1 Latte 6 1 1000
1 Latte 6 2 998
17. Stream-Stream Inner Join
id1 name stock
1 Latte 1000
8 Mocha 800
4 Breve 200
3 Water 5000
7 Tea 2000
id2 price sales
1 6 1
8 8 1
9 3 1
4 5 1
7 4 1
id name price sales stock
1 Latte 6 1 1000
8 Mocha 8 1 800
4 Breve 5 1 200
7 Tea 4 1 2000
This is proposed and
discussed in FLINK-5878
18. User Defined Function (UDF)
UDF converts a scalar input to a scalar output. Create and use a UDF is very
simple and easy:
We have enhanced UDF/UDTF to support
variable types and variable arguments
lSum iSum
35L 1106
Scalar Scalar
long1 long2 int1 int2 int3
10L 25L 6 100 1000
19. User Defined Table Function (UDTF)
name age
Tom 23
Jack 17
David 50
line
Tom#23 Jark#17 David#50
Scalar Table (multi rows and columns)
UDTF converts a scalar input to a table output:
We have shipped UDTF in Flink release 1.2 (FLINK-4469).
20. “SELECT SUM(stock) as total”
Table Scalar
User Defined Aggregate Function (UDAGG) - Motivation
total
2000
UDAGG converts a table input to a scalar output:
Flink has built-in aggregates (count, sum, avg, min, max) for SQL and table API:
What if user wants an aggregate that is not covered by built-in aggregates, say a
weighted average aggregate? We need an aggregate interface to support user
defined aggregate function.
id name price sales stock
1 Latte 6 1 1000
8 Mocha 8 1 800
4 Breve 5 1 200
21. UDAGG – Accumulator (ACC)
id name price sales stock
1 Latte 6 1 1000
8 Mocha 8 1 800
4 Breve 5 1 200
7 Tea 4 1 2000
1 Latte 6 2 998
UDAGG represents its
state using accumulator
23. UDAGG – Merge
Motivated by local & global aggregate (and session window merge etc.), we need a merge
method which can merge the partial aggregated accumulator into one single accumulator
How to count the total visits on TaoBao web pages in real time?
24. UDAGG – Retraction – Motivations
Incorrect! The freq of
cnt=1 should be 2
25. UDAGG – Retraction – Motivations
We need a retract method in UDAGG, which can retract the UNDO messages from the
accumulator
26. Retraction – Solution
The design doc and the progress of
retraction implementation are
tracked in FLINK-6047. We have
delivered this in Flink release 1.3
Retraction is introduced to handle
updates
We use query optimizer to decide
where the retraction is needed.
DataStreamAgg
(Redo+Undo)
Update Table
(consume Undo log)
DataStreamAgg
(Redo)
Update Table
Append Table
TableScan without PK
(Redo)
NeedRetraction
NeedRetraction
Sink Table
(Does not needRetraction)
UpsertSink
27. UDAGG – Summary
Master JIRA for UDAGG is FLINK-5564. We have shipped this in Flink release 1.3.
28. Aggregate – Over Aggregate
time itemID avgPrice
1000 101 1
3000 201 1.5
4000 301 2
5000 101 2.2
5000 401 2.2
7000 301 2.6
8000 401 3
10000 101 2.8
time itemID price
1000 101 1
3000 201 2
4000 301 3
5000 101 1
5000 401 4
7000 301 3
8000 501 5
10000 101 1
Time based Group Aggregate is
not able to differentiate two
records with the same row time,
but Over Aggregate can.
Calculate moving average (in the past 5 seconds), and emit the result for each record
29. Aggregate – Summary
The design of aggregate is mainly tracked in FLIP11 (FLINK-4557). We have
delivered the above aggregates in Flink release 1.3
Grouping methods: Groupby / Over
Time types: Event time; Process time (only for stream)
Unbounded Aggregate: early-firing under a certain emit configuration (by
default it emits the result on every input record)
Windows:
• Time/Count + TUMBLE/SESSION/SLIDE window
• OVER Rows/Time Range window
30. Contributions to Flink SQL & Table API
Flink blog: “Continuous Queries on Dynamic Tables” (posted at
https://flink.apache.org/news/2017/04/04/dynamic-tables.html)
UDF (several improvements are released in 1.3)
UDTF (FLINK-4469, released in 1.2)
UDAGG (FLINK-5564, released in 1.3)
Retraction (FLINK-6047, released in 1.3)
Group/Over Window Aggregate (FLINK-4557, released in 1.3)
Unbounded Stream Group Aggregate (FLINK-6216, released in 1.3)
Stream-Stream Inner Join (FLINK-5878, targeted for release 1.4)
More coming…..
31. SQL & Table API in Large Scale Production
Section 4
32. SQL & Table API in Alibaba Production - example
SQL & Table API is proven to be a successful and sufficient declarative language for data processing.
Significantly reduce the development efforts to rewrite existing jobs or implement new jobs
33. SQL & Table API in Alibaba Production - Summary
Blink@Alibaba
In production at Alibaba for more than a year
• Hundreds of jobs
• The biggest cluster is more than 1500 nodes
• The biggest job has thousands of tasks and states over tens of TB
Blink SQL@Alibaba
In production before 2016 China Singles’ Day (biggest shopping festival, similar as black Friday in US)
• Blink jobs written by SQL & Table API are used to do real time analysis for recommendataion system, which
helps improve the targetting efficiency thereby increasing the traffic-to-sales conversion.
• The biggest SQL job has thousands of tasks and states over TB
The latest release of Blink SQL will be used to support entire Alibaba internal business and server
external customers via Alibaba Cloud streamCompute Service