SlideShare a Scribd company logo
1 of 34
Download to read offline
Streaming SQL
with Apache Calcite
Julian Hyde
Apache Big Data
Vancouver, 2016/05/09
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
Apache member
VP Calcite
PMC Arrow, Drill, Kylin
Thanks:
● Milinda Pathirage & Yi Pan (Samza)
● Haohui Mai (Storm)
● Fabian Hueske & Stephan Ewen (Flink)
Data center
Streaming data sources
Sources:
● Devices / sensors
● Web servers
● (Micro-)services
● Databases (CDC)
● Synthetic streams
● Logging / tracing
Transports:
● Kafka
● Storm
● Nifi
IoT
Devices
Services DatabaseWeb
server
How much is your data worth?
Recent data is more valuable
➢ ...if you act on it in time
Data moves from expensive
memory to cheaper disk as it cools
Old + new data is more valuable
still
➢ ...if we have a means to
combine them Time
Value of
data
($/GB)
Now1 hour
ago
1 day
ago
1 week
ago
1 year
ago
Why query streams?
Duality:
● “Your database is just a cache of my stream”
● “Your stream is just change-capture of my database”
“Data is the new oil”
● Treating events/messages as data allows you to extract and refine them
Declarative approach to streaming applications
Why SQL? ● API to your database
● Ask for what you want,
system decides how to get it
● Query planner (optimizer)
converts logical queries to
physical plans
● Mathematically sound
language (relational algebra)
● For all data, not just data in a
database
● Opportunity for novel data
organizations & algorithms
● Standard
https://www.flickr.com/photos/pere/523019984/ (CC BY-NC-SA 2.0)
➢ API to your database
➢ Ask for what you want,
system decides how to get it
➢ Query planner (optimizer)
converts logical queries to
physical plans
➢ Mathematically sound
language (relational algebra)
➢ For all data, not just “flat”
data in a database
➢ Opportunity for novel data
organizations & algorithms
➢ Standard
Why SQL?
Data workloads
● Batch
● Transaction processing
● Single-record lookup
● Search
● Interactive / OLAP
● Exploration / profiling
● Continuous execution generating alerts (CEP)
● Continuous load
Building a streaming SQL standard via
consensus
Please! No more “SQL-like” languages!
Key technologies are open source (many are Apache projects)
Calcite is providing leadership: developing example queries, TCK
(Optional) Use Calcite’s framework to build a streaming SQL parser/planner for
your engine
Several projects are working with us: Samza, Storm, Flink. (Also non-streaming
SQL in Cassandra, Drill, Flink, Hive, Kylin, Phoenix.)
Simple queries
select *
from Products
where unitPrice < 20
select stream *
from Orders
where units > 1000
➢ Traditional (non-streaming)
➢ Products is a table
➢ Retrieves records from -∞ to now
➢ Streaming
➢ Orders is a stream
➢ Retrieves records from now to +∞
➢ Query never terminates
Stream-table duality
select *
from Orders
where units > 1000
➢ Yes, you can use a stream as
a table
➢ And you can use a table as a
stream
➢ Actually, Orders is both
➢ Use the stream keyword
➢ Where to actually find the
data? That’s up to the system
select stream *
from Orders
where units > 1000
Combining past and future
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime > o.rowtime - interval ‘1’ year)
➢ Orders is used as both stream and table
➢ System determines where to find the records
➢ Query is invalid if records are not available
The “pie chart” problem
➢ Task: Write a web page summarizing
orders over the last hour
➢ Problem: The Orders stream only
contains the current few records
➢ Solution: Materialize short-term history
Orders over the last hour
Beer
48%
Cheese
30%
Wine
22%
select productId, count(*)
from Orders
where rowtime > current_timestamp - interval ‘1’ hour
group by productId
Aggregation and windows
on streams
GROUP BY aggregates multiple rows into sub-
totals
➢ In regular GROUP BY each row contributes
to exactly one sub-total
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than
one sub-total
Window functions leave the number of rows
unchanged, but compute extra expressions for
each row (based on neighboring rows)
Multi
GROUP BY
Window
functions
GROUP BY
GROUP BY select stream productId,
floor(rowtime to hour) as rowtime,
sum(units) as u,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
rowtime productId units
09:12 100 5
09:25 130 10
09:59 100 3
10:00 100 19
11:05 130 20
rowtime productId u c
09:00 100 8 2
09:00 130 10 1
10:00 100 19 1
not emitted yet; waiting
for a row >= 12:00
When are rows emitted?
The replay principle:
A streaming query produces the same result as the corresponding non-
streaming query would if given the same data in a table.
Output must not rely on implicit information (arrival order, arrival time,
processing time, or punctuations)
Making progress
It’s not enough to get the right result. We
need to give the right result at the right
time.
Ways to make progress without
compromising safety:
➢ Monotonic columns (e.g. rowtime)
and expressions (e.g. floor
(rowtime to hour))
➢ Punctuations (aka watermarks)
➢ Or a combination of both
select stream productId,
count(*) as c
from Orders
group by productId;
ERROR: Streaming aggregation
requires at least one
monotonic expression in
GROUP BY clause
Window types
➢ Tumbling window: “Every T seconds, emit the total for T seconds”
➢ Hopping window: “Every T seconds, emit the total for T2 seconds”
➢
➢ Sliding window: “Every record, emit the total for the surrounding T seconds”
or “Every record, emit the total for the surrounding T records” (see next slide…)
select … from Orders group by floor(rowtime to hour)
select … from Orders
group by tumble(rowtime, interval ‘1’ hour)
select stream … from Orders
group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour)
Window functions
select stream sum(units) over w as units1h,
sum(units) over w (partition by productId) as units1hp,
rowtime, productId, units
from Orders
window w as (order by rowtime range interval ‘1’ hour preceding)
rowtime productId units
09:12 100 5
09:25 130 10
09:59 100 3
10:17 100 10
units1h units1hp rowtime productId units
5 5 09:12 100 5
15 10 09:25 130 10
18 8 09:59 100 3
23 13 10:17 100 10
Join stream to a table
Inputs are the Orders stream and the
Products table, output is a stream.
Acts as a “lookup”.
Execute by caching the table in a hash-
map (if table is not too large) and
stream order will be preserved.
select stream *
from Orders as o
join Products as p
on o.productId = p.productId
Join stream to a changing table
Execution is more difficult if the
Products table is being changed
while the query executes.
To do things properly (e.g. to get the
same results when we re-play the
data), we’d need temporal database
semantics.
(Sometimes doing things properly is
too expensive.)
select stream *
from Orders as o
join Products as p
on o.productId = p.productId
and o.rowtime
between p.startEffectiveDate
and p.endEffectiveDate
Join stream to a stream
We can join streams if the join
condition forces them into “lock
step”, within a window (in this case,
1 hour).
Which stream to put input a hash
table? It depends on relative rates,
outer joins, and how we’d like the
output sorted.
select stream *
from Orders as o
join Shipments as s
on o.productId = p.productId
and s.rowtime
between o.rowtime
and o.rowtime + interval ‘1’ hour
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Apache Calcite
Apache top-level project since October, 2015
Query planning framework
➢ Relational algebra, rewrite rules
➢ Cost model & statistics
➢ Federation via adapters
➢ Extensible
Packaging
➢ Library
➢ Optional SQL parser, JDBC server
➢ Community-authored rules, adapters
Embedded Adapters Streaming
Apache Drill
Apache Hive
Apache Kylin
Apache Phoenix*
Cascading
Lingual
Apache
Cassandra
Apache Spark
CSV
Druid*
In-memory
JDBC
JSON
MongoDB
Splunk
Web tables
Apache Flink*
Apache Samza
Apache Storm
* Under development
Architecture
Conventional database Calcite
Relational algebra (plus streaming)
Core operators:
➢ Scan
➢ Filter
➢ Project
➢ Join
➢ Sort
➢ Aggregate
➢ Union
➢ Values
Streaming operators:
➢ Delta (converts relation to
stream)
➢ Chi (converts stream to
relation)
In SQL, the STREAM keyword
signifies Delta
Streaming algebra
➢ Filter
➢ Route
➢ Partition
➢ Round-robin
➢ Queue
➢ Aggregate
➢ Merge
➢ Store
➢ Replay
➢ Sort
➢ Lookup
Optimizing streaming queries
The usual relational transformations still apply: push filters and projects towards
sources, eliminate empty inputs, etc.
The transformations for delta are mostly simple:
➢ Delta(Filter(r, predicate)) → Filter(Delta(r), predicate)
➢ Delta(Project(r, e0, ...)) → Project(Delta(r), e0, …)
➢ Delta(Union(r0, r1), ALL) → Union(Delta(r0), Delta(r1))
But not always:
➢ Delta(Join(r0, r1, predicate)) → Union(Join(r0, Delta(r1)), Join(Delta(r0), r1)
➢ Delta(Scan(aTable)) → Empty
ORDER BY
Sorting a streaming query is valid as long as the system can make
progress.
select stream productId,
floor(rowtime to hour) as rowtime,
sum(units) as u,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
order by rowtime, c desc
Union
As in a typical database, we rewrite x union y
to select distinct * from (x union all y)
We can implement x union all y by simply combining the inputs in arrival
order but output is no longer monotonic. Monotonicity is too useful to squander!
To preserve monotonicity, we merge on the sort key (e.g. rowtime).
DML
➢ View & standing INSERT give same
results
➢ Useful for chained transforms
➢ But internals are different
insert into LargeOrders
select stream * from Orders
where units > 1000
create view LargeOrders as
select stream * from Orders
where units > 1000
upsert into OrdersSummary
select stream productId,
count(*) over lastHour as c
from Orders
window lastHour as (
partition by productId
order by rowtime
range interval ‘1’ hour preceding)
Use DML to maintain a “window”
(materialized stream history).
Summary: Streaming SQL features
Standard SQL over streams and relations
Streaming queries on relations, and relational queries on streams
Joins between stream-stream and stream-relation
Queries are valid if the system can get the data, with a reasonable latency
➢ Monotonic columns and punctuation are ways to achieve this
Views, materialized views and standing queries
Summary: The benefits of streaming SQL
Relational algebra covers needs of data-in-flight and data-at-rest applications
High-level language lets the system optimize quality of service (QoS) and data
location
Give DB tools and traditional users to access streaming data;
give message-oriented tools access to historic data
Combine real-time and historic data, and produce actionable results
Discussion continues at Apache Calcite, with contributions from Samza, Flink,
Storm and others. Please join in!
Thank you!
@julianhyde
@ApacheCalcite
http://calcite.apache.org
http://calcite.apache.org/docs/stream.html
“Data in Flight”, Communications of the ACM, January 2010 [article]

More Related Content

What's hot

What's hot (20)

Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
kafka
kafkakafka
kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 

Similar to Streaming SQL with Apache Calcite

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 

Similar to Streaming SQL with Apache Calcite (20)

Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Julian Hyde - Streaming SQL
Julian Hyde - Streaming SQLJulian Hyde - Streaming SQL
Julian Hyde - Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
JDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregorJDD2014: Real Big Data - Scott MacGregor
JDD2014: Real Big Data - Scott MacGregor
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 

More from Julian Hyde

More from Julian Hyde (18)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Streaming SQL with Apache Calcite

  • 1. Streaming SQL with Apache Calcite Julian Hyde Apache Big Data Vancouver, 2016/05/09
  • 2. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop Apache member VP Calcite PMC Arrow, Drill, Kylin Thanks: ● Milinda Pathirage & Yi Pan (Samza) ● Haohui Mai (Storm) ● Fabian Hueske & Stephan Ewen (Flink)
  • 3. Data center Streaming data sources Sources: ● Devices / sensors ● Web servers ● (Micro-)services ● Databases (CDC) ● Synthetic streams ● Logging / tracing Transports: ● Kafka ● Storm ● Nifi IoT Devices Services DatabaseWeb server
  • 4. How much is your data worth? Recent data is more valuable ➢ ...if you act on it in time Data moves from expensive memory to cheaper disk as it cools Old + new data is more valuable still ➢ ...if we have a means to combine them Time Value of data ($/GB) Now1 hour ago 1 day ago 1 week ago 1 year ago
  • 5. Why query streams? Duality: ● “Your database is just a cache of my stream” ● “Your stream is just change-capture of my database” “Data is the new oil” ● Treating events/messages as data allows you to extract and refine them Declarative approach to streaming applications
  • 6. Why SQL? ● API to your database ● Ask for what you want, system decides how to get it ● Query planner (optimizer) converts logical queries to physical plans ● Mathematically sound language (relational algebra) ● For all data, not just data in a database ● Opportunity for novel data organizations & algorithms ● Standard https://www.flickr.com/photos/pere/523019984/ (CC BY-NC-SA 2.0) ➢ API to your database ➢ Ask for what you want, system decides how to get it ➢ Query planner (optimizer) converts logical queries to physical plans ➢ Mathematically sound language (relational algebra) ➢ For all data, not just “flat” data in a database ➢ Opportunity for novel data organizations & algorithms ➢ Standard Why SQL?
  • 7. Data workloads ● Batch ● Transaction processing ● Single-record lookup ● Search ● Interactive / OLAP ● Exploration / profiling ● Continuous execution generating alerts (CEP) ● Continuous load
  • 8. Building a streaming SQL standard via consensus Please! No more “SQL-like” languages! Key technologies are open source (many are Apache projects) Calcite is providing leadership: developing example queries, TCK (Optional) Use Calcite’s framework to build a streaming SQL parser/planner for your engine Several projects are working with us: Samza, Storm, Flink. (Also non-streaming SQL in Cassandra, Drill, Flink, Hive, Kylin, Phoenix.)
  • 9. Simple queries select * from Products where unitPrice < 20 select stream * from Orders where units > 1000 ➢ Traditional (non-streaming) ➢ Products is a table ➢ Retrieves records from -∞ to now ➢ Streaming ➢ Orders is a stream ➢ Retrieves records from now to +∞ ➢ Query never terminates
  • 10. Stream-table duality select * from Orders where units > 1000 ➢ Yes, you can use a stream as a table ➢ And you can use a table as a stream ➢ Actually, Orders is both ➢ Use the stream keyword ➢ Where to actually find the data? That’s up to the system select stream * from Orders where units > 1000
  • 11. Combining past and future select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) ➢ Orders is used as both stream and table ➢ System determines where to find the records ➢ Query is invalid if records are not available
  • 12. The “pie chart” problem ➢ Task: Write a web page summarizing orders over the last hour ➢ Problem: The Orders stream only contains the current few records ➢ Solution: Materialize short-term history Orders over the last hour Beer 48% Cheese 30% Wine 22% select productId, count(*) from Orders where rowtime > current_timestamp - interval ‘1’ hour group by productId
  • 13. Aggregation and windows on streams GROUP BY aggregates multiple rows into sub- totals ➢ In regular GROUP BY each row contributes to exactly one sub-total ➢ In multi-GROUP BY (e.g. HOP, GROUPING SETS) a row can contribute to more than one sub-total Window functions leave the number of rows unchanged, but compute extra expressions for each row (based on neighboring rows) Multi GROUP BY Window functions GROUP BY
  • 14. GROUP BY select stream productId, floor(rowtime to hour) as rowtime, sum(units) as u, count(*) as c from Orders group by productId, floor(rowtime to hour) rowtime productId units 09:12 100 5 09:25 130 10 09:59 100 3 10:00 100 19 11:05 130 20 rowtime productId u c 09:00 100 8 2 09:00 130 10 1 10:00 100 19 1 not emitted yet; waiting for a row >= 12:00
  • 15. When are rows emitted? The replay principle: A streaming query produces the same result as the corresponding non- streaming query would if given the same data in a table. Output must not rely on implicit information (arrival order, arrival time, processing time, or punctuations)
  • 16. Making progress It’s not enough to get the right result. We need to give the right result at the right time. Ways to make progress without compromising safety: ➢ Monotonic columns (e.g. rowtime) and expressions (e.g. floor (rowtime to hour)) ➢ Punctuations (aka watermarks) ➢ Or a combination of both select stream productId, count(*) as c from Orders group by productId; ERROR: Streaming aggregation requires at least one monotonic expression in GROUP BY clause
  • 17. Window types ➢ Tumbling window: “Every T seconds, emit the total for T seconds” ➢ Hopping window: “Every T seconds, emit the total for T2 seconds” ➢ ➢ Sliding window: “Every record, emit the total for the surrounding T seconds” or “Every record, emit the total for the surrounding T records” (see next slide…) select … from Orders group by floor(rowtime to hour) select … from Orders group by tumble(rowtime, interval ‘1’ hour) select stream … from Orders group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour)
  • 18. Window functions select stream sum(units) over w as units1h, sum(units) over w (partition by productId) as units1hp, rowtime, productId, units from Orders window w as (order by rowtime range interval ‘1’ hour preceding) rowtime productId units 09:12 100 5 09:25 130 10 09:59 100 3 10:17 100 10 units1h units1hp rowtime productId units 5 5 09:12 100 5 15 10 09:25 130 10 18 8 09:59 100 3 23 13 10:17 100 10
  • 19. Join stream to a table Inputs are the Orders stream and the Products table, output is a stream. Acts as a “lookup”. Execute by caching the table in a hash- map (if table is not too large) and stream order will be preserved. select stream * from Orders as o join Products as p on o.productId = p.productId
  • 20. Join stream to a changing table Execution is more difficult if the Products table is being changed while the query executes. To do things properly (e.g. to get the same results when we re-play the data), we’d need temporal database semantics. (Sometimes doing things properly is too expensive.) select stream * from Orders as o join Products as p on o.productId = p.productId and o.rowtime between p.startEffectiveDate and p.endEffectiveDate
  • 21. Join stream to a stream We can join streams if the join condition forces them into “lock step”, within a window (in this case, 1 hour). Which stream to put input a hash table? It depends on relative rates, outer joins, and how we’d like the output sorted. select stream * from Orders as o join Shipments as s on o.productId = p.productId and s.rowtime between o.rowtime and o.rowtime + interval ‘1’ hour
  • 22. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 23. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 24. Apache Calcite Apache top-level project since October, 2015 Query planning framework ➢ Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible Packaging ➢ Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters Embedded Adapters Streaming Apache Drill Apache Hive Apache Kylin Apache Phoenix* Cascading Lingual Apache Cassandra Apache Spark CSV Druid* In-memory JDBC JSON MongoDB Splunk Web tables Apache Flink* Apache Samza Apache Storm * Under development
  • 26. Relational algebra (plus streaming) Core operators: ➢ Scan ➢ Filter ➢ Project ➢ Join ➢ Sort ➢ Aggregate ➢ Union ➢ Values Streaming operators: ➢ Delta (converts relation to stream) ➢ Chi (converts stream to relation) In SQL, the STREAM keyword signifies Delta
  • 27. Streaming algebra ➢ Filter ➢ Route ➢ Partition ➢ Round-robin ➢ Queue ➢ Aggregate ➢ Merge ➢ Store ➢ Replay ➢ Sort ➢ Lookup
  • 28. Optimizing streaming queries The usual relational transformations still apply: push filters and projects towards sources, eliminate empty inputs, etc. The transformations for delta are mostly simple: ➢ Delta(Filter(r, predicate)) → Filter(Delta(r), predicate) ➢ Delta(Project(r, e0, ...)) → Project(Delta(r), e0, …) ➢ Delta(Union(r0, r1), ALL) → Union(Delta(r0), Delta(r1)) But not always: ➢ Delta(Join(r0, r1, predicate)) → Union(Join(r0, Delta(r1)), Join(Delta(r0), r1) ➢ Delta(Scan(aTable)) → Empty
  • 29. ORDER BY Sorting a streaming query is valid as long as the system can make progress. select stream productId, floor(rowtime to hour) as rowtime, sum(units) as u, count(*) as c from Orders group by productId, floor(rowtime to hour) order by rowtime, c desc
  • 30. Union As in a typical database, we rewrite x union y to select distinct * from (x union all y) We can implement x union all y by simply combining the inputs in arrival order but output is no longer monotonic. Monotonicity is too useful to squander! To preserve monotonicity, we merge on the sort key (e.g. rowtime).
  • 31. DML ➢ View & standing INSERT give same results ➢ Useful for chained transforms ➢ But internals are different insert into LargeOrders select stream * from Orders where units > 1000 create view LargeOrders as select stream * from Orders where units > 1000 upsert into OrdersSummary select stream productId, count(*) over lastHour as c from Orders window lastHour as ( partition by productId order by rowtime range interval ‘1’ hour preceding) Use DML to maintain a “window” (materialized stream history).
  • 32. Summary: Streaming SQL features Standard SQL over streams and relations Streaming queries on relations, and relational queries on streams Joins between stream-stream and stream-relation Queries are valid if the system can get the data, with a reasonable latency ➢ Monotonic columns and punctuation are ways to achieve this Views, materialized views and standing queries
  • 33. Summary: The benefits of streaming SQL Relational algebra covers needs of data-in-flight and data-at-rest applications High-level language lets the system optimize quality of service (QoS) and data location Give DB tools and traditional users to access streaming data; give message-oriented tools access to historic data Combine real-time and historic data, and produce actionable results Discussion continues at Apache Calcite, with contributions from Samza, Flink, Storm and others. Please join in!