2. Agenda
âș Why and what is Big SQL 3.0?
âą Not a sales pitch, I promise!
âș Overview of the challenges
âș How we solved (some of) them
âą Architecture and interaction with Hadoop
âą Query rewrite
âą Query optimization
âș Future challenges
3. The Perfect Storm
âș Increase business interest on SQL on Hadoop to
improve the pace and efficiency of adopting Hadoop
âș SQL engines on Hadoop moving away from MR
towards MPP architectures
âș SQL users expect same level of language expressiveness,
features and (somewhat) performance as RDMSs
âș IBM has decades of experience and assets on building
SQL engines⊠Why not leverage it?
4. The Result? Big SQL 3.0
âș MapReduce replaced with a modern
MPP shared-nothing architecture
âș Architected from the ground up
for low latency and high throughput
âș Same SQL expressiveness as relational
RDBMs, which allows application portability
âș Rich enterprise capabilitiesâŠ
5. Big SQL 3.0 At a Glance
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file formats supported
Superior enablement of IBM Software
Performance
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
performance
Result sets not constrained by existing
memory
Federation
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported: DB2,
Teradata, Oracle, Netezza
Enterprise Capabilities
Advanced Security / Auditing
Resource and Workload Management
Self Tuning Memory Management
Comprehensive Monitoring
Rich SQL
Comprehensive SQL support
IBMâs SQL PL compatibility
6. How did we do it?
âș Big SQL is derived from an existing IBM shared-nothing RDBMS
âą A very mature MPP architecture
âą Already understands distributed joins and optimization
âș Behavior is sufficiently different that
it is considered a separate product
âą Certain SQL constructs are disabled
âą Traditional data warehouse partitioning
is unavailable
âą New SQL constructs introduced
âș On the surface, porting a shared
nothing RDBMS to a shared nothing
cluster (Hadoop) seems easy, but âŠ
database
partition
database
partition
database
partition
database
partition
Traditional Distributed RBMS Architecture
7. Challenges for a traditional RDBMS on Hadoop
âș Data placement
âą Traditional databases expect to have full control over data placement
âą Data placement plays an important role in performance (e.g. co-located
joins)
âą Hadoopâs randomly scattered data plays against the grain of this
âș Reading and writing Hadoop files
âą Normally an RDBMS has its own storage format
âą Format is highly optimized to minimize cost of moving data into memory
âą Hadoop has a practically unbounded number of storage formats all with
different capabilities
8. Challenges for a traditional RDBMS on Hadoop
âș Query optimization
âą Statistics on Hadoop are a relatively new concept
âą The are frequently not available
âą The database optimizer can use statistics not traditionally available in Hive
âą Hive-style partitioning (grouping data into different files/directories) is a new
concept
âș Resource management
âą A database server almost always runs in isolation
âą In Hadoop the nodes must be shared with many other tasks
â Data nodes
â MR task tracker and tasks
â HBase region servers, etc.
âą We needed to learn to play nice with others
9. Architecture Overview
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP *FMP = Fenced mode process
10. Big SQL Scheduler
âș The Scheduler is the main RDBMSâHadoop service interface
âą Interfaces with Hive metastore for table metadata
âą Acts like the MapReduce job tracker for Big SQL
â Big SQL provides query predicates for
scheduler to perform partition elimination
â Determines splits for each âtableâ involved in the query
â Schedules splits on available Big SQL nodes
(favoring scheduling locally to the data)
â Serves work (splits) to I/O engines
â Coordinates âcommitsâ after INSERTs
âș Scheduler allows the database engine to
be largely unaware of the Hadoop world
Management Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Mgmt Node
Database
Service
Hive
Metastore
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
TrackerUDF
FMP
11. I/O Fence Mode Processes
âș Native I/O FMP
âą The high-speed interface for a limited number of common file formats
âș Java I/O FMP
âą Handles all other formats via standard Hadoop/Hive APIâs
âș Both perform multi-threaded direct I/O on local data
âș The database engine had to be taught storage format capabilities
âą Projection list is pushed into I/O format
âą Predicates are pushed as close to the data as
possible (into storage format, if possible)
âą Predicates that cannot be pushed down are
evaluated within the database engine
âș The database engine is only aware of which nodes
need to read
âą Scheduler directs the readers to their portion of work
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
12. Mgmt Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Query Compilation There is a lot involved in SQL compilation
âș Parsing
âą Catch syntax errors
âą Generate internal representation of query
âș Semantic checking
âą Determine if query makes sense
âą Incorporate view definitions
âą Add logic for constraint checking
âș Query optimization
âą Modify query to improve performance (Query Rewrite)
âą Choose the most efficient âaccess planâ
âș Pushdown Analysis
âą Federation âoptimizationâ
âș Threaded code generation
âą Generate efficient âexecutableâ code
13. Query Rewrite
âș Why is query re-write important?
âą There are many ways to express the same query
âą Query generators often produce suboptimal queries and donât permit âhand optimizationâ
âą Complex queries often result in redundancy, especially with views
âą For Large data volumes optimal access plans more crucial as penalty for poor planning is
greater
select sum(l_extendedprice) / 7.0
avg_yearly
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < ( select 0.2 *
avg(l_quantity) from tpcd.lineitem
where l_partkey = p_partkey);
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice)
as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
âą Query correlation eliminated
âą Line item table accessed only once
âą Execution time reduced in half!
14. Query Rewrite
âș Most existing query rewrite rules remain unchanged
âą 140+ existing query re-writes are leveraged
âą Almost none are impacted by âthe Hadoop worldâ
âș There were however a few modifications that were requiredâŠ
15. Query Rewrite and Indexes
âș Column nullability and indexes can help drive query optimization
âą Can produce more efficiently decorrelated subqueries and joins
âą Used to prove uniqueness of joined rows (âearly-outâ join)
âș Very few Hadoop data sources support
the concept of an index
âș In the Hive metastore all columns
are implicitly nullable
âș Big SQL introduces advisory
constraints and nullability indicators
âą User can specify whether or not
constraints can be âtrustedâ for
query rewrites
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Nullability Indicators
Constraints
16. Query Pushdown
âș Pushdown moves processing down as
close to the data as possible
âą Projection pushdown â retrieve only
necessary columns
âą Selection pushdown â push search criteria
âș Big SQL understands the capabilities of
readers and storage formats involved
âą As much as possible is pushed down
âą Residual processing done in the server
âą Optimizer costs queries based upon how
much can be pushed down
3) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.04
Predicate Text:
--------------
(Q1.P_BRAND = 'Brand#23')
4) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.025
Predicate Text:
--------------
(Q1.P_CONTAINER = 'MED BOX')
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice) as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
17. Statistics
âș Big SQL utilizes Hive statistics
collection with some extensions:
âą Additional support for column groups,
histograms and frequent values
âą Automatic determination of partitions that
require statistics collection vs. explicit
âą Partitioned tables: added table-level
versions of NDV, Min, Max, Null count,
Average column length
âą Hive catalogs as well as database engine
catalogs are also populated
âą We are restructuring the relevant code for
submission back to Hive
âș Capability for statistic fabrication
if no stats available at compile time
Table statistics
âą Cardinality (count)
âą Number of Files
âą Total File Size
Column statistics
âą Minimum value (all types)
âą Maximum value (all types)
âą Cardinality (non-nulls)
âą Distribution (Number of Distinct Values NDV)
âą Number of null values
âą Average Length of the column value (all types)
âą Histogram - Number of buckets configurable
âą Frequent Values (MFV) â Number configurable
Column group statistics
18. Costing Model
âș Few extensions required to the Cost Model
âș TBSCAN operator cost model extended
to evaluate cost of reading from Hadoop
âș New elements taken into account:
# of files, size of files, # of partitions, # of nodes
âș Optimizer now knows in which subset of
nodes the data resides
â Better costing!
|
2.66667e-08
HSJOIN
( 7)
1.1218e+06
8351
/--------+--------
5.30119e+08 3.75e+07
BTQ NLJOIN
( 8) ( 11)
948130 146345
7291 1060
| /----+----
5.76923e+08 1 3.75e+07
LTQ GRPBY FILTER
( 9) ( 12) ( 20)
855793 114241 126068
7291 1060 1060
| | |
5.76923e+08 13 7.5e+07
TBSCAN TBSCAN BTQ
( 10) ( 13) ( 21)
802209 114241 117135
7291 1060 1060
| | |
7.5e+09 13 5.76923e+06
TABLE: TPCH5TB_PARQ TEMP LTQ
ORDERS ( 14) ( 22)
Q1 114241 108879
1060 1060
| |
13 5.76923e+06
DTQ TBSCAN
( 15) ( 23)
114241 108325
1060 1060
| |
1 7.5e+08
GRPBY TABLE: TPCH5TB_PARQ
( 16) CUSTOMER
114241 Q5
1060
|
1
LTQ
( 17)
114241
1060
|
1
GRPBY
( 18)
114241
1060
|
5.24479e+06
TBSCAN
( 19)
113931
1060
|
7.5e+08
TABLE: TPCH5TB_PARQ
CUSTOMER
Q2
19. We can access a Hadoop table as:
âș âScatteredâ Partitioned:
âą Only accesses local data to the node
âș Replicated:
âą Accesses local and remote data
â Optimizer could also use a broadcast table queue
â HDFS shared file system provides replication
New Access Plans
Data not hash partitioned on a particular columns
(aka âScattered partitionedâ)
New Parallel Join Strategy
introduced
20. Parallel Join Strategies
Replicated vs. Broadcast join
All tables are âscatterâ partitioned
Join predicate:
STORE.STOREKEY = DAILY_SALES.STOREKEY
19
Replicate smaller table to partitions
of the larger table using:
âą Broadcast table queue
âą Replicated HDFS scan
Table Queue represents
communication between
nodes or subagents
JOIN
Store
Daily Sales
SCAN
SCAN
Broadcast
TQ SCAN
replicated SCAN
21. Parallel Join Strategies
Repartitioned join
All tables are âscatterâ partitioned
Join predicate:
DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY
20
âą Both tables large
âą Too expensive to broadcast or
replicate either
âą Repartition both tables on the
join columns
âą Use directed table queue (DTQ)
JOIN
Daily
Forecast
Daily Sales
SCAN SCAN
Directed
TQ
Directed
TQ
22. Future Challenges
âș The challenges never end!
âą Thatâs what makes this job fun!
âą The Hadoop ecosystem continues to expand
âą New storage techniques, indexing techniques, etc.
âș Here are a few areas weâre exploringâŠ.
23. Future Challenges
âș Dynamic split allocation
âą React to competing workloads
âą If one node is slow, hand work you would have handed it to another node
âș More pushdown!
âą Currently we push projection/selection down
âą Should we push more advanced operations? Aggregation? Joins?
âș Join co-location
âą Perform co-located joins when tables are partitioned on the same join key
âș Explicit MapReduce style parallelism (âSQL MRâ)
âą Expand SQL to explicitly perform partitioned operations
24. Queries?
(Optimized, of course)
Try Big SQL 3.0 Beta on the cloud!
https://bigsql.imdemocloud.com/
Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM
Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
Notas do Editor
Rewriting a given SQL query into a semantically equivalent form that may be processed more efficiently
Mfv and histograms obtain better selectivity estimates for range predicates over data that is non-uniformly distributed.
Stats stored in Hive metastore for currently hive supported stats and our internal catalog tables for all
Min. max in hive only for a subset of types
Avg length of the column values in hive only for strings
Column and table stats done together
Next: automatic stats collection