The Challenges of SQL on Hadoop

© 2015 IBM CorporationHadoop Summit – San Jose 2015
Challenges of SQL on Hadoop
A story from the trenches
Scott C. Gray (sgray@us.ibm.com)
Senior Architect and STSM, Big SQL, Big Data Open Source

© 2015 IBM Corporation2 Hadoop Summit – San Jose, CA – June 2015
Why SQL on Hadoop?
 Why are you even asking? This should be
obvious by now! :)
 Hadoop is designed for any data
 Doesn't impose any structure
 Extremely flexible
 At lowest levels is API based
 Requires strong programming expertise
 Steep learning curve
 Even simple operations can be tedious
 Why not use SQL in places its
strengths shine?
 Familiar widely used syntax
 Separation of what you want vs. how to get it
 Robust ecosystem of tools

SQL Engines Everywhere!
 SQL engines are springing up everywhere and maturing at an incredible pace!
 In some cases (certainly not all), the richness of SQL in these engines matches or
surpasses that of traditional data warehouses
 <ShamelessPlug>e.g. IBM’s Big SQL</ShamelessPlug>
 Robust SQL plus inexpensive, easily expandable and reliable clustering leads to a deep
burning desire…
IBM Big SQL
SQL

The Data Model Plop
 Ditch that pricy data warehouse and plop it on Hadoop!
 Problem solved, right?
database
partition
database
partition
database
partition
database
partition
$$$ ¢¢¢

 Your plan may just work! Or…it may not
 Even moving your application from one traditional data warehouse to another requires:
 Planning
 Tuning
 An intimate understanding of architectural differences between products
 Hadoop’s architecture adds another level of potential impedance mismatch that needs to be
considered as well..
Whoa…Hold On There, Buckaroo!
database
partition
database
partition
database
partition
database
partition
$$$
¢¢¢
?

Challenges of SQL on Hadoop
 Hadoop’s architecture presents significant challenges to matching the functionality of a
traditional data warehouse
 Everything is a tradeoff though!
 Hadoop’s architecture helps solve problems that have challenged data warehouses
 It opens data processing well beyond just relational
 This presentation will discuss a (very small) subset of the following challenges and ways in
which some projects are addressing them
 Data placement
 Indexing
 Data manipulation
 File formats out the wazoo (that’s a technical term)
 Caching (buffer pools)
 Optimization and data ownership
 Security
 Competing workloads

Disclaimer(s)
 This presentation is not designed to scare you or dishearten you
 Only to educate you and make you think and plan
 And to teach you about the bleeding edge technologies that will solve all your problems!
 I work on one of these SQL engines (IBM’s Big SQL)
 I wouldn’t be doing so if it weren’t solving real problems for real customers
 There are a LOT of technologies out there
 I haven’t used all of them, and I’m sure I’m missing at least one of your favorite. Sorry.
 Call me out when I’m wrong. I want to learn too!

Data Placement
 Most DW’s rely heavily on controlled data placement
 Data is explicitly partitioned across the cluster
 A particular node “owns” a known subset of data
 Partitioning tables on the same key(s) and on the same
nodes allows for co-located processing
 The fundamental design of HDFS explicitly implements
“random” data placement
 No matter which node writes a block there is no
guarantee a copy will live on that node
 Rebalancing HDFS can move blocks around
 So, no co-located processing without bending over
backwards (more on this later)
Partition A
T1 T2
Partition B
T1 T2
Partition C
T1 T2
Query
Coordinator
HDFS

Query Processing Without Data Placement
 Without co-location the options for join processing are limited
 Redistribution Join
 DB engines read and filter “local” blocks for each table
 Records with the same key are shipped to the same
node to be joined
 In the worst case both joined tables are moved in their entirety!
 Doesn’t really work well for non-equijoins (!=, <, >, etc.)
 Hash Join
 Smaller, or heavily filtered, tables are shipped to all
other nodes
 An in memory hash table is used for very fast joins
 Can still lead to a lot of network to move the small table
 Tricks like bloom filters can help optimize these types of joins
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine T2
DB
Engine
DB
Engine
DB
Engine
DB
Engine
Broadcast Join
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine
Hash Join
T2 T2

Data Placement – Explicit Placement Policies
 HDFS
 HDFS has supported a pluggable data placement policy for some time now
 This could be used to keep blocks for specific tables “together”
 HDFS doesn’t know the data in the blocks, so it would be an “all or nothing” policy
• A full copy of both tables together a given host
• Can be more granular by placing hive-style partitions together (next slide)
 What do you do when a host “fills up”?
 I’m not aware of any SQL engine that leverages this feature now
 HBase (e.g. HBASE-10576)
 HBase today takes advantage of HDFS write behavior such that table regions are “most likely” local
 There are projects underway to cause the HBase balancer to split tables (regions) together
 This nicely solves the problem of a host “filling up”
 Obviously, this is restricted to HBase storage only

Data Placement – Partitioning Without Placement
 Without explicit placement, the next best
thing is reducing the amount of data to
be scanned
 In Hive, “partitioning” allows for subdividing
data by the value in a set of columns
 Queries only access the directories
required to satisfy the query
 Typically cannot be taken advantage of when
joining on the partitioning column
 Scanning a lot of partitions can be
quite expensive!
 Other platforms, like Jethrodata, similarly allow for range partitioning
 Allows for more control over the number of directories/data

Avoid-The-Join: Nested Data Types
 One way to avoid the cost of joins is to physically
nest related data
 E.g. store data as nested JSON, AVRO, etc.
 Each department row contains all employees
 Apache Drill allows this with no schema provided!
 Impala is adding language support to simplify
such queries
 An ARRAY-of-STRUCT implicitly treated as a table
 Aggregates an be applied to arrays
 Dynamic JSON schema discovery
CREATE HADOOP TABLE DEPARTMENT
(
DEPT_ID INT NOT NULL,
DEPT_NAME VARCHAR(30) NOT NULL,
...
EMPLOYEES ARRAY<STRUCT<
EMP_ID:INT,
EMP_NAME: VARCHAR(30),
SALARY DECIMAL(10,2)
...
>>
)
ROW FORMAT SERDE ‘com.myco.MyJsonSerDe’
SELECT D.DEPT_NAME, SUM(E.SALARY)
FROM DEPARTMENT D,
UNNEST(D.EMPLOYEES) AS E
Big SQL Example
SELECT DEPT_NAME, SUM(E.SALARY)
FROM
(SELECT D.DEPT_NAME, FLATTEN(D.EMPLOYEES) E
FROM `myfile.json` D)

Avoid-The-Join: The Gotchas
 While avoiding the cost of the joins, nested data types have some downsides:
 The row size can become very large
 Most storage formats must completely read the entire row even when the complex
column is not being used
 You are no longer relational!
• Becomes expensive to slice the data another way

Indexing, It’s a Challenge!
 HDFS’ random block placement is problematic for traditional indexing
 An index is typically just a data file organized by indexed columns
 Each block in the index file will, of course, be randomly scattered
 Each index entry will point to data in the base data, which is
ALSO randomly scattered!
 This sort of “global” index will work for smaller point or range queries
 Network I/O costs grow as the scan range increases on the index
 Many SQL engines allow users to just drop data files into a directory to make it available
 How does the index know it needs to be updated?
D
D
D
I
D
D = Data I = Index
D
I
D I
I
D
D
D

Indexing and Hadoop Legacy
 Hive-derived database engines use standard Hadoop classes to allow access to any data
 InputFormat – Used to interpret, split, and read a given file type
 OutputFormat – Used to write data into a given file type
 This interfaces are great!
 They were established with the very first version of Hadoop (MapReduce, specifically)
 They are ubiquitous
 You can turn literally any file format into a table!
 But…the interface lacks any kind of “seek” operation!
 A feature necessary to implement an index

Hive “Indexes”
 Hive has supported indexing since 0.8, but they are
barely indexes in the traditional sense
 They are limited in utility
 No other Hive derived SQL solution uses them
 The index table contains
 One row for each [index-values,blockoffset] pair
 A set of bit for each row in the block (1 = a row contains
the indexed columns)
 This sort of index is useful for
 Indexing any file type, regardless of format
 Skipping base table blocks that don’t contain matching values
 Avoiding interpretation of data in rows that don’t match the index
 You still have to read each matching block in its entirety (up to the last “1”)
CREATE INDEX IDX1 ON T1 (A, B)
ROW FORMAT DELIMITED
A B Block
Offset
Bits
CA San Jose 6371541 011010010000…
CA San Jose 4718461 110100000111…
CA Berkeley 1747665 110000000011…
NY New York 1888828 1111111100001…

Block Level Indexing and Synopsis
 The latest trend in indexing is with “smarter” file formats
 Exemplified by Parquet and ORC
 These formats typically
 Store data in a compressed columnar(-ish) format
 Store indexing and/or statistical information within each block
 Can be configured with search criteria prior to reading data
 Index and data are intimately tied together and always in sync
 Optimizations include
 Skipping of blocks that do not match your search criteria
 Quickly seeking within a block to data matching your search criteria
 You still have to at least “peek” at every block
 Fetching a single row out of a billion will still take some time
Parquet
ORC

Indexing in HBase
 All HBase tables are inherently partitioned and indexed on row key
 Provides near-RDBMS levels of performance for fetching on row key (yay!!)
 At a non-negligible cost in writes due to index maintenance (boo!!)
 And requires persistent servers (memory, CPU) instead of just simple flat files
 Today HBase has no native secondary index support (it’s coming!)
 But there are many solutions that will provide them for you….

 Most secondary index solutions for HBase store the index in another HBase table
 The index is automatically maintained via HBase co-processors (kind of like triggers)
 There is a measurable cost to index maintenance
 Bulk load bypasses co-processors (indexes aren’t maintained)
 Big SQL is exploring using co-processors to store index data outside of HBase
 E.g. using a Lucene index
 Stored locally with each region server
Secondary Indexes in HBase
Row Key C1 C2 C3
12345 Frank Martin 44
12346 Mary Lee 22
T1
Row Key Pointer
Martin|44 12345
Lee|22 12346
CREATE INDEX IDX1
ON T1 (C2, C3)
T1_IDX1

Secondary Indexes in HBase
 Global Index (e.g. Phoenix, Big SQL)
 Index table is maintained like a regular HBase
table (regions are randomly scattered)
 Index data likely not co-located with base data
 Good for point or small scan queries
 Suffers from “network storm” during large index
scans
Region Server
T1
Region A
T1_IDX1
Region B
Region Server
T1
Region B
Region Server
T1_IDX1
Region A
 Local Index (e.g. Phoenix)
 Custom HBase balance ensures index data is
co-located with base data
• There is a small chance that it will be remote
 No network hop to go from index to base data
 BUT, for a given index key all index region
servers must be polled
• Potentially more expensive for single row
lookups
Region Server
T1
Region A
Region Server
T1
Region B
T1_IDX1
Region A
T1_IDX1
Region B

Data Manipulation (DML) In a Non-Updateable World
 Another “gotcha” of HDFS is that it only supports write and append
 Modifying data is difficult without the ability to update data!
 Variable block length and block append (HDFS-3689) may allow some crude
modification features
 As a result, you’ll notice very few SQL solutions support DML operations
 Those that do support it have to bend over backwards to accommodate the file system
 Modifications are logged next to original data
 Reads of original data are merged

Hive ACID Tables
 Hive 0.14 introduced “ACID” tables (which
aren’t quite ACID yet!)
 Modifications are logged, in row order,
next to the base data files
 During read, delta file changes are
“merged” into the base data
 Minor compaction process merges delta
files together, major re-builds base data
 Not suitable for OLTP
 Single row update still scans all base
data and produces one delta file

Data Modification in HBase
 HBase takes a similar approach
 Changes are logged to the write-ahead-log (WAL)
 Final view of the row is cached in memory
 Base data (HFILES) are periodically rebuilt by merging
changes
 HBase achieves OLTP levels of performance by caching changes in memory
 Hive supports “UPSERT” semantics
 It is still difficult (costly) to implement SQL UPDATE semantics
Region Server
Write
Ahead
Log
(WAL)
Cache
HFILE
HFILE
HFILE

Oh, There’s So Much More!!
 Other areas and technologies I would have liked to have covered:
 File formats out the wazoo
• Trade-offs in a propriety format vs. being completely agnostic
 Caching
• How file formats and compression make efficient caching difficult
 Schema-discovery and schema-less querying
• What if the data doesn’t have a rigid schema? (Hint: Drill It)
 Optimization and data ownership
• How do you optimize a query if you have no statistics? (dynamic vs. static optimization)
 Security
• Sharing the raw data with tools outside of the database leads to security model mismatches
 Competing workloads
• How does the database deal with competing workloads from other Hadoop tools?
“Sir Not-Appearing-In-This-Film”

Thank You!
 Thanks for putting up with me
 Queries? (Optimized of course!)

The Challenges of SQL on Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a The Challenges of SQL on Hadoop

Semelhante a The Challenges of SQL on Hadoop (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

The Challenges of SQL on Hadoop