SlideShare uma empresa Scribd logo
1 de 25
© 2015 IBM CorporationHadoop Summit – San Jose 2015
Challenges of SQL on Hadoop
A story from the trenches
Scott C. Gray (sgray@us.ibm.com)
Senior Architect and STSM, Big SQL, Big Data Open Source
© 2015 IBM Corporation2 Hadoop Summit – San Jose, CA – June 2015
Why SQL on Hadoop?
 Why are you even asking? This should be
obvious by now! :)
 Hadoop is designed for any data
 Doesn't impose any structure
 Extremely flexible
 At lowest levels is API based
 Requires strong programming expertise
 Steep learning curve
 Even simple operations can be tedious
 Why not use SQL in places its
strengths shine?
 Familiar widely used syntax
 Separation of what you want vs. how to get it
 Robust ecosystem of tools
© 2015 IBM Corporation3 Hadoop Summit – San Jose, CA – June 2015
SQL Engines Everywhere!
 SQL engines are springing up everywhere and maturing at an incredible pace!
 In some cases (certainly not all), the richness of SQL in these engines matches or
surpasses that of traditional data warehouses
 <ShamelessPlug>e.g. IBM’s Big SQL</ShamelessPlug>
 Robust SQL plus inexpensive, easily expandable and reliable clustering leads to a deep
burning desire…
IBM Big SQL
SQL
© 2015 IBM Corporation4 Hadoop Summit – San Jose, CA – June 2015
The Data Model Plop
 Ditch that pricy data warehouse and plop it on Hadoop!
 Problem solved, right?
database
partition
database
partition
database
partition
database
partition
$$$ ¢¢¢
© 2015 IBM Corporation5 Hadoop Summit – San Jose, CA – June 2015
 Your plan may just work! Or…it may not
 Even moving your application from one traditional data warehouse to another requires:
 Planning
 Tuning
 An intimate understanding of architectural differences between products
 Hadoop’s architecture adds another level of potential impedance mismatch that needs to be
considered as well..
Whoa…Hold On There, Buckaroo!
database
partition
database
partition
database
partition
database
partition
$$$
¢¢¢
?
© 2015 IBM Corporation6 Hadoop Summit – San Jose, CA – June 2015
Challenges of SQL on Hadoop
 Hadoop’s architecture presents significant challenges to matching the functionality of a
traditional data warehouse
 Everything is a tradeoff though!
 Hadoop’s architecture helps solve problems that have challenged data warehouses
 It opens data processing well beyond just relational
 This presentation will discuss a (very small) subset of the following challenges and ways in
which some projects are addressing them
 Data placement
 Indexing
 Data manipulation
 File formats out the wazoo (that’s a technical term)
 Caching (buffer pools)
 Optimization and data ownership
 Security
 Competing workloads
© 2015 IBM Corporation7 Hadoop Summit – San Jose, CA – June 2015
Disclaimer(s)
 This presentation is not designed to scare you or dishearten you
 Only to educate you and make you think and plan
 And to teach you about the bleeding edge technologies that will solve all your problems!
 I work on one of these SQL engines (IBM’s Big SQL)
 I wouldn’t be doing so if it weren’t solving real problems for real customers
 There are a LOT of technologies out there
 I haven’t used all of them, and I’m sure I’m missing at least one of your favorite. Sorry.
 Call me out when I’m wrong. I want to learn too!
© 2015 IBM Corporation8 Hadoop Summit – San Jose, CA – June 2015
Data Placement
 Most DW’s rely heavily on controlled data placement
 Data is explicitly partitioned across the cluster
 A particular node “owns” a known subset of data
 Partitioning tables on the same key(s) and on the same
nodes allows for co-located processing
 The fundamental design of HDFS explicitly implements
“random” data placement
 No matter which node writes a block there is no
guarantee a copy will live on that node
 Rebalancing HDFS can move blocks around
 So, no co-located processing without bending over
backwards (more on this later)
Partition A
T1 T2
Partition B
T1 T2
Partition C
T1 T2
Query
Coordinator
HDFS
© 2015 IBM Corporation9 Hadoop Summit – San Jose, CA – June 2015
Query Processing Without Data Placement
 Without co-location the options for join processing are limited
 Redistribution Join
 DB engines read and filter “local” blocks for each table
 Records with the same key are shipped to the same
node to be joined
 In the worst case both joined tables are moved in their entirety!
 Doesn’t really work well for non-equijoins (!=, <, >, etc.)
 Hash Join
 Smaller, or heavily filtered, tables are shipped to all
other nodes
 An in memory hash table is used for very fast joins
 Can still lead to a lot of network to move the small table
 Tricks like bloom filters can help optimize these types of joins
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine T2
DB
Engine
DB
Engine
DB
Engine
DB
Engine
Broadcast Join
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine
Hash Join
T2 T2
© 2015 IBM Corporation10 Hadoop Summit – San Jose, CA – June 2015
Data Placement – Explicit Placement Policies
 HDFS
 HDFS has supported a pluggable data placement policy for some time now
 This could be used to keep blocks for specific tables “together”
 HDFS doesn’t know the data in the blocks, so it would be an “all or nothing” policy
• A full copy of both tables together a given host
• Can be more granular by placing hive-style partitions together (next slide)
 What do you do when a host “fills up”?
 I’m not aware of any SQL engine that leverages this feature now
 HBase (e.g. HBASE-10576)
 HBase today takes advantage of HDFS write behavior such that table regions are “most likely” local
 There are projects underway to cause the HBase balancer to split tables (regions) together
 This nicely solves the problem of a host “filling up”
 Obviously, this is restricted to HBase storage only
© 2015 IBM Corporation11 Hadoop Summit – San Jose, CA – June 2015
Data Placement – Partitioning Without Placement
 Without explicit placement, the next best
thing is reducing the amount of data to
be scanned
 In Hive, “partitioning” allows for subdividing
data by the value in a set of columns
 Queries only access the directories
required to satisfy the query
 Typically cannot be taken advantage of when
joining on the partitioning column
 Scanning a lot of partitions can be
quite expensive!
 Other platforms, like Jethrodata, similarly allow for range partitioning
 Allows for more control over the number of directories/data
© 2015 IBM Corporation12 Hadoop Summit – San Jose, CA – June 2015
Avoid-The-Join: Nested Data Types
 One way to avoid the cost of joins is to physically
nest related data
 E.g. store data as nested JSON, AVRO, etc.
 Each department row contains all employees
 Apache Drill allows this with no schema provided!
 Impala is adding language support to simplify
such queries
 An ARRAY-of-STRUCT implicitly treated as a table
 Aggregates an be applied to arrays
 Dynamic JSON schema discovery
CREATE HADOOP TABLE DEPARTMENT
(
DEPT_ID INT NOT NULL,
DEPT_NAME VARCHAR(30) NOT NULL,
...
EMPLOYEES ARRAY<STRUCT<
EMP_ID:INT,
EMP_NAME: VARCHAR(30),
SALARY DECIMAL(10,2)
...
>>
)
ROW FORMAT SERDE ‘com.myco.MyJsonSerDe’
SELECT D.DEPT_NAME, SUM(E.SALARY)
FROM DEPARTMENT D,
UNNEST(D.EMPLOYEES) AS E
Big SQL Example
SELECT DEPT_NAME, SUM(E.SALARY)
FROM
(SELECT D.DEPT_NAME, FLATTEN(D.EMPLOYEES) E
FROM `myfile.json` D)
© 2015 IBM Corporation13 Hadoop Summit – San Jose, CA – June 2015
Avoid-The-Join: The Gotchas
 While avoiding the cost of the joins, nested data types have some downsides:
 The row size can become very large
 Most storage formats must completely read the entire row even when the complex
column is not being used
 You are no longer relational!
• Becomes expensive to slice the data another way
© 2015 IBM Corporation14 Hadoop Summit – San Jose, CA – June 2015
Indexing, It’s a Challenge!
 HDFS’ random block placement is problematic for traditional indexing
 An index is typically just a data file organized by indexed columns
 Each block in the index file will, of course, be randomly scattered
 Each index entry will point to data in the base data, which is
ALSO randomly scattered!
 This sort of “global” index will work for smaller point or range queries
 Network I/O costs grow as the scan range increases on the index
 Many SQL engines allow users to just drop data files into a directory to make it available
 How does the index know it needs to be updated?
D
D
D
I
D
D = Data I = Index
D
I
D I
I
D
D
D
© 2015 IBM Corporation15 Hadoop Summit – San Jose, CA – June 2015
Indexing and Hadoop Legacy
 Hive-derived database engines use standard Hadoop classes to allow access to any data
 InputFormat – Used to interpret, split, and read a given file type
 OutputFormat – Used to write data into a given file type
 This interfaces are great!
 They were established with the very first version of Hadoop (MapReduce, specifically)
 They are ubiquitous
 You can turn literally any file format into a table!
 But…the interface lacks any kind of “seek” operation!
 A feature necessary to implement an index
© 2015 IBM Corporation16 Hadoop Summit – San Jose, CA – June 2015
Hive “Indexes”
 Hive has supported indexing since 0.8, but they are
barely indexes in the traditional sense
 They are limited in utility
 No other Hive derived SQL solution uses them
 The index table contains
 One row for each [index-values,blockoffset] pair
 A set of bit for each row in the block (1 = a row contains
the indexed columns)
 This sort of index is useful for
 Indexing any file type, regardless of format
 Skipping base table blocks that don’t contain matching values
 Avoiding interpretation of data in rows that don’t match the index
 You still have to read each matching block in its entirety (up to the last “1”)
CREATE INDEX IDX1 ON T1 (A, B)
ROW FORMAT DELIMITED
A B Block
Offset
Bits
CA San Jose 6371541 011010010000…
CA San Jose 4718461 110100000111…
CA Berkeley 1747665 110000000011…
NY New York 1888828 1111111100001…
© 2015 IBM Corporation17 Hadoop Summit – San Jose, CA – June 2015
Block Level Indexing and Synopsis
 The latest trend in indexing is with “smarter” file formats
 Exemplified by Parquet and ORC
 These formats typically
 Store data in a compressed columnar(-ish) format
 Store indexing and/or statistical information within each block
 Can be configured with search criteria prior to reading data
 Index and data are intimately tied together and always in sync
 Optimizations include
 Skipping of blocks that do not match your search criteria
 Quickly seeking within a block to data matching your search criteria
 You still have to at least “peek” at every block
 Fetching a single row out of a billion will still take some time
Parquet
ORC
© 2015 IBM Corporation18 Hadoop Summit – San Jose, CA – June 2015
Indexing in HBase
 All HBase tables are inherently partitioned and indexed on row key
 Provides near-RDBMS levels of performance for fetching on row key (yay!!)
 At a non-negligible cost in writes due to index maintenance (boo!!)
 And requires persistent servers (memory, CPU) instead of just simple flat files
 Today HBase has no native secondary index support (it’s coming!)
 But there are many solutions that will provide them for you….
© 2015 IBM Corporation19 Hadoop Summit – San Jose, CA – June 2015
 Most secondary index solutions for HBase store the index in another HBase table
 The index is automatically maintained via HBase co-processors (kind of like triggers)
 There is a measurable cost to index maintenance
 Bulk load bypasses co-processors (indexes aren’t maintained)
 Big SQL is exploring using co-processors to store index data outside of HBase
 E.g. using a Lucene index
 Stored locally with each region server
Secondary Indexes in HBase
Row Key C1 C2 C3
12345 Frank Martin 44
12346 Mary Lee 22
T1
Row Key Pointer
Martin|44 12345
Lee|22 12346
CREATE INDEX IDX1
ON T1 (C2, C3)
T1_IDX1
© 2015 IBM Corporation20 Hadoop Summit – San Jose, CA – June 2015
Secondary Indexes in HBase
 Global Index (e.g. Phoenix, Big SQL)
 Index table is maintained like a regular HBase
table (regions are randomly scattered)
 Index data likely not co-located with base data
 Good for point or small scan queries
 Suffers from “network storm” during large index
scans
Region Server
T1
Region A
T1_IDX1
Region B
Region Server
T1
Region B
Region Server
T1_IDX1
Region A
 Local Index (e.g. Phoenix)
 Custom HBase balance ensures index data is
co-located with base data
• There is a small chance that it will be remote
 No network hop to go from index to base data
 BUT, for a given index key all index region
servers must be polled
• Potentially more expensive for single row
lookups
Region Server
T1
Region A
Region Server
T1
Region B
T1_IDX1
Region A
T1_IDX1
Region B
© 2015 IBM Corporation21 Hadoop Summit – San Jose, CA – June 2015
Data Manipulation (DML) In a Non-Updateable World
 Another “gotcha” of HDFS is that it only supports write and append
 Modifying data is difficult without the ability to update data!
 Variable block length and block append (HDFS-3689) may allow some crude
modification features
 As a result, you’ll notice very few SQL solutions support DML operations
 Those that do support it have to bend over backwards to accommodate the file system
 Modifications are logged next to original data
 Reads of original data are merged
© 2015 IBM Corporation22 Hadoop Summit – San Jose, CA – June 2015
Hive ACID Tables
 Hive 0.14 introduced “ACID” tables (which
aren’t quite ACID yet!)
 Modifications are logged, in row order,
next to the base data files
 During read, delta file changes are
“merged” into the base data
 Minor compaction process merges delta
files together, major re-builds base data
 Not suitable for OLTP
 Single row update still scans all base
data and produces one delta file
© 2015 IBM Corporation23 Hadoop Summit – San Jose, CA – June 2015
Data Modification in HBase
 HBase takes a similar approach
 Changes are logged to the write-ahead-log (WAL)
 Final view of the row is cached in memory
 Base data (HFILES) are periodically rebuilt by merging
changes
 HBase achieves OLTP levels of performance by caching changes in memory
 Hive supports “UPSERT” semantics
 It is still difficult (costly) to implement SQL UPDATE semantics
Region Server
Write
Ahead
Log
(WAL)
Cache
HFILE
HFILE
HFILE
© 2015 IBM Corporation24 Hadoop Summit – San Jose, CA – June 2015
Oh, There’s So Much More!!
 Other areas and technologies I would have liked to have covered:
 File formats out the wazoo
• Trade-offs in a propriety format vs. being completely agnostic
 Caching
• How file formats and compression make efficient caching difficult
 Schema-discovery and schema-less querying
• What if the data doesn’t have a rigid schema? (Hint: Drill It)
 Optimization and data ownership
• How do you optimize a query if you have no statistics? (dynamic vs. static optimization)
 Security
• Sharing the raw data with tools outside of the database leads to security model mismatches
 Competing workloads
• How does the database deal with competing workloads from other Hadoop tools?
“Sir Not-Appearing-In-This-Film”
© 2015 IBM Corporation25 Hadoop Summit – San Jose, CA – June 2015
Thank You!
 Thanks for putting up with me
 Queries? (Optimized of course!)

Mais conteúdo relacionado

Mais procurados

YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop SecurityDataWorks Summit
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 

Mais procurados (20)

YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop Security
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 

Destaque

Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBRadenko Zec
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Gruter
 
Building Hadoop with Chef
Building Hadoop with ChefBuilding Hadoop with Chef
Building Hadoop with ChefJohn Martin
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Matthew (정재화)
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets DataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]DataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 

Destaque (20)

Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
Building Hadoop with Chef
Building Hadoop with ChefBuilding Hadoop with Chef
Building Hadoop with Chef
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 

Semelhante a The Challenges of SQL on Hadoop

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Semelhante a The Challenges of SQL on Hadoop (20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

The Challenges of SQL on Hadoop

  • 1. © 2015 IBM CorporationHadoop Summit – San Jose 2015 Challenges of SQL on Hadoop A story from the trenches Scott C. Gray (sgray@us.ibm.com) Senior Architect and STSM, Big SQL, Big Data Open Source
  • 2. © 2015 IBM Corporation2 Hadoop Summit – San Jose, CA – June 2015 Why SQL on Hadoop?  Why are you even asking? This should be obvious by now! :)  Hadoop is designed for any data  Doesn't impose any structure  Extremely flexible  At lowest levels is API based  Requires strong programming expertise  Steep learning curve  Even simple operations can be tedious  Why not use SQL in places its strengths shine?  Familiar widely used syntax  Separation of what you want vs. how to get it  Robust ecosystem of tools
  • 3. © 2015 IBM Corporation3 Hadoop Summit – San Jose, CA – June 2015 SQL Engines Everywhere!  SQL engines are springing up everywhere and maturing at an incredible pace!  In some cases (certainly not all), the richness of SQL in these engines matches or surpasses that of traditional data warehouses  <ShamelessPlug>e.g. IBM’s Big SQL</ShamelessPlug>  Robust SQL plus inexpensive, easily expandable and reliable clustering leads to a deep burning desire… IBM Big SQL SQL
  • 4. © 2015 IBM Corporation4 Hadoop Summit – San Jose, CA – June 2015 The Data Model Plop  Ditch that pricy data warehouse and plop it on Hadoop!  Problem solved, right? database partition database partition database partition database partition $$$ ¢¢¢
  • 5. © 2015 IBM Corporation5 Hadoop Summit – San Jose, CA – June 2015  Your plan may just work! Or…it may not  Even moving your application from one traditional data warehouse to another requires:  Planning  Tuning  An intimate understanding of architectural differences between products  Hadoop’s architecture adds another level of potential impedance mismatch that needs to be considered as well.. Whoa…Hold On There, Buckaroo! database partition database partition database partition database partition $$$ ¢¢¢ ?
  • 6. © 2015 IBM Corporation6 Hadoop Summit – San Jose, CA – June 2015 Challenges of SQL on Hadoop  Hadoop’s architecture presents significant challenges to matching the functionality of a traditional data warehouse  Everything is a tradeoff though!  Hadoop’s architecture helps solve problems that have challenged data warehouses  It opens data processing well beyond just relational  This presentation will discuss a (very small) subset of the following challenges and ways in which some projects are addressing them  Data placement  Indexing  Data manipulation  File formats out the wazoo (that’s a technical term)  Caching (buffer pools)  Optimization and data ownership  Security  Competing workloads
  • 7. © 2015 IBM Corporation7 Hadoop Summit – San Jose, CA – June 2015 Disclaimer(s)  This presentation is not designed to scare you or dishearten you  Only to educate you and make you think and plan  And to teach you about the bleeding edge technologies that will solve all your problems!  I work on one of these SQL engines (IBM’s Big SQL)  I wouldn’t be doing so if it weren’t solving real problems for real customers  There are a LOT of technologies out there  I haven’t used all of them, and I’m sure I’m missing at least one of your favorite. Sorry.  Call me out when I’m wrong. I want to learn too!
  • 8. © 2015 IBM Corporation8 Hadoop Summit – San Jose, CA – June 2015 Data Placement  Most DW’s rely heavily on controlled data placement  Data is explicitly partitioned across the cluster  A particular node “owns” a known subset of data  Partitioning tables on the same key(s) and on the same nodes allows for co-located processing  The fundamental design of HDFS explicitly implements “random” data placement  No matter which node writes a block there is no guarantee a copy will live on that node  Rebalancing HDFS can move blocks around  So, no co-located processing without bending over backwards (more on this later) Partition A T1 T2 Partition B T1 T2 Partition C T1 T2 Query Coordinator HDFS
  • 9. © 2015 IBM Corporation9 Hadoop Summit – San Jose, CA – June 2015 Query Processing Without Data Placement  Without co-location the options for join processing are limited  Redistribution Join  DB engines read and filter “local” blocks for each table  Records with the same key are shipped to the same node to be joined  In the worst case both joined tables are moved in their entirety!  Doesn’t really work well for non-equijoins (!=, <, >, etc.)  Hash Join  Smaller, or heavily filtered, tables are shipped to all other nodes  An in memory hash table is used for very fast joins  Can still lead to a lot of network to move the small table  Tricks like bloom filters can help optimize these types of joins T1 T1 DB Engine T1DB Engine T2 DB Engine T2 DB Engine DB Engine DB Engine DB Engine Broadcast Join T1 T1 DB Engine T1DB Engine T2 DB Engine Hash Join T2 T2
  • 10. © 2015 IBM Corporation10 Hadoop Summit – San Jose, CA – June 2015 Data Placement – Explicit Placement Policies  HDFS  HDFS has supported a pluggable data placement policy for some time now  This could be used to keep blocks for specific tables “together”  HDFS doesn’t know the data in the blocks, so it would be an “all or nothing” policy • A full copy of both tables together a given host • Can be more granular by placing hive-style partitions together (next slide)  What do you do when a host “fills up”?  I’m not aware of any SQL engine that leverages this feature now  HBase (e.g. HBASE-10576)  HBase today takes advantage of HDFS write behavior such that table regions are “most likely” local  There are projects underway to cause the HBase balancer to split tables (regions) together  This nicely solves the problem of a host “filling up”  Obviously, this is restricted to HBase storage only
  • 11. © 2015 IBM Corporation11 Hadoop Summit – San Jose, CA – June 2015 Data Placement – Partitioning Without Placement  Without explicit placement, the next best thing is reducing the amount of data to be scanned  In Hive, “partitioning” allows for subdividing data by the value in a set of columns  Queries only access the directories required to satisfy the query  Typically cannot be taken advantage of when joining on the partitioning column  Scanning a lot of partitions can be quite expensive!  Other platforms, like Jethrodata, similarly allow for range partitioning  Allows for more control over the number of directories/data
  • 12. © 2015 IBM Corporation12 Hadoop Summit – San Jose, CA – June 2015 Avoid-The-Join: Nested Data Types  One way to avoid the cost of joins is to physically nest related data  E.g. store data as nested JSON, AVRO, etc.  Each department row contains all employees  Apache Drill allows this with no schema provided!  Impala is adding language support to simplify such queries  An ARRAY-of-STRUCT implicitly treated as a table  Aggregates an be applied to arrays  Dynamic JSON schema discovery CREATE HADOOP TABLE DEPARTMENT ( DEPT_ID INT NOT NULL, DEPT_NAME VARCHAR(30) NOT NULL, ... EMPLOYEES ARRAY<STRUCT< EMP_ID:INT, EMP_NAME: VARCHAR(30), SALARY DECIMAL(10,2) ... >> ) ROW FORMAT SERDE ‘com.myco.MyJsonSerDe’ SELECT D.DEPT_NAME, SUM(E.SALARY) FROM DEPARTMENT D, UNNEST(D.EMPLOYEES) AS E Big SQL Example SELECT DEPT_NAME, SUM(E.SALARY) FROM (SELECT D.DEPT_NAME, FLATTEN(D.EMPLOYEES) E FROM `myfile.json` D)
  • 13. © 2015 IBM Corporation13 Hadoop Summit – San Jose, CA – June 2015 Avoid-The-Join: The Gotchas  While avoiding the cost of the joins, nested data types have some downsides:  The row size can become very large  Most storage formats must completely read the entire row even when the complex column is not being used  You are no longer relational! • Becomes expensive to slice the data another way
  • 14. © 2015 IBM Corporation14 Hadoop Summit – San Jose, CA – June 2015 Indexing, It’s a Challenge!  HDFS’ random block placement is problematic for traditional indexing  An index is typically just a data file organized by indexed columns  Each block in the index file will, of course, be randomly scattered  Each index entry will point to data in the base data, which is ALSO randomly scattered!  This sort of “global” index will work for smaller point or range queries  Network I/O costs grow as the scan range increases on the index  Many SQL engines allow users to just drop data files into a directory to make it available  How does the index know it needs to be updated? D D D I D D = Data I = Index D I D I I D D D
  • 15. © 2015 IBM Corporation15 Hadoop Summit – San Jose, CA – June 2015 Indexing and Hadoop Legacy  Hive-derived database engines use standard Hadoop classes to allow access to any data  InputFormat – Used to interpret, split, and read a given file type  OutputFormat – Used to write data into a given file type  This interfaces are great!  They were established with the very first version of Hadoop (MapReduce, specifically)  They are ubiquitous  You can turn literally any file format into a table!  But…the interface lacks any kind of “seek” operation!  A feature necessary to implement an index
  • 16. © 2015 IBM Corporation16 Hadoop Summit – San Jose, CA – June 2015 Hive “Indexes”  Hive has supported indexing since 0.8, but they are barely indexes in the traditional sense  They are limited in utility  No other Hive derived SQL solution uses them  The index table contains  One row for each [index-values,blockoffset] pair  A set of bit for each row in the block (1 = a row contains the indexed columns)  This sort of index is useful for  Indexing any file type, regardless of format  Skipping base table blocks that don’t contain matching values  Avoiding interpretation of data in rows that don’t match the index  You still have to read each matching block in its entirety (up to the last “1”) CREATE INDEX IDX1 ON T1 (A, B) ROW FORMAT DELIMITED A B Block Offset Bits CA San Jose 6371541 011010010000… CA San Jose 4718461 110100000111… CA Berkeley 1747665 110000000011… NY New York 1888828 1111111100001…
  • 17. © 2015 IBM Corporation17 Hadoop Summit – San Jose, CA – June 2015 Block Level Indexing and Synopsis  The latest trend in indexing is with “smarter” file formats  Exemplified by Parquet and ORC  These formats typically  Store data in a compressed columnar(-ish) format  Store indexing and/or statistical information within each block  Can be configured with search criteria prior to reading data  Index and data are intimately tied together and always in sync  Optimizations include  Skipping of blocks that do not match your search criteria  Quickly seeking within a block to data matching your search criteria  You still have to at least “peek” at every block  Fetching a single row out of a billion will still take some time Parquet ORC
  • 18. © 2015 IBM Corporation18 Hadoop Summit – San Jose, CA – June 2015 Indexing in HBase  All HBase tables are inherently partitioned and indexed on row key  Provides near-RDBMS levels of performance for fetching on row key (yay!!)  At a non-negligible cost in writes due to index maintenance (boo!!)  And requires persistent servers (memory, CPU) instead of just simple flat files  Today HBase has no native secondary index support (it’s coming!)  But there are many solutions that will provide them for you….
  • 19. © 2015 IBM Corporation19 Hadoop Summit – San Jose, CA – June 2015  Most secondary index solutions for HBase store the index in another HBase table  The index is automatically maintained via HBase co-processors (kind of like triggers)  There is a measurable cost to index maintenance  Bulk load bypasses co-processors (indexes aren’t maintained)  Big SQL is exploring using co-processors to store index data outside of HBase  E.g. using a Lucene index  Stored locally with each region server Secondary Indexes in HBase Row Key C1 C2 C3 12345 Frank Martin 44 12346 Mary Lee 22 T1 Row Key Pointer Martin|44 12345 Lee|22 12346 CREATE INDEX IDX1 ON T1 (C2, C3) T1_IDX1
  • 20. © 2015 IBM Corporation20 Hadoop Summit – San Jose, CA – June 2015 Secondary Indexes in HBase  Global Index (e.g. Phoenix, Big SQL)  Index table is maintained like a regular HBase table (regions are randomly scattered)  Index data likely not co-located with base data  Good for point or small scan queries  Suffers from “network storm” during large index scans Region Server T1 Region A T1_IDX1 Region B Region Server T1 Region B Region Server T1_IDX1 Region A  Local Index (e.g. Phoenix)  Custom HBase balance ensures index data is co-located with base data • There is a small chance that it will be remote  No network hop to go from index to base data  BUT, for a given index key all index region servers must be polled • Potentially more expensive for single row lookups Region Server T1 Region A Region Server T1 Region B T1_IDX1 Region A T1_IDX1 Region B
  • 21. © 2015 IBM Corporation21 Hadoop Summit – San Jose, CA – June 2015 Data Manipulation (DML) In a Non-Updateable World  Another “gotcha” of HDFS is that it only supports write and append  Modifying data is difficult without the ability to update data!  Variable block length and block append (HDFS-3689) may allow some crude modification features  As a result, you’ll notice very few SQL solutions support DML operations  Those that do support it have to bend over backwards to accommodate the file system  Modifications are logged next to original data  Reads of original data are merged
  • 22. © 2015 IBM Corporation22 Hadoop Summit – San Jose, CA – June 2015 Hive ACID Tables  Hive 0.14 introduced “ACID” tables (which aren’t quite ACID yet!)  Modifications are logged, in row order, next to the base data files  During read, delta file changes are “merged” into the base data  Minor compaction process merges delta files together, major re-builds base data  Not suitable for OLTP  Single row update still scans all base data and produces one delta file
  • 23. © 2015 IBM Corporation23 Hadoop Summit – San Jose, CA – June 2015 Data Modification in HBase  HBase takes a similar approach  Changes are logged to the write-ahead-log (WAL)  Final view of the row is cached in memory  Base data (HFILES) are periodically rebuilt by merging changes  HBase achieves OLTP levels of performance by caching changes in memory  Hive supports “UPSERT” semantics  It is still difficult (costly) to implement SQL UPDATE semantics Region Server Write Ahead Log (WAL) Cache HFILE HFILE HFILE
  • 24. © 2015 IBM Corporation24 Hadoop Summit – San Jose, CA – June 2015 Oh, There’s So Much More!!  Other areas and technologies I would have liked to have covered:  File formats out the wazoo • Trade-offs in a propriety format vs. being completely agnostic  Caching • How file formats and compression make efficient caching difficult  Schema-discovery and schema-less querying • What if the data doesn’t have a rigid schema? (Hint: Drill It)  Optimization and data ownership • How do you optimize a query if you have no statistics? (dynamic vs. static optimization)  Security • Sharing the raw data with tools outside of the database leads to security model mismatches  Competing workloads • How does the database deal with competing workloads from other Hadoop tools? “Sir Not-Appearing-In-This-Film”
  • 25. © 2015 IBM Corporation25 Hadoop Summit – San Jose, CA – June 2015 Thank You!  Thanks for putting up with me  Queries? (Optimized of course!)