SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
SQL on Hadoop
Raj Nadipalli
rajesh.nadipalli@gmail.com
08.25.2016
Agenda
•  The need for SQL on Hadoop
•  Current popular open source options for SQL on Hadoop
•  Feature review of Hive, SparkSQL, Drill, HAWQ, Phoenix, Splice machine
and Trafodian
•  Q&A
The Need for SQL
Transition to Modern Data Architecture - Legacy
Source: HDInsight Essentials (ISBN 1784396664)
Transition to Modern Data Architecture - New
SQL on Hadoop – the need
•  Hadoop is a fit for ETL offload and/or Data warehouse offload
•  But…it is not a cheap replacement to RDBMS; it’s a new platform
•  SQL-on-Hadoop is an abstraction on HDFS and YARN
•  SQL-on-Hadoop enables ad-hoc analysis on files, ETL and abstractions on
complex data types
•  SQL-on-Hadoop reduces the learning curve for existing DBA’s and developers
SQL on Hadoop – Patterns of Usage
Batch
•  ETL jobs that run typically daily
•  Select, Insert, Update and Delete
•  Latency minutes to hours
Interactive
•  Ad-hoc analysis and queries
•  Typically Select
•  Latency seconds to minutes
Transactio
nal
•  ACID transactional support
•  Select, Insert, Update and Delete
•  Latency milliseconds to second
Batch
Interactive
Transactio
nal
Patterns of Usage & Popular Options
General Architecture of SQL Engines
3 general architectures for SQL Engines
•  Leverage YARN and managed completely by Hadoop. Examples Hive,
SparkSQL
•  Separate process outside Hadoop; does share metadata, HDFS and
infrastructure. Example Hawq, Drill
•  Leverage Hbase Example Phoenix, Trafodian, Splicemachine
Hive Overview
Hive Leveraging YARN
Source: HDInsight Essentials (ISBN 1784396664)
Hive Key Features
•  Standard component in all Hadoop distributions
•  Hcatalog (Hive metadata) is standard metastore for several Hadoop components
•  Data warehouse solution built on Hadoop for providing data summarization.
•  Provides SQL-like query language called HiveQL.
•  Minimal learning curve for SQL developers/business analysts.
•  Ability to bring structure to various formats.
•  Simple interface for ad-hoc query, summarization, joins on big data.
•  Extensible via data types, functions and scripts.
•  JDBC driver to connect with SQL IDE’s like SquirrelSQL, DBeaver
Notable New Features (Hive 1.2)
•  Transactional support (ACID) limited to ORC formats	
•  Row level inserts, updates, or deletes
•  ALTER SCHEMA DDL	
•  ANSI SQL compliance which is not 100%, but getting closer	
•  Cost Based Optimizer	
•  SQL Temporary Tables
WordCount in Hive
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs.txt' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, ’s')) AS word FROM docs) w
GROUP BY word ORDER BY word;
Hive Architecture
But….
•  Latency; queries run typically in minutes and not seconds.
•  HiveQL is not ANSI compliant; RDBMS queries will need rewrite
•  Update, Delete support is limited (Transactions support only works on ORC)
•  No Primary Keys, Foreign Keys, Constraints..
Bottom line… this is not a replacement for transactional databases
SparkSQL Overview
Spark & SparkSQL Key Features
•  Strong community support - currently the most popular Hadoop component (2015,
2016)
•  Can run as YARN application and runs in memory and disk
•  Developers can use Java, Scala, Python, R & SQL
•  Powerful SQL features like Pivot tables, efficient Joins
•  Improved Query performance and caching with Catalyst
•  Claims 100X faster response time over MapReduce in memory, 10X on disk.
•  Supports HiveQL and leverages Hive metastore
•  JDBC, OBDC and beeline support
Spark Architecture (2.0 )
Spark Core (Resilient Distributed Dataset)
Catalyst
SQL Dataframe / Dataset
ML
Pipelines
Structured
Streaming
GraphFrames
Spark SQL
Source: http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
Spark-SQL in action
> ./spark-shell // initialize spark shell
> sqlContext.sql(“CREATE TABLE page_view(viewTime INT, page_url, STRING, user
STRING”) // create Hive table
> sqlContext.sql(“LOAD DATA LOCAL INPATH ‘/path/to/file.txt’ INTO TABLE page_view”) //
load data
> sqlContext.sql(“SELECT * FROM page_view”) // select data from table
Spark-SQL in action using thrift and beeline
But….
•  Latency (better than Hive); may not be sufficient for some applications.
•  Queries are not ANSI compliant; RDBMS queries will need rewrite
•  Update, Delete support is not supported
•  No Primary Keys, Foreign Keys, Constraints..
Bottom line… maybe good for ad-hoc queries but not for transactions.
What’s behind Catalyst
HAWQ Overview
(Incubator Project)
HAWQ Key Features
•  MPP database that is native to Hadoop HDFS, YARN and Ambari managed
•  Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
•  Many times faster than other Hive and Spark-SQL
•  Full transaction capability and consistency guarantee: ACID
•  Multi-language user defined function support: Python, Perl, Java, C/C++, R
•  Standard connectivity: JDBC/ODBC
•  Support multiple level partitioning, constraints, procedures.
•  Built on top of Greenplum DB engine
HAWQ Architecture
Source: http://hdb.docs.pivotal.io/20/overview/HAWQArchitecture.html
•  HAWQ Master is entry point for all
client requests
•  HAWQ master hosts metadata in
global system catalog.
•  Physical Segments are units that
process data in parallel. One per
Hadoop Data Node
•  Each segment can start multiple
Query Executors (QE) that act as
virtual segments
HAWQ – Getting Started
Create Table and check list
	
CREATE	TABLE	product_dim	
	(product_id					integer,	
		product_name				varchar(200))	
DISTRIBUTED	BY	(product_id)	;	
	
gpadmin-#	dt		
									List	of	relations	
		Schema		|	Name		|	Type		|		Owner			
----------+-------+-------+---------	
	student0	|	product_dim	|	table	|	
gpadmin	
(1	row)
Query global catalog
	gpadmin=#	select	*	from	pg_tables	where	
tablename='users’;	
	schemaname	|	tablename			|	tableowner	|	tablespace	|	
hasindexes	|	hasrules	|	hastriggers		
------------+-----------+------------+------------
+------------+----------+-------------	
	student0			|	product_dim	|	gpadmin				|												|	f			
|	f								|	f	
(1	row)	
	
gpadmin=# d product_dim;
Table “product_dim"
Column | Type | Modifiers
--------------+------------------------+-----------
product_id | integer |
product_name | character varying(200) |
But….
•  Currently HAWQ is in Apache incubation state (was Pivotal earlier)
•  For best performance, data has to be stored in HAWQ format; which is not usable by
other tools in eco-system.
•  stored by HAWQ in HDFS can only be used by HAWQ.
•  Additional processing/compute is required to mange data intake from external files
(like .csv) into Hawq.
•  Does not work well with complex data types
Drill Overview
Drill Key Features
•  Scale-Out MPP Query Engine with low latency
•  Schema-less access to any data, any data source with SQL interface
•  Supports ANSI SQL
•  Scale in all dimensions
•  Excellent Query response times; great for ad-hoc queries & data exploration
•  Row/Column level controls
•  Can query Hadoop and other NoSQL data stores
Drill Architecture
Source: https://drill.apache.org/docs/drill-query-execution/
Drill in action
use dfs.yelp;
alter session set `store.format`='json';
>SELECT id, type, name, ppu FROM dfs.`/Users/brumsby/drill/donuts.json`;
+------------+------------+------------+------------+
| id | type | name | ppu |
+------------+------------+------------+------------+
| 0001 | donut | Cake | 0.55 |
+------------+------------+------------+------------+
1 row selected (0.248 seconds);
Source: https://drill.apache.org/docs/create-table-as-ctas/
But..
•  Requires separate processes to run Drillbits
•  Only read-only support
•  Memory and CPU intensive; example for 128GB server, recommendations
File system = 20G, HBase = 20G, OS = 8G, YARN = 60GB, Drill=20GB
Reference: https://drill.apache.org/docs/configuring-multitenant-resources/
Trafodion Overview
(Incubator Project)
Trafodian Key Features
•  OLTP and ODS with ANSI support
•  Distributed ACID transaction protection across multiple statements, tables and rows
•  Integrated with HBase and Hive
•  Support for nested loop, merge, hash joins
•  ODBC and JDBC drivers
Trafodian Architecture - Components
Source: http://trafodion.apache.org/presentations/dtm-architecture.pdf
Sqlci or trafci
Connect with HBase
Connect with Hive
hive> create database testdb;
hive> create table testdb.test1(col1 string, col2 string);
hive> insert into testdb.test1 values('Big','Data’);
Esgyn, from terminal … type
$ sqlci
select count(*) from hive.testdb.test1;
Phoenix
Phoenix Features
•  SQL on Hbase and lightweight.
•  Converts SQL statements to series of Hbase API’s.
•  Recently added ACID support
•  Supports Schemas, Composite Primary Key, Parallel Scan, Joins (limited)
•  Supports Sequences, Rowtimestamps
•  Supports write to Hbase, read from Phoenix
•  Supports Statistics collection
Phoenix Architecture
Source: http://phoenix.apache.org/presentations/OC-HUG-2014-10-4x3.pdf
Phoenix DDL Example
CREATE TABLE IF NOT EXISTS METRIC_RECORD (
METRIC_NAME VARCHAR,
HOSTNAME VARCHAR,
SERVER_TIME UNSIGNED_LONG NOT NULL
METRIC_VALUE DOUBLE,
…
CONSTRAINT pk PRIMARY KEY (METRIC_NAME, HOSTNAME,
SERVER_TIME))Source: http://www.slideshare.net/enissoz/apache-phoenix-past-present-and-future-of-sql-over-hbase
https://phoenix.apache.org/sequences.html
But..
•  Phoenix does not integrate with Hive
•  Phoenix doesn’t support cross-row transactions yet.
•  Query optimizer and join mechanisms needs improvements
•  Secondary indexes can get out of sync with the primary table
•  Multi-tenancy is constrained as it uses a single HBase table.
Splice Machine
Overview
(not Apache but Open Source)
Splice Machine Key Features
•  True ANSI SQL Engine with ACID support
•  Offload OLTP and OLAP workloads
•  Supports joins, transactions, constraints,
stored procedures, window functions
•  Leverages proven processing of Apache
Derby
•  Simplifies Lambda architecture using one
interface
•  Auto Workload isolation, OLAP goes to
SPARK, OLTP to HBase
Simplified Lambda architecture
Splice Machine key commands
Source: http://doc.splicemachine.com/Developers/CmdLineReference/CmdLineSyntax.html
So let’s conclude
SQL on Hadoop Feature Matrix
Feature Hive /
HiveQL
Spark SQL Drill HAWQ Phoenix Trafodian Splice
Machine
Key Use
Cases
Batch / ETL /
Long-running
jobs
Batch / ETL /
advanced
analytics / ad-
hoc queries
Self-Service Data
Exploration
Interactive BI / Ad-
hoc queries
Interactive
BI / Ad-hoc
queries
OLTP on
Hadoop
OLTP on
Hadoop
OLTP and OLAP
on Hadoop
SQL
Support
HiveQL HiveQL ANSI SQL ANSI SQL ANSI SQL ANSI SQL ANSI SQL
ACID
Support
Limited None None ACID
Support
ACID Support
(new)
ACID
Support
ACID Support
Metastore Hcatalog Hcatalog Hcatalog Hcatalog HBase Hcatalog +
Custom
Hcatalog +
Custom
Native to
Hadoop
Yes, YARN +
HDFS
Yes, YARN +
HDFS
Requires Drillbits
as separate
process
YARN +
HDFS
(special
format)
HBase HBase HBase &
SPARK
Latency High Medium (in
memory)
Low Low Low Low Low
Data Types
Supported
Relational &
Limited
complex
types
Relational &
Limited
complex types
Relational &
excellent schema
less support
Relational
only
Relational &
Limited
complex
types
Relational &
Limited complex
types
SQL on Hadoop

Mais conteúdo relacionado

Mais procurados

Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 

Mais procurados (20)

Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 

Destaque

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
The Hive Overview
The Hive OverviewThe Hive Overview
The Hive Overviewthehivecs
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Gruter
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBRadenko Zec
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.Vijaykumar Vangapandu
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLInside Analysis
 
Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption Cloudera, Inc.
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsMichael Stack
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)Steve Min
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 

Destaque (20)

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
The Hive Overview
The Hive OverviewThe Hive Overview
The Hive Overview
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 

Semelhante a SQL on Hadoop

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloudDmitry Tolpeko
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 

Semelhante a SQL on Hadoop (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Shark
SharkShark
Shark
 
מיכאל
מיכאלמיכאל
מיכאל
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 

Mais de nvvrajesh

HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecturenvvrajesh
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentationnvvrajesh
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profitsnvvrajesh
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overviewnvvrajesh
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Waynvvrajesh
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshellnvvrajesh
 

Mais de nvvrajesh (7)

HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecture
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentation
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profits
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Way
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshell
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

SQL on Hadoop

  • 1. SQL on Hadoop Raj Nadipalli rajesh.nadipalli@gmail.com 08.25.2016
  • 2. Agenda •  The need for SQL on Hadoop •  Current popular open source options for SQL on Hadoop •  Feature review of Hive, SparkSQL, Drill, HAWQ, Phoenix, Splice machine and Trafodian •  Q&A
  • 4. Transition to Modern Data Architecture - Legacy Source: HDInsight Essentials (ISBN 1784396664)
  • 5. Transition to Modern Data Architecture - New
  • 6. SQL on Hadoop – the need •  Hadoop is a fit for ETL offload and/or Data warehouse offload •  But…it is not a cheap replacement to RDBMS; it’s a new platform •  SQL-on-Hadoop is an abstraction on HDFS and YARN •  SQL-on-Hadoop enables ad-hoc analysis on files, ETL and abstractions on complex data types •  SQL-on-Hadoop reduces the learning curve for existing DBA’s and developers
  • 7. SQL on Hadoop – Patterns of Usage Batch •  ETL jobs that run typically daily •  Select, Insert, Update and Delete •  Latency minutes to hours Interactive •  Ad-hoc analysis and queries •  Typically Select •  Latency seconds to minutes Transactio nal •  ACID transactional support •  Select, Insert, Update and Delete •  Latency milliseconds to second
  • 9. General Architecture of SQL Engines 3 general architectures for SQL Engines •  Leverage YARN and managed completely by Hadoop. Examples Hive, SparkSQL •  Separate process outside Hadoop; does share metadata, HDFS and infrastructure. Example Hawq, Drill •  Leverage Hbase Example Phoenix, Trafodian, Splicemachine
  • 11. Hive Leveraging YARN Source: HDInsight Essentials (ISBN 1784396664)
  • 12. Hive Key Features •  Standard component in all Hadoop distributions •  Hcatalog (Hive metadata) is standard metastore for several Hadoop components •  Data warehouse solution built on Hadoop for providing data summarization. •  Provides SQL-like query language called HiveQL. •  Minimal learning curve for SQL developers/business analysts. •  Ability to bring structure to various formats. •  Simple interface for ad-hoc query, summarization, joins on big data. •  Extensible via data types, functions and scripts. •  JDBC driver to connect with SQL IDE’s like SquirrelSQL, DBeaver
  • 13. Notable New Features (Hive 1.2) •  Transactional support (ACID) limited to ORC formats •  Row level inserts, updates, or deletes •  ALTER SCHEMA DDL •  ANSI SQL compliance which is not 100%, but getting closer •  Cost Based Optimizer •  SQL Temporary Tables
  • 14. WordCount in Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs.txt' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, ’s')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 16. But…. •  Latency; queries run typically in minutes and not seconds. •  HiveQL is not ANSI compliant; RDBMS queries will need rewrite •  Update, Delete support is limited (Transactions support only works on ORC) •  No Primary Keys, Foreign Keys, Constraints.. Bottom line… this is not a replacement for transactional databases
  • 18. Spark & SparkSQL Key Features •  Strong community support - currently the most popular Hadoop component (2015, 2016) •  Can run as YARN application and runs in memory and disk •  Developers can use Java, Scala, Python, R & SQL •  Powerful SQL features like Pivot tables, efficient Joins •  Improved Query performance and caching with Catalyst •  Claims 100X faster response time over MapReduce in memory, 10X on disk. •  Supports HiveQL and leverages Hive metastore •  JDBC, OBDC and beeline support
  • 19. Spark Architecture (2.0 ) Spark Core (Resilient Distributed Dataset) Catalyst SQL Dataframe / Dataset ML Pipelines Structured Streaming GraphFrames Spark SQL Source: http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
  • 20. Spark-SQL in action > ./spark-shell // initialize spark shell > sqlContext.sql(“CREATE TABLE page_view(viewTime INT, page_url, STRING, user STRING”) // create Hive table > sqlContext.sql(“LOAD DATA LOCAL INPATH ‘/path/to/file.txt’ INTO TABLE page_view”) // load data > sqlContext.sql(“SELECT * FROM page_view”) // select data from table
  • 21. Spark-SQL in action using thrift and beeline
  • 22. But…. •  Latency (better than Hive); may not be sufficient for some applications. •  Queries are not ANSI compliant; RDBMS queries will need rewrite •  Update, Delete support is not supported •  No Primary Keys, Foreign Keys, Constraints.. Bottom line… maybe good for ad-hoc queries but not for transactions.
  • 25. HAWQ Key Features •  MPP database that is native to Hadoop HDFS, YARN and Ambari managed •  Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension •  Many times faster than other Hive and Spark-SQL •  Full transaction capability and consistency guarantee: ACID •  Multi-language user defined function support: Python, Perl, Java, C/C++, R •  Standard connectivity: JDBC/ODBC •  Support multiple level partitioning, constraints, procedures. •  Built on top of Greenplum DB engine
  • 26. HAWQ Architecture Source: http://hdb.docs.pivotal.io/20/overview/HAWQArchitecture.html •  HAWQ Master is entry point for all client requests •  HAWQ master hosts metadata in global system catalog. •  Physical Segments are units that process data in parallel. One per Hadoop Data Node •  Each segment can start multiple Query Executors (QE) that act as virtual segments
  • 27. HAWQ – Getting Started Create Table and check list CREATE TABLE product_dim (product_id integer, product_name varchar(200)) DISTRIBUTED BY (product_id) ; gpadmin-# dt List of relations Schema | Name | Type | Owner ----------+-------+-------+--------- student0 | product_dim | table | gpadmin (1 row) Query global catalog gpadmin=# select * from pg_tables where tablename='users’; schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers ------------+-----------+------------+------------ +------------+----------+------------- student0 | product_dim | gpadmin | | f | f | f (1 row) gpadmin=# d product_dim; Table “product_dim" Column | Type | Modifiers --------------+------------------------+----------- product_id | integer | product_name | character varying(200) |
  • 28. But…. •  Currently HAWQ is in Apache incubation state (was Pivotal earlier) •  For best performance, data has to be stored in HAWQ format; which is not usable by other tools in eco-system. •  stored by HAWQ in HDFS can only be used by HAWQ. •  Additional processing/compute is required to mange data intake from external files (like .csv) into Hawq. •  Does not work well with complex data types
  • 30. Drill Key Features •  Scale-Out MPP Query Engine with low latency •  Schema-less access to any data, any data source with SQL interface •  Supports ANSI SQL •  Scale in all dimensions •  Excellent Query response times; great for ad-hoc queries & data exploration •  Row/Column level controls •  Can query Hadoop and other NoSQL data stores
  • 32. Drill in action use dfs.yelp; alter session set `store.format`='json'; >SELECT id, type, name, ppu FROM dfs.`/Users/brumsby/drill/donuts.json`; +------------+------------+------------+------------+ | id | type | name | ppu | +------------+------------+------------+------------+ | 0001 | donut | Cake | 0.55 | +------------+------------+------------+------------+ 1 row selected (0.248 seconds); Source: https://drill.apache.org/docs/create-table-as-ctas/
  • 33. But.. •  Requires separate processes to run Drillbits •  Only read-only support •  Memory and CPU intensive; example for 128GB server, recommendations File system = 20G, HBase = 20G, OS = 8G, YARN = 60GB, Drill=20GB Reference: https://drill.apache.org/docs/configuring-multitenant-resources/
  • 35. Trafodian Key Features •  OLTP and ODS with ANSI support •  Distributed ACID transaction protection across multiple statements, tables and rows •  Integrated with HBase and Hive •  Support for nested loop, merge, hash joins •  ODBC and JDBC drivers
  • 36. Trafodian Architecture - Components Source: http://trafodion.apache.org/presentations/dtm-architecture.pdf
  • 37. Sqlci or trafci Connect with HBase Connect with Hive hive> create database testdb; hive> create table testdb.test1(col1 string, col2 string); hive> insert into testdb.test1 values('Big','Data’); Esgyn, from terminal … type $ sqlci select count(*) from hive.testdb.test1;
  • 39. Phoenix Features •  SQL on Hbase and lightweight. •  Converts SQL statements to series of Hbase API’s. •  Recently added ACID support •  Supports Schemas, Composite Primary Key, Parallel Scan, Joins (limited) •  Supports Sequences, Rowtimestamps •  Supports write to Hbase, read from Phoenix •  Supports Statistics collection
  • 41. Phoenix DDL Example CREATE TABLE IF NOT EXISTS METRIC_RECORD ( METRIC_NAME VARCHAR, HOSTNAME VARCHAR, SERVER_TIME UNSIGNED_LONG NOT NULL METRIC_VALUE DOUBLE, … CONSTRAINT pk PRIMARY KEY (METRIC_NAME, HOSTNAME, SERVER_TIME))Source: http://www.slideshare.net/enissoz/apache-phoenix-past-present-and-future-of-sql-over-hbase https://phoenix.apache.org/sequences.html
  • 42. But.. •  Phoenix does not integrate with Hive •  Phoenix doesn’t support cross-row transactions yet. •  Query optimizer and join mechanisms needs improvements •  Secondary indexes can get out of sync with the primary table •  Multi-tenancy is constrained as it uses a single HBase table.
  • 44. Splice Machine Key Features •  True ANSI SQL Engine with ACID support •  Offload OLTP and OLAP workloads •  Supports joins, transactions, constraints, stored procedures, window functions •  Leverages proven processing of Apache Derby •  Simplifies Lambda architecture using one interface •  Auto Workload isolation, OLAP goes to SPARK, OLTP to HBase
  • 46. Splice Machine key commands Source: http://doc.splicemachine.com/Developers/CmdLineReference/CmdLineSyntax.html
  • 48. SQL on Hadoop Feature Matrix Feature Hive / HiveQL Spark SQL Drill HAWQ Phoenix Trafodian Splice Machine Key Use Cases Batch / ETL / Long-running jobs Batch / ETL / advanced analytics / ad- hoc queries Self-Service Data Exploration Interactive BI / Ad- hoc queries Interactive BI / Ad-hoc queries OLTP on Hadoop OLTP on Hadoop OLTP and OLAP on Hadoop SQL Support HiveQL HiveQL ANSI SQL ANSI SQL ANSI SQL ANSI SQL ANSI SQL ACID Support Limited None None ACID Support ACID Support (new) ACID Support ACID Support Metastore Hcatalog Hcatalog Hcatalog Hcatalog HBase Hcatalog + Custom Hcatalog + Custom Native to Hadoop Yes, YARN + HDFS Yes, YARN + HDFS Requires Drillbits as separate process YARN + HDFS (special format) HBase HBase HBase & SPARK Latency High Medium (in memory) Low Low Low Low Low Data Types Supported Relational & Limited complex types Relational & Limited complex types Relational & excellent schema less support Relational only Relational & Limited complex types Relational & Limited complex types