SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
© 2014 IBM Corporation
Datawarehouse-grade SQL on Hadoop
3.03.03.0
Big!Big!
Scott C. Gray (sgray@us.ibm.com)
Hebert Pereyra (pereyra@ca.ibm.com)
Please Note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be incorporated
into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
Agenda
Brief introduction to Hadoop
Why SQL on Hadoop?
SQL-on-Hadoop landscape
What is Hive?
Big SQL 3.0
• What is it?
• SQL capabilities
• Architecture
• Application portability and
integration
• Enterprise capabilities
• Performance
Conclusion
2
What is Hadoop?
Hadoop is not a piece of software, you can't install "hadoop"
It is an ecosystem of software that work together
• Hadoop Core (API's)
• HDFS (File system)
• MapReduce (Data processing framework)
• Hive (SQL access)
• HBase (NoSQL database)
• Sqoop (Data movement)
• Oozie (Job workflow)
• …. There are is a LOT of "Hadoop" software
However, there is one common component they all build on: HDFS…
• *Not exactly 100% true but 99.999% true
The Hadoop Filesystem (HDFS)
Driving principals
• Files are stored across the entire cluster
• Programs are brought to the data, not the data to the program
Distributed file system (DFS) stores blocks across the whole cluster
• Blocks of a single file are distributed across the cluster
• A given block is typically replicated as well for resiliency
• Just like a regular file system, the contents of a file is up to the application
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
Data processing on Hadoop
Hadoop (HDFS) doesn't dictate file content/structure
• It is just a filesystem!
• It looks and smells almost exactly like the filesystem on your laptop
• Except, you can ask it "where does each block of my file live?"
The entire Hadoop ecosystem is built around that question!
• Parallelize work by sending your programs to the data
• Each copy processes a given block of the file
• Other nodes may be chosen to aggregate together computed results
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
1
Cluster
23
App
(Read)
App
(Read)
App
(Read)
App
(Compute)
Result
Hadoop MapReduce
MapReduce is a way of writing parallel processing programs
• Leverages the design of the HDFS filesystem
Programs are written in two pieces: Map and Reduce
Programs are submitted to the MapReduce job scheduler (JobTracker)
• The scheduler looks at the blocks of input needed for the job (the "splits")
• For each split, tries to schedule the processing on a host holding the split
• Hosts are chosen based upon available processing resources
Program is shipped to a host and given a split to process
Output of the program is written back to HDFS
MapReduce - Mappers
Mappers
• Small program (typically), distributed across the cluster, local to data
• Handed a portion of the input data (called a split)
• Each mapper parses, filters, or transforms its input
• Produces grouped <key,value> pairs
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
Logical
Input File
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
Map Phase
MapReduce – The Shuffle
The shuffle is transparently orchestrated by MapReduce
The output of each mapper is locally grouped together by key
One node is chosen to process data for each unique key
Shuffle
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
MapReduce - Reduce
Reducers
• Small programs (typically) that aggregate all of the values for the
key that they are responsible for
• Each reducer writes output to its own file
Reduce Phase
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
Why SQL for Hadoop?
Hadoop is designed for any data
• Doesn't impose any structure
• Extremely flexible
At lowest levels is API based
• Requires strong programming
expertise
• Steep learning curve
• Even simple operations can be
tedious
Yet many, if not most, use cases deal with structured data!
• e.g. aging old warehouse data into queriable archive
Why not use SQL in places its strengths shine?
• Familiar widely used syntax
• Separation of what you want vs. how to get it
• Robust ecosystem of tools
Pre-Processing Hub Query-able Archive Exploratory Analysis
Information
Integration
Data Warehouse
Streams
Real-time
processing
BigInsights
Landing zone
for all data
Data Warehouse
BigInsights Can combine
with
unstructured
information
Data Warehouse
1 2 3
SQL-on-Hadoop landscape
The SQL-on-Hadoop landscape is changing rapidly!
They all have their different strengths and weaknesses
Many, including Big SQL, draw their basic designs on Hive…
Then along came Hive
Hive was the first SQL interface for Hadoop data
• Defacto standard for SQL on Hadoop
• Ships with all major Hadoop distributions
SQL queries are executed using MapReduce (today)
• More on this later!
Hive introduced several important concepts/components…
Hive tables
In most cases, a table is simply a directory (on HDFS) full of files
Hive doesn't dictate the content/structure of these files
• It is designed to work with existing user data
• In fact, there is no such thing as a "Hive table"
In Hive java classes define a "table"
/biginsights/hive/warehouse/myschema.db/mytable/
file1
file2
…
CREATE TABLE my_strange_table (
c1 string,
c2 timestamp,
c3 double
)
ROW FORMAT SERDE "com.myco.MyStrangeSerDe"
WITH SERDEPROPERTIES ( "timestamp.format" = "mm/dd/yyyy" )
INPUTFORMAT "com.myco.MyStrangeInputFormat"
OUTPUTFORMAT "com.myco.MyStrangeOutputFormat"
InputFormat and SerDe
InputFormat – Hadoop concept
• Defines a java class that can read from a particular data source
– E.g. file format, database connection, region servers, web servers, etc.
• Each InputFormat produces its own record format as output
• Responsible for determining splits: how to break up the data from the data
source to that work can be split up between mappers
• Each table defines an InputFormat (in the catalog) that understands the
table’s file structure
SerDe (Serializer/Deserializer) – Hive concept
• A class written to interpret the records produced by an InputFormat
• Responsible converting that record to a row (and back)
• A row is a clearly defined Hive definition (an array of values)
1011010
0101001
0011100
1111110
0101001
1010111
0111010
data file InputFormat
(records)
SerDe
(rows)
Hive tables (cont.)
For many common file formats Hive provides a simplified syntax
This just pre-selects combinations of classes and configurations
create table users
(
id int,
office_id int
)
row format delimited
fields terminated by '|'
stored as textfile
create table users
(
id int,
office_id int
)
row format serde 'org.apache.hive…LazySimpleSerde'
with serdeproperties ( 'field.delim' = '|' )
inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.mapred.TextOutputFormat'
Hive partitioned tables
Most table types can be partitioned
Partitioning is on one or more columns
Each unique value becomes a partition
Query predicates can be used to eliminated scanned partitions
CREATE TABLE demo.sales (
part_id int,
part_name string,
qty int,
cost double
)
PARTITIONED BY (
state char(2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
biginsights
hive
warehouse
demo.db
sales
state=NJ
state=AR
state=CA
state=NY
data1.csv
data1.csv
data2.csv
data1.csvselect *
from demo.sales
where state in ('NJ', 'CA');
select *
from demo.sales
where state in ('NJ', 'CA');
Hive MetaStore
Hive maintains a centralized database of metadata
• Typically stored in a traditional RDBMS
Contains table definitions
• Location (directory on HDFS)
• Column names and types
• Partition information
• Classes used to read/write the table
• Etc.
Security
• Groups, roles, permissions
Query processing in Hive
Up until version 0.13 (just released) Hive used MapReduce for
query processing
• They are moving to a new framework called "Tez"
It is useful to understand query processing in MR to understand
how Big SQL tackles the same problem…
Hive – Joins in MapReduce
For joins, MR is used to group data together at the same reducer based
upon the join key
• Mappers read blocks from each “table” in the join
• The <key> is the value of the join key, the <value> is the record to be joined
• Reducer receives a mix of records from each table with the same join key
• Reducers produce the results of the join
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
depts
select e.fname, e.lname, d.dept_name
from employees e, depts d
where e.salary > 30000
and d.dept_id = e.dept_id
select e.fname, e.lname, d.dept_name
from employees e, depts d
where e.salary > 30000
and d.dept_id = e.dept_id
N-way Joins in MapReduce
For N-way joins involving different join keys, multiple jobs are used
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1 1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number
from employees e, depts d, emp_phones p
where e.salary > 30000
and d.dept_id = e.dept_id
and p.emp_id = e.emp_id
select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number
from employees e, depts d, emp_phones p
where e.salary > 30000
and d.dept_id = e.dept_id
and p.emp_id = e.emp_id
depts
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011110
0111011
1
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
emp_phones
(temp files)
1 map
2 map
1 map
1 map
2 map
1 map
2 map
reduce
dept 1 reduce
emp_id 1
reduce
emp_id 2
reduce
emp_id N
results
results
results
Agenda
Brief introduction to Hadoop
Why SQL on Hadoop?
What is Hive?
SQL-on-Hadoop landscape
Big SQL 3.0
• What is it?
• SQL capabilities
• Architecture
• Application portability and
integration
• Enterprise capabilities
• Performance
Conclusion
21
Big SQL 3.0 – At a glance
Available for POWER Linux (Redhat) and Intel x64 Linux (Redhat/SUSE)
11-Apr-2014
Open processing
As with Hive, Big SQL applies SQL to your existing data
• No propriety storage format
A "table" is simply a view on your Hadoop data
Table definitions shared with Hive
• The Hive Metastore catalogs table definitions
• Reading/writing data logic is shared
with Hive
• Definitions can be shared across the
Hadoop ecosystem
Sometimes SQL isn't the answer!
• Use the right tool for the right job
Hive
Hive
Metastore
Hadoop
Cluster
Pig
Hive APIs
Sqoop
Hive APIs
Big SQL
Hive APIs
Creating tables in Big SQL
Big SQL syntax is derived from Hive's syntax with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Creating tables in Big SQL
Big SQL syntax is derived from Hive's syntax with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile; Hadoop Keyword
• Big SQL requires the HADOOP keyword
• Big SQL has internal traditional RDBMS table support
• Stored only at the head node
• Does not live on HDFS
• Supports full ACID capabilities
• Not usable for "big" data
• The HADOOP keyword identifies the table as living on HDFS
Hadoop Keyword
• Big SQL requires the HADOOP keyword
• Big SQL has internal traditional RDBMS table support
• Stored only at the head node
• Does not live on HDFS
• Supports full ACID capabilities
• Not usable for "big" data
• The HADOOP keyword identifies the table as living on HDFS
Creating tables in Big SQL
Big SQL syntax is derived from Hive's syntax with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile; Nullability Indicators
• Enforced on read
• Used by query optimizer for smarter rewrites
Nullability Indicators
• Enforced on read
• Used by query optimizer for smarter rewrites
Creating tables in Big SQL
Big SQL syntax is derived from Hive's syntax with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Constraints
• Unenforced
• Useful as documentation and to drive query builders
• Used by query optimizer for smarter rewrites
Constraints
• Unenforced
• Useful as documentation and to drive query builders
• Used by query optimizer for smarter rewrites
Creating tables in Big SQL
Big SQL syntax is derived from Hive's syntax with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Extended Data Types
• Data types derived from base Hive data types
• Can provide additional constraints on the Hive type
Extended Data Types
• Data types derived from base Hive data types
• Can provide additional constraints on the Hive type
Table types
Big SQL supports many of the "standard" Hadoop storage formats
• Text delimited
• Text delimited sequence files
• Binary delimited sequence files
• Parquet
• RC
• ORC
• Avro
Each has different features/advantages/disadvantages
Custom file formats may be supported as well via custom java classes
Populating Big SQL tables
There are a number of ways to populate tables
Tables can be defined against existing data
• All validation is performed at query time
Rows can be directly inserted into tables
• Data is validated is performed and converted to storage format
• Only suitable for testing
• Produces one physical data file per call to INSERT
create external hadoop table csv_data
(
c1 int not null primary key,
c2 varchar(20) null
)
row format delimited fields terminated by ','
stored as textfile
location '/user/bob/csv_data'
insert into t1 values (5, 'foo'), (6, 'bar'), (7, 'baz')
Populating Big SQL tables (cont.)
Tables can be populated from other tables
Tables can be created from other tables
• Great way to convert between storage types or partition data
insert into top_sellers
select employee_id, rank() over (order by sales)
from (
select employee_id, sum(sales) sales
from product_sales
group by employee_id
)
limit 10;
create hadoop table partitioned_sales
partitioned by (dept_id int not null)
stored as rcfile
as
select emp_id, prod_id, qty, cost, dept_id
from sales
Populating Big SQL tables (cont.)
The LOAD HADOOP is used to populate Hadoop tables from an
external data source
• Statement runs on the cluster – cannot access data at the client
• Nodes of the cluster ingest data in parallel
• Performs data validation during load
• Performs data conversion (to storage format) during load
Supports the following sources of data
• Any valid Hadoop URL (e.g. hdfs://, sftp://, etc.)
• JDBC data sources (e.g. Oracle, DB2, Netezza, etc.)
Loading from URL
Data may be loaded from delimited files read via any valid URL
• If no URI specified is provided, HDFS is assumed:
Example loading via SFTP:
Just remember LOAD HADOOP executes on the cluster
• So file:// will be local to the node chosen to run the statement
LOAD HADOOP USING FILE URL '/user/biadmin/mydir/abc.csv'
WITH SOURCE PROPERTIES(
'field.delimiter'=',',
'date.time.format'=''yyyy-MM-dd-HH.mm.ss.S')
LOAD HADOOP USING FILE URL
sftp://biadmin.biadmin@myserver.abc.com:22/home/biadmin/mydir'
LOAD HADOOP USING FILE URL file:///path/to/myfile/file.csv'
Loading from JDBC data source
A JDBC URL may be used to load directly from external data source
• Tested internally against Oracle, Teradata, DB2, and Netezza
It supports many options to partition the extraction of data
• Providing a table and partitioning column
• Providing a query and a WHERE clause to use for partitioning
Example usage:
LOAD USING JDBC
CONNECTION URL 'jdbc:db2://myhost:50000/SAMPLE'
WITH PARAMETERS (
user = 'myuser',
password='mypassword'
)
FROM TABLE STAFF WHERE "dept=66 and job='Sales'"
INTO TABLE staff_sales
PARTITION ( dept=66 , job='Sales')
APPEND WITH LOAD PROPERTIES (bigsql.load.num.map.tasks = 1) ;
SQL capabilities
Leverage IBM's rich SQL heritage, expertise, and technology
• SQL standards compliant query support
• SQL bodied functions and stored procedures
– Encapsulate your business logic and security at the server
• DB2 compatible SQL PL support
– Cursors
– Anonymous blocks (batches of statements)
– Flow of control (if/then/else, error handling, prepared statements, etc.)
The same SQL you use on your data warehouse should run with
few or no modifications
SQL capability highlights
Full support for subqueries
• In SELECT, FROM, WHERE and
HAVING clauses
• Correlated and uncorrelated
• Equality, non-equality subqueries
• EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
All standard join operations
• Standard and ANSI join syntax
• Inner, outer, and full outer joins
• Equality, non-equality, cross join support
• Multi-value join
• UNION, INTERSECT, EXCEPT
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY
s_name
ORDER BY
numwait desc,
s_name;
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY
s_name
ORDER BY
numwait desc,
s_name;
SQL capability highlights (cont.)
Extensive analytic capabilities
• Grouping sets with CUBE and ROLLUP
• Standard OLAP operations
• Analytic aggregates
LEAD LAG RANK DENSE_RANK
ROW_NUMBER RATIO_TO_REPORT FIRST_VALUE LAST_VALUE
CORRELATION COVARIANCE STDDEV VARIANCE
REGR_AVGX REGR_AVGY REGR_COUNT REGR_INTERCEPT
REGR_ICPT REGR_R2 REGR_SLOPE REGR_XXX
REGR_SXY REGR_XYY WIDTH_BUCKET VAR_SAMP
VAR_POP STDDEV_POP STDDEV_SAMP COVAR_SAMP
COVAR_POP NTILE
Architected for performance
Architected from the ground up for low latency and high throughput
MapReduce replaced with a modern MPP architecture
• Compiler and runtime are native code (not java)
• Big SQL worker daemons live directly on cluster
– Continuously running (no startup latency)
– Processing happens locally at the data
• Message passing allows data to flow directly
between nodes
Operations occur in memory with the ability
to spill to disk
• Supports aggregations and sorts larger than
available RAM
SQL-based
Application
Big SQL
Engine
InfoSphere BigInsights
Data Sources
IBM data server
client
SQL MPP Run-time
CSVCSV SeqSeq ParquetParquet RCRC
ORCORCAvroAvro CustomCustomJSONJSON
Big SQL 3.0 – Architecture
Head (coordinator) node
• Listens to the JDBC/ODBC connections
• Compiles and optimizes the query
• Coordinates the execution of the query
Big SQL worker processes reside on compute nodes (some or all)
Worker nodes stream data between each other as needed
Mgmt Node
Big SQL
Mgmt Node
Hive
Metastore
Mgmt Node
Name Node
Mgmt Node
Job Tracker•••
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node•••
Big
SQL
Big
SQL
Big
SQL
Big
SQL
GPFS/HDFS
39
Extreme parallelism
Massively parallel SQL engine that replaces MR
Shared-nothing architecture that eliminates scalability and networking
issues
Engine pushes processing out to data nodes to maximize data locality.
Hadoop data accessed natively via C++ and Java readers and writers.
Inter- and intra-node parallelism where work is distributed to multiple
worker nodes and on each node multiple worker threads collaborate on
the I/O and data processing (scale out horizontally and scale up
vertically)
Intelligent data partition elimination based on SQL predicates
Fault tolerance through active health monitoring and management of
parallel data and worker nodes
A process model view of Big SQL 3.0
Big SQL 3.0 – Architecture (cont.)
Big SQL's runtime execution engine is all native code
For common table formats a native I/O engine is utilized
• e.g. delimited, RC, SEQ, Parquet, …
For all others, a java I/O engine is used
• Maximizes compatibility with existing tables
• Allows for custom file formats and SerDe's
All Big SQL built-in functions are native code
Customer built UDx's can be developed in C++ or Java
Maximize performance without sacrificing
extensibility
Mgmt Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Big SQL Worker
Native I/O
Engine
Java I/O
Engine
SerDe I/O Fmt
Runtime
Java UDFs
Native UDFs
42
Big SQL 3.0 works with Hadoop
All data is Hadoop data
• In files in HDFS
• SEQ, RC, delimited, Parquet …
Never need to copy data to a proprietary representation
All data is catalog-ed in the Hive metastore
• It is the Hadoop catalog
• It is flexible and extensible
All Hadoop data is in a Hadoop filesystem
• HDFS or GPFS-FPO
43
Resource management
Big SQL doesn't run in isolation
Nodes tend to be shared with a variety of Hadoop services
• Task tracker
• Data node
• HBase region servers
• MapReduce jobs
• etc.
Big SQL can be constrained to limit its footprint on the cluster
• % of CPU utilization
• % of memory utilization
Resources are automatically adjusted based upon workload
• Always fitting within constraints
• Self-tuning memory manager that re-distributes resources across
components dynamically
• default WLM concurrency control for heavy queries
Compute Node
Task
Tracker
Data
Node
Big
SQL
HBase
MR
Task
MR
Task
MR
Task
Performance
Query rewrites
• Exhaustive query rewrite capabilities
• Leverages additional metadata such as constraints and nullability
Optimization
• Statistics and heuristic driven query optimization
• Query optimizer based upon decades of IBM RDBMS experience
Tools and metrics
• Highly detailed explain plans and query diagnostic tools
• Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKE
Y AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
Application portability and integration
Big SQL 3.0 adopts IBM's standard Data Server Client Drivers
• Robust, standards compliant ODBC, JDBC, and .NET drivers
• Same driver used for DB2 LUW, DB2/z and Informix
• Expands support to numerous languages (Python, Ruby, Perl, etc.)
Putting the story together….
• Big SQL shares a common SQL dialect with DB2
• Big SQL shares the same client drivers with DB2
• Data warehouse augmentation just got significantly easier
Compatible
SQL
Compatible
SQL
Compatible
Drivers
Compatible
Drivers
Portable
Application
Portable
Application
Application portability and integration (cont.)
This compatibility extends beyond your own applications
Open integration across Business Analytic Tools
• IBM Optim Data Studio performance tool portfolio
• Superior enablement for IBM Software – e.g. Cognos
• Enhanced support by 3rd party software – e.g. Microstrategy
Query federation
Data never lives in isolation
• Either as a landing zone or a queryable archive it is desirable to
query data across Hadoop and active Data warehouses
Big SQL provides the ability to query heterogeneous systems
• Join Hadoop to other relational databases
• Query optimizer understands capabilities of external system
– Including available statistics
• As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Enterprise security
Users may be authenticated via
• Operating system
• Lightweight directory access protocol (LDAP)
• Kerberos
User authorization mechanisms include
• Full GRANT/REVOKE based security
• Group and role based hierarchical security
• Object level, column level, or row level (fine-grained) access controls
Auditing
• You may define audit policies and track user activity
Transport layer security (TLS)
• Protect integrity and confidentiality of data between the client and Big SQL
Monitoring
Comprehensive runtime monitoring infrastructure that helps
answer the question: what is going on in my system?
• SQL interfaces to the monitoring data via table functions
• Ability to drill down into more granular metrics for problem determination and/ or
detailed performance analysis
• Runtime statistics collected during the execution of the section for a (SQL) access
plan
• Support for event monitors to track specific types of operations and activities
• Protect against and discover unknown/ suspicious behaviour by monitoring data
access via Audit facility.
Reporting Level
(Example: Service Class)
Big SQL 3.0
Worker Threads
Connection
Control Blocks
Worker Threads Collect Locally
Push Up Data Incrementally
Extract Data Directly From
Reporting level
Monitor Query
Performance, Benchmarking, Benchmarketing
Performance matters to customers
Benchmarking appeals to Engineers to drive product innovation
Benchmarketing used to convey performance in a memorable
and appealing way
SQL over Hadoop is in the “Wild West” of Benchmarketing
• 100x claims! Compared to what? Conforming to what rules?
The TPC (Transaction Processing Performance Council) is the
grand-daddy of all multi-vendor SQL-oriented organizations
• Formed in August, 1988
• TPC-H and TPC-DS are the most relevant to SQL over Hadoop
– R/W nature of workload not suitable for HDFS
Big Data Benchmarking Community (BDBC) formed
51
Power and Performance of Standard SQL
Everyone loves performance numbers, but that's not the whole story
• How much work do you have to do to achieve those numbers?
A portion of our internal performance numbers are based upon read-only
versions of TPC benchmarks
Big SQL is capable of executing
• All 22 TPC-H queries without modification
• All 99 TPC-DS queries without modification
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written for Hive
52
53
Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries
Big SQL is upto 41x faster
than Hive 0.12
Big SQL is upto 41x faster
than Hive 0.12
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic
BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H
Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are
performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3.
Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in
a production environment. Results as of April 22, 2014
Big SQL is 10x
faster than Hive 0.12
(total workload
elapsed time)
Big SQL is 10x
faster than Hive 0.12
(total workload
elapsed time)
54
Comparing Big SQL and Hive 0.12
for Decision Support Queries
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI
Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark
Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of
99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically
available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results
may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production
environment. Results as of April 22, 2014
How many times faster is Big SQL than Hive 0.12?
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI
Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark
Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of
99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically
available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results
may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production
environment. Results as of April 22, 2014
Max
Speedup
of 74x
Max
Speedup
of 74x
55
Queries sorted by speed up ratio (worst to best)
Avg
Speedup
of 20x
Avg
Speedup
of 20x
BigInsights Big SQL 3.0: Summary
Big SQL provides rich, robust, standards-based SQL support for data
stored in BigInsights
• Uses IBM common client ODBC/JDBC drivers
Big SQL fully integrates with SQL applications and tools
• Existing queries run with no or few modifications*
• Existing JDBC and ODBC compliant tools can be leveraged
Big SQL provides faster and more reliable performance
• Big SQL uses more efficient access paths to the data
• Queries processed by Big SQL no longer need to use MapReduce
• Big SQL is optimized to more efficiently move data over the network
Big SQL provides and enterprise grade data management
• Security, Auditing, workload management …
56
Conclusion
Today, it seems, performance numbers are the name of the game
But in reality there is so much more…
• How rich is the SQL?
• How difficult is it to (re-)use your existing SQL?
• How secure is your data?
• Is your data still open for other uses on Hadoop?
• Can your queries span your enterprise?
• Can other Hadoop workloads co-exist in harmony?
• …
With Big SQL 3.0 performance doesn't mean compromise
Questions?

Mais conteúdo relacionado

Mais procurados

Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and HadoopCynthia Saracco
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab Cynthia Saracco
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...Stuart Moore
 

Mais procurados (18)

Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and Hadoop
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
 
Stretch db sql server 2016 (sn0028)
Stretch db   sql server 2016 (sn0028)Stretch db   sql server 2016 (sn0028)
Stretch db sql server 2016 (sn0028)
 

Semelhante a Big SQL 3.0 - Toronto Meetup -- May 2014

Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 

Semelhante a Big SQL 3.0 - Toronto Meetup -- May 2014 (20)

Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 

Mais de Nicolas Morales

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Nicolas Morales
 
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014Nicolas Morales
 
IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014Nicolas Morales
 
60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easyNicolas Morales
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0Nicolas Morales
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
 
Security and Audit for Big Data
Security and Audit for Big DataSecurity and Audit for Big Data
Security and Audit for Big DataNicolas Morales
 

Mais de Nicolas Morales (10)

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
 
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
 
IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014
 
60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
Security and Audit for Big Data
Security and Audit for Big DataSecurity and Audit for Big Data
Security and Audit for Big Data
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 

Último

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Último (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Big SQL 3.0 - Toronto Meetup -- May 2014

  • 1. © 2014 IBM Corporation Datawarehouse-grade SQL on Hadoop 3.03.03.0 Big!Big! Scott C. Gray (sgray@us.ibm.com) Hebert Pereyra (pereyra@ca.ibm.com)
  • 2. Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. Agenda Brief introduction to Hadoop Why SQL on Hadoop? SQL-on-Hadoop landscape What is Hive? Big SQL 3.0 • What is it? • SQL capabilities • Architecture • Application portability and integration • Enterprise capabilities • Performance Conclusion 2
  • 4. What is Hadoop? Hadoop is not a piece of software, you can't install "hadoop" It is an ecosystem of software that work together • Hadoop Core (API's) • HDFS (File system) • MapReduce (Data processing framework) • Hive (SQL access) • HBase (NoSQL database) • Sqoop (Data movement) • Oozie (Job workflow) • …. There are is a LOT of "Hadoop" software However, there is one common component they all build on: HDFS… • *Not exactly 100% true but 99.999% true
  • 5. The Hadoop Filesystem (HDFS) Driving principals • Files are stored across the entire cluster • Programs are brought to the data, not the data to the program Distributed file system (DFS) stores blocks across the whole cluster • Blocks of a single file are distributed across the cluster • A given block is typically replicated as well for resiliency • Just like a regular file system, the contents of a file is up to the application 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  • 6. Data processing on Hadoop Hadoop (HDFS) doesn't dictate file content/structure • It is just a filesystem! • It looks and smells almost exactly like the filesystem on your laptop • Except, you can ask it "where does each block of my file live?" The entire Hadoop ecosystem is built around that question! • Parallelize work by sending your programs to the data • Each copy processes a given block of the file • Other nodes may be chosen to aggregate together computed results 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 1 2 3 Logical File Splits 1 Cluster 23 App (Read) App (Read) App (Read) App (Compute) Result
  • 7. Hadoop MapReduce MapReduce is a way of writing parallel processing programs • Leverages the design of the HDFS filesystem Programs are written in two pieces: Map and Reduce Programs are submitted to the MapReduce job scheduler (JobTracker) • The scheduler looks at the blocks of input needed for the job (the "splits") • For each split, tries to schedule the processing on a host holding the split • Hosts are chosen based upon available processing resources Program is shipped to a host and given a split to process Output of the program is written back to HDFS
  • 8. MapReduce - Mappers Mappers • Small program (typically), distributed across the cluster, local to data • Handed a portion of the input data (called a split) • Each mapper parses, filters, or transforms its input • Produces grouped <key,value> pairs 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 Logical Input File 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS Map Phase
  • 9. MapReduce – The Shuffle The shuffle is transparently orchestrated by MapReduce The output of each mapper is locally grouped together by key One node is chosen to process data for each unique key Shuffle 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS
  • 10. MapReduce - Reduce Reducers • Small programs (typically) that aggregate all of the values for the key that they are responsible for • Each reducer writes output to its own file Reduce Phase 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS
  • 11. Why SQL for Hadoop? Hadoop is designed for any data • Doesn't impose any structure • Extremely flexible At lowest levels is API based • Requires strong programming expertise • Steep learning curve • Even simple operations can be tedious Yet many, if not most, use cases deal with structured data! • e.g. aging old warehouse data into queriable archive Why not use SQL in places its strengths shine? • Familiar widely used syntax • Separation of what you want vs. how to get it • Robust ecosystem of tools Pre-Processing Hub Query-able Archive Exploratory Analysis Information Integration Data Warehouse Streams Real-time processing BigInsights Landing zone for all data Data Warehouse BigInsights Can combine with unstructured information Data Warehouse 1 2 3
  • 12. SQL-on-Hadoop landscape The SQL-on-Hadoop landscape is changing rapidly! They all have their different strengths and weaknesses Many, including Big SQL, draw their basic designs on Hive…
  • 13. Then along came Hive Hive was the first SQL interface for Hadoop data • Defacto standard for SQL on Hadoop • Ships with all major Hadoop distributions SQL queries are executed using MapReduce (today) • More on this later! Hive introduced several important concepts/components…
  • 14. Hive tables In most cases, a table is simply a directory (on HDFS) full of files Hive doesn't dictate the content/structure of these files • It is designed to work with existing user data • In fact, there is no such thing as a "Hive table" In Hive java classes define a "table" /biginsights/hive/warehouse/myschema.db/mytable/ file1 file2 … CREATE TABLE my_strange_table ( c1 string, c2 timestamp, c3 double ) ROW FORMAT SERDE "com.myco.MyStrangeSerDe" WITH SERDEPROPERTIES ( "timestamp.format" = "mm/dd/yyyy" ) INPUTFORMAT "com.myco.MyStrangeInputFormat" OUTPUTFORMAT "com.myco.MyStrangeOutputFormat"
  • 15. InputFormat and SerDe InputFormat – Hadoop concept • Defines a java class that can read from a particular data source – E.g. file format, database connection, region servers, web servers, etc. • Each InputFormat produces its own record format as output • Responsible for determining splits: how to break up the data from the data source to that work can be split up between mappers • Each table defines an InputFormat (in the catalog) that understands the table’s file structure SerDe (Serializer/Deserializer) – Hive concept • A class written to interpret the records produced by an InputFormat • Responsible converting that record to a row (and back) • A row is a clearly defined Hive definition (an array of values) 1011010 0101001 0011100 1111110 0101001 1010111 0111010 data file InputFormat (records) SerDe (rows)
  • 16. Hive tables (cont.) For many common file formats Hive provides a simplified syntax This just pre-selects combinations of classes and configurations create table users ( id int, office_id int ) row format delimited fields terminated by '|' stored as textfile create table users ( id int, office_id int ) row format serde 'org.apache.hive…LazySimpleSerde' with serdeproperties ( 'field.delim' = '|' ) inputformat 'org.apache.hadoop.mapred.TextInputFormat' outputformat 'org.apache.hadoop.mapred.TextOutputFormat'
  • 17. Hive partitioned tables Most table types can be partitioned Partitioning is on one or more columns Each unique value becomes a partition Query predicates can be used to eliminated scanned partitions CREATE TABLE demo.sales ( part_id int, part_name string, qty int, cost double ) PARTITIONED BY ( state char(2) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; biginsights hive warehouse demo.db sales state=NJ state=AR state=CA state=NY data1.csv data1.csv data2.csv data1.csvselect * from demo.sales where state in ('NJ', 'CA'); select * from demo.sales where state in ('NJ', 'CA');
  • 18. Hive MetaStore Hive maintains a centralized database of metadata • Typically stored in a traditional RDBMS Contains table definitions • Location (directory on HDFS) • Column names and types • Partition information • Classes used to read/write the table • Etc. Security • Groups, roles, permissions
  • 19. Query processing in Hive Up until version 0.13 (just released) Hive used MapReduce for query processing • They are moving to a new framework called "Tez" It is useful to understand query processing in MR to understand how Big SQL tackles the same problem…
  • 20. Hive – Joins in MapReduce For joins, MR is used to group data together at the same reducer based upon the join key • Mappers read blocks from each “table” in the join • The <key> is the value of the join key, the <value> is the record to be joined • Reducer receives a mix of records from each table with the same join key • Reducers produce the results of the join reduce dept 1 reduce dept 2 reduce dept 3 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 1 map 2 map 2 1 map employees 1011010 0101001 0011110 0111011 1 depts select e.fname, e.lname, d.dept_name from employees e, depts d where e.salary > 30000 and d.dept_id = e.dept_id select e.fname, e.lname, d.dept_name from employees e, depts d where e.salary > 30000 and d.dept_id = e.dept_id
  • 21. N-way Joins in MapReduce For N-way joins involving different join keys, multiple jobs are used reduce dept 1 reduce dept 2 reduce dept 3 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 1 map 2 map 2 1 map employees 1011010 0101001 0011110 0111011 1 select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number from employees e, depts d, emp_phones p where e.salary > 30000 and d.dept_id = e.dept_id and p.emp_id = e.emp_id select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number from employees e, depts d, emp_phones p where e.salary > 30000 and d.dept_id = e.dept_id and p.emp_id = e.emp_id depts 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 1011010 0101001 0011110 0111011 1 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 emp_phones (temp files) 1 map 2 map 1 map 1 map 2 map 1 map 2 map reduce dept 1 reduce emp_id 1 reduce emp_id 2 reduce emp_id N results results results
  • 22. Agenda Brief introduction to Hadoop Why SQL on Hadoop? What is Hive? SQL-on-Hadoop landscape Big SQL 3.0 • What is it? • SQL capabilities • Architecture • Application portability and integration • Enterprise capabilities • Performance Conclusion 21
  • 23. Big SQL 3.0 – At a glance Available for POWER Linux (Redhat) and Intel x64 Linux (Redhat/SUSE) 11-Apr-2014
  • 24. Open processing As with Hive, Big SQL applies SQL to your existing data • No propriety storage format A "table" is simply a view on your Hadoop data Table definitions shared with Hive • The Hive Metastore catalogs table definitions • Reading/writing data logic is shared with Hive • Definitions can be shared across the Hadoop ecosystem Sometimes SQL isn't the answer! • Use the right tool for the right job Hive Hive Metastore Hadoop Cluster Pig Hive APIs Sqoop Hive APIs Big SQL Hive APIs
  • 25. Creating tables in Big SQL Big SQL syntax is derived from Hive's syntax with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile;
  • 26. Creating tables in Big SQL Big SQL syntax is derived from Hive's syntax with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Hadoop Keyword • Big SQL requires the HADOOP keyword • Big SQL has internal traditional RDBMS table support • Stored only at the head node • Does not live on HDFS • Supports full ACID capabilities • Not usable for "big" data • The HADOOP keyword identifies the table as living on HDFS Hadoop Keyword • Big SQL requires the HADOOP keyword • Big SQL has internal traditional RDBMS table support • Stored only at the head node • Does not live on HDFS • Supports full ACID capabilities • Not usable for "big" data • The HADOOP keyword identifies the table as living on HDFS
  • 27. Creating tables in Big SQL Big SQL syntax is derived from Hive's syntax with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Nullability Indicators • Enforced on read • Used by query optimizer for smarter rewrites Nullability Indicators • Enforced on read • Used by query optimizer for smarter rewrites
  • 28. Creating tables in Big SQL Big SQL syntax is derived from Hive's syntax with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Constraints • Unenforced • Useful as documentation and to drive query builders • Used by query optimizer for smarter rewrites Constraints • Unenforced • Useful as documentation and to drive query builders • Used by query optimizer for smarter rewrites
  • 29. Creating tables in Big SQL Big SQL syntax is derived from Hive's syntax with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Extended Data Types • Data types derived from base Hive data types • Can provide additional constraints on the Hive type Extended Data Types • Data types derived from base Hive data types • Can provide additional constraints on the Hive type
  • 30. Table types Big SQL supports many of the "standard" Hadoop storage formats • Text delimited • Text delimited sequence files • Binary delimited sequence files • Parquet • RC • ORC • Avro Each has different features/advantages/disadvantages Custom file formats may be supported as well via custom java classes
  • 31. Populating Big SQL tables There are a number of ways to populate tables Tables can be defined against existing data • All validation is performed at query time Rows can be directly inserted into tables • Data is validated is performed and converted to storage format • Only suitable for testing • Produces one physical data file per call to INSERT create external hadoop table csv_data ( c1 int not null primary key, c2 varchar(20) null ) row format delimited fields terminated by ',' stored as textfile location '/user/bob/csv_data' insert into t1 values (5, 'foo'), (6, 'bar'), (7, 'baz')
  • 32. Populating Big SQL tables (cont.) Tables can be populated from other tables Tables can be created from other tables • Great way to convert between storage types or partition data insert into top_sellers select employee_id, rank() over (order by sales) from ( select employee_id, sum(sales) sales from product_sales group by employee_id ) limit 10; create hadoop table partitioned_sales partitioned by (dept_id int not null) stored as rcfile as select emp_id, prod_id, qty, cost, dept_id from sales
  • 33. Populating Big SQL tables (cont.) The LOAD HADOOP is used to populate Hadoop tables from an external data source • Statement runs on the cluster – cannot access data at the client • Nodes of the cluster ingest data in parallel • Performs data validation during load • Performs data conversion (to storage format) during load Supports the following sources of data • Any valid Hadoop URL (e.g. hdfs://, sftp://, etc.) • JDBC data sources (e.g. Oracle, DB2, Netezza, etc.)
  • 34. Loading from URL Data may be loaded from delimited files read via any valid URL • If no URI specified is provided, HDFS is assumed: Example loading via SFTP: Just remember LOAD HADOOP executes on the cluster • So file:// will be local to the node chosen to run the statement LOAD HADOOP USING FILE URL '/user/biadmin/mydir/abc.csv' WITH SOURCE PROPERTIES( 'field.delimiter'=',', 'date.time.format'=''yyyy-MM-dd-HH.mm.ss.S') LOAD HADOOP USING FILE URL sftp://biadmin.biadmin@myserver.abc.com:22/home/biadmin/mydir' LOAD HADOOP USING FILE URL file:///path/to/myfile/file.csv'
  • 35. Loading from JDBC data source A JDBC URL may be used to load directly from external data source • Tested internally against Oracle, Teradata, DB2, and Netezza It supports many options to partition the extraction of data • Providing a table and partitioning column • Providing a query and a WHERE clause to use for partitioning Example usage: LOAD USING JDBC CONNECTION URL 'jdbc:db2://myhost:50000/SAMPLE' WITH PARAMETERS ( user = 'myuser', password='mypassword' ) FROM TABLE STAFF WHERE "dept=66 and job='Sales'" INTO TABLE staff_sales PARTITION ( dept=66 , job='Sales') APPEND WITH LOAD PROPERTIES (bigsql.load.num.map.tasks = 1) ;
  • 36. SQL capabilities Leverage IBM's rich SQL heritage, expertise, and technology • SQL standards compliant query support • SQL bodied functions and stored procedures – Encapsulate your business logic and security at the server • DB2 compatible SQL PL support – Cursors – Anonymous blocks (batches of statements) – Flow of control (if/then/else, error handling, prepared statements, etc.) The same SQL you use on your data warehouse should run with few or no modifications
  • 37. SQL capability highlights Full support for subqueries • In SELECT, FROM, WHERE and HAVING clauses • Correlated and uncorrelated • Equality, non-equality subqueries • EXISTS, NOT EXISTS, IN, ANY, SOME, etc. All standard join operations • Standard and ANSI join syntax • Inner, outer, and full outer joins • Equality, non-equality, cross join support • Multi-value join • UNION, INTERSECT, EXCEPT SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name; SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 38. SQL capability highlights (cont.) Extensive analytic capabilities • Grouping sets with CUBE and ROLLUP • Standard OLAP operations • Analytic aggregates LEAD LAG RANK DENSE_RANK ROW_NUMBER RATIO_TO_REPORT FIRST_VALUE LAST_VALUE CORRELATION COVARIANCE STDDEV VARIANCE REGR_AVGX REGR_AVGY REGR_COUNT REGR_INTERCEPT REGR_ICPT REGR_R2 REGR_SLOPE REGR_XXX REGR_SXY REGR_XYY WIDTH_BUCKET VAR_SAMP VAR_POP STDDEV_POP STDDEV_SAMP COVAR_SAMP COVAR_POP NTILE
  • 39. Architected for performance Architected from the ground up for low latency and high throughput MapReduce replaced with a modern MPP architecture • Compiler and runtime are native code (not java) • Big SQL worker daemons live directly on cluster – Continuously running (no startup latency) – Processing happens locally at the data • Message passing allows data to flow directly between nodes Operations occur in memory with the ability to spill to disk • Supports aggregations and sorts larger than available RAM SQL-based Application Big SQL Engine InfoSphere BigInsights Data Sources IBM data server client SQL MPP Run-time CSVCSV SeqSeq ParquetParquet RCRC ORCORCAvroAvro CustomCustomJSONJSON
  • 40. Big SQL 3.0 – Architecture Head (coordinator) node • Listens to the JDBC/ODBC connections • Compiles and optimizes the query • Coordinates the execution of the query Big SQL worker processes reside on compute nodes (some or all) Worker nodes stream data between each other as needed Mgmt Node Big SQL Mgmt Node Hive Metastore Mgmt Node Name Node Mgmt Node Job Tracker••• Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node••• Big SQL Big SQL Big SQL Big SQL GPFS/HDFS 39
  • 41. Extreme parallelism Massively parallel SQL engine that replaces MR Shared-nothing architecture that eliminates scalability and networking issues Engine pushes processing out to data nodes to maximize data locality. Hadoop data accessed natively via C++ and Java readers and writers. Inter- and intra-node parallelism where work is distributed to multiple worker nodes and on each node multiple worker threads collaborate on the I/O and data processing (scale out horizontally and scale up vertically) Intelligent data partition elimination based on SQL predicates Fault tolerance through active health monitoring and management of parallel data and worker nodes
  • 42. A process model view of Big SQL 3.0
  • 43. Big SQL 3.0 – Architecture (cont.) Big SQL's runtime execution engine is all native code For common table formats a native I/O engine is utilized • e.g. delimited, RC, SEQ, Parquet, … For all others, a java I/O engine is used • Maximizes compatibility with existing tables • Allows for custom file formats and SerDe's All Big SQL built-in functions are native code Customer built UDx's can be developed in C++ or Java Maximize performance without sacrificing extensibility Mgmt Node Big SQL Compute Node Task Tracker Data Node Big SQL Big SQL Worker Native I/O Engine Java I/O Engine SerDe I/O Fmt Runtime Java UDFs Native UDFs 42
  • 44. Big SQL 3.0 works with Hadoop All data is Hadoop data • In files in HDFS • SEQ, RC, delimited, Parquet … Never need to copy data to a proprietary representation All data is catalog-ed in the Hive metastore • It is the Hadoop catalog • It is flexible and extensible All Hadoop data is in a Hadoop filesystem • HDFS or GPFS-FPO 43
  • 45. Resource management Big SQL doesn't run in isolation Nodes tend to be shared with a variety of Hadoop services • Task tracker • Data node • HBase region servers • MapReduce jobs • etc. Big SQL can be constrained to limit its footprint on the cluster • % of CPU utilization • % of memory utilization Resources are automatically adjusted based upon workload • Always fitting within constraints • Self-tuning memory manager that re-distributes resources across components dynamically • default WLM concurrency control for heavy queries Compute Node Task Tracker Data Node Big SQL HBase MR Task MR Task MR Task
  • 46. Performance Query rewrites • Exhaustive query rewrite capabilities • Leverages additional metadata such as constraints and nullability Optimization • Statistics and heuristic driven query optimization • Query optimizer based upon decades of IBM RDBMS experience Tools and metrics • Highly detailed explain plans and query diagnostic tools • Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generationQuery transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period
  • 47. Application portability and integration Big SQL 3.0 adopts IBM's standard Data Server Client Drivers • Robust, standards compliant ODBC, JDBC, and .NET drivers • Same driver used for DB2 LUW, DB2/z and Informix • Expands support to numerous languages (Python, Ruby, Perl, etc.) Putting the story together…. • Big SQL shares a common SQL dialect with DB2 • Big SQL shares the same client drivers with DB2 • Data warehouse augmentation just got significantly easier Compatible SQL Compatible SQL Compatible Drivers Compatible Drivers Portable Application Portable Application
  • 48. Application portability and integration (cont.) This compatibility extends beyond your own applications Open integration across Business Analytic Tools • IBM Optim Data Studio performance tool portfolio • Superior enablement for IBM Software – e.g. Cognos • Enhanced support by 3rd party software – e.g. Microstrategy
  • 49. Query federation Data never lives in isolation • Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active Data warehouses Big SQL provides the ability to query heterogeneous systems • Join Hadoop to other relational databases • Query optimizer understands capabilities of external system – Including available statistics • As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 50. Enterprise security Users may be authenticated via • Operating system • Lightweight directory access protocol (LDAP) • Kerberos User authorization mechanisms include • Full GRANT/REVOKE based security • Group and role based hierarchical security • Object level, column level, or row level (fine-grained) access controls Auditing • You may define audit policies and track user activity Transport layer security (TLS) • Protect integrity and confidentiality of data between the client and Big SQL
  • 51. Monitoring Comprehensive runtime monitoring infrastructure that helps answer the question: what is going on in my system? • SQL interfaces to the monitoring data via table functions • Ability to drill down into more granular metrics for problem determination and/ or detailed performance analysis • Runtime statistics collected during the execution of the section for a (SQL) access plan • Support for event monitors to track specific types of operations and activities • Protect against and discover unknown/ suspicious behaviour by monitoring data access via Audit facility. Reporting Level (Example: Service Class) Big SQL 3.0 Worker Threads Connection Control Blocks Worker Threads Collect Locally Push Up Data Incrementally Extract Data Directly From Reporting level Monitor Query
  • 52. Performance, Benchmarking, Benchmarketing Performance matters to customers Benchmarking appeals to Engineers to drive product innovation Benchmarketing used to convey performance in a memorable and appealing way SQL over Hadoop is in the “Wild West” of Benchmarketing • 100x claims! Compared to what? Conforming to what rules? The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations • Formed in August, 1988 • TPC-H and TPC-DS are the most relevant to SQL over Hadoop – R/W nature of workload not suitable for HDFS Big Data Benchmarking Community (BDBC) formed 51
  • 53. Power and Performance of Standard SQL Everyone loves performance numbers, but that's not the whole story • How much work do you have to do to achieve those numbers? A portion of our internal performance numbers are based upon read-only versions of TPC benchmarks Big SQL is capable of executing • All 22 TPC-H queries without modification • All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written for Hive 52
  • 54. 53 Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries Big SQL is upto 41x faster than Hive 0.12 Big SQL is upto 41x faster than Hive 0.12 *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  • 55. Big SQL is 10x faster than Hive 0.12 (total workload elapsed time) Big SQL is 10x faster than Hive 0.12 (total workload elapsed time) 54 Comparing Big SQL and Hive 0.12 for Decision Support Queries * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  • 56. How many times faster is Big SQL than Hive 0.12? * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Max Speedup of 74x Max Speedup of 74x 55 Queries sorted by speed up ratio (worst to best) Avg Speedup of 20x Avg Speedup of 20x
  • 57. BigInsights Big SQL 3.0: Summary Big SQL provides rich, robust, standards-based SQL support for data stored in BigInsights • Uses IBM common client ODBC/JDBC drivers Big SQL fully integrates with SQL applications and tools • Existing queries run with no or few modifications* • Existing JDBC and ODBC compliant tools can be leveraged Big SQL provides faster and more reliable performance • Big SQL uses more efficient access paths to the data • Queries processed by Big SQL no longer need to use MapReduce • Big SQL is optimized to more efficiently move data over the network Big SQL provides and enterprise grade data management • Security, Auditing, workload management … 56
  • 58. Conclusion Today, it seems, performance numbers are the name of the game But in reality there is so much more… • How rich is the SQL? • How difficult is it to (re-)use your existing SQL? • How secure is your data? • Is your data still open for other uses on Hadoop? • Can your queries span your enterprise? • Can other Hadoop workloads co-exist in harmony? • … With Big SQL 3.0 performance doesn't mean compromise