SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
© 2016 IBM Corporation
Big SQL: A Technical Introduction
Created by C. M. Saracco, IBM Silicon Valley Lab
June 2016
© 2016 IBM Corporation2
Executive summary
 What is Big SQL?
 Industry-standard SQL query interface for BigInsights data
 New Hadoop query engine derived from decades of IBM R&D investment in RDBMS
technology, including database parallelism and query optimization
 Why Big SQL?
 Easy on-ramp to Hadoop for SQL professionals
 Support familiar SQL tools / applications (via JDBC and ODBC drivers)
 What operations are supported?
 Create tables / views. Store data in DFS, HBase, or Hive warehouse
 Load data into tables (from local files, remote files, RDBMSs)
 Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-in
functions, UDFs, . . . . )
 UPDATE/DELETE data (in Big SQL HBase tables)
 GRANT / REVOKE privileges, create roles, create column masks and row permissions
 Transparently join / union data between Hadoop and RDBMSs in single query
 Collect statistics and inspect detailed data access plan
 Establish workload management controls
 Monitor Big SQL usage
 . . . .
© 2016 IBM Corporation3
Agenda
 Big SQL overview
 Motivation
 Architecture
 Distinguishing characteristics
Using Big SQL: the basics
 Invocation options
 Creating tables and views
 Populating tables with data
 Querying data
 Big SQL: beyond the basics
 Performance
 Fine-grained access control
 SerDes (serializers / deserializers)
 . . .
© 2016 IBM Corporation4
SQL access for Hadoop: Why?
 Data warehouse modernization is
a leading Hadoop use case
 Off-load “cold” warehouse data into query-
ready Hadoop platform
 Explore / transform / analyze / aggregate
social media data, log records, etc. and upload
summary data to warehouse
 Limited availability of skills in
MapReduce, Pig, etc.
 SQL opens the data to a much wider
audience
 Familiar, widely known syntax
 Common catalog for identifying data and
structure
2012 Big Data @ Work Study surveying 1144
business and IT professionals in 95 countries
© 2016 IBM Corporation5
What is Big SQL?
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
DFS
 Comprehensive, standard SQL
– SELECT: joins, unions, aggregates, subqueries . . .
– UPDATE/DELETE (HBase-managed tables)
– GRANT/REVOKE, INSERT … INTO
– SQL procedural logic (SQL PL)
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
 Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
without persisting intermediate results
– In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
 Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
 Integration with RDBMSs via LOAD, query
federation
BigInsights
© 2016 IBM Corporation6
Big SQL architecture
 Head (coordinator / management) node
 Listens to the JDBC/ODBC connections
 Compiles and optimizes the query
 Coordinates the execution of the query . . . . Analogous to Job Tracker for Big SQL
 Optionally store user data in traditional RDBMS table (single node only). Useful for some reference data.
 Big SQL worker processes reside on compute nodes (some or all)
 Worker nodes stream data between each other as needed
 Workers can spill large data sets to local disk if needed
 Allows Big SQL to work with data sets larger than available memory
© 2016 IBM Corporation7
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM and Third Party
software
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Distributed requests to multiple data sources
within a single SQL statement
Main data sources supported:
DB2, Teradata, Oracle, Netezza,
Informix, SQL Server
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
Distinguishing characteristics
© 2016 IBM Corporation8
Agenda
 Big SQL overview
 Motivation
 Architecture
 Distinguishing characteristics
 Using Big SQL: the basics
 Invocation options
 Creating tables and views
 Populating tables with data
 Querying data
 Big SQL: beyond the basics
 Performance
 Fine-grained access control
 SerDes (serializers / deserializers)
 . . .
© 2016 IBM Corporation9
Invocation options
 Command-line interface:
Java SQL Shell (JSqsh)
 Web tooling (Data Server
Manager)
 Tools that support IBM
JDBC/ODBC driver
© 2016 IBM Corporation10
Big SQL web tooling (Data Server Manager)
 Invoked from BigInsights Home
 Develop and execute Big SQL, monitor database, etc.
© 2016 IBM Corporation11
Creating a Big SQL table
 Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in DFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on DFS storage, both within or external to Hive
warehouse. HBase examples provided separately
© 2016 IBM Corporation12
Results from previous CREATE TABLE . . .
 Data stored in subdirectory of Hive warehouse
. . . /hive/warehouse/myid.db/users
 Default schema is user ID. Can create new schemas
 “Table” is just a subdirectory under schema.db
 Table’s data are files within table subdirectory
 Meta data collected (Big SQL & Hive)
 SYSCAT.* and SYSHADOOP.* views
 Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL
schema over existing DFS directory contents
 Useful if table contents already in DFS
 Avoids need to LOAD data into Hive
 Example provided later
© 2016 IBM Corporation13
CREATE VIEW
 Standard SQL syntax
create view my_users as
select fname, lname from biadmin.users where id > 100;
© 2016 IBM Corporation14
Populating tables via LOAD
 Typically best runtime performance
 Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
 Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
© 2016 IBM Corporation15
Populating tables via INSERT
 INSERT INTO . . . SELECT FROM . . .
 Parallel read and write operations
CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
 INSERT INTO . . . VALUES(. . . )
 Not parallelized. 1 file per INSERT. Not recommended except for quick tests
CREATE HADOOP TABLE foo col1 int, col2 varchar(10));
INSERT INTO foo VALUES (1, ‘hello’);
© 2016 IBM Corporation16
CREATE . . . TABLE . . . AS SELECT . . .
 Create a Big SQL table based on contents of other table(s)
 Source tables can be in different file formats or use different
underlying storage mechanisms
-- source tables in this example are external (just DFS files)
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
© 2016 IBM Corporation17
SQL capability highlights
 Query operations
 Projections, restrictions
 UNION, INTERSECT, EXCEPT
 Wide range of built-in functions (e.g. OLAP)
 Various Oracle, Netezza compatibility items
 Full support for subqueries
 In SELECT, FROM, WHERE and
HAVING clauses
 Correlated and uncorrelated
 Equality, non-equality subqueries
 EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
 All standard join operations
 Standard and ANSI join syntax
 Inner, outer, and full outer joins
 Equality, non-equality, cross join support
 Multi-value join
 Stored procedures, user-defined
functions, user-defined aggregates
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
© 2016 IBM Corporation18
Power of standard SQL
 Everyone loves performance numbers, but that's not the whole story
 How much work do you have to do to achieve those numbers?
 A portion of our internal performance numbers are based upon industry
standard benchmarks
 Big SQL is capable of executing
 All 22 TPC-H queries without modification
 All 99 TPC-DS queries without modification
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written query
© 2016 IBM Corporation19
Agenda
 Big SQL overview
 Movitation
 Architecture
 Distinguishing characteristics
Using Big SQL: the basics
 Invocation options
 Creating tables and views
 Populating tables with data
 Querying data
 Big SQL: beyond the basics
 Performance
 Fine-grained access control
 SerDes (serializers / deserializers)
 . . .
© 2016 IBM Corporation20
A word about . . . performance
 TPC (Transaction Processing Performance Council)
 Formed August 1988
 Widely recognized as most credible, vendor-independent SQL benchmarks
 TPC-H and TPC-DS are the most relevant to SQL over Hadoop
• R/W nature of workload not suitable for HDFS
 Hadoop-DS benchmark: BigInsights, Hive, Cloudera and now Hawq
 Original benchmark run by IBM & reviewed by TPC certified auditor in Oct 2014
 Updates (not been reviewed by TPC certified auditor) available – contact your IBM
account rep
 Based on TPC-DS. Key deviations
• No data maintenance or persistence phases (not supported across all vendors)
 Common set of queries across all solutions
• Subset that all vendors can successfully execute at scale factor
• Queries are not cherry picked
 Most complete TPC-DS like benchmark executed so far
 Analogous to porting a relational workload to SQL on Hadoop
© 2016 IBM Corporation21
IBM first (and only) vendor to release audited benchmark
 Letters of attestation are
available for both Hadoop-
DS benchmarks at 10TB
and 30TB scale (Oct. 2014)
 InfoSizing, Transaction
Processing Performance
Council Certified Auditors
verified both IBM results
as well as results on
Cloudera Impala and
HortonWorks HIVE.
 These results are for a
non-TPC benchmark. A
subset of the TPC-DS
Benchmark standard
requirements was
implemented
http://public.dhe.ibm.com/common/ssi/ecm/im/en/imw14800usen/IMW14800USEN.PDF
© 2016 IBM Corporation22
A word about . . . data access plans
 Cost-based optimizer with query rewrite
technology
 Example: SELECT DISTINCT COL_PK,
COL_X . . . FROM TABLE
 Automatically rewrite query to avoid sort –
primary key constraint indicates no nulls, no
duplicates
 Transparent to programmer
 ANALYZE TABLE … to collect statistics
 Automatic or manual collection
 Efficient runtime performance
 EXPLAIN to report detailed access plan
 Subset shown at right
© 2016 IBM Corporation23
A word about . . . column masking
2) Create permissions *
CREATE MASK SALARY_MASK ON SAL_TBL FOR
COLUMN SALARY RETURN
CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1
THEN SALARY
ELSE 0.00
END
ENABLE
3) Enable access control *
ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL
4b) Select as a MANAGER
CONNECT TO TESTDB USER socrates
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 250000
2 Chris 200000
3 Paula 1000000
3 record(s) selected.
Data
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 250000
2 Chris 200000
3 Paula 1000000
1) Create and grant access and roles *
CREATE ROLE MANAGER
CREATE ROLE EMPLOYEE
GRANT SELECT ON SAL_TBL TO USER socrates
GRANT SELECT ON SAL_TBL TO USER newton
GRANT ROLE MANAGER TO USER socrates
GRANT ROLE EMPLOYEE TO USER newton
4a) Select as an EMPLOYEE
CONNECT TO TESTDB USER newton
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 0
2 Chris 0
3 Paula 0
3 record(s) selected.
* Note: Steps 1, 2, and 3 are done by a user with SECADM
authority.
© 2016 IBM Corporation24
A word about . . . row-based access control
2) Create permissions *
CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL
FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1
AND
BRANCH_TBL.BRANCH_NAME = 'Branch_A')
ENFORCED FOR ALL ACCESS
ENABLE
3) Enable access control *
ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL
4) Select as Branch_A user
CONNECT TO TESTDB USER newton
SELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME BRANCH_NAME
----------- ------------ -----------
2 Chris Branch_A
3 Paula Branch_A
5 Pete Branch_A
8 Chrissie Branch_A
4 record(s) selected.
Data
SELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME BRANCH_NAME
------- ------------ -----------
1 Steve Branch_B
2 Chris Branch_A
3 Paula Branch_A
4 Craig Branch_B
5 Pete Branch_A
6 Stephanie Branch_B
7 Julie Branch_B
8 Chrissie Branch_A
1) Create and grant access and roles *
CREATE ROLE BRANCH_A_ROLE
GRANT ROLE BRANCH_A_ROLE TO USER newton
GRANT SELECT ON BRANCH_TBL TO USER newton
* Note: Steps 1, 2, and 3 are done by a user with SECADM
authority.
© 2016 IBM Corporation25
A word about . . . data types
 Variety of primitives supported
 TINYINT, INT, DECIMAL(p,s), FLOAT, REAL, CHAR, VARCHAR,
TIMESTAMP, DATE, VARBINARY, BINARY, . . .
 Maximum 32K
 Complex types
 ARRAY: ordered collection of elements of same type
 Associative ARRAY (equivalent to Hive MAP type): unordered collection of
key/value pairs . Keys must be primitive types (consistent with Hive)
 ROW (equivalent to Hive STRUCT type) : collection of elements of different
types
 Nesting supported for ARRAY of ROW (STRUCT) types
 Query predicates for ARRAY or ROW columns must specify elements of a
primitive type
CREATE HADOOP TABLE mytable (id INT, info INT ARRAY[10]);
SELECT * FROM mytable WHERE info[8]=12;
© 2016 IBM Corporation26
A word about . . . SerDes
 Custom serializers / deserializers (SerDes)
 Read / write complex or “unusual” data formats (e.g., JSON)
 Commonly used by Hadoop community
 Developed by user or available publicly
 Users add SerDes to appropriate directory and reference SerDe
when creating table
 Example
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), FeedInfo varchar(300),
. . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
© 2016 IBM Corporation27
Sample JSON input for previous example
JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
© 2016 IBM Corporation28
Sample Big SQL query output for JSON data
Sample output: Select * from socialmedia-json
© 2016 IBM Corporation29
A word about . . . query federation
 Data rarely lives in isolation
 Big SQL transparently queries heterogeneous systems
 Join Hadoop to RDBMSs
 Query optimizer understands capabilities of external system
• Including available statistics
 As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
© 2016 IBM Corporation30
A word about . . . resource management
 Big SQL doesn't run in isolation
 Nodes tend to be shared with a variety of Hadoop services
 HBase region servers
 MapReduce jobs
 . . .
 Big SQL can be constrained to limit its footprint on the cluster
 CPU utilization
 Memory utilization
 Resources are automatically adjusted based upon workload
 Always fitting within constraints
 Self-tuning memory manager that re-distributes resources across components
dynamically
 default WLM concurrency control for heavy queries
© 2016 IBM Corporation31
A word about . . . application portability
 Big SQL adopts IBM's standard Data Server Client Drivers
 Robust, standards compliant ODBC, JDBC, and .NET drivers
 Same driver used for DB2 LUW, DB2/z and Informix
 Putting the story together….
 Big SQL shares a common SQL dialect with DB2
 Big SQL shares the same client drivers with DB2
 Data warehouse augmentation just got easier
 Integration with popular third party tools just got easier
Compatible
SQL
Compatible
Drivers
Portable
Application
© 2016 IBM Corporation32
A word about . . . SQL compatibility
 General
 OFFSET clause with FETCH FIRST n ROWS ONLY and LIMIT/OFFSET
 ORDER BY with ASC NULLS FIRST and DESC NULL FIRST
 User-defined aggregates
 Extensions to NULLS FIRST / NULLS LAST
 Additional scalar functions (THIS_WEEK, NEXT_YEAR, HASH, OVERLAPS, . . . )
 Oracle syntax compatibility
 JOIN syntax (using +)
 CONNECT BY support for hierarchical queries
 ROWNUM support
 DUAL table support
 Netezza syntax compatibility
 CREATE TEMPORARY TABLE
 JOIN Syntax: USING
 CASTING
 ISNULL as synonym for IS NULL; NOTNULL as synonym for NOT NULL
 NOW as synonym for CURRENT TIMESTAMP
 SQL_COMPAT session variable . When SQL_COMPAT=‘NPS’:
• Double-dot notation for database objects
• TRANSLATE function parameter syntax
• Netezza-style procedural language (NZPLSQL)
• . . .
© 2016 IBM Corporation33
A word about . . . high availability
 Big SQL master node high availability
 Scheduler automatically restarted upon failure
 Catalog changes replicated to warm standby instance
 Warm standby automatically takes over if the primary fails
 Worker node failure leads to black listing / auto resubmission
Scheduler
Big SQL Master
(Primary)
Catalogs Scheduler
Big SQL Master
(Standby)
Catalogs
…
HDFS Data
Worker
Node
Worker
Node
Worker
Node
Database
Logs
HDFS Data HDFS Data
* IBM Confidential
© 2016 IBM Corporation34
A word about . . . HBase support
 Big SQL with HBase – basic operations
– Create tables and views
– LOAD / INSERT data
– Query data with full SQL breadth
– UPDATE / DELETE data
– . . .
 HBase-specific design points
 Column mapping
 Dense / composite columns
– FORCE KEY UNIQUE option
– Salting option
– Secondary indexes
– . . . .
 Details covered in separate presentation
© 2016 IBM Corporation35
A word about . . . Big SQL and Spark
 What is Apache Spark?
 Fast, general-purpose engine for working with Big Data
 Part of IBM Open Platform for Apache Hadoop
 Popular built-in libraries for machine learning, query, streaming, etc.
 Data in Big SQL accessible through Spark
 Big SQL meta data in HCatalog
 Big SQL tables use common Hadoop file formats
 Spark SQL provides access to structured data
 Sample approach
 From Big SQL: CREATE HADOOP TABLE . . . in Hive warehouse
or over DFS directory
 From Spark:
• Create HiveContext
• Issue queries (Hive syntax) directly against Big SQL tables
• Invoke Spark transformations and actions as desired, including those
specific to other Spark libraries (e.g., MLlib)
© 2016 IBM Corporation36
Technical preview: launch Spark jobs from Big SQL
 Spark jobs can be invoked from Big SQL using a table UDF
abstraction
 Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick
off a Spark job that reads a JSON file stored on HDFS
SELECT *
FROM TABLE(SYSHADOOP.EXECSPARK(
language => 'scala',
class => 'com.ibm.biginsights.bigsql.examples.ReadJsonFile',
uri => 'hdfs://host.port.com:8020/user/bigsql/demo.json',
card => 100000)) AS doc
WHERE doc.country IS NOT NULL
© 2016 IBM Corporation37
Technical preview: BLU Acceleration
 In-memory processing of columnar data
 Supported for Big SQL tables on head node
 Based on proven high performance technology for analytical queries
 Simple syntax
 CREATE TABLE with “organize by column” clause
 INSERT, DELETE, positioned UPDATE, LOAD via supplied procedure
 Standard SELECT, including joins with Big SQL tables in HDFS, Hive,
HBase
 Usage considerations
 Dimension tables (join with Big SQL fact tables in Hadoop)
 Small / medium data marts
© 2016 IBM Corporation38
Impersonation (new in 4.2)
 What is Big SQL impersonation?
 Similar to Hive impersonation
 Operations performed as connected end user, not “bigsql” service user
• CREATE TABLE
• LOAD
• Queries
• . . .
 Supported only for Big SQL Hive tables (not HBase tables)
 Consider using when
 Underlying data produced outside of Big SQL service
 Data shared across multiple services on your cluster
© 2016 IBM Corporation39
Get started with Big SQL: External resources
 Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/

Mais conteúdo relacionado

Mais procurados

Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...
Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...
Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...Đông Đô
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and PrivacyClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and PrivacyAltinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Hashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs EnterpriseHashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs EnterpriseStenio Ferreira
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 

Mais procurados (20)

Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...
Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...
Xây dụng và kết hợp Kafka, Druid, Superset để đua vào ứng dụng phân tích dữ l...
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and PrivacyClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
ClickHouse Defense Against the Dark Arts - Intro to Security and Privacy
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Hashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs EnterpriseHashicorp Vault Open Source vs Enterprise
Hashicorp Vault Open Source vs Enterprise
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 

Destaque

Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and HadoopCynthia Saracco
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark Cynthia Saracco
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 

Destaque (6)

Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and Hadoop
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 

Semelhante a Big Data: SQL on Hadoop from IBM

Using your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkUsing your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkCynthia Saracco
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virendervithakur
 
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Lviv Startup Club
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overviewKeshav Murthy
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in CloudDr. Amarjeet Singh
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLPeter Eisentraut
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbsVasilios Kuznos
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platformMostafa
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 

Semelhante a Big Data: SQL on Hadoop from IBM (20)

Using your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkUsing your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and Spark
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virender
 
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overview
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQL
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbs
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 

Mais de Cynthia Saracco

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
 
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study labCynthia Saracco
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab Cynthia Saracco
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsightsCynthia Saracco
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 

Mais de Cynthia Saracco (7)

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Big Data: SQL on Hadoop from IBM

  • 1. © 2016 IBM Corporation Big SQL: A Technical Introduction Created by C. M. Saracco, IBM Silicon Valley Lab June 2016
  • 2. © 2016 IBM Corporation2 Executive summary  What is Big SQL?  Industry-standard SQL query interface for BigInsights data  New Hadoop query engine derived from decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization  Why Big SQL?  Easy on-ramp to Hadoop for SQL professionals  Support familiar SQL tools / applications (via JDBC and ODBC drivers)  What operations are supported?  Create tables / views. Store data in DFS, HBase, or Hive warehouse  Load data into tables (from local files, remote files, RDBMSs)  Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-in functions, UDFs, . . . . )  UPDATE/DELETE data (in Big SQL HBase tables)  GRANT / REVOKE privileges, create roles, create column masks and row permissions  Transparently join / union data between Hadoop and RDBMSs in single query  Collect statistics and inspect detailed data access plan  Establish workload management controls  Monitor Big SQL usage  . . . .
  • 3. © 2016 IBM Corporation3 Agenda  Big SQL overview  Motivation  Architecture  Distinguishing characteristics Using Big SQL: the basics  Invocation options  Creating tables and views  Populating tables with data  Querying data  Big SQL: beyond the basics  Performance  Fine-grained access control  SerDes (serializers / deserializers)  . . .
  • 4. © 2016 IBM Corporation4 SQL access for Hadoop: Why?  Data warehouse modernization is a leading Hadoop use case  Off-load “cold” warehouse data into query- ready Hadoop platform  Explore / transform / analyze / aggregate social media data, log records, etc. and upload summary data to warehouse  Limited availability of skills in MapReduce, Pig, etc.  SQL opens the data to a much wider audience  Familiar, widely known syntax  Common catalog for identifying data and structure 2012 Big Data @ Work Study surveying 1144 business and IT professionals in 95 countries
  • 5. © 2016 IBM Corporation5 What is Big SQL? SQL-based Application Big SQL Engine Data Storage IBM data server client SQL MPP Run-time DFS  Comprehensive, standard SQL – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables) – GRANT/REVOKE, INSERT … INTO – SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers  Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes without persisting intermediate results – In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules  Various storage formats supported – Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required  Integration with RDBMSs via LOAD, query federation BigInsights
  • 6. © 2016 IBM Corporation6 Big SQL architecture  Head (coordinator / management) node  Listens to the JDBC/ODBC connections  Compiles and optimizes the query  Coordinates the execution of the query . . . . Analogous to Job Tracker for Big SQL  Optionally store user data in traditional RDBMS table (single node only). Useful for some reference data.  Big SQL worker processes reside on compute nodes (some or all)  Worker nodes stream data between each other as needed  Workers can spill large data sets to local disk if needed  Allows Big SQL to work with data sets larger than available memory
  • 7. © 2016 IBM Corporation7 Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM and Third Party software Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2, Teradata, Oracle, Netezza, Informix, SQL Server Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Comprehensive SQL Support IBM SQL PL compatibility Extensive Analytic Functions Distinguishing characteristics
  • 8. © 2016 IBM Corporation8 Agenda  Big SQL overview  Motivation  Architecture  Distinguishing characteristics  Using Big SQL: the basics  Invocation options  Creating tables and views  Populating tables with data  Querying data  Big SQL: beyond the basics  Performance  Fine-grained access control  SerDes (serializers / deserializers)  . . .
  • 9. © 2016 IBM Corporation9 Invocation options  Command-line interface: Java SQL Shell (JSqsh)  Web tooling (Data Server Manager)  Tools that support IBM JDBC/ODBC driver
  • 10. © 2016 IBM Corporation10 Big SQL web tooling (Data Server Manager)  Invoked from BigInsights Home  Develop and execute Big SQL, monitor database, etc.
  • 11. © 2016 IBM Corporation11 Creating a Big SQL table  Standard CREATE TABLE DDL with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile; Worth noting: • “Hadoop” keyword creates table in DFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization) • Examples in these charts focus on DFS storage, both within or external to Hive warehouse. HBase examples provided separately
  • 12. © 2016 IBM Corporation12 Results from previous CREATE TABLE . . .  Data stored in subdirectory of Hive warehouse . . . /hive/warehouse/myid.db/users  Default schema is user ID. Can create new schemas  “Table” is just a subdirectory under schema.db  Table’s data are files within table subdirectory  Meta data collected (Big SQL & Hive)  SYSCAT.* and SYSHADOOP.* views  Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory contents  Useful if table contents already in DFS  Avoids need to LOAD data into Hive  Example provided later
  • 13. © 2016 IBM Corporation13 CREATE VIEW  Standard SQL syntax create view my_users as select fname, lname from biadmin.users where id > 100;
  • 14. © 2016 IBM Corporation14 Populating tables via LOAD  Typically best runtime performance  Load data from local or remote file system load hadoop using file url 'sftp://myID:myPassword@myServer.ibm.com:22/install- dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;  Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection load hadoop using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);
  • 15. © 2016 IBM Corporation15 Populating tables via INSERT  INSERT INTO . . . SELECT FROM . . .  Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet ( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile; -- source tables do not need to be in Parquet format insert into big_sales_parquet SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 5500;  INSERT INTO . . . VALUES(. . . )  Not parallelized. 1 file per INSERT. Not recommended except for quick tests CREATE HADOOP TABLE foo col1 int, col2 varchar(10)); INSERT INTO foo VALUES (1, ‘hello’);
  • 16. © 2016 IBM Corporation16 CREATE . . . TABLE . . . AS SELECT . . .  Create a Big SQL table based on contents of other table(s)  Source tables can be in different file formats or use different underlying storage mechanisms -- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_line_en VARCHAR(90) , product_line_de VARCHAR(90) ) as select product_key, d.product_line_code, product_type_key, product_type_code, product_line_en, product_line_de from extern.sls_product_dim d, extern.sls_product_line_lookup l where d.product_line_code = l.product_line_code;
  • 17. © 2016 IBM Corporation17 SQL capability highlights  Query operations  Projections, restrictions  UNION, INTERSECT, EXCEPT  Wide range of built-in functions (e.g. OLAP)  Various Oracle, Netezza compatibility items  Full support for subqueries  In SELECT, FROM, WHERE and HAVING clauses  Correlated and uncorrelated  Equality, non-equality subqueries  EXISTS, NOT EXISTS, IN, ANY, SOME, etc.  All standard join operations  Standard and ANSI join syntax  Inner, outer, and full outer joins  Equality, non-equality, cross join support  Multi-value join  Stored procedures, user-defined functions, user-defined aggregates SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 18. © 2016 IBM Corporation18 Power of standard SQL  Everyone loves performance numbers, but that's not the whole story  How much work do you have to do to achieve those numbers?  A portion of our internal performance numbers are based upon industry standard benchmarks  Big SQL is capable of executing  All 22 TPC-H queries without modification  All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written query
  • 19. © 2016 IBM Corporation19 Agenda  Big SQL overview  Movitation  Architecture  Distinguishing characteristics Using Big SQL: the basics  Invocation options  Creating tables and views  Populating tables with data  Querying data  Big SQL: beyond the basics  Performance  Fine-grained access control  SerDes (serializers / deserializers)  . . .
  • 20. © 2016 IBM Corporation20 A word about . . . performance  TPC (Transaction Processing Performance Council)  Formed August 1988  Widely recognized as most credible, vendor-independent SQL benchmarks  TPC-H and TPC-DS are the most relevant to SQL over Hadoop • R/W nature of workload not suitable for HDFS  Hadoop-DS benchmark: BigInsights, Hive, Cloudera and now Hawq  Original benchmark run by IBM & reviewed by TPC certified auditor in Oct 2014  Updates (not been reviewed by TPC certified auditor) available – contact your IBM account rep  Based on TPC-DS. Key deviations • No data maintenance or persistence phases (not supported across all vendors)  Common set of queries across all solutions • Subset that all vendors can successfully execute at scale factor • Queries are not cherry picked  Most complete TPC-DS like benchmark executed so far  Analogous to porting a relational workload to SQL on Hadoop
  • 21. © 2016 IBM Corporation21 IBM first (and only) vendor to release audited benchmark  Letters of attestation are available for both Hadoop- DS benchmarks at 10TB and 30TB scale (Oct. 2014)  InfoSizing, Transaction Processing Performance Council Certified Auditors verified both IBM results as well as results on Cloudera Impala and HortonWorks HIVE.  These results are for a non-TPC benchmark. A subset of the TPC-DS Benchmark standard requirements was implemented http://public.dhe.ibm.com/common/ssi/ecm/im/en/imw14800usen/IMW14800USEN.PDF
  • 22. © 2016 IBM Corporation22 A word about . . . data access plans  Cost-based optimizer with query rewrite technology  Example: SELECT DISTINCT COL_PK, COL_X . . . FROM TABLE  Automatically rewrite query to avoid sort – primary key constraint indicates no nulls, no duplicates  Transparent to programmer  ANALYZE TABLE … to collect statistics  Automatic or manual collection  Efficient runtime performance  EXPLAIN to report detailed access plan  Subset shown at right
  • 23. © 2016 IBM Corporation23 A word about . . . column masking 2) Create permissions * CREATE MASK SALARY_MASK ON SAL_TBL FOR COLUMN SALARY RETURN CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1 THEN SALARY ELSE 0.00 END ENABLE 3) Enable access control * ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL 4b) Select as a MANAGER CONNECT TO TESTDB USER socrates SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 250000 2 Chris 200000 3 Paula 1000000 3 record(s) selected. Data SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 250000 2 Chris 200000 3 Paula 1000000 1) Create and grant access and roles * CREATE ROLE MANAGER CREATE ROLE EMPLOYEE GRANT SELECT ON SAL_TBL TO USER socrates GRANT SELECT ON SAL_TBL TO USER newton GRANT ROLE MANAGER TO USER socrates GRANT ROLE EMPLOYEE TO USER newton 4a) Select as an EMPLOYEE CONNECT TO TESTDB USER newton SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 0 2 Chris 0 3 Paula 0 3 record(s) selected. * Note: Steps 1, 2, and 3 are done by a user with SECADM authority.
  • 24. © 2016 IBM Corporation24 A word about . . . row-based access control 2) Create permissions * CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1 AND BRANCH_TBL.BRANCH_NAME = 'Branch_A') ENFORCED FOR ALL ACCESS ENABLE 3) Enable access control * ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL 4) Select as Branch_A user CONNECT TO TESTDB USER newton SELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME BRANCH_NAME ----------- ------------ ----------- 2 Chris Branch_A 3 Paula Branch_A 5 Pete Branch_A 8 Chrissie Branch_A 4 record(s) selected. Data SELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME BRANCH_NAME ------- ------------ ----------- 1 Steve Branch_B 2 Chris Branch_A 3 Paula Branch_A 4 Craig Branch_B 5 Pete Branch_A 6 Stephanie Branch_B 7 Julie Branch_B 8 Chrissie Branch_A 1) Create and grant access and roles * CREATE ROLE BRANCH_A_ROLE GRANT ROLE BRANCH_A_ROLE TO USER newton GRANT SELECT ON BRANCH_TBL TO USER newton * Note: Steps 1, 2, and 3 are done by a user with SECADM authority.
  • 25. © 2016 IBM Corporation25 A word about . . . data types  Variety of primitives supported  TINYINT, INT, DECIMAL(p,s), FLOAT, REAL, CHAR, VARCHAR, TIMESTAMP, DATE, VARBINARY, BINARY, . . .  Maximum 32K  Complex types  ARRAY: ordered collection of elements of same type  Associative ARRAY (equivalent to Hive MAP type): unordered collection of key/value pairs . Keys must be primitive types (consistent with Hive)  ROW (equivalent to Hive STRUCT type) : collection of elements of different types  Nesting supported for ARRAY of ROW (STRUCT) types  Query predicates for ARRAY or ROW columns must specify elements of a primitive type CREATE HADOOP TABLE mytable (id INT, info INT ARRAY[10]); SELECT * FROM mytable WHERE info[8]=12;
  • 26. © 2016 IBM Corporation26 A word about . . . SerDes  Custom serializers / deserializers (SerDes)  Read / write complex or “unusual” data formats (e.g., JSON)  Commonly used by Hadoop community  Developed by user or available publicly  Users add SerDes to appropriate directory and reference SerDe when creating table  Example -- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe -- Location clause points to DFS dir containing JSON data -- External clause means DFS dir & data won’t be drop after DROP TABLE command create external hadoop table socialmedia-json (Country varchar(20), FeedInfo varchar(300), . . . ) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '</hdfs_path>/myJSON'; select * from socialmedia-json;
  • 27. © 2016 IBM Corporation27 Sample JSON input for previous example JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
  • 28. © 2016 IBM Corporation28 Sample Big SQL query output for JSON data Sample output: Select * from socialmedia-json
  • 29. © 2016 IBM Corporation29 A word about . . . query federation  Data rarely lives in isolation  Big SQL transparently queries heterogeneous systems  Join Hadoop to RDBMSs  Query optimizer understands capabilities of external system • Including available statistics  As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 30. © 2016 IBM Corporation30 A word about . . . resource management  Big SQL doesn't run in isolation  Nodes tend to be shared with a variety of Hadoop services  HBase region servers  MapReduce jobs  . . .  Big SQL can be constrained to limit its footprint on the cluster  CPU utilization  Memory utilization  Resources are automatically adjusted based upon workload  Always fitting within constraints  Self-tuning memory manager that re-distributes resources across components dynamically  default WLM concurrency control for heavy queries
  • 31. © 2016 IBM Corporation31 A word about . . . application portability  Big SQL adopts IBM's standard Data Server Client Drivers  Robust, standards compliant ODBC, JDBC, and .NET drivers  Same driver used for DB2 LUW, DB2/z and Informix  Putting the story together….  Big SQL shares a common SQL dialect with DB2  Big SQL shares the same client drivers with DB2  Data warehouse augmentation just got easier  Integration with popular third party tools just got easier Compatible SQL Compatible Drivers Portable Application
  • 32. © 2016 IBM Corporation32 A word about . . . SQL compatibility  General  OFFSET clause with FETCH FIRST n ROWS ONLY and LIMIT/OFFSET  ORDER BY with ASC NULLS FIRST and DESC NULL FIRST  User-defined aggregates  Extensions to NULLS FIRST / NULLS LAST  Additional scalar functions (THIS_WEEK, NEXT_YEAR, HASH, OVERLAPS, . . . )  Oracle syntax compatibility  JOIN syntax (using +)  CONNECT BY support for hierarchical queries  ROWNUM support  DUAL table support  Netezza syntax compatibility  CREATE TEMPORARY TABLE  JOIN Syntax: USING  CASTING  ISNULL as synonym for IS NULL; NOTNULL as synonym for NOT NULL  NOW as synonym for CURRENT TIMESTAMP  SQL_COMPAT session variable . When SQL_COMPAT=‘NPS’: • Double-dot notation for database objects • TRANSLATE function parameter syntax • Netezza-style procedural language (NZPLSQL) • . . .
  • 33. © 2016 IBM Corporation33 A word about . . . high availability  Big SQL master node high availability  Scheduler automatically restarted upon failure  Catalog changes replicated to warm standby instance  Warm standby automatically takes over if the primary fails  Worker node failure leads to black listing / auto resubmission Scheduler Big SQL Master (Primary) Catalogs Scheduler Big SQL Master (Standby) Catalogs … HDFS Data Worker Node Worker Node Worker Node Database Logs HDFS Data HDFS Data * IBM Confidential
  • 34. © 2016 IBM Corporation34 A word about . . . HBase support  Big SQL with HBase – basic operations – Create tables and views – LOAD / INSERT data – Query data with full SQL breadth – UPDATE / DELETE data – . . .  HBase-specific design points  Column mapping  Dense / composite columns – FORCE KEY UNIQUE option – Salting option – Secondary indexes – . . . .  Details covered in separate presentation
  • 35. © 2016 IBM Corporation35 A word about . . . Big SQL and Spark  What is Apache Spark?  Fast, general-purpose engine for working with Big Data  Part of IBM Open Platform for Apache Hadoop  Popular built-in libraries for machine learning, query, streaming, etc.  Data in Big SQL accessible through Spark  Big SQL meta data in HCatalog  Big SQL tables use common Hadoop file formats  Spark SQL provides access to structured data  Sample approach  From Big SQL: CREATE HADOOP TABLE . . . in Hive warehouse or over DFS directory  From Spark: • Create HiveContext • Issue queries (Hive syntax) directly against Big SQL tables • Invoke Spark transformations and actions as desired, including those specific to other Spark libraries (e.g., MLlib)
  • 36. © 2016 IBM Corporation36 Technical preview: launch Spark jobs from Big SQL  Spark jobs can be invoked from Big SQL using a table UDF abstraction  Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => 'scala', class => 'com.ibm.biginsights.bigsql.examples.ReadJsonFile', uri => 'hdfs://host.port.com:8020/user/bigsql/demo.json', card => 100000)) AS doc WHERE doc.country IS NOT NULL
  • 37. © 2016 IBM Corporation37 Technical preview: BLU Acceleration  In-memory processing of columnar data  Supported for Big SQL tables on head node  Based on proven high performance technology for analytical queries  Simple syntax  CREATE TABLE with “organize by column” clause  INSERT, DELETE, positioned UPDATE, LOAD via supplied procedure  Standard SELECT, including joins with Big SQL tables in HDFS, Hive, HBase  Usage considerations  Dimension tables (join with Big SQL fact tables in Hadoop)  Small / medium data marts
  • 38. © 2016 IBM Corporation38 Impersonation (new in 4.2)  What is Big SQL impersonation?  Similar to Hive impersonation  Operations performed as connected end user, not “bigsql” service user • CREATE TABLE • LOAD • Queries • . . .  Supported only for Big SQL Hive tables (not HBase tables)  Consider using when  Underlying data produced outside of Big SQL service  Data shared across multiple services on your cluster
  • 39. © 2016 IBM Corporation39 Get started with Big SQL: External resources  Hadoop Dev: links to videos, white paper, lab, . . . . https://developer.ibm.com/hadoop/