Oracle hadoop let them talk together !

Oracle and Hadoop,
let them talk together !!!
Laurent Leturgez

Whoami
• Oracle Consultant since 2001
• Former developer (C, Java, perl, PL/SQL)
• Hadoop aficionado
• Owner@Premiseo: Data Management on Premise and in the Cloud
• Blogger since 2004 : http://laurent-leturgez.com
• Mail: laurent.leturgez@premiseo.com
• Twitter : @lleturgez

3 Membership Tiers
• Oracle ACE Director
• Oracle ACE
• Oracle ACE Associate
bit.ly/OracleACEProgram
500+ Technical Experts
Helping Peers Globally
Connect:
Nominate yourself or someone you know: acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com

Hadoop & Oracle: let them talk together
• Agenda
• Introduction
• Sqoop: import and export data between from Oracle into Hadoop
• Spark for hadoop
• ODBC Connectors
• Oracle BigData Connectors
• Oracle BigData SQL
• Gluent Data Platform

Introduction
• The project (and database usage) approach has changed
• 10-15 years ago … the « product » approach
• New project
• I usually use Oracle or SQLServer
• So … I will use Oracle or SQL Server for this project
• Nowdays … the « solution » approach
• New project
• What kind of data my system will store ?
• Do I have expectation regarding consistency,
security, sizing, pricing/licencing etc. ?
• So … I will use the right tool for the right job !
Companies are now
living in a
heterogeneous World !!!

Hadoop :
What it is, how it works, and what it can do ?
• Hadoop is a framework used:
• For distributed storage (HDFS : Hadoop Distributed File System)
• To process high volumes of data using MapReduce programming model (not only)
• Hadoop is open source
• But some enterprise distributions exist
• Hadoop is designed to analyze lots of data
• Hadoop is designed to scale to tens on petabytes
• Hadoop is designed to manage structured and unstructured data
• Hadoop is designed to run on commodity servers
• Hadoop is initially designed for analytics and batch workloads

Hadoop :

Hadoop :
• Cluster Architecture
• Oracle RAC used « Shared Everything » Clusters
• Based to run on a large number of machines
• All these servers share memory (as a global one) and disks

Hadoop :
• Cluster Architecture
• Hadoop used « Shared Nothing » Clusters
• Based to run on a large number of machines
• All these servers don’t share memory nor disks
• Scaled for high capacity and throughput on commodity servers
• Designed for distributed storage and distributed processing !!!

Hadoop :
• One Data, multiple processing engines
HDFS
PARQUET AVRO
DELIMITED
FILES
Hive Impala
Map Reduce / YARN
Flume Spark
ORC

Right tool for the right job ?
• Oracle
• Perfect for OLTP processing
• Ideal for all in One databases :
• Structured : tables, constraints, typed data etc.
• Unstructured : Images, videos, binary
• Many format: XML, JSON
• Sharded database
• Hadoop
• Free and scalable
• Many open data formats (Avro, Parquet, Kudu etc.)
• Many processing tools (MapReduce, Spark, Kafka etc.)
• Analytic workloads
• Design to manage large amounts of data quickly
How can I
connect Hadoop
to Oracle, Oracle
to Hadoop, and
query data ?
Which solutions
to exchange data
between Oracle
and Hadoop ?
How can I reuse
my Oracle data
in a hadoop
workload

• Sqoop: import and export data between from Oracle into Hadoop
• ODBC Connectors

• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)
• Basically, a tool to run data export and import from hadoop cluster
• Scenarios
• Enrich analytic workloads with multiple data sources
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
RDBMS /
Oracle
Unstructured
Data
Sqoop

• Basically, a tool to run data export and import
• Scenarios
• Offload analytic workloads on hadoop
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
RDBMS /
Oracle
Sqoop

• Scenarios
• Offload analytic workloads on hadoop and keep data on hdfs
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
Sqoop

• Scenarios
• Data archiving into Hadoop
RDBMS /
Oracle
Hadoop
/ HDFS
Compressed
filesets
HDFS
Sqoop

• Sqoop import is from RDBMS to Hadoop (Example with Hive)
$ sqoop import
> --connect jdbc:oracle:thin:@//192.168.99.8:1521/orcl
> --username sh --password sh
> --table S
> --hive-import
> --hive-overwrite
> --hive-table S_HIVE
> --hive-database hive_sample
> --split-by PROD_ID -m 4
> --map-column-hive "TIME_ID"=timestamp
0: jdbc:hive2://hadoop1:10001/hive_sample> show create table s_hive;
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `s_hive`( |
| `prod_id` double, |
| `cust_id` double, |
| `time_id` timestamp, |
| `channel_id` double, |
| `promo_id` double, |
| `quantity_sold` double, |
| `amount_sold` double) |
| COMMENT 'Imported by sqoop on 2017/09/18 14:55:40' |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| WITH SERDEPROPERTIES ( |
| 'field.delim'='u0001', |
| 'line.delim'='n', |
| 'serialization.format'='u0001') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://hadoop1.localdomain:8020/user/hive/warehouse/hive_sample.db/s_hive' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='4', |
| 'numRows'='0', |
| 'rawDataSize'='0', |
| 'totalSize'='122802447', |
| 'transient_lastDdlTime'='1505739346') |
+----------------------------------------------------+--+
$ hdfs dfs -ls -C /user/hive/warehouse/hive_sample.db/s_hive*
/user/hive/warehouse/hive_sample.db/s_hive/part-m-00000
0: jdbc:hive2://hadoop1:10001/hive_sample> select count(*) from s_hive;
+----------+--+
| _c0 |
+----------+--+
| 2756529 |
+----------+--+

• Sqoop import is from RDBMS to Hadoop
• One Oracle session per mapper
• Reads are done in direct path mode
• SQL Statement can be used to filter data to import
• Results can be stored in various format: delimited text, hive, parquet,
compressed or not
• Key issue is Data type conversion
• Hive Datatype mapping (--map-column-hive "TIME_ID"=timestamp)
• Java Datatype mapping (--map-column-java "ID"=Integer, "VALUE"=String)

• Sqoop export is from Hadoop to RDBMS
• Destination table has to be created first
• Direct mode is possible
• Two modes
• Insert mode
• Update mode
$ sqoop export
> --connect jdbc:oracle:thin:@//192.168.99.8:1521/orcl
> --username sh --password sh
> --direct
> --table S_RESULT
> --export-dir=/user/hive/warehouse/hive_sample.db/s_result
> --input-fields-terminated-by '001'
SQL> select * from sh.s_result;
PROD_ID P_SUM P_MIN
---------- ---------- ----------
47 1132200.93 25.97
46 749501.85 21.23
45 1527220.89 42.09
44 889945.74 42.09
…/…

• Spark is an Open Source distributed computing framework
• Fault tolerant by design
• Can work with various cluster managers
• YARN
• MESOS
• Spark Standalone
• Kubernetes (Experimental)
• Centered on a data structure called RDD (Resilient Distributed Dataset)
• Based on various components
• Spark Core
• Spark SQL (Data Abstraction)
• Spark streaming (Data Ingestion)
• Spark MLlib (Machine Learning)
• Spark Graph (Graph processing on top of Spark) Spark Core
Spark SQL
Spark
Streaming
Spark
MLLib
Spark
Graph

• Resilient Distributed DataSets (RDD)
• Read Only Structure
• Distributed over cluster’s machines
• Maintained in a fault tolerant Way

• Evolution
• RDD, DataFrames, and DataSets can be filled from an Oracle Data source
RDD
DataFrame :
RDD + Named
Columns
organization
DataSets :
DataFrame
specialization
Untyped :
DataFrame=DataS
et[Row] or
Typed : DataSet[T]
Best for Spark SQL

• Spark API languages

• Spark for hadoop : Spark vs MapReduce
• MR is batch oriented (Map then Reduce), Spark is Real Time
• MR stores data on Disk, Spark stores data in Memory
• MR is written in Java, Spark is written in Scala
• Performance comparison
• WordCount on a file of 2Gb
• Execution time with and without optimization (Mapper, Reducer, memory,
partitioning etc.) à Cf.
http://repository.stcloudstate.edu/cgi/viewcontent.cgi?article=1008&context=csit_etds
MR Spark
Without optimization 3’ 53’’ 34’’
With optimization 2’ 23’’ 29’’
5x.
faster

• Spark for hadoop and Oracle … use cases
RDBMS /
Oracle
Spark
Analytic
Workload
RDD /
DataFrames
HDFS
Unstructured
Data
Spark / JDBC
Other datasources
: S3 etc.
RDBMS /
Oracle

• Spark for hadoop and Oracle … example (Spark 1.6 / CDH 5.9)
scala> val sales_df = sqlContext.read.format("jdbc").option("url", "jdbc:oracle:thin:@//192.168.99.8:1521/orcl").option("driver",
"oracle.jdbc.OracleDriver").option("dbtable", "sales").option("user", "sh").option("password", "sh").load()
sales_df: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10), TIME_ID: timestamp, CHANNEL_ID: decimal(38,10), PROMO_ID:
decimal(38,10), QUANTITY_SOLD: decimal(10,2), AMOUNT_SOLD: decimal(10,2)]
scala> sales_df.groupBy("PROD_ID").count.show(10,false)
+-------------+-----+
|PROD_ID |count|
+-------------+-----+
|31.0000000000|23108|
|32.0000000000|11253|
|33.0000000000|22768|
|34.0000000000|13043|
|35.0000000000|16494|
|36.0000000000|13008|
|37.0000000000|17430|
|38.0000000000|9523 |
|39.0000000000|13319|
|40.0000000000|27114|
+-------------+-----+
only showing top 10 rows

• Spark for hadoop and Oracle … example (Spark 1.6 / CDH 5.9)
scala> val s = sqlContext.read.format("jdbc").option("url", "jdbc:oracle:thin:@//192.168.99.8:1521/orcl").option("driver",
"oracle.jdbc.OracleDriver").option("dbtable", "sales").option("user", "sh").option("password", "sh").load()
scala> val p = sqlContext.read.format("com.databricks.spark.csv").option("delimiter",";").load("/user/laurent/products.csv")
--
scala> s.registerTempTable("sales")
scala> p.registerTempTable("products")
--
scala> val f = sqlContext.sql(""" select products.prod_name,sum(sales.amount_sold) from sales, products where sales.prod_id=products.prod_id group by
products.prod_name""")
scala> f.write.parquet("/user/laurent/spark_parquet")
$ impala-shell
[hadoop1.localdomain:21000] > create external table t
> LIKE PARQUET '/user/laurent/spark_parquet/_metadata'
> STORED AS PARQUET location '/user/laurent/spark_parquet/’;
[hadoop1.localdomain:21000] > select * from t;
+------------------------------------------------+-------------+
| prod_name | _c1 |
+------------------------------------------------+-------------+
| O/S Documentation Set - Kanji | 509073.63 |
| Keyboard Wrist Rest | 348408.98 |
| Extension Cable | 60713.47 |
| 17" LCD w/built-in HDTV Tuner | 7189171.77 |
.../...

• ODBC Connectors
• Cloudera delivers ODBC drivers for
• Hive
• Impala
• ODBC drivers are available for :
• HortonWorks (Hive, SparkSQL)
• MapR (Hive)
• AWS EMR (Hive, Impala, Hbase)
• Azure HDInsight (Hive) à Only for Windows Client

• ODBC Connectors
• Install driver on Oracle Host
• Configure ODBC on Oracle host
• Configure a heterogeneous datasource based on ODBC
• Create a database link using this datasource

Service
dg4odbc
• ODBC Connectors
Lsnr
DSN
(odbc.ini)
Driver
Manager
Odbc Driver
Non Oracle
Client
Tnsnames.ora
Join
select c1,c2
from orcl_t a,
t@hive_lnk b
where a.id=b.id
and b>1000
Filter predicates
are pushed down
to hive / Impala

SQL> create public database link hivedsn connect to "laurent" identified by "laurent" using 'HIVEDSN';
Database link created.
SQL> show user
USER is "SH"
SQL> select p.prod_name,sum(s."quantity_sold")
2 from products p, s_hive@hivedsn s
3 where p.prod_id=s."prod_id"
4 and s."amount_sold">1500
5 group by p.prod_name;
PROD_NAME SUM(S."QUANTITY_SOLD")
-------------------------------------------------- ----------------------
Mini DV Camcorder with 3.5" Swivel LCD 3732
Envoy Ambassador 20607
256MB Memory Card 3

------------------------------------------------------------------------------------------------
SQL_ID abb5vb85sd3kt, child number 0
-------------------------------------
select p.prod_name,sum(s."quantity_sold") from products p,
s_hive@hivedsn s where p.prod_id=s."prod_id" and s."amount_sold">1500
group by p.prod_name
Plan hash value: 3779319722
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Inst |IN-OUT|
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 204 (100)| | | |
| 1 | HASH GROUP BY | | 71 | 4899 | 204 (1)| 00:00:01 | | |
|* 2 | HASH JOIN | | 100 | 6900 | 203 (0)| 00:00:01 | | |
| 3 | TABLE ACCESS FULL| PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 | | |
| 4 | REMOTE | S_HIVE | 100 | 3900 | 200 (0)| 00:00:01 | HIVED~ | R->S |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("P"."PROD_ID"="S"."prod_id")
Remote SQL Information (identified by operation id):
----------------------------------------------------
4 - SELECT `prod_id`,`quantity_sold`,`amount_sold` FROM `S_HIVE` WHERE
`amount_sold`>1500 (accessing 'HIVEDSN' )
hive> show create table s_hive;
.../...
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `s_hive`( |
.../...
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='6', |
| 'numRows'='2756529', |
| 'rawDataSize'='120045918', |
| 'totalSize'='122802447', |
| 'transient_lastDdlTime'='1505740396') |
+----------------------------------------------------+--
SQL> select count(*) from s_hive@hivedsn where "amount_sold">1500;
COUNT(*)
----------
24342

• Components available
• Oracle Datasource for Apache Hadoop
• Oracle Loader for Hadoop
• Oracle SQL Connector for HDFS
• Oracle R Advanced Analytics for Hadoop
• Oracle XQuery for Hadoop
• Oracle Data Integrator Enterprise Edition
• BigData Connectors are licensed separately from Big Data Appliance (BDA)
• BigData Connectors can be installated on BDA or any hadoop Cluster
• BigData Connectors must be licensed for all processors of a hadoop cluster
• public price: $2000 per Oracle Processor

• Oracle BigData Connectors : Oracle Datasource for Apache Hadoop
• Available for Hive and Spark
• Enables Oracle table as datasource in Hive or Spark
• Based on Hive external tables
• Metadata are stored in Hcatalog
• Data are located on Oracle Server
• Secured (Wallet and Kerberos integration)
• Write data from hive to Oracle is possible
• Performance
• Filter predicates are pushed down
• Projection Pushdown to retrieve only required columns
• Partition pruning enabled

• Oracle BigData Connectors : Oracle Datasource for Apache Hadoop
hive> create external table s_od4h (
> prod_id double,
> cust_id double,
> time_id timestamp,
> channel_id double,
> promo_id double,
> quantity_sold double,
> amount_sold double)
> STORED BY 'oracle.hcat.osh.OracleStorageHandler'
> WITH SERDEPROPERTIES (
> 'oracle.hcat.osh.columns.mapping' = 'prod_id,cust_id,time_id,channel_id,promo_id,quantity_sold,amount_sold')
> TBLPROPERTIES (
> 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@//192.168.99.8:1521/orcl',
> 'mapreduce.jdbc.input.table.name' = 'SALES',
> 'mapreduce.jdbc.username' = 'SH',
> 'mapreduce.jdbc.password' = 'sh',
> 'oracle.hcat.osh.splitterKind' = 'SINGLE_SPLITTER'
> );
hive> select count(prod_id) from s_od4h;

• Oracle BigData Connectors : Oracle Loader for Hadoop
• Load data from hadoop into Oracle table
• Java Map Reduce application
• Online and offline mode
• Need many input files (XML)
• A loadermap: describes destination table (Types, format etc.).
• An input file description: AVRO, Delimited, KV (if Oracle NoSQL file)
• An output file description: JDBC (Online), OCI(Online), Delimited (Offline), DataPump
(Offline)
• A Database connection description file

• Oracle BigData Connectors : Oracle Loader for Hadoop
• Oracle Shell for Hadoop Loaders (OHSH)
• Set of declarative commands to copy contents from Oracle to Hadoop (Hive)
• Need Copy To Hadoop which is included in BigData SQL licence
$ hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader
-D oracle.hadoop.loader.jobName=HDFSUSER_sales_sh_loadJdbc
-D mapred.reduce.tasks=0
-D mapred.input.dir=/user/laurent/sqoop_raw
-D mapred.output.dir=/user/laurent/OracleLoader
-conf /home/laurent/OL_connection.xml
-conf /home/laurent/OL_inputFormat.xml
-conf /home/laurent/OL_mapconf.xml
-conf /home/laurent/OL_outputFormat.xml

• Oracle BigData Connectors : Oracle SQL Connector for Hadoop
• Java MapReduce application
• Create an Oracle External table and link it to HDFS files
• Same Oracle External tables’ limitations
• No insert, update or delete
• Parallel query enable with automatic load balancing
• Full Scan
• Indexing is not possible
• Two commands:
• createTable: create external table and link local files to hdfs files
• publish: refresh files in table DDL
• Can be used to read :
• Datapump files on HDFS
• Delimited text files on HDFS
• Delimited text files on Hive Tables
Data is not
updated in real
time

• Oracle BigData Connectors : Oracle SQL Connector for Hadoop
• CreateTable
$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar
oracle.hadoop.exttab.ExternalTable
-D oracle.hadoop.exttab.tableName=T1_EXT
-D oracle.hadoop.exttab.sourceType=hive
-D oracle.hadoop.exttab.hive.tableName=T1
-D oracle.hadoop.exttab.hive.databaseName=hive_sample
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//192.168.99.8:1521/ORCL
-D oracle.hadoop.connection.user=sh
-D oracle.hadoop.exttab.printStackTrace=true
-createTable
CREATE TABLE "SH"."T1_EXT"
( "ID" NUMBER(*,0),
"V" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "SALES_HIVE_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'01'
MISSING FIELD VALUES ARE NULL
(
"ID" CHAR NULLIF "ID"=0X'5C4E',
"V" CHAR(4000) NULLIF "V"=0X'5C4E'
)
)
LOCATION
( 'osch-20170919025617-5707-1',
'osch-20170919025617-5707-2',
'osch-20170919025617-5707-3'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
$ grep uri /data/sales_hive/osch-20170919025617-5707-1
<uri_list>
<uri_list_item size="9" compressionCodec="">
hdfs://hadoop1.localdomain:8020/user/hive/warehouse/hive_sample.db/t1/000000_0
</uri_list_item>
</uri_list>

SQL_ID d9w2xvzydkdhn, child number 0
-------------------------------------
select p.prod_name,sum(s."QUANTITY_SOLD") from products p,
"SALES_HIVE_XTAB" s where p.prod_id=s."PROD_ID" and s."AMOUNT_SOLD">300
group by p.prod_name
-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 99 (100)| | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 71 | 4899 | 99 (3)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 71 | 4899 | 99 (3)| 00:00:01 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 71 | 4899 | 99 (3)| 00:00:01 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 71 | 4899 | 99 (3)| 00:00:01 | Q1,01 | P->P | HASH |
| 6 | HASH GROUP BY | | 71 | 4899 | 99 (3)| 00:00:01 | Q1,01 | PCWP | |
|* 7 | HASH JOIN | | 10210 | 687K| 98 (2)| 00:00:01 | Q1,01 | PCWP | |
| 8 | PX RECEIVE | | 72 | 2160 | 3 (0)| 00:00:01 | Q1,01 | PCWP | |
| 9 | PX SEND BROADCAST | :TQ10000 | 72 | 2160 | 3 (0)| 00:00:01 | Q1,00 | S->P | BROADCAST |
| 10 | PX SELECTOR | | | | | | Q1,00 | SCWC | |
| 11 | TABLE ACCESS FULL | PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 | Q1,00 | SCWP | |
| 12 | PX BLOCK ITERATOR | | 10210 | 388K| 95 (2)| 00:00:01 | Q1,01 | PCWC | |
|* 13 | EXTERNAL TABLE ACCESS FULL| SALES_HIVE_XTAB | 10210 | 388K| 95 (2)| 00:00:01 | Q1,01 | PCWP | |
-----------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------
7 - access("P"."PROD_ID"="S"."PROD_ID")
13 - filter("S"."AMOUNT_SOLD">300)
Note
-----
- Degree of Parallelism is 8 because of table property

• Support for queries against non-relational datasources
• Apache Hive
• HDFS
• Oracle NoSQL
• Apache Hbase
• Other NoSQL Databases
• “Cold” tablespaces (and datafiles) storage on Hadoop/HDFS
• Licensing
• BigData SQL is licensed separately from Big Data Appliance
• Installation on a BDA not mandatory
• BigData SQL is licensed per-disk drive per hadoop cluster
• Public price : $4000 per disk drive
• All disks in a hadoop cluster have to be licensed

• Three phases installation
• BigDataSQL Parcel deployment (CDH)
• Database Server bundle configuration
• Package deployment on the database Server
• For Oracle 12.1.0.2 and above
• Need some patches !! à Oracle Big Data SQL Master Compatibility Matrix (Doc ID
2119369.1)

• Oracle BigData SQL : Features
• External Table with new Access drivers
• ORACLE_HIVE: Existing Hive tables. Metadata is stored in Hcatalog
• ORACLE_HDFS: Create External table directly on HDFS. Metadata is declared through access
parameters (mandatory)
• Smart Scan for HDFS
• Oracle External Tables typically require Full Scan
• BigData SQL extends Smart Scan capabilities to External Tables:
• smaller result sets send to Oracle Server
• Data movement and network traffic reduced
• Storage Indexes (Only for HIVE and HDFS sources)
• Oracle External Tables cannot have indexes
• BigData SQL maintains SI automatically
• Available for = , <, <=, !=, =>, >, IS NULL and IS NOT NULL
• Predicate pushdown and column projection
• Read only tablespaces on HDFS

• Oracle BigData SQL : example
• External Table creation on a pre-existing Hive Table
SQL> CREATE TABLE "LAURENT"."ORA_TRACKS" (
2 ref1 VARCHAR2(4000),
3 ref2 VARCHAR2(4000),
4 artist VARCHAR2(4000),
5 title VARCHAR2(4000))
6 ORGANIZATION EXTERNAL
7 (TYPE ORACLE_HIVE
8 DEFAULT DIRECTORY DEFAULT_DIR
9 ACCESS PARAMETERS (
10 com.oracle.bigdata.cluster=clusterName
11 com.oracle.bigdata.tablename=hive_lolo.tracks_h)
12 )
13 PARALLEL 2 REJECT LIMIT UNLIMITED
14 /
Table created.

• External Table creation on HDFS files
SQL> CREATE TABLE sales_hdfs (
2 PROD_ID number,
3 CUST_ID number,
4 TIME_ID date,
5 CHANNEL_ID number,
6 PROMO_ID number,
7 AMOUNT_SOLD number
8 )
9 ORGANIZATION EXTERNAL
10 (
11 TYPE ORACLE_HDFS
12 DEFAULT DIRECTORY DEFAULT_DIR
13 ACCESS PARAMETERS (
14 com.oracle.bigdata.rowformat: DELIMITED FIELDS TERMINATED BY ","
15 com.oracle.bigdata.erroropt: [{"action":"replace", "value":"-1", "col":["PROD_ID","CUST_ID","CHANNEL_ID","PROMO_ID"]} ,
16 {"action":"reject", "col":"AMOUNT_SOLD"} ,
17 {"action":"setnull", "col":"TIME_ID"} ]
18 )
19 LOCATION ('hdfs://192.168.99.101/user/laurent/sqoop_raw/*')
20 )
21 /

SQL_ID dm5u21rng1mf4, child number 0
-------------------------------------
select p.prod_name,sum(s."QUANTITY_SOLD") from products p,
laurent.sales_hdfs s where p.prod_id=s."PROD_ID" and
s."AMOUNT_SOLD">300 group by p.prod_name
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1364 (100)| |
| 1 | HASH GROUP BY | | 71 | 4899 | 1364 (1)| 00:00:01 |
|* 2 | HASH JOIN | | 20404 | 1374K| 1363 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL | PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 |
|* 4 | EXTERNAL TABLE ACCESS STORAGE FULL| SALES_HDFS | 20404 | 777K| 1360 (1)| 00:00:01 |
---------------------------------------------------------------------------------------------------
---------------------------------------------------
2 - access("P"."PROD_ID"="S"."PROD_ID")
4 - filter("S"."AMOUNT_SOLD">300)
24 rows selected.

• Oracle BigData SQL : Read only Tablespace offload
• Move cold data in a read only tablespace to HDFS
• Use a FUSE mount point to HDFS root
SQL> select tablespace_name,STATUS from dba_tablespaces where tablespace_name='MYTBS';
TABLESPACE_NAME STATUS
------------------------------ ---------
MYTBS READ ONLY
SQL> select tablespace_name,status,file_name from dba_data_files where tablespace_name='MYTBS';
TABLESPACE_NAME STATUS FILE_NAME
------------------------------ --------- --------------------------------------------------------------------------------
MYTBS AVAILABLE /u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel
6.localdomain-orcl/MYTBS/mytbs01.dbf
[oracle@oel6 MYTBS]$ pwd
/u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel6.localdomain-orcl/MYTBS
[oracle@oel6 MYTBS]$ ls -l
total 4
lrwxrwxrwx 1 oracle oinstall 82 Sep 19 18:21 mytbs01.dbf -> /mnt/fuse-cluster-hdfs/user/oracle/cluster-oel6.localdomain-
orcl/MYTBS/mytbs01.dbf
$ hdfs dfs -ls /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf
-rw-r--r-- 3 oracle oinstall 104865792 2017-09-19 18:21 /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf

• Present Data stored in Hadoop (in various formats) to any compatible rDBMS
(Oracle, SQL Server, Teradata)
• Offload your data and your workload into hadoop
• Table or Contiguous partitions
• Take benefits of a distributed platform (Storage and processing)
• Advise you which schema or data can be safely offloaded to Hadoop

• Conclusion
• Hadoop integration to Oracle can help
• To take advantages of distributed storage and processing
• To optimize storage placement
• To reduce TCOs (workload offloading, Oracle DataMining Option etc.)
• Many scenarios
• Many products for many solutions
• Many Prices
• Choose the best solution(s) for your specific problematics !

Oracle hadoop let them talk together !

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Oracle hadoop let them talk together !

Semelhante a Oracle hadoop let them talk together ! (20)

Mais de Laurent Leturgez

Mais de Laurent Leturgez (6)

Último

Último (20)

Oracle hadoop let them talk together !