5. Introduction
• The project (and database usage) approach has changed
• 10-15 years ago … the « product » approach
• New project
• I usually use Oracle or SQLServer
• So … I will use Oracle or SQL Server for this project
• Nowdays … the « solution » approach
• New project
• What kind of data my system will store ?
• Do I have expectation regarding consistency,
security, sizing, pricing/licencing etc. ?
• So … I will use the right tool for the right job !
Companies are now
living in a
heterogeneous World !!!
6. Hadoop :
What it is, how it works, and what it can do ?
• Hadoop is a framework used:
• For distributed storage (HDFS : Hadoop Distributed File System)
• To process high volumes of data using MapReduce programming model (not only)
• Hadoop is open source
• But some enterprise distributions exist
• Hadoop is designed to analyze lots of data
• Hadoop is designed to scale to tens on petabytes
• Hadoop is designed to manage structured and unstructured data
• Hadoop is designed to run on commodity servers
• Hadoop is initially designed for analytics and batch workloads
11. Right tool for the right job ?
• Oracle
• Perfect for OLTP processing
• Ideal for all in One databases :
• Structured : tables, constraints, typed data etc.
• Unstructured : Images, videos, binary
• Many format: XML, JSON
• Sharded database
• Hadoop
• Free and scalable
• Many open data formats (Avro, Parquet, Kudu etc.)
• Many processing tools (MapReduce, Spark, Kafka etc.)
• Analytic workloads
• Design to manage large amounts of data quickly
How can I
connect Hadoop
to Oracle, Oracle
to Hadoop, and
query data ?
Which solutions
to exchange data
between Oracle
and Hadoop ?
How can I reuse
my Oracle data
in a hadoop
workload
14. Hadoop & Oracle: let them talk together
• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)
• Basically, a tool to run data export and import from hadoop cluster
• Scenarios
• Enrich analytic workloads with multiple data sources
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
RDBMS /
Oracle
Unstructured
Data
Sqoop
15. Hadoop & Oracle: let them talk together
• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)
• Basically, a tool to run data export and import
• Scenarios
• Offload analytic workloads on hadoop
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
RDBMS /
Oracle
Sqoop
16. Hadoop & Oracle: let them talk together
• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)
• Basically, a tool to run data export and import
• Scenarios
• Offload analytic workloads on hadoop and keep data on hdfs
RDBMS /
Oracle
Hadoop
/ HDFS
Analytic
Workload
Results /
HDFS
Sqoop
17. Hadoop & Oracle: let them talk together
• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)
• Basically, a tool to run data export and import
• Scenarios
• Data archiving into Hadoop
RDBMS /
Oracle
Hadoop
/ HDFS
Compressed
filesets
HDFS
Sqoop
19. Hadoop & Oracle: let them talk together
• Sqoop import is from RDBMS to Hadoop
• One Oracle session per mapper
• Reads are done in direct path mode
• SQL Statement can be used to filter data to import
• Results can be stored in various format: delimited text, hive, parquet,
compressed or not
• Key issue is Data type conversion
• Hive Datatype mapping (--map-column-hive "TIME_ID"=timestamp)
• Java Datatype mapping (--map-column-java "ID"=Integer, "VALUE"=String)
20. Hadoop & Oracle: let them talk together
• Sqoop export is from Hadoop to RDBMS
• Destination table has to be created first
• Direct mode is possible
• Two modes
• Insert mode
• Update mode
$ sqoop export
> --connect jdbc:oracle:thin:@//192.168.99.8:1521/orcl
> --username sh --password sh
> --direct
> --table S_RESULT
> --export-dir=/user/hive/warehouse/hive_sample.db/s_result
> --input-fields-terminated-by '001'
SQL> select * from sh.s_result;
PROD_ID P_SUM P_MIN
---------- ---------- ----------
47 1132200.93 25.97
46 749501.85 21.23
45 1527220.89 42.09
44 889945.74 42.09
…/…
22. Hadoop & Oracle: let them talk together
• Spark for hadoop
• Spark is an Open Source distributed computing framework
• Fault tolerant by design
• Can work with various cluster managers
• YARN
• MESOS
• Spark Standalone
• Kubernetes (Experimental)
• Centered on a data structure called RDD (Resilient Distributed Dataset)
• Based on various components
• Spark Core
• Spark SQL (Data Abstraction)
• Spark streaming (Data Ingestion)
• Spark MLlib (Machine Learning)
• Spark Graph (Graph processing on top of Spark) Spark Core
Spark SQL
Spark
Streaming
Spark
MLLib
Spark
Graph
24. Hadoop & Oracle: let them talk together
• Spark for hadoop
• Evolution
• RDD, DataFrames, and DataSets can be filled from an Oracle Data source
RDD
DataFrame :
RDD + Named
Columns
organization
DataSets :
DataFrame
specialization
Untyped :
DataFrame=DataS
et[Row] or
Typed : DataSet[T]
Best for Spark SQL
26. Hadoop & Oracle: let them talk together
• Spark for hadoop : Spark vs MapReduce
• MR is batch oriented (Map then Reduce), Spark is Real Time
• MR stores data on Disk, Spark stores data in Memory
• MR is written in Java, Spark is written in Scala
• Performance comparison
• WordCount on a file of 2Gb
• Execution time with and without optimization (Mapper, Reducer, memory,
partitioning etc.) à Cf.
http://repository.stcloudstate.edu/cgi/viewcontent.cgi?article=1008&context=csit_etds
MR Spark
Without optimization 3’ 53’’ 34’’
With optimization 2’ 23’’ 29’’
5x.
faster
34. Hadoop & Oracle: let them talk together
SQL> create public database link hivedsn connect to "laurent" identified by "laurent" using 'HIVEDSN';
Database link created.
SQL> show user
USER is "SH"
SQL> select p.prod_name,sum(s."quantity_sold")
2 from products p, s_hive@hivedsn s
3 where p.prod_id=s."prod_id"
4 and s."amount_sold">1500
5 group by p.prod_name;
PROD_NAME SUM(S."QUANTITY_SOLD")
-------------------------------------------------- ----------------------
Mini DV Camcorder with 3.5" Swivel LCD 3732
Envoy Ambassador 20607
256MB Memory Card 3
35. Hadoop & Oracle: let them talk together
------------------------------------------------------------------------------------------------
SQL_ID abb5vb85sd3kt, child number 0
-------------------------------------
select p.prod_name,sum(s."quantity_sold") from products p,
s_hive@hivedsn s where p.prod_id=s."prod_id" and s."amount_sold">1500
group by p.prod_name
Plan hash value: 3779319722
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Inst |IN-OUT|
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 204 (100)| | | |
| 1 | HASH GROUP BY | | 71 | 4899 | 204 (1)| 00:00:01 | | |
|* 2 | HASH JOIN | | 100 | 6900 | 203 (0)| 00:00:01 | | |
| 3 | TABLE ACCESS FULL| PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 | | |
| 4 | REMOTE | S_HIVE | 100 | 3900 | 200 (0)| 00:00:01 | HIVED~ | R->S |
------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("P"."PROD_ID"="S"."prod_id")
Remote SQL Information (identified by operation id):
----------------------------------------------------
4 - SELECT `prod_id`,`quantity_sold`,`amount_sold` FROM `S_HIVE` WHERE
`amount_sold`>1500 (accessing 'HIVEDSN' )
hive> show create table s_hive;
.../...
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `s_hive`( |
.../...
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='6', |
| 'numRows'='2756529', |
| 'rawDataSize'='120045918', |
| 'totalSize'='122802447', |
| 'transient_lastDdlTime'='1505740396') |
+----------------------------------------------------+--
SQL> select count(*) from s_hive@hivedsn where "amount_sold">1500;
COUNT(*)
----------
24342
37. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors
• Components available
• Oracle Datasource for Apache Hadoop
• Oracle Loader for Hadoop
• Oracle SQL Connector for HDFS
• Oracle R Advanced Analytics for Hadoop
• Oracle XQuery for Hadoop
• Oracle Data Integrator Enterprise Edition
• BigData Connectors are licensed separately from Big Data Appliance (BDA)
• BigData Connectors can be installated on BDA or any hadoop Cluster
• BigData Connectors must be licensed for all processors of a hadoop cluster
• public price: $2000 per Oracle Processor
38. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors : Oracle Datasource for Apache Hadoop
• Available for Hive and Spark
• Enables Oracle table as datasource in Hive or Spark
• Based on Hive external tables
• Metadata are stored in Hcatalog
• Data are located on Oracle Server
• Secured (Wallet and Kerberos integration)
• Write data from hive to Oracle is possible
• Performance
• Filter predicates are pushed down
• Projection Pushdown to retrieve only required columns
• Partition pruning enabled
40. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors : Oracle Loader for Hadoop
• Load data from hadoop into Oracle table
• Java Map Reduce application
• Online and offline mode
• Need many input files (XML)
• A loadermap: describes destination table (Types, format etc.).
• An input file description: AVRO, Delimited, KV (if Oracle NoSQL file)
• An output file description: JDBC (Online), OCI(Online), Delimited (Offline), DataPump
(Offline)
• A Database connection description file
41. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors : Oracle Loader for Hadoop
• Oracle Shell for Hadoop Loaders (OHSH)
• Set of declarative commands to copy contents from Oracle to Hadoop (Hive)
• Need Copy To Hadoop which is included in BigData SQL licence
$ hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader
-D oracle.hadoop.loader.jobName=HDFSUSER_sales_sh_loadJdbc
-D mapred.reduce.tasks=0
-D mapred.input.dir=/user/laurent/sqoop_raw
-D mapred.output.dir=/user/laurent/OracleLoader
-conf /home/laurent/OL_connection.xml
-conf /home/laurent/OL_inputFormat.xml
-conf /home/laurent/OL_mapconf.xml
-conf /home/laurent/OL_outputFormat.xml
42. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors : Oracle SQL Connector for Hadoop
• Java MapReduce application
• Create an Oracle External table and link it to HDFS files
• Same Oracle External tables’ limitations
• No insert, update or delete
• Parallel query enable with automatic load balancing
• Full Scan
• Indexing is not possible
• Two commands:
• createTable: create external table and link local files to hdfs files
• publish: refresh files in table DDL
• Can be used to read :
• Datapump files on HDFS
• Delimited text files on HDFS
• Delimited text files on Hive Tables
Data is not
updated in real
time
43. Hadoop & Oracle: let them talk together
• Oracle BigData Connectors : Oracle SQL Connector for Hadoop
• CreateTable
$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar
oracle.hadoop.exttab.ExternalTable
-D oracle.hadoop.exttab.tableName=T1_EXT
-D oracle.hadoop.exttab.sourceType=hive
-D oracle.hadoop.exttab.hive.tableName=T1
-D oracle.hadoop.exttab.hive.databaseName=hive_sample
-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//192.168.99.8:1521/ORCL
-D oracle.hadoop.connection.user=sh
-D oracle.hadoop.exttab.printStackTrace=true
-createTable
CREATE TABLE "SH"."T1_EXT"
( "ID" NUMBER(*,0),
"V" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "SALES_HIVE_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'01'
MISSING FIELD VALUES ARE NULL
(
"ID" CHAR NULLIF "ID"=0X'5C4E',
"V" CHAR(4000) NULLIF "V"=0X'5C4E'
)
)
LOCATION
( 'osch-20170919025617-5707-1',
'osch-20170919025617-5707-2',
'osch-20170919025617-5707-3'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
$ grep uri /data/sales_hive/osch-20170919025617-5707-1
<uri_list>
<uri_list_item size="9" compressionCodec="">
hdfs://hadoop1.localdomain:8020/user/hive/warehouse/hive_sample.db/t1/000000_0
</uri_list_item>
</uri_list>
46. Hadoop & Oracle: let them talk together
• Oracle BigData SQL
• Support for queries against non-relational datasources
• Apache Hive
• HDFS
• Oracle NoSQL
• Apache Hbase
• Other NoSQL Databases
• “Cold” tablespaces (and datafiles) storage on Hadoop/HDFS
• Licensing
• BigData SQL is licensed separately from Big Data Appliance
• Installation on a BDA not mandatory
• BigData SQL is licensed per-disk drive per hadoop cluster
• Public price : $4000 per disk drive
• All disks in a hadoop cluster have to be licensed
47. Hadoop & Oracle: let them talk together
• Oracle BigData SQL
• Three phases installation
• BigDataSQL Parcel deployment (CDH)
• Database Server bundle configuration
• Package deployment on the database Server
• For Oracle 12.1.0.2 and above
• Need some patches !! à Oracle Big Data SQL Master Compatibility Matrix (Doc ID
2119369.1)
48. Hadoop & Oracle: let them talk together
• Oracle BigData SQL : Features
• External Table with new Access drivers
• ORACLE_HIVE: Existing Hive tables. Metadata is stored in Hcatalog
• ORACLE_HDFS: Create External table directly on HDFS. Metadata is declared through access
parameters (mandatory)
• Smart Scan for HDFS
• Oracle External Tables typically require Full Scan
• BigData SQL extends Smart Scan capabilities to External Tables:
• smaller result sets send to Oracle Server
• Data movement and network traffic reduced
• Storage Indexes (Only for HIVE and HDFS sources)
• Oracle External Tables cannot have indexes
• BigData SQL maintains SI automatically
• Available for = , <, <=, !=, =>, >, IS NULL and IS NOT NULL
• Predicate pushdown and column projection
• Read only tablespaces on HDFS
49. Hadoop & Oracle: let them talk together
• Oracle BigData SQL : example
• External Table creation on a pre-existing Hive Table
SQL> CREATE TABLE "LAURENT"."ORA_TRACKS" (
2 ref1 VARCHAR2(4000),
3 ref2 VARCHAR2(4000),
4 artist VARCHAR2(4000),
5 title VARCHAR2(4000))
6 ORGANIZATION EXTERNAL
7 (TYPE ORACLE_HIVE
8 DEFAULT DIRECTORY DEFAULT_DIR
9 ACCESS PARAMETERS (
10 com.oracle.bigdata.cluster=clusterName
11 com.oracle.bigdata.tablename=hive_lolo.tracks_h)
12 )
13 PARALLEL 2 REJECT LIMIT UNLIMITED
14 /
Table created.
51. Hadoop & Oracle: let them talk together
• Oracle BigData SQL : example
SQL_ID dm5u21rng1mf4, child number 0
-------------------------------------
select p.prod_name,sum(s."QUANTITY_SOLD") from products p,
laurent.sales_hdfs s where p.prod_id=s."PROD_ID" and
s."AMOUNT_SOLD">300 group by p.prod_name
Plan hash value: 4039843832
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1364 (100)| |
| 1 | HASH GROUP BY | | 71 | 4899 | 1364 (1)| 00:00:01 |
|* 2 | HASH JOIN | | 20404 | 1374K| 1363 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL | PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 |
|* 4 | EXTERNAL TABLE ACCESS STORAGE FULL| SALES_HDFS | 20404 | 777K| 1360 (1)| 00:00:01 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("P"."PROD_ID"="S"."PROD_ID")
4 - filter("S"."AMOUNT_SOLD">300)
24 rows selected.
52. Hadoop & Oracle: let them talk together
• Oracle BigData SQL : Read only Tablespace offload
• Move cold data in a read only tablespace to HDFS
• Use a FUSE mount point to HDFS root
SQL> select tablespace_name,STATUS from dba_tablespaces where tablespace_name='MYTBS';
TABLESPACE_NAME STATUS
------------------------------ ---------
MYTBS READ ONLY
SQL> select tablespace_name,status,file_name from dba_data_files where tablespace_name='MYTBS';
TABLESPACE_NAME STATUS FILE_NAME
------------------------------ --------- --------------------------------------------------------------------------------
MYTBS AVAILABLE /u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel
6.localdomain-orcl/MYTBS/mytbs01.dbf
[oracle@oel6 MYTBS]$ pwd
/u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel6.localdomain-orcl/MYTBS
[oracle@oel6 MYTBS]$ ls -l
total 4
lrwxrwxrwx 1 oracle oinstall 82 Sep 19 18:21 mytbs01.dbf -> /mnt/fuse-cluster-hdfs/user/oracle/cluster-oel6.localdomain-
orcl/MYTBS/mytbs01.dbf
$ hdfs dfs -ls /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf
-rw-r--r-- 3 oracle oinstall 104865792 2017-09-19 18:21 /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf
54. Hadoop & Oracle: let them talk together
• Gluent Data Platform
• Present Data stored in Hadoop (in various formats) to any compatible rDBMS
(Oracle, SQL Server, Teradata)
• Offload your data and your workload into hadoop
• Table or Contiguous partitions
• Take benefits of a distributed platform (Storage and processing)
• Advise you which schema or data can be safely offloaded to Hadoop
55. Hadoop & Oracle: let them talk together
• Conclusion
• Hadoop integration to Oracle can help
• To take advantages of distributed storage and processing
• To optimize storage placement
• To reduce TCOs (workload offloading, Oracle DataMining Option etc.)
• Many scenarios
• Many products for many solutions
• Many Prices
• Choose the best solution(s) for your specific problematics !