SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
From Oracle to Hadoop: 
Unlocking Hadoop for Your RDBMS with 
Apache Sqoop and Other Tools 
Guy Harrison, David Robson, Kate Ting 
{guy.harrison, david.robson}@software.dell.com, 
kate@cloudera.com 
October 16, 2014
About Guy, David, & Kate 
Guy Harrison @guyharrison 
- Executive Director of R&D @ Dell 
- Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming 
David Robson @DavidR021 
- Principal Technologist @ Dell 
- Sqoop Committer, Lead on Toad for Hadoop & OraOop 
Kate Ting @kate_ting 
- Technical Account Mgr @ Cloudera 
- Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
RDBMS and Hadoop 
 The relational database reigned 
supreme for more than two decades 
 Hadoop and other non-relational 
tools have overthrown that 
hegemony 
 We are unlikely to return to a “one 
size fits all” model based on Hadoop 
- Though some will try  
 For the foreseeable future, enterprise 
information architectures will include 
relational and non-relational stores
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
Analytic output 
YARN/ 
MR1 
HDFS 
Weblogs 
Products 
RDBMS 
Flume SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
Analytic output 
HDFS 
Products 
RDBMS 
SQOOP 
YARN/ 
MR1 
Sales 
SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
3. Feeding YARN/MR output 
into RDBMS 
Analytic output 
HDFS 
Weblogs 
Weblog 
Summary 
RDBMS 
Flume 
SQOOP 
YARN/ 
MR1
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
2. We want to use Hadoop to 
analyse RDBMS data 
3. Hadoop output belongs in 
RDBMS Data warehouse 
4. We archive old RDBMS 
data to Hadoop 
HDFS 
BI platform 
Sales 
RDBMS 
SQOOP 
HQL 
Old Sales 
SQL
SQOOP 
 SQOOP was created in 2009 
by Aaron Kimball as a means 
of moving data between SQL 
databases and Hadoop 
 It provided a generic 
implementation for moving 
data 
 It also provided a framework 
for implementing database 
specific optimized 
connectors
How SQOOP works (import) 
Hive Table 
HDFS 
Table 
Metadata 
Table 
Data 
RDBMS 
Hive DDL 
Table.java SQOOP 
Map Task 
FileOutputFormat 
DataDrivenDBInputFormat 
Map Task 
DataDrivenDBInputFormat 
FileOutputFormat 
HDFS files
SQOOP & Oracle
SQOOP issues with Oracle 
 SQOOP uses primary key 
ranges to divide up data 
between mappers 
 However, the deletes hit older 
key values harder, making key 
ranges unbalanced. 
 Data is almost never arranged 
on disk in key order so index 
scans collide on disk 
 Load is unbalanced, and IO 
block requests >> blocks in the 
table. 
ORACLE TABLE on DISK 
ID > 0 and ID < 
MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
ID > MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
Index block Index block
Other problems 
 Oracle might run each mapper using a 
full scan – clobbering the database 
 Oracle might run each mapper in 
parallel – clobbering the database 
 Sqoop may clobber the database 
cache 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 2 4 6 8 10 12 14 16 18 
Elasped time (s) 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
Database load 
0 Number of mappers 
0 4 8 12 16 20 24 
Database Time (s) 
Number of mappers
High speed connector design 
 Partition data based on physical 
storage 
 By-pass Oracle buffering 
 By-pass Oracle parallelism 
 Do not require or use indexes 
 Never read the same data block more 
than once 
 Support Oracle datatypes 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Imports (Oracle->Hadoop) 
 Uses Oracle block/extent map to equally 
divide IO 
 Uses Oracle direct path (non-buffered) 
IO for all reads 
 Round-robin, sequential or random 
allocation 
 All mappers get an equal number of 
blocks & no block is read twice 
 If table is partitioned, each mapper can 
work on a separate partition – results in 
partitioned output 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Exports (Hadoop-> Oracle) 
 Optionally leverages Oracle 
partitions and temporary tables for 
parallel writes 
 Performs MERGE into Oracle table 
(Updates existing rows, inserts new 
rows) 
 Optionally use oracle NOLOGGING 
(faster but unrecoverable) 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Import – Oracle to Hadoop 
 When data is unclustered 
(randomly distributed by PK), old 
SQOOP scales poorly 
 Clustered data shows better 
scalability but is still much slower 
than the direct approach. 
 New SQOOP outperforms 5-20 
times typically 
 We’ve seen limiting factor as: 
- Data IO bandwidth, or 
- Network out of DB, or 
- Hadoop CPU 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 5 10 15 20 25 30 35 
Elapsed time (s) 
Number of mappers 
direct=false - unclustered Data direct=false clustered data direct=true
Import - Database overhead 
 As you increase mappers in old sqoop, 
database load increases rapidly 
- (sometimes non-linear) 
 In new Sqoop, queuing occurs only after 
IO bandwidth is exceeded 
3000 
2500 
2000 
1500 
1000 
500 
0 
0 4 8 12 16 20 24 
DB time (minutes) 
Number of mappers 
Sqoop 
Direct
Export – Oracle to Hadoop 
 On Export, old SQOOP would hit 
database writer bottleneck early on 
and fail to parallelize. 
 New SQOOP uses partitioning and 
direct path inserts. 
 Typically bottlenecks on write IO on 
Oracle side 
120 
100 
80 
60 
40 
20 
0 
0 4 8 12 16 20 24 
Elapsed time (minutes) 
Number of mappers 
Sqoop 
Direct
Reduction in database load 
 45% reduction in DB CPU 
 83% reduction in elapsed time 
 90% reduction in total database 
time 
 99.9% reduction in database IO 
8 node Hadoop cluster, 1B rows, 310GB 
55.31 
83.45 
90.59 
99.98 
99.28 
0 20 40 60 80 100 
IO time 
IO requests 
DB time 
Elapsed time 
CPU time 
% reduction
Replication 
 No matter how fast we make SQOOP, 
it’s a drag to have to run a SQOOP job 
before every Hadoop job. 
 Replicating data into Hadoop cuts 
down on SQOOP overhead on both 
sides and avoids stale data. 
Shareplex® for Oracle and Hadoop
Sqoop 1.4.5 Summary 
Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct 
Minimal privileges required Access to DBA views required 
Works on most object types: e.g. IOT 5x-20x faster performance on tables 
Favors Sqoop terminology Favors Oracle terminology 
Database load increases non-linearly Up to 99% reduction in database IO
Future of SQOOP
Sqoop 1 Import Architecture 
sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities
Sqoop 1 Export Architecture 
sqoop export  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities  
--export-dir /temp/cities
Sqoop 1 Challenges 
 Concerns with usability 
- Cryptic, contextual command line 
arguments 
 Concerns with security 
- Client access to Hadoop bin/config, DB 
 Concerns with extensibility 
- Connectors tightly coupled with data 
format
Sqoop 2 Design Goals 
 Ease of use 
- REST API and Java API 
 Ease of security 
- Separation of responsibilities 
 Ease of extensibility 
- Connector SDK, focus on pluggability
Ease of Use 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS
Ease of Security 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS 
• Role-based access to connection objects 
• Prevents misuse and abuse 
• Administrators create, edit, delete 
• Operators use
Ease of Extensibility 
Sqoop 1 Sqoop 2 
Tight Coupling 
• Connectors fetch and store 
data from db 
• Framework handles 
serialization, format 
conversion, integration
Takeaway 
 Apache Sqoop 
- Bulk data transfer tool between external structured datastores and Hadoop 
 Sqoop 1.4.5 now with a --direct parameter option for Oracle 
- 5x-20x performance improvement on Oracle table imports 
 Sqoop 2 
- Ease of use, security, extensibility
Questions? 
Guy Harrison @guyharrison 
David Robson @DavidR021 
Kate Ting @kate_ting 
Visit Dell at Booth #102 
Visit Cloudera at Booth #305 
Book Signing: Today @ 3:15pm 
Office Hours: Tomorrow @ 11am

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptxSadhik7
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

Mais procurados (20)

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Meet Apache HBase - 2.0
Meet Apache HBase - 2.0Meet Apache HBase - 2.0
Meet Apache HBase - 2.0
 
Hive
HiveHive
Hive
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Sqoop
SqoopSqoop
Sqoop
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Hive
HiveHive
Hive
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Destaque

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetupaaamase
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015Guy Harrison
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop UsersKathleen Ting
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Guy Harrison
 
Replication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseReplication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseGhanshyam Yadav
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Kathleen Ting
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleContinuent
 

Destaque (20)

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)
 
Replication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseReplication in Distributed Real Time Database
Replication in Distributed Real Time Database
 
Oracle in Database Hadoop
Oracle in Database HadoopOracle in Database Hadoop
Oracle in Database Hadoop
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And Oracle
 

Semelhante a From oracle to hadoop with Sqoop and other tools

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Laurent Leturgez
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 

Semelhante a From oracle to hadoop with Sqoop and other tools (20)

SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

Mais de Guy Harrison

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionGuy Harrison
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information managementGuy Harrison
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 shareGuy Harrison
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Guy Harrison
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Guy Harrison
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11gGuy Harrison
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuningGuy Harrison
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010Guy Harrison
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Guy Harrison
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Guy Harrison
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Guy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Guy Harrison
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the MemoryGuy Harrison
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 

Mais de Guy Harrison (19)

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolution
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information management
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 share
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11g
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuning
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the Memory
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love Oracle
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 

Último

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 

Último (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 

From oracle to hadoop with Sqoop and other tools

  • 1. From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com, kate@cloudera.com October 16, 2014
  • 2. About Guy, David, & Kate Guy Harrison @guyharrison - Executive Director of R&D @ Dell - Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming David Robson @DavidR021 - Principal Technologist @ Dell - Sqoop Committer, Lead on Toad for Hadoop & OraOop Kate Ting @kate_ting - Technical Account Mgr @ Cloudera - Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
  • 6. RDBMS and Hadoop  The relational database reigned supreme for more than two decades  Hadoop and other non-relational tools have overthrown that hegemony  We are unlikely to return to a “one size fits all” model based on Hadoop - Though some will try   For the foreseeable future, enterprise information architectures will include relational and non-relational stores
  • 7. Scenarios 1. We need to access RDBMS to make sense of Hadoop data Analytic output YARN/ MR1 HDFS Weblogs Products RDBMS Flume SQOOP
  • 8. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS Analytic output HDFS Products RDBMS SQOOP YARN/ MR1 Sales SQOOP
  • 9. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS 3. Feeding YARN/MR output into RDBMS Analytic output HDFS Weblogs Weblog Summary RDBMS Flume SQOOP YARN/ MR1
  • 10. Scenarios 1. We need to access RDBMS to make sense of Hadoop data 2. We want to use Hadoop to analyse RDBMS data 3. Hadoop output belongs in RDBMS Data warehouse 4. We archive old RDBMS data to Hadoop HDFS BI platform Sales RDBMS SQOOP HQL Old Sales SQL
  • 11. SQOOP  SQOOP was created in 2009 by Aaron Kimball as a means of moving data between SQL databases and Hadoop  It provided a generic implementation for moving data  It also provided a framework for implementing database specific optimized connectors
  • 12. How SQOOP works (import) Hive Table HDFS Table Metadata Table Data RDBMS Hive DDL Table.java SQOOP Map Task FileOutputFormat DataDrivenDBInputFormat Map Task DataDrivenDBInputFormat FileOutputFormat HDFS files
  • 14. SQOOP issues with Oracle  SQOOP uses primary key ranges to divide up data between mappers  However, the deletes hit older key values harder, making key ranges unbalanced.  Data is almost never arranged on disk in key order so index scans collide on disk  Load is unbalanced, and IO block requests >> blocks in the table. ORACLE TABLE on DISK ID > 0 and ID < MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block ID > MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block Index block Index block
  • 15. Other problems  Oracle might run each mapper using a full scan – clobbering the database  Oracle might run each mapper in parallel – clobbering the database  Sqoop may clobber the database cache 1800 1600 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 Elasped time (s) 7000 6000 5000 4000 3000 2000 1000 Database load 0 Number of mappers 0 4 8 12 16 20 24 Database Time (s) Number of mappers
  • 16. High speed connector design  Partition data based on physical storage  By-pass Oracle buffering  By-pass Oracle parallelism  Do not require or use indexes  Never read the same data block more than once  Support Oracle datatypes ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 17. Imports (Oracle->Hadoop)  Uses Oracle block/extent map to equally divide IO  Uses Oracle direct path (non-buffered) IO for all reads  Round-robin, sequential or random allocation  All mappers get an equal number of blocks & no block is read twice  If table is partitioned, each mapper can work on a separate partition – results in partitioned output ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 18. Exports (Hadoop-> Oracle)  Optionally leverages Oracle partitions and temporary tables for parallel writes  Performs MERGE into Oracle table (Updates existing rows, inserts new rows)  Optionally use oracle NOLOGGING (faster but unrecoverable) ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 19. Import – Oracle to Hadoop  When data is unclustered (randomly distributed by PK), old SQOOP scales poorly  Clustered data shows better scalability but is still much slower than the direct approach.  New SQOOP outperforms 5-20 times typically  We’ve seen limiting factor as: - Data IO bandwidth, or - Network out of DB, or - Hadoop CPU 1600 1400 1200 1000 800 600 400 200 0 0 5 10 15 20 25 30 35 Elapsed time (s) Number of mappers direct=false - unclustered Data direct=false clustered data direct=true
  • 20. Import - Database overhead  As you increase mappers in old sqoop, database load increases rapidly - (sometimes non-linear)  In new Sqoop, queuing occurs only after IO bandwidth is exceeded 3000 2500 2000 1500 1000 500 0 0 4 8 12 16 20 24 DB time (minutes) Number of mappers Sqoop Direct
  • 21. Export – Oracle to Hadoop  On Export, old SQOOP would hit database writer bottleneck early on and fail to parallelize.  New SQOOP uses partitioning and direct path inserts.  Typically bottlenecks on write IO on Oracle side 120 100 80 60 40 20 0 0 4 8 12 16 20 24 Elapsed time (minutes) Number of mappers Sqoop Direct
  • 22. Reduction in database load  45% reduction in DB CPU  83% reduction in elapsed time  90% reduction in total database time  99.9% reduction in database IO 8 node Hadoop cluster, 1B rows, 310GB 55.31 83.45 90.59 99.98 99.28 0 20 40 60 80 100 IO time IO requests DB time Elapsed time CPU time % reduction
  • 23. Replication  No matter how fast we make SQOOP, it’s a drag to have to run a SQOOP job before every Hadoop job.  Replicating data into Hadoop cuts down on SQOOP overhead on both sides and avoids stale data. Shareplex® for Oracle and Hadoop
  • 24. Sqoop 1.4.5 Summary Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct Minimal privileges required Access to DBA views required Works on most object types: e.g. IOT 5x-20x faster performance on tables Favors Sqoop terminology Favors Oracle terminology Database load increases non-linearly Up to 99% reduction in database IO
  • 26. Sqoop 1 Import Architecture sqoop import --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities
  • 27. Sqoop 1 Export Architecture sqoop export --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities --export-dir /temp/cities
  • 28. Sqoop 1 Challenges  Concerns with usability - Cryptic, contextual command line arguments  Concerns with security - Client access to Hadoop bin/config, DB  Concerns with extensibility - Connectors tightly coupled with data format
  • 29. Sqoop 2 Design Goals  Ease of use - REST API and Java API  Ease of security - Separation of responsibilities  Ease of extensibility - Connector SDK, focus on pluggability
  • 30. Ease of Use Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS
  • 31. Ease of Security Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS • Role-based access to connection objects • Prevents misuse and abuse • Administrators create, edit, delete • Operators use
  • 32. Ease of Extensibility Sqoop 1 Sqoop 2 Tight Coupling • Connectors fetch and store data from db • Framework handles serialization, format conversion, integration
  • 33. Takeaway  Apache Sqoop - Bulk data transfer tool between external structured datastores and Hadoop  Sqoop 1.4.5 now with a --direct parameter option for Oracle - 5x-20x performance improvement on Oracle table imports  Sqoop 2 - Ease of use, security, extensibility
  • 34. Questions? Guy Harrison @guyharrison David Robson @DavidR021 Kate Ting @kate_ting Visit Dell at Booth #102 Visit Cloudera at Booth #305 Book Signing: Today @ 3:15pm Office Hours: Tomorrow @ 11am

Notas do Editor

  1. When you think about Dell you probably think about laptops
  2. Or servers that might run databases or a Hadoop cluster, but you probably don't think of Dell as having expertise in either Oracle or Hadoop
  3. But actually Dell now has a billion-dollar software arm which includes the world's number one independent database tool – toad – used by millions of users and supporting almost every data platform
  4. Guy to improve diagram