SlideShare uma empresa Scribd logo
1 de 24
Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net Guy.harrison@quest.com @guyharrison
Introductions
Agenda RDBMS-Hadoop interoperability scenarios Interoperability options Cloudera SQOOP Extending SQOOP  Quest OraOop extension for Cloudera SQOOP Performance comparisons  Lessons learned and best practices
Scenario #1: Reference data in RDBMS  RDBMS Customers HDFS Products WebLogs
Scenario #2: Hadoop for off-line analytics RDBMS Customers HDFS Products Sales History
Scenario #3: Hadoop for RDBMS archive RDBMS HDFS Sales 2010 Sales 2009  Sales 2008  Sales 2008
Scenario #4: MapReduce results to RDBMS  RDBMS HDFS WebLog Summary WebLogs
Options for RDBMS inter-op DBInputFormat: Allows database records to be used as mapper inputs. BUT: Not inherently scalable or efficient For repeated analysis, better to stage in Hadoop Tedious coding of DBWritable classes for each table SQOOP Open source utility provided by Cloudera Configurable command line interface to copy RDBMS->HDFS Support for Hive, Hbase Generates java classes for future M-R tasks Extensible to provide optimized adaptors for specific targets Bi-Directional
SQOOP Details   SQOOP import  Divide table into ranges using primary key max/min Create mappers for each range  Mappers write to multiple HDFS nodes Creates text or sequence files  Generates Java class for resulting HDFS file Generates Hive definition and auto-loads into HIVE SQOOP export Read files in HDFS directory via MapReduce Bulk parallel insert into database table
SQOOP details SQOOP features: Compatible with almost any JDBC enabled database Auto load into HIVE  Hbase support  Special handling for database LOBs Job management  Cluster configuration (jar file distribution) WHERE clause support Open source, and included in Cloudera distributions SQOOP fast paths & plug ins Invokemysqldump, mysqlimport for MySQL jobs  Similar fast paths for PostgreSQL Extensibility architecture for 3rd parties (like Quest ) Teradata, Netezza, etc.
Working with Oracle  SQOOP approach is generic and applicable to all RDBMS However for Oracle, sub-optimal in some respects: Oracle may parallelize and serialize individual mappers Oracle optimizer may decline to use index range scans Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc) Primary keys often not be evenly distributed Index range scans use single block random reads vs.faster multi-block table scans Index range scans load into Oracle buffer cache  Pollutes cache increasing IO for other users Limited help to SQOOP since rows are only read once  Luckily, SQOOP extensibility allows us to add optimizations for specific targets
Oracle – parallelism  Aggregate Sort Scan OracleSALEStable Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle Master(QC) Client (JDBC) Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave Oracle PQ  Slave SELECT cust_id, SUM (amount_sold) FROM sh.sales GROUP BY cust_id ORDER BY 2 DESC
Oracle – parallelism  HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
Oracle – parallelism  HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
Index range scans  Hadoop Mapper ID > 0 and ID < MAX/2 Hadoop Mapper ID > MAX/2 Oracle Session Oracle Session Index range scan Buffer cache Index range scan Index block Index block Index block Index block Index block Index block Oracle table
Ideal architecture  HDFS OracleSALEStable Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper
Quest/Cloudera OraOop for SQOOP Design goals  Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support esoteric datatypes (eventually)  Support RAC clusters  Availability: Freely available from www.quest.com/ora-oop Packaged with Cloudera Enterprise  Commercial support from Quest/Cloudera within Enterprise distribution
OraOop Throughput
Oracle overhead
Extending SQOOP  SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
SQOOP/OraOop best practices  Use sequence files for LOBs OR Set inline-lob-limit  Directly control datanodes for widest destination bandwidth Can’t rely on mapred.max.maps.per.node Set number of mappers realistically  Disable speculative execution (our default) Leads to duplicate DB reads  Set Oracle row fetch size extra high  Keeps the mappers streaming to HDFS
Conclusion RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop SQOOP extensions can provide optimizations for specific targets Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value  Try out OraOop for SQOOP!
Hadoop and rdbms with sqoop

Mais conteúdo relacionado

Mais procurados

Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 

Mais procurados (20)

Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 

Destaque

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop UsersKathleen Ting
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015Guy Harrison
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using HadoopDataWorks Summit
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Co-existence or competition - RDBMS and Hadoop
Co-existence or competition  - RDBMS and HadoopCo-existence or competition  - RDBMS and Hadoop
Co-existence or competition - RDBMS and HadoopFlytxt
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateBobby Curtis
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
阿里自研数据库 Ocean base实践
阿里自研数据库 Ocean base实践阿里自研数据库 Ocean base实践
阿里自研数据库 Ocean base实践wuqiuping
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
Awkward Family Photos Slide Show
Awkward Family Photos   Slide ShowAwkward Family Photos   Slide Show
Awkward Family Photos Slide ShowRobinNicole621
 
Music Row\'s Newest Star
Music Row\'s Newest StarMusic Row\'s Newest Star
Music Row\'s Newest Starlizabetht
 

Destaque (20)

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Data Warehousing using Hadoop
Data Warehousing using HadoopData Warehousing using Hadoop
Data Warehousing using Hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Co-existence or competition - RDBMS and Hadoop
Co-existence or competition  - RDBMS and HadoopCo-existence or competition  - RDBMS and Hadoop
Co-existence or competition - RDBMS and Hadoop
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Hld and lld
Hld and lldHld and lld
Hld and lld
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
阿里自研数据库 Ocean base实践
阿里自研数据库 Ocean base实践阿里自研数据库 Ocean base实践
阿里自研数据库 Ocean base实践
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Awkward Family Photos Slide Show
Awkward Family Photos   Slide ShowAwkward Family Photos   Slide Show
Awkward Family Photos Slide Show
 
Music Row\'s Newest Star
Music Row\'s Newest StarMusic Row\'s Newest Star
Music Row\'s Newest Star
 

Semelhante a Hadoop and rdbms with sqoop

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google CloudPgDay.Seoul
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?Srihari Srinivasan
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options ComparedSergey Bushik
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 

Semelhante a Hadoop and rdbms with sqoop (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
NoSQL Options Compared
NoSQL Options ComparedNoSQL Options Compared
NoSQL Options Compared
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 

Mais de Guy Harrison

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionGuy Harrison
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information managementGuy Harrison
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 shareGuy Harrison
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Guy Harrison
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Guy Harrison
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11gGuy Harrison
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuningGuy Harrison
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010Guy Harrison
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Guy Harrison
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Guy Harrison
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Guy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Guy Harrison
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the MemoryGuy Harrison
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Guy Harrison
 

Mais de Guy Harrison (20)

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolution
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information management
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 share
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11g
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuning
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the Memory
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love Oracle
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Hadoop and rdbms with sqoop

  • 1. Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net Guy.harrison@quest.com @guyharrison
  • 3.
  • 4. Agenda RDBMS-Hadoop interoperability scenarios Interoperability options Cloudera SQOOP Extending SQOOP Quest OraOop extension for Cloudera SQOOP Performance comparisons Lessons learned and best practices
  • 5. Scenario #1: Reference data in RDBMS RDBMS Customers HDFS Products WebLogs
  • 6. Scenario #2: Hadoop for off-line analytics RDBMS Customers HDFS Products Sales History
  • 7. Scenario #3: Hadoop for RDBMS archive RDBMS HDFS Sales 2010 Sales 2009 Sales 2008 Sales 2008
  • 8. Scenario #4: MapReduce results to RDBMS RDBMS HDFS WebLog Summary WebLogs
  • 9. Options for RDBMS inter-op DBInputFormat: Allows database records to be used as mapper inputs. BUT: Not inherently scalable or efficient For repeated analysis, better to stage in Hadoop Tedious coding of DBWritable classes for each table SQOOP Open source utility provided by Cloudera Configurable command line interface to copy RDBMS->HDFS Support for Hive, Hbase Generates java classes for future M-R tasks Extensible to provide optimized adaptors for specific targets Bi-Directional
  • 10. SQOOP Details SQOOP import Divide table into ranges using primary key max/min Create mappers for each range Mappers write to multiple HDFS nodes Creates text or sequence files Generates Java class for resulting HDFS file Generates Hive definition and auto-loads into HIVE SQOOP export Read files in HDFS directory via MapReduce Bulk parallel insert into database table
  • 11. SQOOP details SQOOP features: Compatible with almost any JDBC enabled database Auto load into HIVE Hbase support Special handling for database LOBs Job management Cluster configuration (jar file distribution) WHERE clause support Open source, and included in Cloudera distributions SQOOP fast paths & plug ins Invokemysqldump, mysqlimport for MySQL jobs Similar fast paths for PostgreSQL Extensibility architecture for 3rd parties (like Quest ) Teradata, Netezza, etc.
  • 12. Working with Oracle SQOOP approach is generic and applicable to all RDBMS However for Oracle, sub-optimal in some respects: Oracle may parallelize and serialize individual mappers Oracle optimizer may decline to use index range scans Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc) Primary keys often not be evenly distributed Index range scans use single block random reads vs.faster multi-block table scans Index range scans load into Oracle buffer cache Pollutes cache increasing IO for other users Limited help to SQOOP since rows are only read once Luckily, SQOOP extensibility allows us to add optimizations for specific targets
  • 13. Oracle – parallelism Aggregate Sort Scan OracleSALEStable Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle Master(QC) Client (JDBC) Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave Oracle PQ Slave SELECT cust_id, SUM (amount_sold) FROM sh.sales GROUP BY cust_id ORDER BY 2 DESC
  • 14. Oracle – parallelism HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
  • 15. Oracle – parallelism HDFS OracleSALEStable Hadoop Mapper Hadoop Mapper Hadoop Mapper Hadoop Mapper
  • 16. Index range scans Hadoop Mapper ID > 0 and ID < MAX/2 Hadoop Mapper ID > MAX/2 Oracle Session Oracle Session Index range scan Buffer cache Index range scan Index block Index block Index block Index block Index block Index block Oracle table
  • 17. Ideal architecture HDFS OracleSALEStable Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper Oracle Session Hadoop Mapper
  • 18. Quest/Cloudera OraOop for SQOOP Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support esoteric datatypes (eventually) Support RAC clusters Availability: Freely available from www.quest.com/ora-oop Packaged with Cloudera Enterprise Commercial support from Quest/Cloudera within Enterprise distribution
  • 21. Extending SQOOP  SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
  • 22. SQOOP/OraOop best practices Use sequence files for LOBs OR Set inline-lob-limit Directly control datanodes for widest destination bandwidth Can’t rely on mapred.max.maps.per.node Set number of mappers realistically Disable speculative execution (our default) Leads to duplicate DB reads Set Oracle row fetch size extra high Keeps the mappers streaming to HDFS
  • 23. Conclusion RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop SQOOP extensions can provide optimizations for specific targets Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value Try out OraOop for SQOOP!