SlideShare a Scribd company logo
1 of 18
Download to read offline
Sqoop 2 
Refactoring for generic data transfer 
Abraham Elmahrek
Cloudera Ingest!
Introduction to Sqoop 2 
Ease of use Extensible Security 
Provide a rest API and Java 
API for easy integration. 
Existing clients include a Hue 
UI and a command line client. 
Provide a connector SDK and 
focus on pluggability. Existing 
connectors include Generic 
JDBC connector and HDFS 
connector. 
Emphasize separation of 
responsibilities. Eventually 
have ACLs or RBAC.
Life of a Request 
• Client 
– Talks to server over REST + JSON 
– Does nothing but sends requests 
• Server 
– Extracts metadata from data source 
– Delegates to execution engine 
– Does all the heavy lifting really 
• MapReduce 
– Parallelizes execution of the job
Workflow
Job Types 
IMPORT into Hadoop and EXPORT out of Hadoop
Responsibilities 
Transfer data from Connector A to Hadoop 
Connector responsibilities Sqoop framework responsibilities
Connector Definitions 
• Connectors define: 
– How to connect to a data source 
– How to extract data from a data source 
– How to load data to a data source 
public Importer getImporter(); // Supply extract method 
public Importer getExporter(); // Supply load method 
public class getConnectionConfigurationClass(); 
public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
Intermediate Data Format 
• Describe a single record as it moves through Sqoop 
• currently available 
– CSV 
col1,col2,col3,... 
col1,col2,col3,... 
...
What’s Wrong w/ Current Implementation? 
• Hadoop as a first class citizen disables transfers between the 
components in the Hadoop ecosystem 
– HBase to HDFS not supported 
– HDFS to Accumulo not supported 
• Hadoop ecosystem not well defined 
– Accumulo was not considered part of Hadoop ecosystem 
– What’s next? Kafka?
Refactoring 
• Connectors already defined extractors and loaders 
– Refactor the connector SDK 
• Pull out HDFS integration to a connector 
• Improve Schema integration 
Transfer data from Connector A to Connector B
Connector SDK 
• Connectors assume all roles 
• Add Direction for FROM and TO 
• Initializers and destroyers for both directions 
Connector responsibilities
HDFS Connector 
• Move Hadoop role to connector 
• Schemaless 
• Data formats 
– Text (CSV) 
– Sequence 
– etc.
Schema Improvements 
• Schema per connector 
• Intermediate data format (IDF) has a Schema 
• Introduce matcher 
• Schema represents data as it moves through the system
Matcher 
• Matcher ensures data goes to right place 
• Combinations 
– FROM and TO schema 
– FROM schema 
– TO schema 
– No schema = Error
Matcher 
Location Name User defined 
Ensure that FROM schema 
matches TO schema by index 
location of Schema 
Provide a connector SDK and 
focus on pluggability. Existing 
connectors include Generic 
JDBC connector and HDFS 
connector. 
Emphasize separation of 
responsibilities. Eventually 
have ACLs or RBAC.
Checkout http: 
//ingest.tips for 
general ingest
Thank you

More Related Content

What's hot

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 

What's hot (20)

Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Apache hive
Apache hiveApache hive
Apache hive
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 

Viewers also liked

Viewers also liked (8)

Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL
 
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Spark Security
Spark SecuritySpark Security
Spark Security
 

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Java Web services
Java Web servicesJava Web services
Java Web services
Sujit Kumar
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup (20)

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.pptSQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Build on AWS: Migrating And Platforming
Build on AWS: Migrating And PlatformingBuild on AWS: Migrating And Platforming
Build on AWS: Migrating And Platforming
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and Platforming
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioning
 
Java Web services
Java Web servicesJava Web services
Java Web services
 
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the FieldKafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Analysing big data with cluster service and R
Analysing big data with cluster service and RAnalysing big data with cluster service and R
Analysing big data with cluster service and R
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Developing real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and KafkaDeveloping real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and Kafka
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeep
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

  • 1. Sqoop 2 Refactoring for generic data transfer Abraham Elmahrek
  • 3. Introduction to Sqoop 2 Ease of use Extensible Security Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client. Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
  • 4. Life of a Request • Client – Talks to server over REST + JSON – Does nothing but sends requests • Server – Extracts metadata from data source – Delegates to execution engine – Does all the heavy lifting really • MapReduce – Parallelizes execution of the job
  • 6. Job Types IMPORT into Hadoop and EXPORT out of Hadoop
  • 7. Responsibilities Transfer data from Connector A to Hadoop Connector responsibilities Sqoop framework responsibilities
  • 8. Connector Definitions • Connectors define: – How to connect to a data source – How to extract data from a data source – How to load data to a data source public Importer getImporter(); // Supply extract method public Importer getExporter(); // Supply load method public class getConnectionConfigurationClass(); public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
  • 9. Intermediate Data Format • Describe a single record as it moves through Sqoop • currently available – CSV col1,col2,col3,... col1,col2,col3,... ...
  • 10. What’s Wrong w/ Current Implementation? • Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem – HBase to HDFS not supported – HDFS to Accumulo not supported • Hadoop ecosystem not well defined – Accumulo was not considered part of Hadoop ecosystem – What’s next? Kafka?
  • 11. Refactoring • Connectors already defined extractors and loaders – Refactor the connector SDK • Pull out HDFS integration to a connector • Improve Schema integration Transfer data from Connector A to Connector B
  • 12. Connector SDK • Connectors assume all roles • Add Direction for FROM and TO • Initializers and destroyers for both directions Connector responsibilities
  • 13. HDFS Connector • Move Hadoop role to connector • Schemaless • Data formats – Text (CSV) – Sequence – etc.
  • 14. Schema Improvements • Schema per connector • Intermediate data format (IDF) has a Schema • Introduce matcher • Schema represents data as it moves through the system
  • 15. Matcher • Matcher ensures data goes to right place • Combinations – FROM and TO schema – FROM schema – TO schema – No schema = Error
  • 16. Matcher Location Name User defined Ensure that FROM schema matches TO schema by index location of Schema Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
  • 17. Checkout http: //ingest.tips for general ingest