SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
sqoop
Automatic database import



Aaron Kimball
Cloudera Inc.
October 2, 2009
The problem
Structured data in traditional databases cannot be easily
combined with unstructured data stored in HDFS




                Where’s the bridge?
Sqoop = SQL-to-Hadoop
▪   Easy import of data from many databases to HDFS
▪   Generates code for use in MapReduce applications
▪   Integrates with Hive




                     sqoop!       Hadoop
          mysql
                                                          Custom
                                                        MapReduce
                                                         programs
                                                      reinterpret data


                              ...Via auto-generated
                               datatype definitions
In this talk…
▪   Features and overview
▪   How it works
▪   Working with imported data
▪   The extension API
▪   Optimizations
▪   Future directions
Key Features
▪   JDBC-based implementation
    ▪   Works with many popular database vendors
▪   Auto-generation of tedious user-side code
    ▪   Write MapReduce applications to work with your data, faster
▪   Integration with Hive
    ▪   Allows you to stay in a SQL-based environment
▪   Extensible backend
    ▪   Database-specific code paths for better performance
Example Input
mysql> use corp;
Database changed

mysql> describe employees;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| firstname | varchar(32) | YES |       | NULL    |                |
| lastname   | varchar(32) | YES |      | NULL    |                |
| jobtitle   | varchar(64) | YES |      | NULL    |                |
| start_date | date        | YES |      | NULL    |                |
| dept_id    | int(11)     | YES |      | NULL    |                |
+------------+-------------+------+-----+---------+----------------+
Loading into HDFS

$ sqoop --connect jdbc:mysql://db.foo.com/corp 
           --table employees




▪   Imports “employees” table into HDFS directory
    ▪   Data imported as text or SequenceFiles
    ▪   Optionally compress and split data during import
▪   Generates employees.java for your use
Auto-generated Class

public class employees {
  public Integer get_id();
  public String get_firstname();
  public String get_lastname();
  public String get_jobtitle();
  public java.sql.Date get_start_date();
  public Integer get_dept_id();
  // and serialization methods for Hadoop
}
Hive Integration
 Imports table definition into Hive after data imported to HDFS
Additional Options
▪   Multiple data representations supported
    ▪   TextFile – ubiquitous; easy import into Hive
    ▪   SequenceFile – supports compression, higher performance
▪   Supports local and remote Hadoop clusters, databases
▪   Can select a subset of columns, specify a WHERE clause
▪   Control for delimiters and quote characters:
    ▪   --fields-terminated-by, --lines-terminated-by,
        --optionally-enclosed-by, etc.
    ▪   Also supports delimiter conversion
        (--input-fields-terminated-by, etc.)
Under the Hood…
▪   JDBC
    ▪   Allows Java applications to submit SQL queries to databases
    ▪   Provides metadata about databases (column names, types, etc.)
▪   Hadoop
    ▪   Allows input from arbitrary sources via different InputFormats
    ▪   Provides multiple JDBC-based InputFormats to read from
        databases
InputFormat Woes
▪   DBInputFormat allows database records to be used as mapper
    inputs
▪   The trouble with this is:
    ▪   Connecting an entire Hadoop cluster to a database is a
        performance nightmare
    ▪   Databases have lower read bandwidth than HDFS; for repeated
        analyses, much better to make a copy in HDFS first
    ▪   Users must write a class that describes a record from each table
        they want to import or work with (a “DBWritable”)
DBWritable Example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
DBWritable Example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
A Direct Type Mapping
            JDBC Type                          Java Type
        CHAR                         String
        VARCHAR                      String
        LONGVARCHAR                  String
        NUMERIC                      java.math.BigDecimal
        DECIMAL                      java.math.BigDecimal
        BIT                          boolean
        TINYINT                      byte
        SMALLINT                     short
        INTEGER                      int
        BIGINT                       long
        REAL                         float
        FLOAT                        double
        DOUBLE                       double
        BINARY                       byte[]
        VARBINARY                    byte[]
        LONGVARBINARY                byte[]
        DATE                         java.sql.Date
        TIME                         java.sql.Time
        TIMESTAMP                    java.sql.Timestamp
        http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
Class Autogeneration
Working with Sqoop
▪   Basic workflow:
    ▪   Import initial table with Sqoop
    ▪   Use auto-generated table class in MapReduce analyses
    ▪   … Or write Hive queries over imported tables
    ▪   Perform periodic re-imports to ingest new data


▪   Table classes can parse records from delimited files in HDFS
Processing Records in MapReduce
1.   void map(LongWritable k, Text v, Context c) {
2.       MyRecord r = new MyRecord();
3.       r.parse(v); // auto-generated text parser
4.       process(r.get_msg()); // your logic here
5.       ...
6.   }
Extending Sqoop
Direct-mode Imports
▪   MySQL provides mysqldump for high-performance table output
    ▪   Sqoop special-cases jdbc:mysql:// for faster loading
▪   Similar mechanism used for PostgreSQL
▪   Avoids JDBC overhead




▪   Looking to do this for other databases as well; feedback welcome
Further Optimizations
▪   DBInputFormat uses LIMIT/OFFSET to generate data splits
    ▪   Requires ORDER BY clause in generated queries
    ▪   Most databases then run multiple full table scans in parallel
▪   DataDrivenDBInputFormat generates split ranges based on actual
    values in the index column
    ▪   Allows databases to use index range scans
    ▪   Considerably better parallel performance
    ▪   Available for use in Hadoop by Sqoop or other apps
Future Directions
▪   “unsqoop” to export data from Hive/HDFS back to database
▪   Integration with Cloudera Desktop
▪   Continued performance optimization
▪   Support for Hive features: partition columns, buckets
▪   Integration with Avro
Conclusions
▪   Most database import tasks are “turning the crank”
▪   Sqoop can automate a lot of this
    ▪   Allows more efficient use of existing data sources in concert with
        new, unstructured data
▪   Available as part of Cloudera’s Distribution for Hadoop


                      www.cloudera.com/hadoop-sqoop
                         www.cloudera.com/hadoop


                            aaron@cloudera.com
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Sqoop tutorial
Sqoop tutorialSqoop tutorial
Sqoop tutorial
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 

Semelhante a Hw09 Sqoop Database Import For Hadoop

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 

Semelhante a Hw09 Sqoop Database Import For Hadoop (20)

מיכאל
מיכאלמיכאל
מיכאל
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hive
HiveHive
Hive
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Full Stack Development With Node.Js And NoSQL (Nic Raboy & Arun Gupta)
Full Stack Development With Node.Js And NoSQL (Nic Raboy & Arun Gupta)Full Stack Development With Node.Js And NoSQL (Nic Raboy & Arun Gupta)
Full Stack Development With Node.Js And NoSQL (Nic Raboy & Arun Gupta)
 
NoSQL
NoSQLNoSQL
NoSQL
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Hw09 Sqoop Database Import For Hadoop

  • 1.
  • 2. sqoop Automatic database import Aaron Kimball Cloudera Inc. October 2, 2009
  • 3. The problem Structured data in traditional databases cannot be easily combined with unstructured data stored in HDFS Where’s the bridge?
  • 4. Sqoop = SQL-to-Hadoop ▪ Easy import of data from many databases to HDFS ▪ Generates code for use in MapReduce applications ▪ Integrates with Hive sqoop! Hadoop mysql Custom MapReduce programs reinterpret data ...Via auto-generated datatype definitions
  • 5. In this talk… ▪ Features and overview ▪ How it works ▪ Working with imported data ▪ The extension API ▪ Optimizations ▪ Future directions
  • 6. Key Features ▪ JDBC-based implementation ▪ Works with many popular database vendors ▪ Auto-generation of tedious user-side code ▪ Write MapReduce applications to work with your data, faster ▪ Integration with Hive ▪ Allows you to stay in a SQL-based environment ▪ Extensible backend ▪ Database-specific code paths for better performance
  • 7. Example Input mysql> use corp; Database changed mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+
  • 8. Loading into HDFS $ sqoop --connect jdbc:mysql://db.foo.com/corp --table employees ▪ Imports “employees” table into HDFS directory ▪ Data imported as text or SequenceFiles ▪ Optionally compress and split data during import ▪ Generates employees.java for your use
  • 9. Auto-generated Class public class employees { public Integer get_id(); public String get_firstname(); public String get_lastname(); public String get_jobtitle(); public java.sql.Date get_start_date(); public Integer get_dept_id(); // and serialization methods for Hadoop }
  • 10. Hive Integration Imports table definition into Hive after data imported to HDFS
  • 11. Additional Options ▪ Multiple data representations supported ▪ TextFile – ubiquitous; easy import into Hive ▪ SequenceFile – supports compression, higher performance ▪ Supports local and remote Hadoop clusters, databases ▪ Can select a subset of columns, specify a WHERE clause ▪ Control for delimiters and quote characters: ▪ --fields-terminated-by, --lines-terminated-by, --optionally-enclosed-by, etc. ▪ Also supports delimiter conversion (--input-fields-terminated-by, etc.)
  • 12. Under the Hood… ▪ JDBC ▪ Allows Java applications to submit SQL queries to databases ▪ Provides metadata about databases (column names, types, etc.) ▪ Hadoop ▪ Allows input from arbitrary sources via different InputFormats ▪ Provides multiple JDBC-based InputFormats to read from databases
  • 13. InputFormat Woes ▪ DBInputFormat allows database records to be used as mapper inputs ▪ The trouble with this is: ▪ Connecting an entire Hadoop cluster to a database is a performance nightmare ▪ Databases have lower read bandwidth than HDFS; for repeated analyses, much better to make a copy in HDFS first ▪ Users must write a class that describes a record from each table they want to import or work with (a “DBWritable”)
  • 14. DBWritable Example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 15. DBWritable Example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 16. A Direct Type Mapping JDBC Type Java Type CHAR String VARCHAR String LONGVARCHAR String NUMERIC java.math.BigDecimal DECIMAL java.math.BigDecimal BIT boolean TINYINT byte SMALLINT short INTEGER int BIGINT long REAL float FLOAT double DOUBLE double BINARY byte[] VARBINARY byte[] LONGVARBINARY byte[] DATE java.sql.Date TIME java.sql.Time TIMESTAMP java.sql.Timestamp http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
  • 18. Working with Sqoop ▪ Basic workflow: ▪ Import initial table with Sqoop ▪ Use auto-generated table class in MapReduce analyses ▪ … Or write Hive queries over imported tables ▪ Perform periodic re-imports to ingest new data ▪ Table classes can parse records from delimited files in HDFS
  • 19. Processing Records in MapReduce 1. void map(LongWritable k, Text v, Context c) { 2. MyRecord r = new MyRecord(); 3. r.parse(v); // auto-generated text parser 4. process(r.get_msg()); // your logic here 5. ... 6. }
  • 21. Direct-mode Imports ▪ MySQL provides mysqldump for high-performance table output ▪ Sqoop special-cases jdbc:mysql:// for faster loading ▪ Similar mechanism used for PostgreSQL ▪ Avoids JDBC overhead ▪ Looking to do this for other databases as well; feedback welcome
  • 22. Further Optimizations ▪ DBInputFormat uses LIMIT/OFFSET to generate data splits ▪ Requires ORDER BY clause in generated queries ▪ Most databases then run multiple full table scans in parallel ▪ DataDrivenDBInputFormat generates split ranges based on actual values in the index column ▪ Allows databases to use index range scans ▪ Considerably better parallel performance ▪ Available for use in Hadoop by Sqoop or other apps
  • 23. Future Directions ▪ “unsqoop” to export data from Hive/HDFS back to database ▪ Integration with Cloudera Desktop ▪ Continued performance optimization ▪ Support for Hive features: partition columns, buckets ▪ Integration with Avro
  • 24. Conclusions ▪ Most database import tasks are “turning the crank” ▪ Sqoop can automate a lot of this ▪ Allows more efficient use of existing data sources in concert with new, unstructured data ▪ Available as part of Cloudera’s Distribution for Hadoop www.cloudera.com/hadoop-sqoop www.cloudera.com/hadoop aaron@cloudera.com
  • 25. (c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0