SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
SQOOP
 SQOOP is data ingestion tool.
 SQOOP is a tool designed for transfer data
between HDFS and RDBMS such as MySQL, Oracle
etc.
 Export data back to RDBMS.
 Simple as user specifies the “what” and leave the
“how” to underlying processing engine.
 Rapid development
 No Java is required.
 Developed by cloudera.
Why SQOOP
• Data already available in RDBMS worldwide.
• Nighty processing is done on RDBMS for years.
• Need is to move certain data from RDBMS to
Hadoop for processing.
• Transferring data using scripts is inefficient
and time consuming.
• Traditional DB has already reporting, data
visualization applications configured.
SQOOP Under the hood
• The dataset being transferred is sliced up into
different partitions.
A Map only Job is launched with individual
mappers responsible for transferring a slice of
dataset.
• Each record of the data is maintained in a type
safe manner since SQOOP uses the database
metadata to understand the data types.
How SQOOP Import works
• Step-1
 SQOOP introspects database to gather the necessary metadata
for the data being imported.
• Step-2
 A Map only hadoop job submitted to cluster by SQOOP and
performs the data transfer using metadata captured in step-1.
• The imported data is saved in HDFS directory based on the
table being imported.
• By default these files contains comma delimitted fields,
with new line separating records.
 User can override the format of data by specifying the field
separator and record terminator character.
How SQOOP Export works
• Step-1
 SQOOP introspects database to gather the necessary metadata
for the data being imported.
• Step-2
 Transfer the data
 SQOOP divides the input dataset into splits.
 Sqoop uses the individual Map task to push the splits to the database.
 Each Map task performs this transfer over many transaction in order
to ensure optimal throughput and minimal resource utilization.
The target table must already exist in the database. Sqoop
performs a set of INSERT INTO operations, without regard for
existing content. If Sqoop attempts to insert rows which
violate constraints in the database (for example, a particular
primary key value already exists), then the export fails.
Importing Data into Hive
• --hive-import
Appending above to SQOOP import command, SQOOP
takes care of populating the hive metastore with
appropriate metadata for the table and also invokes
the necessary commands to load the table and
partition.
• Using Hive import, SQOOP converts the data from
the native data types within the external
datastore into the corresponding types in hive.
• SQOOP automatically chooses the native
delimiter set used by hive.
Importing Data into HBase
• SQOOP can populate data in specific column
family in Hbase table.
• Hbase table and column family setting is
required in order to import data to Hbase.
• Data imported to Hbase converted to their
string representation and inserted as UTF-8
bytes.
Connecting to a Database Server
• The connect string is similar to a URL, and is communicated to
Sqoop with the --connect argument.
• This describes the server and database to connect to; it may also
specify the port.
• You can use the --username and --password or -P parameters to
supply a username and a password to the database.
• For example:
• sqoop import --connect jdbc:mysql://IPAddress:port /DBName--
table tableName --username sqoop --password sqoop
Controlling Parallelism
• Sqoop imports data in parallel from most database sources. You can
specify the number of map tasks (parallel processes) to use to
perform the import by using the -m or --num-mappers argument.
• NOTE: Do Not increase the degree of parallism higher than that
which your database can reasonably support. For e.g., Connecting
100 concurrent clients to your database may increase the load on
the database server to a point where performance suffers as a
result.
• Sqoop uses a splitting column to split the workload. By default,
Sqoop will identify the primary key column (if present) in a table
and use it as the splitting column.
How Parallel import works
• The low and high values for the splitting column are retrieved from
the database, and the map tasks operate on evenly-sized
components of the total range. By default, four tasks are used. For
example, if you had a table with a primary key column of id whose
minimum value was 0 and maximum value was 1000, and Sqoop
was directed to use 4 tasks, Sqoop would run four processes which
each execute SQL statements of the form SELECT * FROM
sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250),
(250, 500), (500, 750), and (750, 1001) in the different tasks.
• NOTE: Sqoop cannot currently split on multi-column primary key. If
your table has no index column, or has a multi-column key, then you
must also manually choose a splitting column.
Incremental Imports
• Sqoop provides an incremental import mode which can be used to
retrieve only rows newer than some previously-imported set of rows.
• Sqoop supports two types of incremental imports:
• 1)append and 2)lastmodified.
• 1)You should specify append mode when importing a table where new
rows are continually being added with increasing row id values. You
specify the column containing the row’s id with --check-column. Sqoop
imports rows where the check column has a value greater than the one
specified with --last-value.
• 2)Lastmodified mode should be used when rows of the source table may
be updated, and each such update will set the value of a last-modified
column to the current timestamp. Rows where the check column holds a
timestamp more recent than the timestamp specified with --last-value are
imported.
Install SQOOP
• To install SQOOP
• Download Sqoop-*.tar.gz
• tar -xvf sqoop-*.*.tar.gz
• export HADOOP_HOME=/some/path/hadoop-dir
• Please add the vendor Specific JDBC jar to $SQOOP_HOME/lib
• Change to Sqoop Bin folder
• ./sqoop help
Practice Session
SQOOP Commands
• A basic import of a table
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop
• Load sample data to a target directory
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1
• Load sample data with output directory and package
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --package-name org.sandeep.sample --
outdir '/home/cloudera/sandeep/test1' --target-dir '/user/cloudera/test/film' -
m 1
• Controlling the import parallelism (8 parallel tasks):
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' --
split-by film_id -m 8
SQOOP Commands
• Incremental import
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1
• Save target file in tab separated format
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop --password sqoop --check-column actor_id --incremental
append --last-value 180 --target-dir /user/cloudera/test/film3 --fields-
terminated-by 't‘
• Selecting specific columns from the EMPLOYEES table
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop -password sqoop --columns 'actor_id,first_name,last_name' -
-target-dir /user/cloudera/test/actor1
• Query usage to import with condition
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select
* from film where film_id < 91 and $CONDITIONS' --username sqoop --
password sqoop --target-dir '/user/cloudera/test/film2' --split-by film_id -m 2
SQOOP Commands
• Storing data in SequenceFiles
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --as-sequencefile --target-dir
/user/cloudera/test/f
• Importing data to Hive:
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table
language --username sqoop -password sqoop -m 1 --hive-import
• Import only the schema to hive table
 sqoop create-hive-table --connect jdbc:mysql://192.168.45.1:3306/sakila --
table actor --username sqoop -password sqoop --fields-terminated-by ',';
• Importing data to Hbase:
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop --password sqoop --columns 'actor_id,first_name,last_name'
--hbase-table ActorInfo --column-family ActorName --hbase-row-key actor_id -
m 1
SQOOP Commands
• Import all tables
 sqoop import-all-tables --connect
jdbc:mysql://192.168.45.1:3306/sakila --username sqoop --
password sqoop
• SQOOP EXPORT
 sqoop export --connect jdbc:mysql://192.168.45.1:3306/sakila --
table test --username sqoop --password sqoop --export-dir
/user/cloudera/actor
• SQOOP Version:
 $ sqoop version
• List tables present in a database
 sqoop list-tables --connect jdbc:mysql://192.168.45.1:3306/sakila --
username sqoop --password sqoop
SQOOP JOBS
Creating saved jobs is done with the --create action. This operation
requires a -- followed by a tool name and its arguments. The tool and
its arguments will form the basis of the saved job.
• Step-1 (Create a job)
 sqoop job --create myjob -- export --connect
jdbc:mysql://192.168.45.1:3306/sakila --table test --username sqoop -
-password sqoop --export-dir /user/cloudera/actor
• Step-2(view list of available jobs)
 sqoop job –list
• Step-3(verify the job details)
 sqoop job --show myjob
• Step-4(Execute job)
 sqoop job --exec myjob
Saved jobs and passwords
• Sqoop does not store passwords in the metastore as it is
not a secure resource.
• Hence, If you create a job that requires a password, you will
be prompted for that password each time you execute the
job.
• You can enable passwords in the metastore by
setting sqoop.metastore.client.record.password to true in
the configuration.
• Note: set sqoop.metastore.client.record.password to true if
you are executing saved jobs via Oozie because Sqoop
cannot prompt the user to enter passwords while being
executed as Oozie tasks.
Sqoop-eval
• The eval tool allows users to quickly run simple SQL
queries against a database; results are printed to the
console. This allows users to preview their import
queries to ensure they import the data they expect.
• sqoop eval --connect
jdbc:mysql://192.168.45.1:3306/sakila --query 'select
* from film limit 10' --username sqoop --password
sqoop
• sqoop eval --connect
jdbc:mysql://192.168.45.1:3306/sakila --query "insert
into test values(200,'test','test','2006-01-01
00:00:00')" --username sqoop --password sqoop
Sqoop-codegen
• The codegen tool generates Java code, It does
not perform the full import.
• The tool can be used to regenerate code if
Java source file is by chance lost.
• sqoop codegen --connect
jdbc:mysql://192.168.45.1:3306/sakila --table
film --username sqoop --password sqoop
Thank You
• Question?
• Feedback?
write me: explorehadoop@gmail.com

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Hive
HiveHive
Hive
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 

Destaque

New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 

Destaque (17)

Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
MCSE Certifications
MCSE CertificationsMCSE Certifications
MCSE Certifications
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Semelhante a Sqoop

Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 

Semelhante a Sqoop (20)

From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Oozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeepOozie &amp; sqoop by pradeep
Oozie &amp; sqoop by pradeep
 
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.pptSQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data WarehousesTraining Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
6.hive
6.hive6.hive
6.hive
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Azure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solutionAzure PaaS (WebApp & SQL Database) workshop solution
Azure PaaS (WebApp & SQL Database) workshop solution
 
Optimizing your Database Import!
Optimizing your Database Import! Optimizing your Database Import!
Optimizing your Database Import!
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 

Mais de Prashant Gupta

Mais de Prashant Gupta (8)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Sqoop

  • 1. SQOOP  SQOOP is data ingestion tool.  SQOOP is a tool designed for transfer data between HDFS and RDBMS such as MySQL, Oracle etc.  Export data back to RDBMS.  Simple as user specifies the “what” and leave the “how” to underlying processing engine.  Rapid development  No Java is required.  Developed by cloudera.
  • 2. Why SQOOP • Data already available in RDBMS worldwide. • Nighty processing is done on RDBMS for years. • Need is to move certain data from RDBMS to Hadoop for processing. • Transferring data using scripts is inefficient and time consuming. • Traditional DB has already reporting, data visualization applications configured.
  • 3. SQOOP Under the hood • The dataset being transferred is sliced up into different partitions. A Map only Job is launched with individual mappers responsible for transferring a slice of dataset. • Each record of the data is maintained in a type safe manner since SQOOP uses the database metadata to understand the data types.
  • 4. How SQOOP Import works • Step-1  SQOOP introspects database to gather the necessary metadata for the data being imported. • Step-2  A Map only hadoop job submitted to cluster by SQOOP and performs the data transfer using metadata captured in step-1. • The imported data is saved in HDFS directory based on the table being imported. • By default these files contains comma delimitted fields, with new line separating records.  User can override the format of data by specifying the field separator and record terminator character.
  • 5. How SQOOP Export works • Step-1  SQOOP introspects database to gather the necessary metadata for the data being imported. • Step-2  Transfer the data  SQOOP divides the input dataset into splits.  Sqoop uses the individual Map task to push the splits to the database.  Each Map task performs this transfer over many transaction in order to ensure optimal throughput and minimal resource utilization. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
  • 6. Importing Data into Hive • --hive-import Appending above to SQOOP import command, SQOOP takes care of populating the hive metastore with appropriate metadata for the table and also invokes the necessary commands to load the table and partition. • Using Hive import, SQOOP converts the data from the native data types within the external datastore into the corresponding types in hive. • SQOOP automatically chooses the native delimiter set used by hive.
  • 7. Importing Data into HBase • SQOOP can populate data in specific column family in Hbase table. • Hbase table and column family setting is required in order to import data to Hbase. • Data imported to Hbase converted to their string representation and inserted as UTF-8 bytes.
  • 8. Connecting to a Database Server • The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. • This describes the server and database to connect to; it may also specify the port. • You can use the --username and --password or -P parameters to supply a username and a password to the database. • For example: • sqoop import --connect jdbc:mysql://IPAddress:port /DBName-- table tableName --username sqoop --password sqoop
  • 9. Controlling Parallelism • Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. • NOTE: Do Not increase the degree of parallism higher than that which your database can reasonably support. For e.g., Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. • Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column.
  • 10. How Parallel import works • The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. By default, four tasks are used. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. • NOTE: Sqoop cannot currently split on multi-column primary key. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
  • 11. Incremental Imports • Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows. • Sqoop supports two types of incremental imports: • 1)append and 2)lastmodified. • 1)You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value. • 2)Lastmodified mode should be used when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
  • 12. Install SQOOP • To install SQOOP • Download Sqoop-*.tar.gz • tar -xvf sqoop-*.*.tar.gz • export HADOOP_HOME=/some/path/hadoop-dir • Please add the vendor Specific JDBC jar to $SQOOP_HOME/lib • Change to Sqoop Bin folder • ./sqoop help
  • 14. SQOOP Commands • A basic import of a table  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop • Load sample data to a target directory  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1 • Load sample data with output directory and package  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --package-name org.sandeep.sample -- outdir '/home/cloudera/sandeep/test1' --target-dir '/user/cloudera/test/film' - m 1 • Controlling the import parallelism (8 parallel tasks):  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -- split-by film_id -m 8
  • 15. SQOOP Commands • Incremental import  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1 • Save target file in tab separated format  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop --password sqoop --check-column actor_id --incremental append --last-value 180 --target-dir /user/cloudera/test/film3 --fields- terminated-by 't‘ • Selecting specific columns from the EMPLOYEES table  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop -password sqoop --columns 'actor_id,first_name,last_name' - -target-dir /user/cloudera/test/actor1 • Query usage to import with condition  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select * from film where film_id < 91 and $CONDITIONS' --username sqoop -- password sqoop --target-dir '/user/cloudera/test/film2' --split-by film_id -m 2
  • 16. SQOOP Commands • Storing data in SequenceFiles  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --as-sequencefile --target-dir /user/cloudera/test/f • Importing data to Hive:  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table language --username sqoop -password sqoop -m 1 --hive-import • Import only the schema to hive table  sqoop create-hive-table --connect jdbc:mysql://192.168.45.1:3306/sakila -- table actor --username sqoop -password sqoop --fields-terminated-by ','; • Importing data to Hbase:  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop --password sqoop --columns 'actor_id,first_name,last_name' --hbase-table ActorInfo --column-family ActorName --hbase-row-key actor_id - m 1
  • 17. SQOOP Commands • Import all tables  sqoop import-all-tables --connect jdbc:mysql://192.168.45.1:3306/sakila --username sqoop -- password sqoop • SQOOP EXPORT  sqoop export --connect jdbc:mysql://192.168.45.1:3306/sakila -- table test --username sqoop --password sqoop --export-dir /user/cloudera/actor • SQOOP Version:  $ sqoop version • List tables present in a database  sqoop list-tables --connect jdbc:mysql://192.168.45.1:3306/sakila -- username sqoop --password sqoop
  • 18. SQOOP JOBS Creating saved jobs is done with the --create action. This operation requires a -- followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. • Step-1 (Create a job)  sqoop job --create myjob -- export --connect jdbc:mysql://192.168.45.1:3306/sakila --table test --username sqoop - -password sqoop --export-dir /user/cloudera/actor • Step-2(view list of available jobs)  sqoop job –list • Step-3(verify the job details)  sqoop job --show myjob • Step-4(Execute job)  sqoop job --exec myjob
  • 19. Saved jobs and passwords • Sqoop does not store passwords in the metastore as it is not a secure resource. • Hence, If you create a job that requires a password, you will be prompted for that password each time you execute the job. • You can enable passwords in the metastore by setting sqoop.metastore.client.record.password to true in the configuration. • Note: set sqoop.metastore.client.record.password to true if you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks.
  • 20. Sqoop-eval • The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect. • sqoop eval --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select * from film limit 10' --username sqoop --password sqoop • sqoop eval --connect jdbc:mysql://192.168.45.1:3306/sakila --query "insert into test values(200,'test','test','2006-01-01 00:00:00')" --username sqoop --password sqoop
  • 21. Sqoop-codegen • The codegen tool generates Java code, It does not perform the full import. • The tool can be used to regenerate code if Java source file is by chance lost. • sqoop codegen --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --username sqoop --password sqoop
  • 22. Thank You • Question? • Feedback? write me: explorehadoop@gmail.com