Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Bridging the gap of Relational to
Hadoop using Sqoop@Expedia
(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia

Agenda
• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
| Expedia Inc. Proprietary & Confidential1

Data Synchronization

Problem Statement
• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.
–Merge the new/updated data to hive tables.
–Merge the data in parallel.

Community Solution - Sqoop
• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data
stores such as relational databases.
• Support various relational databases like Teradata, SQL
Server, Oracle,Mysql,DB2 etc.

Enhanced Sqoop Features
• Enhanced Sqoop Features for community business needs.
- Hive Merge
- Merges the incremental data migrated to hdfs into your
existing hive tables.
- Supports merge based on composite keys
- Merges older partitions as well as add new partitions.

Enhanced Sqoop Features
- Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy

Hcatalog for Hive
- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.

External tables with HCatalog

Sqoop Import to Hive Managed Table
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse

New Enhancement :Import to Hive External Table
• The above command creates a hive table in the user managed
Directory /user/root/test_part2

Dynamic Partitioning with HCatalog

Sqoop Import to Hive Static Partition
• Can pass only 1 static partition as sqoop argument

Sqoop Import to Hive Static Partition
• Check Hive Partition

Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with
date value specified manually.

Questions
How to Import Data if there are more than 200 partitions ?
Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?
Is there any way that I can pass the format ?

New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-
keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static
partition key.

New Enhancement : Import to Hive Dynamic Partition with
Date Format
• A new argument is passed –hcatalog-dynamic-partition-
key-format with argument –hcatalog-dynamic-partition-
keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.

Password encrypted in Sqoop Metastore
• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic
passkey and algorithm is passed in command line.

Issues with Sqoop Merge Tool
• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the
merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate
sequential Sqoop jobs.

Merge Incremental data using Sqoop and Hive External
Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS
directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the
Base (base_table) and Change (incremental_table) tables.

Merge Incremental data using Sqoop and Hive External
Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above
generated table.

New Tool: Hive Merge
• Import original base table into Hive

New Tool : Hive merge
• Import incremental data into Hive

• Finally merge data using tool hive-merge.
New Tool : Hive merge

Acquiring locks during Hive Merge
• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it
finishes.

Performance metrics : Hive Merge tool

Other Key Enhancements
• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog

Apache Sqoop Jiras
These are the few jiras for which the patch has been
provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if
Hive table does not exists & add support for Partition Date
Format
• SQOOP-2335 :Support for Hive External Table in Sqoop –
Hcatalog

• SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be
retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop
Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-
Hcatalog

Questions

Hive Merge Internal Architecture
Step 1: Identify partitions to update. Skip this step for
non-partitioned tables.

Step 2: Merge the new partitions with the old partitions(only for
partitioned tables).

Step 3: Delete older versions.

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Semelhante a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia (20)

Mais de DataWorks Summit/Hadoop Summit

Mais de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia