How AI, OpenAI, and ChatGPT impact business and software.
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
1. Bridging the gap of Relational to
Hadoop using Sqoop@Expedia
(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia
2. Agenda
• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
| Expedia Inc. Proprietary & Confidential1
3. | Expedia Inc. Proprietary & Confidential2
Data Synchronization
4. Problem Statement
• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.
–Merge the new/updated data to hive tables.
–Merge the data in parallel.
| Expedia Inc. Proprietary & Confidential3
5. Community Solution - Sqoop
• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data
stores such as relational databases.
• Support various relational databases like Teradata, SQL
Server, Oracle,Mysql,DB2 etc.
| Expedia Inc. Proprietary & Confidential4
6. Enhanced Sqoop Features
• Enhanced Sqoop Features for community business needs.
- Hive Merge
- Merges the incremental data migrated to hdfs into your
existing hive tables.
- Supports merge based on composite keys
- Merges older partitions as well as add new partitions.
| Expedia Inc. Proprietary & Confidential5
7. Enhanced Sqoop Features
- Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy
| Expedia Inc. Proprietary & Confidential6
8. Hcatalog for Hive
- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.
| Expedia Inc. Proprietary & Confidential7
10. Sqoop Import to Hive Managed Table
| Expedia Inc. Proprietary & Confidential9
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse
12. New Enhancement :Import to Hive External Table
| Expedia Inc. Proprietary & Confidential11
• The above command creates a hive table in the user managed
Directory /user/root/test_part2
15. Sqoop Import to Hive Static Partition
• Can pass only 1 static partition as sqoop argument
| Expedia Inc. Proprietary & Confidential14
16. Sqoop Import to Hive Static Partition
• Check Hive Partition
| Expedia Inc. Proprietary & Confidential15
17. Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with
date value specified manually.
| Expedia Inc. Proprietary & Confidential16
18. Questions
| Expedia Inc. Proprietary & Confidential17
How to Import Data if there are more than 200 partitions ?
Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?
Is there any way that I can pass the format ?
19. New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-
keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static
partition key.
| Expedia Inc. Proprietary & Confidential18
21. New Enhancement : Import to Hive Dynamic Partition with
Date Format
• A new argument is passed –hcatalog-dynamic-partition-
key-format with argument –hcatalog-dynamic-partition-
keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.
| Expedia Inc. Proprietary & Confidential20
23. Password encrypted in Sqoop Metastore
• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic
passkey and algorithm is passed in command line.
| Expedia Inc. Proprietary & Confidential22
24. Issues with Sqoop Merge Tool
• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the
merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate
sequential Sqoop jobs.
| Expedia Inc. Proprietary & Confidential23
25. Merge Incremental data using Sqoop and Hive External
Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS
directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the
Base (base_table) and Change (incremental_table) tables.
| Expedia Inc. Proprietary & Confidential24
26. Merge Incremental data using Sqoop and Hive External
Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above
generated table.
| Expedia Inc. Proprietary & Confidential25
27. New Tool: Hive Merge
• Import original base table into Hive
| Expedia Inc. Proprietary & Confidential26
28. New Tool : Hive merge
• Import incremental data into Hive
| Expedia Inc. Proprietary & Confidential27
29. • Finally merge data using tool hive-merge.
| Expedia Inc. Proprietary & Confidential28
New Tool : Hive merge
30. Acquiring locks during Hive Merge
• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it
finishes.
| Expedia Inc. Proprietary & Confidential29
32. Other Key Enhancements
• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog
| Expedia Inc. Proprietary & Confidential31
33. Apache Sqoop Jiras
These are the few jiras for which the patch has been
provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if
Hive table does not exists & add support for Partition Date
Format
• SQOOP-2335 :Support for Hive External Table in Sqoop –
Hcatalog
| Expedia Inc. Proprietary & Confidential32
34. • SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be
retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop
Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-
Hcatalog
| Expedia Inc. Proprietary & Confidential33
37. Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential36
Step 1: Identify partitions to update. Skip this step for
non-partitioned tables.
38. Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential37
Step 2: Merge the new partitions with the old partitions(only for
partitioned tables).