SlideShare uma empresa Scribd logo
1 de 39
Bridging the gap of Relational to
Hadoop using Sqoop@Expedia
(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia
Agenda
• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
| Expedia Inc. Proprietary & Confidential1
| Expedia Inc. Proprietary & Confidential2
Data Synchronization
Problem Statement
• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.
–Merge the new/updated data to hive tables.
–Merge the data in parallel.
| Expedia Inc. Proprietary & Confidential3
Community Solution - Sqoop
• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data
stores such as relational databases.
• Support various relational databases like Teradata, SQL
Server, Oracle,Mysql,DB2 etc.
| Expedia Inc. Proprietary & Confidential4
Enhanced Sqoop Features
• Enhanced Sqoop Features for community business needs.
- Hive Merge
- Merges the incremental data migrated to hdfs into your
existing hive tables.
- Supports merge based on composite keys
- Merges older partitions as well as add new partitions.
| Expedia Inc. Proprietary & Confidential5
Enhanced Sqoop Features
- Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy
| Expedia Inc. Proprietary & Confidential6
Hcatalog for Hive
- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.
| Expedia Inc. Proprietary & Confidential7
External tables with HCatalog
| Expedia Inc. Proprietary & Confidential8
Sqoop Import to Hive Managed Table
| Expedia Inc. Proprietary & Confidential9
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse
| Expedia Inc. Proprietary & Confidential10
New Enhancement :Import to Hive External Table
| Expedia Inc. Proprietary & Confidential11
• The above command creates a hive table in the user managed
Directory /user/root/test_part2
| Expedia Inc. Proprietary & Confidential12
Dynamic Partitioning with HCatalog
| Expedia Inc. Proprietary & Confidential13
Sqoop Import to Hive Static Partition
• Can pass only 1 static partition as sqoop argument
| Expedia Inc. Proprietary & Confidential14
Sqoop Import to Hive Static Partition
• Check Hive Partition
| Expedia Inc. Proprietary & Confidential15
Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with
date value specified manually.
| Expedia Inc. Proprietary & Confidential16
Questions
| Expedia Inc. Proprietary & Confidential17
How to Import Data if there are more than 200 partitions ?
Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?
Is there any way that I can pass the format ?
New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-
keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static
partition key.
| Expedia Inc. Proprietary & Confidential18
| Expedia Inc. Proprietary & Confidential19
New Enhancement : Import to Hive Dynamic Partition with
Date Format
• A new argument is passed –hcatalog-dynamic-partition-
key-format with argument –hcatalog-dynamic-partition-
keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.
| Expedia Inc. Proprietary & Confidential20
| Expedia Inc. Proprietary & Confidential21
Password encrypted in Sqoop Metastore
• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic
passkey and algorithm is passed in command line.
| Expedia Inc. Proprietary & Confidential22
Issues with Sqoop Merge Tool
• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the
merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate
sequential Sqoop jobs.
| Expedia Inc. Proprietary & Confidential23
Merge Incremental data using Sqoop and Hive External
Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS
directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the
Base (base_table) and Change (incremental_table) tables.
| Expedia Inc. Proprietary & Confidential24
Merge Incremental data using Sqoop and Hive External
Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above
generated table.
| Expedia Inc. Proprietary & Confidential25
New Tool: Hive Merge
• Import original base table into Hive
| Expedia Inc. Proprietary & Confidential26
New Tool : Hive merge
• Import incremental data into Hive
| Expedia Inc. Proprietary & Confidential27
• Finally merge data using tool hive-merge.
| Expedia Inc. Proprietary & Confidential28
New Tool : Hive merge
Acquiring locks during Hive Merge
• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it
finishes.
| Expedia Inc. Proprietary & Confidential29
Performance metrics : Hive Merge tool
| Expedia Inc. Proprietary & Confidential30
Other Key Enhancements
• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog
| Expedia Inc. Proprietary & Confidential31
Apache Sqoop Jiras
These are the few jiras for which the patch has been
provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if
Hive table does not exists & add support for Partition Date
Format
• SQOOP-2335 :Support for Hive External Table in Sqoop –
Hcatalog
| Expedia Inc. Proprietary & Confidential32
• SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be
retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop
Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-
Hcatalog
| Expedia Inc. Proprietary & Confidential33
34
Demo
Questions
| Expedia Inc. Proprietary & Confidential35
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential36
Step 1: Identify partitions to update. Skip this step for
non-partitioned tables.
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential37
Step 2: Merge the new partitions with the old partitions(only for
partitioned tables).
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential38
Step 3: Delete older versions.

Mais conteúdo relacionado

Mais procurados

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache CassandraCassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache CassandraDataStax Academy
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analyticsDataWorks Summit
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 

Mais procurados (20)

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
ebay
ebayebay
ebay
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache CassandraCassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting Platform
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 

Destaque

About Expedia, Inc.
About Expedia, Inc.About Expedia, Inc.
About Expedia, Inc.ExpediaIncPR
 
Expedia presentation
Expedia presentationExpedia presentation
Expedia presentationeqc3w
 
Expedia company presentation
Expedia company presentationExpedia company presentation
Expedia company presentationNicole Grieble
 
Expedia hotel view: overview
Expedia hotel view: overviewExpedia hotel view: overview
Expedia hotel view: overviewExpedia EMEA
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 

Destaque (20)

About Expedia, Inc.
About Expedia, Inc.About Expedia, Inc.
About Expedia, Inc.
 
Expedia case study
Expedia case studyExpedia case study
Expedia case study
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Expedia presentation
Expedia presentationExpedia presentation
Expedia presentation
 
Expedia company presentation
Expedia company presentationExpedia company presentation
Expedia company presentation
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Expedia hotel view: overview
Expedia hotel view: overviewExpedia hotel view: overview
Expedia hotel view: overview
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 

Semelhante a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Le novità di SQL Server 2022
Le novità di SQL Server 2022Le novità di SQL Server 2022
Le novità di SQL Server 2022Gianluca Hotz
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!KNIMESlides
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMERosaria Silipo
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
 
Symfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case studySymfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case studyGaetano Giunta
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Kubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsKubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsSaiyam Pathak
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopKai Sasaki
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015Colin Charles
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance TuningGunnar Hillert
 
Docker for the enterprise
Docker for the enterpriseDocker for the enterprise
Docker for the enterpriseBert Poller
 

Semelhante a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia (20)

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
gk.pptx
gk.pptxgk.pptx
gk.pptx
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Le novità di SQL Server 2022
Le novità di SQL Server 2022Le novità di SQL Server 2022
Le novità di SQL Server 2022
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Symfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case studySymfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case study
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Kubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsKubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancements
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
ECS19 - Robi Voncina - Upgrade to SharePoint 2019
ECS19 - Robi Voncina - Upgrade to SharePoint 2019ECS19 - Robi Voncina - Upgrade to SharePoint 2019
ECS19 - Robi Voncina - Upgrade to SharePoint 2019
 
MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
 
Docker for the enterprise
Docker for the enterpriseDocker for the enterprise
Docker for the enterprise
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

  • 1. Bridging the gap of Relational to Hadoop using Sqoop@Expedia (Enhancing Sqoop for Synchronization) Shashank Tandon, Expedia Kopal Niranjan, Expedia
  • 2. Agenda • Problem statement • Why- Sqoop • Expedia Enhancements for Sqoop. • New Tool : Hive Merge • Data Synchronization • Demo | Expedia Inc. Proprietary & Confidential1
  • 3. | Expedia Inc. Proprietary & Confidential2 Data Synchronization
  • 4. Problem Statement • Import huge amount of data available on RDBMS to Hive table • Support multiple partitions on Hive while importing. • Regular updates happening on RDBMS. –Merge the new/updated data to hive tables. –Merge the data in parallel. | Expedia Inc. Proprietary & Confidential3
  • 5. Community Solution - Sqoop • Sqoop is an open source tool designed to efficiently transfer bulk data between Hadoop and structured data stores such as relational databases. • Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc. | Expedia Inc. Proprietary & Confidential4
  • 6. Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs. - Hive Merge - Merges the incremental data migrated to hdfs into your existing hive tables. - Supports merge based on composite keys - Merges older partitions as well as add new partitions. | Expedia Inc. Proprietary & Confidential5
  • 7. Enhanced Sqoop Features - Hive Dynamic Partition - Hive Dynamic Partition with Partition Format - Hive External Table - Compression like Snappy | Expedia Inc. Proprietary & Confidential6
  • 8. Hcatalog for Hive - Hcatalog is a java wrapper on top of Hive metastore. - Sqoop supports all the latest Hive features using Hcatalog. | Expedia Inc. Proprietary & Confidential7
  • 9. External tables with HCatalog | Expedia Inc. Proprietary & Confidential8
  • 10. Sqoop Import to Hive Managed Table | Expedia Inc. Proprietary & Confidential9 • Sqoop connects to mysql database test • Import table MYTABLE in a hive managed table test_part1 • The hive managed table is located in /apps/hive/warehouse
  • 11. | Expedia Inc. Proprietary & Confidential10
  • 12. New Enhancement :Import to Hive External Table | Expedia Inc. Proprietary & Confidential11 • The above command creates a hive table in the user managed Directory /user/root/test_part2
  • 13. | Expedia Inc. Proprietary & Confidential12
  • 14. Dynamic Partitioning with HCatalog | Expedia Inc. Proprietary & Confidential13
  • 15. Sqoop Import to Hive Static Partition • Can pass only 1 static partition as sqoop argument | Expedia Inc. Proprietary & Confidential14
  • 16. Sqoop Import to Hive Static Partition • Check Hive Partition | Expedia Inc. Proprietary & Confidential15
  • 17. Sqoop Import to Hive Static Partition on Date column • Can pass only 1 static partition as sqoop argument with date value specified manually. | Expedia Inc. Proprietary & Confidential16
  • 18. Questions | Expedia Inc. Proprietary & Confidential17 How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ? How to Import Data if the date format is month or day or year? Is there any way that I can pass the format ?
  • 19. New Enhancement : Import to Hive Dynamic Partition • A new argument is passed –hcatalog-dynamic-partition- keys in sqoop. • It works along with current static partition key. • If both are passed then it will give more preference to static partition key. | Expedia Inc. Proprietary & Confidential18
  • 20. | Expedia Inc. Proprietary & Confidential19
  • 21. New Enhancement : Import to Hive Dynamic Partition with Date Format • A new argument is passed –hcatalog-dynamic-partition- key-format with argument –hcatalog-dynamic-partition- keys. • Check the Hive Partitions after the Sqoop Import. • The partitions created will be in the user-specified format. | Expedia Inc. Proprietary & Confidential20
  • 22. | Expedia Inc. Proprietary & Confidential21
  • 23. Password encrypted in Sqoop Metastore • Password will now be saved in Sqoop metastore in encrypted manner. • The logic is same as done in file encryption where generic passkey and algorithm is passed in command line. | Expedia Inc. Proprietary & Confidential22
  • 24. Issues with Sqoop Merge Tool • Designed to merge two directories on HDFS. Will need modification to support merging of Hive tables. • The output directory must be specified while performing the merge. • Supports merge based on a single column. • To merge many partitions, each will require separate sequential Sqoop jobs. | Expedia Inc. Proprietary & Confidential23
  • 25. Merge Incremental data using Sqoop and Hive External Table • Import records from base table to a HDFS directory. • Import updates using incremental imports to another HDFS directory. • Create a hive external table for both the directories. • Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables. | Expedia Inc. Proprietary & Confidential24
  • 26. Merge Incremental data using Sqoop and Hive External Table • The view now contains the most up-to-date set of records. • Generate a table from the view created in above step. • Replace the base table with the entries from the above generated table. | Expedia Inc. Proprietary & Confidential25
  • 27. New Tool: Hive Merge • Import original base table into Hive | Expedia Inc. Proprietary & Confidential26
  • 28. New Tool : Hive merge • Import incremental data into Hive | Expedia Inc. Proprietary & Confidential27
  • 29. • Finally merge data using tool hive-merge. | Expedia Inc. Proprietary & Confidential28 New Tool : Hive merge
  • 30. Acquiring locks during Hive Merge • In order to allow only single Hive merge happen on same table, tool acquire lock in the start and release lock once it finishes. | Expedia Inc. Proprietary & Confidential29
  • 31. Performance metrics : Hive Merge tool | Expedia Inc. Proprietary & Confidential30
  • 32. Other Key Enhancements • Save encrypted password in Sqoop Metastore • Teradata varchar/char support • Teradata current timestamp support • Sqoop Job runs for Incremental Import • Snappy compression support in Hcatalog | Expedia Inc. Proprietary & Confidential31
  • 33. Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us: • SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format • SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog | Expedia Inc. Proprietary & Confidential32
  • 34. • SQOOP-2585: Merging hive tables using sqoop • SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import • SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form. • SQOOP-2331: Snappy Compression Support in Sqoop- Hcatalog | Expedia Inc. Proprietary & Confidential33
  • 36. Questions | Expedia Inc. Proprietary & Confidential35
  • 37. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential36 Step 1: Identify partitions to update. Skip this step for non-partitioned tables.
  • 38. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential37 Step 2: Merge the new partitions with the old partitions(only for partitioned tables).
  • 39. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential38 Step 3: Delete older versions.