SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
Cloud Friendly Hadoop & Hive

         Joydeep Sen Sarma



           Qubole
Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud




                           2
Qubole Data Service




AWS EC2
                                3
AWS S3
Qubole Data Service




                      API

     Oozie     Hive            Pig   Sqoop



                      Hadoop
AWS EC2
AWS S3
Qubole Data Service




                      API
                                                  Vertica
     Oozie     Hive            Pig   Sqoop

                                                   Mysql
                      Hadoop
AWS EC2
                                                     5
                                             S3://adco/logs
AWS S3
Qubole Data Service

                                             SDK    ODBC




 Explore – Integrate – Analyze – Schedule

                          API
                                                                Vertica
      Oozie        Hive            Pig      Sqoop

                                                                 Mysql
                          Hadoop
AWS EC2
                                   6                               6
AWS S3                                                     S3://adco/logs
Qubole Data Service

                                             SDK    ODBC




 Explore – Integrate – Analyze – Schedule

                          API
                                                                Vertica
      Oozie        Hive            Pig      Sqoop

                                                                 Mysql
                          Hadoop
AWS EC2
                                   7                               7
AWS S3                                                     S3://adco/logs
Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud




                           8
Step 1(Optional): Setup Hadoop




              9
Step 2: Fire Away




    AdCo Hadoop




          10
Step 2: Fire Away

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop




                                               11
Step 2: Fire Away

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop




                                               12
Step 2: Fire Away
                                                       hadoop jar –Dmapred.min.split.size=32000000
                                                       myapp.jar –partitioner .org.apache…

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop


                                                         insert overwrite table dest
                                                         select a.id, a.zip, count(distinct b.uid)
                                                         from ads a join LARGE_TABLE b on (a.id=b.ad_id)
                                               13        group by a.id, a.zip;
                                                                                                     13
Step 2: Fire Away
                                                       hadoop jar –Dmapred.min.split.size=32000000
                                                       myapp.jar –partitioner .org.apache…

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop


                                                         insert overwrite table dest
                                                         select a.id, a.zip, count(distinct b.uid)
                                                         from ads a join LARGE_TABLE b on (a.id=b.ad_id)
                                               14        group by a.id, a.zip;
                                                                                                     14
Step 2: Fire Away
                  hadoop jar –Dmapred.min.split.size=32000000
                  myapp.jar –partitioner .org.apache…




    AdCo Hadoop




          15
Step 2: Fire Away
                  hadoop jar –Dmapred.min.split.size=32000000
                  myapp.jar –partitioner .org.apache…




    AdCo Hadoop




          16
Step 2: Fire Away




    AdCo Hadoop




          17
Come back anytime




       18
Hadoop as Service
1. Detect when cluster is required
  – Not all Hive statements require cluster (EXPLAIN/SHOW/..)


2. Atomically create cluster
  – Long running process, concurrency control using Mysql


3. Shutdown when not in use
  – Do on hour boundary (whose?)
  – Not if User Sessions are active!

                              19
Hadoop as Service
• Archive Job History/Logs to S3
  – Transparent access to Old jobs



• Auto-Config different node types
  – Use ALL ephemeral drives for HDFS/MR
  – Use right number of slots per machine


• Scrub, Scrub, Scrub
  – Bad Nodes, Bad Clusters, AWS timeouts


                                     20
Scaling Up
                                Slaves



Map Tasks

                 Job Tracker


ReduceTasks




 Master           StarCluster


                                   21
                 AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           22
                         AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           23
                         AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           24
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      25
                         AWS
Scaling Up
insert overwrite table dest                           Slaves
select … from ads join
campaigns on …group by …;
                                           Progress


   Map Tasks

                          Job Tracker


   ReduceTasks
                                  Supply

                     Demand



    Master                StarCluster


                                                         26
                         AWS
Scaling Up
insert overwrite table dest                           Slaves
select … from ads join
campaigns on …group by …;
                                           Progress


   Map Tasks

                          Job Tracker


   ReduceTasks
                                  Supply

                     Demand



    Master                StarCluster


                                                         27
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      28
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      29
                         AWS
Scaling Down
1. On hour boundary – check if node is required:
   – Can’t remove nodes with map-outputs (today)
   – Don’t go below minimum cluster size


2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast!
  –   Delete affected cache files instead of re-replicating
  –   One surviving replica and we are Done.


4. Delete Instance
                                  30
Spot Instances




On an average 50-60% cheaper
            31                 31
Spot Instance: Challenges
• Can lose Spot nodes anytime
  – Disastrous for HDFS
  – Hybrid Mode: Use mix of On-Demand and Spot
  – Hybrid Mode: Keep one replica in On-Demand nodes



• Spot Instances may not be available
  – Timeout and use On-Demand nodes as fallback



                           32
Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud




                          33
Query History/Results




         34
Cheap to Test

           Evaluate expressions on
            sample data




     35
Cheap to Test




           Run Query on Sample




     36
Fastest Hive SaaS
• Works with Small Files!
  – Faster Split Computation (8x)
  – Prefetching S3 files (30%)




                             37
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup




                             38
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3
  – HIVE-1620




                             39
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3               • Columnar Cache
  – HIVE-1620                         – Use HDFS as cache for S3
                                      – Upto 5x faster for JSON
                                        data




                             40
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3               • Columnar Cache
  – HIVE-1620                         – Use HDFS as cache for S3
                                      – Upto 5x faster for JSON
                                        data
• NEW – Multi-Tenant Hive
  Server

                             41
Questions?


           @Qubole
Free Trial: www.qubole.com

Mais conteúdo relacionado

Mais procurados

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaDatabricks
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure DatabricksSascha Dittmann
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 

Mais procurados (20)

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 

Destaque

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoptionQubole
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015Michael Mersch
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSAmazon Web Services
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics SuiteJames Serra
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsAmazon Web Services
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Readymscug
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...Amazon Web Services
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012dataxu
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 

Destaque (20)

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
RDO-Packstack Workshop
RDO-Packstack Workshop RDO-Packstack Workshop
RDO-Packstack Workshop
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Creating a fortigate vpn network & security blog
Creating a fortigate vpn   network & security blogCreating a fortigate vpn   network & security blog
Creating a fortigate vpn network & security blog
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure Workloads
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Ready
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 

Semelhante a Qubole hadoop-summit-2013-europe

Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveDataWorks Summit
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAsociatia ProLinux
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Naoki Yanai
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1Milind gunjan
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingDeepak Singh
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 

Semelhante a Qubole hadoop-summit-2013-europe (20)

Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Spark 2013-04-17
Spark 2013-04-17Spark 2013-04-17
Spark 2013-04-17
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Scala+data
Scala+dataScala+data
Scala+data
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 

Mais de Joydeep Sen Sarma

Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 

Mais de Joydeep Sen Sarma (7)

Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Qubole hadoop-summit-2013-europe

  • 1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole
  • 2. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 2
  • 3. Qubole Data Service AWS EC2 3 AWS S3
  • 4. Qubole Data Service API Oozie Hive Pig Sqoop Hadoop AWS EC2 AWS S3
  • 5. Qubole Data Service API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 5 S3://adco/logs AWS S3
  • 6. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 6 6 AWS S3 S3://adco/logs
  • 7. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 7 7 AWS S3 S3://adco/logs
  • 8. Agenda • What is Qubole Data Service • Hadoop as a Service in Cloud • Hive as a Service in Cloud 8
  • 10. Step 2: Fire Away AdCo Hadoop 10
  • 11. Step 2: Fire Away select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop 11
  • 12. Step 2: Fire Away select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop 12
  • 13. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 13 group by a.id, a.zip; 13
  • 14. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 14 group by a.id, a.zip; 14
  • 15. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 15
  • 16. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 16
  • 17. Step 2: Fire Away AdCo Hadoop 17
  • 19. Hadoop as Service 1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..) 2. Atomically create cluster – Long running process, concurrency control using Mysql 3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 19
  • 20. Hadoop as Service • Archive Job History/Logs to S3 – Transparent access to Old jobs • Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine • Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 20
  • 21. Scaling Up Slaves Map Tasks Job Tracker ReduceTasks Master StarCluster 21 AWS
  • 22. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 22 AWS
  • 23. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 23 AWS
  • 24. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 24 AWS
  • 25. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 25 AWS
  • 26. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 26 AWS
  • 27. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 27 AWS
  • 28. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 28 AWS
  • 29. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 29 AWS
  • 30. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 2. Remove node from Map-Reduce Cluster 3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 4. Delete Instance 30
  • 31. Spot Instances On an average 50-60% cheaper 31 31
  • 32. Spot Instance: Challenges • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 32
  • 33. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 33
  • 35. Cheap to Test  Evaluate expressions on sample data 35
  • 36. Cheap to Test  Run Query on Sample 36
  • 37. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) 37
  • 38. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup 38
  • 39. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 – HIVE-1620 39
  • 40. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data 40
  • 41. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data • NEW – Multi-Tenant Hive Server 41
  • 42. Questions? @Qubole Free Trial: www.qubole.com