SlideShare a Scribd company logo
1 of 26
Coordinating the Many
        Tools of Big Data
Strata 2013

Alan F. Gates
@alanfgates




                              Page 1
Big Data = Terabytes, Petabytes, …




Image Credit: Gizmodo
             © Hortonworks 2013
                                        Page 2
But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
  2012 on calculations Twitter is doing via UDFs in Pig.
  This equation uses stochastic gradient descent to do
  machine learning with their data:



   w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)




      © Hortonworks 2013
                                                       Page 3
And New Tools
• Apache Hadoop brings with it a large selection of tools
  and paradigms
   – Apache HBase, Apache Cassandra – Distributed, high volume
     reads and rights of individual data records
   – Apache Hive - SQL
   – Apache Pig, Cascading – Data flow programming for ETL, data
     modeling, and exploration
   – Apache Giraph – Graph processing
   – MapReduce – Batch processing
   – Storm, S4 – Stream processing
   – Plus lots of commercial offerings




      © Hortonworks 2013
                                                                   Page 4
Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
  SAS).



                             Data
                             Mart
                                             Statistical
                                             Analysis
         Data
       Warehouse


                             Cube/M                        OLTP
                              OLAP



        © Hortonworks 2013
                                                                            Page 5
Cloud: Many Tools One Platform
   • Users no longer want to be concerned with what platform their data is in – just
     apply the tool to it
   • SQL no longer the only or primary data access tool

                                                                           Statistical
                  Data                                                     Analysis
                  Mart
  Data
Warehouse




Cube/M                                                                   OLT
 OLAP                                                                     P




            © Hortonworks 2013
                                                                                     Page 6
Upside - Pick the Right Tool for the Job




    © Hortonworks 2013
                                       Page 7
Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools
  – Different storage formats
  – Different data models
  – Different user defined function interfaces




      © Hortonworks 2013
                                                 Page 8
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                          Hive

                             Pig         Parser

                            Parser     Metadata

                           Optimizer   Optimizer
                           Physical     Physical
                           Planner      Planner

                           Executor     Executor


      © Hortonworks 2013
                                                       Page 9
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                                   Hive

                             Pig                  Parser

                            Parser               Metadata

                           Optimizer             Optimizer
                           Physical              Physical
                                       Overlap
                           Planner               Planner

                           Executor              Executor


      © Hortonworks 2013
                                                             Page 10
Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense




        © Hortonworks 2013
                                                          Page 11
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 12
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 13
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access




      © Hortonworks 2013
                                                             Page 14
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive




                           Metastore




      © Hortonworks 2013
                                                             Page 15
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive            Pig
                                            HCat
                                           Loader



                           Metastore      MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 16
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                                Hive         Pig
   External
   Systems                                  HCat
                                           Loader
   REST
                    WebHCat
                              Metastore   MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 17
Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
  etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
  the queue between steps in the pipeline
• Does not write intermediate output to HDFS
  – Much lighter disk and network usage
• Built on YARN



      © Hortonworks 2013
                                                       Page 18
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier




                                                  Job 3




                         Pig/Hive - MR
                       © Hortonworks 2013
                                                                                         Page 19
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier



                                                                         Single Job


                                                  Job 3




                         Pig/Hive - MR                                                   Pig/Hive - Tez
                       © Hortonworks 2013
                                                                                                          Page 20
FastQuery: Beyond Batch with YARN




 Tez Generalizes Map-Reduce           Always-On Tez Service
Simplified execution plans process   Low latency processing for
        data more efficiently        all Hadoop data processing




       © Hortonworks 2013
                                                                  Page 21
Knox – Single Sign On




   © Hortonworks 2013
                        Page 22
Today’s Access Options
• Direct Access
   – Access Services via REST (WebHDFS, WebHCat)
   – Need knowledge of and access to whole cluster
   – Security handled by each component in the cluster
   – Kerberos details exposed to users


          User              {REST}   Hadoop Cluster


• Gateway / Portal Nodes
   – Dedicated nodes behind firewall
   – User SSH to node to access Hadoop services

                             SSH
                                      GW
          User                                  Hadoop Cluster
                                     Node


       © Hortonworks 2013
                                                                 Page 23
Knox Design Goals
• Operators can firewall cluster without end user access to
  “gateway node”
• Users see one cluster end-point that aggregates
  capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
  easier
• Enable integration enterprise and cloud identity
  management environments




      © Hortonworks 2013
                                                        Page 24
Perimeter Verification & Authentication
Verification
- Verify identity token                       Authentication       Hadoop Cluster
- SAML, propagation of identity
Authentication
                                                    User Store
- Establish identity at Gateway to
  Authenticate with LDAP + AD                        KDC, AD,             DN        DN
                                                      LDAP
                                                                  Web     DN        DN
                                                                  HDFS
                                                                               NN
                            {REST}                    Knox
         Client                                      Gateway

                                                                               JT
                                                                  Web
                                                                               Hive
                                     ID Provider                  HCat
                                      KDC, AD,
                                        LDAP                                 HCat

                                                   Verification
                © Hortonworks 2013
                                                                                      Page 25
Thank You




   © Hortonworks 2012
                        Page 26

More Related Content

What's hot

Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
alanfgates
 

What's hot (20)

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 

Viewers also liked

Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)
Linnea Hanson
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 

Viewers also liked (14)

Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)
 
Simply the best college best work
Simply the best   college best workSimply the best   college best work
Simply the best college best work
 
Bowling event
Bowling eventBowling event
Bowling event
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Ninjutsu
NinjutsuNinjutsu
Ninjutsu
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Brownian motion
Brownian motionBrownian motion
Brownian motion
 
Types dbms
Types dbmsTypes dbms
Types dbms
 

Similar to Strata feb2013

Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
DataWorks Summit
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data Platform
Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 

Similar to Strata feb2013 (20)

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data Platform
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Strata feb2013

  • 1. Coordinating the Many Tools of Big Data Strata 2013 Alan F. Gates @alanfgates Page 1
  • 2. Big Data = Terabytes, Petabytes, … Image Credit: Gizmodo © Hortonworks 2013 Page 2
  • 3. But It Is Also Complex Algorithms • An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data: w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y) © Hortonworks 2013 Page 3
  • 4. And New Tools • Apache Hadoop brings with it a large selection of tools and paradigms – Apache HBase, Apache Cassandra – Distributed, high volume reads and rights of individual data records – Apache Hive - SQL – Apache Pig, Cascading – Data flow programming for ETL, data modeling, and exploration – Apache Giraph – Graph processing – MapReduce – Batch processing – Storm, S4 – Stream processing – Plus lots of commercial offerings © Hortonworks 2013 Page 4
  • 5. Pre-Cloud: One Tool per Machine • Databases presented SQL or SQL-like paradigms for operating on data • Other tools came in separate packages (e.g. R) or on separate platforms (e.g. SAS). Data Mart Statistical Analysis Data Warehouse Cube/M OLTP OLAP © Hortonworks 2013 Page 5
  • 6. Cloud: Many Tools One Platform • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Statistical Data Analysis Mart Data Warehouse Cube/M OLT OLAP P © Hortonworks 2013 Page 6
  • 7. Upside - Pick the Right Tool for the Job © Hortonworks 2013 Page 7
  • 8. Downside – Tools Don’t Play Well Together • Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces © Hortonworks 2013 Page 8
  • 9. Downside – Wasted Developer Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Planner Planner Executor Executor © Hortonworks 2013 Page 9
  • 10. Downside – Wasted Developer Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2013 Page 10
  • 11. Conclusion: We Need Services • We need to find a way to share services where we can • Gives users the same experience across tools • Allows developers to share effort when it makes sense © Hortonworks 2013 Page 11
  • 12. Hadoop = Distributed Data Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 12
  • 13. Hadoop = Distributed Data Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 13
  • 14. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access © Hortonworks 2013 Page 14
  • 15. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Metastore © Hortonworks 2013 Page 15
  • 16. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig HCat Loader Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 16
  • 17. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig External Systems HCat Loader REST WebHCat Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 17
  • 18. Tez – Moving Beyond MapReduce • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, Cascading etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS – Much lighter disk and network usage • Built on YARN © Hortonworks 2013 Page 18
  • 19. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Job 3 Pig/Hive - MR © Hortonworks 2013 Page 19
  • 20. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez © Hortonworks 2013 Page 20
  • 21. FastQuery: Beyond Batch with YARN Tez Generalizes Map-Reduce Always-On Tez Service Simplified execution plans process Low latency processing for data more efficiently all Hadoop data processing © Hortonworks 2013 Page 21
  • 22. Knox – Single Sign On © Hortonworks 2013 Page 22
  • 23. Today’s Access Options • Direct Access – Access Services via REST (WebHDFS, WebHCat) – Need knowledge of and access to whole cluster – Security handled by each component in the cluster – Kerberos details exposed to users User {REST} Hadoop Cluster • Gateway / Portal Nodes – Dedicated nodes behind firewall – User SSH to node to access Hadoop services SSH GW User Hadoop Cluster Node © Hortonworks 2013 Page 23
  • 24. Knox Design Goals • Operators can firewall cluster without end user access to “gateway node” • Users see one cluster end-point that aggregates capabilities for data access, metadata and job control • Provide perimeter security to make Hadoop security setup easier • Enable integration enterprise and cloud identity management environments © Hortonworks 2013 Page 24
  • 25. Perimeter Verification & Authentication Verification - Verify identity token Authentication Hadoop Cluster - SAML, propagation of identity Authentication User Store - Establish identity at Gateway to Authenticate with LDAP + AD KDC, AD, DN DN LDAP Web DN DN HDFS NN {REST} Knox Client Gateway JT Web Hive ID Provider HCat KDC, AD, LDAP HCat Verification © Hortonworks 2013 Page 25
  • 26. Thank You © Hortonworks 2012 Page 26

Editor's Notes

  1. This is how we tend to think of Big data
  2. Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around
  3. Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything