SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Hadoop and Hive Development at
   Facebook

Dhruba Borthakur Zheng Shao
{dhruba, zshao}@facebook.com
Presented at Hadoop World, New York
October 2, 2009
Hadoop @ Facebook
Who generates this data?

  Lots of data is generated on Facebook
    –  300+ million active users
    –  30 million users update their statuses at least once each
       day
    –  More than 1 billion photos uploaded each month
    –  More than 10 million videos uploaded each month
    –  More than 1 billion pieces of content (web links, news
       stories, blog posts, notes, photos, etc.) shared each
       week
Data Usage

  Statistics per day:
    –  4 TB of compressed new data added per day
    –  135TB of compressed data scanned per day
    –  7500+ Hive jobs on production cluster per day
    –  80K compute hours per day
  Barrier to entry is significantly reduced:
    –  New engineers go though a Hive training session
    –  ~200 people/month run jobs on Hadoop/Hive
    –  Analysts (non-engineers) use Hadoop through Hive
Where is this data stored?

  Hadoop/Hive Warehouse
  –  4800 cores, 5.5 PetaBytes
  –  12 TB per node
  –  Two level network topology
       1 Gbit/sec from node to rack switch
       4 Gbit/sec to top level rack switch
Data Flow into Hadoop Cloud
                                                     Network

                                                     Storage

                                                     and

                                                     Servers

    Web
Servers

                        Scribe
MidTier





     Oracle
RAC
   Hadoop
Hive
Warehouse
   MySQL

Hadoop Scribe: Avoid Costly Filers
                                                          Scribe
Writers


                                                                      RealBme

                                                                      Hadoop

                                                                      Cluster

    Web
Servers
               Scribe
MidTier





     Oracle
RAC
         Hadoop
Hive
Warehouse
                    MySQL

    http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
HDFS Raid

  Start the same: triplicate
   every data block
                                                 A               B                C
  Background encoding
   –  Combine third replica of                   A               B                C
      blocks from a single file to
      create parity block                        A               B                C
   –  Remove third replica
   –  Apache JIRA HDFS-503                                     A+B+C
  DiskReduce from CMU
   –  Garth Gibson research                     A file with three blocks A, B and C

   http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
Archival: Move old data to cheap storage

Hadoop
Warehouse


                                                 NFS




               Hadoop
Archive
Node
                     Cheap
NAS




                                 Hadoop
Archival
Cluster



                    hEp://issues.apache.org/jira/browse/HDFS‐220

 Hive
Query

Dynamic-size MapReduce Clusters

  Why multiple compute clouds in Facebook?
    –  Users unaware of resources needed by job
    –  Absence of flexible Job Isolation techniques
    –  Provide adequate SLAs for jobs
  Dynamically move nodes between clusters
    –  Based on load and configured policies
    –  Apache Jira MAPREDUCE-1044
Resource Aware Scheduling (Fair Share
Scheduler)
  We use the Hadoop Fair Share Scheduler
    –  Scheduler unaware of memory needed by job
  Memory and CPU aware scheduling
    –  RealTime gathering of CPU and memory usage
    –  Scheduler analyzes memory consumption in realtime
    –  Scheduler fair-shares memory usage among jobs
    –  Slot-less scheduling of tasks (in future)
    –  Apache Jira MAPREDUCE-961
Hive – Data Warehouse

  Efficient SQL to Map-Reduce Compiler

  Mar 2008: Started at Facebook
  May 2009: Release 0.3.0 available
  Now: Preparing for release 0.4.0


  Countable for 95%+ of Hadoop jobs @ Facebook
  Used by ~200 engineers and business analysts at Facebook
   every month
Hive Architecture

  Web UI + Hive CLI + JDBC/                Map Reduce       HDFS
            ODBC                       User-defined
     Browse, Query, DDL              Map-reduce Scripts

  MetaStore               Hive QL              UDF/UDAF
                                                 substr
                     Parser                       sum
  Thrift API                                    average
                    Planner    Execution
                                                          FileFormats
                                                 SerDe
                   Optimizer                                TextFile
                                                  CSV     SequenceFile
                                                 Thrift      RCFile
                                                 Regex
Hive DDL

  DDL
   –  Complex columns
   –  Partitions
   –  Buckets
  Example
   –  CREATE TABLE sales (
        id INT,
        items ARRAY<STRUCT<id:INT, name:STRING>>,
        extra MAP<STRING, STRING>
      ) PARTITIONED BY (ds STRING)
      CLUSTERED BY (id) INTO 32 BUCKETS;
Hive Query Language

  SQL
   –    Where
   –    Group By
   –    Equi-Join
   –    Sub query in from clause
  Example
   –  SELECT r.*, s.*
      FROM r JOIN (
        SELECT key, count(1) as count
        FROM s
        GROUP BY key) s
      ON r.key = s.key
      WHERE s.count > 100;
Group By

  4 different plans based on:
  –  Does data have skew?
  –  partial aggregation
  Map-side hash aggregation
  –  In-memory hash table in mapper to do partial
     aggregations
  2-map-reduce aggregation
  –  For distinct queries with skew and large cardinality
Join

  Normal map-reduce Join
  –  Mapper sends all rows with the same key to a single
     reducer
  –  Reducer does the join
  Map-side Join
  –  Mapper loads the whole small table and a portion of big
     table
  –  Mapper does the join
  –  Much faster than map-reduce join
Sampling

  Efficient sampling
   –  Table can be bucketed
   –  Each bucket is a file
   –  Sampling can choose some buckets
  Example
   –  SELECT product_id, sum(price)
      FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
      GROUP BY product_id
Multi-table Group-By/Insert

FROM users
INSERT INTO TABLE pv_gender_sum
   SELECT gender, count(DISTINCT userid)
   GROUP BY gender
INSERT INTO
   DIRECTORY '/user/facebook/tmp/pv_age_sum.dir'
   SELECT age, count(DISTINCT userid)
   GROUP BY age
INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir'
   SELECT country, gender, count(DISTINCT userid)
   GROUP BY country, gender;
File Formats

  TextFile:
   –  Easy for other applications to write/read
   –  Gzip text files are not splittable
  SequenceFile:
   –  Only hadoop can read it
   –  Support splittable compression
  RCFile: Block-based columnar storage
   –    Use SequenceFile block format
   –    Columnar storage inside a block
   –    25% smaller compressed size
   –    On-par or better query performance depending on the query
SerDe

  Serialization/Deserialization
  Row Format
   –    CSV (LazySimpleSerDe)
   –    Thrift (ThriftSerDe)
   –    Regex (RegexSerDe)
   –    Hive Binary Format (LazyBinarySerDe)
  LazySimpleSerDe and LazyBinarySerDe
   –  Deserialize the field when needed
   –  Reuse objects across different rows
   –  Text and Binary format
UDF/UDAF

  Features:
   –    Use either Java or Hadoop Objects (int, Integer, IntWritable)
   –    Overloading
   –    Variable-length arguments
   –    Partial aggregation for UDAF
  Example UDF:
   –  public class UDFExampleAdd extends UDF {
        public int evaluate(int a, int b) {
          return a + b;
        }
      }
Hive – Performance

  Date      SVN Revision        Major Changes            Query A   Query B   Query C
2/22/2009      746906      Before Lazy Deserialization    83 sec    98 sec   183 sec
2/23/2009      747293         Lazy Deserialization        40 sec    66 sec   185 sec
3/6/2009       751166        Map-side Aggregation         22 sec    67 sec   182 sec
4/29/2009      770074             Object Reuse            21 sec    49 sec   130 sec
6/3/2009       781633           Map-side Join *           21 sec    48 sec   132 sec
8/5/2009       801497        Lazy Binary Format *         21 sec    48 sec   132 sec

        QueryA: SELECT count(1) FROM t;
        QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
        QueryC: SELECT * FROM t;
        map-side time only (incl. GzipCodec for comp/decompression)
        * These two features need to be tested with other queries.
Hive – Future Works

    Indexes
    Create table as select
    Views / variables
    Explode operator
    In/Exists sub queries
    Leverage sort/bucket information in Join

Mais conteúdo relacionado

Mais procurados

IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic StoragePatrick Bouillaud
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed_Hat_Storage
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed_Hat_Storage
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesEric Carter
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
How MariaDB is approaching DBaaS
How MariaDB is approaching DBaaSHow MariaDB is approaching DBaaS
How MariaDB is approaching DBaaSMariaDB plc
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy
 
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed_Hat_Storage
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1IBM Sverige
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)HTS Hosting
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red_Hat_Storage
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015Doug O'Flaherty
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed_Hat_Storage
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red_Hat_Storage
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashAshutosh Mate
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines
 

Mais procurados (20)

IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
 
Red Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and FutureRed Hat Ceph Storage: Past, Present and Future
Red Hat Ceph Storage: Past, Present and Future
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
How MariaDB is approaching DBaaS
How MariaDB is approaching DBaaSHow MariaDB is approaching DBaaS
How MariaDB is approaching DBaaS
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage MattersRed Hat Storage Day Seattle: Why Software-Defined Storage Matters
Red Hat Storage Day Seattle: Why Software-Defined Storage Matters
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
 

Semelhante a Hadoop and Hive Development at Facebook

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

Semelhante a Hadoop and Hive Development at Facebook (20)

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
מיכאל
מיכאלמיכאל
מיכאל
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

Mais de elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introductionelliando dias
 

Mais de elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introduction
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Hadoop and Hive Development at Facebook

  • 1. Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
  • 3. Who generates this data?   Lots of data is generated on Facebook –  300+ million active users –  30 million users update their statuses at least once each day –  More than 1 billion photos uploaded each month –  More than 10 million videos uploaded each month –  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
  • 4. Data Usage   Statistics per day: –  4 TB of compressed new data added per day –  135TB of compressed data scanned per day –  7500+ Hive jobs on production cluster per day –  80K compute hours per day   Barrier to entry is significantly reduced: –  New engineers go though a Hive training session –  ~200 people/month run jobs on Hadoop/Hive –  Analysts (non-engineers) use Hadoop through Hive
  • 5. Where is this data stored?   Hadoop/Hive Warehouse –  4800 cores, 5.5 PetaBytes –  12 TB per node –  Two level network topology   1 Gbit/sec from node to rack switch   4 Gbit/sec to top level rack switch
  • 6. Data Flow into Hadoop Cloud Network
 Storage
 and
 Servers
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL

  • 7. Hadoop Scribe: Avoid Costly Filers Scribe
Writers
 RealBme
 Hadoop
 Cluster
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL
 http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 8. HDFS Raid   Start the same: triplicate every data block A B C   Background encoding –  Combine third replica of A B C blocks from a single file to create parity block A B C –  Remove third replica –  Apache JIRA HDFS-503 A+B+C   DiskReduce from CMU –  Garth Gibson research A file with three blocks A, B and C http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
  • 9. Archival: Move old data to cheap storage Hadoop
Warehouse
 NFS
 Hadoop
Archive
Node
 Cheap
NAS
 Hadoop
Archival
Cluster
 hEp://issues.apache.org/jira/browse/HDFS‐220
 Hive
Query

  • 10. Dynamic-size MapReduce Clusters   Why multiple compute clouds in Facebook? –  Users unaware of resources needed by job –  Absence of flexible Job Isolation techniques –  Provide adequate SLAs for jobs   Dynamically move nodes between clusters –  Based on load and configured policies –  Apache Jira MAPREDUCE-1044
  • 11. Resource Aware Scheduling (Fair Share Scheduler)   We use the Hadoop Fair Share Scheduler –  Scheduler unaware of memory needed by job   Memory and CPU aware scheduling –  RealTime gathering of CPU and memory usage –  Scheduler analyzes memory consumption in realtime –  Scheduler fair-shares memory usage among jobs –  Slot-less scheduling of tasks (in future) –  Apache Jira MAPREDUCE-961
  • 12. Hive – Data Warehouse   Efficient SQL to Map-Reduce Compiler   Mar 2008: Started at Facebook   May 2009: Release 0.3.0 available   Now: Preparing for release 0.4.0   Countable for 95%+ of Hadoop jobs @ Facebook   Used by ~200 engineers and business analysts at Facebook every month
  • 13. Hive Architecture Web UI + Hive CLI + JDBC/ Map Reduce HDFS ODBC User-defined Browse, Query, DDL Map-reduce Scripts MetaStore Hive QL UDF/UDAF substr Parser sum Thrift API average Planner Execution FileFormats SerDe Optimizer TextFile CSV SequenceFile Thrift RCFile Regex
  • 14. Hive DDL   DDL –  Complex columns –  Partitions –  Buckets   Example –  CREATE TABLE sales ( id INT, items ARRAY<STRUCT<id:INT, name:STRING>>, extra MAP<STRING, STRING> ) PARTITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;
  • 15. Hive Query Language   SQL –  Where –  Group By –  Equi-Join –  Sub query in from clause   Example –  SELECT r.*, s.* FROM r JOIN ( SELECT key, count(1) as count FROM s GROUP BY key) s ON r.key = s.key WHERE s.count > 100;
  • 16. Group By   4 different plans based on: –  Does data have skew? –  partial aggregation   Map-side hash aggregation –  In-memory hash table in mapper to do partial aggregations   2-map-reduce aggregation –  For distinct queries with skew and large cardinality
  • 17. Join   Normal map-reduce Join –  Mapper sends all rows with the same key to a single reducer –  Reducer does the join   Map-side Join –  Mapper loads the whole small table and a portion of big table –  Mapper does the join –  Much faster than map-reduce join
  • 18. Sampling   Efficient sampling –  Table can be bucketed –  Each bucket is a file –  Sampling can choose some buckets   Example –  SELECT product_id, sum(price) FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) GROUP BY product_id
  • 19. Multi-table Group-By/Insert FROM users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
  • 20. File Formats   TextFile: –  Easy for other applications to write/read –  Gzip text files are not splittable   SequenceFile: –  Only hadoop can read it –  Support splittable compression   RCFile: Block-based columnar storage –  Use SequenceFile block format –  Columnar storage inside a block –  25% smaller compressed size –  On-par or better query performance depending on the query
  • 21. SerDe   Serialization/Deserialization   Row Format –  CSV (LazySimpleSerDe) –  Thrift (ThriftSerDe) –  Regex (RegexSerDe) –  Hive Binary Format (LazyBinarySerDe)   LazySimpleSerDe and LazyBinarySerDe –  Deserialize the field when needed –  Reuse objects across different rows –  Text and Binary format
  • 22. UDF/UDAF   Features: –  Use either Java or Hadoop Objects (int, Integer, IntWritable) –  Overloading –  Variable-length arguments –  Partial aggregation for UDAF   Example UDF: –  public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
  • 23. Hive – Performance Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec 8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec   QueryA: SELECT count(1) FROM t;   QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;   QueryC: SELECT * FROM t;   map-side time only (incl. GzipCodec for comp/decompression)   * These two features need to be tested with other queries.
  • 24. Hive – Future Works   Indexes   Create table as select   Views / variables   Explode operator   In/Exists sub queries   Leverage sort/bucket information in Join