SlideShare uma empresa Scribd logo
1 de 81
Baixar para ler offline
About Me
david.engfer@gmail.com

@engfer

Meetup organizer for DFWBigData.org
  > Hadoop, Cassandra, and all other things
    BigData and NoSQL
  > Join up!

Sr. Consultant @
  > Rapidly growing national IT consulting firm
    focused on career development while
    operating within an local-office project model
What is Hadoop?
0 “framework for running [distributed] applications on
 large cluster built of commodity hardware”
                                   –from Hadoop Wiki
                                                 Marty McFly?

0 Originally created by Doug Cutting
   > Named the project after his son’s toy


0 The name “Hadoop” has now evolved
 to cover a family of products, but at its
 core, it’s essentially just the
 MapReduce programming paradigm
 + a distributed file system
History
History




                 >_< Growing Pains
                                    +
Jeffery Dean: lots of data + tape backup +
expensive servers + high network bandwidth +
expensive databases + non-linear scalability + etc.
(http://bit.ly/ec31VL + http://bit.ly/gq84Ot)
History



      +
>_< Growing Pains

      +
      Solutions
History


                            White Papers:
      +
>_< Growing Pains
                              Google File System
                              •   2003


                              MapReduce

      +                       •   2004


                              BigTable
                              •   2006
      Solutions
History

White Papers:              Hadoop Core
  Google File System               c. 2005
  •   2003


  MapReduce
  •   2004


  BigTable
  •   2006
Hadoop Distributed
              File System (HDFS)
0 OSS implementation of Google File System (bit.ly/ihXkof)
0 Master/slave architecture
0 Designed to run on commodity hardware
0 Hardware failures assumed in design
0 Fault-tolerant via replication
0 Semi-POSIX compliance; relaxed for performance
0 Unix-like permissions; ties into host’s users & groups
Hadoop Distributed
               File System (HDFS)
0 Written in Java
0 Optimized for larger files
0 Focus on streaming data (high-throughput > low-latency)
0 Rack-aware
0 Only *nix for production env.
0 Web consoles for stats
HDFS Client API’s
0 “Shell-like” commands (hadoop dfs [cmd])
    > cat                chgrp               chmod              chown
      copyFromLocal      copyToLocal         cp                 du, dus
      expunge            get                 getmerge           ls, lsr
      mkdir              movefromLocal       mv                 put
      rm, rmr            setrep              stat               tail
      test               text                touchz


0 Native Java API


0              API for other languages (http://bit.ly/fLgCJC)
    > C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
      Smalltalk, and OCaml
Other HDFS Admin Tools
0 hadoop dfsadmin [opts]
 > Basic admin utilities for the DFS cluster
 > Change file-level replication factors, set quotas, upgrade,
   safemode, reporting, etc

0 hadoop fsck [opts]
 > Runs distributed file system checking and fixing utility


0 hadoop balancer
 > Utility that rebalances block storage across the nodes
HDFS Node Types
Master

  NameNode
                 0 Single node responsible for:
                    > Filesystem metadata operations on cluster
                    > Replication and locations of file blocks
                 0 SPOF

                         =(
(backups)
CheckpointNode
      or         0 Nodes responsible for:
  BackupNode       > NameNode backup mechanisms


Slaves
                 0 Nodes responsible for:
  DataNode
   DataNode        > Storage of file blocks
     DataNode
                   > Serving actual file data to client
HDFS Architecture
                                     FS/namespace/meta ops


                                         NameNode            BackupNode
                                                        (namespace backups)




                     (heartbeats, balancing, replication, etc)




   DataNode        DataNode      DataNode        DataNode         DataNode


serving data -->

                         nodes write to local disk
HDFS Architecture
 Giant
 File:                      (block locations, FS ops, etc)
110010101001
010100101010
                                   <No file data!!>
011001010100       HDFS
101010010101                                                 NameNode         BackupNode
001100101010
010101001010
                   Client
100110010101
001010100101
010011001010
100101010010
10100101101...

                        data Xfer




             DataNode        DataNode            DataNode          DataNode      DataNode
Putting files on HDFS
client buffers blocks to local disk…
                                                  {64MB}

 Giant                                   HDFS
 File:                                   Client
110010101001                                                return block size and
010100101010
011001010100
                                                            nodes for each block
                                                                              (based on “replication factor”)
101010010101
001100101010
010101001010
100110010101
                                                                                     (3 by default)
001010100101
010011001010
100101010010
10100101101...                                                    NameNode                 BackupNode




                 DataNode              DataNode            DataNode          DataNode            DataNode
Putting files on HDFS
                                          {node1, node2, node3}

 Giant                                                             (based on “replication factor”)
                                       HDFS
 File:                                 Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                                    NameNode                 BackupNode

                       While buffering to local disk,
                      the client Xfers block directly
                         to assigned data nodes




                 DataNode         DataNode              DataNode            DataNode             DataNode
Putting files on HDFS
                                          {node1, node3, node5}

 Giant                                 HDFS
 File:                                 Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                                    NameNode         BackupNode

                       While buffering to local disk,
                      the client Xfers block directly
                         to assigned data nodes




                 DataNode         DataNode              DataNode        DataNode      DataNode
Putting files on HDFS
                                          {node1, node4, node5}

 Giant                                 HDFS
 File:                                 Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                                    NameNode         BackupNode

                       While buffering to local disk,
                      the client Xfers block directly
                         to assigned data nodes




                 DataNode         DataNode              DataNode        DataNode      DataNode
Putting files on HDFS
                                          {node2, node3, node4}

 Giant                                 HDFS
 File:                                 Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                                    NameNode         BackupNode

                       While buffering to local disk,
                      the client Xfers block directly
                         to assigned data nodes




                 DataNode         DataNode              DataNode        DataNode      DataNode
Putting files on HDFS
                                          {node2, node4, node5}

 Giant                                 HDFS
 File:                                 Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                                    NameNode         BackupNode

                       While buffering to local disk,
                      the client Xfers block directly
                         to assigned data nodes




                 DataNode         DataNode              DataNode        DataNode      DataNode
Putting files on HDFS
 Giant                            HDFS
 File:                            Client
110010101001
010100101010
011001010100
101010010101
001100101010
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                  NameNode         BackupNode


                            Ad noseum…




                 DataNode      DataNode    DataNode   DataNode      DataNode
Getting files from HDFS
 Giant                            HDFS
 File:                            Client
                                               return locations of
110010101001
010100101010
011001010100
101010010101
001100101010
                                                 blocks for file
010101001010
100110010101
001010100101
010011001010
100101010010
10100101101...                                       NameNode             BackupNode



                             Stream blocks
                            from data nodes



                 DataNode     DataNode        DataNode         DataNode      DataNode
Fault Tolerance?

                              NameNode         BackupNode



           NameNode detects
            DataNode loss




DataNode   DataNode     DataNode    DataNode      DataNode
Fault Tolerance?

                                     NameNode           BackupNode




           Blocks are auto-replicated on remaining
              nodes to satisfy replication factor




DataNode       DataNode       DataNode       DataNode
Fault Tolerance?

                                     NameNode           BackupNode




           Blocks are auto-replicated on remaining
              nodes to satisfy replication factor




DataNode       DataNode       DataNode       DataNode
Fault Tolerance?

                                     NameNode           BackupNode




           Blocks are auto-replicated on remaining
              nodes to satisfy replication factor




DataNode       DataNode       DataNode       DataNode
Fault Tolerance?

                                         NameNode            BackupNode
    NameNode loss = FAIL
(requires manual intervention)

                                               not an EPIC fail because you
                                              have the backup node to replay
                                                    any FS operations




   DataNode        DataNode       DataNode        DataNode       DataNode



                  **automatic failover is in the works
Live horizontal scaling
          and rebalancing
                           NameNode          BackupNode




                      NameNode detects new DataNode
                            is added to cluster




DataNode   DataNode   DataNode    DataNode      DataNode
Live horizontal scaling
          and rebalancing
                                    NameNode         BackupNode



       Blocks are re-balanced and
             re-distributed




DataNode     DataNode      DataNode       DataNode      DataNode
Live horizontal scaling
          and rebalancing
                                    NameNode         BackupNode



       Blocks are re-balanced and
             re-distributed




DataNode     DataNode      DataNode       DataNode      DataNode
Live horizontal scaling
          and rebalancing
                                    NameNode         BackupNode



       Blocks are re-balanced and
             re-distributed




DataNode     DataNode      DataNode       DataNode      DataNode
Live horizontal scaling
          and rebalancing
                                     NameNode            BackupNode




           Once replication factor is satisfied,
              extra replicas are removed




DataNode      DataNode        DataNode        DataNode      DataNode
HDFS Demonstration
Other HDFS Utils
0 HDFS Raid (http://bit.ly/fqnzs5)
  > Uses distributed RAID instead of
    replication (useful at Petabyte             from flume wiki
    scale)


0 Flume/Scribe/Chukwa
   > Log collection and aggregation
     frameworks that support streaming
     log data to HDFS
   > Flume = Cloudera (http://bit.ly/gX8LeO)
   > Scribe = Facebook (http://bit.ly/dIh3If)
MapReduce
0 Distributed programming paradigm and framework that is
 the OSS implementation of Google’s MapReduce
 (http://bit.ly/gXZbsk)


0 Modeled using the ideas behind functional programming
 map() and reduce() operations
  > Distributed on as many nodes as you would like

0 2 phase process:


                 map( )  reduce( )
                  sub-divide &   combine & reduce
                    conquer         cardinality
MapReduce ABC’s
0 Essentially, it’s…
   1. Take a large problem and divide it into sub-problems
   2. Perform the same function on all sub-problems
   3. Combine the output from all sub-problems

0 Ex: Searching
   1.   Take a large problem and divide it into sub-problems
        #    Different groups of rows in DB; different parts of files; 1 user from a list of
             users; etc.
   2.   Perform the same function on all sub-problems
        #    Search for a key in the given partition of data for the sub-problem; count
             words; etc.
   3.   Combine the output from all sub-problems
        #    Combine the results into a result-set and return to the client
M/R Facts
0 M/R is excellent for problems where the “sub-problems”
 are not interdependent
  > For example, the output of one “mapper” should not depend
    on the output or communication with another “mapper”
0 The reduce phase does not begin execution until all
  mappers have finished
0 Failed map and reduce tasks get auto-restarted
0 Rack/HDFS-aware
MapReduce Visualized
                             <keyA, valuea>
                             <keyB, valueb>
   <keyi, valuei>   Mapper   <keyC, valuec>
                                                      <keyA, list(valuea,valueb, valuec,…)>
                             …
                                                                  Reducer

                             <keyA, valuea>
   <keyi, valuei>   Mapper   <keyB, valueb>
                             <keyC, valuec>    Sort   <keyB, list(valuea,valueb, valuec,…)>
                             …                 and
Input                                         group               Reducer                Output
                                                by
                             <keyA, valuea>
   <keyi, valuei>   Mapper   <keyB, valueb>    key
                             <keyC, valuec>           <keyC, list(valuea,valueb, valuec,…)>
                             …
                                                                  Reducer
                             <keyA, valuea>
                             <keyB, valueb>
   <keyi, valuei>   Mapper   <keyC, valuec>
                             …
Example: Word Count
                                    <“foo”, 3>
                                    <“bar”, 14>
      <?, file1_part1>   Mapper     <“baz”, 6>
                                                          <“foo”, (3, 21, 11, 1)>
                          count()   …
                                                                       Reducer
                                                                         sum()
                                    <“foo”, 21>
      <?, file1_part2>   Mapper     <“bar”, 78>
                                    <“baz”, 12>    Sort   <“bar”, (14, 78, 22, 41)>
 Lots of                 count()    …              and                                bar,155
 Input
  Input
Big Files                                         group                Reducer        baz,59
                                                                                      foo,36
                                                    by                   sum()        …
                                    <“foo”, 11>
      <?, file2_part1>   Mapper     <“bar”, 22>    key
                                    <“baz”, 31>           <“baz”, (6, 12, 31, 10)>
                         count()    …
                                                                       Reducer
                                    <“foo”, 1>                           sum()
                                    <“bar”, 41>
      <?, file2_part2>   Mapper     <“baz”, 10>
                         count()    …
Hadoop’s MapReduce
0 MapReduce tasks are submitted as a “job”
  > Jobs can be assigned to a specified “queue” of jobs
     # By default, jobs are submitted to the “default” queue
  > Job submission is controlled by ACL’s for each queue


0 Rack-aware and HDFS-aware
   > The JobTracker communicates with the HDFS NameNode
     and schedules map/reduce operations using input data
     locality on HDFS DataNodes
M/R Nodes
Master
                  0 Single node responsible for:
  JobTracker         > Coordinating all M/R tasks & events
                     > Managing job queues and scheduling
                     > Maintains and Controls TaskTrackers
                     > Moves/restarts map/reduce tasks if needed
                  0 SPOF
                          =(

                     > Uses “checkpointing” to combat this
Slaves
                  0 Worker nodes responsible for:
 TaskTracker
  TaskTracker       > Executing individual map and reduce tasks
    TaskTracker
                      as assigned by JobTracker (in separate JVM)
Conceptual Overview

                                       JobTracker


                       JobTracker controls and heartbeats
                              TaskTracker nodes



TaskTracker       TaskTracker        TaskTracker            TaskTracker




              TaskTrackers store temp data on HDFS
                   Temporary data stored on HDFS
Job Submission
           M/R          submit jobs to JobTracker
            M/R
              M/R
          Client
           Client
             Client                                  JobTracker       jobs get queued




                                  map()’s are assigned to TaskTrackers
                                    (HDFS DataNode locality aware)



  TaskTracker           TaskTracker                 TaskTracker          TaskTracker
    Mapper                 Mapper                     Mapper                Mapper
mappers spawned in separate
     JVM and execute                                   mappers store results on HDFS

                         Temporary data stored on HDFS
Job Submission
        M/R        submit jobs to JobTracker
         M/R
           M/R
       Client
        Client
          Client                                    JobTracker       jobs get queued




                                               reduce phase begins




TaskTracker        TaskTracker                     TaskTracker          TaskTracker


 Reducer             Reducer                         Reducer              Reducer

                        tmp data read from HDFS
                    Temporary data stored on HDFS
MapReduce Tips
0 Keys and values can be any type of object
   > Can specify custom data splitters, partitoners, combiners,
     InputFormat’s, and OutputFormat’s
0 Use ToolRunner.run(Tool) to run your Java jobs…
   > Will use GenericOptionsParser and DistributedCache so that
     -files, -libjars, & -archives options are available to distribute
     your mappers, reducers, and any
   > Without this, your mappers, reducers, and other utilites will
     not be propagated and added to the classpath of the other
     nodes (ClassNotFoundException)
MapReduce Demonstration
Other M/R Utils
0 $HADOOP_HOME/contrib/*
   > PriorityScheduler & FairScheduler
   > HOD (Hadoop On Demand)
     # Uses TORQUE resource manager to dynamically allocate, use,
       and destroy MapReduce clusters on an as-needed basis
     # Great for development and testing
  > Hadoop Streaming (next slide...)


0 Amazon’s Elastic MapReduce (EMR)
   > Essentially production HOD for EC2 data/clusters
Hadoop Streaming
0 Allows you to write MapReduce jobs in languages other than
 Java by running any command line process
  > Input data is partitioned and given to the standard input (STDIN) of
    the command line mappers and reducers specified
  > Output (STDOUT) from the command line mappers and reducers
    get combined into the M/R pipeline
0 Can specify custom partitioners and combiners
0 Can specify files & archives to propagate to all nodes and
 unpack on local file system (-archives & -file)
 hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar
     -input “/foo/bar/input.txt”
     -mapper splitz.py
     -reducer /bin/wc
     -output “/foo/baz/out”
     -archives „hdfs://hadoop1/foo/bar/cachedir.jar‟
     -file ~/scripts/splitz.py
     -D mapred.job.name=“Foo bar”
Pig
0 Framework and language (Pig Latin) for creating and
  submitting Hadoop MapReduce jobs
0 Common data operations (not supported by POJO-M/R)
  like join, group, filter, sort, select, etc. are provided
0 Don’t need to know Java
0 Removes boilerplate aspect from M/R
  > 200 lines in Java  15 lines in Pig!
0 Relational qualities (reads and feels SQL-ish)
Pig
0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig
0 Interactive shell (grunt) exists
0 User Defined Functions (UDF)
   > Allows you to specify Java code where the logic may be too
     complex for Pig Latin
   > UDF’s can be part of most every operation in Pig Latin
   > Great for loading and storing custom formats as well as
     transforming data
Pig Relational Operations
COGROUP           JOIN               SPLIT
CROSS             LIMIT              STORE
DISTINCT          LOAD               STREAM
FILTER            MAPREDUCE          UNION
FOREACH           ORDER BY
GROUP             SAMPLE

  most of these are pretty self-explanatory
Example Pig Script
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a
search query log file from the Excite search engine and compares the occurrence of frequency of
search phrases across two time periods separated by twelve hours.
01:   REGISTER ./tutorial.jar;
02:   raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query);
03:   clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04:   clean2 = FOREACH clean1 GENERATE user, time,
           org.apache.pig.tutorial.ToLower(query) as query;
05:   houred = FOREACH clean2 GENERATE user,
           org.apache.pig.tutorial.ExtractHour(time) as hour, query;
06:   ngramed1 = FOREACH houred GENERATE user, hour,
           flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07:   ngramed2 = DISTINCT ngramed1;
08:   hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09:   hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
           COUNT($1) AS count;
10:   hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
           $1 as hour, $2 as count;
11:   hour00 = FILTER hour_frequency2 BY hour eq '00';
12:   hour12 = FILTER hour_frequency3 BY hour eq '12';
13:   same = JOIN hour00 BY $0, hour12 BY $0;
14:   same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
           ngram, $2 as count00, $5 as count12;
15:   STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
Example Pig Script
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a
search query log file from the Excite search engine and compares the occurrence of frequency of
search phrases across two time periods separated by twelve hours.
01:   REGISTER ./tutorial.jar;
02:   raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query);
03:   clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04:   clean2 = FOREACH clean1 GENERATE user, time,
           org.apache.pig.tutorial.ToLower(query) as query;
05:   houred = FOREACH clean2 GENERATE user,
           org.apache.pig.tutorial.ExtractHour(time) as hour, query;      UDF’’s
06:   ngramed1 = FOREACH houred GENERATE user, hour,
           flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07:   ngramed2 = DISTINCT ngramed1;
08:   hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09:   hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
           COUNT($1) AS count;
10:   hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
           $1 as hour, $2 as count;
                                              Now... image this equivalent in Java...
11:   hour00 = FILTER hour_frequency2 BY hour eq '00';
12:   hour12 = FILTER hour_frequency3 BY hour eq '12';
13:   same = JOIN hour00 BY $0, hour12 BY $0;
14:   same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
           ngram, $2 as count00, $5 as count12;
15:   STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
<- ?
                                 ZooKeeper
0 Centralized coordination service for use by distributed
  applications
   > Configuration, naming, synchronization (locks), ownership (master
     election), etc.
                                  ZooKeeper Service
                                             Leader!

             Server      Server         Server            Server      Server




    Client     Client   Client      Client       Client      Client   Client   Client

0 Important system guarantees:
   > Sequential consistency (great for locking)
   > Atomicity – all or nothing at all
   > Data consistency – all clients view same system state regardless of
     the server it connects to
<- ?
                           ZooKeeper
0 Hierarchical namespace of “znodes” (like directories)


0 Operations:
   > create a node at a location in the tree
   > delete a node
   > exists - tests if a node exists at a location
   > get data from a node
   > set data on a node
   > get children from a node
   > sync - waits for data to be propagated
                                                     leaf znodes
HBase
0 Sparse, non-relational, column-oriented distributed database
  built on top of Hadoop Core (HDFS + MapReduce)

0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA)

0 NoSQL         Not Only SQL...
                         ...not “SQL is terrible”
0 HBase also has:
   > Strong consistency model
   > In-memory operation
   > LZO compression (optional)
   > Live migrations
   > MapReduce support for querying
What HBase Is…
0 Good at fast/streaming writes
0 Fault tolerant
0 Good at linear horizontal scalability
0 Very efficient at managing billions of rows and millions of
  columns
0 Good at keeping row history
0 Good at auto-balancing
0 A complement to a SQL DB/warehouse
0 Great with non-normalized data
What HBase Is NOT…
0 Made for table joins
0 Made for splitting into normalized tables (see previous)
0 A complete replacement for a SQL relational database
0 A complete replacement for a SQL data warehouse
0 Great for storing small amounts of data
0 Great for storing gobs of large binary data
0 The best way to do OLTP
0 The best way to do live adhoc querying of any column
0 A replacement for a proper caching mechanism
0 ACID compliant (http://bit.ly/hhFXCS)
HBase Facts
0 Written in Java
0 Uses ZooKeeper to store metadata and -ROOT- region
0 Column-oriented store = flexible schema
   > Can alter the schema simply by adding the column name and
     data on insert (“put”)
   > No schema migrations!
0 Every column has a timestamp associated with it
   > Same column with most recent timestamp wins
0 Can export metrics for use with Ganglia, or as JMX
0 hbase hbck
   > Check for errors and fix them (like HDFS fsck)
HBase Client API’s
0 jRuby interactive shell (hbase shell)
   > DDL/DML commands
   > Admin commands
   > Cluster commands


0 Java API (http://bit.ly/ij0MgF)

0 REST API
   > Provided using Stargate


0                API for other languages (http://bit.ly/fLgCJC)
Column-Oriented?
   0 Traditional RDBMS are stored using row-oriented storage
     which stores entire rows sequentially on disk
               Row 1 – Cols 1-3                          Row 2 – Cols 1-3
                              Row 3 – Cols 1-3

   0 Whereas column-oriented storage only stores columns for
     each row (or column-families) sequentially on disk
    Row 1 – Col 1   Row 2 – Col 1                Row 1 – Col 2     Row 2 – Col 2
    Row 3 – Col 1                                Row 3 – Col 2

                     Row 1 – Col 3       Row 3 – Col 3


Where’s Row 2 - Col 2?              Not needed because columns are stored
                                    sequentially, so rows have flexible schema!
Think of HBase Tables As…
   0 More like JSON
     > And less like spreadsheets                     row id
           {
               "1" : {
                  "A" : { v: "x", ts: 4282 },
                  "B" : { v: "z", ts: 4282 }
               },
                                                columns
               "aaaaa" : {
                  "A" : { v: "y", ts: 4282 }
                                                                column families allow grouping of
               },                                               columns (faster retrieval)
               "xyz" : {
                  “address” : {
                     “line1" : { v: "hello", ts: 4282 },
flexible             “line2" : { v: "there", ts: 4282 },
                                                                recent TS = default col value
schema               “line2" : { v: "there", ts: 1234 }
                  },                                     old TS
                  “fooo" : { v: "wow!", ts: 4282 }
               },
               "zzzzz" : {

                                                         value & timestamp (TS)
                  "A" : { v: "woot", ts: 4282 },
                  "B" : { v: "1337", ts: 4282 }
               }
           }
                 Modified from http://bit.ly/hbGWIG
HBase Overview
    Data is
    sent using                                                               The Master server keeps track of the
    the client                                                               metadata for RegionServer’s and their
                                                                             containing Regions and stores it in
                                                                             Zookeeper




                                             The HBase client communicates with the Zookeeper cluster
                                             only to get Region information; moreover, no data is sent
                                             through the Master

       The actual row “data” (bytes) is sent directly
              to and from the RegionServers


Pretty diagrams from Lars George
                                                          Therefore, the Master server nor the Zookeeper
http://goo.gl/wRLJP & http://goo.gl/6ehnV                     cluster don’t serve as data bottlenecks
HBase Overview
                                                                   Pretty diagrams from Lars George
                                                                   http://goo.gl/wRLJP




          All HBase data (HLog and HFiles) are stored on HDFS




HDFS breaks files into 64MB chucks and replicates the chunks N times (3 by
default) to store on “actual” disk (giving HBase it’s fault tolerance)
Understanding HBase
Tables are split into groups of ~100   Regions are assigned to particular
rows (configurable) called Regions     RegionServer’s by the Master
                                       server. The Master only contains
                                       region-location metadata and
Table                     HRegions     contains no “real” row data.




                                          Pretty diagrams from Lars George
                                          http://goo.gl/wRLJP & http://goo.gl/6ehnV
Writing to HBase
1) HBase client gets the assigned region servers (and                     Pretty diagrams from Lars George
regions) from Master server for the particular keys                       http://goo.gl/wRLJP & http://goo.gl/6ehnV

(rows) in question and sends commands/data


                                                                                                     HDFS

                                                                                   4) In memory store is
                                                                                   periodically flushed to
                                                                                   HDFS (disk) when size
                                                                                   reaches threshold




       2) Transaction is written to write-              3) Same data is written to in memory
       ahead-log on HDFS (disk) first         HDFS      store for the assigned region (row group)
HBase Scalability




                                  Additional RegionServers can
                                  be added to the live system.
                                  The master server will then
                                  rebalance the cluster to
                                  migrate Regions onto the new
                                  RegionServers

Moreover, additional HDFS data
nodes can be added to disk give
more space to the HDFS cluster

                                    Pretty diagrams from Lars George
                                    http://goo.gl/wRLJP & http://goo.gl/6ehnV
HBase Demonstration
Hive
0 Data warehouse infrastructure on top of Hadoop Core
  > Stores data on HDFS
  > Allows you to add custom MapReduce plugins
0 HiveQL
  > SQL-like language pretty close to ANSI SQL
    # Supports joins
  > JDBC driver exists
0 Has interactive shell (like MySQL & PostgreSQL) to run
 interactive queries
Hive
0 When running a HiveQL query/script, in            > SHOW TABLES;
  the background Hive creates and runs a
  series of MapReduce jobs to                       > CREATE TABLE rating (
   > BigData means it can take a long time to run       userid INT,
     queries                                            movieid INT,
                                                        rating INT,
                                                        unixtime STRING)
0 Therefore, it’s good for offline BigETL, but        ROW FORMAT DELIMITED
  not good replacement for OLTP/OLAP data             FIELDS TERMINATED BY 't'
  warehouse (like Oracle)                             STORED AS TEXTFILE;

                                                    > DESCRIBE rating;
0 Learn more from wiki: http://bit.ly/epauio
Other useful utilities around
                  Hadoop
0 Sqoop (http://bit.ly/eRfVEJ)
   > Load SQL data from a table into HDFS or Hive
   > Generates Java classes to interact with the loaded data
0 Oozie (http://bit.ly/eNLi3B)
   > Orchestrates complex workflows around multiple MapReduce jobs
0 Mahout (http://bit.ly/hCXRjL)
   > Algorithm library for collaborative filtering, clustering, classifiers,
     and machine learning
0 Cascading (http://bit.ly/gyZNiI)
   > Data query abstraction layer similar to Pig
   > Java API that sits on top of MapReduce framework
   > Since it’s a Java API you can use it with any program that uses a JVM
     language: Groovy, Scala, Clojure, jRuby, jython, etc.
What about support?
0 Community, wikis, forumns, IRC


0 Cloudera provides enterprise support
   > Offerings:
    # Cloudera Enterprise
    # Support, professional services, training, management apps
  > Cloudera Distribution of Hadoop (CDH)
     # Tested and hardened version of Hadoop products plus some
       other goodies (oozie, flume, hue, sqoop, whirr)
       ~ Separate codebase, but patches are made to and form the Apache versions
    # Packages: debian, redhat, EC2, VM
     if you want to try Hadoop, CDH is probably the way to go.
         I recommended this instead of downloading each project individually.
Who uses this stuff?




                  and many more
Where the heck can I use this stuff?
0 The hardest part, is finding the right use-cases to apply Hadoop
  (and any NoSQL system)
  > SQL databases are great for data that fits on one machine
  > Lots of tooling support for SQL; not as much for Hadoop (yet)


0 A few questions to think about:
   > How much data are you processing?
   > Are you throwing away valuable data due to space?
   > Are you processing data where steps aren’t interdependent?

0 Log storage, log processing, utility data, research data,
  biological data, medical records, events, mail, tweets, market
  data, financial data
≠
NoSQL
        ≠
NoSQL
        =
The Law Of the Instrument

“It is tempting, if the only tool you have is
a hammer, to treat everything as if it
were a nail.”
                       -Abraham Maslow
?’s
Thank You



 david.engfer@gmail.com




submit feedback here!

Mais conteúdo relacionado

Mais procurados

Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeDataWorks Summit
 
Webinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data SafetyWebinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data SafetyMongoDB
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceJim Dowling
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
Advanced Windows Debugging
Advanced Windows DebuggingAdvanced Windows Debugging
Advanced Windows DebuggingBala Subra
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
Open-E Backup to iSCSI Target Volume over a LAN
Open-E Backup to iSCSI Target Volume over a LANOpen-E Backup to iSCSI Target Volume over a LAN
Open-E Backup to iSCSI Target Volume over a LANopen-e
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveDipti Borkar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 

Mais procurados (20)

Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Webinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data SafetyWebinar: Understanding Storage for Performance and Data Safety
Webinar: Understanding Storage for Performance and Data Safety
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Advanced Windows Debugging
Advanced Windows DebuggingAdvanced Windows Debugging
Advanced Windows Debugging
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to UNIX
Introduction to UNIXIntroduction to UNIX
Introduction to UNIX
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Open-E Backup to iSCSI Target Volume over a LAN
Open-E Backup to iSCSI Target Volume over a LANOpen-E Backup to iSCSI Target Volume over a LAN
Open-E Backup to iSCSI Target Volume over a LAN
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep dive
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 

Destaque

[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學NUTC, imac
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 

Destaque (6)

Mc donalds
Mc donaldsMc donalds
Mc donalds
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 

Semelhante a Intro to the Hadoop Stack @ April 2011 JavaMUG

Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...Hooman Peiro Sajjad
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptxAyush .
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name NodeAaron Cordova
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationEtsuji Nakai
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFSApache Apex
 
[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享Tsu-Fen Han
 

Semelhante a Intro to the Hadoop Stack @ April 2011 JavaMUG (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
Maintaining Strong Consistency Semantics in a Horizontally Scalable and Highl...
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享[OSDC 2013] Hadoop Cluster HA 的經驗分享
[OSDC 2013] Hadoop Cluster HA 的經驗分享
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Intro to the Hadoop Stack @ April 2011 JavaMUG

  • 1.
  • 2. About Me david.engfer@gmail.com @engfer Meetup organizer for DFWBigData.org > Hadoop, Cassandra, and all other things BigData and NoSQL > Join up! Sr. Consultant @ > Rapidly growing national IT consulting firm focused on career development while operating within an local-office project model
  • 3. What is Hadoop? 0 “framework for running [distributed] applications on large cluster built of commodity hardware” –from Hadoop Wiki Marty McFly? 0 Originally created by Doug Cutting > Named the project after his son’s toy 0 The name “Hadoop” has now evolved to cover a family of products, but at its core, it’s essentially just the MapReduce programming paradigm + a distributed file system
  • 5. History >_< Growing Pains + Jeffery Dean: lots of data + tape backup + expensive servers + high network bandwidth + expensive databases + non-linear scalability + etc. (http://bit.ly/ec31VL + http://bit.ly/gq84Ot)
  • 6. History + >_< Growing Pains + Solutions
  • 7. History White Papers: + >_< Growing Pains Google File System • 2003 MapReduce + • 2004 BigTable • 2006 Solutions
  • 8. History White Papers: Hadoop Core Google File System c. 2005 • 2003 MapReduce • 2004 BigTable • 2006
  • 9. Hadoop Distributed File System (HDFS) 0 OSS implementation of Google File System (bit.ly/ihXkof) 0 Master/slave architecture 0 Designed to run on commodity hardware 0 Hardware failures assumed in design 0 Fault-tolerant via replication 0 Semi-POSIX compliance; relaxed for performance 0 Unix-like permissions; ties into host’s users & groups
  • 10. Hadoop Distributed File System (HDFS) 0 Written in Java 0 Optimized for larger files 0 Focus on streaming data (high-throughput > low-latency) 0 Rack-aware 0 Only *nix for production env. 0 Web consoles for stats
  • 11. HDFS Client API’s 0 “Shell-like” commands (hadoop dfs [cmd]) > cat chgrp chmod chown copyFromLocal copyToLocal cp du, dus expunge get getmerge ls, lsr mkdir movefromLocal mv put rm, rmr setrep stat tail test text touchz 0 Native Java API 0 API for other languages (http://bit.ly/fLgCJC) > C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml
  • 12. Other HDFS Admin Tools 0 hadoop dfsadmin [opts] > Basic admin utilities for the DFS cluster > Change file-level replication factors, set quotas, upgrade, safemode, reporting, etc 0 hadoop fsck [opts] > Runs distributed file system checking and fixing utility 0 hadoop balancer > Utility that rebalances block storage across the nodes
  • 13. HDFS Node Types Master NameNode 0 Single node responsible for: > Filesystem metadata operations on cluster > Replication and locations of file blocks 0 SPOF =( (backups) CheckpointNode or 0 Nodes responsible for: BackupNode > NameNode backup mechanisms Slaves 0 Nodes responsible for: DataNode DataNode > Storage of file blocks DataNode > Serving actual file data to client
  • 14. HDFS Architecture FS/namespace/meta ops NameNode BackupNode (namespace backups) (heartbeats, balancing, replication, etc) DataNode DataNode DataNode DataNode DataNode serving data --> nodes write to local disk
  • 15. HDFS Architecture Giant File: (block locations, FS ops, etc) 110010101001 010100101010 <No file data!!> 011001010100 HDFS 101010010101 NameNode BackupNode 001100101010 010101001010 Client 100110010101 001010100101 010011001010 100101010010 10100101101... data Xfer DataNode DataNode DataNode DataNode DataNode
  • 16. Putting files on HDFS client buffers blocks to local disk… {64MB} Giant HDFS File: Client 110010101001 return block size and 010100101010 011001010100 nodes for each block (based on “replication factor”) 101010010101 001100101010 010101001010 100110010101 (3 by default) 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode DataNode DataNode DataNode DataNode DataNode
  • 17. Putting files on HDFS {node1, node2, node3} Giant (based on “replication factor”) HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
  • 18. Putting files on HDFS {node1, node3, node5} Giant HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
  • 19. Putting files on HDFS {node1, node4, node5} Giant HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
  • 20. Putting files on HDFS {node2, node3, node4} Giant HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
  • 21. Putting files on HDFS {node2, node4, node5} Giant HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
  • 22. Putting files on HDFS Giant HDFS File: Client 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode Ad noseum… DataNode DataNode DataNode DataNode DataNode
  • 23. Getting files from HDFS Giant HDFS File: Client return locations of 110010101001 010100101010 011001010100 101010010101 001100101010 blocks for file 010101001010 100110010101 001010100101 010011001010 100101010010 10100101101... NameNode BackupNode Stream blocks from data nodes DataNode DataNode DataNode DataNode DataNode
  • 24. Fault Tolerance? NameNode BackupNode NameNode detects DataNode loss DataNode DataNode DataNode DataNode DataNode
  • 25. Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factor DataNode DataNode DataNode DataNode
  • 26. Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factor DataNode DataNode DataNode DataNode
  • 27. Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factor DataNode DataNode DataNode DataNode
  • 28. Fault Tolerance? NameNode BackupNode NameNode loss = FAIL (requires manual intervention) not an EPIC fail because you have the backup node to replay any FS operations DataNode DataNode DataNode DataNode DataNode **automatic failover is in the works
  • 29. Live horizontal scaling and rebalancing NameNode BackupNode NameNode detects new DataNode is added to cluster DataNode DataNode DataNode DataNode DataNode
  • 30. Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributed DataNode DataNode DataNode DataNode DataNode
  • 31. Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributed DataNode DataNode DataNode DataNode DataNode
  • 32. Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributed DataNode DataNode DataNode DataNode DataNode
  • 33. Live horizontal scaling and rebalancing NameNode BackupNode Once replication factor is satisfied, extra replicas are removed DataNode DataNode DataNode DataNode DataNode
  • 35. Other HDFS Utils 0 HDFS Raid (http://bit.ly/fqnzs5) > Uses distributed RAID instead of replication (useful at Petabyte from flume wiki scale) 0 Flume/Scribe/Chukwa > Log collection and aggregation frameworks that support streaming log data to HDFS > Flume = Cloudera (http://bit.ly/gX8LeO) > Scribe = Facebook (http://bit.ly/dIh3If)
  • 36. MapReduce 0 Distributed programming paradigm and framework that is the OSS implementation of Google’s MapReduce (http://bit.ly/gXZbsk) 0 Modeled using the ideas behind functional programming map() and reduce() operations > Distributed on as many nodes as you would like 0 2 phase process: map( )  reduce( ) sub-divide & combine & reduce conquer cardinality
  • 37. MapReduce ABC’s 0 Essentially, it’s… 1. Take a large problem and divide it into sub-problems 2. Perform the same function on all sub-problems 3. Combine the output from all sub-problems 0 Ex: Searching 1. Take a large problem and divide it into sub-problems # Different groups of rows in DB; different parts of files; 1 user from a list of users; etc. 2. Perform the same function on all sub-problems # Search for a key in the given partition of data for the sub-problem; count words; etc. 3. Combine the output from all sub-problems # Combine the results into a result-set and return to the client
  • 38. M/R Facts 0 M/R is excellent for problems where the “sub-problems” are not interdependent > For example, the output of one “mapper” should not depend on the output or communication with another “mapper” 0 The reduce phase does not begin execution until all mappers have finished 0 Failed map and reduce tasks get auto-restarted 0 Rack/HDFS-aware
  • 39. MapReduce Visualized <keyA, valuea> <keyB, valueb> <keyi, valuei> Mapper <keyC, valuec> <keyA, list(valuea,valueb, valuec,…)> … Reducer <keyA, valuea> <keyi, valuei> Mapper <keyB, valueb> <keyC, valuec> Sort <keyB, list(valuea,valueb, valuec,…)> … and Input group Reducer Output by <keyA, valuea> <keyi, valuei> Mapper <keyB, valueb> key <keyC, valuec> <keyC, list(valuea,valueb, valuec,…)> … Reducer <keyA, valuea> <keyB, valueb> <keyi, valuei> Mapper <keyC, valuec> …
  • 40. Example: Word Count <“foo”, 3> <“bar”, 14> <?, file1_part1> Mapper <“baz”, 6> <“foo”, (3, 21, 11, 1)> count() … Reducer sum() <“foo”, 21> <?, file1_part2> Mapper <“bar”, 78> <“baz”, 12> Sort <“bar”, (14, 78, 22, 41)> Lots of count() … and bar,155 Input Input Big Files group Reducer baz,59 foo,36 by sum() … <“foo”, 11> <?, file2_part1> Mapper <“bar”, 22> key <“baz”, 31> <“baz”, (6, 12, 31, 10)> count() … Reducer <“foo”, 1> sum() <“bar”, 41> <?, file2_part2> Mapper <“baz”, 10> count() …
  • 41. Hadoop’s MapReduce 0 MapReduce tasks are submitted as a “job” > Jobs can be assigned to a specified “queue” of jobs # By default, jobs are submitted to the “default” queue > Job submission is controlled by ACL’s for each queue 0 Rack-aware and HDFS-aware > The JobTracker communicates with the HDFS NameNode and schedules map/reduce operations using input data locality on HDFS DataNodes
  • 42. M/R Nodes Master 0 Single node responsible for: JobTracker > Coordinating all M/R tasks & events > Managing job queues and scheduling > Maintains and Controls TaskTrackers > Moves/restarts map/reduce tasks if needed 0 SPOF =( > Uses “checkpointing” to combat this Slaves 0 Worker nodes responsible for: TaskTracker TaskTracker > Executing individual map and reduce tasks TaskTracker as assigned by JobTracker (in separate JVM)
  • 43. Conceptual Overview JobTracker JobTracker controls and heartbeats TaskTracker nodes TaskTracker TaskTracker TaskTracker TaskTracker TaskTrackers store temp data on HDFS Temporary data stored on HDFS
  • 44. Job Submission M/R submit jobs to JobTracker M/R M/R Client Client Client JobTracker jobs get queued map()’s are assigned to TaskTrackers (HDFS DataNode locality aware) TaskTracker TaskTracker TaskTracker TaskTracker Mapper Mapper Mapper Mapper mappers spawned in separate JVM and execute mappers store results on HDFS Temporary data stored on HDFS
  • 45. Job Submission M/R submit jobs to JobTracker M/R M/R Client Client Client JobTracker jobs get queued reduce phase begins TaskTracker TaskTracker TaskTracker TaskTracker Reducer Reducer Reducer Reducer tmp data read from HDFS Temporary data stored on HDFS
  • 46. MapReduce Tips 0 Keys and values can be any type of object > Can specify custom data splitters, partitoners, combiners, InputFormat’s, and OutputFormat’s 0 Use ToolRunner.run(Tool) to run your Java jobs… > Will use GenericOptionsParser and DistributedCache so that -files, -libjars, & -archives options are available to distribute your mappers, reducers, and any > Without this, your mappers, reducers, and other utilites will not be propagated and added to the classpath of the other nodes (ClassNotFoundException)
  • 48. Other M/R Utils 0 $HADOOP_HOME/contrib/* > PriorityScheduler & FairScheduler > HOD (Hadoop On Demand) # Uses TORQUE resource manager to dynamically allocate, use, and destroy MapReduce clusters on an as-needed basis # Great for development and testing > Hadoop Streaming (next slide...) 0 Amazon’s Elastic MapReduce (EMR) > Essentially production HOD for EC2 data/clusters
  • 49. Hadoop Streaming 0 Allows you to write MapReduce jobs in languages other than Java by running any command line process > Input data is partitioned and given to the standard input (STDIN) of the command line mappers and reducers specified > Output (STDOUT) from the command line mappers and reducers get combined into the M/R pipeline 0 Can specify custom partitioners and combiners 0 Can specify files & archives to propagate to all nodes and unpack on local file system (-archives & -file) hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar -input “/foo/bar/input.txt” -mapper splitz.py -reducer /bin/wc -output “/foo/baz/out” -archives „hdfs://hadoop1/foo/bar/cachedir.jar‟ -file ~/scripts/splitz.py -D mapred.job.name=“Foo bar”
  • 50. Pig 0 Framework and language (Pig Latin) for creating and submitting Hadoop MapReduce jobs 0 Common data operations (not supported by POJO-M/R) like join, group, filter, sort, select, etc. are provided 0 Don’t need to know Java 0 Removes boilerplate aspect from M/R > 200 lines in Java  15 lines in Pig! 0 Relational qualities (reads and feels SQL-ish)
  • 51. Pig 0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig 0 Interactive shell (grunt) exists 0 User Defined Functions (UDF) > Allows you to specify Java code where the logic may be too complex for Pig Latin > UDF’s can be part of most every operation in Pig Latin > Great for loading and storing custom formats as well as transforming data
  • 52. Pig Relational Operations COGROUP JOIN SPLIT CROSS LIMIT STORE DISTINCT LOAD STREAM FILTER MAPREDUCE UNION FOREACH ORDER BY GROUP SAMPLE most of these are pretty self-explanatory
  • 53. Example Pig Script Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours. 01: REGISTER ./tutorial.jar; 02: raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query); 03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); 04: clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; 05: houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; 06: ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; 07: ngramed2 = DISTINCT ngramed1; 08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour); 09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) AS count; 10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count; 11: hour00 = FILTER hour_frequency2 BY hour eq '00'; 12: hour12 = FILTER hour_frequency3 BY hour eq '12'; 13: same = JOIN hour00 BY $0, hour12 BY $0; 14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12; 15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
  • 54. Example Pig Script Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours. 01: REGISTER ./tutorial.jar; 02: raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, query); 03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); 04: clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; 05: houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; UDF’’s 06: ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; 07: ngramed2 = DISTINCT ngramed1; 08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour); 09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) AS count; 10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count; Now... image this equivalent in Java... 11: hour00 = FILTER hour_frequency2 BY hour eq '00'; 12: hour12 = FILTER hour_frequency3 BY hour eq '12'; 13: same = JOIN hour00 BY $0, hour12 BY $0; 14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12; 15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
  • 55. <- ? ZooKeeper 0 Centralized coordination service for use by distributed applications > Configuration, naming, synchronization (locks), ownership (master election), etc. ZooKeeper Service Leader! Server Server Server Server Server Client Client Client Client Client Client Client Client 0 Important system guarantees: > Sequential consistency (great for locking) > Atomicity – all or nothing at all > Data consistency – all clients view same system state regardless of the server it connects to
  • 56. <- ? ZooKeeper 0 Hierarchical namespace of “znodes” (like directories) 0 Operations: > create a node at a location in the tree > delete a node > exists - tests if a node exists at a location > get data from a node > set data on a node > get children from a node > sync - waits for data to be propagated leaf znodes
  • 57. HBase 0 Sparse, non-relational, column-oriented distributed database built on top of Hadoop Core (HDFS + MapReduce) 0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA) 0 NoSQL Not Only SQL... ...not “SQL is terrible” 0 HBase also has: > Strong consistency model > In-memory operation > LZO compression (optional) > Live migrations > MapReduce support for querying
  • 58. What HBase Is… 0 Good at fast/streaming writes 0 Fault tolerant 0 Good at linear horizontal scalability 0 Very efficient at managing billions of rows and millions of columns 0 Good at keeping row history 0 Good at auto-balancing 0 A complement to a SQL DB/warehouse 0 Great with non-normalized data
  • 59. What HBase Is NOT… 0 Made for table joins 0 Made for splitting into normalized tables (see previous) 0 A complete replacement for a SQL relational database 0 A complete replacement for a SQL data warehouse 0 Great for storing small amounts of data 0 Great for storing gobs of large binary data 0 The best way to do OLTP 0 The best way to do live adhoc querying of any column 0 A replacement for a proper caching mechanism 0 ACID compliant (http://bit.ly/hhFXCS)
  • 60. HBase Facts 0 Written in Java 0 Uses ZooKeeper to store metadata and -ROOT- region 0 Column-oriented store = flexible schema > Can alter the schema simply by adding the column name and data on insert (“put”) > No schema migrations! 0 Every column has a timestamp associated with it > Same column with most recent timestamp wins 0 Can export metrics for use with Ganglia, or as JMX 0 hbase hbck > Check for errors and fix them (like HDFS fsck)
  • 61. HBase Client API’s 0 jRuby interactive shell (hbase shell) > DDL/DML commands > Admin commands > Cluster commands 0 Java API (http://bit.ly/ij0MgF) 0 REST API > Provided using Stargate 0 API for other languages (http://bit.ly/fLgCJC)
  • 62. Column-Oriented? 0 Traditional RDBMS are stored using row-oriented storage which stores entire rows sequentially on disk Row 1 – Cols 1-3 Row 2 – Cols 1-3 Row 3 – Cols 1-3 0 Whereas column-oriented storage only stores columns for each row (or column-families) sequentially on disk Row 1 – Col 1 Row 2 – Col 1 Row 1 – Col 2 Row 2 – Col 2 Row 3 – Col 1 Row 3 – Col 2 Row 1 – Col 3 Row 3 – Col 3 Where’s Row 2 - Col 2? Not needed because columns are stored sequentially, so rows have flexible schema!
  • 63. Think of HBase Tables As… 0 More like JSON > And less like spreadsheets row id { "1" : { "A" : { v: "x", ts: 4282 }, "B" : { v: "z", ts: 4282 } }, columns "aaaaa" : { "A" : { v: "y", ts: 4282 } column families allow grouping of }, columns (faster retrieval) "xyz" : { “address” : { “line1" : { v: "hello", ts: 4282 }, flexible “line2" : { v: "there", ts: 4282 }, recent TS = default col value schema “line2" : { v: "there", ts: 1234 } }, old TS “fooo" : { v: "wow!", ts: 4282 } }, "zzzzz" : { value & timestamp (TS) "A" : { v: "woot", ts: 4282 }, "B" : { v: "1337", ts: 4282 } } } Modified from http://bit.ly/hbGWIG
  • 64. HBase Overview Data is sent using The Master server keeps track of the the client metadata for RegionServer’s and their containing Regions and stores it in Zookeeper The HBase client communicates with the Zookeeper cluster only to get Region information; moreover, no data is sent through the Master The actual row “data” (bytes) is sent directly to and from the RegionServers Pretty diagrams from Lars George Therefore, the Master server nor the Zookeeper http://goo.gl/wRLJP & http://goo.gl/6ehnV cluster don’t serve as data bottlenecks
  • 65. HBase Overview Pretty diagrams from Lars George http://goo.gl/wRLJP All HBase data (HLog and HFiles) are stored on HDFS HDFS breaks files into 64MB chucks and replicates the chunks N times (3 by default) to store on “actual” disk (giving HBase it’s fault tolerance)
  • 66. Understanding HBase Tables are split into groups of ~100 Regions are assigned to particular rows (configurable) called Regions RegionServer’s by the Master server. The Master only contains region-location metadata and Table HRegions contains no “real” row data. Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
  • 67. Writing to HBase 1) HBase client gets the assigned region servers (and Pretty diagrams from Lars George regions) from Master server for the particular keys http://goo.gl/wRLJP & http://goo.gl/6ehnV (rows) in question and sends commands/data HDFS 4) In memory store is periodically flushed to HDFS (disk) when size reaches threshold 2) Transaction is written to write- 3) Same data is written to in memory ahead-log on HDFS (disk) first HDFS store for the assigned region (row group)
  • 68. HBase Scalability Additional RegionServers can be added to the live system. The master server will then rebalance the cluster to migrate Regions onto the new RegionServers Moreover, additional HDFS data nodes can be added to disk give more space to the HDFS cluster Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
  • 70. Hive 0 Data warehouse infrastructure on top of Hadoop Core > Stores data on HDFS > Allows you to add custom MapReduce plugins 0 HiveQL > SQL-like language pretty close to ANSI SQL # Supports joins > JDBC driver exists 0 Has interactive shell (like MySQL & PostgreSQL) to run interactive queries
  • 71. Hive 0 When running a HiveQL query/script, in > SHOW TABLES; the background Hive creates and runs a series of MapReduce jobs to > CREATE TABLE rating ( > BigData means it can take a long time to run userid INT, queries movieid INT, rating INT, unixtime STRING) 0 Therefore, it’s good for offline BigETL, but ROW FORMAT DELIMITED not good replacement for OLTP/OLAP data FIELDS TERMINATED BY 't' warehouse (like Oracle) STORED AS TEXTFILE; > DESCRIBE rating; 0 Learn more from wiki: http://bit.ly/epauio
  • 72. Other useful utilities around Hadoop 0 Sqoop (http://bit.ly/eRfVEJ) > Load SQL data from a table into HDFS or Hive > Generates Java classes to interact with the loaded data 0 Oozie (http://bit.ly/eNLi3B) > Orchestrates complex workflows around multiple MapReduce jobs 0 Mahout (http://bit.ly/hCXRjL) > Algorithm library for collaborative filtering, clustering, classifiers, and machine learning 0 Cascading (http://bit.ly/gyZNiI) > Data query abstraction layer similar to Pig > Java API that sits on top of MapReduce framework > Since it’s a Java API you can use it with any program that uses a JVM language: Groovy, Scala, Clojure, jRuby, jython, etc.
  • 73. What about support? 0 Community, wikis, forumns, IRC 0 Cloudera provides enterprise support > Offerings: # Cloudera Enterprise # Support, professional services, training, management apps > Cloudera Distribution of Hadoop (CDH) # Tested and hardened version of Hadoop products plus some other goodies (oozie, flume, hue, sqoop, whirr) ~ Separate codebase, but patches are made to and form the Apache versions # Packages: debian, redhat, EC2, VM if you want to try Hadoop, CDH is probably the way to go. I recommended this instead of downloading each project individually.
  • 74. Who uses this stuff? and many more
  • 75. Where the heck can I use this stuff? 0 The hardest part, is finding the right use-cases to apply Hadoop (and any NoSQL system) > SQL databases are great for data that fits on one machine > Lots of tooling support for SQL; not as much for Hadoop (yet) 0 A few questions to think about: > How much data are you processing? > Are you throwing away valuable data due to space? > Are you processing data where steps aren’t interdependent? 0 Log storage, log processing, utility data, research data, biological data, medical records, events, mail, tweets, market data, financial data
  • 76.
  • 77. NoSQL
  • 78. NoSQL =
  • 79. The Law Of the Instrument “It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” -Abraham Maslow
  • 80. ?’s