SlideShare uma empresa Scribd logo
1 de 55
Hadoop
Saeed Iqbal P11-6501 MSCS
What Is                            ?   inspired by
 • System for Processing mind-boggingly large amount
   of data.
 • The Apache Hadoop software library is a framework:
    • that allows for the distributed processing of large data sets
        across clusters of computers using simple programming
        models.
    •   It is designed to scale up from single servers to thousands
        of machines, each offering local computation and storage.
    •   Rather than rely on hardware to deliver high-availability,
    •   The library itself is designed to detect and handle failures at
        the application layer, so delivering a highly-available service
        on top of a cluster of computers, each of which may be
        prone to failures.
Hadoop Core
• Open sourced, flexible and available architecture
    for large scale computation and data processing on
    a network of commodity hardware
•   Open Source Software + Hardware Commodity
    •   IT Costs Reduction


    MapReduce                 Computation

    HDFS                      Storage
Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
   • Failure is expected, rather than exceptional.
   • The number of nodes in a cluster is not constant.
• Need common infrastructure
   • Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
   • Workloads are IO bound and not CPU bound
Hadoop History
•   2004—Initial versions of what is now Hadoop Distributed Filesystem and
    MapReduce implemented by Doug Cutting and Mike Cafarella.
•   December 2005—Nutch ported to the new framework. Hadoop runs reliably on
    20 nodes.
•   January 2006—Doug Cutting joins Yahoo!.
•   February 2006—Apache Hadoop project officially started to support the
    standalone development of MapReduce and HDFS.
•   February 2006—Adoption of Hadoop by Yahoo! Grid team.
•   April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
•   May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
•   May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than
    April benchmark).
•   October 2006—Research cluster reaches 600 nodes.
•   December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
    hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
•   January 2007—Research cluster reaches 900 nodes.
•   April 2007—Research clusters—2 clusters of 1000 nodes.
•   April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
•   October 2008—Loading 10 terabytes of data per a day on to research clusters.
•   March 2009—17 clusters with a total of 24,000 nodes
•   April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400
    nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
Who uses Hadoop?
Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Twitter.com
How does HDFS Work?            MapReduce has
                               undergone a
Let Suppose we have a file     complete
                               overhaul in
Size : 300MB                   hadoop-0.23 and
                               we now
                               have, what we
                               call, MapReduce
                               2.0 (MRv2) or
                             0
                               YARN.
                             0 The fundamental
                             M idea of MRv2 is
                             B to split up the
                               two major
                               functionalities of
                               the
                               JobTracker, resou
                               rce management
                               and job
                               scheduling/monit
                               oring, into
                               separate daemo
How does HDFS Work?             1   MapReduce has
HDFS splits it into blocks      2   undergone a complete
                                8   overhaul in hadoop-0.23
Size of each block is 128 MB.   M
                                    and we now have, what
                                    we call, MapReduce 2.0
                                B   (MRv2) or YARN.
                                    The fundamental idea of
                                    MRv2 is to split up the
                                1   two major functionalities
                                2   of the JobTracker,
                                8
                                M
                                B



                                4   resource management
                                4   and job
                                M   scheduling/monitoring, i
                                B   nto separate daemo
How does HDFS Work?               MapReduce has
                                  undergone a complete
HDFS will keep 3 copies of each   overhaul in hadoop-0.23
                                  and we now have, what
Block.                            we call, MapReduce 2.0
HDFS store these blocks on        (MRv2) or YARN.

datanodes,
HDFS distributes the block to
the DNs
How does HDFS Work?
The Name Node tracks blocks and Data nodes.
  DN         DN          DN




             DN           DN
  DN
                                      Name Node


   DN         DN         DN
How does HDFS Work?
Sometimes a datanode will die. Not a problem,
Example for MapReduce
Page 1: the weather is good
Page 2: today is good
Page 3: good weather is good.
Map output
Worker 1:
  (the 1), (weather 1), (is 1), (good 1).
Worker 2:
  (today 1), (is 1), (good 1).
Worker 3:
  (good 1), (weather 1), (is 1), (good 1).
Reduce Output
Worker 1:
  (the 1)
Worker 2:
  (is 3)
Worker 3:
  (weather 2)
Worker 4:
  (today 1)
Worker 5:
  (good 4)
MapReduce Architecture
MapReduce: Programming Model
Process data using special map() and reduce()
  functions
  The map() function is called on every item in the
    input and emits a series of intermediate key/value
    pairs
  All values associated with a given key are grouped
    together
  The reduce() function is called on every unique
    key, and its value list, and emits a value that is
    added to the output
MapReduce:Programming Model



              M     <How,1>
                    <now,1>     <How,1 1>
 How now            <brown,1>   <now,1 1>            brown 1
  Brown       M     <cow,1>     <brown,1>     R       cow 1
   cow              <How,1>     <cow,1>               does 1
                    <does,1>    <does,1>              How 2
              M     <it,1>      <it,1>
                                              R        it 1
  How does
It work now         <work,1>    <work,1>              now 2
                    <now,1>                          work 1
              M                             Reduce
                          MapReduce
                          Framework
              Map
  Input                                              Output
MapReduce:Programming Model
More formally,
   Map(k1,v1) --> list(k2,v2)
   Reduce(k2, list(v2)) --> list(v2)
MapReduce Life Cycle

                          Map function



                          Reduce function




                       Run this program as a
                         MapReduce job
Hadoop Environment
Hadoop has become the kernel of the distributed
  operating system for Big Data
The project includes these modules:
Hadoop Common: The common utilities that support
  the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
  distributed file system that provides high-throughput
  access to application data.
Hadoop YARN: A framework for job scheduling and
  cluster resource management.
Hadoop MapReduce: A YARN-based system for
  parallel processing of large data sets.
Hadoop Architecture

                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
Zookeeper – Coordination
Framework
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
What is ZooKeeper
• A centralized service for maintaining
   •   Configuration information, naming
   •   Providing distributed synchronization,
   •   and providing group services.
   •   A set of tools to build distributed applications that can
       safely handle partial failures
• ZooKeeper was designed to store coordination data
   •   Status information
   •   Configuration
   •   Location information
ZooKeeper
• ZooKeeper allows distributed processes to coordinate
  with each other through a shared hierarchical name
  space of data registers (we call these registers
  znodes), much like a file system.
Flume / Sqoop – Data Integration
Framework
                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
FLUME

• Flume is a distributed:
   • A distributed, reliable, and data collection service
   • It efficiently collecting, aggregating, and moving large
     amounts of data
   • Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
• It has a simple and flexible architecture based on
  streaming data flows.
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Flume Architecture

Log                                                             Log
                                              ...


Flume Node                                                      Flume Node


HDFS

      December 2nd, 2012 - Apache Flume 1.3.0 Released
                    ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently
    transferring bulk data between Apache Hadoop and
    structured data stores such as relational databases.
•   Easy, parallel database import/export
•   What you want do?
    •   Insert data from RDBMS to HDFS
    •   Export data from HDFS back into RDBMS
Sqoop Architecture & Example
                             March of 2012
      HDFS

      Sqoop

      RDBMS
     $ sqoop import --connect jdbc:mysql://localhost/world --username root --
     table City
     ...

     $ hadoop fs -cat City/part-m-00000
     1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF
     G,Herat,1868004,Mazar-e-
     Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
     ...
33
                          ©2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
Why Hive and Pig?
Although MapReduce is very powerful, it can also be
   complex to master
Many organizations have business or data analysts who
   are skilled at writing SQL queries, but not at writing
   Java code
Many organizations have programmers who are skilled
   at writing code in scripting languages
Hive and Pig are two projects which evolved separately
   to help such people analyze huge amounts of data
   via MapReduce
   Hive was initially developed at Facebook, Pig at Yahoo!
Hive
What is Hive?
   An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides easy data
  summarization and ad hoc querying and the analysis
  of large datasets stored in Hadoop compatible file
  systems.
   MapRuduce for execution
   HDFS for storage
Hive Query Language
   Basic-SQL : Select, From, Join, Group-By
   Equi-Join, Muti-Table Insert, Multi-Group-By
   Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive

            SQL


            Hive


            MapReduce


38
            ©2011 Cloudera, Inc. All Rights Reserved.
Pig
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which
   you need to do some processing or analysis , one
   way is to write Map Reduce code and then run that
   processing on data.
Other way is to write Pig scripts which would inturn be
   converted to Map Reduce code and would process
   your data.
Pig consists of two parts
•   Pig latin language
•   Pig engine         A = load ‘a.txt’ as (id, name, age, ...)
                         B = load ‘b.txt’ as (id, address, ...)
                         C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Pig Latin & Pig Engine
• Pig latin is a scripting language which allows you to
    describe how data flow from one or more inputs
    should be read , how it should be processed and
    then where it should be stored.
•   The flows can be simple or complex where some
    processing is applied in between. Data can be picked
    from multiple inputs.
•   We can say Pig Latin describes a directed acyclic
    graphs where edges are data flows and the nodes
    are operators that process the data
•   Pig Engine:
•   The job of engine is to execute the data flow written
    in Pig latin in parallel on hadoop infrastructure.
Why Pig is required when we can code all in MR
• Pig provides all standard data processing operations
    like sort , group , join , filter , order by , union right
    inside pig latin
•   In MR we have to lots of manual coding.
    Pig does optimization of Pig latin scripts while
    creating them into MR jobs.
•   It creates optimized version of Map reduce to run on
    hadoop
•   It takes very less time to write Pig latin script then to
    write corresponding MR code
•   Where Pig is useful
    Transactional ETL Data pipelines ( Mostly used)
    Research on raw data
    Iterative processing
Pig

      Script


      Pig


      MapReduce


       ©2011 Cloudera, Inc. All Rights Reserved.
WordCount Example
• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  <   Hello, 1>
  <   World, 1>
  <   Bye, 1>
  <   World, 1>
  <   Hello, 1>
  <   Hadoop, 1>
  <   Goodbye, 1>
  <   Hadoop, 1>

• theBye, 1> just sums up the values
   < reduce
  <   Goodbye, 1>
  <   Hadoop, 2>
  <   Hello, 2>
  <   World, 2>
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
        sum += val.get();
     }
     context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as
(token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
Hbase – Column NoSQL DB
                                             Hue                                   Mahout
                                         (Web Console)                          (Data Mining)


                                                               Oozie
                                                     (Job Workflow & Scheduling)
 (Coordination)
                  Zookeeper




                                       Sqoop/Flume
                                                                       Pig/Hive (Analytical Language)
                                       (Data integration)



                              MapReduce Runtime
                              (Dist. Programming Framework)                         Hbase
                                                                              (Column NoSQL DB)


                                          Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
• Apache HBase™ is the Hadoop database, a
  distributed, scalable, big data store.
  • Apache HBase is an open-source, distributed, versioned, column-
    oriented store modeled after Google's Bigtable: A Distributed Storage
    System for Structured Data by Chang et al.

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
  –   PUT
  –   GET
  –   DELETE
  –   SCANE
Hbase
HBase is a type of "NoSQL" database. NoSQL?
 "NoSQL" is a general term meaning that the database
 isn't an RDBMS which supports SQL as its primary
 access language, but there are many types of NoSQL
 databases: BerkeleyDB is an example of a local
 NoSQL database, whereas HBase is very much a
 distributed database.

 Technically speaking, HBase is really more a "Data
 Store" than "Data Base" because it lacks many of the
 features you find in an RDBMS, such as typed
 columns, secondary indexes, triggers, and advanced
 query languages, etc.
HBase Examples
hbase>   create 'mytable', 'mycf‘
hbase>   list
hbase>   put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase>   put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase>   put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase>   scan 'mytable‘
hbase>   disable 'mytable‘
hbase>   drop 'mytable'




Hbase reference : http://hbase.apache.org




                     ©2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling

                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
What is                               ?
Oozie is a server-based workflow scheduler system to
   manage Apache Hadoop jobs (e.g. load data, storing
   data, analyze data, cleaning data, running map
   reduce jobs, etc.)
   A Java Web Application
Oozie is a workflow scheduler for Hadoop
Triggered
   Time
                                 Job 1 Job 2
   Data

                                       Job 3

                                 Job 4 Job 5
Mahout – Data Mining

                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
What is
Machine-learning tool
Distributed and scalable machine learning algorithms
  on the Hadoop platform
Building intelligent applications easier and faster
Our core algorithms for clustering, classfication and
  batch based collaborative filtering are implemented
  on top of Apache Hadoop using the map/reduce
  paradigm.
Mahout Use Cases
Yahoo: Spam Detection
Foursquare: Recommendations
SpeedDate.com: Recommendations
Adobe: User Targetting
Amazon: Personalization Platform




               ©2011 Cloudera, Inc. All Rights Reserved.
Use case Example
Predict what the user likes based on
   His/Her historical behavior
   Aggregate behavior of people similar to him
Recap – Hadoop Arcitecture
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
THANKS

Mais conteúdo relacionado

Mais procurados

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and RJunHo Cho
 
Scheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersScheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersMarcelo Veiga Neves
 

Mais procurados (20)

Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Hadoop
HadoopHadoop
Hadoop
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
52 nfs
52 nfs52 nfs
52 nfs
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Scheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersScheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC Clusters
 

Destaque

How to build a news website use CMS wordpress
How to build a news website use CMS wordpressHow to build a news website use CMS wordpress
How to build a news website use CMS wordpressbaran19901990
 
09 implementing+subprograms
09 implementing+subprograms09 implementing+subprograms
09 implementing+subprogramsbaran19901990
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentationbaran19901990
 
Nhập môn công tác kỹ sư
Nhập môn công tác kỹ sưNhập môn công tác kỹ sư
Nhập môn công tác kỹ sưbaran19901990
 
Config websocket on apache
Config websocket on apacheConfig websocket on apache
Config websocket on apachebaran19901990
 
Memory allocation
Memory allocationMemory allocation
Memory allocationsanya6900
 
Chapter 9 & chapter 10 solutions
Chapter 9 & chapter 10 solutionsChapter 9 & chapter 10 solutions
Chapter 9 & chapter 10 solutionsSaeed Iqbal
 
Scope - Static and Dynamic
Scope - Static and DynamicScope - Static and Dynamic
Scope - Static and DynamicSneh Pahilwani
 
Computer Fundamentals Chapter 12 cl
Computer Fundamentals Chapter 12 clComputer Fundamentals Chapter 12 cl
Computer Fundamentals Chapter 12 clSaumya Sahu
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prologbaran19901990
 
Unit 3 principles of programming language
Unit 3 principles of programming languageUnit 3 principles of programming language
Unit 3 principles of programming languageVasavi College of Engg
 
Basic c++ programs
Basic c++ programsBasic c++ programs
Basic c++ programsharman kaur
 
Software engineering lecture notes
Software engineering lecture notesSoftware engineering lecture notes
Software engineering lecture notesSiva Ayyakutti
 

Destaque (20)

Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
How to build a news website use CMS wordpress
How to build a news website use CMS wordpressHow to build a news website use CMS wordpress
How to build a news website use CMS wordpress
 
09 implementing+subprograms
09 implementing+subprograms09 implementing+subprograms
09 implementing+subprograms
 
08 subprograms
08 subprograms08 subprograms
08 subprograms
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
Datatype
DatatypeDatatype
Datatype
 
Chapter2
Chapter2Chapter2
Chapter2
 
Nhập môn công tác kỹ sư
Nhập môn công tác kỹ sưNhập môn công tác kỹ sư
Nhập môn công tác kỹ sư
 
Config websocket on apache
Config websocket on apacheConfig websocket on apache
Config websocket on apache
 
Control structure
Control structureControl structure
Control structure
 
Memory allocation
Memory allocationMemory allocation
Memory allocation
 
Chapter 9 & chapter 10 solutions
Chapter 9 & chapter 10 solutionsChapter 9 & chapter 10 solutions
Chapter 9 & chapter 10 solutions
 
Chapter 17 dccn
Chapter 17 dccnChapter 17 dccn
Chapter 17 dccn
 
Scope - Static and Dynamic
Scope - Static and DynamicScope - Static and Dynamic
Scope - Static and Dynamic
 
Computer Fundamentals Chapter 12 cl
Computer Fundamentals Chapter 12 clComputer Fundamentals Chapter 12 cl
Computer Fundamentals Chapter 12 cl
 
10 logic+programming+with+prolog
10 logic+programming+with+prolog10 logic+programming+with+prolog
10 logic+programming+with+prolog
 
Unit 5
Unit 5Unit 5
Unit 5
 
Unit 3 principles of programming language
Unit 3 principles of programming languageUnit 3 principles of programming language
Unit 3 principles of programming language
 
Basic c++ programs
Basic c++ programsBasic c++ programs
Basic c++ programs
 
Software engineering lecture notes
Software engineering lecture notesSoftware engineering lecture notes
Software engineering lecture notes
 

Semelhante a Hadoop: An Introduction to Distributed Processing of Large Datasets

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Big Data Joe™ Rossi
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Semelhante a Hadoop: An Introduction to Distributed Processing of Large Datasets (20)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Hadoop: An Introduction to Distributed Processing of Large Datasets

  • 2. What Is ? inspired by • System for Processing mind-boggingly large amount of data. • The Apache Hadoop software library is a framework: • that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, • The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 3. Hadoop Core • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity • IT Costs Reduction MapReduce Computation HDFS Storage
  • 4. Hadoop, Why? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant. • Need common infrastructure • Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 5. Hadoop History • 2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella. • December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006—Doug Cutting joins Yahoo!. • February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS. • February 2006—Adoption of Hadoop by Yahoo! Grid team. • April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. • May 2006—Yahoo! set up a Hadoop research cluster—300 nodes. • May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006—Research cluster reaches 600 nodes. • December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007—Research cluster reaches 900 nodes. • April 2007—Research clusters—2 clusters of 1000 nodes. • April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008—Loading 10 terabytes of data per a day on to research clusters. • March 2009—17 clusters with a total of 24,000 nodes • April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
  • 6. Who uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo! Twitter.com
  • 7. How does HDFS Work? MapReduce has undergone a Let Suppose we have a file complete overhaul in Size : 300MB hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or 0 YARN. 0 The fundamental M idea of MRv2 is B to split up the two major functionalities of the JobTracker, resou rce management and job scheduling/monit oring, into separate daemo
  • 8. How does HDFS Work? 1 MapReduce has HDFS splits it into blocks 2 undergone a complete 8 overhaul in hadoop-0.23 Size of each block is 128 MB. M and we now have, what we call, MapReduce 2.0 B (MRv2) or YARN. The fundamental idea of MRv2 is to split up the 1 two major functionalities 2 of the JobTracker, 8 M B 4 resource management 4 and job M scheduling/monitoring, i B nto separate daemo
  • 9. How does HDFS Work? MapReduce has undergone a complete HDFS will keep 3 copies of each overhaul in hadoop-0.23 and we now have, what Block. we call, MapReduce 2.0 HDFS store these blocks on (MRv2) or YARN. datanodes, HDFS distributes the block to the DNs
  • 10. How does HDFS Work? The Name Node tracks blocks and Data nodes. DN DN DN DN DN DN Name Node DN DN DN
  • 11. How does HDFS Work? Sometimes a datanode will die. Not a problem,
  • 12. Example for MapReduce Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
  • 13. Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).
  • 14. Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)
  • 16. MapReduce: Programming Model Process data using special map() and reduce() functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  • 17. MapReduce:Programming Model M <How,1> <now,1> <How,1 1> How now <brown,1> <now,1 1> brown 1 Brown M <cow,1> <brown,1> R cow 1 cow <How,1> <cow,1> does 1 <does,1> <does,1> How 2 M <it,1> <it,1> R it 1 How does It work now <work,1> <work,1> now 2 <now,1> work 1 M Reduce MapReduce Framework Map Input Output
  • 18. MapReduce:Programming Model More formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
  • 19. MapReduce Life Cycle Map function Reduce function Run this program as a MapReduce job
  • 20. Hadoop Environment Hadoop has become the kernel of the distributed operating system for Big Data The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 21. Hadoop Architecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 23. What is ZooKeeper • A centralized service for maintaining • Configuration information, naming • Providing distributed synchronization, • and providing group services. • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data • Status information • Configuration • Location information
  • 24. ZooKeeper • ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  • 25. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 26. FLUME • Flume is a distributed: • A distributed, reliable, and data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats • It has a simple and flexible architecture based on streaming data flows.
  • 27. Flume: High-Level Overview • Logical Node • Source • Sink
  • 28. Flume Architecture Log Log ... Flume Node Flume Node HDFS December 2nd, 2012 - Apache Flume 1.3.0 Released ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. Sqoop • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. • Easy, parallel database import/export • What you want do? • Insert data from RDBMS to HDFS • Export data from HDFS back into RDBMS
  • 30. Sqoop Architecture & Example March of 2012 HDFS Sqoop RDBMS $ sqoop import --connect jdbc:mysql://localhost/world --username root -- table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF G,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ... 33 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 32. Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!
  • 33. Hive What is Hive? An SQL-like interface to Hadoop Data Warehouse infrastructure that provides easy data summarization and ad hoc querying and the analysis of large datasets stored in Hadoop compatible file systems. MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 34. Hive SQL Hive MapReduce 38 ©2011 Cloudera, Inc. All Rights Reserved.
  • 35. Pig Apache Pig is a platform to analyze large data sets. In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data. Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data. Pig consists of two parts • Pig latin language • Pig engine A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 36. Pig Latin & Pig Engine • Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored. • The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs. • We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data • Pig Engine: • The job of engine is to execute the data flow written in Pig latin in parallel on hadoop infrastructure.
  • 37. Why Pig is required when we can code all in MR • Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin • In MR we have to lots of manual coding. Pig does optimization of Pig latin scripts while creating them into MR jobs. • It creates optimized version of Map reduce to run on hadoop • It takes very less time to write Pig latin script then to write corresponding MR code • Where Pig is useful Transactional ETL Data pipelines ( Mostly used) Research on raw data Iterative processing
  • 38. Pig Script Pig MapReduce ©2011 Cloudera, Inc. All Rights Reserved.
  • 39. WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> • theBye, 1> just sums up the values < reduce < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 40. WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 41. WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 42. WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 43. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 45. • Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. • Apache HBase is an open-source, distributed, versioned, column- oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. • Coordinated by Zookeeper • Low Latency • Random Reads And Writes • Distributed Key/Value Store • Simple API – PUT – GET – DELETE – SCANE
  • 46. Hbase HBase is a type of "NoSQL" database. NoSQL? "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
  • 47. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' Hbase reference : http://hbase.apache.org ©2011 Cloudera, Inc. All Rights Reserved.
  • 48. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 49. What is ? Oozie is a server-based workflow scheduler system to manage Apache Hadoop jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) A Java Web Application Oozie is a workflow scheduler for Hadoop Triggered Time Job 1 Job 2 Data Job 3 Job 4 Job 5
  • 50. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 51. What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
  • 52. Mahout Use Cases Yahoo: Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform ©2011 Cloudera, Inc. All Rights Reserved.
  • 53. Use case Example Predict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him
  • 54. Recap – Hadoop Arcitecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)