SlideShare uma empresa Scribd logo
1 de 61
What is Big Data ?
●   How is big “Big Data” ?
    ●   Is 30 40 Terabyte big data ?
    ●   ….
●   Big data are datasets that grow so large that they
    become awkward to work with using on-hand
    database management tools
●   Today Terabyte, Petabyte, Exabyte
●   Tomorrow ?
Enterprises & Big Data
●   Most companies are currently using traditional tools to
    store data
●   Big data: The next frontier for innovation, competition,
    and productivity
●   The use of big data will become a key basis of competition
●   Organisations across the globe need to take the rising
    importance of big data more seriously
Hadoop is an ecosystem, not a single product.




When you deal with BigData, the data center is your computer.
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
A Brief History of Hadoop
•   Hadoop has its origins in Apache Nutch

•   Nutch was started in 2002

•   Challenge : The billions of pages on the Web ?

•   2003 GFS (Google File System)

•   2004 NDFS (Nutch File System)

•   2004 Google published the paper of MapReduce

•   2005 Nutch Developers getting started with development of
    MapReduce
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Contributers and Development




Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
Contributers and Development
Contributers and Development




* Resource: Kerberos Konference (Yahoo) – 2010
Development in ASF/Hadoop
●   Resources
    ●   Mailing List
    ●   Wiki Pages , blogs
    ●   Issue Tracking – JIRA
    ●   Version Control SVN – Git
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
What is Hadoop
•   Open-source project administered by the ASF

•   Data Intensive Storage

•   and Massivly Paralel Processing(MPP)

•   Enables applications to work with thousands of nodes and
    petabytes of data

•   Suitable for application with large data sets
What is Hadoop ?

•   Scalable

•   Fault Tolerance

•   Reliable data storage using the Hadoop Distributed
    File System (HDFS)

•   High-performance parallel data processing using a
    technique called MapReduce
What is Hadoop ?

•   Hadoop Becoming defacto standard for large scale
    dataprocessing

•   Becoming more than just MapReduce

•   Ecosystem growing rapidly lot’s of great tools around it
What is Hadoop ?



 Yahoo Hadoop Cluster
38,000 machines
distributed across 20
different clusters.
Recource : Yahoo 2010

50,000 m : January 2012
Resource
http://www.computerworlduk.com/in-
depth/applications/3329092/hadoop-   SGI Hadoop Cluster
could-save-you-money-over-a-
traditional-rdbms/
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
•       Hadoop has its origins in Apache Nutch
•       Can Process Big Data (Petabytes and more..)
•       Unlimited Data Storage & Analyse
•       No licence cost - Apache License 2.0
•       Can be build out of the commodity hardware
•       IT Cost Reduction
    •        Results
         •      Be One Step Ahead of Competition
         •      Stay there
Is hadoop alternative for RDBMs ?
 •   At the moment Apache Hadoop is not a substitute for a database
 •   No Relation
 •   Key Value pairs
 •   Big Data
 •   unstructured (Text)
 •   semi structured (Seq / Binary Files)
 •   Structured (Hbase=Google BigTable)
 •   Works fine together with RDBMs
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Hadoop Ecosystem
   ETL Tools           BI Reporting     RDBMS


Pig (Data   Flow)      Hive (SQL)        Sqoop


 MapReduce (Job     Scheduling/Execution System)

HBase (Key-Value store)



                        HDFS
        (Hadoop Distributed File System)
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool


•   HBase : realtime read/write access to your Big Data
Hadoop Ecosystem
Hadoop is a Distributed Data Computing Platform
HDFS
HDFS




NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file
metadata—which files are in the system and how each file is broken down into blocks. The
DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the
metadata current.»
Hadoop Cluster
Writing Files To HDFS


               •   Client consults NameNode
               •   Client writes block directly to
                   one DataNode
               •   DataNote replicates block
               •   Cycle repeats for next block
Reading Files From HDFS




•   Client consults NameNode
•   Client receives Data Node list for each block
•   Client picks first Data Node for each block
•   Client reads blocks sequentially
Rackawareness & Fault Tolerance

                                                        NameNode

                                                  Rack Aware       Metadata
                                                  Rack 1:          File.txt
                                                  DN1              Blk A:
                                                  DN2              DN1,DN5,DN6
                                                  DN3
                                                  DN5              Blk B:
                                                                   DN1,DN2,DN9
                                                  Rack 5:
                                                  DN5              BLKC:
                                                  DN6              DN5,DN9,DN10
                                                  DN7
                                                  DN8

                                                  Rack N
•   Never loose all data if entire rack fails
•   In Rack is higher bandwidth , lower latency
Cluster Healt
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
MapReduce-Paradigm
•   Simplified Data Processing on Large Clusters
•   Splitting a Big Problem/Data into Little PiecesHive
•   Key-Value
MapReduce-Batch Processing
•       Phases
    •     Map
    •     Sort/Shuffle
    •     Reduce (Aggregation)
•       Coordination
    •     Job Tracker
    •     Task Tracker
MapReduce-Map
                           K   V
                               1
                               1
Datanode 1           MAP
                               1
                               1


                               1
Datanode 2           MAP
                               1
                               1
                               1


                               1
Datanode 3                     1
                     MAP
                               1
                               1
MapReduce-Sort/Shuffle
                          1
                          1




                   SORT
Datanode 1                1
                          1


                          1
Datanode 2                1



                   SORT
                          1
                          1
                          1


Datanode 3                1
                   SORT




                          1
                          1
MapReduce-Reduce
                      1
                                   K   V
                      1


               SORT
                          REDUCE       4
Datanode 1            1
                      1


                      1
                                   K   V
                      1
Datanode 2                             2
               SORT




                      1   REDUCE
                                       3
                      1
                      1


                      1            K   V
Datanode 3
               SORT




                          REDUCE       3
                      1
                      1
MapReduce-All Phases
         1
                    1
         1




             SORT
   MAP              1
         1              REDUCE   4
                    1
         1
                    1

         1          1
         1          1




             SORT
   MAP
                        REDUCE
                                 2
         1          1
                                 3
         1          1
                    1

         1
         1
             SORT   1
   MAP                  REDUCE
                                 3
         1          1
         1          1
MapReduce-Job & Task Tracker

                                                                                Namenode




                                                                                 Datanodes



JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks
to each TaskTracker in the cluster
Summary of HDFS and MR
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Hive
Hive
•   Data warehousing package built on top of Hadoop
•   It began its life at Facebook processing large amount of user
    and log data
•   Hadoop subproject with many contributors
•   Ad hoc queries , summarization , and data analysis on Hadoop-
    scale data
•   Directly query data from different formats (text/binary) and file
    formats (Flat/Sequence)
•   HiveQL - like SQL
Hive Components
Mgmt. Web UI



                                                                           Map Reduce   HDFS

                             Hive CLI
                Browsing        Queries          DDL


                Thrift API                       Parser
                                                                           Execution
                                                Planner
                                                          Hive QL



               MetaStore
                                    *Thrift : Interface Definition Lang.
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Pig
•       The language used to express data flows, called Pig Latin
•       Pig Latin can be extended using UDF (User Defined Functions)
•       was originally developed at Yahoo Research
•       PigPen is an Eclipse plug-in that provides an environment for
        developing Pig programs
•       Running Pig Programs
    •       Script ; script file that contains Pig commands
    •       Grunt ; interactive shell
    •       Embedded ; java
Pig
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
      AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}

grunt> filtered_records = FILTER records BY temperature != 22 );
grunt> DUMP filtered_records;

grunt> grouped_records = GROUP records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
HBase
•   Random, realtime read/write access to your Big Data

•   Billions of rows X millions of columns

•   Column-oriented store modeled after Google's BigTable

•   provides Bigtable-like capabilities on top of Hadoop and HDFS

•   HBase is not a column-oriented database in the typical RDBMS

    sense, but utilizes an on-disk column storage format
HBase-Datamodel
    •        (Table, RowKey, Family,Column, Timestamp) → Value




•       Think of tags. Values any length, no predefined names or widths

•       Column names carry info (just like tags)
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
Create Sample Table
hbase(main):003:0> create 'test', 'cf'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12'
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
hbase(main):007:0> scan 'test'
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
hbase(main):007:0> scan 'test', { VERSIONS => 3 }
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row1     column=cf:a, timestamp=1288380727188, value=value11
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
Hbase-Architecture
•   Splits

•   Auto-Sharding

•   Master

•   Region Servers

•   HFile
Splits & RegionServers




•   Rows grouped in regions and served by different servers
•   Table dynamically split into “regions”
•   Each region contains values [startKey, endKey)
•   Regions hosted on a regionserver
Hbase-Architecture
Other Components
•   Flume

•   Sqoop
Commertial Products
•   Oracle Big Data Appliance

•   Microsoft Azure + Excel + MapReduce

•   Cloud Computing , Amazon elastic computing

•   IBM Hadoop-based InfoSphere BigInsights

•   VMWare Spring for Apache Hadoop

•   Toad for Cloud Database

•   Mapr , Cloudera , HortonWorks, Datameer
Thank You



Faruk Berksöz
fberksoz@gmail.com

Mais conteúdo relacionado

Mais procurados

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 

Mais procurados (20)

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
6.hive
6.hive6.hive
6.hive
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 

Semelhante a Hadoop hbase mapreduce

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 

Semelhante a Hadoop hbase mapreduce (20)

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop
HadoopHadoop
Hadoop
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop hbase mapreduce

  • 1.
  • 2. What is Big Data ? ● How is big “Big Data” ? ● Is 30 40 Terabyte big data ? ● …. ● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools ● Today Terabyte, Petabyte, Exabyte ● Tomorrow ?
  • 3. Enterprises & Big Data ● Most companies are currently using traditional tools to store data ● Big data: The next frontier for innovation, competition, and productivity ● The use of big data will become a key basis of competition ● Organisations across the globe need to take the rising importance of big data more seriously
  • 4. Hadoop is an ecosystem, not a single product. When you deal with BigData, the data center is your computer.
  • 5. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 6. A Brief History of Hadoop • Hadoop has its origins in Apache Nutch • Nutch was started in 2002 • Challenge : The billions of pages on the Web ? • 2003 GFS (Google File System) • 2004 NDFS (Nutch File System) • 2004 Google published the paper of MapReduce • 2005 Nutch Developers getting started with development of MapReduce
  • 7. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 8. Contributers and Development Lifetime patches contributed for all Hadoop-related projects: community members by current employer * source : JIRA tickets
  • 10. Contributers and Development * Resource: Kerberos Konference (Yahoo) – 2010
  • 11. Development in ASF/Hadoop ● Resources ● Mailing List ● Wiki Pages , blogs ● Issue Tracking – JIRA ● Version Control SVN – Git
  • 12. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 13. What is Hadoop • Open-source project administered by the ASF • Data Intensive Storage • and Massivly Paralel Processing(MPP) • Enables applications to work with thousands of nodes and petabytes of data • Suitable for application with large data sets
  • 14. What is Hadoop ? • Scalable • Fault Tolerance • Reliable data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce
  • 15. What is Hadoop ? • Hadoop Becoming defacto standard for large scale dataprocessing • Becoming more than just MapReduce • Ecosystem growing rapidly lot’s of great tools around it
  • 16. What is Hadoop ? Yahoo Hadoop Cluster 38,000 machines distributed across 20 different clusters. Recource : Yahoo 2010 50,000 m : January 2012 Resource http://www.computerworlduk.com/in- depth/applications/3329092/hadoop- SGI Hadoop Cluster could-save-you-money-over-a- traditional-rdbms/
  • 17. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 21. Why Hadoop? • Hadoop has its origins in Apache Nutch • Can Process Big Data (Petabytes and more..) • Unlimited Data Storage & Analyse • No licence cost - Apache License 2.0 • Can be build out of the commodity hardware • IT Cost Reduction • Results • Be One Step Ahead of Competition • Stay there
  • 22. Is hadoop alternative for RDBMs ? • At the moment Apache Hadoop is not a substitute for a database • No Relation • Key Value pairs • Big Data • unstructured (Text) • semi structured (Seq / Binary Files) • Structured (Hbase=Google BigTable) • Works fine together with RDBMs
  • 23. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 24. Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)
  • 25. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : realtime read/write access to your Big Data
  • 26. Hadoop Ecosystem Hadoop is a Distributed Data Computing Platform
  • 27. HDFS
  • 28. HDFS NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.»
  • 30. Writing Files To HDFS • Client consults NameNode • Client writes block directly to one DataNode • DataNote replicates block • Cycle repeats for next block
  • 31. Reading Files From HDFS • Client consults NameNode • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially
  • 32. Rackawareness & Fault Tolerance NameNode Rack Aware Metadata Rack 1: File.txt DN1 Blk A: DN2 DN1,DN5,DN6 DN3 DN5 Blk B: DN1,DN2,DN9 Rack 5: DN5 BLKC: DN6 DN5,DN9,DN10 DN7 DN8 Rack N • Never loose all data if entire rack fails • In Rack is higher bandwidth , lower latency
  • 34. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 35. MapReduce-Paradigm • Simplified Data Processing on Large Clusters • Splitting a Big Problem/Data into Little PiecesHive • Key-Value
  • 36. MapReduce-Batch Processing • Phases • Map • Sort/Shuffle • Reduce (Aggregation) • Coordination • Job Tracker • Task Tracker
  • 37. MapReduce-Map K V 1 1 Datanode 1 MAP 1 1 1 Datanode 2 MAP 1 1 1 1 Datanode 3 1 MAP 1 1
  • 38. MapReduce-Sort/Shuffle 1 1 SORT Datanode 1 1 1 1 Datanode 2 1 SORT 1 1 1 Datanode 3 1 SORT 1 1
  • 39. MapReduce-Reduce 1 K V 1 SORT REDUCE 4 Datanode 1 1 1 1 K V 1 Datanode 2 2 SORT 1 REDUCE 3 1 1 1 K V Datanode 3 SORT REDUCE 3 1 1
  • 40. MapReduce-All Phases 1 1 1 SORT MAP 1 1 REDUCE 4 1 1 1 1 1 1 1 SORT MAP REDUCE 2 1 1 3 1 1 1 1 1 SORT 1 MAP REDUCE 3 1 1 1 1
  • 41. MapReduce-Job & Task Tracker Namenode Datanodes JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster
  • 42. Summary of HDFS and MR
  • 43. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 44. Hive
  • 45. Hive • Data warehousing package built on top of Hadoop • It began its life at Facebook processing large amount of user and log data • Hadoop subproject with many contributors • Ad hoc queries , summarization , and data analysis on Hadoop- scale data • Directly query data from different formats (text/binary) and file formats (Flat/Sequence) • HiveQL - like SQL
  • 46. Hive Components Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL MetaStore *Thrift : Interface Definition Lang.
  • 47. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 48. Pig • The language used to express data flows, called Pig Latin • Pig Latin can be extended using UDF (User Defined Functions) • was originally developed at Yahoo Research • PigPen is an Eclipse plug-in that provides an environment for developing Pig programs • Running Pig Programs • Script ; script file that contains Pig commands • Grunt ; interactive shell • Embedded ; java
  • 49. Pig grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> DESCRIBE records; records: {year: chararray,temperature: int,quality: int} grunt> filtered_records = FILTER records BY temperature != 22 ); grunt> DUMP filtered_records; grunt> grouped_records = GROUP records BY year; grunt> DUMP grouped_records; (1949,{(1949,111,1),(1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
  • 50. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 51. HBase • Random, realtime read/write access to your Big Data • Billions of rows X millions of columns • Column-oriented store modeled after Google's BigTable • provides Bigtable-like capabilities on top of Hadoop and HDFS • HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format
  • 52. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value • Think of tags. Values any length, no predefined names or widths • Column names carry info (just like tags)
  • 53. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 54. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 55. Create Sample Table hbase(main):003:0> create 'test', 'cf' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12' hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 hbase(main):007:0> scan 'test', { VERSIONS => 3 } ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row1 column=cf:a, timestamp=1288380727188, value=value11 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3
  • 56. Hbase-Architecture • Splits • Auto-Sharding • Master • Region Servers • HFile
  • 57. Splits & RegionServers • Rows grouped in regions and served by different servers • Table dynamically split into “regions” • Each region contains values [startKey, endKey) • Regions hosted on a regionserver
  • 59. Other Components • Flume • Sqoop
  • 60. Commertial Products • Oracle Big Data Appliance • Microsoft Azure + Excel + MapReduce • Cloud Computing , Amazon elastic computing • IBM Hadoop-based InfoSphere BigInsights • VMWare Spring for Apache Hadoop • Toad for Cloud Database • Mapr , Cloudera , HortonWorks, Datameer