SlideShare uma empresa Scribd logo
1 de 14
Ali Bahu
10/17/2012
APACHE HADOOP
INTRODUCTION
 Apache Hadoop is an open-source software
framework that supports data-intensive
distributed applications, licensed under the
Apache v2 license. It enables applications to
work with thousands of computation-
independent computers and petabytes of
data.
 Hadoop was derived from Google's
MapReduce and Google File System (GFS)
papers.
 Hadoop is implemented in Java.
WHY HADOOP
 Need to process Multi Petabyte Datasets
 It is expensive to build reliability in each application
 Nodes failure is expected and Hadoop can help
 Number of nodes is not constant
 Efficient, reliable, Open Source Apache License
 Workloads are IO bound and not CPU bound
WHO USES HADOOP
 Amazon/A9
 Facebook
 Google
 IBM
 Joost
 Last.fm
 New York Times
 PowerSet
 Veoh
 Yahoo!
COMMODITY HARDWARE
 Typically in 2 level architecture
 Nodes are commodity PCs
 30-40 nodes/rack
 Uplink from rack is 3-4 gigabit
 Rack-internal is one gigabit
HDFS ARCHITECTURE
HDFS (HADOOP DISTRIBUTED FILE
SYSTEM) Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Assumes Commodity Hardware
 Files are replicated to handle hardware failure
 Detect failures and recovers from them
 Optimized for Batch Processing
 Data locations exposed so that computations can
move to where data resides
 Provides very high aggregate bandwidth
 User Space, runs on heterogeneous OS
 Single Namespace for entire cluster
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple Data Nodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from Data Node
NAME NODE METADATA
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of Data Nodes for each block
 File attributes, e.g. creation time, replication factor, etc.
 Transaction Log
 Records file creations, file deletions, etc. is kept here
DATA NODE
Block Server
 Stores data in the local file system (e.g. ext3)
 Stores meta-data of a block (e.g. CRC)
 Serves data and meta-data to Clients
Block Report
 Periodically sends a report of all existing blocks to
the Name Node
Facilitates Pipelining of Data
 Forwards data to other specified Data Nodes
BLOCK PLACEMENT
 Block Placement Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replica
DATA CORRECTNESS
 Use Checksums to validate data
 Use CRC32, etc.
 File Creation
 Client computes checksum per 512 byte
 Data Node stores the checksum
 File access
 Client retrieves the data and checksum from Data Node
 If Validation fails, Client tries other replicas
NAMENODE FAILURE
 A single point of failure
 Transaction Log stored in multiple directories
 A directory on the local file system
 A directory on a remote file system (NFS/CIFS)
DATA PIPELINING
 Client retrieves a list of DataNodes on which to place replicas of a
block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next DataNode in the
Pipeline
 When all replicas are written, the Client moves on to write the next
block in file
 Rebalancer is used to ensure that the % disk full on DataNodes are
similar
 Usually run when new DataNodes are added
 Cluster is online when Rebalancer is active
 Rebalancer is throttled to avoid network congestion
 Command line tool
HADOOP MAP/REDUCE
 The Map-Reduce programming model
 Framework for distributed processing of large data sets
 Pluggable user code runs in generic framework
 Common design pattern in data processing
 cat * | grep | sort | unique -c | cat > file
 input | map | shuffle | reduce | output
 Useful for:
 Log processing
 Web search indexing
 Ad-hoc queries, etc.

Mais conteúdo relacionado

Mais procurados

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Milad Sobhkhiz
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirements
AbDul ThaYyal
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
Schubert Zhang
 

Mais procurados (20)

Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJID
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 
Big data technologies and databases
Big data technologies and databasesBig data technologies and databases
Big data technologies and databases
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Directory services by SAJID
Directory services by SAJIDDirectory services by SAJID
Directory services by SAJID
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
3. distributed file system requirements
3. distributed file system requirements3. distributed file system requirements
3. distributed file system requirements
 
Describing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataDescribing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked Data
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
 
File organization
File organizationFile organization
File organization
 
Chapter13
Chapter13Chapter13
Chapter13
 
File organization
File organizationFile organization
File organization
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Ch11
Ch11Ch11
Ch11
 
Poster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLPoster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQL
 
Distributed file systems dfs
Distributed file systems   dfsDistributed file systems   dfs
Distributed file systems dfs
 

Semelhante a Hadoop

Semelhante a Hadoop (20)

HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 

Mais de Ali Bahu

Apache Ant
Apache AntApache Ant
Apache Ant
Ali Bahu
 
Apache Ant
Apache AntApache Ant
Apache Ant
Ali Bahu
 
EclipseMAT
EclipseMATEclipseMAT
EclipseMAT
Ali Bahu
 
Cloud computing
Cloud computingCloud computing
Cloud computing
Ali Bahu
 
Pervasive computing
Pervasive computingPervasive computing
Pervasive computing
Ali Bahu
 

Mais de Ali Bahu (6)

Apache Ant
Apache AntApache Ant
Apache Ant
 
Jhiccup
JhiccupJhiccup
Jhiccup
 
Apache Ant
Apache AntApache Ant
Apache Ant
 
EclipseMAT
EclipseMATEclipseMAT
EclipseMAT
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Pervasive computing
Pervasive computingPervasive computing
Pervasive computing
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Hadoop

  • 2. INTRODUCTION  Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It enables applications to work with thousands of computation- independent computers and petabytes of data.  Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.  Hadoop is implemented in Java.
  • 3. WHY HADOOP  Need to process Multi Petabyte Datasets  It is expensive to build reliability in each application  Nodes failure is expected and Hadoop can help  Number of nodes is not constant  Efficient, reliable, Open Source Apache License  Workloads are IO bound and not CPU bound
  • 4. WHO USES HADOOP  Amazon/A9  Facebook  Google  IBM  Joost  Last.fm  New York Times  PowerSet  Veoh  Yahoo!
  • 5. COMMODITY HARDWARE  Typically in 2 level architecture  Nodes are commodity PCs  30-40 nodes/rack  Uplink from rack is 3-4 gigabit  Rack-internal is one gigabit
  • 7. HDFS (HADOOP DISTRIBUTED FILE SYSTEM) Very Large Distributed File System  10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware  Files are replicated to handle hardware failure  Detect failures and recovers from them  Optimized for Batch Processing  Data locations exposed so that computations can move to where data resides  Provides very high aggregate bandwidth  User Space, runs on heterogeneous OS  Single Namespace for entire cluster  Data Coherency  Write-once-read-many access model  Client can only append to existing files  Files are broken up into blocks  Typically 128 MB block size  Each block replicated on multiple Data Nodes  Intelligent Client  Client can find location of blocks  Client accesses data directly from Data Node
  • 8. NAME NODE METADATA  Meta-data in Memory  The entire metadata is in main memory  No demand paging of meta-data  Types of Metadata  List of files  List of Blocks for each file  List of Data Nodes for each block  File attributes, e.g. creation time, replication factor, etc.  Transaction Log  Records file creations, file deletions, etc. is kept here
  • 9. DATA NODE Block Server  Stores data in the local file system (e.g. ext3)  Stores meta-data of a block (e.g. CRC)  Serves data and meta-data to Clients Block Report  Periodically sends a report of all existing blocks to the Name Node Facilitates Pipelining of Data  Forwards data to other specified Data Nodes
  • 10. BLOCK PLACEMENT  Block Placement Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica
  • 11. DATA CORRECTNESS  Use Checksums to validate data  Use CRC32, etc.  File Creation  Client computes checksum per 512 byte  Data Node stores the checksum  File access  Client retrieves the data and checksum from Data Node  If Validation fails, Client tries other replicas
  • 12. NAMENODE FAILURE  A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system  A directory on a remote file system (NFS/CIFS)
  • 13. DATA PIPELINING  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file  Rebalancer is used to ensure that the % disk full on DataNodes are similar  Usually run when new DataNodes are added  Cluster is online when Rebalancer is active  Rebalancer is throttled to avoid network congestion  Command line tool
  • 14. HADOOP MAP/REDUCE  The Map-Reduce programming model  Framework for distributed processing of large data sets  Pluggable user code runs in generic framework  Common design pattern in data processing  cat * | grep | sort | unique -c | cat > file  input | map | shuffle | reduce | output  Useful for:  Log processing  Web search indexing  Ad-hoc queries, etc.