SlideShare uma empresa Scribd logo
1 de 26
Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
      Instructor: Professor Lothar Piepmayer




  HDFS at a glance
Agenda

1. Design of HDFS
2.1. HDFS Concepts – Blocks
2.1. HDFS Concepts - Namenode and datanode
3.1 Dataflow - Anatomy of a read file
3.2 Dataflow - Anatomy of a write file
3.3 Dataflow - Coherency model
4. Parallel copying
5. Demo - Command line
The Design of HDFS

Very large distributed file system
  Up to 10K nodes, 1 billion files, 100PB
Streaming data access
  Write once, read many times
Commodity hardware
  Files are replicated to handle hardware failure
        Detect failures and recover from them
Worst fit with

Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications
HDFS Blocks

Normal Filesystem blocks are few kilobytes
HDFS has Large block size
    Default 64MB
    Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is
 smaller than a single block does not occupy a full block
HDFS Blocks


A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes
 across the cluster.
HDFS Blocks




Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
Why blocks in HDFS so large?

Minimize the cost of seeks
=> Make transfer time = disk transfer rate
Benefit of Block abstraction

A file can be larger than any single disk in the network
Simplify the storage subsystem
Providing fault tolerance and availability
Namenode & Datanodes
Namenode & Datanodes

 Namenode (master)
 – manages the filesystem namespace
 – maintains the filesystem tree and metadata for all the
 files and directories in the tree.
 Datanodes (slaves)
 – store data in the local file system
 – Periodically report back to the namenode with lists of all
 existing blocks
 Clients communicate with both namenode and datanodes.
Anatomy of a File Read
Anatomy of a File Read


Benefits:
- Avoid “bottle neck”
- Multi-Clients
Writing in HDFS


Namenode
Datanode
Block
Writing in HDFS


Exeptions: Node failed
  Pipeline close, remove block and addr of failed
   node
  Namenode arrange new datanode
Coherency Model


Not visible when copying
use sync()
Apply in applications
Parallel copying in HDFS

Transfer data between clusters
   % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Implemented as MapReduce, each file per map
Each map take at least 256MB
Default max maps is 20 per node
The diffirent versions only supported by webhdfs protocol:
   % hadoop distcp webhdfs://namenode1:50070/foo
      webhdfs://namenode2:50070/bar
Setup

Cluster with 03 nodes:
    04 GB RAM
    02 CPU @ 2.0Ghz+
    100G HDD
Using vmWare on 03 different servers
Network: 100Mbps
Operating System: Ubuntu 11.04
    Windows: Not tested
Setup Guide - Single Node


java runtime ssh
  http://hadoop.apache.org/common/docs/r1.0.3/si
   ngle_node_setup.html
/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml
Cluster


/etc/hadoop/masters
/etc/hadoop/slaves
http://hadoop.apache.org/common/docs/r1.0.3
/cluster_setup.html
Command Line

Similar to *nix
    hadoop fs -ls /
    hadoop fs -mkdir /test
    hadoop fs -rmr /test
    hadoop fs -cp /1 /2
    hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific:
    hadoop namenode -format
    start-all.sh
Command Line

Sorting: Standard method to test cluster
    TeraGen: Generate dummy data
    TeraSort: Sort
    TeraValidate: Validate sort result
Command Line:
    hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
     terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
Benchmark Result

2 Nodes, 1GB data: 0:03:38
3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07
3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck
Who
wins…?
References

Hadoop The Definitive Guide

Mais conteúdo relacionado

Mais procurados

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Anand Kulkarni
 

Mais procurados (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving tools
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
6 technical-dns-workshop-day3
6 technical-dns-workshop-day36 technical-dns-workshop-day3
6 technical-dns-workshop-day3
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 

Semelhante a Hadoop at a glance

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 

Semelhante a Hadoop at a glance (20)

Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Hdfs
HdfsHdfs
Hdfs
 

Mais de Tan Tran

Managing for results
Managing for resultsManaging for results
Managing for results
Tan Tran
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniques
Tan Tran
 
Personal task management
Personal task managementPersonal task management
Personal task management
Tan Tran
 
Jira in action
Jira in actionJira in action
Jira in action
Tan Tran
 
BIS Vietnamese-German University
BIS Vietnamese-German UniversityBIS Vietnamese-German University
BIS Vietnamese-German University
Tan Tran
 
Phac thao compendium
Phac thao compendiumPhac thao compendium
Phac thao compendium
Tan Tran
 
Management skills in IT - Communication
Management skills in IT - CommunicationManagement skills in IT - Communication
Management skills in IT - Communication
Tan Tran
 
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy YênTổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tan Tran
 
Flash coding convention for action script 3
Flash coding convention for action script 3Flash coding convention for action script 3
Flash coding convention for action script 3
Tan Tran
 
Java convention
Java conventionJava convention
Java convention
Tan Tran
 

Mais de Tan Tran (16)

Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)Mật thư trò chơi lớn (tóm tắt)
Mật thư trò chơi lớn (tóm tắt)
 
Managing for results
Managing for resultsManaging for results
Managing for results
 
Software estimation techniques
Software estimation techniquesSoftware estimation techniques
Software estimation techniques
 
Personal task management
Personal task managementPersonal task management
Personal task management
 
Jira in action
Jira in actionJira in action
Jira in action
 
Beautifying Data in the real world
Beautifying Data in the real worldBeautifying Data in the real world
Beautifying Data in the real world
 
BIS Vietnamese-German University
BIS Vietnamese-German UniversityBIS Vietnamese-German University
BIS Vietnamese-German University
 
Phac thao compendium
Phac thao compendiumPhac thao compendium
Phac thao compendium
 
Management skills in IT - Communication
Management skills in IT - CommunicationManagement skills in IT - Communication
Management skills in IT - Communication
 
Internet governance and the filtering problems
Internet governance and the filtering problemsInternet governance and the filtering problems
Internet governance and the filtering problems
 
C# conventions & good practices
C# conventions & good practicesC# conventions & good practices
C# conventions & good practices
 
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy YênTổng hợp Dâng Ngài - nhạc sĩ Thy Yên
Tổng hợp Dâng Ngài - nhạc sĩ Thy Yên
 
Flash coding convention for action script 3
Flash coding convention for action script 3Flash coding convention for action script 3
Flash coding convention for action script 3
 
Java convention
Java conventionJava convention
Java convention
 
VGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information ManagementVGU - BIS2010: Integrated Information Management
VGU - BIS2010: Integrated Information Management
 
Scrum introduction
Scrum introductionScrum introduction
Scrum introduction
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

Hadoop at a glance

  • 1. Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer HDFS at a glance
  • 2. Agenda 1. Design of HDFS 2.1. HDFS Concepts – Blocks 2.1. HDFS Concepts - Namenode and datanode 3.1 Dataflow - Anatomy of a read file 3.2 Dataflow - Anatomy of a write file 3.3 Dataflow - Coherency model 4. Parallel copying 5. Demo - Command line
  • 3. The Design of HDFS Very large distributed file system Up to 10K nodes, 1 billion files, 100PB Streaming data access Write once, read many times Commodity hardware Files are replicated to handle hardware failure Detect failures and recover from them
  • 4. Worst fit with Low-latency data access Lots of small files Multiple writers, arbitrary file modifications
  • 5. HDFS Blocks Normal Filesystem blocks are few kilobytes HDFS has Large block size  Default 64MB  Typical 128MB Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
  • 6. HDFS Blocks A file is stored in blocks on various nodes in hadoop cluster. HDFS creates several replication of the data blocks Each and every data block is replicated to multiple nodes across the cluster.
  • 7. HDFS Blocks Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
  • 8. Why blocks in HDFS so large? Minimize the cost of seeks => Make transfer time = disk transfer rate
  • 9. Benefit of Block abstraction A file can be larger than any single disk in the network Simplify the storage subsystem Providing fault tolerance and availability
  • 11. Namenode & Datanodes  Namenode (master) – manages the filesystem namespace – maintains the filesystem tree and metadata for all the files and directories in the tree.  Datanodes (slaves) – store data in the local file system – Periodically report back to the namenode with lists of all existing blocks  Clients communicate with both namenode and datanodes.
  • 12. Anatomy of a File Read
  • 13. Anatomy of a File Read Benefits: - Avoid “bottle neck” - Multi-Clients
  • 15.
  • 16. Writing in HDFS Exeptions: Node failed Pipeline close, remove block and addr of failed node Namenode arrange new datanode
  • 17. Coherency Model Not visible when copying use sync() Apply in applications
  • 18. Parallel copying in HDFS Transfer data between clusters % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar Implemented as MapReduce, each file per map Each map take at least 256MB Default max maps is 20 per node The diffirent versions only supported by webhdfs protocol: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
  • 19. Setup Cluster with 03 nodes:  04 GB RAM  02 CPU @ 2.0Ghz+  100G HDD Using vmWare on 03 different servers Network: 100Mbps Operating System: Ubuntu 11.04  Windows: Not tested
  • 20. Setup Guide - Single Node java runtime ssh http://hadoop.apache.org/common/docs/r1.0.3/si ngle_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
  • 22. Command Line Similar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/ Namedone-specific:  hadoop namenode -format  start-all.sh
  • 23. Command Line Sorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort result Command Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
  • 24. Benchmark Result 2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13 2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28 Virtual Machine's harddisks are the bottle-neck