SlideShare uma empresa Scribd logo
1 de 23
++
Hadoop 기본 과정
++
Overview
HDFSHDFSHDFSHDFS
ImpalaImpalaImpalaImpala
MapReduceMapReduceMapReduceMapReduce
CascadingCascadingCascadingCascading HiveHiveHiveHive
++
Big Data for What?
Service
CAP Theorem, Fast Response ,Scale Out , Schema
Free ...
Distributor with RDBMS
NoSQL
MongoDB , HBASE , CouchDB ...
Analysis
Hadoop <--- today’s topic!!!
++
What’s Hadoop
Consist of
HDFS (Hadoop Distributed File System)
MapReduce
++
HDFS Architecture
master
namenode
slave
bunch of
datanode
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
++
single master
Strong Point
simple architecture
master have global knowledge.
file and block namespace (memory and disk)
mapping from files to blocks (memory and disk)
location of each block’s replicas ( only memory)
master can make sophisticated decisions.
++
single master
Weak Point
SPOF(= single point of failure )
bottleneck
minimizing master’s involvement is important
++
Fast Recovery for NameNode
Secondary Namenode
crawls namenode’s
operation log
maintains
namenode’s data
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
Secondary NameNodeSecondary NameNodeSecondary NameNodeSecondary NameNode
++
HA for NameNode
active namenode
do normal
namenode’s
operation
standby namenode
maintain
namenode’s data
ready to be active
namenode
NameNode(active)NameNode(active)NameNode(active)NameNode(active)
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)
++
block
each file consists of blocks
size
default 64M
replication ( default 3 )
++
write operation
client send ‘write request’ to
namenode
namenode lock file and select
datanode to be written.
namenode response datanode list
to client.
client send file content to datanode.
datanode store file and relay to
other datanode.
finally client send close request to
namenode.
namenode release write lock
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
clientclientclientclient
write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode
++
read operation
client send ‘read request’ to
namenode
namenode lock file and select
datanode to be written.
namenode response datanode list
to client.
client send read request to
datanode.
datanode send content to client
finally client send close request to
namenode.
namenode release read lock
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
clientclientclientclient
read lockread lockread lockread lock
++
block(again)
reason to use big-size-block
reduce client’s need to interact with namenode
reduce the size of metadata stored on namenode
++
namenode’s operation
namespace management and locking
replica placement
creation, re-replication, rebalancing
garbage collection
stale replica detection
++ namespace management and locking
goal
ensure proper serialization
use read lock/write lock
++
block replica placement
goal
maximize data reliability and availability
maximize network bandwidth utilization
default strategy is ...
one on same datanode.
one on other datanode in same rack.
one on other datanode in other rack.
++ creation, re-replication, rebalancing
creation
client create new files
consider
disk space utilization
number of recent creation
spread replicas
re-replication
number of available replica falls below proper goal
datanode down, replica corruption ...
rebalancing
move replicas for better disk space and load balancing
++
garbage collection
what’s garbage?
block not in namenode’s metadata.
mechanism
when exchanging HeartBeat with namenode,
datanode reports subset of block it has.
master replies with garbage blocks.
datanode deletes grabage blocks.
++
stale replica detection
mechanism
storing with generation timestamp.
when restarting, datanode reports its set of blocks with its
generation timestamp
++
Datanode’s operation
check data integrity
datanode use checksumming to detect corruption.
++
filesystem api
hdfs provide basic linux utilities.
ex)
hdfs dfs -mkdir -p /foo
hdfs dfs -ls /foo
hdfs dfs -cat /foo/bar.txt
hdfs dfs -rm -r /foo
++
etc
raid?
native library?
++
end
thanks ....

Mais conteúdo relacionado

Mais procurados

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFSApache Apex
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14jijukjoseph
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 

Mais procurados (19)

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 

Semelhante a HDFS introduction

Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionHadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionEdureka!
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name NodeAaron Cordova
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 

Semelhante a HDFS introduction (20)

Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionHadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 

Último

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 

Último (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 

HDFS introduction

  • 3. ++ Big Data for What? Service CAP Theorem, Fast Response ,Scale Out , Schema Free ... Distributor with RDBMS NoSQL MongoDB , HBASE , CouchDB ... Analysis Hadoop <--- today’s topic!!!
  • 4. ++ What’s Hadoop Consist of HDFS (Hadoop Distributed File System) MapReduce
  • 5. ++ HDFS Architecture master namenode slave bunch of datanode NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
  • 6. ++ single master Strong Point simple architecture master have global knowledge. file and block namespace (memory and disk) mapping from files to blocks (memory and disk) location of each block’s replicas ( only memory) master can make sophisticated decisions.
  • 7. ++ single master Weak Point SPOF(= single point of failure ) bottleneck minimizing master’s involvement is important
  • 8. ++ Fast Recovery for NameNode Secondary Namenode crawls namenode’s operation log maintains namenode’s data NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode Secondary NameNodeSecondary NameNodeSecondary NameNodeSecondary NameNode
  • 9. ++ HA for NameNode active namenode do normal namenode’s operation standby namenode maintain namenode’s data ready to be active namenode NameNode(active)NameNode(active)NameNode(active)NameNode(active) DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)
  • 10. ++ block each file consists of blocks size default 64M replication ( default 3 )
  • 11. ++ write operation client send ‘write request’ to namenode namenode lock file and select datanode to be written. namenode response datanode list to client. client send file content to datanode. datanode store file and relay to other datanode. finally client send close request to namenode. namenode release write lock NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode clientclientclientclient write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode
  • 12. ++ read operation client send ‘read request’ to namenode namenode lock file and select datanode to be written. namenode response datanode list to client. client send read request to datanode. datanode send content to client finally client send close request to namenode. namenode release read lock NameNodeNameNodeNameNodeNameNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode clientclientclientclient read lockread lockread lockread lock
  • 13. ++ block(again) reason to use big-size-block reduce client’s need to interact with namenode reduce the size of metadata stored on namenode
  • 14. ++ namenode’s operation namespace management and locking replica placement creation, re-replication, rebalancing garbage collection stale replica detection
  • 15. ++ namespace management and locking goal ensure proper serialization use read lock/write lock
  • 16. ++ block replica placement goal maximize data reliability and availability maximize network bandwidth utilization default strategy is ... one on same datanode. one on other datanode in same rack. one on other datanode in other rack.
  • 17. ++ creation, re-replication, rebalancing creation client create new files consider disk space utilization number of recent creation spread replicas re-replication number of available replica falls below proper goal datanode down, replica corruption ... rebalancing move replicas for better disk space and load balancing
  • 18. ++ garbage collection what’s garbage? block not in namenode’s metadata. mechanism when exchanging HeartBeat with namenode, datanode reports subset of block it has. master replies with garbage blocks. datanode deletes grabage blocks.
  • 19. ++ stale replica detection mechanism storing with generation timestamp. when restarting, datanode reports its set of blocks with its generation timestamp
  • 20. ++ Datanode’s operation check data integrity datanode use checksumming to detect corruption.
  • 21. ++ filesystem api hdfs provide basic linux utilities. ex) hdfs dfs -mkdir -p /foo hdfs dfs -ls /foo hdfs dfs -cat /foo/bar.txt hdfs dfs -rm -r /foo