SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Reaching 10,000
Aaron Cordova
Booz Allen Hamilton | Hadoop Meetup DC | Sep 7 2010
cordova_aaron@bah.com
Lots of Applications Require
Scalability
                   Machine Learning
     Text
                                Defense
              Intelligence
Graph Analytics                         Bio-Metrics
                               Video
         Bio-Informatics
                                 Network Security
     Images
                      Structured Data
Hadoop Scales
Linear Scalability
Cost ->




                           Data Size ->
          Shared Nothing                  Shared Disk
Massive Parallelism
MapReduce

Simplified Distributed Programming Model
Fault Tolerant
Designed to Scale to Thousands of Servers
Many Algorithms Easily Expressed as Map and Reduce
HDFS

Distributed File System
Optimized for High-Throughput
Fault Tolerant Through Replication, Checksumming
Designed to Scale to 10,000 servers
Hadoop is a Platform
Pig

    MapReduce
                       HBase

Cascading                  Flume
            HDFS

                        Nutch
 Mahout
                Hive
HBase

Scalable Structured store
Fast Lookups
Durable, Consistent Writes
Automatic Partitioning
Mahout


Scalable Machine Learning Algorithms
Clustering
Classification
Fuzzy Table


Low-Latency Parallel Search
Generalized Fuzzy Matching
Images, Biometrics, Audio
One Major Problem
HDFS Single NameNode

Single NameSpace - easy to serialize operations
NameSpace stored entirely in memory
Changes written to transaction log first
Single Point of Failure
Performance Bottleneck?
NameNode Scalability
                 “100,000 HDFS clients on a 10,000-node
                 HDFS cluster will exceed the throughput
                 capacity of a single name-node.
                 ... any solution intended for single
                 namespace server optimization lacks
Konstantin       scalability.
Shvachko
                 ... the most promising solutions seem to
Login Apr 2010   be based on distributing the namespace
                 server ...”
Goal
                                    50

       writes/second (thousands)
                                   37.5



                                    25



                                   12.5



                                     0



                                      Single NN   Target
HDFS Single NameNode

Server grade machine
Lots of memory
Reliable components
RAID
Hot-Failover
Needs Parallelism
Scaling NameNode

Grow memory
Read-only Replicas of NameNode
Multiple static namespace partitions
Distributed name server, partition namespace
dynamically
Distributed NameNode
Features

Fast Lookups
Durable, Consistent writes
Automatic Partitioning
Can we use HBase?
Mappings as HBase Tables
NameSpace

filename : blocks   DataNodes

                   node : blocks   Blocks

                                   block : nodes
How to order namespace?
Depth First Search Order
                /
                /dir1
                /dir1/subdir
                /dir1/subdir/file
                /dir2/file1
                /dir2/file2
Depth First Operations


   Delete (Recursive)
    Move / Rename
Breadth First Search Order
                0/
                1/dir1
                2/dir2/file1
                2/dir2/file2
                2/dir1/subdir
                3/dir2/subdir/file
Breadth First Operations



            List
Current Architecture
 NameNode




DataNode    DataNode   DFSClient   DFSClient
Proposed Architecture


 RServer   RServer    RServer       RServer




DNNProxy   DNNProxy    DNNProxy        DNNProxy

DataNode   DataNode     DFSClient       DFSClient
100k clients -> 41k writes/s
Anticipated Performance
                             50
writes/second (thousands)




                            37.5



                             25



                            12.5



                              0
                               100               150                      200            250
                                                 # machines hosting namespace

                                     Single NN           Distributed NN         Target
Issues


Synchronization - multiple writers, changes
Name distribution hotspots
Current Status

Working code exists that uses HBase with slightly
modified DFSClient and DataNode for create, write,
close, open, read, mkdirs, delete.
New component: HealthServer monitors DataNodes
and does garbage collection. More like BigTable
master, can die, restart without affecting clients.
Code


Will be at http://code.google.com/p/hdfs-dnn
Available under the Apache license - whichever is
compatible with Hadoop
Doesn’t HBase run on HDFS?
Self-Hosted HBase

May be possible to have HBase use the same HDFS
instance it’s supporting
Some recursion and self-reference already exists:
HBase Metadata table is itself a table in HBase
Have to work out bootstrapping and failure recovery to
resolve any potential circular dependencies

Mais conteúdo relacionado

Mais procurados

HDFS introduction
HDFS introductionHDFS introduction
HDFS introduction
injae yeo
 
HBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFSHBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFS
Cloudera, Inc.
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Anand Kulkarni
 

Mais procurados (20)

HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 
HDFS introduction
HDFS introductionHDFS introduction
HDFS introduction
 
HBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFSHBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFS
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workl...
 
MyCassandra (Full English Version)
MyCassandra (Full English Version)MyCassandra (Full English Version)
MyCassandra (Full English Version)
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
HDFS & ASM
HDFS & ASMHDFS & ASM
HDFS & ASM
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 

Destaque

Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
Tsuyoshi OZAWA
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
ckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sourcesckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sources
Chengjen Lee
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 

Destaque (16)

Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Hadoop-2 @ eBay
Hadoop-2 @ eBayHadoop-2 @ eBay
Hadoop-2 @ eBay
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
Final version sql over hadoop ver1
Final version sql over hadoop ver1Final version sql over hadoop ver1
Final version sql over hadoop ver1
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
 
DCAT-AP exchanging metadata
DCAT-AP exchanging metadataDCAT-AP exchanging metadata
DCAT-AP exchanging metadata
 
DCAT: a tale of exchanging metadata
DCAT: a tale of exchanging metadataDCAT: a tale of exchanging metadata
DCAT: a tale of exchanging metadata
 
ckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sourcesckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sources
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
 
HBase ArcheTypes
HBase ArcheTypesHBase ArcheTypes
HBase ArcheTypes
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 

Semelhante a Design for a Distributed Name Node

Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 

Semelhante a Design for a Distributed Name Node (20)

RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hbase
HbaseHbase
Hbase
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 

Mais de Aaron Cordova (7)

Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and Roadmap
 
Text Indexing in Accumulo
Text Indexing in AccumuloText Indexing in Accumulo
Text Indexing in Accumulo
 
Accumulo on EC2
Accumulo on EC2Accumulo on EC2
Accumulo on EC2
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Design for a Distributed Name Node