SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop Distributed File System

                Dhruba Borthakur
   Apache Hadoop Project Management Committee
              dhruba@apache.org
             dhruba@facebook.com
Hadoop, Why?
•  Need to process huge datasets on large clusters
   of computers
•  Nodes fail every day
   – Failure is expected, rather than exceptional.
   – The number of nodes in a cluster is not constant.
•  Very expensive to build reliability into each
   application.
•  Need common infrastructure
   – Efficient, reliable, easy to use
   – Open Source, Apache License
Hadoop History
•    Dec 2004 – Google paper published
•    July 2005 – Nutch uses new MapReduce implementation
•    Jan 2006 – Doug Cutting joins Yahoo!
•    Feb 2006 – Hadoop becomes a new Lucene subproject
•    Apr 2007 – Yahoo! running Hadoop on 1000-node cluster
•    Jan 2008 – An Apache Top Level Project
•    Feb 2008 – Yahoo! production search index with Hadoop
•    July 2008 – First reports of a 4000-node cluster
Who uses Hadoop?
•    Amazon/A9
•    Facebook
•    Google
•    IBM
•    Joost
•    Last.fm
•    New York Times
•    PowerSet
•    Veoh
•    Yahoo!
What is Hadoop used for?
•  Search
   –  Yahoo, Amazon, Zvents,
•  Log processing
   –  Facebook, Yahoo, ContextWeb. Joost, Last.fm
•  Recommendation Systems
   –  Facebook
•  Data Warehouse
   –  Facebook, AOL
•  Video and Image Analysis
   –  New York Times, Eyealike
Public Hadoop Clouds
•  Hadoop Map-reduce on Amazon EC2 instances
    –  http://wiki.apache.org/hadoop/AmazonEC2
•  IBM Blue Cloud
   –  Partnering with Google to offer web-scale infrastructure
•  Global Cloud Computing Testbed
   –  Joint effort by Yahoo, HP and Intel
Commodity Hardware



Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Goals of HDFS
•  Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
•  Assumes Commodity Hardware
   – Files are replicated to handle hardware failure
   – Detect failures and recovers from them
•  Optimized for Batch Processing
   – Data locations exposed so that computations can
   move to where data resides
   – Provides very high aggregate bandwidth
Distributed File System
•  Single Namespace for entire cluster
•  Data Coherency
   – Write-once-read-many access model
   – Client can only append to existing files
•  Files are broken up into blocks
   – Typically 128 MB block size
   – Each block replicated on multiple DataNodes
•  Intelligent Client
   – Client can find location of blocks
   – Client accesses data directly from DataNode
Functions of a NameNode
•  Manages File System Namespace
   – Maps a file name to a set of blocks
   – Maps a block to the DataNodes where
   it resides
•  Cluster Configuration Management
•  Replication Engine for Blocks
NameNode Metadata
•  Meta-data in Memory
   – The entire metadata is in main memory
   – No demand paging of FS meta-data
•  Types of Metadata
   – List of files
   – List of Blocks for each file
   – List of DataNodes for each block
   – File attributes, e.g access time, replication factor
•  A Transaction Log
   – Records file creations, file deletions. etc
DataNode
•  A Block Server
   – Stores data in the local file system (e.g. ext3)
   – Stores meta-data of a block (e.g. CRC)
   – Serves data and meta-data to Clients
•  Block Report
   – Periodically sends a report of all existing blocks to
   the NameNode
•  Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes
Block Placement
•  Current Strategy
   -- One replica on random node on local rack
   -- Second replica on a random remote rack
   -- Third replica on same remote rack
   -- Additional replicas are randomly placed
•  Clients read from nearest replica
•  Would like to make this policy pluggable
Replication Engine
•  NameNode detects DataNode failures
   – Chooses new DataNodes for new
   replicas
   – Balances disk usage
   – Balances communication traffic to
   DataNodes
Data Correctness
•  Use Checksums to validate data
   – Use CRC32
•  File Creation
   – Client computes checksum per 512 byte
   – DataNode stores the checksum
•  File access
   – Client retrieves the data and checksum
   from DataNode
   – If Validation fails, Client tries other replicas
Namenode Failure
•  A single point of failure
•  Transaction Log stored in multiple
   directories
   – A directory on the local file system
   – A directory on a remote file system
   (NFS/CIFS)
•  Need to develop a real HA solution
Data Pipelining
•  Client retrieves a list of DataNodes on
   which to place replicas of a block
•  Client writes block to the first DataNode
•  The first DataNode forwards the data to
   the next DataNode in the Pipeline
•  When all replicas are written, the Client
   moves on to the next block in file
Secondary NameNode
•  Copies FsImage and Transaction Log from
   NameNode to a temporary directory
•  Merges FSImage and Transaction Log into a
   new FSImage in temporary directory
•  Uploads new FSImage to the NameNode
   – Transaction Log on NameNode is purged
User Interface
•  Command for HDFS User:
   – hadoop dfs -mkdir /foodir
   – hadoop dfs -cat /foodir/myfile.txt
   – hadoop dfs -rm /foodir myfile.txt
•  Command for HDFS Administrator
   – hadoop dfsadmin -report
   – hadoop dfsadmin -decommission datanodename
•  Web Interface
   – http://host:port/dfshealth.jsp
Hadoop Map/Reduce
•  Implementation of the Map-Reduce programming model
   – Framework for distributed processing of large data sets
   – Data handled as collections of key-value pairs
   – Pluggable user code runs in generic framework
•  Very common design pattern in data processing
   – Demonstrated by a unix pipeline example:
   cat * | grep | sort | unique -c | cat > file
   input | map | shuffle | reduce | output
•  Natural for:
    – Log processing
    – Web search indexing
    – Ad-hoc queries
Hadoop Subprojects
•  Hive
  –  A Data Warehouse with SQL support
•  HBase
  – table storage for semi-structured data
•  Zookeeper
  – coordinating distributed applications
•  Mahout
  –  Machine learning
Hadoop at Facebook
•  Hardware
  –  4800 cores, 600 machines, 16GB/8GB per
     machine – Nov 2008
  –  4 SATA disks of 1 TB each per machine
  –  2 level network hierarchy, 40 machines per
     rack
Hadoop at Facebook
•  Single HDFS cluster across all cores
   –  2 PB raw capacity
   –  Ingest rate is 2 TB compressed per day
          – 10 TB uncompressed
   –  12 Million files
Hadoop Growth at Facebook
Data Flow at Facebook



Web
Servers
           Scribe
Servers



                                                  Network

                                                  Storage





 Oracle
RAC
   Hadoop
Cluster
           MySQL

Hadoop Usage at Facebook
•  Statistics per day:
   –  55TB of compressed data scanned per day
   –  3200+ jobs on production cluster per day
   –  80M compute minutes per day
•  Barrier to entry is significantly reduced:
   –  SQL like language called Hive
   –  http://hadoop.apache.org/hive/
Useful Links
•  HDFS Design:
  – http://hadoop.apache.org/core/docs/current/hdfs_design.html

•  Hadoop API:
   – http://lucene.apache.org/hadoop/api/
•  Hadoop Wiki
   –  http://wiki.apache.org/hadoop/

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data IntegrationAli Usman
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoopSandeep Patil
 
google file system
google file systemgoogle file system
google file systemdiptipan
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - RangerIsheeta Sanghi
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
Distributed data processing
Distributed data processingDistributed data processing
Distributed data processingAyisha Kowsar
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 

What's hot (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data Integration
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
google file system
google file systemgoogle file system
google file system
 
Document Database
Document DatabaseDocument Database
Document Database
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Distributed data processing
Distributed data processingDistributed data processing
Distributed data processing
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Kdd process
Kdd processKdd process
Kdd process
 

Viewers also liked

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3Sung-jae Park
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_SystemBrett Keim
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewelleryChitradevi
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems Ioanna Tsalouchidou
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger SulbhaSulbha Bakshi
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureAngelo Gino Varrati
 

Viewers also liked (20)

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewellery
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger Sulbha
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 

Similar to Hadoop Distributed File System

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 

Similar to Hadoop Distributed File System (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

More from elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Hadoop Distributed File System

  • 1. Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
  • 2. Hadoop, Why? •  Need to process huge datasets on large clusters of computers •  Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. •  Very expensive to build reliability into each application. •  Need common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License
  • 3. Hadoop History •  Dec 2004 – Google paper published •  July 2005 – Nutch uses new MapReduce implementation •  Jan 2006 – Doug Cutting joins Yahoo! •  Feb 2006 – Hadoop becomes a new Lucene subproject •  Apr 2007 – Yahoo! running Hadoop on 1000-node cluster •  Jan 2008 – An Apache Top Level Project •  Feb 2008 – Yahoo! production search index with Hadoop •  July 2008 – First reports of a 4000-node cluster
  • 4. Who uses Hadoop? •  Amazon/A9 •  Facebook •  Google •  IBM •  Joost •  Last.fm •  New York Times •  PowerSet •  Veoh •  Yahoo!
  • 5. What is Hadoop used for? •  Search –  Yahoo, Amazon, Zvents, •  Log processing –  Facebook, Yahoo, ContextWeb. Joost, Last.fm •  Recommendation Systems –  Facebook •  Data Warehouse –  Facebook, AOL •  Video and Image Analysis –  New York Times, Eyealike
  • 6. Public Hadoop Clouds •  Hadoop Map-reduce on Amazon EC2 instances –  http://wiki.apache.org/hadoop/AmazonEC2 •  IBM Blue Cloud –  Partnering with Google to offer web-scale infrastructure •  Global Cloud Computing Testbed –  Joint effort by Yahoo, HP and Intel
  • 7. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 8. Goals of HDFS •  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB •  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them •  Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 9. Distributed File System •  Single Namespace for entire cluster •  Data Coherency – Write-once-read-many access model – Client can only append to existing files •  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes •  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 10.
  • 11. Functions of a NameNode •  Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides •  Cluster Configuration Management •  Replication Engine for Blocks
  • 12. NameNode Metadata •  Meta-data in Memory – The entire metadata is in main memory – No demand paging of FS meta-data •  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g access time, replication factor •  A Transaction Log – Records file creations, file deletions. etc
  • 13. DataNode •  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients •  Block Report – Periodically sends a report of all existing blocks to the NameNode •  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. Block Placement •  Current Strategy -- One replica on random node on local rack -- Second replica on a random remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed •  Clients read from nearest replica •  Would like to make this policy pluggable
  • 15. Replication Engine •  NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 16. Data Correctness •  Use Checksums to validate data – Use CRC32 •  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum •  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 17. Namenode Failure •  A single point of failure •  Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) •  Need to develop a real HA solution
  • 18. Data Pipelining •  Client retrieves a list of DataNodes on which to place replicas of a block •  Client writes block to the first DataNode •  The first DataNode forwards the data to the next DataNode in the Pipeline •  When all replicas are written, the Client moves on to the next block in file
  • 19. Secondary NameNode •  Copies FsImage and Transaction Log from NameNode to a temporary directory •  Merges FSImage and Transaction Log into a new FSImage in temporary directory •  Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 20. User Interface •  Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt •  Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename •  Web Interface – http://host:port/dfshealth.jsp
  • 21. Hadoop Map/Reduce •  Implementation of the Map-Reduce programming model – Framework for distributed processing of large data sets – Data handled as collections of key-value pairs – Pluggable user code runs in generic framework •  Very common design pattern in data processing – Demonstrated by a unix pipeline example: cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output •  Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 22.
  • 23. Hadoop Subprojects •  Hive –  A Data Warehouse with SQL support •  HBase – table storage for semi-structured data •  Zookeeper – coordinating distributed applications •  Mahout –  Machine learning
  • 24. Hadoop at Facebook •  Hardware –  4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008 –  4 SATA disks of 1 TB each per machine –  2 level network hierarchy, 40 machines per rack
  • 25. Hadoop at Facebook •  Single HDFS cluster across all cores –  2 PB raw capacity –  Ingest rate is 2 TB compressed per day – 10 TB uncompressed –  12 Million files
  • 26. Hadoop Growth at Facebook
  • 27. Data Flow at Facebook Web
Servers
 Scribe
Servers
 Network
 Storage
 Oracle
RAC
 Hadoop
Cluster
 MySQL

  • 28. Hadoop Usage at Facebook •  Statistics per day: –  55TB of compressed data scanned per day –  3200+ jobs on production cluster per day –  80M compute minutes per day •  Barrier to entry is significantly reduced: –  SQL like language called Hive –  http://hadoop.apache.org/hive/
  • 29. Useful Links •  HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html •  Hadoop API: – http://lucene.apache.org/hadoop/api/ •  Hadoop Wiki –  http://wiki.apache.org/hadoop/