SlideShare uma empresa Scribd logo
1 de 35
Page 1Classification: Restricted
Hadoop Training
Hadoop & HDFS
Page 2Classification: Restricted
Agenda
• History of hadoop
• Hadoop Ecosystem
• Hadoop Animal Planet
• What is Hadoop?
• Distinctions of hadoop
• Hadoop Components
• The Hadoop Distributed Filesystem
• Design of HDFS
• When Not to use Hadoop?
• HDFS Concepts
• Anatomy of a File Read
• Anatomy of a File Write
• Replication & Rack awareness
• Mapreduce Components
• Typical Mapreduce Job
Page 3Classification: Restricted
History of Hadoop
Page 4Classification: Restricted
Hadoop Ecosystem
Page 5Classification: Restricted
Hadoop Ecosystem
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers. HDFS, MapReduce and YARN form the core of
Apache™ Hadoop
• MapReduce is a programming model and it provides implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
• Apache Hadoop YARN (short, in self-deprecating fashion, for Yet Another
Resource Negotiator) is a cluster management technology. It is one of the key
features in second-generation Hadoop.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis. While initially developed
by Facebook, Apache Hive is now used and developed by other companies
such as Netflix.
Page 6Classification: Restricted
Hadoop Ecosystem
• Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
• Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's Bigtable: A Distributed Storage System for
Structured Data by Chang et al.
• ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by distributed
applications.
• Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases. Sqoop imports data from external structured datastores into
HDFS or related systems like Hive and HBase.
Page 7Classification: Restricted
Hadoop Ecosystem
• Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and
flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
• Machine learning is a type of artificial intelligence (AI) that provides
computers with the ability to learn without being explicitly programmed.
Machine learning focuses on the development of computer programs that
can teach themselves to grow and change when exposed to new data.
• R is a programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians
and data miners for developing statistical software and data analysis
• Cloudera Impala is Cloudera's open source massively parallel processing
(MPP) SQL query engine for data stored in a computer cluster
running Apache Hadoop.
Page 8Classification: Restricted
Hadoop Ecosystem
• MongoDB is a document database that provides high performance, high
availability, and easy scalability. Document Database. Documents (objects)
map nicely to programming language data types. Embedded documents and
arrays reduce need for joins. Dynamic schema makes polymorphism easier.
• Apache CouchDB, commonly referred to as CouchDB, is an open
source database that focuses on ease of use and on being "a databasethat
completely embraces the web". It is a document-oriented
NoSQLdatabase that uses JSON to store data, JavaScript as its query
language using MapReduce, and HTTP for an API.
• Cascading is a software abstraction layer for Apache Hadoop. Cascading is
used to create and execute complex data processing workflows on a Hadoop
cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the
underlying complexity of MapReduce jobs.
Page 9Classification: Restricted
Hadoop Animal Planet
Page 10Classification: Restricted
What is Hadoop?
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
• It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service on top of a cluster of computers, each
of which may be prone to failures.
Page 11Classification: Restricted
HADOO
P
Scalabl
e
Robust
Accessible
Simple
Distinctions of Hadoop
Page 12Classification: Restricted
Hadoop Components:
Page 13Classification: Restricted
Hadoop Components
MapReduc
e
HDFS
Cluster
Job
Tracker
Namenod
e
Task
Tracke
r
Task
Tracke
r
Task
Tracke
r
Data
Node
Data
Node
Data
Node
Page 14Classification: Restricted
Hadoop Components:
• HDFS – Hadoop Distributed File System(storage):
• Data is split and distributed across nodes
• Each split is replicated
• Namenode is the master & Datanodes are the slaves
• Mapreduce(processing):
• Splits a task across processors
• Execution is Near the data & the results are merged
• Self-healing
• JobTracker is the master & Task trackers are slaves
Page 15Classification: Restricted
The Hadoop Distributed Filesystem:
• When a dataset is larger than the storage capacity of a single physical
machine, it is necessary to split it and save it in a number of separate
machines. Filesystems that manage the storage across a network of
machines are called distributed filesystems. One of the biggest challenges
of distributed filesystems is handling node (machine) failure without
suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS, which
stands for Hadoop Distributed Filesystem. HDFS overcomes this
challenge.
Page 16Classification: Restricted
Design Of HDFS:
• HDFS is designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
• Very Large Files: Files in the sizes of hundreds of gigabytes (GB) or
terabytes (TB)
• Streaming Data Access: HDFS is built around the idea that the most
efficient data processing pattern is a write-once, read-many-times
pattern. A dataset is typically generated or copied from source, and then
various analyses are performed on that dataset over time.
• Commodity Hardware: Commonly available hardware which is cheap
(not enterprise level)
Page 17Classification: Restricted
When Not to Use Hadoop?
• HDFS does not fit
• When you need low-latency(faster) access to data: Applications that
require faster data access will not work well with HDFS. HDFS is
optimized for high throughput of data.
• When you have lots of small files: Namenode holds the filesystem
metadata in memory. Every block, file, directory occupies around
150bytes. Hence having large number of small files will cause burden on
the Namenode
• When random writes in the file is needed: Files in the HDFS may be
written to by a single writer. Writes are always made at the end-of file
in append-only fashion. There is no support for multiple writers or for
modifications at random offsets(positions)
Page 18Classification: Restricted
HDFS Concepts
• HDFS Block: A HDFS block is the smallest unit of data that we can read or
write in HDFS. HDFS stores files in blocks that are typically at least 64 MB or
(more commonly now) 128 MB in size, much larger than the 4-32 KB seen in
most filesystems.
• In HDFS, 1 MB file stored with a block size of 128 MB uses 1 MB of disk
space, not 128 MB
Why is a block in HDFS so large?
• The Disk seek time and the data transfer rate are not at the same level
(data transfer is much faster). As of now, the disk seek time is 10 ms and
the transfer rate is 100MB/s. To keep them in sync with each other, the
data block was chosen to be large (which results in longer transfer time)
Page 19Classification: Restricted
HDFS Concepts:
• NameNode
• It is the master node & responsible for the entire cluster
• Manages the filesystem namespace
• Enterprise level software is used
• DataNode
• Slaves which run on commodity/cheap hardware
• Store and retrieve data when they are told to (by client or Namenode)
• Sends heart-beat signals to NN with the blocks that they store
• Secondary NameNode
• It is a backup for the Namenode(not a hot stand-by)
• It periodically merges the fsimage & edit log files
Page 20Classification: Restricted
A File in HDFS
Page 21Classification: Restricted
A File in HDFS
• [hadoop@hadoopmaster1 tmp]$ hdfs dfs -ls /tmp/test
• Found 1 items
• -rw-r--r-- 3 hadoop supergroup 62752 2015-03-08 00:33
/tmp/test/hadoop_storage_cleanupFilesList_tmp.log
• [hadoop@hadoopmaster1 tmp]$ hdfs fsck
/tmp/test/hadoop_storage_cleanupFilesList_tmp.log -files -racks -blocks
• Connecting to namenode via http://hadoop-fip.cemodperf1.nokia.com:50070
• FSCK started by hadoop (auth:KERBEROS_SSL) from /10.10.10.207 for path
/tmp/test/hadoop_storage_cleanupFilesList_tmp.log at
• /tmp/test/hadoop_storage_cleanupFilesList_tmp.log 62752 bytes, 1 block(s): OK
• 0. BP-1043234896-10.10.10.13-1450426556411:blk_1074160221_419397
len=62752 repl=3 [/default-rack/10.10.10.112:1004, /default-
rack/10.10.10.111:1004, /default-rack/10.10.10.15:1004]
• Status: HEALTHY
• Total size: 62752 B
• Total dirs: 0
Page 22Classification: Restricted
A File in HDFS
• Total files: 1
• Total symlinks: 0
• Total blocks (validated): 1 (avg. block size 62752 B)
• Minimally replicated blocks: 1 (100.0 %)
• Over-replicated blocks: 0 (0.0 %)
• Under-replicated blocks: 0 (0.0 %)
• Mis-replicated blocks: 0 (0.0 %)
• Default replication factor: 3
• Average block replication: 3.0
• Corrupt blocks: 0
• Missing replicas: 0 (0.0 %)
• Number of data-nodes: 6
• Number of racks: 1
• FSCK ended at Sun Mar 08 00:34:05 IST 2015 in 1 milliseconds
• The filesystem under path '/tmp/test/hadoop_storage_cleanupFilesList_tmp.log' is
HEALTHY
Page 23Classification: Restricted
Anatomy of a File Read
Page 24Classification: Restricted
Anatomy of a File Read
1. The client opens the file it wishes to read by calling open() on
the FileSystem object, which for HDFS is an instance of DistributedFileSystem
2. DistributedFileSystem calls the namenode, using RPC, to determine the
locations of the blocks for the first few blocks in the file
3. For each block, the namenode returns the addresses of the datanodes that
have a copy of that block. Furthermore, the datanodes are sorted according
to their proximity to the client. The DistributedFileSystem returns
an FSDataInputStream (an input stream that supports file seeks) to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the datanode and namenode I/O. The client then
calls read() on the stream.
Page 25Classification: Restricted
Anatomy of a File Read
4. DFSInputStream, which has stored the datanode addresses for the first
few blocks in the file, then connects to the first (closest) datanode for the
first block in the file. Data is streamed from the datanode back to the
client, which calls read() repeatedly on the stream
5. When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next
block
6. Blocks are read in order, with the DFSInputStream opening new
connections to datanodes as the client reads through the stream. It will
also call the namenode to retrieve the datanode locations for the next
batch of blocks as needed. When the client has finished reading, it
calls close() on the FSDataInputStream
Page 26Classification: Restricted
Anatomy of a File Write
Page 27Classification: Restricted
Anatomy of a File Write
1. The client creates the file by calling create() on DistributedFileSystem
2. DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it. NN
checks for client’s access permissions and checks if the file already exists (file
exists - IO Exception); The DistributedFileSystem returns an
FSDataOutputStream for the client to start writing data to.
FSDataOutputStream wraps a DFSOutputStream, which handles
communication with the datanodes and namenode.
3. As the client writes data (step 3), DFSOutputStream splits it into packets,
which it writes to an internal queue, called the data queue. The data queue is
consumed by the DataStreamer, which is responsible for asking the
namenode to allocate new blocks by picking a list of suitable datanodes to
store the replicas. The list of datanodes forms a pipeline
Page 28Classification: Restricted
Anatomy of a File Write
4. The DataStreamer streams the packets to the first datanode in the
pipeline, which stores the packet and forwards it to the second
datanode in the pipeline. Similarly, the second datanode stores the
packet and forwards it to the third (and last) datanode in the pipeline
5. DFSOutputStream also maintains an internal queue of packets that
are waiting to be acknowledged by datanodes, called the ack queue.
A packet is removed from the ack queue only when it has been
acknowledged by all the datanodes in the pipeline (step 5).
6. When the client has finished writing data, it calls close() on the
stream waits for acknowledgments before contacting the namenode
to signal that the file is complete
Page 29Classification: Restricted
Two Level Network Architecture
Page 30Classification: Restricted
Two Level Network Architecture
Page 31Classification: Restricted
Replication & Rack Awareness
Block A:
Block B:
Block C:
Rack
1
1
2
3
4
Rack
2
5
6
7
8
Rack
3
9
1
0
11
1
2
Page 32Classification: Restricted
MapReduce Components:
• Job Tracker:
• Coordinates all the jobs run on the system by scheduling tasks
• Keeps a record of overall progress of each job
• If a job fails, reschedules the job on a different tasktracker
• Task Tracker:
• Slave daemon which accepts tasks to be run a block of data
• Sends progress reports as heart beat signals to the Job tracker at regular
intervals
Page 33Classification: Restricted
MapReduce Job
Page 34Classification: Restricted
Hadoop 1 vs 2
Page 35Classification: Restricted
Thank You!

Mais conteúdo relacionado

Mais procurados

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

Mais procurados (17)

Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Cppt
CpptCppt
Cppt
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop
Hadoop Hadoop
Hadoop
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Semelhante a Hadoop - HDFS

Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Semelhante a Hadoop - HDFS (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Anju
AnjuAnju
Anju
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Hadoop - HDFS

  • 2. Page 2Classification: Restricted Agenda • History of hadoop • Hadoop Ecosystem • Hadoop Animal Planet • What is Hadoop? • Distinctions of hadoop • Hadoop Components • The Hadoop Distributed Filesystem • Design of HDFS • When Not to use Hadoop? • HDFS Concepts • Anatomy of a File Read • Anatomy of a File Write • Replication & Rack awareness • Mapreduce Components • Typical Mapreduce Job
  • 5. Page 5Classification: Restricted Hadoop Ecosystem • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce and YARN form the core of Apache™ Hadoop • MapReduce is a programming model and it provides implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. • Apache Hadoop YARN (short, in self-deprecating fashion, for Yet Another Resource Negotiator) is a cluster management technology. It is one of the key features in second-generation Hadoop. • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.
  • 6. Page 6Classification: Restricted Hadoop Ecosystem • Apache Pig is a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. • Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. • Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase.
  • 7. Page 7Classification: Restricted Hadoop Ecosystem • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. • Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. • R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis • Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
  • 8. Page 8Classification: Restricted Hadoop Ecosystem • MongoDB is a document database that provides high performance, high availability, and easy scalability. Document Database. Documents (objects) map nicely to programming language data types. Embedded documents and arrays reduce need for joins. Dynamic schema makes polymorphism easier. • Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being "a databasethat completely embraces the web". It is a document-oriented NoSQLdatabase that uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API. • Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs.
  • 10. Page 10Classification: Restricted What is Hadoop? • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 13. Page 13Classification: Restricted Hadoop Components MapReduc e HDFS Cluster Job Tracker Namenod e Task Tracke r Task Tracke r Task Tracke r Data Node Data Node Data Node
  • 14. Page 14Classification: Restricted Hadoop Components: • HDFS – Hadoop Distributed File System(storage): • Data is split and distributed across nodes • Each split is replicated • Namenode is the master & Datanodes are the slaves • Mapreduce(processing): • Splits a task across processors • Execution is Near the data & the results are merged • Self-healing • JobTracker is the master & Task trackers are slaves
  • 15. Page 15Classification: Restricted The Hadoop Distributed Filesystem: • When a dataset is larger than the storage capacity of a single physical machine, it is necessary to split it and save it in a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. One of the biggest challenges of distributed filesystems is handling node (machine) failure without suffering data loss. • Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. HDFS overcomes this challenge.
  • 16. Page 16Classification: Restricted Design Of HDFS: • HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. • Very Large Files: Files in the sizes of hundreds of gigabytes (GB) or terabytes (TB) • Streaming Data Access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. • Commodity Hardware: Commonly available hardware which is cheap (not enterprise level)
  • 17. Page 17Classification: Restricted When Not to Use Hadoop? • HDFS does not fit • When you need low-latency(faster) access to data: Applications that require faster data access will not work well with HDFS. HDFS is optimized for high throughput of data. • When you have lots of small files: Namenode holds the filesystem metadata in memory. Every block, file, directory occupies around 150bytes. Hence having large number of small files will cause burden on the Namenode • When random writes in the file is needed: Files in the HDFS may be written to by a single writer. Writes are always made at the end-of file in append-only fashion. There is no support for multiple writers or for modifications at random offsets(positions)
  • 18. Page 18Classification: Restricted HDFS Concepts • HDFS Block: A HDFS block is the smallest unit of data that we can read or write in HDFS. HDFS stores files in blocks that are typically at least 64 MB or (more commonly now) 128 MB in size, much larger than the 4-32 KB seen in most filesystems. • In HDFS, 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB Why is a block in HDFS so large? • The Disk seek time and the data transfer rate are not at the same level (data transfer is much faster). As of now, the disk seek time is 10 ms and the transfer rate is 100MB/s. To keep them in sync with each other, the data block was chosen to be large (which results in longer transfer time)
  • 19. Page 19Classification: Restricted HDFS Concepts: • NameNode • It is the master node & responsible for the entire cluster • Manages the filesystem namespace • Enterprise level software is used • DataNode • Slaves which run on commodity/cheap hardware • Store and retrieve data when they are told to (by client or Namenode) • Sends heart-beat signals to NN with the blocks that they store • Secondary NameNode • It is a backup for the Namenode(not a hot stand-by) • It periodically merges the fsimage & edit log files
  • 21. Page 21Classification: Restricted A File in HDFS • [hadoop@hadoopmaster1 tmp]$ hdfs dfs -ls /tmp/test • Found 1 items • -rw-r--r-- 3 hadoop supergroup 62752 2015-03-08 00:33 /tmp/test/hadoop_storage_cleanupFilesList_tmp.log • [hadoop@hadoopmaster1 tmp]$ hdfs fsck /tmp/test/hadoop_storage_cleanupFilesList_tmp.log -files -racks -blocks • Connecting to namenode via http://hadoop-fip.cemodperf1.nokia.com:50070 • FSCK started by hadoop (auth:KERBEROS_SSL) from /10.10.10.207 for path /tmp/test/hadoop_storage_cleanupFilesList_tmp.log at • /tmp/test/hadoop_storage_cleanupFilesList_tmp.log 62752 bytes, 1 block(s): OK • 0. BP-1043234896-10.10.10.13-1450426556411:blk_1074160221_419397 len=62752 repl=3 [/default-rack/10.10.10.112:1004, /default- rack/10.10.10.111:1004, /default-rack/10.10.10.15:1004] • Status: HEALTHY • Total size: 62752 B • Total dirs: 0
  • 22. Page 22Classification: Restricted A File in HDFS • Total files: 1 • Total symlinks: 0 • Total blocks (validated): 1 (avg. block size 62752 B) • Minimally replicated blocks: 1 (100.0 %) • Over-replicated blocks: 0 (0.0 %) • Under-replicated blocks: 0 (0.0 %) • Mis-replicated blocks: 0 (0.0 %) • Default replication factor: 3 • Average block replication: 3.0 • Corrupt blocks: 0 • Missing replicas: 0 (0.0 %) • Number of data-nodes: 6 • Number of racks: 1 • FSCK ended at Sun Mar 08 00:34:05 IST 2015 in 1 milliseconds • The filesystem under path '/tmp/test/hadoop_storage_cleanupFilesList_tmp.log' is HEALTHY
  • 24. Page 24Classification: Restricted Anatomy of a File Read 1. The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem 2. DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file 3. For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client. The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O. The client then calls read() on the stream.
  • 25. Page 25Classification: Restricted Anatomy of a File Read 4. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream 5. When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block 6. Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream
  • 27. Page 27Classification: Restricted Anatomy of a File Write 1. The client creates the file by calling create() on DistributedFileSystem 2. DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it. NN checks for client’s access permissions and checks if the file already exists (file exists - IO Exception); The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode. 3. As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline
  • 28. Page 28Classification: Restricted Anatomy of a File Write 4. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline 5. DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5). 6. When the client has finished writing data, it calls close() on the stream waits for acknowledgments before contacting the namenode to signal that the file is complete
  • 29. Page 29Classification: Restricted Two Level Network Architecture
  • 30. Page 30Classification: Restricted Two Level Network Architecture
  • 31. Page 31Classification: Restricted Replication & Rack Awareness Block A: Block B: Block C: Rack 1 1 2 3 4 Rack 2 5 6 7 8 Rack 3 9 1 0 11 1 2
  • 32. Page 32Classification: Restricted MapReduce Components: • Job Tracker: • Coordinates all the jobs run on the system by scheduling tasks • Keeps a record of overall progress of each job • If a job fails, reschedules the job on a different tasktracker • Task Tracker: • Slave daemon which accepts tasks to be run a block of data • Sends progress reports as heart beat signals to the Job tracker at regular intervals