Topic 9a-Hadoop Storage- HDFS.pptx

Big Data Analytics
Dr. Danish Mahmood
Big Data Platforms
• Hadoop
• Architecture
• Storage
• Resource navigator
• Computations
• Ecosystems
• HBASE
• HIVE
• ZOOKEEPER
• MOSES
• … Etc.
• Spark
• Architecture
• Concept of RDDs
• Spark Streaming
• Spark Mlib
• Spark SQL
• Eco systems
In This presentation
Hadoop Architecture, Storage,
Resource Navigator, and
Computations
Yet to come
Introduction
• In the “distributed data” world, the terms Spark, Hadoop, and Kafka should sound familiar.
• However, with numerous big data solutions available,
• it may be unclear exactly what they are, their main differences, and which is better.
• Determine
• what kinds of applications, such as machine learning, distributed streaming, and data storage
that you can expect to make effective and efficient by using Hadoop, Spark, and Kafka.
What is Hadoop?
Hadoop is an open-source software that stores massive amounts of data while running large numbers of
commodity-grade computers to tackle tasks that are too large for a single computer to process on its own.
Store and compute:
can be used to write software that stores data or runs computations across hundreds or thousands of
machines without needing to know the details of what each machine can do, or how it can communicate.
Error Handling:
designed to handle them within the framework itself, which significantly reduces the amount of error
handling necessary within your solution.
Key components
At its most basic level of functionality, Hadoop includes the standard libraries between its modules, the file
system HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and its
implementation of MapReduce.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to
application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop-related Apache Projects
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It also
provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive
applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
Hadoop-related Apache Projects
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a
powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch
and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed applications.
Distinctive Layers of Hadoop
YARN
Distinctive Layers of Hadoop
HDFS
Common Use Cases for Big Data in Hadoop
• Log Data Analysis
most common, fits perfectly for HDFS scenario: Write once & Read often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond
Topic 9a-Hadoop Storage- HDFS.pptx
Data Storage Operations on HDFS
• Hadoop is designed to work best with a modest number of
extremely large files.
• Average file sizes ➔ larger than 500MB.
• Write Once, Read Often model.
• Content of individual files cannot be modified, other than
appending new data at the end of the file.
What we can do:
• Create a new file
• Append content to the end of a file
• Delete a file
• Rename a file
• Modify file attributes like owner
Topic 9a-Hadoop Storage- HDFS.pptx
HDFS Deamons
NameNode
Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file
data is kept.
It does not store the data of these files itself. Kind of block lookup dictionary (index or address book
of blocks).
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file.
The NameNode responds the successful requests by returning a list of relevant DataNode servers
where the data lives.
HDFS Deamons
DataNode
DataNode stores data in the Hadoop Filesystem
A functional filesystem has more than one DataNode, with data replicated across them
On startup, a DataNode connects to the NameNode; spinning until that service comes up.
It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of
the data
HDFS Deamons
Secondry NameNode
Not a failover NameNode
The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary
name-node periodically
downloads current name-node image and edits log files, joins them into new image and uploads
the new image back to the (primary and the only) name-node
Default checkpoint time is one hour. It can be set to one minute on highly busy clusters where lots
of write operations are being performed.
HDFS blocks
• File is divided into blocks (default: 64MB) and duplicated in multiple places. 128MB in Hadoop
2.0.
• Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB.
• The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a central server
HDFS Blocks
When HDFS stores the replicas of the original blocks across the
Hadoop cluster, it tries to ensure that the block replicas are stored
in different failure points.
Name Node
Data Node
• Data Replication:
• HDFS is designed to handle large scale data in distributed environment
• Hardware or software failure, or network partition exist
• Therefore need replications for those fault tolerance
• Replication placement:
• High initialization time to create replication to all machines
• An approximate solution: Only 3 replications
One replication resides in current node
One replication resides in current rack
One replication resides in another rack
Data Replication
Data Replication
Rack Awareness
Data Replication
Re-replicating missing replicas
• Missing heartbeats signify lost nodes
• Name node consults metadata, finds affected data
• Name node consults rack awareness script
• Name node tells a data node to re replicate
Name Node Failure
• NameNode is the single point of failure in the cluster- Hadoop 1.0
• If NameNode is down due to software glitch, restart the machine
• If original NameNode can be restored, secondary can re-establish the most current metadata
snapshot
• If machine don’t come up, metadata for the cluster is irretrievable. In this situation create a new
NameNode, use secondary to copy metadata to new primary, restart whole cluster
• Before Hadoop 2.0, NameNode was a single point of failure and operation limitation.
• Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or
4,000 nodes.
• Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature – one is in an Active
state, the other one is in a Standby state).
High Availability of the NameNodes
Standby NameNode – keeping the state of the block
locations and block metadata in memory
JournalNode – if a failure occurs, the Standby Node
reads all completed journal entries to
ensure the new Active NameNode is fully consistent
with the state of cluster.
Zookeeper – provides coordination and
configuration services for distributed systems.
Several useful commands for HDFS
All hadoop commands are invoked by the bin/hadoop script.
% hadoop fsck / -files –blocks:
➔ list the blocks that make up each file in HDFS.
For HDFS, the schema name is hdfs, and for the local file system, the schema name is file.
A file or director in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost
The HDFS file system shell command is similar to Linux file commands, with the following general
syntax: hadoop hdfs –file_cmd
For instance mkdir runs as:
$hadoop hdfs dfs –mkdir /user/directory_name
Several
useful
commands
for
HDFS
Calculating HDFS nodes storage
• Key players in computing HDFS node storage
• H = HDFS Storage size
• C = Compression ratio. It depends on the type of compression used and size of the data. When no
compression is used, C=1.
• R = Replication factor. It is usually 3 in a production cluster.
• S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and
incremental data.
• i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated
to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a
common guidlines for many production applications. Even Cloudera has recommended 25% for
intermediate results.
Calculating Initial Data
• This could be a combination of historical data and incremental data.
• In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months
period,
• Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three
months, and Output files from MR Jobs may create at least 10 % of the initial data, then we
need to consider 600 TB as the initial data size.
• i.e., 500 TB + 50 TB +500*10/100 = 600 TB initial size
• Now if we have nodes having size of 8TBs, How many nodes will be needed. Number of data
nodes (n): n = H/d (d= disk space available per node.) = 600/8 (without considering replication
and intermediate data factors along with compression techniques that may be employed)
• Question: Is it feasible to use 100% disk space?
Estimating size for Hadoop storage based on
initial data
• Suppose you have to upload X GBs of data into HDFS (Hadoop 2.0). with no compression,
a Replication factor of 3, and Intermediate factor of 0.25 = ¼. Compute how many times Hadoop’s
storage will be increased with respect to initial data i.e., X GBs.
• H = (3+1/4)*X = 3.25*X
With the assumptions above, the Hadoop storage is estimated to be 3.25 times the size of the
initial data size.
H = HDFS Storage size
C = Compression ratio. When no compression is used, C=1.
R = Replication factor. It is usually 3 in a production cluster.
S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data.
i = intermediate data factor. It is usually 1/3 or 1/4. when no information is given assume it as zero.
If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were
excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)?
• Estimating the hardware requirement is always challenging in Hadoop environment because we never know
when data storage demand can increase for a business.
• We must understand following factors in detail to come to a conclusion for the current scenario of adding right
numbers to the cluster:
• The actual size of data to store – 600 TB
• At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending
analysis or business requirement justification (prediction)
• We are in Hadoop world, so replication factor plays an important role – default 3x replicas
• Hardware machine overhead (OS, logs etc.) – 2 disks were considered
• Intermediate mapper and reducer data output on hard disk - 1x
• Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be
full with their storage capacity.
• Compression ratio
Calculation to find the number of data nodes
required to store 600 TB of data
• Rough Calculation
• Data Size – 600 TB
• Replication factor – 3
• Intermediate data – 1
• Total Storage requirement – (3+1) * 600 = 2400 TB
• Available disk size for storage – 8 TB
• Total number of required data nodes (approx.): n = H/d
• 2400/8 = 300 machines
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper
nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator.
The intermediate data is cleaned up after the Hadoop Job complete
Calculation to find the number of data nodes
required to store 600 TB of data
• Actual Calculation:
• Disk space utilization – 65 % (differ business to business)
• Compression ratio – 2.3
• Total Storage requirement – 2400/2.3 = 1043.5 TB
• Available disk size for storage – 8*0.65 = 5.2 TB
• Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
Case: Business has predicted 20 % data increase in a quarter
and we need to predict the new machines to be added in a
year.
• Data increase – 20 % over a quarter
• 1st quarter: 1043.5 * 0.2 = 208.7 TB
• 2nd quarter: 1043.5 * 1.2 * 0.2 =
250.44 TB
• 3rd quarter: 1043.5 * (1.2)^2 * 0.2 =
300.5 TB
• 4th quarter: 1043.5 * (1.2)^3 * 0.2 =
360.6 TB
• Additional data nodes requirement
(approx.):
• 1st quarter: 208.7/5.2 = 41 machines
• 2nd quarter: 250.44/5.2 = 49 machines
• 3rd quarter: 300.5/5.2 = 58 machines
• 4th quarter: 360.6/5.2 = 70 machines
Compound Interest Formula: A = P (1 + R/100)ⁿ * percentage of increase
Here, P = 1043.5, R = 20 % Quarterly and n = increment after every quarter.
Thought Question
• Imagine that you are uploading a file of 1664MB into HDFS (Hadoop 2.0).
• 8 blocks are successfully uploaded into HDFS Please find how many blocks are remaining.
• Another client wants to work or read the uploaded data while the upload is still in progress
i.e., data which is updated in 8 blocks. What will happen in such a scenario, will the 8 blocks
of data that is uploaded will it be displayed or available for use?
1 de 37

Recomendados

Hadoop por
HadoopHadoop
Hadoopyasser hassen
220 visualizações55 slides
Hadoop overview.pdf por
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdfSunil D Patil
1 visão16 slides
Managing Big data with Hadoop por
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
787 visualizações24 slides
Unit IV.pdf por
Unit IV.pdfUnit IV.pdf
Unit IV.pdfKennyPratheepKumar
2 visualizações21 slides
Hadoop - HDFS por
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
106 visualizações35 slides
Introduction to HDFS and MapReduce por
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
105 visualizações40 slides

Mais conteúdo relacionado

Similar a Topic 9a-Hadoop Storage- HDFS.pptx

Hadoop Technology por
Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
2.5K visualizações22 slides
Hadoop architecture-tutorial por
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
38 visualizações6 slides
Asbury Hadoop Overview por
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
1.6K visualizações50 slides
Apache hadoop basics por
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
1.1K visualizações20 slides
Unit 1 por
Unit 1Unit 1
Unit 1SriKGangadharRaoAssi
55 visualizações46 slides
Hadoop distributed computing framework for big data por
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
2.5K visualizações23 slides

Similar a Topic 9a-Hadoop Storage- HDFS.pptx(20)

Hadoop Technology por Atul Kushwaha
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K visualizações
Hadoop architecture-tutorial por vinayiqbusiness
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness38 visualizações
Asbury Hadoop Overview por Brian Enochson
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson1.6K visualizações
Apache hadoop basics por saili mane
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane1.1K visualizações
Hadoop distributed computing framework for big data por Cyanny LIANG
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG2.5K visualizações
Apache Hadoop Big Data Technology por Jay Nagar
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar493 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic018169 visualizações
Hadoop por chandinisanz
HadoopHadoop
Hadoop
chandinisanz74 visualizações
getFamiliarWithHadoop por AmirReza Mohammadi
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi54 visualizações
Introduction to Hadoop and Hadoop component por rebeccatho
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho2.7K visualizações
Hadoop seminar por KrishnenduKrishh
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh477 visualizações
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women por maharajothip1
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip119 visualizações
Introduction to Apache Hadoop Ecosystem por Mahabubur Rahaman
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman5.3K visualizações
Cppt Hadoop por chunkypandey12
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey1227 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12126 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12147 visualizações
Anju por Anju Shekhawat
AnjuAnju
Anju
Anju Shekhawat417 visualizações
OPERATING SYSTEM .pptx por AltafKhadim
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim6 visualizações

Mais de DanishMahmood23

Topic 4- processes.pptx por
Topic 4- processes.pptxTopic 4- processes.pptx
Topic 4- processes.pptxDanishMahmood23
30 visualizações40 slides
Topic 5- Communications v1.pptx por
Topic 5- Communications v1.pptxTopic 5- Communications v1.pptx
Topic 5- Communications v1.pptxDanishMahmood23
54 visualizações72 slides
L1-intro(2).pptx por
L1-intro(2).pptxL1-intro(2).pptx
L1-intro(2).pptxDanishMahmood23
6 visualizações65 slides
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf por
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdfIoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdfDanishMahmood23
4 visualizações17 slides
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf por
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdfIoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdfDanishMahmood23
4 visualizações37 slides
10. Lec X- SDN.pptx por
10. Lec X- SDN.pptx10. Lec X- SDN.pptx
10. Lec X- SDN.pptxDanishMahmood23
9 visualizações38 slides

Mais de DanishMahmood23(8)

Topic 4- processes.pptx por DanishMahmood23
Topic 4- processes.pptxTopic 4- processes.pptx
Topic 4- processes.pptx
DanishMahmood2330 visualizações
Topic 5- Communications v1.pptx por DanishMahmood23
Topic 5- Communications v1.pptxTopic 5- Communications v1.pptx
Topic 5- Communications v1.pptx
DanishMahmood2354 visualizações
L1-intro(2).pptx por DanishMahmood23
L1-intro(2).pptxL1-intro(2).pptx
L1-intro(2).pptx
DanishMahmood236 visualizações
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf por DanishMahmood23
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdfIoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
IoT_IO1_3 Getting familiar with Hardware - Sensors.pdf
DanishMahmood234 visualizações
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf por DanishMahmood23
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdfIoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
IoT_IO1_2 Getting familiar with Hardware - Development Boards.pdf
DanishMahmood234 visualizações
10. Lec X- SDN.pptx por DanishMahmood23
10. Lec X- SDN.pptx10. Lec X- SDN.pptx
10. Lec X- SDN.pptx
DanishMahmood239 visualizações
IoT_IO1_1 Introduction to the IoT-1.pdf por DanishMahmood23
IoT_IO1_1 Introduction to the IoT-1.pdfIoT_IO1_1 Introduction to the IoT-1.pdf
IoT_IO1_1 Introduction to the IoT-1.pdf
DanishMahmood2315 visualizações
IoT architecture.pptx por DanishMahmood23
IoT architecture.pptxIoT architecture.pptx
IoT architecture.pptx
DanishMahmood2316 visualizações

Último

CHEMICAL KINETICS.pdf por
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdfAguedaGutirrez
7 visualizações337 slides
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx por
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxAnnieRachelJohn
25 visualizações34 slides
Deutsch Crimping por
Deutsch CrimpingDeutsch Crimping
Deutsch CrimpingIwiss Tools Co.,Ltd
19 visualizações7 slides
Wire Rope por
Wire RopeWire Rope
Wire RopeIwiss Tools Co.,Ltd
8 visualizações5 slides
An approach of ontology and knowledge base for railway maintenance por
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenanceIJECEIAES
12 visualizações14 slides
SWM L15-L28_drhasan (Part 2).pdf por
SWM L15-L28_drhasan (Part 2).pdfSWM L15-L28_drhasan (Part 2).pdf
SWM L15-L28_drhasan (Part 2).pdfMahmudHasan747870
28 visualizações93 slides

Último(20)

CHEMICAL KINETICS.pdf por AguedaGutirrez
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdf
AguedaGutirrez7 visualizações
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx por AnnieRachelJohn
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
AnnieRachelJohn25 visualizações
An approach of ontology and knowledge base for railway maintenance por IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 visualizações
SWM L15-L28_drhasan (Part 2).pdf por MahmudHasan747870
SWM L15-L28_drhasan (Part 2).pdfSWM L15-L28_drhasan (Part 2).pdf
SWM L15-L28_drhasan (Part 2).pdf
MahmudHasan74787028 visualizações
MSA Website Slideshow (16).pdf por msaucla
MSA Website Slideshow (16).pdfMSA Website Slideshow (16).pdf
MSA Website Slideshow (16).pdf
msaucla39 visualizações
LFA-NPG-Paper.pdf por harinsrikanth
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth40 visualizações
9_DVD_Dynamic_logic_circuits.pdf por Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta21 visualizações
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... por ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 visualizações
Multi-objective distributed generation integration in radial distribution sy... por IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 visualizações
13_DVD_Latch-up_prevention.pdf por Usha Mehta
13_DVD_Latch-up_prevention.pdf13_DVD_Latch-up_prevention.pdf
13_DVD_Latch-up_prevention.pdf
Usha Mehta9 visualizações
How I learned to stop worrying and love the dark silicon apocalypse.pdf por Tomasz Kowalczewski
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdf
Tomasz Kowalczewski23 visualizações
Informed search algorithms.pptx por Dr.Shweta
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta12 visualizações
Thermal aware task assignment for multicore processors using genetic algorithm por IJECEIAES
Thermal aware task assignment for multicore processors using genetic algorithm Thermal aware task assignment for multicore processors using genetic algorithm
Thermal aware task assignment for multicore processors using genetic algorithm
IJECEIAES29 visualizações
fakenews_DBDA_Mar23.pptx por deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra812 visualizações
What is Unit Testing por Sadaaki Emura
What is Unit TestingWhat is Unit Testing
What is Unit Testing
Sadaaki Emura23 visualizações
Activated sludge process .pdf por 8832RafiyaAltaf
Activated sludge process .pdfActivated sludge process .pdf
Activated sludge process .pdf
8832RafiyaAltaf8 visualizações

Topic 9a-Hadoop Storage- HDFS.pptx

  • 1. Big Data Analytics Dr. Danish Mahmood
  • 2. Big Data Platforms • Hadoop • Architecture • Storage • Resource navigator • Computations • Ecosystems • HBASE • HIVE • ZOOKEEPER • MOSES • … Etc. • Spark • Architecture • Concept of RDDs • Spark Streaming • Spark Mlib • Spark SQL • Eco systems In This presentation Hadoop Architecture, Storage, Resource Navigator, and Computations Yet to come
  • 3. Introduction • In the “distributed data” world, the terms Spark, Hadoop, and Kafka should sound familiar. • However, with numerous big data solutions available, • it may be unclear exactly what they are, their main differences, and which is better. • Determine • what kinds of applications, such as machine learning, distributed streaming, and data storage that you can expect to make effective and efficient by using Hadoop, Spark, and Kafka.
  • 4. What is Hadoop? Hadoop is an open-source software that stores massive amounts of data while running large numbers of commodity-grade computers to tackle tasks that are too large for a single computer to process on its own. Store and compute: can be used to write software that stores data or runs computations across hundreds or thousands of machines without needing to know the details of what each machine can do, or how it can communicate. Error Handling: designed to handle them within the framework itself, which significantly reduces the amount of error handling necessary within your solution. Key components At its most basic level of functionality, Hadoop includes the standard libraries between its modules, the file system HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and its implementation of MapReduce.
  • 5. The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 6. Hadoop-related Apache Projects • Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. • Avro™: A data serialization system. • Cassandra™: A scalable multi-master database with no single points of failure. • Chukwa™: A data collection system for managing large distributed systems. • HBase™: A scalable, distributed database that supports structured data storage for large tables. • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout™: A Scalable machine learning and data mining library.
  • 7. Hadoop-related Apache Projects • Pig™: A high-level data-flow language and execution framework for parallel computation. • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. • ZooKeeper™: A high-performance coordination service for distributed applications.
  • 8. Distinctive Layers of Hadoop YARN
  • 10. HDFS
  • 11. Common Use Cases for Big Data in Hadoop • Log Data Analysis most common, fits perfectly for HDFS scenario: Write once & Read often. • Data Warehouse Modernization • Fraud Detection • Risk Modeling • Social Sentiment Analysis • Image Classification • Graph Analysis • Beyond
  • 13. Data Storage Operations on HDFS • Hadoop is designed to work best with a modest number of extremely large files. • Average file sizes ➔ larger than 500MB. • Write Once, Read Often model. • Content of individual files cannot be modified, other than appending new data at the end of the file. What we can do: • Create a new file • Append content to the end of a file • Delete a file • Rename a file • Modify file attributes like owner
  • 15. HDFS Deamons NameNode Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Kind of block lookup dictionary (index or address book of blocks). Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
  • 16. HDFS Deamons DataNode DataNode stores data in the Hadoop Filesystem A functional filesystem has more than one DataNode, with data replicated across them On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data
  • 17. HDFS Deamons Secondry NameNode Not a failover NameNode The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node Default checkpoint time is one hour. It can be set to one minute on highly busy clusters where lots of write operations are being performed.
  • 18. HDFS blocks • File is divided into blocks (default: 64MB) and duplicated in multiple places. 128MB in Hadoop 2.0. • Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB. • The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale. • Every data block stored in HDFS has its own metadata and needs to be tracked by a central server
  • 19. HDFS Blocks When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries to ensure that the block replicas are stored in different failure points.
  • 21. Data Node • Data Replication: • HDFS is designed to handle large scale data in distributed environment • Hardware or software failure, or network partition exist • Therefore need replications for those fault tolerance • Replication placement: • High initialization time to create replication to all machines • An approximate solution: Only 3 replications One replication resides in current node One replication resides in current rack One replication resides in another rack
  • 24. Data Replication Re-replicating missing replicas • Missing heartbeats signify lost nodes • Name node consults metadata, finds affected data • Name node consults rack awareness script • Name node tells a data node to re replicate
  • 25. Name Node Failure • NameNode is the single point of failure in the cluster- Hadoop 1.0 • If NameNode is down due to software glitch, restart the machine • If original NameNode can be restored, secondary can re-establish the most current metadata snapshot • If machine don’t come up, metadata for the cluster is irretrievable. In this situation create a new NameNode, use secondary to copy metadata to new primary, restart whole cluster
  • 26. • Before Hadoop 2.0, NameNode was a single point of failure and operation limitation. • Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or 4,000 nodes. • Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature – one is in an Active state, the other one is in a Standby state).
  • 27. High Availability of the NameNodes Standby NameNode – keeping the state of the block locations and block metadata in memory JournalNode – if a failure occurs, the Standby Node reads all completed journal entries to ensure the new Active NameNode is fully consistent with the state of cluster. Zookeeper – provides coordination and configuration services for distributed systems.
  • 28. Several useful commands for HDFS All hadoop commands are invoked by the bin/hadoop script. % hadoop fsck / -files –blocks: ➔ list the blocks that make up each file in HDFS. For HDFS, the schema name is hdfs, and for the local file system, the schema name is file. A file or director in HDFS can be specified in a fully qualified way, such as: hdfs://namenodehost/parent/child or hdfs://namenodehost The HDFS file system shell command is similar to Linux file commands, with the following general syntax: hadoop hdfs –file_cmd For instance mkdir runs as: $hadoop hdfs dfs –mkdir /user/directory_name
  • 30. Calculating HDFS nodes storage • Key players in computing HDFS node storage • H = HDFS Storage size • C = Compression ratio. It depends on the type of compression used and size of the data. When no compression is used, C=1. • R = Replication factor. It is usually 3 in a production cluster. • S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. • i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a common guidlines for many production applications. Even Cloudera has recommended 25% for intermediate results.
  • 31. Calculating Initial Data • This could be a combination of historical data and incremental data. • In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months period, • Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three months, and Output files from MR Jobs may create at least 10 % of the initial data, then we need to consider 600 TB as the initial data size. • i.e., 500 TB + 50 TB +500*10/100 = 600 TB initial size • Now if we have nodes having size of 8TBs, How many nodes will be needed. Number of data nodes (n): n = H/d (d= disk space available per node.) = 600/8 (without considering replication and intermediate data factors along with compression techniques that may be employed) • Question: Is it feasible to use 100% disk space?
  • 32. Estimating size for Hadoop storage based on initial data • Suppose you have to upload X GBs of data into HDFS (Hadoop 2.0). with no compression, a Replication factor of 3, and Intermediate factor of 0.25 = ¼. Compute how many times Hadoop’s storage will be increased with respect to initial data i.e., X GBs. • H = (3+1/4)*X = 3.25*X With the assumptions above, the Hadoop storage is estimated to be 3.25 times the size of the initial data size. H = HDFS Storage size C = Compression ratio. When no compression is used, C=1. R = Replication factor. It is usually 3 in a production cluster. S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data. i = intermediate data factor. It is usually 1/3 or 1/4. when no information is given assume it as zero.
  • 33. If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)? • Estimating the hardware requirement is always challenging in Hadoop environment because we never know when data storage demand can increase for a business. • We must understand following factors in detail to come to a conclusion for the current scenario of adding right numbers to the cluster: • The actual size of data to store – 600 TB • At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending analysis or business requirement justification (prediction) • We are in Hadoop world, so replication factor plays an important role – default 3x replicas • Hardware machine overhead (OS, logs etc.) – 2 disks were considered • Intermediate mapper and reducer data output on hard disk - 1x • Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be full with their storage capacity. • Compression ratio
  • 34. Calculation to find the number of data nodes required to store 600 TB of data • Rough Calculation • Data Size – 600 TB • Replication factor – 3 • Intermediate data – 1 • Total Storage requirement – (3+1) * 600 = 2400 TB • Available disk size for storage – 8 TB • Total number of required data nodes (approx.): n = H/d • 2400/8 = 300 machines The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete
  • 35. Calculation to find the number of data nodes required to store 600 TB of data • Actual Calculation: • Disk space utilization – 65 % (differ business to business) • Compression ratio – 2.3 • Total Storage requirement – 2400/2.3 = 1043.5 TB • Available disk size for storage – 8*0.65 = 5.2 TB • Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
  • 36. Case: Business has predicted 20 % data increase in a quarter and we need to predict the new machines to be added in a year. • Data increase – 20 % over a quarter • 1st quarter: 1043.5 * 0.2 = 208.7 TB • 2nd quarter: 1043.5 * 1.2 * 0.2 = 250.44 TB • 3rd quarter: 1043.5 * (1.2)^2 * 0.2 = 300.5 TB • 4th quarter: 1043.5 * (1.2)^3 * 0.2 = 360.6 TB • Additional data nodes requirement (approx.): • 1st quarter: 208.7/5.2 = 41 machines • 2nd quarter: 250.44/5.2 = 49 machines • 3rd quarter: 300.5/5.2 = 58 machines • 4th quarter: 360.6/5.2 = 70 machines Compound Interest Formula: A = P (1 + R/100)ⁿ * percentage of increase Here, P = 1043.5, R = 20 % Quarterly and n = increment after every quarter.
  • 37. Thought Question • Imagine that you are uploading a file of 1664MB into HDFS (Hadoop 2.0). • 8 blocks are successfully uploaded into HDFS Please find how many blocks are remaining. • Another client wants to work or read the uploaded data while the upload is still in progress i.e., data which is updated in 8 blocks. What will happen in such a scenario, will the 8 blocks of data that is uploaded will it be displayed or available for use?