2. Big Data Platforms
• Hadoop
• Architecture
• Storage
• Resource navigator
• Computations
• Ecosystems
• HBASE
• HIVE
• ZOOKEEPER
• MOSES
• … Etc.
• Spark
• Architecture
• Concept of RDDs
• Spark Streaming
• Spark Mlib
• Spark SQL
• Eco systems
In This presentation
Hadoop Architecture, Storage,
Resource Navigator, and
Computations
Yet to come
3. Introduction
• In the “distributed data” world, the terms Spark, Hadoop, and Kafka should sound familiar.
• However, with numerous big data solutions available,
• it may be unclear exactly what they are, their main differences, and which is better.
• Determine
• what kinds of applications, such as machine learning, distributed streaming, and data storage
that you can expect to make effective and efficient by using Hadoop, Spark, and Kafka.
4. What is Hadoop?
Hadoop is an open-source software that stores massive amounts of data while running large numbers of
commodity-grade computers to tackle tasks that are too large for a single computer to process on its own.
Store and compute:
can be used to write software that stores data or runs computations across hundreds or thousands of
machines without needing to know the details of what each machine can do, or how it can communicate.
Error Handling:
designed to handle them within the framework itself, which significantly reduces the amount of error
handling necessary within your solution.
Key components
At its most basic level of functionality, Hadoop includes the standard libraries between its modules, the file
system HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and its
implementation of MapReduce.
5. The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to
application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
6. Hadoop-related Apache Projects
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It also
provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive
applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
7. Hadoop-related Apache Projects
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a
powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch
and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed applications.
11. Common Use Cases for Big Data in Hadoop
• Log Data Analysis
most common, fits perfectly for HDFS scenario: Write once & Read often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond
13. Data Storage Operations on HDFS
• Hadoop is designed to work best with a modest number of
extremely large files.
• Average file sizes ➔ larger than 500MB.
• Write Once, Read Often model.
• Content of individual files cannot be modified, other than
appending new data at the end of the file.
What we can do:
• Create a new file
• Append content to the end of a file
• Delete a file
• Rename a file
• Modify file attributes like owner
15. HDFS Deamons
NameNode
Keeps the metadata of all files/blocks in the file system, and tracks where across the cluster the file
data is kept.
It does not store the data of these files itself. Kind of block lookup dictionary (index or address book
of blocks).
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file.
The NameNode responds the successful requests by returning a list of relevant DataNode servers
where the data lives.
16. HDFS Deamons
DataNode
DataNode stores data in the Hadoop Filesystem
A functional filesystem has more than one DataNode, with data replicated across them
On startup, a DataNode connects to the NameNode; spinning until that service comes up.
It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of
the data
17. HDFS Deamons
Secondry NameNode
Not a failover NameNode
The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary
name-node periodically
downloads current name-node image and edits log files, joins them into new image and uploads
the new image back to the (primary and the only) name-node
Default checkpoint time is one hour. It can be set to one minute on highly busy clusters where lots
of write operations are being performed.
18. HDFS blocks
• File is divided into blocks (default: 64MB) and duplicated in multiple places. 128MB in Hadoop
2.0.
• Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB.
• The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a central server
19. HDFS Blocks
When HDFS stores the replicas of the original blocks across the
Hadoop cluster, it tries to ensure that the block replicas are stored
in different failure points.
21. Data Node
• Data Replication:
• HDFS is designed to handle large scale data in distributed environment
• Hardware or software failure, or network partition exist
• Therefore need replications for those fault tolerance
• Replication placement:
• High initialization time to create replication to all machines
• An approximate solution: Only 3 replications
One replication resides in current node
One replication resides in current rack
One replication resides in another rack
24. Data Replication
Re-replicating missing replicas
• Missing heartbeats signify lost nodes
• Name node consults metadata, finds affected data
• Name node consults rack awareness script
• Name node tells a data node to re replicate
25. Name Node Failure
• NameNode is the single point of failure in the cluster- Hadoop 1.0
• If NameNode is down due to software glitch, restart the machine
• If original NameNode can be restored, secondary can re-establish the most current metadata
snapshot
• If machine don’t come up, metadata for the cluster is irretrievable. In this situation create a new
NameNode, use secondary to copy metadata to new primary, restart whole cluster
26. • Before Hadoop 2.0, NameNode was a single point of failure and operation limitation.
• Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or
4,000 nodes.
• Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature – one is in an Active
state, the other one is in a Standby state).
27. High Availability of the NameNodes
Standby NameNode – keeping the state of the block
locations and block metadata in memory
JournalNode – if a failure occurs, the Standby Node
reads all completed journal entries to
ensure the new Active NameNode is fully consistent
with the state of cluster.
Zookeeper – provides coordination and
configuration services for distributed systems.
28. Several useful commands for HDFS
All hadoop commands are invoked by the bin/hadoop script.
% hadoop fsck / -files –blocks:
➔ list the blocks that make up each file in HDFS.
For HDFS, the schema name is hdfs, and for the local file system, the schema name is file.
A file or director in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost
The HDFS file system shell command is similar to Linux file commands, with the following general
syntax: hadoop hdfs –file_cmd
For instance mkdir runs as:
$hadoop hdfs dfs –mkdir /user/directory_name
30. Calculating HDFS nodes storage
• Key players in computing HDFS node storage
• H = HDFS Storage size
• C = Compression ratio. It depends on the type of compression used and size of the data. When no
compression is used, C=1.
• R = Replication factor. It is usually 3 in a production cluster.
• S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and
incremental data.
• i = intermediate data factor. It is usually 1/3 or 1/4. It is Hadoop’s Intermediate working space dedicated
to storing intermediate results of Map Tasks are any temporary storage used in Pig or Hive. This is a
common guidlines for many production applications. Even Cloudera has recommended 25% for
intermediate results.
31. Calculating Initial Data
• This could be a combination of historical data and incremental data.
• In this, we need to consider the growth rate of Initial Data as well, at least for next 3-6 months
period,
• Like we have 500 TB data now, and it is expected that 50 TB will be ingested in next three
months, and Output files from MR Jobs may create at least 10 % of the initial data, then we
need to consider 600 TB as the initial data size.
• i.e., 500 TB + 50 TB +500*10/100 = 600 TB initial size
• Now if we have nodes having size of 8TBs, How many nodes will be needed. Number of data
nodes (n): n = H/d (d= disk space available per node.) = 600/8 (without considering replication
and intermediate data factors along with compression techniques that may be employed)
• Question: Is it feasible to use 100% disk space?
32. Estimating size for Hadoop storage based on
initial data
• Suppose you have to upload X GBs of data into HDFS (Hadoop 2.0). with no compression,
a Replication factor of 3, and Intermediate factor of 0.25 = ¼. Compute how many times Hadoop’s
storage will be increased with respect to initial data i.e., X GBs.
• H = (3+1/4)*X = 3.25*X
With the assumptions above, the Hadoop storage is estimated to be 3.25 times the size of the
initial data size.
H = HDFS Storage size
C = Compression ratio. When no compression is used, C=1.
R = Replication factor. It is usually 3 in a production cluster.
S = Initial size of data need to be moved to Hadoop. This could be a combination of historical data and incremental data.
i = intermediate data factor. It is usually 1/3 or 1/4. when no information is given assume it as zero.
33. If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were
excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)?
• Estimating the hardware requirement is always challenging in Hadoop environment because we never know
when data storage demand can increase for a business.
• We must understand following factors in detail to come to a conclusion for the current scenario of adding right
numbers to the cluster:
• The actual size of data to store – 600 TB
• At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending
analysis or business requirement justification (prediction)
• We are in Hadoop world, so replication factor plays an important role – default 3x replicas
• Hardware machine overhead (OS, logs etc.) – 2 disks were considered
• Intermediate mapper and reducer data output on hard disk - 1x
• Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be
full with their storage capacity.
• Compression ratio
34. Calculation to find the number of data nodes
required to store 600 TB of data
• Rough Calculation
• Data Size – 600 TB
• Replication factor – 3
• Intermediate data – 1
• Total Storage requirement – (3+1) * 600 = 2400 TB
• Available disk size for storage – 8 TB
• Total number of required data nodes (approx.): n = H/d
• 2400/8 = 300 machines
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper
nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator.
The intermediate data is cleaned up after the Hadoop Job complete
35. Calculation to find the number of data nodes
required to store 600 TB of data
• Actual Calculation:
• Disk space utilization – 65 % (differ business to business)
• Compression ratio – 2.3
• Total Storage requirement – 2400/2.3 = 1043.5 TB
• Available disk size for storage – 8*0.65 = 5.2 TB
• Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
36. Case: Business has predicted 20 % data increase in a quarter
and we need to predict the new machines to be added in a
year.
• Data increase – 20 % over a quarter
• 1st quarter: 1043.5 * 0.2 = 208.7 TB
• 2nd quarter: 1043.5 * 1.2 * 0.2 =
250.44 TB
• 3rd quarter: 1043.5 * (1.2)^2 * 0.2 =
300.5 TB
• 4th quarter: 1043.5 * (1.2)^3 * 0.2 =
360.6 TB
• Additional data nodes requirement
(approx.):
• 1st quarter: 208.7/5.2 = 41 machines
• 2nd quarter: 250.44/5.2 = 49 machines
• 3rd quarter: 300.5/5.2 = 58 machines
• 4th quarter: 360.6/5.2 = 70 machines
Compound Interest Formula: A = P (1 + R/100)ⁿ * percentage of increase
Here, P = 1043.5, R = 20 % Quarterly and n = increment after every quarter.
37. Thought Question
• Imagine that you are uploading a file of 1664MB into HDFS (Hadoop 2.0).
• 8 blocks are successfully uploaded into HDFS Please find how many blocks are remaining.
• Another client wants to work or read the uploaded data while the upload is still in progress
i.e., data which is updated in 8 blocks. What will happen in such a scenario, will the 8 blocks
of data that is uploaded will it be displayed or available for use?