• Big Data Fundamentals1
• Hadoop and
Components2
• QA3
Today’s Overview
Agenda – Big Data Fundamental
• What is Big Data ?
• Basic Characteristics
of Big Data
• Sources of Big Data
• V’s of Big Data
• Processing Of Data
– Traditional Approach
VS Big Data Approach
What is Big Data –con’t
• Basically Big Data is nothing but collection of
large set of Data that not able to processed
using traditional approach and also its
contains the followings
– Structured Data- Traditional Data
– Semi Structure Data- XML
– Unstructured Data – Image/PDF/Media and etc
Hadoop Fundamental
• What is Hadoop ?
• Key Characterstics
• Components
• HDFS
• MapReduce
• Yarn
• Benefits of Hadoop
What is Hadoop
• Hadoop is an open-source software
framework for storing large amounts of data
and processing/querying those data on a
cluster with multiple nodes of commodity
hardware (i.e. low cost hardware).
Components
• Common Libraries
• High Volume of Distributed Data Storage
System –HDFS
• High Volume of Distributed Data Processing
Framework –MapReduce
• Resource and Meta Data Management -YARN
– HDFS
• What is HDFS?
• Architecture
• Components
• Basic Features
What is HDFS ?
HDFS holds very large amount of data and
provides easier access. To store such huge data,
the files are stored across multiple machines.
These files are stored in redundant fashion to
rescue the system from possible data losses in
case of failure. HDFS also makes applications
available to parallel processing
Components- HDFS
Master/slave architecture
HDFS cluster consists of a single Namenode, a
master server that manages the file system
namespace and regulates access to files by
clients.
There are a number of DataNodes usually one
per node in a cluster.
The DataNodes manage storage attached to the
nodes that they run on.
Components -HDFS
HDFS exposes a file system namespace and
allows user data to be stored in files.
A file is split into one or more blocks and set
of blocks are stored in DataNodes.
DataNodes: serves read, write requests,
performs block creation, deletion, and
replication upon instruction from Namenode
Features
• Highly fault-tolerant
• High throughput
• Suitable Distributed Storage for large
Amount of Data
• Streaming access to file system data
• Can be built out of commodity hardware
What is MapReduce
• Its framework mainly used to process the
large Amount of Data in parallel on the large
clusters of commodity hardware
• Its based on divide –conquer Principle which
provides built-in fault tolerance and
redundancy
• Its batch oriented parallel processing engine
to process the large volume of data
MapReduce
– Map stage : The map or mapper’s job is to process
the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes
the data and creates several small chunks of data.
– Reduce stage : This stage is the combination of the
Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the
mapper. After processing, it produces a new set of
output, which will be stored in the HDFS.
Stages of each Tasks
• Map Task have the following Stages
– Map
– Combine
– Partition
• Reduce Task have the following stages
– Shuffle and Sort
– Reduce
Demo
• Refer the PDF Attachment
• Mainly for reading the text and count the no
of word
YARN
• YARN (Yet Another Resource Nagotiator): A
framework for job scheduling and cluster
resource management
– Hive
• What is Hive?
• Architecture of Hive
• Flow in Hive
• Data Types
• Sample Query
• Not Hive
• Demo
What is Hive
• Its Data warehouse infrastructure tool to
process the structured data in Hadoop
platform
• Its originally developed by Facebook then
moves into apache umbrella
• Basic large volume of data is retrieve from
multiple resources and RDBMS system could
not fit as perfect solutions .We move into
Hive.
What is Hive
• Its Query Engine wrapper on top of the Hadoop
to perform the OLAP
• Provides the HiveQL is similar to SQL
• Targeted to the users/developer with SQL
background
• Its stores schema in database and process the
data in HDFS
• Data Stored in HDFS/HBASE and every tables
should reference to the file on HDFS/HBASE
Architecture - Hive
• Components
– User Interface- Infrastructure tool used to interaction
between user and HDFS/HBASE
– Meta Store – Used to store Schema/tables and etc,
Mainly used to store the meta data information
– SerDe- libraries used to Serialize/Deserialize for their
own data format. Read and Writes the rows from/in
the tables
– Query Processor -
Data Type
• Integral Type
• SmallInt,BigInt,TinyInt,INT
• Float Type
– Double,Decimal
• String Type
– Char , Varchar
• Misc Type
– Boolean ,Binary
• TimeStamp,Dates,Decimal
• Complex Type
– Struct,Map,Arrays
Sample Query
• Create Table
• Drop Table
• Alter Table
• Rename Table- Rename the table name
• Load Data –Insert
• Create View
• Select
Operator and Built in Function
• Arithmetic Operator
• Relational Operator
• Logical Operator
• Aggregate and Built in Function
• Supports Index/Order/Join
Disadvantages of HIVE
• Not for Real time Query
• Supports ACID from 0.14 version onwards
• Poor performance – It took more time to
process since each time Hive will
generate/process the Map Reduce or Spark
Program internally while processing the
Records sets
Disadvantages of HIVE
• It can process only for large volume of
Structured data not for other categories
CAP
• CAP Theorem
– Consistency
• Read the data from all the notes always consistent
– Availability
• Read/write always acknowledge either success or failure
– Partition Tolerance
• It can tolerate communication outage that spit the cluster
into multiple silos /data set
Distributed Data System only provides the any two of
the above properties
Distributed Data Storage based on the above theorem
BASE
• BASE
– Basic availability
– Soft state
– Eventual consistency
Above property mainly used in database
distributed data for non transactional data
SCV
• SCV
– Speed
– Consistency
– Volume
High Data Volume Data Processing is based on the
above algorithm
Data Processing should satisfied at max of two of
the above properties
Sharding
• Sharding
It’s the process of Horizontally partitioning of large
volume of data into smaller set of more
manageable data set
Replication
• Replication
Stores the multiple copies of the data set known as
replicas
Provides always high availability , scalability and
fault tolerance since its stores into multiple nodes
Replicas implements the following was
Master-slave
Peer -Peer
HDFS
• Blocks
– In HDFS File can split into small segments which
used to store the Data .Each Segments called as
Block
– Default size of the Block is 64 MB (Hadoop 1.X) ,
you can change the size in HDFS Configuration
upto 128 MB(Hadoop 2.x Advisable approach)
Types of File Format -MR
• TxtInputFormat-- Default
• KeyValueTxtInputFormat
• SequenceFileInputFormat
• SequenceAsFileTxtInputFormat
Reader and Writer
• RecordReader –
– Read the Record from file line by line , Each line
in the file treat as a record
– Perform before the Mapper function
• RecordWriter
–Write content into file as a output
– Perform after the Reducer
BoxClasses in MR
• Its equivalent to wrapper in JAVA
• IntWritter
• FloatWritter
• LongWritter
• DoubleWritter
• TextWritter
• Mainly used for (K,V) in MR
Hadoop Tools
• 15+ frameworks & tools like Sqoop, Flume,
Kafka, Pig, Hive, Spark, Impala, etc to ingest data
into HDFS, store and process data within HDFS,
and to query data from HDFS for business
intelligence & analytics. Some tools like Pig &
Hive are abstraction layers on top of
MapReduce, whilst the other tools like Spark &
Impala are improved architecture/design from
MapReduce for much improved latencies to
support near real-time (i.e. NRT) & real-time
processing.
NRT
• Near Real time –
– Near real-time processing is when speed is
important, but processing time in minutes is
acceptable in lieu of seconds
HeartBit - HDFS
• Heartbeat is referred to a signal used
between a data node and Name node, and
between task tracker and job tracker
MapReducer – Partition
• all the value of a single key goes to the same
reducer from Mapper, eventually which helps
evenly distribution of the map output over
the reducers
HDFS VS NAS(Network Attached
Storage)
• HDFS data blocks are distributed across local
drives of all machines in a cluster
• NAS data is stored on dedicated hardware.
• HDFS there is data redundancy because of
the replication protocol.
• NAS there is no probability of data
redundancy
Commodity Hardware
• Commodity Hardware refers to inexpensive
systems that do not have high availability or
high quality. Commodity Hardware consists
of RAM because there are specific services
that need to be executed on RAM
Combine-MapReduce
• A “Combiner” is a mini “reducer” that
performs the local “reduce” task. It receives
the input from the “mapper” on a particular
“node” and sends the output to the
“reducer”. “Combiners” help in enhancing
the efficiency of “MapReduce” by reducing
the quantum of data that is required to be
sent to the “reducers”.
JobTracker –Functionality
– When Client applications submit map reduce jobs to the Job tracker. The
JobTracker talks to the Name node to determine the location of the data.
– The JobTracker locates Tasktracker nodes with available slots at or near the
data
– The JobTracker submits the work to the chosen Tasktracker nodes.
– The TaskTracker nodes are monitored. If they do not submit heartbeat
signals often enough, they are deemed to have failed and the work is
scheduled on a different TaskTracker.
– When the work is completed, the JobTracker updates its status.
– Client applications can poll the JobTracker for information.
Hive Support File Format
• Text File (Plain raw data)
• Sequence File(Key value pairs)
• RCFile (Record Columnar files which are
stored columns of the table in columnar
Database)
NameNode Vs MetaNode
• NameNode- Stores the MetaData information
about the files in Hadoop
• MetaNode-Stores the MetaData information
about the Tables /Data Base in Hive
Tez- Hive
• execute complex directed acyclic graphs of
general data processing tasks
• Its better than the MapReduce
Bucketing -Hive
• Bucketing provides mechanism to query and
examine random samples of data.
• Bucketing offers capability to execute queries
on a sub-set of random data