2. Hadoop
Apache Hadoop is an open-source software framework written in Java for distributed
storage and distributed processing of very large data sets on computer clusters built
from commodity hardware
3. History
o Hadoop was created by Doug Cutting and Mike Cafarella in 2005
o Apache components were inspired by Google papers on their MapReduce and
Google File System
Doug Cutting Mike Cafrella
4. Core of Hadoop
HDFS
( Hadoop Distributed File System)
MAP REDUCE
Storage Part
Processing Part
5. HDFS- Hadoop Distributed File System
HDFS is a specially designed file system for storing huge data
and can be implemented through a cluster of commodity
hardware
8. Input splits will be stored in commodity hardware and Name Node handles meta data
Name Node
Big Data
Job Tracker
Task
Tracker,
Map
Task
Tracker,
Map
Task
Tracker,
Map
Every three
seconds, task
tracker will talk
back to job
tracker that it is
alive
9. File Formats
Text Input Format- Default Key value
Text Input Format Sequence File
Input Format Sequence File as Text
Input Format
10. Record reader will convert the data to (key , value)
Record Reader
Mapper
(Key, Value) (Byte offset, Entire Line)
Reducer
Name Node
Program LogicMappers and
reducer only
understands
(Key, Value)pair
11. Reducer
Reducer combines the processed
data from Data Nodes and report
to Name Node that where the
output stores
Cont…