The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Presentation sreenu dwh-services
1. Hadoop Eco System
Presented by : Sreenu Musham
27th March, 2015
Sreenu Musham
Data Warehouse Architect
sreenu.musham@yahoo.com
Hadoop Introduction
2. Agenda
BigData and Its Challenges(Recap)
Hadoop and Its Evolution
Terminology used
HDFS
MapReduce
Hadoop Eco System
Hadoop Distributors
Feel of Hadoop (how it looks?)
3. Big Data and Challenges
1024 KB = 1MB
1024 MB = 1GB
1024 GB = 1TB
1024 TB = 1Petabyte
So on … Exabytes, Zettabytes, Yottabytes, Brontobytes,
Geopbytes
4. Big Data and Challenges
Issues on Disk I/O, Network &
Processing in time
Storage? Costly in enterprise machine
Vertical solution is not always correct
Reliability
handling unstructured data?
Schema less data
5. Disk Vs Transfer Rate
126 sec to read
whole disk
58 min to read
whole disk
4 hrs to read
whole disk
6. What is Hadoop?
Apache Open source Software Framework for reliable,
scalable, distributed computing of massive amount of data
A framework where the job is divided among the nodes and
process them in parallel
Hides underlying system details and complexities from
user
Developed in JAVA
A set of machines running on HDFS and MapReduce is
known as Hadoop cluster
Core Components
HDFS MapReduce
7. Hadoop is not for all type of work
• Process transactions
• Low-Latency data Access
• Lot of Small Files
• Intensive calculations with little data
Hadoop initiated new kinds of analysis
• Not jus old thinking on bigger data
• Iterate over whole data sets, not only sample sets
• Use multiple data sources (beyond structured
data)
• Work with schema-less data
10. Story Behind Hadoop
2003 2004 2005
Created by
Doug Cutting and Michael Cafarella
(Yahoo)
2006
Yahoo donated the project
to
Paper
on GFS
paper
on MapReduce
2008 2009
Name
by
Doug
Terabyte Sort
2010
Launches
HIVE
Runs 4000 node
Hadoop cluster
13. What is HDFS?
HDFS runs on top of existing file system
Uses blocks to store a file or parts of file
Stores data across multiple nodes
Size of a file can be larger than any single disk in the
network
The default block size is 64MB
The Main Objectives
Storing Very large files
Streaming data access
Commodity hardware
Allow access to data on any node in the cluster
Able to handle hardware failures
14. What is HDFS?
If a chunk of the file is smaller than HDFS block size
Only the needed space is used
Example: 300MB
HDFS blocks are replicated to several nodes for reliability
Maintains checksums of data for corruption detection and
recovery
15. What is MapReduce?
Is a algorithm/programming model to process the data in
the hadoop cluster
Consists of two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overal
dataset
•Typically one HDFS block of data
After all Maps are complete, the MapReduce system
disstribute the intermediate data to nodes which perform
the Reduce phase
Hadoop framework parallelizes the computation, handles
failures, efficient communication and performance issues.
16. Sample Data Flow
Take a large problem and divide it into sub-problems
-Break data set down into small chunks
Perform the same function on all sub-tasks
REDUCEMAP
Combine the output from all sub-tasks
It works like a Unix pipeline
cat input | grep <pattern> | sort | uniq -c > output
Input | Map | Shuffle & Sort | Reduce | Output
19. Sample Data Flow
(1949,
111)
(1950,
22)
(1949, [111, 78])
(1950, [0, 22,
−11])
(1949, 111)
(1950, 22)
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(1, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(2, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(3, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
24. Replication of Blocks
Node 1
Node 2
Node N
R
A
C
K
2
Hadoop Cluster
Node 3
Node 1
Node 2
Node N
R
A
C
K
3 Node 3
Name
Node
Node 1
Node 3
R
A
C
K
1 Node 2
HDFS Client
file.txt
File.txt:B1,B2,B3
Name node
28. Accessing hadoop/HDFS
Hadoop fs –ls <path>
Hadoop fs mkdir testsreenu
Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put
samplefile.txt
Running mapreduce
hadoop jar newjob.jar samplefile in_dir out_dir
Hive
Pig
PIG code
a. Load customer records
Cust=LOAD ‘/input/custs’ using PigStorage(,) AS
(custid:chararray,firstname:chararray,lastname:chararray,age:long,profe
ssion:chararray);
b. Select only 100 records
Amt=LIMIT cust 100;
Dump amt;
c. Group customer records by profession
groupbyprofession = GROUP cust BY profession;
describe groupbyprofession;
Feel of Hadoop
29. Hadoop Vs Other Systems
29
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands
of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing
infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s
demand
• Example: Amazon EC2
Elastic:grows and shrinks based on user’s
demand
30. Q & A
Hadoop Eco System
Presented By
Sreenu Musham