Choosing the Right CBSE School A Comprehensive Guide for Parents
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
1. Acharya Institute of Technology, Bangalore
A technical Seminar on,
A Survey of Scheduling Methods
in Hadoop MapReduce Framework
Presented by,
Mahantesh C. Angadi
M.Tech (CNE) First Year
Mahantesh.mtcn.13@acharya.ac.in
Under the Guidance of,
Prof. Amogh P. Kulkarni
AIT, Bangalore
Dept. of ISE, AIT, Bangalore
2. Agenda
Motivation
Introduction
What is BigData…?
What is Hadoop…?
What is HDFS and MapReduce…?
Challenges in MapReduce
Literature Survey on Scheduling in MapReduce
Survey of scheduling methods on proposed methods
Conclusion
References.
Dept. of ISE, AIT, Bangalore
3. Motivation
“Necessity” is the Mother of All the Inventions…!
In 2000s, Google faced a serious
challenge: To organize the
world’s information.
Google designed a new data processing infrastructure.
i. Google File System (GFS)
ii. MapReduce
In 2004, Google published a paper describing its work to the
Community.
Doug Cutting decided to use the technique Google described.
Dept. of ISE, AIT, Bangalore
4. Introduction
With the current trend in increased use of internet in
everything, lot of data is generated and need to be analysed.
Web search engines and social networking sites capture and
analyze every user action on their sites to improve site
design, detect spam, and find advertising opportunities.
The processing of this can be best done using Distributed
computing and parallel processing mechanisms.
Hadoop MapReduce is one of the most popularly used such
technique for handling the BigData. So here we discuss the
different scheduling methods.
Dept. of ISE, AIT, Bangalore
5. What is BigData…?
Today we live in the data age.
Every day, we create 2.5 quintillion bytes of data, 90% of
this data is unstructured.
90% of the data in the world today has been created in the
last two years alone .
By the end of 2015, CISCO estimate that global Internet
traffic will reach 4.8 zettabytes a year.
Ex. Social Networking Sites, Airlines, Healthcare
Departments, Satellites,
Dept. of ISE, AIT, Bangalore
6. How is the BigData Generates…?
Dept. of ISE, AIT, Bangalore
7. What is Apache Hadoop…?
Apache Hadoop is an open-source software
framework.
A platform to manage Big Data.
Its not only a tool, It’s a Framework of Tools.
Most Important Hadoop subprojects:
i. HDFS: Hadoop Distributed File System
ii. MapReduce: A Programming Model
Dept. of ISE, AIT, Bangalore
9. Why only Hadoop…?
It is Schema-less, but RDBMS is Schema-based.
Handles large volumes of unstructured data easily.
Hadoop is designed to run on cheap commodity
hardware.
Automatically handles data replication and node
failure.
Moving Computation is cheaper than moving Data.
Last but not the least – Its Free…! (Open source)
Dept. of ISE, AIT, Bangalore
10. What is Hadoop HDFS…?
Inspired by Google File System.
It’s a Scalable, distributed, reliable file system
written in Java for Hadoop framework.
An HFDS cluster primarily consists of:
i. NameNode
ii. DataNode
Stores very large files in blocks across machines in
a large Cluster, deployed on low-cost hardware.
Dept. of ISE, AIT, Bangalore
11. What is MapReduce…?
A software framework for distributed processing of
large data sets on computer clusters.
First developed by Google.
Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner.
It includes JobTracker and TaskTracker.
Dept. of ISE, AIT, Bangalore
14. Challenges of MapReduce
Job Scheduling problems
As the number and variety of jobs to be executed across
heterogeneous clusters are increasing, so is the complexity of
scheduling them efficiently to meet required objectives of
performance.
Energy Efficiency Problems
The size of the clusters is usually in hundreds and
thousands, thus there is a need to look at energy efficiency of
MapReduce clusters.
Dept. of ISE, AIT, Bangalore
15. Literature Survey
Hadoop MapReduce Scheduling methods can be categorized
based on their runtime behavior as follows.
Adaptive (Dynamic) Algorithms
These
methods
uses
the
previous,
current
and/or
future values of parameters to make scheduling decisions.
Ex. Fair, Capacity, Throughput scheduler etc.
Non- adaptive (Static) Algorithms
These methods does not take into consideration the
changes taking place in environment and schedules job/tasks as
per a predefine policy/order.
EX. FIFO (First In First Out).
Dept. of ISE, AIT, Bangalore
17. [1]. Survey of Task Scheduling Methods for
MapReduce Framework in Hadoop.
This paper discusses about the survey of various earlier
scheduling methods which have been proposed.
These scheduling methods include
First In First Out scheduler,
Fair Scheduler,
Capacity Scheduler,
LATE scheduler,
Deadline constraint scheduler,
Etc.,
Dept. of ISE, AIT, Bangalore
18. [1]. Conclusion and future scope
By achieving data locality in the MapReduce framework
performance can be improved.
Finally they concluded with how we can consider the
scheduling methods in Hadoop heterogeneous clusters.
Dept. of ISE, AIT, Bangalore
19. [2]. Perform Wordcount MapReduce Job in Single Node
Apache Hadoop Cluster & Compress Data Using LZO
Algorithm.
Applications like Yahoo, Facebook, and Twitter have huge
data which has to be stored and retrieved as per client
access.
This huge data storage requires huge database leading to
increase in physical storage and becomes complex for
analysis required in business growth.
Lempel-Ziv-Oberhumer (LZO) algorithm, is used to
compress the redundant data.
LZO algorithm is developed by considering the “Speed as
the Priority”.
Dept. of ISE, AIT, Bangalore
20. [2]. Conclusion and future scope
LZO algorithm compress the file 5 times faster than the
gzip format.
Decompression ratio of LZO algorithm is 2 times the faster
than gzip format.
Size of the LZO file is slightly larger than the gzip file after
the compression.
Compressed file using LZO or gzip format is very much
smaller than the original file.
In future we can implement this in heterogeneous
multinode clusters.
Dept. of ISE, AIT, Bangalore
21. [3]. S3: An Efficient Shared Scan Scheduler on MapReduce
Framework.
To improve performance, multiple jobs operating on a common
data file can be processed as a batch to share the cost of
scanning the file.
Jobs often do not arrive at the same time.
S3 operates like this: At the same time System may be processing a batch of sub-jobs,
Also there are sub-jobs which are waiting in job-queue,
As a new job arrives,
Its sub-jobs can be aligned with waiting jobs in job-queue,
Once the current-batch of sub-jobs completes processing Then next batch of sub-jobs is initiated for processing.
Dept. of ISE, AIT, Bangalore
22. [3]. Conclusion and future scope
S3 can exploit the sharing of data scan to improve
performance.
Unlike existing batch-based schedulers S3 allows jobs to
be processed as they arrive, and arriving job does not
need to wait for long time.
More
computational
policies
such
as
computational
resources and job priorities can be added to S3 to make
more flexible.
Dept. of ISE, AIT, Bangalore
23. [4]. Two Sides of a Coin: Optimizing the Schedule of
MapReduce Jobs to Minimize their Makespan and Improve
Cluster Performance.
This paper proposes the key- challenge to increase the
utilization of MapReduce clusters.
Here the goal is to automate the design of a job schedule
that minimizes the completion- time or deadline of
MapReduce jobs.
A novel abstraction framework and a heuristic called
BalancedPools are discussed.
Dept. of ISE, AIT, Bangalore
24. [4]. Conclusion and future scope
They have simulated the things over a realistic workload
and
observed
that
15%-38%
completion-time
improvements.
This shows that, the order in which jobs executed can have
significant impact on their overall completion-time and the
cluster resource utilization.
Future step may include addressing a more general
problem of minimizing the deadline of batch workloads.
Dept. of ISE, AIT, Bangalore
25. [5]. ThroughputScheduler: Learning to Schedule on
Heterogeneous Hadoop Clusters.
Presently available schedulers for Hadoop clusters assign
tasks to nodes without regard to the capability of the nodes.
This paper proposes a method, which reduces the overall job
completion time on a cluster of heterogeneous nodes by
actively scheduling tasks on nodes based on optimally
matching job requirements to node capabilities.
Node capabilities are learned by running probe jobs on the
cluster.
Bayesian active learning scheme is used to learn source
requirements of jobs on-the-fly.
Dept. of ISE, AIT, Bangalore
26. [5]. Conclusion and future scope
The framework learns both server capabilities and job task
parameters autonomously.
ThroughputScheduler can reduce total job completion time
by almost 20% compared to the Hadoop Fair Scheduler
and 40% compared to FIFO Scheduler.
ThroughputScheduler also reduces average mapping time
by 33% compared to either of these schedulers.
Dept. of ISE, AIT, Bangalore
27. Conclusion
Local data processing takes lesser time as compared to
moving
the
data
across
network.
So
to
improve
the
performance of jobs, most of the algorithms work to improve
the data locality. To meet the user expectations, scheduling
algorithms must use prediction methods based on the volume of
data to be processed and underlying hardware. So as a future
work we can consider developing the algorithms which can
schedule the jobs efficiently on heterogeneous clusters.
Dept. of ISE, AIT, Bangalore
28. References
[1]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters.”
Proc. Sixth Symp. Operating System Design and
Implementation, San Francisco, CA, Dec. 6-8, Usenix, 2004.
[2]. Lei Shi, Xiaohui Li, Kian-Lee Tan, “S3: An Efficient Shared Scan Scheduler
on MapReduce Framework.”, School of Computing National University of
Singapore, comp.nus.edu.sg, 2012.
[3]. Dr. Umesh Bellur, Nidhi Tiwari, “Scheduling and Energy Efficiency
Improvement Techniques for Hadoop MapReduce: State of Art and Directions
for Future Research.”, Department of Computer Science and Engineering
Indian Institute of Technology, Mumbai.
[4]. Abhishek Verma, Ludmila Cherkasova, Roy H. Campbell, “Two Sides of a
Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan
and Improve Cluster Performance.”, HP Labs. Supported in part by Air Force
Research grant FA8750-11-2-0084.
[5]. Nandan Mirajkar, Sandeep Bhujbal, Aaradhana Deshmukh, “Perform
Wordcount MapReduce Job in Single Node Apache Hadoop Cluster and
Compress Data Using Lempel-Ziv-Oberhumer (LZO) Algorithm.”, Department
of Advanced Software and Computing Technologies IGNOU –I2IT, Centre of
Excellence for Advanced Education and Research Pune, India.
Dept. of ISE, AIT, Bangalore
29. References
continued…
[6]. Houvik B Ardhan, Daniel A. Menasce. “The Anatomy of
MapReduce Jobs, Scheduling, and Performance Challenges”,
Proceedings of the 2013 Conference of the Computer Measurement
Group, San Diego, CA, November 5-8, 2013.
[7]. Shekhar Gupta, Christian Fritz, Bob Price, Roger Hoover, and
Johan de Kleer, “ThroughputScheduler: Learning to Schedule on
Heterogeneous Hadoop Clusters”, USENIX Association, 10th
International Conference on Autonomic Computing (ICAC 2013).
[8]. Nilam Kadale, U. A. Mande, “Survey of Task Scheduling Method
for MapReduce Framework in Hadoop.”, 2nd National Conference on
Innovative Paradigms in Engineering & Technology (NCIPET 2013).
[9]. Tom Wille, “Hadoop: The Definitive Guide.” 2nd edition, O’Reilly
publications, Sebastopol, CA 95472. October 2010.
[10]. J Jeffery Hanson. “An Introduction to the Hadoop Distributed
File System.” IBM DeveloperWorks, 2011.
Dept. of ISE, AIT, Bangalore