Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Big Data Processing: Performance Gain Through In-Memory Computation
1. Report on the project
Big Data Processing: Performance Gain Through
In-Memory Computation
By
Group4: David Holland, Joy Rahman, Prosunjit Biswas
Rehana Begam, Yang Zhou
Introduction:
The main objective of this project is to
analyze the performance gain in the Big
Data processing through the in-memory
computation. We tried to understand the
Hadoop MapReduce and Spark in-memory
frameworks, gathered the execution time
required for a benchmark to run on both
of them and analysed those results to get
the performance gain achieved by Spark.
Background and Motivation:
The rapid development of the Internet
has generated vast amount of data that
poses big challenges to traditional data
processing model. To deal with such
challenges, a variety of cluster computing
frameworks have been proposed to
support large scale data intensive
applications on commodity machines.
MapReduce, introduced by Google is one
such successful framework for processing
large data sets in a scalable, reliable and
fault-tolerant manner.
Apache Hadoop provides an open source
implementation of MapReduce. It is a very
popular general purpose framework for
distributed storage and distributed
processing of Big Data on clusters of
commodity hardware. It is used for many
different classes of data-intensive
applications. For processing the data, the
Hadoop MapReduce ships code to the
nodes that have the required data, and
the nodes then process the data in
parallel. This approach takes advantage of
the data locality. The term Hadoop often
indicates the "Hadoop Ecosystem" which
means the combination of Hadoop and all
of the additional software packages that
can be installed on top of or alongside
Hadoop, such as Pig, Hive, HBase, Spark
and others.
Spark is an emerging framework or
compute engine for Hadoop data. It
provides a very simple and expressive
programming model that supports a wide
range of applications, including ETL,
machine learning, stream processing, and
graph computation. Spark is designed to
have a global cache mechanism and can
achieve better performance in terms of
response time because of its in-memory
access over the distributed machines of
cluster.
Hadoop MapReduce is not good for
iterative operations because of the cost
paid for the data reloading from disk at
each iteration. MapReduce cannot keep
reused data and state information during
execution. Thus, MapReduce reads the
same data iteratively and materializes
intermediate results in local disks in each
2. iteration, requiring lots of disk accesses,
I/Os and unnecessary computations. On
the other hand, Spark offers better
execution time by caching intermediate
data in-memory for iterative operations.
Most ML algorithms run on the same data
set iteratively. In MapReduce, there was
no easy way to communicate shared
states and data from iterations to
iterations. Spark is designed to overcome
the shortages of MapReduce in iterative
operations. Through the use of the data
structure called Resilient distributed
datasets (RDDs), Spark can effectively
improve the performance of the iterative
jobs with low latency requirements.
In this project, we attempted to conduct
exhaustive experiments to evaluate the
system performance between Hadoop
MapReduce and Spark. We considered the
execution time as the performance matric.
We choose a typical iterative algorithm
"PageRank" to run for some real data sets
on both of the frameworks.
Experimental Environment:
I. Cluster Architecture
The experimental cluster is composed of
six computers. One of them is designated
as master, and the other five as slaves. We
use the operating system Ubuntu 12.04.2
(GNU/Linux 3.5.0-28-generic x86 64) for
all the computers.
Table 1 shows the hostname, machine
modal, IP address, CPU and memory
information of the computers. We use
Hadoop 1.2.1 and Spark 1.1.0 for all the
experiments. Figure 1 shows the overall
testbed architecture of our system.
II. Dataset Description
We choose four real graph datasets to do
comparative experiments. Table 2 lists
these graph datasets. They are all in the
format of edge list, each line in the file is a
[src ID] [target ID] pair separated by
whitespace. These real graph datasets
come from SNAP.
Figure 1: Testbed Architecture
Implementation:
I. Benchmark: PageRank
PageRank is an algorithm used by Google
Search to rank websites in their search
engine results. PageRank was named after
Larry Page, one of the founders of Google.
It is a way of measuring the importance of
website pages. It works by counting the
number and quality of links to a page to
determine a rough estimate of how
important the website is. The underlying
assumption is that more important
websites are likely to receive more links
from other websites.
II. Execution Model for MapReduce
In Figure 2 we can see the steps for
setting up the HDFS for out input
datasets. In the execution model for
MapReduce, given input files for page
ranking algorithm, we distribute the data
over the hadoop cluster and run three
3. Hostname Machine IP CPU info Memory
master Hadoop-6 10.0.0.11 1 2 GB
Slave0-slave4 Hadoop-2-5 10.0.0.2/4/5/9 1 2 GB
Table 1: Information of machines in the cluster
Name File size Nodes Edges Description
wiki-Vote 1.0 MB 7,115 103,689 Wikipedia who-votes-on-whom network
p2p-Gnutella31 10.8 MB 62,586 147,892
Gnutella peer to peer network from
August 31 2002
soc-Epinions1 5.531 MB 75,879 508,837
Who-trusts-whom network of
Epinions.com
soc-
Slashdot0811
10.496
MB
77,360 905,468
Slashdot social network from November
2008
web-Google 71.9 MB 875,713 5,105,039 Web graph from Google
Table 2: Graph Datasets
different MapReduce jobs on the data.
Figure 3 gives a summary of the jobs.
III. Execution Model for Spark
In this case, we used the same dataset and
HDFS configuration as for the execution
model of MapReduce. We rewrite
MapReduce jobs taking advantage of
Sparks RDD. Summary of this work is
given in Figure 4.
Figure 2: HDFS setup
Figure 3: Execution model for Hadoop
Figure 4: Execution model on Spark
4. Experimental Results:
For each dataset we make PageRank run
with different number of iterations on
Hadoop and Spark to see how that effects
the performance. We then record the total
running time each dataset spends for each
number of iterations. We finally stop at
five iterations for all the datasets instead
of their convergence iteration time
because five iterations is long enough to
help us quantify the time differences
between Hadoop and Spark.
Figure 5 shows us the running time
required for each of the dataset when the
iteration number = 1. As we can see there
is not much improvement achieved with
the Spark as the iteration number is small.
In Figure 6 and 7, we show the results for
PageRank with the same datasets when
the iteration number = 2, 3 respectively.
As we can see, running time for Hadoop is
increasing and Spark is outperforming
Hadoop for all of the datasets.
The graph in Figure 8 is the running time
results for the benchmark when the
iteration number = 5. From this figure we
can find that Spark is performing better
than Hadoop MapReduce jobs.
We also tried to compare them with large
datasets but found that though they can
be handled by MapReduce, Spark cannot
as it do not get sufficient memory as per
requirement to run the benchmark for
them.
Figure 5: Running time comparison, iter num=1
5. Figure 6: Running time comparison, iter num=2
Figure 7: Running time comparison, iter num=3
6. Figure 8: Running time comparison, iter num=5
While we were considering web-Google, a
large dataset (71.9 MB) with 875,713
nodes and 5,105,039 edges, we found that
the running time for Spark is much higher
than that of MapReduce. Figure 9 shows
the corresponding graph. The rise in the
running time of Spark may be caused by
the virtualization overhead.
Figure 9: Running time for web-Google dataset
7. Figure 10: Console output of a Spark run
Figure 11: Console output with memory error in Spark
8. The above two figures (Figure 10 and 11)
shows the typical output consoles for the
Spark runs. Figure 10 shows the output of
a successful run with the total running
time required.
In Figure 11, we can see the "not enough
space to cache partition rdd_x_y in
memory" error for cit-Patents, a larger
dataset (267.5 MB) with 3,774,768 nodes
and 16,518,948 edges. This means that
with our existing cluster configuration
and insufficient memory, Spark cannot
carry out PageRank benchmark with a
bigger dataset.
Conclusion:
In this project, we worked with Hadoop
MapReduce and Spark to compare the
performance gain in terms of running
time and memory consumption. We found
that, though for small datasets Spark
performs better, for large datasets
MapReduce is much efficient even with
insufficient memory. Spark needs enough
memory for the correct execution of the
benchmark. Without that it can take
longer time and even crash.
If speed is not a demanding requirement
and we do not have abundant memory,
we should not choose Spark. As long as
we have enough disk space to
accommodate the original dataset and
intermediate results, Hadoop MapReduce
is a good choice.
References:
[1] M. Zaharia, M. Chowdhury, S. S. Michael J.
Franklin, and I. Stoica, “Spark: Cluster computing
with working sets,” In HotCloud, June 2010.
[2] Lei Gu, Huan Li, “Memory or Time:
Performance Evaluation for Iterative Operation on
Hadoop and Spark” In 2013 IEEE International
Conference on High Performance Computing and
Communications & 2013 IEEE International
Conference on Embedded and Ubiquitous
Computing, 2013.
[3] SNAP url: http://snap.stanford.edu/data/
[4] hortonworks.com/hadoop-tutorial/using-
commandline-manage-files-hdfs/
[5]stackoverflow.com/questions/24167194/why-
is-the-spark-task-running-on-a-single-node