SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Report on the project
Big Data Processing: Performance Gain Through
In-Memory Computation
By
Group4: David Holland, Joy Rahman, Prosunjit Biswas
Rehana Begam, Yang Zhou
Introduction:
The main objective of this project is to
analyze the performance gain in the Big
Data processing through the in-memory
computation. We tried to understand the
Hadoop MapReduce and Spark in-memory
frameworks, gathered the execution time
required for a benchmark to run on both
of them and analysed those results to get
the performance gain achieved by Spark.
Background and Motivation:
The rapid development of the Internet
has generated vast amount of data that
poses big challenges to traditional data
processing model. To deal with such
challenges, a variety of cluster computing
frameworks have been proposed to
support large scale data intensive
applications on commodity machines.
MapReduce, introduced by Google is one
such successful framework for processing
large data sets in a scalable, reliable and
fault-tolerant manner.
Apache Hadoop provides an open source
implementation of MapReduce. It is a very
popular general purpose framework for
distributed storage and distributed
processing of Big Data on clusters of
commodity hardware. It is used for many
different classes of data-intensive
applications. For processing the data, the
Hadoop MapReduce ships code to the
nodes that have the required data, and
the nodes then process the data in
parallel. This approach takes advantage of
the data locality. The term Hadoop often
indicates the "Hadoop Ecosystem" which
means the combination of Hadoop and all
of the additional software packages that
can be installed on top of or alongside
Hadoop, such as Pig, Hive, HBase, Spark
and others.
Spark is an emerging framework or
compute engine for Hadoop data. It
provides a very simple and expressive
programming model that supports a wide
range of applications, including ETL,
machine learning, stream processing, and
graph computation. Spark is designed to
have a global cache mechanism and can
achieve better performance in terms of
response time because of its in-memory
access over the distributed machines of
cluster.
Hadoop MapReduce is not good for
iterative operations because of the cost
paid for the data reloading from disk at
each iteration. MapReduce cannot keep
reused data and state information during
execution. Thus, MapReduce reads the
same data iteratively and materializes
intermediate results in local disks in each
iteration, requiring lots of disk accesses,
I/Os and unnecessary computations. On
the other hand, Spark offers better
execution time by caching intermediate
data in-memory for iterative operations.
Most ML algorithms run on the same data
set iteratively. In MapReduce, there was
no easy way to communicate shared
states and data from iterations to
iterations. Spark is designed to overcome
the shortages of MapReduce in iterative
operations. Through the use of the data
structure called Resilient distributed
datasets (RDDs), Spark can effectively
improve the performance of the iterative
jobs with low latency requirements.
In this project, we attempted to conduct
exhaustive experiments to evaluate the
system performance between Hadoop
MapReduce and Spark. We considered the
execution time as the performance matric.
We choose a typical iterative algorithm
"PageRank" to run for some real data sets
on both of the frameworks.
Experimental Environment:
I. Cluster Architecture
The experimental cluster is composed of
six computers. One of them is designated
as master, and the other five as slaves. We
use the operating system Ubuntu 12.04.2
(GNU/Linux 3.5.0-28-generic x86 64) for
all the computers.
Table 1 shows the hostname, machine
modal, IP address, CPU and memory
information of the computers. We use
Hadoop 1.2.1 and Spark 1.1.0 for all the
experiments. Figure 1 shows the overall
testbed architecture of our system.
II. Dataset Description
We choose four real graph datasets to do
comparative experiments. Table 2 lists
these graph datasets. They are all in the
format of edge list, each line in the file is a
[src ID] [target ID] pair separated by
whitespace. These real graph datasets
come from SNAP.
Figure 1: Testbed Architecture
Implementation:
I. Benchmark: PageRank
PageRank is an algorithm used by Google
Search to rank websites in their search
engine results. PageRank was named after
Larry Page, one of the founders of Google.
It is a way of measuring the importance of
website pages. It works by counting the
number and quality of links to a page to
determine a rough estimate of how
important the website is. The underlying
assumption is that more important
websites are likely to receive more links
from other websites.
II. Execution Model for MapReduce
In Figure 2 we can see the steps for
setting up the HDFS for out input
datasets. In the execution model for
MapReduce, given input files for page
ranking algorithm, we distribute the data
over the hadoop cluster and run three
Hostname Machine IP CPU info Memory
master Hadoop-6 10.0.0.11 1 2 GB
Slave0-slave4 Hadoop-2-5 10.0.0.2/4/5/9 1 2 GB
Table 1: Information of machines in the cluster
Name File size Nodes Edges Description
wiki-Vote 1.0 MB 7,115 103,689 Wikipedia who-votes-on-whom network
p2p-Gnutella31 10.8 MB 62,586 147,892
Gnutella peer to peer network from
August 31 2002
soc-Epinions1 5.531 MB 75,879 508,837
Who-trusts-whom network of
Epinions.com
soc-
Slashdot0811
10.496
MB
77,360 905,468
Slashdot social network from November
2008
web-Google 71.9 MB 875,713 5,105,039 Web graph from Google
Table 2: Graph Datasets
different MapReduce jobs on the data.
Figure 3 gives a summary of the jobs.
III. Execution Model for Spark
In this case, we used the same dataset and
HDFS configuration as for the execution
model of MapReduce. We rewrite
MapReduce jobs taking advantage of
Sparks RDD. Summary of this work is
given in Figure 4.
Figure 2: HDFS setup
Figure 3: Execution model for Hadoop
Figure 4: Execution model on Spark
Experimental Results:
For each dataset we make PageRank run
with different number of iterations on
Hadoop and Spark to see how that effects
the performance. We then record the total
running time each dataset spends for each
number of iterations. We finally stop at
five iterations for all the datasets instead
of their convergence iteration time
because five iterations is long enough to
help us quantify the time differences
between Hadoop and Spark.
Figure 5 shows us the running time
required for each of the dataset when the
iteration number = 1. As we can see there
is not much improvement achieved with
the Spark as the iteration number is small.
In Figure 6 and 7, we show the results for
PageRank with the same datasets when
the iteration number = 2, 3 respectively.
As we can see, running time for Hadoop is
increasing and Spark is outperforming
Hadoop for all of the datasets.
The graph in Figure 8 is the running time
results for the benchmark when the
iteration number = 5. From this figure we
can find that Spark is performing better
than Hadoop MapReduce jobs.
We also tried to compare them with large
datasets but found that though they can
be handled by MapReduce, Spark cannot
as it do not get sufficient memory as per
requirement to run the benchmark for
them.
Figure 5: Running time comparison, iter num=1
Figure 6: Running time comparison, iter num=2
Figure 7: Running time comparison, iter num=3
Figure 8: Running time comparison, iter num=5
While we were considering web-Google, a
large dataset (71.9 MB) with 875,713
nodes and 5,105,039 edges, we found that
the running time for Spark is much higher
than that of MapReduce. Figure 9 shows
the corresponding graph. The rise in the
running time of Spark may be caused by
the virtualization overhead.
Figure 9: Running time for web-Google dataset
Figure 10: Console output of a Spark run
Figure 11: Console output with memory error in Spark
The above two figures (Figure 10 and 11)
shows the typical output consoles for the
Spark runs. Figure 10 shows the output of
a successful run with the total running
time required.
In Figure 11, we can see the "not enough
space to cache partition rdd_x_y in
memory" error for cit-Patents, a larger
dataset (267.5 MB) with 3,774,768 nodes
and 16,518,948 edges. This means that
with our existing cluster configuration
and insufficient memory, Spark cannot
carry out PageRank benchmark with a
bigger dataset.
Conclusion:
In this project, we worked with Hadoop
MapReduce and Spark to compare the
performance gain in terms of running
time and memory consumption. We found
that, though for small datasets Spark
performs better, for large datasets
MapReduce is much efficient even with
insufficient memory. Spark needs enough
memory for the correct execution of the
benchmark. Without that it can take
longer time and even crash.
If speed is not a demanding requirement
and we do not have abundant memory,
we should not choose Spark. As long as
we have enough disk space to
accommodate the original dataset and
intermediate results, Hadoop MapReduce
is a good choice.
References:
[1] M. Zaharia, M. Chowdhury, S. S. Michael J.
Franklin, and I. Stoica, “Spark: Cluster computing
with working sets,” In HotCloud, June 2010.
[2] Lei Gu, Huan Li, “Memory or Time:
Performance Evaluation for Iterative Operation on
Hadoop and Spark” In 2013 IEEE International
Conference on High Performance Computing and
Communications & 2013 IEEE International
Conference on Embedded and Ubiquitous
Computing, 2013.
[3] SNAP url: http://snap.stanford.edu/data/
[4] hortonworks.com/hadoop-tutorial/using-
commandline-manage-files-hdfs/
[5]stackoverflow.com/questions/24167194/why-
is-the-spark-task-running-on-a-single-node

Mais conteúdo relacionado

Mais procurados

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
pramodbiligiri
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 

Mais procurados (20)

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
A hadoop map reduce
A hadoop map reduceA hadoop map reduce
A hadoop map reduce
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
 

Semelhante a Big Data Processing: Performance Gain Through In-Memory Computation

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 

Semelhante a Big Data Processing: Performance Gain Through In-Memory Computation (20)

Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 

Mais de UT, San Antonio

Security_of_openstack_keystone
Security_of_openstack_keystoneSecurity_of_openstack_keystone
Security_of_openstack_keystone
UT, San Antonio
 
Attribute Based Encryption
Attribute Based EncryptionAttribute Based Encryption
Attribute Based Encryption
UT, San Antonio
 

Mais de UT, San Antonio (20)

digital certificate - types and formats
digital certificate - types and formatsdigital certificate - types and formats
digital certificate - types and formats
 
Saml metadata
Saml metadataSaml metadata
Saml metadata
 
Static Analysis with Sonarlint
Static Analysis with SonarlintStatic Analysis with Sonarlint
Static Analysis with Sonarlint
 
Shellshock- from bug towards vulnerability
Shellshock- from bug towards vulnerabilityShellshock- from bug towards vulnerability
Shellshock- from bug towards vulnerability
 
Abac17 prosun-slides
Abac17 prosun-slidesAbac17 prosun-slides
Abac17 prosun-slides
 
Abac17 prosun-slides
Abac17 prosun-slidesAbac17 prosun-slides
Abac17 prosun-slides
 
Recitation
RecitationRecitation
Recitation
 
Recitation
RecitationRecitation
Recitation
 
Enumerated authorization policy ABAC (EP-ABAC) model
Enumerated authorization policy ABAC (EP-ABAC) modelEnumerated authorization policy ABAC (EP-ABAC) model
Enumerated authorization policy ABAC (EP-ABAC) model
 
Where is my Privacy presentation slideshow (one page only)
Where is my Privacy presentation slideshow (one page only)Where is my Privacy presentation slideshow (one page only)
Where is my Privacy presentation slideshow (one page only)
 
Three month course
Three month courseThree month course
Three month course
 
One month-syllabus
One month-syllabusOne month-syllabus
One month-syllabus
 
Zerovm backgroud
Zerovm backgroudZerovm backgroud
Zerovm backgroud
 
Security_of_openstack_keystone
Security_of_openstack_keystoneSecurity_of_openstack_keystone
Security_of_openstack_keystone
 
Research seminar group_1_prosunjit
Research seminar group_1_prosunjitResearch seminar group_1_prosunjit
Research seminar group_1_prosunjit
 
Ksi
KsiKsi
Ksi
 
Attribute Based Encryption
Attribute Based EncryptionAttribute Based Encryption
Attribute Based Encryption
 
Final Project Transciption Factor DNA binding Prediction
Final Project Transciption Factor DNA binding Prediction Final Project Transciption Factor DNA binding Prediction
Final Project Transciption Factor DNA binding Prediction
 
Cyber Security Exam 2
Cyber Security Exam 2Cyber Security Exam 2
Cyber Security Exam 2
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Big Data Processing: Performance Gain Through In-Memory Computation

  • 1. Report on the project Big Data Processing: Performance Gain Through In-Memory Computation By Group4: David Holland, Joy Rahman, Prosunjit Biswas Rehana Begam, Yang Zhou Introduction: The main objective of this project is to analyze the performance gain in the Big Data processing through the in-memory computation. We tried to understand the Hadoop MapReduce and Spark in-memory frameworks, gathered the execution time required for a benchmark to run on both of them and analysed those results to get the performance gain achieved by Spark. Background and Motivation: The rapid development of the Internet has generated vast amount of data that poses big challenges to traditional data processing model. To deal with such challenges, a variety of cluster computing frameworks have been proposed to support large scale data intensive applications on commodity machines. MapReduce, introduced by Google is one such successful framework for processing large data sets in a scalable, reliable and fault-tolerant manner. Apache Hadoop provides an open source implementation of MapReduce. It is a very popular general purpose framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. It is used for many different classes of data-intensive applications. For processing the data, the Hadoop MapReduce ships code to the nodes that have the required data, and the nodes then process the data in parallel. This approach takes advantage of the data locality. The term Hadoop often indicates the "Hadoop Ecosystem" which means the combination of Hadoop and all of the additional software packages that can be installed on top of or alongside Hadoop, such as Pig, Hive, HBase, Spark and others. Spark is an emerging framework or compute engine for Hadoop data. It provides a very simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark is designed to have a global cache mechanism and can achieve better performance in terms of response time because of its in-memory access over the distributed machines of cluster. Hadoop MapReduce is not good for iterative operations because of the cost paid for the data reloading from disk at each iteration. MapReduce cannot keep reused data and state information during execution. Thus, MapReduce reads the same data iteratively and materializes intermediate results in local disks in each
  • 2. iteration, requiring lots of disk accesses, I/Os and unnecessary computations. On the other hand, Spark offers better execution time by caching intermediate data in-memory for iterative operations. Most ML algorithms run on the same data set iteratively. In MapReduce, there was no easy way to communicate shared states and data from iterations to iterations. Spark is designed to overcome the shortages of MapReduce in iterative operations. Through the use of the data structure called Resilient distributed datasets (RDDs), Spark can effectively improve the performance of the iterative jobs with low latency requirements. In this project, we attempted to conduct exhaustive experiments to evaluate the system performance between Hadoop MapReduce and Spark. We considered the execution time as the performance matric. We choose a typical iterative algorithm "PageRank" to run for some real data sets on both of the frameworks. Experimental Environment: I. Cluster Architecture The experimental cluster is composed of six computers. One of them is designated as master, and the other five as slaves. We use the operating system Ubuntu 12.04.2 (GNU/Linux 3.5.0-28-generic x86 64) for all the computers. Table 1 shows the hostname, machine modal, IP address, CPU and memory information of the computers. We use Hadoop 1.2.1 and Spark 1.1.0 for all the experiments. Figure 1 shows the overall testbed architecture of our system. II. Dataset Description We choose four real graph datasets to do comparative experiments. Table 2 lists these graph datasets. They are all in the format of edge list, each line in the file is a [src ID] [target ID] pair separated by whitespace. These real graph datasets come from SNAP. Figure 1: Testbed Architecture Implementation: I. Benchmark: PageRank PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. It is a way of measuring the importance of website pages. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. II. Execution Model for MapReduce In Figure 2 we can see the steps for setting up the HDFS for out input datasets. In the execution model for MapReduce, given input files for page ranking algorithm, we distribute the data over the hadoop cluster and run three
  • 3. Hostname Machine IP CPU info Memory master Hadoop-6 10.0.0.11 1 2 GB Slave0-slave4 Hadoop-2-5 10.0.0.2/4/5/9 1 2 GB Table 1: Information of machines in the cluster Name File size Nodes Edges Description wiki-Vote 1.0 MB 7,115 103,689 Wikipedia who-votes-on-whom network p2p-Gnutella31 10.8 MB 62,586 147,892 Gnutella peer to peer network from August 31 2002 soc-Epinions1 5.531 MB 75,879 508,837 Who-trusts-whom network of Epinions.com soc- Slashdot0811 10.496 MB 77,360 905,468 Slashdot social network from November 2008 web-Google 71.9 MB 875,713 5,105,039 Web graph from Google Table 2: Graph Datasets different MapReduce jobs on the data. Figure 3 gives a summary of the jobs. III. Execution Model for Spark In this case, we used the same dataset and HDFS configuration as for the execution model of MapReduce. We rewrite MapReduce jobs taking advantage of Sparks RDD. Summary of this work is given in Figure 4. Figure 2: HDFS setup Figure 3: Execution model for Hadoop Figure 4: Execution model on Spark
  • 4. Experimental Results: For each dataset we make PageRank run with different number of iterations on Hadoop and Spark to see how that effects the performance. We then record the total running time each dataset spends for each number of iterations. We finally stop at five iterations for all the datasets instead of their convergence iteration time because five iterations is long enough to help us quantify the time differences between Hadoop and Spark. Figure 5 shows us the running time required for each of the dataset when the iteration number = 1. As we can see there is not much improvement achieved with the Spark as the iteration number is small. In Figure 6 and 7, we show the results for PageRank with the same datasets when the iteration number = 2, 3 respectively. As we can see, running time for Hadoop is increasing and Spark is outperforming Hadoop for all of the datasets. The graph in Figure 8 is the running time results for the benchmark when the iteration number = 5. From this figure we can find that Spark is performing better than Hadoop MapReduce jobs. We also tried to compare them with large datasets but found that though they can be handled by MapReduce, Spark cannot as it do not get sufficient memory as per requirement to run the benchmark for them. Figure 5: Running time comparison, iter num=1
  • 5. Figure 6: Running time comparison, iter num=2 Figure 7: Running time comparison, iter num=3
  • 6. Figure 8: Running time comparison, iter num=5 While we were considering web-Google, a large dataset (71.9 MB) with 875,713 nodes and 5,105,039 edges, we found that the running time for Spark is much higher than that of MapReduce. Figure 9 shows the corresponding graph. The rise in the running time of Spark may be caused by the virtualization overhead. Figure 9: Running time for web-Google dataset
  • 7. Figure 10: Console output of a Spark run Figure 11: Console output with memory error in Spark
  • 8. The above two figures (Figure 10 and 11) shows the typical output consoles for the Spark runs. Figure 10 shows the output of a successful run with the total running time required. In Figure 11, we can see the "not enough space to cache partition rdd_x_y in memory" error for cit-Patents, a larger dataset (267.5 MB) with 3,774,768 nodes and 16,518,948 edges. This means that with our existing cluster configuration and insufficient memory, Spark cannot carry out PageRank benchmark with a bigger dataset. Conclusion: In this project, we worked with Hadoop MapReduce and Spark to compare the performance gain in terms of running time and memory consumption. We found that, though for small datasets Spark performs better, for large datasets MapReduce is much efficient even with insufficient memory. Spark needs enough memory for the correct execution of the benchmark. Without that it can take longer time and even crash. If speed is not a demanding requirement and we do not have abundant memory, we should not choose Spark. As long as we have enough disk space to accommodate the original dataset and intermediate results, Hadoop MapReduce is a good choice. References: [1] M. Zaharia, M. Chowdhury, S. S. Michael J. Franklin, and I. Stoica, “Spark: Cluster computing with working sets,” In HotCloud, June 2010. [2] Lei Gu, Huan Li, “Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark” In 2013 IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013. [3] SNAP url: http://snap.stanford.edu/data/ [4] hortonworks.com/hadoop-tutorial/using- commandline-manage-files-hdfs/ [5]stackoverflow.com/questions/24167194/why- is-the-spark-task-running-on-a-single-node