SlideShare uma empresa Scribd logo
1 de 17
HADOOP
VS.
APACHE SPARK
Hadoop and Spark are popular Apache projects in the big data
ecosystem.
Apache Spark is an open-source platform, based on the original
Hadoop MapReduce component of the Hadoop ecosystem.
 Apache developed Hadoop project as open-source software
for reliable, scalable, distributed computing.
 A framework that allows distributed processing of large
datasets across clusters of computers using simple
programming models.
 Hadoop can be easily scaled-up to multi cluster machines, each
offering local storage and computation.
 Hadoop libraries are designed in such a way that it can detect
the failed cluster at application layer and can handle those
failures by it.
 Hadoop Common: These are Java libraries and utilities required
for running other Hadoop modules. These libraries provide OS
level and filesystem abstractions and contain the necessary
Java files and scripts required to start and run Hadoop.
 Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
 Hadoop YARN: A framework for job scheduling and cluster
resource management.
 Hadoop MapReduce: A YARN-based system for parallel
processing of large datasets.
The project includes these modules:
Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and
distributed platform for storage and processing of very large datasets across clusters
of commodity computers. Hadoop uses the same set of nodes for data storage as
well as to perform the computations. This allows Hadoop to improve the
performance of large scale computations by combining computations along with the
storage.
Hadoop Distributed File System – HDFS
HDFS is a distributed filesystem that is designed to store large volume of
data reliably.
HDFS stores a single large file on different nodes across the cluster of
commodity machines.
HDFS overlays on top of the existing filesystem. Data is stored in fine
grained blocks, with default block size of 128MB.
HDFS also stores redundant copies of these data blocks in multiple nodes
to ensure reliability and fault tolerance. HDFS is a distributed, reliable and
scalable file system.
YARN (Yet Another Resource Negotiator), a central component in the
Hadoop ecosystem, is a framework for job scheduling and cluster resource
management. The basic idea of YARN is to split up the functionalities of
resource management and job scheduling/monitoring into separate
daemons.
Hadoop YARN
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps
input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and
process to output the required values. Mapper processes the jobs in parallel on every
cluster and Reducer process them in any available node as directed by YARN.
 It is a framework for analysing data analytics on a distributed
computing cluster.
 It provides in-memory computations for increasing speed and
data processing over MapReduce.
 It utilizes the Hadoop Distributed File System (HDFS) and runs
on top of existing Hadoop cluster.
 It can also process both structured data in Hive and streaming
data from different sources like HDFS, Flume, Kafka, and
Twitter.
Spark Stream
Spark Streaming is an extension of the core Spark API.
Processing live data streams can be done using Spark Streaming, that enables
scalable, high-throughput, fault-tolerant stream.
Input Data can be from any sources like WebStream (TCP sockets), Flume,
Kafka, etc., and can be processed using complex algorithms with high-level
functions like map, reduce, join, etc. Finally, processed data can be pushed out
to filesystems (HDFS), databases, and live dashboards.
We can also apply Spark’s graph processing algorithms and machine learning
on data streams.
Spark SQL
 Apache Spark provides a separate module Spark SQL for processing
structured data.
 Spark SQL has an interface, which provides detailed information about the
structure of the data and the computation being performed.
 Internally, Spark SQL uses this additional information to perform extra
optimizations.
Datasets and DataFrames
 A distributed collection of data is called as Dataset in Apache Spark.
Dataset provides the benefits of RDDs along with utilizing the Spark
SQL’s optimized execution engine.
 A Dataset can be constructed from objects and then manipulated
using functional transformations.
 A DataFrame is a dataset organized into named columns. It is equally
related to a relational database table or a R/Python data frame, but
with richer optimizations under the hood.
 A DataFrame can be constructed, using various data source like
structured data file or Hive tables or external databases or existing
RDDs.
Resilient Distributed Datasets (RDDs)
Spark works on fault-tolerant collection of elements that can be operated
on in parallel, the concept called resilient distributed dataset (RDD). RDDs
can be created in two ways, parallelizing an existing collection in driver
program, or referencing a dataset in an external storage system, such as a
shared filesystem, HDFS, HBase, etc.
Here is a quick comparison guideline before concluding.
Aspects Hadoop Apache Spark
Difficulty MapReduce is difficult to program
and needs abstractions.
Spark is easy to program and does
not require any abstractions.
Interactive Mode
There is no in-built interactive
mode, except Pig and Hive.
It has interactive mode.
Streaming
Hadoop MapReduce just get to
process a batch of large stored
data.
Spark can be used to modify in
real time through Spark
Streaming.
Aspects Hadoop Apache Spark
Performance
MapReduce does not leverage the
memory of the Hadoop cluster to
the maximum.
Spark has been said to execute
batch processing jobs about 10 to
100 times faster than Hadoop
MapReduce.
Latency
MapReduce is disk oriented
completely.
Spark ensures lower latency
computations by caching the
partial results across its
memory of distributed workers.
Ease of coding
Writing Hadoop MapReduce
pipelines is complex and lengthy
process.
Writing Spark code is always more
compact.
CONTACT US
Write to us : business@altencalsoftlabs.com
Visit Our Website: https://www.altencalsoftlabs.com
USA | FRANCE | UK | INDIA | SINGAPORE

Mais conteúdo relacionado

Mais procurados

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 

Mais procurados (20)

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Apache spark
Apache sparkApache spark
Apache spark
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Semelhante a Hadoop vs Apache Spark

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Semelhante a Hadoop vs Apache Spark (20)

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Mais de ALTEN Calsoft Labs

Overview of IoT/M2M Capability
Overview of IoT/M2M CapabilityOverview of IoT/M2M Capability
Overview of IoT/M2M Capability
ALTEN Calsoft Labs
 

Mais de ALTEN Calsoft Labs (18)

Harnessing the True Potential of Cloud Security.pptx
Harnessing the True Potential of Cloud Security.pptxHarnessing the True Potential of Cloud Security.pptx
Harnessing the True Potential of Cloud Security.pptx
 
How do SIS and LMS solutions support the entire student lifecycle?
How do SIS and LMS solutions support the entire student lifecycle?How do SIS and LMS solutions support the entire student lifecycle?
How do SIS and LMS solutions support the entire student lifecycle?
 
How can you keep supply chain management
How can you keep supply chain managementHow can you keep supply chain management
How can you keep supply chain management
 
Lean TLF Mock Shells: A Programmer’s Boon
Lean TLF Mock Shells: A Programmer’s BoonLean TLF Mock Shells: A Programmer’s Boon
Lean TLF Mock Shells: A Programmer’s Boon
 
Net suite erp implementation
Net suite erp implementationNet suite erp implementation
Net suite erp implementation
 
Case Study: Loan default prediction
Case Study: Loan default predictionCase Study: Loan default prediction
Case Study: Loan default prediction
 
Introduction to Robotic Process Automation (rpa) and RPA Case Study
Introduction to Robotic Process Automation (rpa) and RPA Case StudyIntroduction to Robotic Process Automation (rpa) and RPA Case Study
Introduction to Robotic Process Automation (rpa) and RPA Case Study
 
Overview of IoT/M2M Capability
Overview of IoT/M2M CapabilityOverview of IoT/M2M Capability
Overview of IoT/M2M Capability
 
Embedded System and IoT - ALTEN Calsoft Labs
Embedded System and IoT - ALTEN Calsoft LabsEmbedded System and IoT - ALTEN Calsoft Labs
Embedded System and IoT - ALTEN Calsoft Labs
 
Business Intelligence and Analytics Capability
Business Intelligence and Analytics CapabilityBusiness Intelligence and Analytics Capability
Business Intelligence and Analytics Capability
 
Top 10 IoT Blogs
Top 10 IoT BlogsTop 10 IoT Blogs
Top 10 IoT Blogs
 
Top 10 retail tech trends 2017
Top 10 retail tech trends   2017Top 10 retail tech trends   2017
Top 10 retail tech trends 2017
 
Top 9 Retail IoT Use Cases
Top 9 Retail IoT Use CasesTop 9 Retail IoT Use Cases
Top 9 Retail IoT Use Cases
 
Top 6 IoT Use Cases in Manufacturing
Top 6 IoT Use Cases in ManufacturingTop 6 IoT Use Cases in Manufacturing
Top 6 IoT Use Cases in Manufacturing
 
Top 100 IoT Use Cases
Top 100 IoT Use CasesTop 100 IoT Use Cases
Top 100 IoT Use Cases
 
Intel DPDK - ALTEN Calsoft Lab's Expertise
Intel DPDK - ALTEN Calsoft Lab's ExpertiseIntel DPDK - ALTEN Calsoft Lab's Expertise
Intel DPDK - ALTEN Calsoft Lab's Expertise
 
Healthcare Data Analytics Implementation
Healthcare Data Analytics ImplementationHealthcare Data Analytics Implementation
Healthcare Data Analytics Implementation
 
Genomic Dashboard For Targeted Cancer Therapy
Genomic Dashboard For Targeted Cancer TherapyGenomic Dashboard For Targeted Cancer Therapy
Genomic Dashboard For Targeted Cancer Therapy
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Hadoop vs Apache Spark

  • 2. Hadoop and Spark are popular Apache projects in the big data ecosystem. Apache Spark is an open-source platform, based on the original Hadoop MapReduce component of the Hadoop ecosystem.
  • 3.  Apache developed Hadoop project as open-source software for reliable, scalable, distributed computing.  A framework that allows distributed processing of large datasets across clusters of computers using simple programming models.  Hadoop can be easily scaled-up to multi cluster machines, each offering local storage and computation.  Hadoop libraries are designed in such a way that it can detect the failed cluster at application layer and can handle those failures by it.
  • 4.  Hadoop Common: These are Java libraries and utilities required for running other Hadoop modules. These libraries provide OS level and filesystem abstractions and contain the necessary Java files and scripts required to start and run Hadoop.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large datasets. The project includes these modules:
  • 5. Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and distributed platform for storage and processing of very large datasets across clusters of commodity computers. Hadoop uses the same set of nodes for data storage as well as to perform the computations. This allows Hadoop to improve the performance of large scale computations by combining computations along with the storage.
  • 6.
  • 7. Hadoop Distributed File System – HDFS HDFS is a distributed filesystem that is designed to store large volume of data reliably. HDFS stores a single large file on different nodes across the cluster of commodity machines. HDFS overlays on top of the existing filesystem. Data is stored in fine grained blocks, with default block size of 128MB. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. HDFS is a distributed, reliable and scalable file system.
  • 8. YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. Hadoop YARN
  • 9. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer process them in any available node as directed by YARN.
  • 10.  It is a framework for analysing data analytics on a distributed computing cluster.  It provides in-memory computations for increasing speed and data processing over MapReduce.  It utilizes the Hadoop Distributed File System (HDFS) and runs on top of existing Hadoop cluster.  It can also process both structured data in Hive and streaming data from different sources like HDFS, Flume, Kafka, and Twitter.
  • 11. Spark Stream Spark Streaming is an extension of the core Spark API. Processing live data streams can be done using Spark Streaming, that enables scalable, high-throughput, fault-tolerant stream. Input Data can be from any sources like WebStream (TCP sockets), Flume, Kafka, etc., and can be processed using complex algorithms with high-level functions like map, reduce, join, etc. Finally, processed data can be pushed out to filesystems (HDFS), databases, and live dashboards. We can also apply Spark’s graph processing algorithms and machine learning on data streams.
  • 12. Spark SQL  Apache Spark provides a separate module Spark SQL for processing structured data.  Spark SQL has an interface, which provides detailed information about the structure of the data and the computation being performed.  Internally, Spark SQL uses this additional information to perform extra optimizations.
  • 13. Datasets and DataFrames  A distributed collection of data is called as Dataset in Apache Spark. Dataset provides the benefits of RDDs along with utilizing the Spark SQL’s optimized execution engine.  A Dataset can be constructed from objects and then manipulated using functional transformations.  A DataFrame is a dataset organized into named columns. It is equally related to a relational database table or a R/Python data frame, but with richer optimizations under the hood.  A DataFrame can be constructed, using various data source like structured data file or Hive tables or external databases or existing RDDs.
  • 14. Resilient Distributed Datasets (RDDs) Spark works on fault-tolerant collection of elements that can be operated on in parallel, the concept called resilient distributed dataset (RDD). RDDs can be created in two ways, parallelizing an existing collection in driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, etc.
  • 15. Here is a quick comparison guideline before concluding. Aspects Hadoop Apache Spark Difficulty MapReduce is difficult to program and needs abstractions. Spark is easy to program and does not require any abstractions. Interactive Mode There is no in-built interactive mode, except Pig and Hive. It has interactive mode. Streaming Hadoop MapReduce just get to process a batch of large stored data. Spark can be used to modify in real time through Spark Streaming.
  • 16. Aspects Hadoop Apache Spark Performance MapReduce does not leverage the memory of the Hadoop cluster to the maximum. Spark has been said to execute batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Latency MapReduce is disk oriented completely. Spark ensures lower latency computations by caching the partial results across its memory of distributed workers. Ease of coding Writing Hadoop MapReduce pipelines is complex and lengthy process. Writing Spark code is always more compact.
  • 17. CONTACT US Write to us : business@altencalsoftlabs.com Visit Our Website: https://www.altencalsoftlabs.com USA | FRANCE | UK | INDIA | SINGAPORE