SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Apache Spark | NR
1
Apache Spark
What is Spark?
Spark is a cluster computing platform designed to be fast and general
purpose.
On the speed side, Spark extends the popular MapReduce model to
efficiently support more types of computations, including interactive
queries and stream processing. Speed is important in processing large
datasets, as it means the difference between exploring data
interactively and waiting minutes or hours. One of the main features
Spark offers for speed is the ability to run computations in memory, but
the system is also more efficient than MapReduce for complex
applications running on disk.
On the generality side, Spark is designed to cover a wide range of
workloads that previously required separate distributed systems,
including batch applications, iterative algorithms, interactive queries,
and streaming. By supporting these workloads in the same engine,
Spark makes it easy and inexpensive to combine different processing
types, which is often necessary in production data analysis pipelines. In
addition, it reduces the management burden of maintaining separate
tools.
Apache Spark | NR
2
Spark is designed to be highly accessible, offering simple APIs in Python,
Java, Scala, and SQL, and rich built-in libraries. It also integrates closely
with other Big Data tools. In particular, Spark can run in Hadoop
clusters and access any Hadoop data source, including Cassandra.
It is a general-purpose data processing engine, suitable for use in a wide
range of circumstances.
Interactive queries across large data sets, processing of streaming data
from sensors or financial systems, and machine learning tasks tend to
be most frequently associated with Spark.
Developers can also use it to support other data processing tasks,
benefiting from Spark’s extensive set of developer libraries and APIs,
and its comprehensive support for languages such as Java, Python, R
and Scala.
Spark is often used alongside Hadoop’s data storage
Module, HDFS, but can also integrate equally well with other popular
data storage subsystems such as HBase, Cassandra, MapR-DB,
MongoDB and Amazon’s S3.
Apache Spark | NR
3
What is the need of spark?
There are many reasons to choose Spark, but three are key:
• Simplicity
Spark’s capabilities are accessible via a set of rich APIs, all designed
specifically for interacting quickly and easily with data at scale.
These APIs are well documented, and structured in a way that makes it
straightforward for data scientists and application developers to quickly
put Spark to work;
• Speed
Spark is designed for speed, operating both in memory and on
disk. In 2014, Spark was used to win the Daytona Gray Sort
benchmarking
Apache Spark | NR
4
challenge, processing 100 terabytes of data stored on solid-state
drives in just 23 minutes. The previous winner used Hadoop and a
different cluster configuration, but it took 72 minutes. This win was the
result of processing a static data set. Spark’s performance can be even
greater when supporting interactive queries of data stored in memory,
with claims that Spark can be 100 times faster than Hadoop’s
MapReduce in these situations;
• Support:
Spark supports a range of programming languages, including
Java, Python, R, and Scala. Although often closely associated with
Hadoop’s underlying storage system, HDFS, Spark includes native
support for tight integration with a number of leading storage solutions
in the Hadoop ecosystem and beyond. Additionally, the Apache Spark
community is large, active, and international. A growing set of
commercial providers including Databricks, IBM, and all of the main
Hadoop vendors deliver comprehensive support for Spark-based
solutions.
Apache Spark | NR
5
Importance of spark
Spark is a general-purpose data processing engine, an API-powered
toolkit
which data scientists and application developers incorporate into their
applications to rapidly query, analyze and transform data at scale.
Spark’s flexibility makes it well-suited to tackling a range of use cases,
and it is capable of handling several petabytes of data at a time,
distributed across a cluster of thousands of cooperating physical or
virtual servers. Typical use cases include:
• Stream processing:
From log files to sensor data, application developers increasingly have
to cope with “streams” of data. This data arrives in a steady stream,
often from multiple sources simultaneously. While it is certainly
feasible to allow these data streams to be stored on disk and analyzed
retrospectively, it can sometimes be sensible or important to process
and act upon the data as it arrives. Streams of data related to financial
transactions, for example, can be processed in real time to identify--and
refuse--potentially fraudulent transactions.
Apache Spark | NR
6
• Machine learning:
As data volumes grow, machine learning approaches become more
feasible and increasingly accurate. Software can be trained to identify
and act upon triggers within well-understood data sets before applying
the same solutions to new and unknown data. Spark’s ability to store
data in memory and rapidly run repeated queries makes it well suited
to training machine learning algorithms. Running broadly similar
queries again and again, at scale, significantly reduces the time
required to iterate through a set of possible solutions in order to find
the most efficient algorithms.
• Interactive analytics:
Rather than running pre-defined queries to create static dashboards of
sales or production line productivity or stock prices, business analysts
and data scientists increasingly want to explore their data by asking a
question, viewing the result, and then either altering the initial
question slightly or drilling deeper into results. This interactive query
process requires systems such as Spark that are able to respond and
adapt quickly.
• Data integration:
Data produced by different systems across a business is rarely clean or
consistent enough to simply and easily be combined for reporting or
analysis. Extract, transform, and load (ETL) processes are often used to
pull data from different systems, clean and standardize it,and then load
it into a separate system for analysis. Spark (and Hadoop)
are increasingly being used to reduce the cost and time required for
this ETL process.
Apache Spark | NR
7
How to install Apache Spark?
Follow these simple steps to download Java, Spark, and Hadoop and get
them running on a laptop (in this case, one running Mac OS X). If you do
not currently have the Java JDK (version 7 or higher) installed,
download it and follow the steps to install it for your operating system.
Visit the Spark downloads page, select a pre-built package, and
download Spark. Double-click the archive file to expand its contents
ready for use.
Apache Spark | NR
8
Testing Spark
Open a text console, and navigate to the newly created directory. Start
Spark’s interactive shell:
./bin/spark-shell
A series of messages will scroll past as Spark and Hadoop are
configured.
Once the scrolling stops, you will see a simple prompt.
Apache Spark | NR
9
At this prompt, let’s create some data; a simple sequence of numbers
from 1 to 50,000.
Val data = 1 to 50000
Now, let’s place these 50,000 numbers into a Resilient Distributed
Dataset (RDD) which we’ll call sparkSample. It is this RDD upon which
Spark can perform analysis.
Val sparkSample = sc.parallelize(data)
Apache Spark | NR
10
Now we can filter the data in the RDD to find any values of less than 10.
sparkSample.filter(_ < 10).collect()
Spark should report the result, with an array containing any values less
than 10.
Apache Spark | NR
11
Difference between Spark and Hadoop
1.
Map Side of Hadoop Map Reduce ( see left side of the image above ):
 Each Map task outputs the data in Key and Value pair.
 The output is stored in a CIRCULAR BUFFER instead of writing to disk.
 The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by
default, then the data will be spilled to disk, which are called shuffle spill files.
 On a particular node, many map tasks are run as a result many spill files are
created. Hadoop merges all the spill files, on a particular node, into one big file
which is SORTED and PARTITIONED based on number of reducers.
Apache Spark | NR
12
Map side of Spark ( see right side of the image above)
 Initial Design:
o The output of map side is written to OS BUFFER CACHE.
o The operating system will decide if the data can stay in OS buffer cache or should it
be spilled to DISK.
o Each map task creates as many shuffle spill files as number of reducers.
o SPARK doesn't merge and partition shuffle spill files into one big file, which is the
case with Apache Hadoop.
o Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be
(M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task
creates as many shuffle spill files as number of reducers. This caused performance
degradation.
o This was the initial design of Apache Spark.
 Shuffle File Consolidation:
o Same as above.
o The only difference being, the map tasks which run on the same cores will be
consolidated into a single file. So, each CORE will output as many shuffle files as
number of reducers.
o Example: If there are 6000(R) and 4 (C) Cores, the number of shuffle files will be
(R*C) 6000*4=24000 shuffle files. Note the huge change in the number of shuffle
files.
Apache Spark | NR
13
2.
 Reduce side of Hadoop MR:
o PUSHES the intermediate files(shuffle files) created at the map side.And the data
is loaded into memory.
o If the buffer reaches 70% of its limit, then the data will be spilled to disk.
o Then the spills are merged to form bigger files.
o Finally the reduce method gets invoked.
Apache Spark | NR
14
 Reduce side of Apache Spark:
o PULLS the intermediate files(shuffle files) to Reduce side.
o The data is directly written to memory.
o If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on-
wards. Before that, an OOM(out of memory) exception would be thrown.
o Finally the reducer functionality gets invoked.
3.
Hadoop is essentially a distributed data infrastructure: It distributes massive data
collections across multiple nodes within a cluster of commodity servers, which means
you don't need to buy and maintain expensive custom hardware. It also indexes and
keeps track of that data, enabling big-data processing and analytics far more effectively
than was possible previously.
Spark, on the other hand, is a data-processing tool that operates on those distributed
data collections; it doesn't do distributed storage.
4. Processing Data
Apache Spark | NR
15
5.

Mais conteúdo relacionado

Mais procurados

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 

Mais procurados (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 

Destaque

GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovJAXLondon2014
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsDATAVERSITY
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark IntroductionRich Lee
 
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...Front
 
The Gathering - Tony Hsieh - February 23, 2017
The Gathering - Tony Hsieh - February 23, 2017 The Gathering - Tony Hsieh - February 23, 2017
The Gathering - Tony Hsieh - February 23, 2017 Delivering Happiness
 
Trainer vs Performance Consultant
Trainer vs Performance ConsultantTrainer vs Performance Consultant
Trainer vs Performance ConsultantMike Kunkle
 
Transportation in sap
Transportation in sapTransportation in sap
Transportation in sapshantdey
 
Manchester city
Manchester cityManchester city
Manchester cityofrancis
 
French Property Market 2014
French Property Market 2014French Property Market 2014
French Property Market 2014David Bourla
 

Destaque (20)

GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and Facts
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...
Hacker News vs TechCrunch vs Product Hunt: which is most effective to launch ...
 
The Gathering - Tony Hsieh - February 23, 2017
The Gathering - Tony Hsieh - February 23, 2017 The Gathering - Tony Hsieh - February 23, 2017
The Gathering - Tony Hsieh - February 23, 2017
 
Trainer vs Performance Consultant
Trainer vs Performance ConsultantTrainer vs Performance Consultant
Trainer vs Performance Consultant
 
Transportation in sap
Transportation in sapTransportation in sap
Transportation in sap
 
Growth Hacking
Growth Hacking Growth Hacking
Growth Hacking
 
Manchester city
Manchester cityManchester city
Manchester city
 
Workshop
WorkshopWorkshop
Workshop
 
Datomic
DatomicDatomic
Datomic
 
Jim rohn
Jim  rohnJim  rohn
Jim rohn
 
Datomic
DatomicDatomic
Datomic
 
Datomic
DatomicDatomic
Datomic
 
Backbone.js
Backbone.jsBackbone.js
Backbone.js
 
Selena Gomez
Selena GomezSelena Gomez
Selena Gomez
 
Management Consulting
Management ConsultingManagement Consulting
Management Consulting
 
Sap fiori
Sap fioriSap fiori
Sap fiori
 
French Property Market 2014
French Property Market 2014French Property Market 2014
French Property Market 2014
 

Semelhante a Apache Spark PDF

Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Anirudh Gangwar
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 

Semelhante a Apache Spark PDF (20)

Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache spark
Apache sparkApache spark
Apache spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Apache spark
Apache sparkApache spark
Apache spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Spark core
Spark coreSpark core
Spark core
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 

Último

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 

Último (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

Apache Spark PDF

  • 1. Apache Spark | NR 1 Apache Spark What is Spark? Spark is a cluster computing platform designed to be fast and general purpose. On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk. On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.
  • 2. Apache Spark | NR 2 Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra. It is a general-purpose data processing engine, suitable for use in a wide range of circumstances. Interactive queries across large data sets, processing of streaming data from sensors or financial systems, and machine learning tasks tend to be most frequently associated with Spark. Developers can also use it to support other data processing tasks, benefiting from Spark’s extensive set of developer libraries and APIs, and its comprehensive support for languages such as Java, Python, R and Scala. Spark is often used alongside Hadoop’s data storage Module, HDFS, but can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3.
  • 3. Apache Spark | NR 3 What is the need of spark? There are many reasons to choose Spark, but three are key: • Simplicity Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly and easily with data at scale. These APIs are well documented, and structured in a way that makes it straightforward for data scientists and application developers to quickly put Spark to work; • Speed Spark is designed for speed, operating both in memory and on disk. In 2014, Spark was used to win the Daytona Gray Sort benchmarking
  • 4. Apache Spark | NR 4 challenge, processing 100 terabytes of data stored on solid-state drives in just 23 minutes. The previous winner used Hadoop and a different cluster configuration, but it took 72 minutes. This win was the result of processing a static data set. Spark’s performance can be even greater when supporting interactive queries of data stored in memory, with claims that Spark can be 100 times faster than Hadoop’s MapReduce in these situations; • Support: Spark supports a range of programming languages, including Java, Python, R, and Scala. Although often closely associated with Hadoop’s underlying storage system, HDFS, Spark includes native support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond. Additionally, the Apache Spark community is large, active, and international. A growing set of commercial providers including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for Spark-based solutions.
  • 5. Apache Spark | NR 5 Importance of spark Spark is a general-purpose data processing engine, an API-powered toolkit which data scientists and application developers incorporate into their applications to rapidly query, analyze and transform data at scale. Spark’s flexibility makes it well-suited to tackling a range of use cases, and it is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. Typical use cases include: • Stream processing: From log files to sensor data, application developers increasingly have to cope with “streams” of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is certainly feasible to allow these data streams to be stored on disk and analyzed retrospectively, it can sometimes be sensible or important to process and act upon the data as it arrives. Streams of data related to financial transactions, for example, can be processed in real time to identify--and refuse--potentially fraudulent transactions.
  • 6. Apache Spark | NR 6 • Machine learning: As data volumes grow, machine learning approaches become more feasible and increasingly accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated queries makes it well suited to training machine learning algorithms. Running broadly similar queries again and again, at scale, significantly reduces the time required to iterate through a set of possible solutions in order to find the most efficient algorithms. • Interactive analytics: Rather than running pre-defined queries to create static dashboards of sales or production line productivity or stock prices, business analysts and data scientists increasingly want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process requires systems such as Spark that are able to respond and adapt quickly. • Data integration: Data produced by different systems across a business is rarely clean or consistent enough to simply and easily be combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data from different systems, clean and standardize it,and then load it into a separate system for analysis. Spark (and Hadoop) are increasingly being used to reduce the cost and time required for this ETL process.
  • 7. Apache Spark | NR 7 How to install Apache Spark? Follow these simple steps to download Java, Spark, and Hadoop and get them running on a laptop (in this case, one running Mac OS X). If you do not currently have the Java JDK (version 7 or higher) installed, download it and follow the steps to install it for your operating system. Visit the Spark downloads page, select a pre-built package, and download Spark. Double-click the archive file to expand its contents ready for use.
  • 8. Apache Spark | NR 8 Testing Spark Open a text console, and navigate to the newly created directory. Start Spark’s interactive shell: ./bin/spark-shell A series of messages will scroll past as Spark and Hadoop are configured. Once the scrolling stops, you will see a simple prompt.
  • 9. Apache Spark | NR 9 At this prompt, let’s create some data; a simple sequence of numbers from 1 to 50,000. Val data = 1 to 50000 Now, let’s place these 50,000 numbers into a Resilient Distributed Dataset (RDD) which we’ll call sparkSample. It is this RDD upon which Spark can perform analysis. Val sparkSample = sc.parallelize(data)
  • 10. Apache Spark | NR 10 Now we can filter the data in the RDD to find any values of less than 10. sparkSample.filter(_ < 10).collect() Spark should report the result, with an array containing any values less than 10.
  • 11. Apache Spark | NR 11 Difference between Spark and Hadoop 1. Map Side of Hadoop Map Reduce ( see left side of the image above ):  Each Map task outputs the data in Key and Value pair.  The output is stored in a CIRCULAR BUFFER instead of writing to disk.  The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by default, then the data will be spilled to disk, which are called shuffle spill files.  On a particular node, many map tasks are run as a result many spill files are created. Hadoop merges all the spill files, on a particular node, into one big file which is SORTED and PARTITIONED based on number of reducers.
  • 12. Apache Spark | NR 12 Map side of Spark ( see right side of the image above)  Initial Design: o The output of map side is written to OS BUFFER CACHE. o The operating system will decide if the data can stay in OS buffer cache or should it be spilled to DISK. o Each map task creates as many shuffle spill files as number of reducers. o SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop. o Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation. o This was the initial design of Apache Spark.  Shuffle File Consolidation: o Same as above. o The only difference being, the map tasks which run on the same cores will be consolidated into a single file. So, each CORE will output as many shuffle files as number of reducers. o Example: If there are 6000(R) and 4 (C) Cores, the number of shuffle files will be (R*C) 6000*4=24000 shuffle files. Note the huge change in the number of shuffle files.
  • 13. Apache Spark | NR 13 2.  Reduce side of Hadoop MR: o PUSHES the intermediate files(shuffle files) created at the map side.And the data is loaded into memory. o If the buffer reaches 70% of its limit, then the data will be spilled to disk. o Then the spills are merged to form bigger files. o Finally the reduce method gets invoked.
  • 14. Apache Spark | NR 14  Reduce side of Apache Spark: o PULLS the intermediate files(shuffle files) to Reduce side. o The data is directly written to memory. o If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on- wards. Before that, an OOM(out of memory) exception would be thrown. o Finally the reducer functionality gets invoked. 3. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don't need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage. 4. Processing Data
  • 15. Apache Spark | NR 15 5.