SlideShare uma empresa Scribd logo
1 de 20
Applying Stratosphere for
Big Data Analytics
1
P R E S E N T E D B Y : -
J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 )
A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
Points of Discussion
 Big Data and Hadoop
 Map-Reduce Framework
 Stratosphere and its Components
 Stratosphere and its Architecture
 Stratosphere and its Operators
 Stratosphere vs Map-Reduce
 Execution and Analysis
2
Big Data
 Big Data is a collection of large and complex data sets that it become difficult to process using
on-hand database management tools. The challenges include capture, storage, search, sharing,
analysis and visualization.
 Problems :-
1) Large-Scale Data Storage
2) Large-Scale Data Analysis
 Solution :-
Hadoop – HDFS - MapReduce
3
Hadoop Approach
 Hadoop is a software framework for distributed processing of large datasets across large
clusters of computers.
Large datasets  Terabytes or Petabytes of data
Large Clusters  hundreds or thousands of nodes
 Hadoop is based on simple programming model called MapReduce.
 Hadoop = HDFS + Map / Reduce Infrastructure .
 Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
 HDFS creates multiple replicas of data blocks and distributes them on computer nodes
throughout a cluster to enable reliable, extremely rapid computations.
 HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3)
times and stored on different data nodes.
 MapReduce is a programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java , Python , Ruby and C++.
4
Map Reduce - Example
5
MAP FUNCTION REDUCE FUNCTION
Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper.
Map is applied in parallel on input data set. This
produces output keys and list of values for each
key depending on the functionality.
Reduce is then applied in parallel to each group , again
producing a collection of key , values.
Mapper output are partitioned per reducer = No.
of reduce task for that job.
Reducers cannot be set by user.
StratoSphere
 A massively parallel data processing system.
 Extends MapReduce with more operators.
 Support for advanced data flow graphs.
 Compiler/Optimizer , Java/Scala Interface , YARN
 Data Flow Composition
6
StratoSphere – Components
7
Query is parsed into Sopremo Plan which is a DAG
(Directed Acyclic Graph) of interconnected data
processing operators
End Users Specify Data Analysis tasks by writing
Meteor Queries.
A Generalization of MapReduce programming
paradigm.
Interprets data flow graphs and distributes tasks to
the computation nodes.
StratoSphere – Architecture
(1)
Users formulate a query that
is parsed into Sopremo Plan
(2)
Imports the
packages .
(3)
Registers the
discovered operators
and predefined
functions
(4)
Validates the script
and translate it into
Sopremo Plan
(5)
The plan is analyzed by
the schema inferencer
to obtain a global
schema
(6)
Creates a
consistent
PACT plan .
8
StratoSphere - Operators
9
MAP
Record-at-a-
Time Accepts Single Record as Input ,
Emits Any number of Records ,
Applications :- Filters / Transformations
One Input
REDUCE
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts A list of Records as Input ,
Emits Any number of Records ,
Applications :- Aggregations
One Input
JOIN
Record-at-a-
Time
Joins both inputs on their Record Keys
and non-matched records are discarded.
Accepts One Record of each Input ,
Emits Any number of Records ,
Applications :- Equi-Joins
Two Inputs
StratoSphere - Operators
10
CROSS
Record-at-a-
Time
Cartesian product of the records of
both inputs.
Accepts One Record of each Input ,
Emits Any number of Records ,
Very Expensive Operation.
Two Inputs
CO-
GROUP
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts One list of Records for each
Input ,
Emits Any number of Records .
Two Inputs
UNION
Record-at-a-
Time
Merges two or more input data sets into
a single output data set.
Follows Bag Semantics.
Duplicates are not removed.
Two Inputs
vs
11
Conclusions :-
1) Most tasks do not fit the MapReduce model.
2) Very Expensive – Always go to disk and HDFS.
3) Tedious to implement .
vs
12
Conclusions :-
1) Joins do not fit the MapReduce model.
2) Time Consuming to implement .
3) Hard Optimization necessary .
vs
13
Loop is outside the system
• Hard to Program
• Very Poor Performance
Loop is inside the system
• Easy to Program
• Huge Performance Gains
Summary : Feature Matrix
Map Reduce StratoSphere
Operators
• Map
• Reduce
• Map
• Reduce (multiple sort keys)
• Cross
• Join
• CoGroup
• Union
• Iterate , Iterate Delta
Composition Only MapReduce Arbitrary Data Flows
Data Exchange Batch through disk
Pipe-lined , in-memory
(automatic spilling to disk)
14
Stratosphere - Web Log Analysis
15
Stratosphere Query Interface
Web Log Analysis – continued…..
Optimizer Query Plan
Web Log Analysis – continued…..
17
Job Submission
Web Log Analysis – continued…..
18
Dashboard – Running Jobs
19
Web Log Analysis – continued…..
Dashboard – Running Jobs
20
Dashboard –Job Plan
Web Log Analysis – continued…..

Mais conteúdo relacionado

Mais procurados (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 

Semelhante a Stratosphere with big_data_analytics

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersBRNSSPublicationHubI
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 

Semelhante a Stratosphere with big_data_analytics (20)

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Stratosphere with big_data_analytics

  • 1. Applying Stratosphere for Big Data Analytics 1 P R E S E N T E D B Y : - J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 ) A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
  • 2. Points of Discussion  Big Data and Hadoop  Map-Reduce Framework  Stratosphere and its Components  Stratosphere and its Architecture  Stratosphere and its Operators  Stratosphere vs Map-Reduce  Execution and Analysis 2
  • 3. Big Data  Big Data is a collection of large and complex data sets that it become difficult to process using on-hand database management tools. The challenges include capture, storage, search, sharing, analysis and visualization.  Problems :- 1) Large-Scale Data Storage 2) Large-Scale Data Analysis  Solution :- Hadoop – HDFS - MapReduce 3
  • 4. Hadoop Approach  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers. Large datasets  Terabytes or Petabytes of data Large Clusters  hundreds or thousands of nodes  Hadoop is based on simple programming model called MapReduce.  Hadoop = HDFS + Map / Reduce Infrastructure .  Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.  HDFS creates multiple replicas of data blocks and distributes them on computer nodes throughout a cluster to enable reliable, extremely rapid computations.  HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3) times and stored on different data nodes.  MapReduce is a programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java , Python , Ruby and C++. 4
  • 5. Map Reduce - Example 5 MAP FUNCTION REDUCE FUNCTION Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper. Map is applied in parallel on input data set. This produces output keys and list of values for each key depending on the functionality. Reduce is then applied in parallel to each group , again producing a collection of key , values. Mapper output are partitioned per reducer = No. of reduce task for that job. Reducers cannot be set by user.
  • 6. StratoSphere  A massively parallel data processing system.  Extends MapReduce with more operators.  Support for advanced data flow graphs.  Compiler/Optimizer , Java/Scala Interface , YARN  Data Flow Composition 6
  • 7. StratoSphere – Components 7 Query is parsed into Sopremo Plan which is a DAG (Directed Acyclic Graph) of interconnected data processing operators End Users Specify Data Analysis tasks by writing Meteor Queries. A Generalization of MapReduce programming paradigm. Interprets data flow graphs and distributes tasks to the computation nodes.
  • 8. StratoSphere – Architecture (1) Users formulate a query that is parsed into Sopremo Plan (2) Imports the packages . (3) Registers the discovered operators and predefined functions (4) Validates the script and translate it into Sopremo Plan (5) The plan is analyzed by the schema inferencer to obtain a global schema (6) Creates a consistent PACT plan . 8
  • 9. StratoSphere - Operators 9 MAP Record-at-a- Time Accepts Single Record as Input , Emits Any number of Records , Applications :- Filters / Transformations One Input REDUCE Group-at-a- Time Groups the record of its input on Record Key. Accepts A list of Records as Input , Emits Any number of Records , Applications :- Aggregations One Input JOIN Record-at-a- Time Joins both inputs on their Record Keys and non-matched records are discarded. Accepts One Record of each Input , Emits Any number of Records , Applications :- Equi-Joins Two Inputs
  • 10. StratoSphere - Operators 10 CROSS Record-at-a- Time Cartesian product of the records of both inputs. Accepts One Record of each Input , Emits Any number of Records , Very Expensive Operation. Two Inputs CO- GROUP Group-at-a- Time Groups the record of its input on Record Key. Accepts One list of Records for each Input , Emits Any number of Records . Two Inputs UNION Record-at-a- Time Merges two or more input data sets into a single output data set. Follows Bag Semantics. Duplicates are not removed. Two Inputs
  • 11. vs 11 Conclusions :- 1) Most tasks do not fit the MapReduce model. 2) Very Expensive – Always go to disk and HDFS. 3) Tedious to implement .
  • 12. vs 12 Conclusions :- 1) Joins do not fit the MapReduce model. 2) Time Consuming to implement . 3) Hard Optimization necessary .
  • 13. vs 13 Loop is outside the system • Hard to Program • Very Poor Performance Loop is inside the system • Easy to Program • Huge Performance Gains
  • 14. Summary : Feature Matrix Map Reduce StratoSphere Operators • Map • Reduce • Map • Reduce (multiple sort keys) • Cross • Join • CoGroup • Union • Iterate , Iterate Delta Composition Only MapReduce Arbitrary Data Flows Data Exchange Batch through disk Pipe-lined , in-memory (automatic spilling to disk) 14
  • 15. Stratosphere - Web Log Analysis 15 Stratosphere Query Interface
  • 16. Web Log Analysis – continued….. Optimizer Query Plan
  • 17. Web Log Analysis – continued….. 17 Job Submission
  • 18. Web Log Analysis – continued….. 18 Dashboard – Running Jobs
  • 19. 19 Web Log Analysis – continued….. Dashboard – Running Jobs
  • 20. 20 Dashboard –Job Plan Web Log Analysis – continued…..