Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Stratosphere with big_data_analytics
1. Applying Stratosphere for
Big Data Analytics
1
P R E S E N T E D B Y : -
J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 )
A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
2. Points of Discussion
Big Data and Hadoop
Map-Reduce Framework
Stratosphere and its Components
Stratosphere and its Architecture
Stratosphere and its Operators
Stratosphere vs Map-Reduce
Execution and Analysis
2
3. Big Data
Big Data is a collection of large and complex data sets that it become difficult to process using
on-hand database management tools. The challenges include capture, storage, search, sharing,
analysis and visualization.
Problems :-
1) Large-Scale Data Storage
2) Large-Scale Data Analysis
Solution :-
Hadoop – HDFS - MapReduce
3
4. Hadoop Approach
Hadoop is a software framework for distributed processing of large datasets across large
clusters of computers.
Large datasets Terabytes or Petabytes of data
Large Clusters hundreds or thousands of nodes
Hadoop is based on simple programming model called MapReduce.
Hadoop = HDFS + Map / Reduce Infrastructure .
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
HDFS creates multiple replicas of data blocks and distributes them on computer nodes
throughout a cluster to enable reliable, extremely rapid computations.
HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3)
times and stored on different data nodes.
MapReduce is a programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java , Python , Ruby and C++.
4
5. Map Reduce - Example
5
MAP FUNCTION REDUCE FUNCTION
Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper.
Map is applied in parallel on input data set. This
produces output keys and list of values for each
key depending on the functionality.
Reduce is then applied in parallel to each group , again
producing a collection of key , values.
Mapper output are partitioned per reducer = No.
of reduce task for that job.
Reducers cannot be set by user.
6. StratoSphere
A massively parallel data processing system.
Extends MapReduce with more operators.
Support for advanced data flow graphs.
Compiler/Optimizer , Java/Scala Interface , YARN
Data Flow Composition
6
7. StratoSphere – Components
7
Query is parsed into Sopremo Plan which is a DAG
(Directed Acyclic Graph) of interconnected data
processing operators
End Users Specify Data Analysis tasks by writing
Meteor Queries.
A Generalization of MapReduce programming
paradigm.
Interprets data flow graphs and distributes tasks to
the computation nodes.
8. StratoSphere – Architecture
(1)
Users formulate a query that
is parsed into Sopremo Plan
(2)
Imports the
packages .
(3)
Registers the
discovered operators
and predefined
functions
(4)
Validates the script
and translate it into
Sopremo Plan
(5)
The plan is analyzed by
the schema inferencer
to obtain a global
schema
(6)
Creates a
consistent
PACT plan .
8
9. StratoSphere - Operators
9
MAP
Record-at-a-
Time Accepts Single Record as Input ,
Emits Any number of Records ,
Applications :- Filters / Transformations
One Input
REDUCE
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts A list of Records as Input ,
Emits Any number of Records ,
Applications :- Aggregations
One Input
JOIN
Record-at-a-
Time
Joins both inputs on their Record Keys
and non-matched records are discarded.
Accepts One Record of each Input ,
Emits Any number of Records ,
Applications :- Equi-Joins
Two Inputs
10. StratoSphere - Operators
10
CROSS
Record-at-a-
Time
Cartesian product of the records of
both inputs.
Accepts One Record of each Input ,
Emits Any number of Records ,
Very Expensive Operation.
Two Inputs
CO-
GROUP
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts One list of Records for each
Input ,
Emits Any number of Records .
Two Inputs
UNION
Record-at-a-
Time
Merges two or more input data sets into
a single output data set.
Follows Bag Semantics.
Duplicates are not removed.
Two Inputs
11. vs
11
Conclusions :-
1) Most tasks do not fit the MapReduce model.
2) Very Expensive – Always go to disk and HDFS.
3) Tedious to implement .
12. vs
12
Conclusions :-
1) Joins do not fit the MapReduce model.
2) Time Consuming to implement .
3) Hard Optimization necessary .
13. vs
13
Loop is outside the system
• Hard to Program
• Very Poor Performance
Loop is inside the system
• Easy to Program
• Huge Performance Gains
14. Summary : Feature Matrix
Map Reduce StratoSphere
Operators
• Map
• Reduce
• Map
• Reduce (multiple sort keys)
• Cross
• Join
• CoGroup
• Union
• Iterate , Iterate Delta
Composition Only MapReduce Arbitrary Data Flows
Data Exchange Batch through disk
Pipe-lined , in-memory
(automatic spilling to disk)
14