SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Stratosphere:
System Overview
Robert Metzger
mail@robertmetzger.de
Twitter: @rmetzger_
Big Data Beers Meetup, Nov. 19th, 2013
Stratosphere
… is a distributed data processing engine
… automatically handles parallelization
… brings database technology to the world of
big data
Overview
● Extends MapReduce with more operators
map

cross

join

reduce

cogroup

New in Stratosphere

Known from Hadoop

● Support for advanced data flow graphs
M
M

R
J

R

R

M
Known from Hadoop

New in Stratosphere

● Compiler/Optimizer, Java/Scala Interface, YARN

R
Stratosphere System Stack
Java
API

Scala
API

Meteor

...

Hive
Stratosphere Optimizer
Stratosphere Runtime

Hadoop MR
Cluster
Manager

YARN

Direct

EC2

Storage

Local Files

HDFS

S3

...
Stratosphere in a Cluster
Master Node

●
●
●
●
●

Operators are executed
over the whole cluster
Side by side with Hadoop
Scales by adding more
nodes
Support for YARN is in
development
We have a LocalExecutor

Job
Submission

JobManager
Resource Mgmt
Compiler
Web Interface

TaskManager

TaskManager

DataNode

DataNode

TaskManager

TaskManager

DataNode

DataNode

Legend:
Cluster Node
Stratosphere
Hadoop

4 Worker Nodes
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Data Flows: Execution Models
M

Apache Hadoop MR is
limited to one data flow

R

One of many possible data flows
in Stratosphere
M

R
J
M

R
Complex Data Flows in Hadoop
Grouping

R

Grouping

J

Filtering
M

M

R

Joining

R

M
M

R

M

R
Data Flows: Lessons Learned

1. Most tasks do not fit the MapReduce model
2. Very expensive
○ Always go to disk and HDFS

3. Tedious to implement
○ Custom data types and file formats between jobs

That’s why higher level abstractions for MR exist.
Advanced Data Flows in Stratosphere
●
●

Data flow graphs are supported natively
Stratosphere only writes to disk if necessary,
otherwise in-memory

R
J
M

R
Skeleton of a Stratosphere Program
● Input: text file, JDBC source, CSV, etc.
● Transformations
○ map, reduce, join, iterate etc.

● Output: to file etc.
● Data Types
○ PactRecord: Tuples with n fields.
○ custom data types for vectors, images, audio (we
only expect serialization and compare)
2
Data Flows: Code Example

R
J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);

MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();

Filter Mapper

ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0).build();

Define group key

MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0,
0)
.input1(ordersFiltered)
.input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0).build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);
Map Stub and PactRecord by Example
MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();

public class FilterOrders extends MapStub {
@Override
public void map(PactRecord order, Collector<PactRecord> out)
throws Exception {
PactString date = order.getField(Orders.DATE_IDX, PactString.class);
if (date.getValue().equals("11.20.2013")) {
out.collect(order);
}
}
}
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Joins in Hadoop
Map (Broadcast) Join

Reduce (Repartition) Join

● Which strategy to choose?
● How to configure it
Lessons Learned:
● Joins do not naturally fit MapReduce
● Very time consuming to implement
● Hand optimization necessary
Source: Sebastian Schelter, TU Berlin
Joins with Stratosphere
● Natively implemented into the system
● Optimizer decides join strategy:
○ Sort-merge-join
○ Hybrid Hash Join
○ Data Shipping Strategy
● Hybrid Hash Join starts in-memory and
gracefully degrades to disk
Optimizer Magic
Recap example job:
Grouping

R

Grouping

J

Filtering
M

R

Joining

We require a grouped input for the reducer
(sorting or hashing)
● Optimizer chooses Sort-Merge-Join → no sorting
for reduce
●
Stratosphere Optimizer
●

Cost-based optimizer
○ Enumerate different execution plans
○ Choose the cheapest one

●

Optimizer collects statistics
○ Size of input and output

Operators (Map, Reduce, Join) tell how they
modify fields
● In-memory chaining of operators
● Memory Distribution
⇒ Focus on your application logic rather than
parallel execution.
●
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Algorithms that need iterations
●
●
●
●
●
●
●

K-Means
Gradient descent
Page-Rank
Logistic Regression
Path algorithms on graphs
Graph communities / dense sub-components
Inference (belief propagation)
Why Iterations?
●

Many algorithms loop over the data
○ Machine learning: iteratively refine the model
○ Graph processing: propagate information hop by hop

Initial Input
1

1st Iteration
1

2

4

3

1

1

2

2

5

6

2nd Iteration
1

5

7

5

1

1

5

5

Example: Connected Components

5

5
Iterations in Hadoop
Loop is outside the system
○ Hard to program
○ Very poor performance

Itera
n 2nd

Ite
io
n

R

Usually each iteration
is more than a single
map and reduce!

t
ra

1st Iteration

th

M

n-

S

n

It

aw

n

w
pa

1st

Sp

on

i
rat
e

tion

Driver

Spaw

●

M
2nd Iteration

R

M
...

n-th Iteration

R
Iterations in Stratosphere
●

Loop is inside the system
○ Easy to program
○ Huge performance gains

Iterate
M

C

M

R

R

M
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
●
●
●
●
●
●
●
●

Functional object oriented programming language
ScaLa = Scalable Language
Very productive (few LOC)
Feels like a scripting language
No more UDFs
Easy to integrate
Runs in JVM, is compatible to regular Java classes
Basis for developing embedded domain specific
languages (DSL)
Do more, write less!
class Person(val firstName: String, val lastName: String)

public class Person {
private final String firstName;
private final String lastName;
public Person(String firstName, String lastName) {
this.firstName = firstName;
this.lastName = lastName;
}
public String getFirstName() {
return firstName;
}
public String getLastName() {
return lastName;
}
}
Let the code speak
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
R

Example in Scala

J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);
MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build();
ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0)
.build();
MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0)
.input1(ordersFiltered).input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0)
.build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts");

val customers = DataSource(customersPath, CsvInputFormat[Customer])
val orders = DataSource(ordersPath, CsvInputFormat[Order])
val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")}
val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip,
grp.maxBy{_.total})}
val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id}
.isEqualTo {cust => cust._1} .map { (orders, cust) => cust}
val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}}
val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled))
val plan = new ScalaPlan(Seq(output), "BDB Example")
Summary: Feature Matrix
Stratosphere: Database inspired Big Data Analytics
Map Reduce
●
●

Map
Reduce

Operators

Stratosphere
●
●
●
●
●
●
●

Map
Reduce (multiple sort keys)
Cross
Join
CoGroup
Union
Iterate, Iterate Delta

Composition

Only MapReduce

Arbitrary Data flows

Data Exchange

Batch through disk

Pipelined, in-memory
(automatic spilling to disk)
Get In Touch
Stratosphere is the next-generation open source
Big Data Analytics Platform.
Quickstart: http://stratosphere.eu/quickstart
Website: http://stratosphere.eu
GitHub: https://github.com/stratosphere
Mailing List:
https://groups.google.com/d/forum/stratosphere-dev
Twitter: @stratosphere_eu

Mais conteúdo relacionado

Mais procurados

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 

Mais procurados (20)

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 

Destaque

Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...stratosphere_eu
 
The formation of the ozone layer
The formation of the ozone layerThe formation of the ozone layer
The formation of the ozone layerBn_QaBBaN
 
Air Pollution
Air PollutionAir Pollution
Air Pollutionmargori
 
Ozone layer depletion ppt
Ozone layer depletion pptOzone layer depletion ppt
Ozone layer depletion pptAnchal Singhal
 

Destaque (6)

Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
 
Stratosphere
StratosphereStratosphere
Stratosphere
 
The formation of the ozone layer
The formation of the ozone layerThe formation of the ozone layer
The formation of the ozone layer
 
Air Pollution
Air PollutionAir Pollution
Air Pollution
 
Atmosphere 2
Atmosphere  2Atmosphere  2
Atmosphere 2
 
Ozone layer depletion ppt
Ozone layer depletion pptOzone layer depletion ppt
Ozone layer depletion ppt
 

Semelhante a Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005Tugdual Grall
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 

Semelhante a Stratosphere System Overview Big Data Beers Berlin. 20.11.2013 (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 

Mais de Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupRobert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewRobert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community UpdateRobert Metzger
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community UpdateRobert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateRobert Metzger
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Mais de Robert Metzger (20)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 

Último

WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfJamesConcepcion7
 
Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamArik Fletcher
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesDoe Paoro
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxRakhi Bazaar
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdfShaun Heinrichs
 
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfGUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfDanny Diep To
 
Pitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckPitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckHajeJanKamps
 
Environmental Impact Of Rotary Screw Compressors
Environmental Impact Of Rotary Screw CompressorsEnvironmental Impact Of Rotary Screw Compressors
Environmental Impact Of Rotary Screw Compressorselgieurope
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterJamesConcepcion7
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerAggregage
 
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...ssuserf63bd7
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Americas Got Grants
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdfMintel Group
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreNZSG
 
Psychic Reading | Spiritual Guidance – Astro Ganesh Ji
Psychic Reading | Spiritual Guidance – Astro Ganesh JiPsychic Reading | Spiritual Guidance – Astro Ganesh Ji
Psychic Reading | Spiritual Guidance – Astro Ganesh Jiastral oracle
 
business environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxbusiness environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxShruti Mittal
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...ssuserf63bd7
 
EUDR Info Meeting Ethiopian coffee exporters
EUDR Info Meeting Ethiopian coffee exportersEUDR Info Meeting Ethiopian coffee exporters
EUDR Info Meeting Ethiopian coffee exportersPeter Horsten
 
Types of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfTypes of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfASGITConsulting
 

Último (20)

WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdf
 
Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management Team
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic Experiences
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf
 
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdfGUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
GUIDELINES ON USEFUL FORMS IN FREIGHT FORWARDING (F) Danny Diep Toh MBA.pdf
 
Pitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckPitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deck
 
Environmental Impact Of Rotary Screw Compressors
Environmental Impact Of Rotary Screw CompressorsEnvironmental Impact Of Rotary Screw Compressors
Environmental Impact Of Rotary Screw Compressors
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare Newsletter
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon Harmer
 
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
Intermediate Accounting, Volume 2, 13th Canadian Edition by Donald E. Kieso t...
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource Centre
 
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptxThe Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
 
Psychic Reading | Spiritual Guidance – Astro Ganesh Ji
Psychic Reading | Spiritual Guidance – Astro Ganesh JiPsychic Reading | Spiritual Guidance – Astro Ganesh Ji
Psychic Reading | Spiritual Guidance – Astro Ganesh Ji
 
business environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxbusiness environment micro environment macro environment.pptx
business environment micro environment macro environment.pptx
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
 
EUDR Info Meeting Ethiopian coffee exporters
EUDR Info Meeting Ethiopian coffee exportersEUDR Info Meeting Ethiopian coffee exporters
EUDR Info Meeting Ethiopian coffee exporters
 
Types of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdfTypes of Cyberattacks - ASG I.T. Consulting.pdf
Types of Cyberattacks - ASG I.T. Consulting.pdf
 

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

  • 1. Stratosphere: System Overview Robert Metzger mail@robertmetzger.de Twitter: @rmetzger_ Big Data Beers Meetup, Nov. 19th, 2013
  • 2. Stratosphere … is a distributed data processing engine … automatically handles parallelization … brings database technology to the world of big data
  • 3. Overview ● Extends MapReduce with more operators map cross join reduce cogroup New in Stratosphere Known from Hadoop ● Support for advanced data flow graphs M M R J R R M Known from Hadoop New in Stratosphere ● Compiler/Optimizer, Java/Scala Interface, YARN R
  • 4. Stratosphere System Stack Java API Scala API Meteor ... Hive Stratosphere Optimizer Stratosphere Runtime Hadoop MR Cluster Manager YARN Direct EC2 Storage Local Files HDFS S3 ...
  • 5. Stratosphere in a Cluster Master Node ● ● ● ● ● Operators are executed over the whole cluster Side by side with Hadoop Scales by adding more nodes Support for YARN is in development We have a LocalExecutor Job Submission JobManager Resource Mgmt Compiler Web Interface TaskManager TaskManager DataNode DataNode TaskManager TaskManager DataNode DataNode Legend: Cluster Node Stratosphere Hadoop 4 Worker Nodes
  • 6. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  • 7. Data Flows: Execution Models M Apache Hadoop MR is limited to one data flow R One of many possible data flows in Stratosphere M R J M R
  • 8. Complex Data Flows in Hadoop Grouping R Grouping J Filtering M M R Joining R M M R M R
  • 9. Data Flows: Lessons Learned 1. Most tasks do not fit the MapReduce model 2. Very expensive ○ Always go to disk and HDFS 3. Tedious to implement ○ Custom data types and file formats between jobs That’s why higher level abstractions for MR exist.
  • 10. Advanced Data Flows in Stratosphere ● ● Data flow graphs are supported natively Stratosphere only writes to disk if necessary, otherwise in-memory R J M R
  • 11. Skeleton of a Stratosphere Program ● Input: text file, JDBC source, CSV, etc. ● Transformations ○ map, reduce, join, iterate etc. ● Output: to file etc. ● Data Types ○ PactRecord: Tuples with n fields. ○ custom data types for vectors, images, audio (we only expect serialization and compare) 2
  • 12. Data Flows: Code Example R J R M FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath); FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath); MapContract ordersFiltered = MapContract.builder(FilterOrders.class) .input(orders).build(); Filter Mapper ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class) .input(customers) .keyField(PactInteger.class, 0).build(); Define group key MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0, 0) .input1(ordersFiltered) .input2(groupedCustomers).build(); ReduceContract orderBy = ReduceContract.builder(MaxSum.class) .input(joined) .keyField(PactInteger.class, 0).build(); FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);
  • 13. Map Stub and PactRecord by Example MapContract ordersFiltered = MapContract.builder(FilterOrders.class) .input(orders).build(); public class FilterOrders extends MapStub { @Override public void map(PactRecord order, Collector<PactRecord> out) throws Exception { PactString date = order.getField(Orders.DATE_IDX, PactString.class); if (date.getValue().equals("11.20.2013")) { out.collect(order); } } }
  • 14. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  • 15. Joins in Hadoop Map (Broadcast) Join Reduce (Repartition) Join ● Which strategy to choose? ● How to configure it Lessons Learned: ● Joins do not naturally fit MapReduce ● Very time consuming to implement ● Hand optimization necessary Source: Sebastian Schelter, TU Berlin
  • 16. Joins with Stratosphere ● Natively implemented into the system ● Optimizer decides join strategy: ○ Sort-merge-join ○ Hybrid Hash Join ○ Data Shipping Strategy ● Hybrid Hash Join starts in-memory and gracefully degrades to disk
  • 17. Optimizer Magic Recap example job: Grouping R Grouping J Filtering M R Joining We require a grouped input for the reducer (sorting or hashing) ● Optimizer chooses Sort-Merge-Join → no sorting for reduce ●
  • 18. Stratosphere Optimizer ● Cost-based optimizer ○ Enumerate different execution plans ○ Choose the cheapest one ● Optimizer collects statistics ○ Size of input and output Operators (Map, Reduce, Join) tell how they modify fields ● In-memory chaining of operators ● Memory Distribution ⇒ Focus on your application logic rather than parallel execution. ●
  • 19. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  • 20. Algorithms that need iterations ● ● ● ● ● ● ● K-Means Gradient descent Page-Rank Logistic Regression Path algorithms on graphs Graph communities / dense sub-components Inference (belief propagation)
  • 21. Why Iterations? ● Many algorithms loop over the data ○ Machine learning: iteratively refine the model ○ Graph processing: propagate information hop by hop Initial Input 1 1st Iteration 1 2 4 3 1 1 2 2 5 6 2nd Iteration 1 5 7 5 1 1 5 5 Example: Connected Components 5 5
  • 22. Iterations in Hadoop Loop is outside the system ○ Hard to program ○ Very poor performance Itera n 2nd Ite io n R Usually each iteration is more than a single map and reduce! t ra 1st Iteration th M n- S n It aw n w pa 1st Sp on i rat e tion Driver Spaw ● M 2nd Iteration R M ... n-th Iteration R
  • 23. Iterations in Stratosphere ● Loop is inside the system ○ Easy to program ○ Huge performance gains Iterate M C M R R M
  • 24. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  • 25. ● ● ● ● ● ● ● ● Functional object oriented programming language ScaLa = Scalable Language Very productive (few LOC) Feels like a scripting language No more UDFs Easy to integrate Runs in JVM, is compatible to regular Java classes Basis for developing embedded domain specific languages (DSL)
  • 26. Do more, write less! class Person(val firstName: String, val lastName: String) public class Person { private final String firstName; private final String lastName; public Person(String firstName, String lastName) { this.firstName = firstName; this.lastName = lastName; } public String getFirstName() { return firstName; } public String getLastName() { return lastName; } }
  • 27. Let the code speak val input = TextFile(textInput) val words = input.flatMap { line => line.split(" ") } val counts = words .groupBy { word => word } .count() val output = counts.write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output))
  • 28. R Example in Scala J R M FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath); FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath); MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build(); ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class) .input(customers) .keyField(PactInteger.class, 0) .build(); MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0) .input1(ordersFiltered).input2(groupedCustomers).build(); ReduceContract orderBy = ReduceContract.builder(MaxSum.class) .input(joined) .keyField(PactInteger.class, 0) .build(); FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts"); val customers = DataSource(customersPath, CsvInputFormat[Customer]) val orders = DataSource(ordersPath, CsvInputFormat[Order]) val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")} val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip, grp.maxBy{_.total})} val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id} .isEqualTo {cust => cust._1} .map { (orders, cust) => cust} val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}} val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled)) val plan = new ScalaPlan(Seq(output), "BDB Example")
  • 29. Summary: Feature Matrix Stratosphere: Database inspired Big Data Analytics Map Reduce ● ● Map Reduce Operators Stratosphere ● ● ● ● ● ● ● Map Reduce (multiple sort keys) Cross Join CoGroup Union Iterate, Iterate Delta Composition Only MapReduce Arbitrary Data flows Data Exchange Batch through disk Pipelined, in-memory (automatic spilling to disk)
  • 30. Get In Touch Stratosphere is the next-generation open source Big Data Analytics Platform. Quickstart: http://stratosphere.eu/quickstart Website: http://stratosphere.eu GitHub: https://github.com/stratosphere Mailing List: https://groups.google.com/d/forum/stratosphere-dev Twitter: @stratosphere_eu