SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
The State of Spark
And Where We’re Going Next
Matei Zaharia
Community Growth
Project History
Spark started as research project in 2009
Open sourced in 2010
» 1st version was 1600 LOC, could run Wikipedia demo
Growing community since
Entered Apache Incubator in June 2013
Development Community
With over 100 developers and 25 companies,
one of the most active communities in big data
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!
Development Community
Healthy across the whole ecosystem
Release Growth
Spark 0.6:
- Java API, Maven,
standalone mode
- 17 contributors
Sept ‘13
Feb ‘13
Oct ‘12
Spark 0.7:
- Python API,
Spark Streaming
- 31 contributors
Spark 0.8:
- YARN, MLlib,
monitoring UI
- 67 contributors
YARN support (Yahoo!)
Columnar compression in Shark (Yahoo!)
Fair scheduling (Intel)
Metrics reporting (Intel, Quantifind)
New RDD operators (Bizo, ClearStory)
Scala 2.10 support (Imaginea)
Some Community Contributions
Conferences
0
50
100
150
200
250
300
350
400
450
500
AMP Camp 1
(Aug 2012)
AMP Camp 2
(Aug 2013)
Spark Summit
(Nov 2013)
Attendees
Projects Built on Spark
Spark
Spark
Streaming
(real-time)
GraphX
(graph)
…
Shark"
(SQL)
MLbase
(machine
learning)
BlinkDB
What’s Next?
Our View
While big data tools have advanced a lot, they
are still far too difficult to tune and use
Goal: design big data systems that are as
powerful & seamless as those for small data
Current Priorities
Standard libraries
Deployment
Out-of-the-box usability
Enterprise use
 +
Standard Libraries
While writing K-means in 30 lines is great, it’s
even better to call it from a library!
Spark’s MLlib and GraphX will be standard
libraries supported by core developers
» MLlib in Spark 0.8 with 7 algorithms
» GraphX coming soon
» Both operate directly on RDDs
Standard Libraries
val	
  rdd:	
  RDD[Array[Double]]	
  =	
  ...	
  
val	
  model	
  =	
  KMeans.train(rdd,	
  k	
  =	
  10)	
  
	
  
	
  
val	
  graph	
  =	
  Graph(vertexRDD,	
  edgeRDD)	
  
val	
  ranks	
  =	
  PageRank.run(graph,	
  iters	
  =	
  10)	
  
Standard Libraries
Beyond these libraries, Databricks is investing
heavily in higher-level projects
Spark Streaming:"
easier 24/7 operation and optimizations coming in 0.9
Shark:"
calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12
Goal: a complete and interoperable stack
Deployment
Want Spark to easily run anywhere
Spark 0.8: much improved YARN, EC2 support
Spark 0.8.1: support for YARN 2.2
SIMR: launch Spark in MapReduce clusters as
a Hadoop job (no installation needed!)
» For experimenting; see talk by Ahir
Monitoring and metrics (0.8)
Better support for large # of tasks (0.8.1)
High availability for standalone mode (0.8.1)
External hashing & sorting (0.9)
Ease of Use
Long-term: remove need to tune beyond defaults
Next Releases
Spark 0.8.1 (this month)
» YARN 2.2, standalone mode HA, optimized shuffle,
broadcast & result fetching
Spark 0.9 (Jan 2014)
» Scala 2.10 support, configuration system, Spark
Streaming improvements
What Makes Spark Unique?
Big Data Systems Today
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
S4
…
Specialized systems
(iterative, interactive and"
streaming apps)
General batch"
processing
Spark’s Approach
Instead of specializing, generalize MapReduce"
to support new apps in same engine
Two changes (general task DAG & data sharing)
are enough to express previous models!
Unification has big benefits
» For the engine
» For users
Spark
Streaming
GraphX
…
Shark
MLbase
Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
Shark*
* also calls into Hive
Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
GraphX
Shark*
* also calls into Hive
Performance
Hive
Impala(disk)
Impala(mem)
Shark(disk)
Shark(mem)
0
5
10
15
20
25
30
35
40
45
ResponseTime(s)
SQL
Hadoop
Giraph
GraphX
0
5
10
15
20
25
30
ResponseTime(min)
Graph
Storm
Spark
0
5
10
15
20
25
30
35
Throughput(MB/s/node)
Streaming
What it Means for Users
Separate frameworks:

…
HDFS
read
HDFS
write
ETL
HDFS
read
HDFS
write
train
HDFS
read
HDFS
write
query
HDFS
HDFS
read
ETL
train
query
Spark:

Interactive"
analysis
Combining Processing Types
val	
  points	
  =	
  sc.runSql[Double,	
  Double](	
  
	
  	
  “select	
  latitude,	
  longitude	
  from	
  historic_tweets”)	
  
	
  
val	
  model	
  =	
  KMeans.train(points,	
  10)	
  
	
  
sc.twitterStream(...)	
  
	
  	
  .map(t	
  =>	
  (model.closestCenter(t.location),	
  1))	
  
	
  	
  .reduceByWindow(“5s”,	
  _	
  +	
  _)	
  
From Scala:
Combining Processing Types
GENERATE	
  KMeans(tweet_locations)	
  AS	
  TABLE	
  tweet_clusters	
  
	
  
	
  
//	
  Scala	
  table	
  generating	
  function	
  (TGF):	
  
object	
  KMeans	
  {	
  
	
  	
  @Schema(spec	
  =	
  “x	
  double,	
  y	
  double,	
  cluster	
  int”)	
  
	
  	
  def	
  apply(points:	
  RDD[(Double,	
  Double)])	
  =	
  {	
  
	
  	
  	
  	
  ...	
  
	
  	
  }	
  
}	
  
	
  
From SQL (in Shark 0.8.1):
Conclusion
Next challenge in big data will be complex
and low-latency applications
Spark offers a unified engine to tackle and
combine these apps
Best strength is the community: enjoy Spark
Summit!
Contributors
Aaron Davidson
Alexander Pivovarov
Ali Ghodsi
Ameet Talwalkar
Andre Shumacher
Andrew Ash
Andrew Psaltis
Andrew Xia
Andrey Kouznetsov
Andy Feng
Andy Konwinski
Ankur Dave
Antonio Lupher
Benjamin Hindman
Bill Zhao
Charles Reiss
Chris Mattmann
Christoph Grothaus
Christopher Nguyen
Chu Tong
Cliff Engle
Cody Koeninger
David McCauley
Denny Britz
Dmitriy Lyubimov
Edison Tung
Eric Zhang
Erik van Oosten
Ethan Jewett
Evan Chan
Evan Sparks
Ewen Cheslack-Postava
Fabrizio Milo
Fernand Pajot
Frank Dai
Gavin Li
Ginger Smith
Giovanni Delussu
Grace Huang
Haitao Yao
Haoyuan Li
Harold Lim
Harvey Feng
Henry Milner
Henry Saputra
Hiral Patel
Holden Karau
Ian Buss
Imran Rashid
Ismael Juma
James Phillpotts
Jason Dai
Jerry Shao
Jey Kottalam
Joseph E. Gonzalez
Josh Rosen
Justin Ma
Kalpit Shah
Karen Feng
Karthik Tunga
Kay Ousterhout
Kody Koeniger
Konstantin Boudnik
Lee Moon Soo
Lian Cheng
Liang-Chi Hsieh
Marc Mercer
Marek Kolodziej
Mark Hamstra
Matei Zaharia
Matthew Taylor
Michael Heuer
Mike Potts
Mikhail Bautin
Mingfei Shi
Mosharaf Chowdhury
Mridul Muralidharan
Nathan Howell
Neal Wiggins
Nick Pentreath
Olivier Grisel
Patrick Wendell
Paul Cavallaro
Paul Ruan
Peter Sankauskas
Pierre Borckmans
Prabeesh K.
Prashant Sharma
Ram Sriharsha
Ravi Pandya
Ray Racine
Reynold Xin
Richard Benkovsky
Richard McKinley
Rohit Rai
Roman Tkalenko
Ryan LeCompte
S. Kumar
Sean McNamara
Shane Huang
Shivaram Venkataraman
Stephen Haberman
Tathagata Das
Thomas Dudziak
Thomas Graves
Timothy Hunter
Tyson Hamilton
Vadim Chekan
Wu Zeming
Xinghao Pan

Mais conteúdo relacionado

Mais procurados

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
HadoopDistributions
HadoopDistributionsHadoopDistributions
HadoopDistributionsDemet Aksoy
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 

Mais procurados (20)

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
HadoopDistributions
HadoopDistributionsHadoopDistributions
HadoopDistributions
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 

Destaque

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 

Destaque (18)

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

Semelhante a Spark-summit-2013 Matei Zaharia

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDatabricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 

Semelhante a Spark-summit-2013 Matei Zaharia (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 

Último

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Spark-summit-2013 Matei Zaharia

  • 1. The State of Spark And Where We’re Going Next Matei Zaharia
  • 3. Project History Spark started as research project in 2009 Open sourced in 2010 » 1st version was 1600 LOC, could run Wikipedia demo Growing community since Entered Apache Incubator in June 2013
  • 4. Development Community With over 100 developers and 25 companies, one of the most active communities in big data Comparison: Storm (48), Giraph (52), Drill (18), Tez (12) Past 6 months: more active devs than Hadoop MapReduce!
  • 6. Release Growth Spark 0.6: - Java API, Maven, standalone mode - 17 contributors Sept ‘13 Feb ‘13 Oct ‘12 Spark 0.7: - Python API, Spark Streaming - 31 contributors Spark 0.8: - YARN, MLlib, monitoring UI - 67 contributors
  • 7. YARN support (Yahoo!) Columnar compression in Shark (Yahoo!) Fair scheduling (Intel) Metrics reporting (Intel, Quantifind) New RDD operators (Bizo, ClearStory) Scala 2.10 support (Imaginea) Some Community Contributions
  • 8. Conferences 0 50 100 150 200 250 300 350 400 450 500 AMP Camp 1 (Aug 2012) AMP Camp 2 (Aug 2013) Spark Summit (Nov 2013) Attendees
  • 9. Projects Built on Spark Spark Spark Streaming (real-time) GraphX (graph) … Shark" (SQL) MLbase (machine learning) BlinkDB
  • 11. Our View While big data tools have advanced a lot, they are still far too difficult to tune and use Goal: design big data systems that are as powerful & seamless as those for small data
  • 13. Standard Libraries While writing K-means in 30 lines is great, it’s even better to call it from a library! Spark’s MLlib and GraphX will be standard libraries supported by core developers » MLlib in Spark 0.8 with 7 algorithms » GraphX coming soon » Both operate directly on RDDs
  • 14. Standard Libraries val  rdd:  RDD[Array[Double]]  =  ...   val  model  =  KMeans.train(rdd,  k  =  10)       val  graph  =  Graph(vertexRDD,  edgeRDD)   val  ranks  =  PageRank.run(graph,  iters  =  10)  
  • 15. Standard Libraries Beyond these libraries, Databricks is investing heavily in higher-level projects Spark Streaming:" easier 24/7 operation and optimizations coming in 0.9 Shark:" calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12 Goal: a complete and interoperable stack
  • 16. Deployment Want Spark to easily run anywhere Spark 0.8: much improved YARN, EC2 support Spark 0.8.1: support for YARN 2.2 SIMR: launch Spark in MapReduce clusters as a Hadoop job (no installation needed!) » For experimenting; see talk by Ahir
  • 17. Monitoring and metrics (0.8) Better support for large # of tasks (0.8.1) High availability for standalone mode (0.8.1) External hashing & sorting (0.9) Ease of Use Long-term: remove need to tune beyond defaults
  • 18. Next Releases Spark 0.8.1 (this month) » YARN 2.2, standalone mode HA, optimized shuffle, broadcast & result fetching Spark 0.9 (Jan 2014) » Scala 2.10 support, configuration system, Spark Streaming improvements
  • 19. What Makes Spark Unique?
  • 20. Big Data Systems Today MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and" streaming apps) General batch" processing
  • 21. Spark’s Approach Instead of specializing, generalize MapReduce" to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models! Unification has big benefits » For the engine » For users Spark Streaming GraphX … Shark MLbase
  • 23. Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming
  • 24. Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming Shark* * also calls into Hive
  • 25. Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming GraphX Shark* * also calls into Hive
  • 27. What it Means for Users Separate frameworks: … HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS HDFS read ETL train query Spark: Interactive" analysis
  • 28. Combining Processing Types val  points  =  sc.runSql[Double,  Double](      “select  latitude,  longitude  from  historic_tweets”)     val  model  =  KMeans.train(points,  10)     sc.twitterStream(...)      .map(t  =>  (model.closestCenter(t.location),  1))      .reduceByWindow(“5s”,  _  +  _)   From Scala:
  • 29. Combining Processing Types GENERATE  KMeans(tweet_locations)  AS  TABLE  tweet_clusters       //  Scala  table  generating  function  (TGF):   object  KMeans  {      @Schema(spec  =  “x  double,  y  double,  cluster  int”)      def  apply(points:  RDD[(Double,  Double)])  =  {          ...      }   }     From SQL (in Shark 0.8.1):
  • 30. Conclusion Next challenge in big data will be complex and low-latency applications Spark offers a unified engine to tackle and combine these apps Best strength is the community: enjoy Spark Summit!
  • 31. Contributors Aaron Davidson Alexander Pivovarov Ali Ghodsi Ameet Talwalkar Andre Shumacher Andrew Ash Andrew Psaltis Andrew Xia Andrey Kouznetsov Andy Feng Andy Konwinski Ankur Dave Antonio Lupher Benjamin Hindman Bill Zhao Charles Reiss Chris Mattmann Christoph Grothaus Christopher Nguyen Chu Tong Cliff Engle Cody Koeninger David McCauley Denny Britz Dmitriy Lyubimov Edison Tung Eric Zhang Erik van Oosten Ethan Jewett Evan Chan Evan Sparks Ewen Cheslack-Postava Fabrizio Milo Fernand Pajot Frank Dai Gavin Li Ginger Smith Giovanni Delussu Grace Huang Haitao Yao Haoyuan Li Harold Lim Harvey Feng Henry Milner Henry Saputra Hiral Patel Holden Karau Ian Buss Imran Rashid Ismael Juma James Phillpotts Jason Dai Jerry Shao Jey Kottalam Joseph E. Gonzalez Josh Rosen Justin Ma Kalpit Shah Karen Feng Karthik Tunga Kay Ousterhout Kody Koeniger Konstantin Boudnik Lee Moon Soo Lian Cheng Liang-Chi Hsieh Marc Mercer Marek Kolodziej Mark Hamstra Matei Zaharia Matthew Taylor Michael Heuer Mike Potts Mikhail Bautin Mingfei Shi Mosharaf Chowdhury Mridul Muralidharan Nathan Howell Neal Wiggins Nick Pentreath Olivier Grisel Patrick Wendell Paul Cavallaro Paul Ruan Peter Sankauskas Pierre Borckmans Prabeesh K. Prashant Sharma Ram Sriharsha Ravi Pandya Ray Racine Reynold Xin Richard Benkovsky Richard McKinley Rohit Rai Roman Tkalenko Ryan LeCompte S. Kumar Sean McNamara Shane Huang Shivaram Venkataraman Stephen Haberman Tathagata Das Thomas Dudziak Thomas Graves Timothy Hunter Tyson Hamilton Vadim Chekan Wu Zeming Xinghao Pan