SlideShare uma empresa Scribd logo
1 de 28
What’s New in Spark 2?
Eyal Ben Ivri
In just two words…
Not Much
In just OVER two words…
Not Much,
but let’s talk about it.
Let’s start from the beginning
• What is Apache Spark?
• An open source cluster computing framework.
• Originally developed at the University of California, Berkeley's AMPLab.
• Aimed and designed to be a Big Data computational framework.
Spark Components
Data Sources
Spark Core (Batch Processing, RDD, SparkContext)
SparkSQL
(DataFrame)
Spark
Streaming
Spark MLlib
(ML Pipelines)
Spark GraphX
Spark
Packages
Spark Components (v2)
Data Sources
Spark Core (Batch Processing, RDD, SparkContext)
Spark
Streaming
Spark MLlib
(ML Pipelines)
Spark GraphX
Spark
Packages
Spark SQL
(SparkSession,
DataSet)
Timeline
UC
Berkeley’s
AMPLab
(2009)
Open
Sourced
(2010)
Apache
Foundation
(2013)
Top-Level
Apache
Project (Feb
2014)
Version 1.0
(May 2014)
World
record in
large scale
sorting (Nov
214)
Version 1.6
(Jan 2016)
Version
2.0.0 (Jul
2016)
Version
2.0.1 (Oct
2016,
Current)
Version History (major changes)
1.0 –
SparkSQL
(formally Shark
project)
1.1 –
Streaming
support for
python
1.2 – Core
engine
improvements.
GraphX
graduates.
1.3 –
DataFrame.
Python engine
improvements.
1.4 – SparkR
1.5 – Bugs and
Performance
1.6 – Dataset
(experimental)
Spark 2.0.x
• Major notables:
• Scala 2.11
• SparkSession
• Performance
• API Stability
• SQL:2003 support
• Structured Streaming
• R UDFs and Mllib algorithms implementation
API
• Spark doesn't like API changes
• The good news:
• To migrate, you’ll have to perform
little to no changes to your code.
• The (not so) bad news:
• To benefit from all the
performance improvements, some
old code might need more
refactoring.
Programming API
• Unifying DataFrame and Dataset:
• Dataset[Row] = DataFrame
• SparkSession replaces
SqlContext/HiveContext
• Both kept for backwards compatibility.
• Simpler, more performant accumulator
API
• A new, improved Aggregator API for
typed aggregation in Datasets
SQL Language
• Improved SQL functionalities (SQL
2003 support)
• Can now run all 99 TPC-DS
queries
• The parser support ANSI-SQL as
well as HiveQL
• Subquery support
SparkSQL new features
• Native CSV data source (Based on
Databricks’ spark-csv package)
• Better off-heap memory
management
• Bucketing support (Hive
implementation)
• Performance performance
performance
Demo
Spark API
SparkSQL
Judges Ruling
Dataset was supposed to
be the future like a 6
months ago
SQL 2003 is so 2003
The API lives on!
SQL 2003 is cool
Structured Streaming (Alpha)
Structured Streaming is a scalable and
fault-tolerant stream processing engine
built on the Spark SQL engine.
You can express your streaming
computation the same way you would
express a batch computation on static
data.
Structured Streaming (cont.)
You can use the Dataset / DataFrame API in Scala, Java or Python to
express streaming aggregations, event-time windows, stream-to-
batch joins, etc.
The computation is executed on the same optimized Spark SQL
engine.
Exactly-once fault-tolerance guarantees through checkpointing and
Write Ahead Logs.
Windowing streams
How many vehicles entered each toll booth every 5 minutes?
val windowedCounts = cars.groupBy(
window($"timestamp", ”5 minutes", ”5 minutes"),
$”car"
).count()
Demo Structured Streaming
Judges Ruling
Still a micro-batching process SparkSQL is the future
Tungsten Project - Phase 2
Performance improvements
Heap Memory management
Connectors optimizations
Let’s Pop the hood
Project Tungsten
Bring Spark performance closer to the bare metal, through:
• Native memory Management
• Runtime code generation
Started @ Version 1.4
The cornerstone that enabled the Catalyst engine
Project Tungsten - Phase 2
Whole stage code generation
a. A technique that blends state-of-the-art from modern compilers and MPP databases.
b. Gives a performance boost of up to x9 faster
c. Emit optimized bytecode at runtime that collapses the entire query into a single function
d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
Project Tungsten - Phase 2
Optimized input / output
a. Caching for Dataframes is based on Parquet
b. Faster Parquet reader
c. Google Gueva is OUT
d. Smarter HadoopFS connector
you have to be running on DataFrame / Dataset
Overall Judges Ruling
I Want to complain but i don’t know
what about!!
Internal performance improvements
aside, this feels more like Spark 1.7
I like flink...
All is good
SparkSQL is for sure the future of
spark
The competition has done well
for Spark
Thank you (Questions?)
Long live Spark (and Flink)
Eyal Ben Ivri
https://github.com/eyalbenivri/spark2demo

Mais conteúdo relacionado

Mais procurados

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 

Mais procurados (20)

Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
 

Destaque

Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 

Destaque (20)

Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users
 
Electronic governance steps in the right direction?
Electronic governance   steps in the right direction?Electronic governance   steps in the right direction?
Electronic governance steps in the right direction?
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Scala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very ugly
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Semelhante a What's New in Spark 2?

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Semelhante a What's New in Spark 2? (20)

Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

What's New in Spark 2?

  • 1. What’s New in Spark 2? Eyal Ben Ivri
  • 2. In just two words… Not Much
  • 3. In just OVER two words… Not Much, but let’s talk about it.
  • 4. Let’s start from the beginning • What is Apache Spark? • An open source cluster computing framework. • Originally developed at the University of California, Berkeley's AMPLab. • Aimed and designed to be a Big Data computational framework.
  • 5. Spark Components Data Sources Spark Core (Batch Processing, RDD, SparkContext) SparkSQL (DataFrame) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages
  • 6. Spark Components (v2) Data Sources Spark Core (Batch Processing, RDD, SparkContext) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages Spark SQL (SparkSession, DataSet)
  • 7. Timeline UC Berkeley’s AMPLab (2009) Open Sourced (2010) Apache Foundation (2013) Top-Level Apache Project (Feb 2014) Version 1.0 (May 2014) World record in large scale sorting (Nov 214) Version 1.6 (Jan 2016) Version 2.0.0 (Jul 2016) Version 2.0.1 (Oct 2016, Current)
  • 8. Version History (major changes) 1.0 – SparkSQL (formally Shark project) 1.1 – Streaming support for python 1.2 – Core engine improvements. GraphX graduates. 1.3 – DataFrame. Python engine improvements. 1.4 – SparkR 1.5 – Bugs and Performance 1.6 – Dataset (experimental)
  • 9. Spark 2.0.x • Major notables: • Scala 2.11 • SparkSession • Performance • API Stability • SQL:2003 support • Structured Streaming • R UDFs and Mllib algorithms implementation
  • 10. API • Spark doesn't like API changes • The good news: • To migrate, you’ll have to perform little to no changes to your code. • The (not so) bad news: • To benefit from all the performance improvements, some old code might need more refactoring.
  • 11. Programming API • Unifying DataFrame and Dataset: • Dataset[Row] = DataFrame • SparkSession replaces SqlContext/HiveContext • Both kept for backwards compatibility. • Simpler, more performant accumulator API • A new, improved Aggregator API for typed aggregation in Datasets
  • 12. SQL Language • Improved SQL functionalities (SQL 2003 support) • Can now run all 99 TPC-DS queries • The parser support ANSI-SQL as well as HiveQL • Subquery support
  • 13. SparkSQL new features • Native CSV data source (Based on Databricks’ spark-csv package) • Better off-heap memory management • Bucketing support (Hive implementation) • Performance performance performance
  • 15. Judges Ruling Dataset was supposed to be the future like a 6 months ago SQL 2003 is so 2003 The API lives on! SQL 2003 is cool
  • 16. Structured Streaming (Alpha) Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
  • 17. Structured Streaming (cont.) You can use the Dataset / DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to- batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.
  • 18. Windowing streams How many vehicles entered each toll booth every 5 minutes? val windowedCounts = cars.groupBy( window($"timestamp", ”5 minutes", ”5 minutes"), $”car" ).count()
  • 19.
  • 20.
  • 22. Judges Ruling Still a micro-batching process SparkSQL is the future
  • 23. Tungsten Project - Phase 2 Performance improvements Heap Memory management Connectors optimizations Let’s Pop the hood
  • 24. Project Tungsten Bring Spark performance closer to the bare metal, through: • Native memory Management • Runtime code generation Started @ Version 1.4 The cornerstone that enabled the Catalyst engine
  • 25. Project Tungsten - Phase 2 Whole stage code generation a. A technique that blends state-of-the-art from modern compilers and MPP databases. b. Gives a performance boost of up to x9 faster c. Emit optimized bytecode at runtime that collapses the entire query into a single function d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
  • 26. Project Tungsten - Phase 2 Optimized input / output a. Caching for Dataframes is based on Parquet b. Faster Parquet reader c. Google Gueva is OUT d. Smarter HadoopFS connector you have to be running on DataFrame / Dataset
  • 27. Overall Judges Ruling I Want to complain but i don’t know what about!! Internal performance improvements aside, this feels more like Spark 1.7 I like flink... All is good SparkSQL is for sure the future of spark The competition has done well for Spark
  • 28. Thank you (Questions?) Long live Spark (and Flink) Eyal Ben Ivri https://github.com/eyalbenivri/spark2demo