SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Anatomy of Data Frame
API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Spark SQL library
● Dataframe abstraction
● Pig/Hive pipleline vs SparkSQL
● Logical plan
● Optimizer
● Different steps in Query analysis
Spark SQL library
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server
Architecture of Spark SQL
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
DataFrame API
● Single abstraction for representing structured data in
Spark
● DataFrame = RDD + Schema (aka SchemaRDD)
● All data source API’s return DataFrame
● Introduced in 1.3
● Inspired from R and Python panda
● .rdd to convert to RDD representation resulting in RDD
[Row]
● Support for DataFrame DSL in Spark
Need for new abstraction
● Single abstraction for structured data
○ Ability to combine data from multiple sources
○ Uniform access from all different language API’s
○ Ability to support multiple DSL’s
● Familiar interface to Data scientists
○ Same API as R/ Panda
○ Easy to convert from R local data frame to Spark
○ New 1.4 SparkR is built around it
Data Structure of structured world
● Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured
data
● Having single data structure allows to build multiple
DSL’s targeting different developers
● All DSL’s will be using same optimizer and code
generator underneath
● Compare with Hadoop Pig and Hive
Pig and Hive pipeline
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Pig latin
Pig parser
Optimizer
Executor
Pig latin script
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Issue with Pig and Hive flow
● Pig and hive shares a lot similar steps but independent
of each other
● Each project implements it’s own optimizer and
executor which prevents benefiting from each other’s
work
● There is no common data structure on which we can
build both Pig and Hive dialects
● Optimizer is not flexible to accommodate multiple DSL’s
● Lot of duplicate effort and poor interoperability
Spark SQL pipeline
HiveQL
Hive parser
Hive queries
SparkQL
SparkSQL Parser
Spark SQL
queries
Dataframe
DSL
DataFrame
Catalyst
Spark RDD
code
Spark SQL flow
● Multiple DSL’s share same optimizer and executor
● All DSL’s ultimately generate Dataframes
● Catalyst is a new optimizer built from ground up for
Spark which is rule based framework
● Catalyst allows developers to plug custom rules specific
to their DSL
● You can plug your own DSL too!!
What is a data frame?
● Data frame is a container for Logical Plan
● Logical Plan is a tree which represents data and
schema
● Every transformation is represented as tree
manipulation
● These trees are manipulated and optimized by catalyst
rules
● Logical plan will be converted to physical plan for
execution
Explain Command
● Explain command on dataframe allows us look at these
plans
● There are three types of logical plans
○ Parsed logical plan
○ Analysed Logical Plan
○ Optimized logical Plan
● Explain also shows Physical plan
● DataFrameExample.scala
Filter example
● In last example, all plans looked same as there were no
dataframe operations
● In this example, we are going to apply two filters on the
data frame
● Observe generated optimized plan
● Example : FilterExampleTree.scala
Optimized Plan
● Optimized plan normally allows spark to plug in set of
optimization rules
● In our example, When multiple filters are added, spark
&& them for better performance
● Even developer can plug in his/her own rules to
optimizer
Accessing Plan trees
● Every dataframe is attached with queryExecution object
which allows us to access these plans individually.
● We can access plans as follows
○ parsed plan - queryExecution.logical
○ Analysed - queryExecution.analyzed
○ Optimized - queryExecution.optimizedPlan
● numberedTreeString on the plan allows us to see the
hierarchy
● Example : FilterExampleTree.scala
Filter tree representation
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
01 Filter NOT (CAST(c1#0,
DoubleType) = CAST(0,
DoubleType))
00 Filter NOT (CAST(c2#0,
DoubleType) = CAST(0,
DoubleType))
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Filter (NOT (CAST(c1#0,
DoubleType) = 0.0) && NOT
(CAST(c2#1, DoubleType) = 0.0))
Manipulating Trees
● Every optimization in spark-sql is implemented as a tree
or logical transformation
● Series of these transformation allows for modular
optimizer
● All tree manipulations are done using scala case class
● As developer we can write these manipulations too
● Let’s create an OR filter rather than and
● OrFilter.scala
Understanding steps in plan
● Logical plan goes through series of rules to resolve and
optimize plan
● Each plan is a Tree manipulation we seen before
● We can apply series of rules to see how a given plan
evolves over time
● This understanding allows us to understand how to
tweak given query for better performance
● Ex : StepsInQueryPlanning.scala
Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0
Parsed Plan
● This is plan generated after parsing the DSL
● Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
● Usually they recognize the different transformations and
represent them in the tree nodes
● It’s a straightforward translation without much tweaking
● This will be fed to analyser to generate analysed plan
Parsed Logical Plan
UnResolvedRelation
Sales
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
Analyzed plan
● We use sqlContext.analyser access the rules to
generate analyzed plan
● These rules has to be run in sequence to resolve
different entities in the logical plan
● Different entities to be resolved is
○ Relations ( aka Table)
○ References Ex : Subquery, aliases etc
○ Data type casting
ResolveRelations Rule
● This rule resolves all the relations ( tables) specified in
the plan
● Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
● Once it finds the relation, it resolves that with actual
relationship
Resolved Relation Logical Plan
JsonRelation
Sales[amountPaid..]
Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
SubQuery - sales
UnResolvedRelation
Sales
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
ResolveReferences
● This rule resolves all the references in the Plan
● All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
● This unique numbering allows subqueries to removed
for better optimization
Resolved References Plan
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = 1)
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
JsonRelation
Sales[amountPaid..]
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
SubQuery - sales
PromoteString
● This rule allows analyser to promote string to right data
types
● In our query, Filter( 1=’1’) we are comparing a double
with string
● This rule puts a cast from string to double to have the
right semantics
Promote String Plan
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = 1)
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
Optimize
Eliminate Subqueries
● This rule allows analyser to eliminate superfluous sub
queries
● This is possible as we have unique identifier for each of
the references
● Removal of sub queries allows us to do advanced
optimization in subsequent steps
Eliminate subqueries
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
Constant Folding
● Simplifies expressions which result in constant values
● In our plan, Filter(1=1) always results in true
● So constant folding replaces it in true
ConstantFoldingPlan
JsonRelation
Sales[amountPaid#0..]
`Filter
True
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
Simplify Filters
● This rule simplifies filters by
○ Removes always true filters
○ Removes entire plan subtree if filter is false
● In our query, the true Filter will be removed
● By simplifying filters, we can avoid multiple iterations on
data
Simplify Filter Plan
JsonRelation
Sales[amountPaid#0..]
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
True
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
PushPredicateThroughFilter
● It’s always good to have filters near to the data source
for better optimizations
● This rules pushes the filters near to the JsonRelation
● When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
● In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
PushPredicateThroughFilter Plan
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Projection
customerId#1L,amountPaid#0
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
Project Collapsing
● Removes unnecessary projects from the plan
● In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
● So we can get rid of the second projection
● This gives us most optimized plan
Project Collapsing Plan
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Projection
customerId#1L,amountPaid#0
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Project
customerId#1L
Generating Physical Plan
● Catalyser can take a logical plan and turn into a
physical plan or Spark plan
● On queryExecutor, we have a plan called executedPlan
which gives us physical plan
● On physical plan, we can call executeCollect or
executeTake to start evaluating the plan
References
● https://www.youtube.com/watch?v=GQSNJAzxOr8
● https://databricks.com/blog/2015/04/13/deep-dive-into-
spark-sqls-catalyst-optimizer.html
● http://spark.apache.org/sql/

Mais conteúdo relacionado

Mais procurados

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks
 

Mais procurados (20)

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 

Destaque

Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試みデータテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試みYahoo!デベロッパーネットワーク
 
困らない程度のJDK入門
困らない程度のJDK入門困らない程度のJDK入門
困らない程度のJDK入門Yohei Oda
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

Destaque (20)

Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試みデータテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
 
困らない程度のJDK入門
困らない程度のJDK入門困らない程度のJDK入門
困らない程度のJDK入門
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Semelhante a Anatomy of Data Frame API : A deep dive into Spark Data Frame API

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured StreamingKnoldus Inc.
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and serverPavel Chertorogov
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark mldatamantra
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 

Semelhante a Anatomy of Data Frame API : A deep dive into Spark Data Frame API (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 

Mais de datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 

Mais de datamantra (15)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 

Último

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 

Último (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

  • 1. Anatomy of Data Frame API A deep dive into the Spark Data Frame API https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Spark SQL library ● Dataframe abstraction ● Pig/Hive pipleline vs SparkSQL ● Logical plan ● Optimizer ● Different steps in Query analysis
  • 4. Spark SQL library ● Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  • 5. Architecture of Spark SQL CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  • 6. DataFrame API ● Single abstraction for representing structured data in Spark ● DataFrame = RDD + Schema (aka SchemaRDD) ● All data source API’s return DataFrame ● Introduced in 1.3 ● Inspired from R and Python panda ● .rdd to convert to RDD representation resulting in RDD [Row] ● Support for DataFrame DSL in Spark
  • 7. Need for new abstraction ● Single abstraction for structured data ○ Ability to combine data from multiple sources ○ Uniform access from all different language API’s ○ Ability to support multiple DSL’s ● Familiar interface to Data scientists ○ Same API as R/ Panda ○ Easy to convert from R local data frame to Spark ○ New 1.4 SparkR is built around it
  • 8. Data Structure of structured world ● Data Frame is a data structure to represent structured data, whereas RDD is a data structure for unstructured data ● Having single data structure allows to build multiple DSL’s targeting different developers ● All DSL’s will be using same optimizer and code generator underneath ● Compare with Hadoop Pig and Hive
  • 9. Pig and Hive pipeline HiveQL Hive parser Optimizer Executor Hive queries Logical Plan Optimized Logical Plan(M/R plan) Physical Plan Pig latin Pig parser Optimizer Executor Pig latin script Logical Plan Optimized Logical Plan(M/R plan) Physical Plan
  • 10. Issue with Pig and Hive flow ● Pig and hive shares a lot similar steps but independent of each other ● Each project implements it’s own optimizer and executor which prevents benefiting from each other’s work ● There is no common data structure on which we can build both Pig and Hive dialects ● Optimizer is not flexible to accommodate multiple DSL’s ● Lot of duplicate effort and poor interoperability
  • 11. Spark SQL pipeline HiveQL Hive parser Hive queries SparkQL SparkSQL Parser Spark SQL queries Dataframe DSL DataFrame Catalyst Spark RDD code
  • 12. Spark SQL flow ● Multiple DSL’s share same optimizer and executor ● All DSL’s ultimately generate Dataframes ● Catalyst is a new optimizer built from ground up for Spark which is rule based framework ● Catalyst allows developers to plug custom rules specific to their DSL ● You can plug your own DSL too!!
  • 13. What is a data frame? ● Data frame is a container for Logical Plan ● Logical Plan is a tree which represents data and schema ● Every transformation is represented as tree manipulation ● These trees are manipulated and optimized by catalyst rules ● Logical plan will be converted to physical plan for execution
  • 14. Explain Command ● Explain command on dataframe allows us look at these plans ● There are three types of logical plans ○ Parsed logical plan ○ Analysed Logical Plan ○ Optimized logical Plan ● Explain also shows Physical plan ● DataFrameExample.scala
  • 15. Filter example ● In last example, all plans looked same as there were no dataframe operations ● In this example, we are going to apply two filters on the data frame ● Observe generated optimized plan ● Example : FilterExampleTree.scala
  • 16. Optimized Plan ● Optimized plan normally allows spark to plug in set of optimization rules ● In our example, When multiple filters are added, spark && them for better performance ● Even developer can plug in his/her own rules to optimizer
  • 17. Accessing Plan trees ● Every dataframe is attached with queryExecution object which allows us to access these plans individually. ● We can access plans as follows ○ parsed plan - queryExecution.logical ○ Analysed - queryExecution.analyzed ○ Optimized - queryExecution.optimizedPlan ● numberedTreeString on the plan allows us to see the hierarchy ● Example : FilterExampleTree.scala
  • 18. Filter tree representation 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] 01 Filter NOT (CAST(c1#0, DoubleType) = CAST(0, DoubleType)) 00 Filter NOT (CAST(c2#0, DoubleType) = CAST(0, DoubleType)) 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] Filter (NOT (CAST(c1#0, DoubleType) = 0.0) && NOT (CAST(c2#1, DoubleType) = 0.0))
  • 19. Manipulating Trees ● Every optimization in spark-sql is implemented as a tree or logical transformation ● Series of these transformation allows for modular optimizer ● All tree manipulations are done using scala case class ● As developer we can write these manipulations too ● Let’s create an OR filter rather than and ● OrFilter.scala
  • 20. Understanding steps in plan ● Logical plan goes through series of rules to resolve and optimize plan ● Each plan is a Tree manipulation we seen before ● We can apply series of rules to see how a given plan evolves over time ● This understanding allows us to understand how to tweak given query for better performance ● Ex : StepsInQueryPlanning.scala
  • 21. Query select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1') a where amount=500.0
  • 22. Parsed Plan ● This is plan generated after parsing the DSL ● Normally these plans generated by the specific parsers like HiveQL parser, Dataframe DSL parser etc ● Usually they recognize the different transformations and represent them in the tree nodes ● It’s a straightforward translation without much tweaking ● This will be fed to analyser to generate analysed plan
  • 23. Parsed Logical Plan UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  • 24. Analyzed plan ● We use sqlContext.analyser access the rules to generate analyzed plan ● These rules has to be run in sequence to resolve different entities in the logical plan ● Different entities to be resolved is ○ Relations ( aka Table) ○ References Ex : Subquery, aliases etc ○ Data type casting
  • 25. ResolveRelations Rule ● This rule resolves all the relations ( tables) specified in the plan ● Whenever it finds a new unresolved relation, it consults catalyst aka registerTempTable list ● Once it finds the relation, it resolves that with actual relationship
  • 26. Resolved Relation Logical Plan JsonRelation Sales[amountPaid..] Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  • 27. ResolveReferences ● This rule resolves all the references in the Plan ● All aliases and column names get a unique number which allows parser to locate them irrespective of their position ● This unique numbering allows subqueries to removed for better optimization
  • 28. Resolved References Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid..] `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales
  • 29. PromoteString ● This rule allows analyser to promote string to right data types ● In our query, Filter( 1=’1’) we are comparing a double with string ● This rule puts a cast from string to double to have the right semantics
  • 30. Promote String Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  • 32. Eliminate Subqueries ● This rule allows analyser to eliminate superfluous sub queries ● This is possible as we have unique identifier for each of the references ● Removal of sub queries allows us to do advanced optimization in subsequent steps
  • 33. Eliminate subqueries JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  • 34. Constant Folding ● Simplifies expressions which result in constant values ● In our plan, Filter(1=1) always results in true ● So constant folding replaces it in true
  • 36. Simplify Filters ● This rule simplifies filters by ○ Removes always true filters ○ Removes entire plan subtree if filter is false ● In our query, the true Filter will be removed ● By simplifying filters, we can avoid multiple iterations on data
  • 37. Simplify Filter Plan JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter True Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  • 38. PushPredicateThroughFilter ● It’s always good to have filters near to the data source for better optimizations ● This rules pushes the filters near to the JsonRelation ● When we rearrange the tree nodes, we need to make sure we rewrite the rule match the aliases ● In our example, the filter rule is rewritten to use alias amountPaid rather than amount
  • 39. PushPredicateThroughFilter Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  • 40. Project Collapsing ● Removes unnecessary projects from the plan ● In our plan , we don’t need second projection, i.e customerId, amount Paid as we only require one projection i.e customerId ● So we can get rid of the second projection ● This gives us most optimized plan
  • 41. Project Collapsing Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Project customerId#1L
  • 42. Generating Physical Plan ● Catalyser can take a logical plan and turn into a physical plan or Spark plan ● On queryExecutor, we have a plan called executedPlan which gives us physical plan ● On physical plan, we can call executeCollect or executeTake to start evaluating the plan