O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Baixar para ler offline

In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

Mais Conteúdo rRelacionado

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

  1. 1. Anatomy of Data Frame API A deep dive into the Spark Data Frame API https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
  2. 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Spark SQL library ● Dataframe abstraction ● Pig/Hive pipleline vs SparkSQL ● Logical plan ● Optimizer ● Different steps in Query analysis
  4. 4. Spark SQL library ● Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  5. 5. Architecture of Spark SQL CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  6. 6. DataFrame API ● Single abstraction for representing structured data in Spark ● DataFrame = RDD + Schema (aka SchemaRDD) ● All data source API’s return DataFrame ● Introduced in 1.3 ● Inspired from R and Python panda ● .rdd to convert to RDD representation resulting in RDD [Row] ● Support for DataFrame DSL in Spark
  7. 7. Need for new abstraction ● Single abstraction for structured data ○ Ability to combine data from multiple sources ○ Uniform access from all different language API’s ○ Ability to support multiple DSL’s ● Familiar interface to Data scientists ○ Same API as R/ Panda ○ Easy to convert from R local data frame to Spark ○ New 1.4 SparkR is built around it
  8. 8. Data Structure of structured world ● Data Frame is a data structure to represent structured data, whereas RDD is a data structure for unstructured data ● Having single data structure allows to build multiple DSL’s targeting different developers ● All DSL’s will be using same optimizer and code generator underneath ● Compare with Hadoop Pig and Hive
  9. 9. Pig and Hive pipeline HiveQL Hive parser Optimizer Executor Hive queries Logical Plan Optimized Logical Plan(M/R plan) Physical Plan Pig latin Pig parser Optimizer Executor Pig latin script Logical Plan Optimized Logical Plan(M/R plan) Physical Plan
  10. 10. Issue with Pig and Hive flow ● Pig and hive shares a lot similar steps but independent of each other ● Each project implements it’s own optimizer and executor which prevents benefiting from each other’s work ● There is no common data structure on which we can build both Pig and Hive dialects ● Optimizer is not flexible to accommodate multiple DSL’s ● Lot of duplicate effort and poor interoperability
  11. 11. Spark SQL pipeline HiveQL Hive parser Hive queries SparkQL SparkSQL Parser Spark SQL queries Dataframe DSL DataFrame Catalyst Spark RDD code
  12. 12. Spark SQL flow ● Multiple DSL’s share same optimizer and executor ● All DSL’s ultimately generate Dataframes ● Catalyst is a new optimizer built from ground up for Spark which is rule based framework ● Catalyst allows developers to plug custom rules specific to their DSL ● You can plug your own DSL too!!
  13. 13. What is a data frame? ● Data frame is a container for Logical Plan ● Logical Plan is a tree which represents data and schema ● Every transformation is represented as tree manipulation ● These trees are manipulated and optimized by catalyst rules ● Logical plan will be converted to physical plan for execution
  14. 14. Explain Command ● Explain command on dataframe allows us look at these plans ● There are three types of logical plans ○ Parsed logical plan ○ Analysed Logical Plan ○ Optimized logical Plan ● Explain also shows Physical plan ● DataFrameExample.scala
  15. 15. Filter example ● In last example, all plans looked same as there were no dataframe operations ● In this example, we are going to apply two filters on the data frame ● Observe generated optimized plan ● Example : FilterExampleTree.scala
  16. 16. Optimized Plan ● Optimized plan normally allows spark to plug in set of optimization rules ● In our example, When multiple filters are added, spark && them for better performance ● Even developer can plug in his/her own rules to optimizer
  17. 17. Accessing Plan trees ● Every dataframe is attached with queryExecution object which allows us to access these plans individually. ● We can access plans as follows ○ parsed plan - queryExecution.logical ○ Analysed - queryExecution.analyzed ○ Optimized - queryExecution.optimizedPlan ● numberedTreeString on the plan allows us to see the hierarchy ● Example : FilterExampleTree.scala
  18. 18. Filter tree representation 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] 01 Filter NOT (CAST(c1#0, DoubleType) = CAST(0, DoubleType)) 00 Filter NOT (CAST(c2#0, DoubleType) = CAST(0, DoubleType)) 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] Filter (NOT (CAST(c1#0, DoubleType) = 0.0) && NOT (CAST(c2#1, DoubleType) = 0.0))
  19. 19. Manipulating Trees ● Every optimization in spark-sql is implemented as a tree or logical transformation ● Series of these transformation allows for modular optimizer ● All tree manipulations are done using scala case class ● As developer we can write these manipulations too ● Let’s create an OR filter rather than and ● OrFilter.scala
  20. 20. Understanding steps in plan ● Logical plan goes through series of rules to resolve and optimize plan ● Each plan is a Tree manipulation we seen before ● We can apply series of rules to see how a given plan evolves over time ● This understanding allows us to understand how to tweak given query for better performance ● Ex : StepsInQueryPlanning.scala
  21. 21. Query select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1') a where amount=500.0
  22. 22. Parsed Plan ● This is plan generated after parsing the DSL ● Normally these plans generated by the specific parsers like HiveQL parser, Dataframe DSL parser etc ● Usually they recognize the different transformations and represent them in the tree nodes ● It’s a straightforward translation without much tweaking ● This will be fed to analyser to generate analysed plan
  23. 23. Parsed Logical Plan UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  24. 24. Analyzed plan ● We use sqlContext.analyser access the rules to generate analyzed plan ● These rules has to be run in sequence to resolve different entities in the logical plan ● Different entities to be resolved is ○ Relations ( aka Table) ○ References Ex : Subquery, aliases etc ○ Data type casting
  25. 25. ResolveRelations Rule ● This rule resolves all the relations ( tables) specified in the plan ● Whenever it finds a new unresolved relation, it consults catalyst aka registerTempTable list ● Once it finds the relation, it resolves that with actual relationship
  26. 26. Resolved Relation Logical Plan JsonRelation Sales[amountPaid..] Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  27. 27. ResolveReferences ● This rule resolves all the references in the Plan ● All aliases and column names get a unique number which allows parser to locate them irrespective of their position ● This unique numbering allows subqueries to removed for better optimization
  28. 28. Resolved References Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid..] `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales
  29. 29. PromoteString ● This rule allows analyser to promote string to right data types ● In our query, Filter( 1=’1’) we are comparing a double with string ● This rule puts a cast from string to double to have the right semantics
  30. 30. Promote String Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  31. 31. Optimize
  32. 32. Eliminate Subqueries ● This rule allows analyser to eliminate superfluous sub queries ● This is possible as we have unique identifier for each of the references ● Removal of sub queries allows us to do advanced optimization in subsequent steps
  33. 33. Eliminate subqueries JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  34. 34. Constant Folding ● Simplifies expressions which result in constant values ● In our plan, Filter(1=1) always results in true ● So constant folding replaces it in true
  35. 35. ConstantFoldingPlan JsonRelation Sales[amountPaid#0..] `Filter True Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  36. 36. Simplify Filters ● This rule simplifies filters by ○ Removes always true filters ○ Removes entire plan subtree if filter is false ● In our query, the true Filter will be removed ● By simplifying filters, we can avoid multiple iterations on data
  37. 37. Simplify Filter Plan JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter True Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  38. 38. PushPredicateThroughFilter ● It’s always good to have filters near to the data source for better optimizations ● This rules pushes the filters near to the JsonRelation ● When we rearrange the tree nodes, we need to make sure we rewrite the rule match the aliases ● In our example, the filter rule is rewritten to use alias amountPaid rather than amount
  39. 39. PushPredicateThroughFilter Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  40. 40. Project Collapsing ● Removes unnecessary projects from the plan ● In our plan , we don’t need second projection, i.e customerId, amount Paid as we only require one projection i.e customerId ● So we can get rid of the second projection ● This gives us most optimized plan
  41. 41. Project Collapsing Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Project customerId#1L
  42. 42. Generating Physical Plan ● Catalyser can take a logical plan and turn into a physical plan or Spark plan ● On queryExecutor, we have a plan called executedPlan which gives us physical plan ● On physical plan, we can call executeCollect or executeTake to start evaluating the plan
  43. 43. References ● https://www.youtube.com/watch?v=GQSNJAzxOr8 ● https://databricks.com/blog/2015/04/13/deep-dive-into- spark-sqls-catalyst-optimizer.html ● http://spark.apache.org/sql/

×