SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Higher Order Functions
Herman van Hövell @westerflyer
2018-10-03, London Spark Summit EU 2018
2
About Me
- Software Engineer @Databricks
Amsterdam office
- Apache Spark Committer and
PMC member
- In a previous life Data Engineer &
Data Analyst.
3
Complex Data
Complex data types in Spark SQL
- Struct. For example: struct(a: Int, b: String)
- Array. For example: array(a: Int)
- Map. For example: map(key: String, value: Int)
This provides primitives to build tree-based data models
- High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table
designs.
- More natural, reality like data models
4
Complex Data - Tweet JSON
{
"created_at": "Wed Oct 03 11:41:57 +0000 2018",
"id_str": "994633657141813248",
"text": "Looky nested data #spark #sseu",
"display_text_range": [0, 140],
"user": {
"id_str": "12343453",
"screen_name": "Westerflyer"
},
"extended_tweet": {
"full_text": "Looky nested data #spark #sseu",
"display_text_range": [0, 249],
"entities": {
"hashtags": [{
"text": "spark",
"indices": [211, 225]
}, {
"text": "sseu",
"indices": [239, 249]
}]
}
}
}
adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
root
|-- created_at: string (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- id_str: string (nullable = true)
| |-- screen_name: string (nullable = true)
|-- display_text_range: array (nullable = true)
| |-- element: long (containsNull = true)
|-- extended_tweet: struct (nullable = true)
| |-- full_text: string (nullable = true)
| |-- display_text_range: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- entities: struct (nullable = true)
| | |-- hashtags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- indices: array (nullable = true)
| | | | | |-- element: long (containsNull = true)
| | | | |-- text: string (nullable = true)
5
Manipulating Complex Data
Structs are easy :)
Maps/Arrays not so much...
- Easy to read single values/retrieve keys
- Hard to transform or to summarize
6
Transforming an Array
Let’s say we want to add 1 to every element of the vals field of
every row in an input table.
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
How would we do this?
7
Transforming an Array
Option 1 - Explode and Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
8
Transforming an Array
Option 1 - Explode and Collect - Explode
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
9
Transforming an Array
Option 1 - Explode and Collect - Explode
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Val
1 1
1 2
1 3
2 4
2 5
2 6
10
Transforming an Array
Option 1 - Explode and Collect - Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
2. Collect
11
Transforming an Array
Option 1 - Explode and Collect - Collect
Id Val
1 1 + 1
1 2 + 1
1 3 + 1
2 4 + 1
2 5 + 1
2 6 + 1
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
12
Transforming an Array
Option 1 - Explode and Collect - Complexity
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
13
Transforming an Array
Option 1 - Explode and Collect - Complexity
• Shuffles the data around, which is very expensive
• collect_list does not respect pre-existing ordering
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
14
Transforming an Array
Option 1 - Explode and Collect - Pitfalls
Id Vals
1 [1, 2, 3]
1 [4, 5, 6]
Id Vals
1 [5, 6, 7, 2, 3, 4]
Id Vals
1 null
2 [4, 5, 6]
Id Vals
2 [5, 6, 7]
Values need to have data
Keys need to be unique
15
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
16
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
17
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
18
Transforming an Array
Pros
- Is faster than Explode & Collect
- Does not suffer from correctness pitfalls
Cons
- Is relatively slow, we need to do a lot serialization
- You need to register UDFs per type
- Does not work for SQL
- Clunky
Option 2 - Scala UDF
19
When are you going to talk
about
Higher Order Functions?
20
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
21
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
22
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
Can we do the same for Spark SQL?
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
Anonymous ‘Lambda’ Function
23
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
- Spark SQL native code: fast & no serialization needed
- Works for SQL
24
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
Higher Order Function
transform is the Higher Order Function. It takes an input array
and an expression, it applies this expression to each element in
the array
25
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl Anonymous ‘Lambda’ Function
val -> val + 1 is the lambda function. It is the operation
that is applied to each value in the array. This function is divided
into two components separated by a -> symbol:
1. The Argument list.
2. The expression used to calculate the new value.
26
Higher Order Functions in Spark SQL
Nesting
select id,
transform(vals, val ->
transform(val, e -> e + 1)) as vals
from nested_input_tbl
Capture
select id,
ref_value,
transform(vals, val -> ref_value + val) as vals
from nested_input_tbl
27
Didn’t you say these were
faster?
28
Performance
29
Higher Order Functions in Spark SQL
Spark 2.4 will ship with following higher order functions:
Array
- transform
- filter
- exists
- aggregate/reduce
- zip_with
Map
- transform_keys
- transform_values
- map_filter
- map_zip_with
A lot of new collection based expression were also added...
30
Future work
Disclaimer: All of this is speculative and has not been discussed on the Dev list!
Arrays and Maps have received a lot of love. However working
with wide structs fields is still non-trivial (a lot of typing). We can
do better here:
- The following dataset functions should work for nested fields:
- withColumn()
- withColumnRenamed()
- The following functions should be added for struct fields:
- select()
- withColumn()
- withColumnRenamed()
Questions?

Mais conteúdo relacionado

Mais procurados

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practicesNabeel Moidu
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 

Mais procurados (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...
Splunk Architecture | Splunk Tutorial For Beginners | Splunk Training | Splun...
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 

Semelhante a An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell

Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperChester Chen
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?Tomasz Wrobel
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With ScalaKnoldus Inc.
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Languageleague
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java ProgrammersEric Pederson
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with ScalaNeelkanth Sachdeva
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data opsmnacos
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkAngelo Leto
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversionsKnoldus Inc.
 
Kotlin boost yourproductivity
Kotlin boost yourproductivityKotlin boost yourproductivity
Kotlin boost yourproductivitynklmish
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02Nguyen Tuan
 

Semelhante a An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With Scala
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java Programmers
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with Scala
 
Scala for curious
Scala for curiousScala for curious
Scala for curious
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversions
 
Kotlin boost yourproductivity
Kotlin boost yourproductivityKotlin boost yourproductivity
Kotlin boost yourproductivity
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Último (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell

  • 1. Higher Order Functions Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018
  • 2. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer and PMC member - In a previous life Data Engineer & Data Analyst.
  • 3. 3 Complex Data Complex data types in Spark SQL - Struct. For example: struct(a: Int, b: String) - Array. For example: array(a: Int) - Map. For example: map(key: String, value: Int) This provides primitives to build tree-based data models - High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table designs. - More natural, reality like data models
  • 4. 4 Complex Data - Tweet JSON { "created_at": "Wed Oct 03 11:41:57 +0000 2018", "id_str": "994633657141813248", "text": "Looky nested data #spark #sseu", "display_text_range": [0, 140], "user": { "id_str": "12343453", "screen_name": "Westerflyer" }, "extended_tweet": { "full_text": "Looky nested data #spark #sseu", "display_text_range": [0, 249], "entities": { "hashtags": [{ "text": "spark", "indices": [211, 225] }, { "text": "sseu", "indices": [239, 249] }] } } } adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html root |-- created_at: string (nullable = true) |-- id_str: string (nullable = true) |-- text: string (nullable = true) |-- user: struct (nullable = true) | |-- id_str: string (nullable = true) | |-- screen_name: string (nullable = true) |-- display_text_range: array (nullable = true) | |-- element: long (containsNull = true) |-- extended_tweet: struct (nullable = true) | |-- full_text: string (nullable = true) | |-- display_text_range: array (nullable = true) | | |-- element: long (containsNull = true) | |-- entities: struct (nullable = true) | | |-- hashtags: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- text: string (nullable = true)
  • 5. 5 Manipulating Complex Data Structs are easy :) Maps/Arrays not so much... - Easy to read single values/retrieve keys - Hard to transform or to summarize
  • 6. 6 Transforming an Array Let’s say we want to add 1 to every element of the vals field of every row in an input table. Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Vals 1 [2, 3, 4] 2 [5, 6, 7] How would we do this?
  • 7. 7 Transforming an Array Option 1 - Explode and Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id
  • 8. 8 Transforming an Array Option 1 - Explode and Collect - Explode select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode
  • 9. 9 Transforming an Array Option 1 - Explode and Collect - Explode Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Val 1 1 1 2 1 3 2 4 2 5 2 6
  • 10. 10 Transforming an Array Option 1 - Explode and Collect - Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode 2. Collect
  • 11. 11 Transforming an Array Option 1 - Explode and Collect - Collect Id Val 1 1 + 1 1 2 + 1 1 3 + 1 2 4 + 1 2 5 + 1 2 6 + 1 Id Vals 1 [2, 3, 4] 2 [5, 6, 7]
  • 12. 12 Transforming an Array Option 1 - Explode and Collect - Complexity == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 13. 13 Transforming an Array Option 1 - Explode and Collect - Complexity • Shuffles the data around, which is very expensive • collect_list does not respect pre-existing ordering == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 14. 14 Transforming an Array Option 1 - Explode and Collect - Pitfalls Id Vals 1 [1, 2, 3] 1 [4, 5, 6] Id Vals 1 [5, 6, 7, 2, 3, 4] Id Vals 1 null 2 [4, 5, 6] Id Vals 2 [5, 6, 7] Values need to have data Keys need to be unique
  • 15. 15 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) }
  • 16. 16 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
  • 17. 17 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 18. 18 Transforming an Array Pros - Is faster than Explode & Collect - Does not suffer from correctness pitfalls Cons - Is relatively slow, we need to do a lot serialization - You need to register UDFs per type - Does not work for SQL - Clunky Option 2 - Scala UDF
  • 19. 19 When are you going to talk about Higher Order Functions?
  • 20. 20 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 21. 21 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function
  • 22. 22 Higher Order Functions Let’s take another look at Option 2 - Scala UDF Can we do the same for Spark SQL? def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function Anonymous ‘Lambda’ Function
  • 23. 23 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl - Spark SQL native code: fast & no serialization needed - Works for SQL
  • 24. 24 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Higher Order Function transform is the Higher Order Function. It takes an input array and an expression, it applies this expression to each element in the array
  • 25. 25 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Anonymous ‘Lambda’ Function val -> val + 1 is the lambda function. It is the operation that is applied to each value in the array. This function is divided into two components separated by a -> symbol: 1. The Argument list. 2. The expression used to calculate the new value.
  • 26. 26 Higher Order Functions in Spark SQL Nesting select id, transform(vals, val -> transform(val, e -> e + 1)) as vals from nested_input_tbl Capture select id, ref_value, transform(vals, val -> ref_value + val) as vals from nested_input_tbl
  • 27. 27 Didn’t you say these were faster?
  • 29. 29 Higher Order Functions in Spark SQL Spark 2.4 will ship with following higher order functions: Array - transform - filter - exists - aggregate/reduce - zip_with Map - transform_keys - transform_values - map_filter - map_zip_with A lot of new collection based expression were also added...
  • 30. 30 Future work Disclaimer: All of this is speculative and has not been discussed on the Dev list! Arrays and Maps have received a lot of love. However working with wide structs fields is still non-trivial (a lot of typing). We can do better here: - The following dataset functions should work for nested fields: - withColumn() - withColumnRenamed() - The following functions should be added for struct fields: - select() - withColumn() - withColumnRenamed()