SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Kazuaki Ishizaki (石崎 一明)
IBM Research – Tokyo (日本アイ・ビー・エム(株)東京基礎研究所)
@kiszk
Looking back at Spark 2.x and forward to 3.0
1
About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research - Tokyo https://ibm.biz/ishizaki
– Compiler optimization
– Language runtime
– Parallel processing
▪ Working for IBM Java virtual machine (now OpenJ9) from over 20 years
– In particular, just-in-time compiler
▪ Apache Spark Committer for SQL package (from 2018/9)
– My first PR has been merged on 2015/12
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– Slideshare: https://www.slideshare.net/ishizaki
2 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Today’s Talk
▪ I will not talk about distributed framework
– You are more familiar than myself
▪ I will not talk about SQL, machine learning, and other libraries
– I expect @maropu will talk about SQL in the next session
3 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Today’s Talk
▪ I will not talk about distributed framework
– You are more familiar than myself
▪ I will not talk about SQL, machine learning, and other libraries
– I expect @maropu will talk about SQL in the next session
▪ I will talk about how a program is executed on an executor at a node
4 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Outline
▪ How a DataFrame/Dataset program is executed?
▪ What are problems in Spark 2.x?
▪ What’s new in Spark 3.0?
▪ Why am I appointed to a committer?
5 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Apache Spark Program is Written by a User
▪ This DataFrame program is written in Scala
6 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
df: DataFrame[int] = (1 to 100).toDF
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
.show
Java code is actually executed
▪ A DataFrame/Dataset program is translated to Java program to be
actually executed
– An optimizer combines two arithmetic operations into one
– Whole-stage codegen puts multiple operations (read, selectExpr, and
projection) into one loop
7
while (itr.hasNext()) { // execute a row
// get a value from a row in DF
int value =((Row)itr.next()).getInt(0);
// compute a new value
int mapValue = value + 3;
// store a new value to a row in DF
outRow.write(0, mapValue);
append(outRow);
}
df: DataFrame[int] = …
df.selectExpr(“value + 1”)
.selectExpr(“value + 2”)
.show
Code
generation
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
1, 2, 3, 4, …
Unsafe data (on heap)
How a Program is Translated to Java Code
8 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
From Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
Who is More Familiar with Each Module
▪ Four Japanese committers are in this room
9 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Project Tungsten
Major Items in Spark 2.x to Me
▪ Improve performance
– by improving data representation
– by eliminating serialization/deserialization (ser/de)
– by improving generated code
▪ Stable code generation
– No more Java exception while a program has large number of columns
(>1000)
10 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Array Internal Representation
▪ Before Spark 2.1, an array (UnsafeArrayData) is internally represented by
using an sparse/indirect structure
– Good for small memory consumption if an array is sparse
▪ After Spark 2.1, the array representation is dense/contiguous
– Good for performance
11 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
len = 2 7 8
a[0] a[1]offset[0] offset[1]
len = 2 Non
Null
Non
Null 7 8
a[0] a[1]
SPARK-15962 improves this representation
This is (the first) tough PR for me
▪ Spent three months with 270 conversations
12 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
A Simple Dataset Program with Array
▪ Read an integer array in a row
▪ Create a new array from the first element
13 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))
Weird Generated Pseudo Code with DataSet
▪ Data conversion is too slow
– Between internal representation (Tungsten) and Java object format (Object[])
▪ Element-wise data copy is too slow
14 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
ArrayData inArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
append(outRow);
}
ds: DataSet[Array[Int]] =
Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))
Data conversion
Element-wise data copy
Element-wise data copy
int[] mapArray = new int[1] { a[0] };
Code
generation
Data conversion
Element-wise data copy
Ser
De
Copy each element with null check
Data conversion
Data conversion Copy with Java object creation
Element-wise data copy
Generated Source Java Code
▪ Data conversion is done by boxing or unboxing
▪ Element-wise data copy is done
by for-loop
15 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
ds: DataSet[Array[Int]] =
Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))
Data conversion
Code
generation
Element-wise data copy
ArrayData inArray;
while (itr.hasNext()) {
inArray = ((Row)itr.next().getArray(0);
Object[] tmp = new Object[inArray.numElements()];
for (int i = 0; i < tmp.length; i ++) {
tmp[i] = (inArray.isNullAt(i)) ?
null : inArray.getInt(i);
}
ArrayData array =
new GenericIntArrayData(tmpArray);
int[] javaArray = array.toIntArray();
int[] mapArray = (int[])map_func.apply(javaArray);
outArray = new GenericArrayData(mapArray);
for (int i = 0; i < outArray.numElements(); i++) {
if (outArray.isNullAt(i)) {
arrayWriter.setNullInt(i);
} else {
arrayWriter.write(i, outArray.getInt(i));
}
}
append(outRow);
}
Ser
De
Too Long Actually-Generated Java Code (Spark 2.0)
▪ Too to
16 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Data conversion Element-wise data copy
final int[] mapelements_value = mapelements_isNull ?
null : (int[]) mapelements_value1.apply(deserializetoobject_value);
mapelements_isNull = mapelements_value == null;
final boolean serializefromobject_isNull = mapelements_isNull;
final ArrayData serializefromobject_value = serializefromobject_isNull ?
null : new GenericArrayData(mapelements_value);
serializefromobject_holder.reset();
serializefromobject_rowWriter.zeroOutNullBytes();
if (serializefromobject_isNull) {
serializefromobject_rowWriter.setNullAt(0);
} else {
final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
if (serializefromobject_value instanceof UnsafeArrayData) {
final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
serializefromobject_holder.grow(serializefromobject_sizeInBytes);
((UnsafeArrayData) serializefromobject_value).writeToMemory(
serializefromobject_holder.buffer, serializefromobject_holder.cursor);
serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
} else {
final int serializefromobject_numElements = serializefromobject_value.numElements();
serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements;
serializefromobject_index++) {
if (serializefromobject_value.isNullAt(serializefromobject_index)) {
serializefromobject_arrayWriter.setNullAt(serializefromobject_index);
} else {
final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
}
}
}
serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor,
serializefromobject_holder.cursor - serializefromobject_tmpCursor);
serializefromobject_rowWriter.alignToWords(serializefromobject_holder.cursor - serializefromobject_tmpCursor);
}
serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
append(serializefromobject_result);
if (shouldStop()) return;
}
}
protected void processNext() throws java.io.IOException {
while (inputadapter_input.hasNext()) {
InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
ArrayData inputadapter_value = inputadapter_isNull ?
null : (inputadapter_row.getArray(0));
boolean deserializetoobject_isNull1 = inputadapter_isNull;
ArrayData deserializetoobject_value1 = null;
if (!inputadapter_isNull) {
final int deserializetoobject_n = inputadapter_value.numElements();
final Object[] deserializetoobject_values = new Object[deserializetoobject_n];
for (int deserializetoobject_j = 0;
deserializetoobject_j < deserializetoobject_n; deserializetoobject_j ++) {
if (inputadapter_value.isNullAt(deserializetoobject_j)) {
deserializetoobject_values[deserializetoobject_j] = null;
} else {
boolean deserializetoobject_feNull = false;
int deserializetoobject_fePrim =
inputadapter_value.getInt(deserializetoobject_j);
boolean deserializetoobject_teNull = deserializetoobject_feNull;
int deserializetoobject_tePrim = -1;
if (!deserializetoobject_feNull) {
deserializetoobject_tePrim = deserializetoobject_fePrim;
}
if (deserializetoobject_teNull) {
deserializetoobject_values[deserializetoobject_j] = null;
} else {
deserializetoobject_values[deserializetoobject_j] = deserializetoobject_tePrim;
}
}
}
deserializetoobject_value1 = new GenericArrayData(deserializetoobject_values);
}
boolean deserializetoobject_isNull = deserializetoobject_isNull1;
final int[] deserializetoobject_value = deserializetoobject_isNull ?
null : (int[]) deserializetoobject_value1.toIntArray();
deserializetoobject_isNull = deserializetoobject_value == null;
Object mapelements_obj = ((Expression) references[0]).eval(null);
scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj;
boolean mapelements_isNull = false || deserializetoobject_isNull;
ds.map(a => Array(a(0))).debugCodegen
Simple Generated Code for Array on Spark 2.2
▪ Data conversion and element-wise copy are not used
▪ Bulk copy is faster than element-wise data copy
17
ds: DataSet[Array[Int]] =
Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))
Bulk data copy Copy whole array using memcpy()
while (itr.hasNext()) {
inArray =((Row)itr.next()).getArray(0);
int[]mapArray=(int[])map.apply(javaArray);
append(outRow);
}
Bulk data copy
int[] mapArray = new int[1] { a[0] };
Bulk data copy
Bulk data copy
SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy
Code
generation
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Simple Generated Java Code for Array
▪ Data conversion and element-wise copy are not used
▪ Bulk copy is faster than element-wise data copy
18
ds: DataSet[Array[Int]] =
Seq(Array(7, 8)).toDS
ds.map(a => Array(a(0)))
Bulk data copy Copy whole array using memcpy()
SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy
while (itr.hasNext()) {
inArray =((Row)itr.next()).getArray(0);
int[] javaArray = inArray.toIntArray();
int[] mapArray = (int[])mapFunc.apply(javaArray);
outArray = UnsafeArrayData
.fromPrimitiveArray(mapArray);
outArray.writeToMemory(outRow);
append(outRow);
}
Code
generation
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Dataset for Array Is Not Extremely Slow
▪ Good news: 4.5x faster than Spark 2.0
▪ Bad news: still 12x slower than DataFrame
19
0 10 20 30 40 50 60
Relative execution time over DataFrame
DataFrame Dataset
ds = Seq(Array(…), Array(…), …)
.toDS.cache
ds.map(a => Array(a(0)))
4.5x
df = Seq(Array(…), Array(…), …)
.toDF(”a”).cache
df.selectExpr(“Array(a[0])”)
12x
Shorter is better
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Spark 2.4 Supports Array Built-in Functions
▪ These built-in functions operate on array elements without writing a loop
– Single array input: array_min, array_max, array_position, ...
– Two-array input: array_intersect, array_union, array_except,
▪ Before Spark 2.4, users have to write a function using Dataset or UDF
20
SPARK-23899 is an umbrella entry
ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS
ds.map(a => a.mix)
df: DataSet[Array[Int]] = Seq(Array(7, 8)).toDF(“a”)
df.array_min(“a”)
@ueshin co-wrote an blog entry at
https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Pre-Spark 2.3 Throws Java Exception with Large Columns
▪ Java class file has multiple limitations
– Bytecode size of a method is less than 64KB
– Entry size of constant pool (e.g. symbol name) is less than 64KB
21
df.groupBy(“id”).agg(max(“c1”), sum(“c2”), …, min(“c4000”))
01:11:11.123 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to
compile: org.codehaus.janino.JaninoRuntimeException: Code of method
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions
/UnsafeRow;" of class
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows
beyond 64 KB
...
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Generated a huge method whose bytecode size is
more than 64KB
Spark 2.3 Fixes Java Exception with Large Columns
▪ Generate small size of methods conservatively when potentially large
Java code is generated
– Apply this policy to multiple places into source files
22
SPARK-22150 is an umbrella entry that has 25 sub-tasks
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Spark 2.3 Fixes Java Exception with Large Columns
▪ Generate small size of methods conservatively when potentially large
Java code is generated
– Apply this policy to multiple places into source files
23
SPARK-22150 is an umbrella entry that has 25 sub-tasks
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Major Items in Spark 3.0
▪ JDK11 and Scala 2.12 support (available in master branch)
– SPARK-24417
▪ Tungsten intermediate representation (IR)
– Easy to restructure generated code
▪ SPARK-25728 (under proposal)
▪ DataSource V2 API
– SPARK-25528
▪ …
24 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Motivation of Tungsten IR
▪ It is not easy to restructure Java code after generating the code
– Code generation is done by string concatenation
25
int i = ...
func1(i + 1, i * 2);
...
func1(i + 500, i * 2);
func1(i + 501, i * 2);
func1(i + 1000, i * 2);
Hard to split here into two parts
without parsing Java code
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Structured IR Allows us to Restructure Generated Code
▪ Ease of code restructuring in blue
▪ Ease of rebuilding an expression in green
26
Method
Invoke
+
load i
500
*
load i
2
Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Invoke
+
load i
501
Easy to split here
into two parts
Other Major Items in Spark 2.x
▪ PySpark Performance Improvement
– To use Pandas UDF with Apache Arrow can drastically improve performance
of PySpark
▪ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
▪ Project Hydrogen
– Barrier execution mode for integrating ML/DL frameworks with Spark
▪ https://databricks.com/blog/2018/07/25/bay-area-apache-spark-meetup-summary-
databricks-hq.html
27 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
What I am Interested in
▪ Tungsten IR in Spark
– ease of code restructuring
– (In the future) apply multiple optimizations
▪ Improvement of generated code in Spark
– for Parquet reader
– data representation for array table cache
▪ Integration of Spark with DL/ML frameworks (TensorFlow…) and others
28 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
Possible Integration of Spark thru Arrow (My View)
▪ Frameworks: DL/ML frameworks (TensorFlow…)
▪ Resource: GPU, …
– RAPIDS (by NVIDIA) may help integrate with GPU
29 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
From rapids.ai
2 12
1 11
In-memory Columnar
Why Am I Appointed to a Committer?
▪ Continue to make contributions to certain component (SQL)
▪ Review many pull requests
▪ Share knowledge based on my expertise in the community
– Compiler and Java virtual machine
▪ Meet committers and contributors in person
– Hadoop Source Code Reading, Hadoop Spark Conference Japan,
Spark Summit, other meetups
30 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
31
ぜひオープンソースにコントリビューションを
Sparkコミュニティに飛び込もう!
by Apache Spark committer 猿田さん
https://www.slideshare.net/hadoopxnttdata/apache-spark-commnity-nttdata-sarutak

Mais conteúdo relacionado

Mais procurados

Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoTaro L. Saito
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaChris Bailey
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 

Mais procurados (20)

Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 

Semelhante a Looking back at Spark 2.x and forward to 3.0

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)I Goo Lee.
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 

Semelhante a Looking back at Spark 2.x and forward to 3.0 (20)

Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
 
Scala 20140715
Scala 20140715Scala 20140715
Scala 20140715
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 

Mais de Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Kazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki publicKazuaki Ishizaki
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_publicKazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラKazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化Kazuaki Ishizaki
 

Mais de Kazuaki Ishizaki (16)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 

Looking back at Spark 2.x and forward to 3.0

  • 1. Kazuaki Ishizaki (石崎 一明) IBM Research – Tokyo (日本アイ・ビー・エム(株)東京基礎研究所) @kiszk Looking back at Spark 2.x and forward to 3.0 1
  • 2. About Me – Kazuaki Ishizaki ▪ Researcher at IBM Research - Tokyo https://ibm.biz/ishizaki – Compiler optimization – Language runtime – Parallel processing ▪ Working for IBM Java virtual machine (now OpenJ9) from over 20 years – In particular, just-in-time compiler ▪ Apache Spark Committer for SQL package (from 2018/9) – My first PR has been merged on 2015/12 ▪ ACM Distinguished Member ▪ SNS – @kiszk – Slideshare: https://www.slideshare.net/ishizaki 2 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 3. Today’s Talk ▪ I will not talk about distributed framework – You are more familiar than myself ▪ I will not talk about SQL, machine learning, and other libraries – I expect @maropu will talk about SQL in the next session 3 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 4. Today’s Talk ▪ I will not talk about distributed framework – You are more familiar than myself ▪ I will not talk about SQL, machine learning, and other libraries – I expect @maropu will talk about SQL in the next session ▪ I will talk about how a program is executed on an executor at a node 4 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 5. Outline ▪ How a DataFrame/Dataset program is executed? ▪ What are problems in Spark 2.x? ▪ What’s new in Spark 3.0? ▪ Why am I appointed to a committer? 5 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 6. Apache Spark Program is Written by a User ▪ This DataFrame program is written in Scala 6 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki df: DataFrame[int] = (1 to 100).toDF df.selectExpr(“value + 1”) .selectExpr(“value + 2”) .show
  • 7. Java code is actually executed ▪ A DataFrame/Dataset program is translated to Java program to be actually executed – An optimizer combines two arithmetic operations into one – Whole-stage codegen puts multiple operations (read, selectExpr, and projection) into one loop 7 while (itr.hasNext()) { // execute a row // get a value from a row in DF int value =((Row)itr.next()).getInt(0); // compute a new value int mapValue = value + 3; // store a new value to a row in DF outRow.write(0, mapValue); append(outRow); } df: DataFrame[int] = … df.selectExpr(“value + 1”) .selectExpr(“value + 2”) .show Code generation Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki 1, 2, 3, 4, … Unsafe data (on heap)
  • 8. How a Program is Translated to Java Code 8 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki From Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust
  • 9. Who is More Familiar with Each Module ▪ Four Japanese committers are in this room 9 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki Project Tungsten
  • 10. Major Items in Spark 2.x to Me ▪ Improve performance – by improving data representation – by eliminating serialization/deserialization (ser/de) – by improving generated code ▪ Stable code generation – No more Java exception while a program has large number of columns (>1000) 10 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 11. Array Internal Representation ▪ Before Spark 2.1, an array (UnsafeArrayData) is internally represented by using an sparse/indirect structure – Good for small memory consumption if an array is sparse ▪ After Spark 2.1, the array representation is dense/contiguous – Good for performance 11 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki len = 2 7 8 a[0] a[1]offset[0] offset[1] len = 2 Non Null Non Null 7 8 a[0] a[1] SPARK-15962 improves this representation
  • 12. This is (the first) tough PR for me ▪ Spent three months with 270 conversations 12 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 13. A Simple Dataset Program with Array ▪ Read an integer array in a row ▪ Create a new array from the first element 13 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => Array(a(0)))
  • 14. Weird Generated Pseudo Code with DataSet ▪ Data conversion is too slow – Between internal representation (Tungsten) and Java object format (Object[]) ▪ Element-wise data copy is too slow 14 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki ArrayData inArray; while (itr.hasNext()) { inArray = ((Row)itr.next().getArray(0); append(outRow); } ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => Array(a(0))) Data conversion Element-wise data copy Element-wise data copy int[] mapArray = new int[1] { a[0] }; Code generation Data conversion Element-wise data copy Ser De Copy each element with null check Data conversion Data conversion Copy with Java object creation Element-wise data copy
  • 15. Generated Source Java Code ▪ Data conversion is done by boxing or unboxing ▪ Element-wise data copy is done by for-loop 15 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => Array(a(0))) Data conversion Code generation Element-wise data copy ArrayData inArray; while (itr.hasNext()) { inArray = ((Row)itr.next().getArray(0); Object[] tmp = new Object[inArray.numElements()]; for (int i = 0; i < tmp.length; i ++) { tmp[i] = (inArray.isNullAt(i)) ? null : inArray.getInt(i); } ArrayData array = new GenericIntArrayData(tmpArray); int[] javaArray = array.toIntArray(); int[] mapArray = (int[])map_func.apply(javaArray); outArray = new GenericArrayData(mapArray); for (int i = 0; i < outArray.numElements(); i++) { if (outArray.isNullAt(i)) { arrayWriter.setNullInt(i); } else { arrayWriter.write(i, outArray.getInt(i)); } } append(outRow); } Ser De
  • 16. Too Long Actually-Generated Java Code (Spark 2.0) ▪ Too to 16 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki Data conversion Element-wise data copy final int[] mapelements_value = mapelements_isNull ? null : (int[]) mapelements_value1.apply(deserializetoobject_value); mapelements_isNull = mapelements_value == null; final boolean serializefromobject_isNull = mapelements_isNull; final ArrayData serializefromobject_value = serializefromobject_isNull ? null : new GenericArrayData(mapelements_value); serializefromobject_holder.reset(); serializefromobject_rowWriter.zeroOutNullBytes(); if (serializefromobject_isNull) { serializefromobject_rowWriter.setNullAt(0); } else { final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; if (serializefromobject_value instanceof UnsafeArrayData) { final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes(); serializefromobject_holder.grow(serializefromobject_sizeInBytes); ((UnsafeArrayData) serializefromobject_value).writeToMemory( serializefromobject_holder.buffer, serializefromobject_holder.cursor); serializefromobject_holder.cursor += serializefromobject_sizeInBytes; } else { final int serializefromobject_numElements = serializefromobject_value.numElements(); serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8); for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) { if (serializefromobject_value.isNullAt(serializefromobject_index)) { serializefromobject_arrayWriter.setNullAt(serializefromobject_index); } else { final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index); serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element); } } } serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); serializefromobject_rowWriter.alignToWords(serializefromobject_holder.cursor - serializefromobject_tmpCursor); } serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); append(serializefromobject_result); if (shouldStop()) return; } } protected void processNext() throws java.io.IOException { while (inputadapter_input.hasNext()) { InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); boolean inputadapter_isNull = inputadapter_row.isNullAt(0); ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0)); boolean deserializetoobject_isNull1 = inputadapter_isNull; ArrayData deserializetoobject_value1 = null; if (!inputadapter_isNull) { final int deserializetoobject_n = inputadapter_value.numElements(); final Object[] deserializetoobject_values = new Object[deserializetoobject_n]; for (int deserializetoobject_j = 0; deserializetoobject_j < deserializetoobject_n; deserializetoobject_j ++) { if (inputadapter_value.isNullAt(deserializetoobject_j)) { deserializetoobject_values[deserializetoobject_j] = null; } else { boolean deserializetoobject_feNull = false; int deserializetoobject_fePrim = inputadapter_value.getInt(deserializetoobject_j); boolean deserializetoobject_teNull = deserializetoobject_feNull; int deserializetoobject_tePrim = -1; if (!deserializetoobject_feNull) { deserializetoobject_tePrim = deserializetoobject_fePrim; } if (deserializetoobject_teNull) { deserializetoobject_values[deserializetoobject_j] = null; } else { deserializetoobject_values[deserializetoobject_j] = deserializetoobject_tePrim; } } } deserializetoobject_value1 = new GenericArrayData(deserializetoobject_values); } boolean deserializetoobject_isNull = deserializetoobject_isNull1; final int[] deserializetoobject_value = deserializetoobject_isNull ? null : (int[]) deserializetoobject_value1.toIntArray(); deserializetoobject_isNull = deserializetoobject_value == null; Object mapelements_obj = ((Expression) references[0]).eval(null); scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj; boolean mapelements_isNull = false || deserializetoobject_isNull; ds.map(a => Array(a(0))).debugCodegen
  • 17. Simple Generated Code for Array on Spark 2.2 ▪ Data conversion and element-wise copy are not used ▪ Bulk copy is faster than element-wise data copy 17 ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => Array(a(0))) Bulk data copy Copy whole array using memcpy() while (itr.hasNext()) { inArray =((Row)itr.next()).getArray(0); int[]mapArray=(int[])map.apply(javaArray); append(outRow); } Bulk data copy int[] mapArray = new int[1] { a[0] }; Bulk data copy Bulk data copy SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy Code generation Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 18. Simple Generated Java Code for Array ▪ Data conversion and element-wise copy are not used ▪ Bulk copy is faster than element-wise data copy 18 ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => Array(a(0))) Bulk data copy Copy whole array using memcpy() SPARK-15985 and SPARK-17490 simplify Ser/De by using bulk data copy while (itr.hasNext()) { inArray =((Row)itr.next()).getArray(0); int[] javaArray = inArray.toIntArray(); int[] mapArray = (int[])mapFunc.apply(javaArray); outArray = UnsafeArrayData .fromPrimitiveArray(mapArray); outArray.writeToMemory(outRow); append(outRow); } Code generation Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 19. Dataset for Array Is Not Extremely Slow ▪ Good news: 4.5x faster than Spark 2.0 ▪ Bad news: still 12x slower than DataFrame 19 0 10 20 30 40 50 60 Relative execution time over DataFrame DataFrame Dataset ds = Seq(Array(…), Array(…), …) .toDS.cache ds.map(a => Array(a(0))) 4.5x df = Seq(Array(…), Array(…), …) .toDF(”a”).cache df.selectExpr(“Array(a[0])”) 12x Shorter is better Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 20. Spark 2.4 Supports Array Built-in Functions ▪ These built-in functions operate on array elements without writing a loop – Single array input: array_min, array_max, array_position, ... – Two-array input: array_intersect, array_union, array_except, ▪ Before Spark 2.4, users have to write a function using Dataset or UDF 20 SPARK-23899 is an umbrella entry ds: DataSet[Array[Int]] = Seq(Array(7, 8)).toDS ds.map(a => a.mix) df: DataSet[Array[Int]] = Seq(Array(7, 8)).toDF(“a”) df.array_min(“a”) @ueshin co-wrote an blog entry at https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 21. Pre-Spark 2.3 Throws Java Exception with Large Columns ▪ Java class file has multiple limitations – Bytecode size of a method is less than 64KB – Entry size of constant pool (e.g. symbol name) is less than 64KB 21 df.groupBy(“id”).agg(max(“c1”), sum(“c2”), …, min(“c4000”)) 01:11:11.123 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions /UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB ... Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki Generated a huge method whose bytecode size is more than 64KB
  • 22. Spark 2.3 Fixes Java Exception with Large Columns ▪ Generate small size of methods conservatively when potentially large Java code is generated – Apply this policy to multiple places into source files 22 SPARK-22150 is an umbrella entry that has 25 sub-tasks Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 23. Spark 2.3 Fixes Java Exception with Large Columns ▪ Generate small size of methods conservatively when potentially large Java code is generated – Apply this policy to multiple places into source files 23 SPARK-22150 is an umbrella entry that has 25 sub-tasks Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 24. Major Items in Spark 3.0 ▪ JDK11 and Scala 2.12 support (available in master branch) – SPARK-24417 ▪ Tungsten intermediate representation (IR) – Easy to restructure generated code ▪ SPARK-25728 (under proposal) ▪ DataSource V2 API – SPARK-25528 ▪ … 24 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 25. Motivation of Tungsten IR ▪ It is not easy to restructure Java code after generating the code – Code generation is done by string concatenation 25 int i = ... func1(i + 1, i * 2); ... func1(i + 500, i * 2); func1(i + 501, i * 2); func1(i + 1000, i * 2); Hard to split here into two parts without parsing Java code Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 26. Structured IR Allows us to Restructure Generated Code ▪ Ease of code restructuring in blue ▪ Ease of rebuilding an expression in green 26 Method Invoke + load i 500 * load i 2 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki Invoke + load i 501 Easy to split here into two parts
  • 27. Other Major Items in Spark 2.x ▪ PySpark Performance Improvement – To use Pandas UDF with Apache Arrow can drastically improve performance of PySpark ▪ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ▪ Project Hydrogen – Barrier execution mode for integrating ML/DL frameworks with Spark ▪ https://databricks.com/blog/2018/07/25/bay-area-apache-spark-meetup-summary- databricks-hq.html 27 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 28. What I am Interested in ▪ Tungsten IR in Spark – ease of code restructuring – (In the future) apply multiple optimizations ▪ Improvement of generated code in Spark – for Parquet reader – data representation for array table cache ▪ Integration of Spark with DL/ML frameworks (TensorFlow…) and others 28 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 29. Possible Integration of Spark thru Arrow (My View) ▪ Frameworks: DL/ML frameworks (TensorFlow…) ▪ Resource: GPU, … – RAPIDS (by NVIDIA) may help integrate with GPU 29 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki From rapids.ai 2 12 1 11 In-memory Columnar
  • 30. Why Am I Appointed to a Committer? ▪ Continue to make contributions to certain component (SQL) ▪ Review many pull requests ▪ Share knowledge based on my expertise in the community – Compiler and Java virtual machine ▪ Meet committers and contributors in person – Hadoop Source Code Reading, Hadoop Spark Conference Japan, Spark Summit, other meetups 30 Looking back at Spark 2.x and forward to 3.0 - Kazuaki Ishizaki
  • 31. 31 ぜひオープンソースにコントリビューションを Sparkコミュニティに飛び込もう! by Apache Spark committer 猿田さん https://www.slideshare.net/hadoopxnttdata/apache-spark-commnity-nttdata-sarutak