Understanding transactional writes in datasource v2

Transactional Writes in
Datasource V2
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com

Agenda
● Introduction to Data Source V2
● Shortcomings of Datasource Write API
● Anatomy of Datasource V2 Write API
● Per Partition Transaction
● Source Level Transaction
● Partition Affinity

Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra
etc

ShortComings of V1 API
● Introduced in 1.3 but not evolved compared to other
parts of spark
● Dependency on High Level API like DataFrame,
SparkContext etc
● Lack of support for column read
● Lack of partition awareness
● No Transaction support in write API
● Lack of extendibility

Introduction to Datasource V2 API

V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.

No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file

Interfaces
Master
Datasource Writer
DataWriter Factory
User code
Worker
Data Writer
Worker
Data Writer
Writer Support

WriterSupport Interface
● Entry Point to the Data Source
● Has One Method
def createWriter(jobId: String, schema: StructType, mode: SaveMode,options:
DataSourceOptions): Optional[DataSourceWriter]
● SaveMode and Schema same as V1 API
● Returns Optional for ready only sources

DataSourceWriter Interface
● Entry Point to Writer
● Has Three Methods
○ def createWriterFactory(): DataWriterFactory[Row]
○ def commit(messages: Array[WriterCommitMessage])
○ def abort(messages: Array[WriterCommitMessage])
● Responsible for create writer factory
● WriterCommitMessage is interface for communication
● Can see transactional support throughout of the API

DataWriterFactory Interface
● Follow factory design pattern of Java to create actual
data writes
● This code for creating writers for uniquely identifying
different partitions
● It has one method to create data write
def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[Row]
● attemptNumber for retrying tasks

DataWriter Interface
● Interface Responsible for actual write of data
● Runs in worker nodes
● Method exposed are
● def write(record: Row)
● def commit(): WriterCommitMessage
● def abort()
● Looks very similar to Hadoop Write interface

Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● Transaction support throughout the API
● Write interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.

Mysql Source
● Mysql source is responsible for writing data using jdbc
API
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleMysqlWriter.scala

Distributed Writes
● Distributed writing is hard
● There are many reasons write can fail
○ Connection is dropped
○ Error in Writing Data for partition
○ Duplicate data because of retrying
● Many of these issues crop up very frequently in spark
applications
● Ex : MysqlTransactionExample.scala

Transactional Support
● In Datasource V2 API, there is good support for
transaction
● Transaction can be implemented at
○ Partition level
○ Job level
● This transaction support helps to handle issue in error in
partial write data
● Ex : MysqlWithTransaction

Partition Locations
● Many data sources today provide native support for
partitioning
● These partitions can be distributed over cluster of
machines
● Making spark aware of these partition scheme makes
reading much faster
● Works best for co located data sources

Preferred Locations
● DataReader Factory expose getPreferredLocations API
to send the partitioning information to spark
● This API returns the host name of the machines where
partition is available
● Spark uses it as hint. It may not use it
● If we return hostname which is not recognisable to
spark , it just ignores it
● Store this information in the RDD
● Ex : SimpleDataSourceWithPartitionAffinity.scala

References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Understanding transactional writes in datasource v2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Understanding transactional writes in datasource v2

Semelhante a Understanding transactional writes in datasource v2 (20)

Mais de datamantra

Mais de datamantra (17)

Último

Último (20)

Understanding transactional writes in datasource v2