Polymorphic Table Functions (PTFs) allow SQL queries to invoke Spark computations and integrate the results as relational tables. PTFs define a Spark job as a class that can be called from SQL like a table. The class implements methods for describing the table structure and executing the Spark logic. This provides a scalable way to leverage Spark's capabilities from SQL without needing intermediate data storage. Example use cases include integrating various data sources, complex ETL, and invoking machine learning models from SQL queries.
2. Please note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality
described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance that
any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
4. Notices and disclaimers
continued
It is the customer’s responsibility to insure its own compliance with legal
requirements and to obtain advice of competent legal counsel as to
the identification and interpretation of any relevant laws and regulatory
requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal
advice or represent or warrant that its services or products will ensure that
the customer follows any law.
Information concerning non-IBM products was obtained from the suppliers of
those products, their published announcements or other publicly available
sources. IBM has not tested those products about this publication and cannot
confirm the accuracy of performance, compatibility or any other claims related to
non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party
products, or the ability of any such third-party products to interoperate with IBM’s
products. IBM expressly disclaims all warranties, expressed or implied, including
but not limited to, the implied warranties of merchantability and fitness for a
purpose.
The provision of the information contained herein is not intended to, and does not,
grant any right or license under any IBM patents, copyrights, trademarks or other
intellectual property right.
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service
names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at:
www.ibm.com/legal/copytrade.shtml.
5. Agenda
Motivation
Traditional ways of integrating Spark and SQL
Shortcomings of the traditional approach
Polymorphic Table Functions
What are PTFs?
How do PTFs work?
Use cases for PTFs
Conclusions
7. Traditional Way of
Integrating Spark and Data Stores
▪ Spark reads data from and writes
data to data stores (e.g. JDBC,
Parquet, …)
▪ Spark initiates communication
▪ Spark processes data which it read
and writes data which it generated
▪ Spark is in the driver seat
Data Store
Spark is in the
driver seat
8. A New Way of
Integrating Spark and Data Stores
▪ The data store is in the driver seat
▪ It uses the result of a Spark
computation as a table
▪ The data store delegates complex
transformations on tables to Spark
▪ Solution:
Polymorphic Table Functions (PTF)
Data Store
Data store is in
the driver seat
10. What are Polymorphic Table Functions?
▪ A table function is used to invoke the Spark code
▪ The result of a computation in Spark is presented as relational table
▪ Why polymorphic?
▪ Single function is used to execute all kinds of Spark code
▪ This table function can return many different types of tables
▪ Different number of columns
▪ Different types of columns
▪ Tables may in addition be passed as arguments to the Spark
computation
▪ Table function may be used like a normal table in SQL (e.g. for joins
with other tables or table functions, for filtering, etc.)
11. Example: Polymorphic Table Functions in Big SQL
▪ Polymorphic table function is called EXECSPARK
▪ Class which implements the SparkPtf interface must be passed as
argument to EXECSPARK
▪ Class may be written in Scala or Java
▪ Class must implement four methods:
Method Purpose
describe Used by Big SQL to understand which columns with which data type are returned
execute Called during runtime. Implements the actual Spark computation
destroy Used for cleanup
cardinality Used by the optimizer of Big SQL to estimate the size of the returned table
12. PTF: A Complete Example: Reading a JSON File
Calling SQL Statement
select cast(country as varchar(10)) c
from table(
SYSHADOOP.EXECSPARK(
class => ‘org.examples.ReadJsonFileSecure’,
uri => ‘/user/bigsql/tmp/demo.json’
)
) as doc where country is not null
order by c ;
14. PTF: A Complete Example: Reading a JSON File
PTF Implementation: Class Definition & describe Method
public class ReadJsonFileSecure implements SparkPtf {
Dataset<Row> df = null;
@Override
public StructType describe(SQLContext ctx,
Map<String, Object> arguments) {
df = ctx.read().json((String)arguments.get(“URI”));
df.cache();
return df.schema();
}
15. PTF: A Complete Example: Reading a JSON File
PTF Implementation: execute Method
@Override
public Dataset<Row> execute(SQLContext ctx,
Map<String, Object> arguments) {
if (df == null) {
df = ctx.read().json((String)arguments.get(“URI”));
df.cache();
}
return df;
}
16. PTF: A Complete Example: Reading a JSON File
PTF Implementation: destroy & cardinality Methods
@Override
public void destroy(SQLContext ctx,
Map<String, Object> arguments) {
if (df != null) { df = null;
}
}
@Override
public long cardinality(SQLContext ctx,
Map<String, Object> arguments) {
if (arguments.get(“CARD”) == null)
return 100;
else
return (long) arguments.get(“CARD”);
}
}
17. What is IBM Db2 Big SQL?
▪ Advanced, fully-featured SQL engine for physical and virtual data
lakes on Hadoop and/or Cloud Object Storage
▪ Focusses on
▪ data virtualization,
▪ SQL compatibility,
▪ scalability,
▪ performance,
▪ enterprise security/governance
▪ Supports queries on Hive and Hbase data
▪ Uses Hive Metastore
▪ Supports CSV, Parquet, ORC, …
▪ Bidirectional integration with Spark:
▪ Spark JDBC data source enables execution of Big SQL queries from Spark and consumes the results as data frames
▪ Polymorphic table function enables execution of Spark jobs from Big SQL and consumes the results as tables.
19. Efficient Integration
• Low latency from Spark jobs executing on long running Spark app that is
co-located with Big SQL engine (no spark-submit per job)
• High throughput data movement by exploiting parallelism and local data
transfer at the nodes
Big SQL Workers co-located with Spark Executors
Head Node
Big SQL Head Spark Driver
Worker Node
Big SQL Worker Spark Executor
Worker Node
Big SQL Worker Spark Executor
Worker Node
Big SQL Worker Spark Executor
…
HDFS
Job
Data Data Data
data flow
control flow
across engines
control/data flow
within engine
20. How it works
Spark DriverBigSQL Head
Head Node
Spark ExecutorBigSQL Worker
Worker Node
DataFrame
Partitions
HDFS
Local
Storage
SELECT *
FROM TABLE(
EXECSPARK(
class => ‘com….PTF’, …))
describe
schema
execute
scan
stages/
tasks
class
21. High-throughput data transfer
• Parallelism across nodes and within nodes
• DataFrame partitions are streamed from Spark executor to BigSQL worker as soon as Spark
produce them
• Partition data is serialized in BigSQL runtime format on the Spark side
• No deserialization overhead once data arrives in BigSQL
Spark ExecutorBigSQL Worker
DataFrame
Partitions
Worker Nodes
22. Use Cases
▪ Data virtualization to many data sources otherwise not supported
▪ All sources for which there is a Spark DataSource become available in
SQL engine
▪ Complex transformations that are hard/impossible to express in SQL
become possible
▪ Exploitation of Spark libraries and external packages in SQL engine
▪ ETL
▪ Machine learning
▪ Custom aggregates
23. Example
SELECT people.firstname, people.lastname, people.phone
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘jdbc’,
url => ‘jdbc:mysql://localhost/mydb’,
driver => ‘com.mysql.jdbc.Driver’,
user => ‘joe’,
password => ‘joespwd’
load => ‘default.people’
)) AS people
WHERE people.age > 55
Using DataSource PTF to Query a Mysql Table
24. Example:
SELECT DESCRIBE doc.*
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘csv’,
skipHeader = true
load => ‘hdfs://user/guest/sample.csv’
)) AS doc
Using DataSource PTF to Describe a CSV File
25. Example:
SELECT people.firstname, people.lastname, people.phone
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘parquet’,
load => ‘hdfs://user/guest/people.parquet’
)) AS people
WHERE people.age > 55
Using DataSource PTF to Query a Parquet File
27. Summary
Polymorphic table functions
▪ Provide a fast and scalable way to transfer data from Spark to SQL
engine with SQL engine in control
▪ No intermediate storage necessary like in naïve solution
▪ Many possible use cases supported
▪ Leverage the analytics capabilities of Spark in SQL engine
28. Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Andreas Weininger
andreas.Weininger@de.ibm.com
@aweininger