SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Polymorphic Table Functions:
The best way to
integrate SQL and Spark
Andreas Weininger
andreas.Weininger@de.ibm.com
@aweininger
Please note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality
described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance that
any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
Notices and disclaimers
© 2020 International Business Machines Corporation. No part of this document
may be reproduced or transmitted in any form without written permission from
IBM.
U.S. Government Users Restricted Rights — use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM.
This document is current as of the initial date of publication and may be changed
by IBM at any time. Not all offerings are available in every country in which IBM
operates.
Information in these presentations (including information relating to products that
have not yet been announced by IBM) has been reviewed for accuracy as of the
date of initial publication and could include unintentional technical or
typographical errors. IBM shall have no responsibility to update this information.
This document is distributed “as is” without any warranty, either express or
implied. In no event, shall IBM be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business interruption, loss
of profit or loss of opportunity. IBM products and services are warranted per the
terms and conditions of the agreements under which they are provided. The
performance data and client examples cited are presented for illustrative
purposes only. Actual performance results may vary depending on specific
configurations and operating conditions.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously installed.
Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans
are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled,
isolated environments. Customer examples are presented as illustrations of how
those customers have used IBM products and the results they may have
achieved. Actual performance, cost, savings or other results in other
operating environments may vary.
References in this document to IBM products, programs, or services does not
imply that IBM intends to make such products, programs or services available in all
countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by
independent session speakers, and do not necessarily reflect the views of
IBM. All materials and discussions are provided for informational purposes only,
and are neither intended to, nor shall constitute legal or other guidance or advice
to any individual participant or their specific situation.
Notices and disclaimers
continued
It is the customer’s responsibility to insure its own compliance with legal
requirements and to obtain advice of competent legal counsel as to
the identification and interpretation of any relevant laws and regulatory
requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal
advice or represent or warrant that its services or products will ensure that
the customer follows any law.
Information concerning non-IBM products was obtained from the suppliers of
those products, their published announcements or other publicly available
sources. IBM has not tested those products about this publication and cannot
confirm the accuracy of performance, compatibility or any other claims related to
non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party
products, or the ability of any such third-party products to interoperate with IBM’s
products. IBM expressly disclaims all warranties, expressed or implied, including
but not limited to, the implied warranties of merchantability and fitness for a
purpose.
The provision of the information contained herein is not intended to, and does not,
grant any right or license under any IBM patents, copyrights, trademarks or other
intellectual property right.
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service
names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at:
www.ibm.com/legal/copytrade.shtml.
Agenda
Motivation
Traditional ways of integrating Spark and SQL
Shortcomings of the traditional approach
Polymorphic Table Functions
What are PTFs?
How do PTFs work?
Use cases for PTFs
Conclusions
Motivation
Traditional Way of
Integrating Spark and Data Stores
▪ Spark reads data from and writes
data to data stores (e.g. JDBC,
Parquet, …)
▪ Spark initiates communication
▪ Spark processes data which it read
and writes data which it generated
▪ Spark is in the driver seat
Data Store
Spark is in the
driver seat
A New Way of
Integrating Spark and Data Stores
▪ The data store is in the driver seat
▪ It uses the result of a Spark
computation as a table
▪ The data store delegates complex
transformations on tables to Spark
▪ Solution:
Polymorphic Table Functions (PTF)
Data Store
Data store is in
the driver seat
Polymorphic Table Functions
What are Polymorphic Table Functions?
▪ A table function is used to invoke the Spark code
▪ The result of a computation in Spark is presented as relational table
▪ Why polymorphic?
▪ Single function is used to execute all kinds of Spark code
▪ This table function can return many different types of tables
▪ Different number of columns
▪ Different types of columns
▪ Tables may in addition be passed as arguments to the Spark
computation
▪ Table function may be used like a normal table in SQL (e.g. for joins
with other tables or table functions, for filtering, etc.)
Example: Polymorphic Table Functions in Big SQL
▪ Polymorphic table function is called EXECSPARK
▪ Class which implements the SparkPtf interface must be passed as
argument to EXECSPARK
▪ Class may be written in Scala or Java
▪ Class must implement four methods:
Method Purpose
describe Used by Big SQL to understand which columns with which data type are returned
execute Called during runtime. Implements the actual Spark computation
destroy Used for cleanup
cardinality Used by the optimizer of Big SQL to estimate the size of the returned table
PTF: A Complete Example: Reading a JSON File
Calling SQL Statement
select cast(country as varchar(10)) c
from table(
SYSHADOOP.EXECSPARK(
class => ‘org.examples.ReadJsonFileSecure’,
uri => ‘/user/bigsql/tmp/demo.json’
)
) as doc where country is not null
order by c ;
PTF: A Complete Example: Reading a JSON File
PTF Implementation: Imports
package org.examples;
import java.util.Map;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import com.ibm.biginsights.bigsql.spark.SparkPtf;
PTF: A Complete Example: Reading a JSON File
PTF Implementation: Class Definition & describe Method
public class ReadJsonFileSecure implements SparkPtf {
Dataset<Row> df = null;
@Override
public StructType describe(SQLContext ctx,
Map<String, Object> arguments) {
df = ctx.read().json((String)arguments.get(“URI”));
df.cache();
return df.schema();
}
PTF: A Complete Example: Reading a JSON File
PTF Implementation: execute Method
@Override
public Dataset<Row> execute(SQLContext ctx,
Map<String, Object> arguments) {
if (df == null) {
df = ctx.read().json((String)arguments.get(“URI”));
df.cache();
}
return df;
}
PTF: A Complete Example: Reading a JSON File
PTF Implementation: destroy & cardinality Methods
@Override
public void destroy(SQLContext ctx,
Map<String, Object> arguments) {
if (df != null) { df = null;
}
}
@Override
public long cardinality(SQLContext ctx,
Map<String, Object> arguments) {
if (arguments.get(“CARD”) == null)
return 100;
else
return (long) arguments.get(“CARD”);
}
}
What is IBM Db2 Big SQL?
▪ Advanced, fully-featured SQL engine for physical and virtual data
lakes on Hadoop and/or Cloud Object Storage
▪ Focusses on
▪ data virtualization,
▪ SQL compatibility,
▪ scalability,
▪ performance,
▪ enterprise security/governance
▪ Supports queries on Hive and Hbase data
▪ Uses Hive Metastore
▪ Supports CSV, Parquet, ORC, …
▪ Bidirectional integration with Spark:
▪ Spark JDBC data source enables execution of Big SQL queries from Spark and consumes the results as data frames
▪ Polymorphic table function enables execution of Spark jobs from Big SQL and consumes the results as tables.
IBM Cloud / © 2020 IBM Corporation
Virtualized
environment to
query
heterogeneous
data
SQL-on-Hadoop to virtualizes more than 10 different data sources: RDBMS, NoSQL, HDFS or
Object Store with efficient query rewrite and predicate pushdown
MS SQL
Server Netezza Oracle PostgreSQL Teradata
DB2
LUW, Db2z, DB2 on iInformix
WebHDFS
Object Store
(S3)
Hive HBase HDFS
Hadoop
Db2 Big SQL
NoSQL
ML
Model
Transparent
High function
Autonomous
High performance
• Appears to be one source
• Programmers don’t need to know how / where data is stored
• Full SQL support against all data
• Capabilities of sources as well
• Non-disruptive to data sources, existing applications,
systems.
• Optimization of distributed queries
Efficient Integration
• Low latency from Spark jobs executing on long running Spark app that is
co-located with Big SQL engine (no spark-submit per job)
• High throughput data movement by exploiting parallelism and local data
transfer at the nodes
Big SQL Workers co-located with Spark Executors
Head Node
Big SQL Head Spark Driver
Worker Node
Big SQL Worker Spark Executor
Worker Node
Big SQL Worker Spark Executor
Worker Node
Big SQL Worker Spark Executor
…
HDFS
Job
Data Data Data
data flow
control flow
across engines
control/data flow
within engine
How it works
Spark DriverBigSQL Head
Head Node
Spark ExecutorBigSQL Worker
Worker Node
DataFrame
Partitions
HDFS
Local
Storage
SELECT *
FROM TABLE(
EXECSPARK(
class => ‘com….PTF’, …))
describe
schema
execute
scan
stages/
tasks
class
High-throughput data transfer
• Parallelism across nodes and within nodes
• DataFrame partitions are streamed from Spark executor to BigSQL worker as soon as Spark
produce them
• Partition data is serialized in BigSQL runtime format on the Spark side
• No deserialization overhead once data arrives in BigSQL
Spark ExecutorBigSQL Worker
DataFrame
Partitions
Worker Nodes
Use Cases
▪ Data virtualization to many data sources otherwise not supported
▪ All sources for which there is a Spark DataSource become available in
SQL engine
▪ Complex transformations that are hard/impossible to express in SQL
become possible
▪ Exploitation of Spark libraries and external packages in SQL engine
▪ ETL
▪ Machine learning
▪ Custom aggregates
Example
SELECT people.firstname, people.lastname, people.phone
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘jdbc’,
url => ‘jdbc:mysql://localhost/mydb’,
driver => ‘com.mysql.jdbc.Driver’,
user => ‘joe’,
password => ‘joespwd’
load => ‘default.people’
)) AS people
WHERE people.age > 55
Using DataSource PTF to Query a Mysql Table
Example:
SELECT DESCRIBE doc.*
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘csv’,
skipHeader = true
load => ‘hdfs://user/guest/sample.csv’
)) AS doc
Using DataSource PTF to Describe a CSV File
Example:
SELECT people.firstname, people.lastname, people.phone
FROM TABLE(execspark(
class => 'com.ibm.bigsql.DataSource',
format => ‘parquet’,
load => ‘hdfs://user/guest/people.parquet’
)) AS people
WHERE people.age > 55
Using DataSource PTF to Query a Parquet File
Conclusions
Summary
Polymorphic table functions
▪ Provide a fast and scalable way to transfer data from Spark to SQL
engine with SQL engine in control
▪ No intermediate storage necessary like in naïve solution
▪ Many possible use cases supported
▪ Leverage the analytics capabilities of Spark in SQL engine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Andreas Weininger
andreas.Weininger@de.ibm.com
@aweininger

Mais conteúdo relacionado

Mais procurados

SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonJeffrey T. Pollock
 
Fighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceFighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceDataWorks Summit
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseJeffrey T. Pollock
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntDatabricks
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017Ray Bugg
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Jeffrey T. Pollock
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Steven Moy
 
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...Databricks
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
 
One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceJeffrey T. Pollock
 

Mais procurados (20)

SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
The Manulife Journey
The Manulife JourneyThe Manulife Journey
The Manulife Journey
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Fighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial IntelligenceFighting Financial Crime with Artificial Intelligence
Fighting Financial Crime with Artificial Intelligence
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019
 
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
 
One Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and GovernanceOne Slide Overview: ORCL Big Data Integration and Governance
One Slide Overview: ORCL Big Data Integration and Governance
 

Semelhante a Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark

Was liberty elastic clusters and centralised admin
Was liberty   elastic clusters and centralised adminWas liberty   elastic clusters and centralised admin
Was liberty elastic clusters and centralised adminsflynn073
 
Radically Simple Management & Assembly of API-based Applications
Radically Simple Management & Assembly of API-based ApplicationsRadically Simple Management & Assembly of API-based Applications
Radically Simple Management & Assembly of API-based Applicationsvinodmut
 
Improving Software Delivery with Software Defined Environments (IBM Interconn...
Improving Software Delivery with Software Defined Environments (IBM Interconn...Improving Software Delivery with Software Defined Environments (IBM Interconn...
Improving Software Delivery with Software Defined Environments (IBM Interconn...Michael Elder
 
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Marc Nehme
 
How to Containerize WebSphere Application Server Traditional, and Why You Mig...
How to Containerize WebSphere Application Server Traditional, and Why You Mig...How to Containerize WebSphere Application Server Traditional, and Why You Mig...
How to Containerize WebSphere Application Server Traditional, and Why You Mig...David Currie
 
Informix REST API Tutorial
Informix REST API TutorialInformix REST API Tutorial
Informix REST API TutorialBrian Hughes
 
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...paul young cpa, cga
 
Highly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseHighly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseIBM_Info_Management
 
Think 2018 - MicroProfile OpenAPI
Think 2018  - MicroProfile OpenAPIThink 2018  - MicroProfile OpenAPI
Think 2018 - MicroProfile OpenAPIArthur De Magalhaes
 
OpenWhisk Part 1 Research Data at Interconnect 2017
OpenWhisk Part 1 Research Data at Interconnect 2017OpenWhisk Part 1 Research Data at Interconnect 2017
OpenWhisk Part 1 Research Data at Interconnect 2017Perry Cheng
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Paulo Lacerda
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsIBM
 
Managing integration in a multi cluster world
Managing integration in a multi cluster worldManaging integration in a multi cluster world
Managing integration in a multi cluster worldShikha Srivastava
 
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...nick_garrod
 
TI 1641 - delivering enterprise software at the speed of cloud
TI 1641 - delivering enterprise software at the speed of cloudTI 1641 - delivering enterprise software at the speed of cloud
TI 1641 - delivering enterprise software at the speed of cloudVincent Burckhardt
 
Smarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting DataSmarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting DataKelly Raposo
 
Scaling Your Microservices With LoopBack
Scaling Your Microservices With LoopBackScaling Your Microservices With LoopBack
Scaling Your Microservices With LoopBackRitchie Martori
 
OpenWhisk Introduction
OpenWhisk IntroductionOpenWhisk Introduction
OpenWhisk IntroductionIoana Baldini
 

Semelhante a Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark (20)

Was liberty elastic clusters and centralised admin
Was liberty   elastic clusters and centralised adminWas liberty   elastic clusters and centralised admin
Was liberty elastic clusters and centralised admin
 
Radically Simple Management & Assembly of API-based Applications
Radically Simple Management & Assembly of API-based ApplicationsRadically Simple Management & Assembly of API-based Applications
Radically Simple Management & Assembly of API-based Applications
 
Improving Software Delivery with Software Defined Environments (IBM Interconn...
Improving Software Delivery with Software Defined Environments (IBM Interconn...Improving Software Delivery with Software Defined Environments (IBM Interconn...
Improving Software Delivery with Software Defined Environments (IBM Interconn...
 
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...
 
How to Containerize WebSphere Application Server Traditional, and Why You Mig...
How to Containerize WebSphere Application Server Traditional, and Why You Mig...How to Containerize WebSphere Application Server Traditional, and Why You Mig...
How to Containerize WebSphere Application Server Traditional, and Why You Mig...
 
Informix REST API Tutorial
Informix REST API TutorialInformix REST API Tutorial
Informix REST API Tutorial
 
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...
Vision 2016 fpm 1072 - tips on using ibm cognos command center with ibm plann...
 
Highly successful performance tuning of an informix database
Highly successful performance tuning of an informix databaseHighly successful performance tuning of an informix database
Highly successful performance tuning of an informix database
 
Think 2018 - MicroProfile OpenAPI
Think 2018  - MicroProfile OpenAPIThink 2018  - MicroProfile OpenAPI
Think 2018 - MicroProfile OpenAPI
 
Gateway deepdive
Gateway deepdiveGateway deepdive
Gateway deepdive
 
OpenWhisk Part 1 Research Data at Interconnect 2017
OpenWhisk Part 1 Research Data at Interconnect 2017OpenWhisk Part 1 Research Data at Interconnect 2017
OpenWhisk Part 1 Research Data at Interconnect 2017
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
 
Managing integration in a multi cluster world
Managing integration in a multi cluster worldManaging integration in a multi cluster world
Managing integration in a multi cluster world
 
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
 
TI 1641 - delivering enterprise software at the speed of cloud
TI 1641 - delivering enterprise software at the speed of cloudTI 1641 - delivering enterprise software at the speed of cloud
TI 1641 - delivering enterprise software at the speed of cloud
 
Smarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting DataSmarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting Data
 
Scaling Your Microservices With LoopBack
Scaling Your Microservices With LoopBackScaling Your Microservices With LoopBack
Scaling Your Microservices With LoopBack
 
API management with GraphQL
API management with GraphQLAPI management with GraphQL
API management with GraphQL
 
OpenWhisk Introduction
OpenWhisk IntroductionOpenWhisk Introduction
OpenWhisk Introduction
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 

Último (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 

Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark

  • 1. Polymorphic Table Functions: The best way to integrate SQL and Spark Andreas Weininger andreas.Weininger@de.ibm.com @aweininger
  • 2. Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. Notices and disclaimers © 2020 International Business Machines Corporation. No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights — use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, either express or implied. In no event, shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted per the terms and conditions of the agreements under which they are provided. The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
  • 4. Notices and disclaimers continued It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer follows any law. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products about this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at: www.ibm.com/legal/copytrade.shtml.
  • 5. Agenda Motivation Traditional ways of integrating Spark and SQL Shortcomings of the traditional approach Polymorphic Table Functions What are PTFs? How do PTFs work? Use cases for PTFs Conclusions
  • 7. Traditional Way of Integrating Spark and Data Stores ▪ Spark reads data from and writes data to data stores (e.g. JDBC, Parquet, …) ▪ Spark initiates communication ▪ Spark processes data which it read and writes data which it generated ▪ Spark is in the driver seat Data Store Spark is in the driver seat
  • 8. A New Way of Integrating Spark and Data Stores ▪ The data store is in the driver seat ▪ It uses the result of a Spark computation as a table ▪ The data store delegates complex transformations on tables to Spark ▪ Solution: Polymorphic Table Functions (PTF) Data Store Data store is in the driver seat
  • 10. What are Polymorphic Table Functions? ▪ A table function is used to invoke the Spark code ▪ The result of a computation in Spark is presented as relational table ▪ Why polymorphic? ▪ Single function is used to execute all kinds of Spark code ▪ This table function can return many different types of tables ▪ Different number of columns ▪ Different types of columns ▪ Tables may in addition be passed as arguments to the Spark computation ▪ Table function may be used like a normal table in SQL (e.g. for joins with other tables or table functions, for filtering, etc.)
  • 11. Example: Polymorphic Table Functions in Big SQL ▪ Polymorphic table function is called EXECSPARK ▪ Class which implements the SparkPtf interface must be passed as argument to EXECSPARK ▪ Class may be written in Scala or Java ▪ Class must implement four methods: Method Purpose describe Used by Big SQL to understand which columns with which data type are returned execute Called during runtime. Implements the actual Spark computation destroy Used for cleanup cardinality Used by the optimizer of Big SQL to estimate the size of the returned table
  • 12. PTF: A Complete Example: Reading a JSON File Calling SQL Statement select cast(country as varchar(10)) c from table( SYSHADOOP.EXECSPARK( class => ‘org.examples.ReadJsonFileSecure’, uri => ‘/user/bigsql/tmp/demo.json’ ) ) as doc where country is not null order by c ;
  • 13. PTF: A Complete Example: Reading a JSON File PTF Implementation: Imports package org.examples; import java.util.Map; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import com.ibm.biginsights.bigsql.spark.SparkPtf;
  • 14. PTF: A Complete Example: Reading a JSON File PTF Implementation: Class Definition & describe Method public class ReadJsonFileSecure implements SparkPtf { Dataset<Row> df = null; @Override public StructType describe(SQLContext ctx, Map<String, Object> arguments) { df = ctx.read().json((String)arguments.get(“URI”)); df.cache(); return df.schema(); }
  • 15. PTF: A Complete Example: Reading a JSON File PTF Implementation: execute Method @Override public Dataset<Row> execute(SQLContext ctx, Map<String, Object> arguments) { if (df == null) { df = ctx.read().json((String)arguments.get(“URI”)); df.cache(); } return df; }
  • 16. PTF: A Complete Example: Reading a JSON File PTF Implementation: destroy & cardinality Methods @Override public void destroy(SQLContext ctx, Map<String, Object> arguments) { if (df != null) { df = null; } } @Override public long cardinality(SQLContext ctx, Map<String, Object> arguments) { if (arguments.get(“CARD”) == null) return 100; else return (long) arguments.get(“CARD”); } }
  • 17. What is IBM Db2 Big SQL? ▪ Advanced, fully-featured SQL engine for physical and virtual data lakes on Hadoop and/or Cloud Object Storage ▪ Focusses on ▪ data virtualization, ▪ SQL compatibility, ▪ scalability, ▪ performance, ▪ enterprise security/governance ▪ Supports queries on Hive and Hbase data ▪ Uses Hive Metastore ▪ Supports CSV, Parquet, ORC, … ▪ Bidirectional integration with Spark: ▪ Spark JDBC data source enables execution of Big SQL queries from Spark and consumes the results as data frames ▪ Polymorphic table function enables execution of Spark jobs from Big SQL and consumes the results as tables.
  • 18. IBM Cloud / © 2020 IBM Corporation Virtualized environment to query heterogeneous data SQL-on-Hadoop to virtualizes more than 10 different data sources: RDBMS, NoSQL, HDFS or Object Store with efficient query rewrite and predicate pushdown MS SQL Server Netezza Oracle PostgreSQL Teradata DB2 LUW, Db2z, DB2 on iInformix WebHDFS Object Store (S3) Hive HBase HDFS Hadoop Db2 Big SQL NoSQL ML Model Transparent High function Autonomous High performance • Appears to be one source • Programmers don’t need to know how / where data is stored • Full SQL support against all data • Capabilities of sources as well • Non-disruptive to data sources, existing applications, systems. • Optimization of distributed queries
  • 19. Efficient Integration • Low latency from Spark jobs executing on long running Spark app that is co-located with Big SQL engine (no spark-submit per job) • High throughput data movement by exploiting parallelism and local data transfer at the nodes Big SQL Workers co-located with Spark Executors Head Node Big SQL Head Spark Driver Worker Node Big SQL Worker Spark Executor Worker Node Big SQL Worker Spark Executor Worker Node Big SQL Worker Spark Executor … HDFS Job Data Data Data data flow control flow across engines control/data flow within engine
  • 20. How it works Spark DriverBigSQL Head Head Node Spark ExecutorBigSQL Worker Worker Node DataFrame Partitions HDFS Local Storage SELECT * FROM TABLE( EXECSPARK( class => ‘com….PTF’, …)) describe schema execute scan stages/ tasks class
  • 21. High-throughput data transfer • Parallelism across nodes and within nodes • DataFrame partitions are streamed from Spark executor to BigSQL worker as soon as Spark produce them • Partition data is serialized in BigSQL runtime format on the Spark side • No deserialization overhead once data arrives in BigSQL Spark ExecutorBigSQL Worker DataFrame Partitions Worker Nodes
  • 22. Use Cases ▪ Data virtualization to many data sources otherwise not supported ▪ All sources for which there is a Spark DataSource become available in SQL engine ▪ Complex transformations that are hard/impossible to express in SQL become possible ▪ Exploitation of Spark libraries and external packages in SQL engine ▪ ETL ▪ Machine learning ▪ Custom aggregates
  • 23. Example SELECT people.firstname, people.lastname, people.phone FROM TABLE(execspark( class => 'com.ibm.bigsql.DataSource', format => ‘jdbc’, url => ‘jdbc:mysql://localhost/mydb’, driver => ‘com.mysql.jdbc.Driver’, user => ‘joe’, password => ‘joespwd’ load => ‘default.people’ )) AS people WHERE people.age > 55 Using DataSource PTF to Query a Mysql Table
  • 24. Example: SELECT DESCRIBE doc.* FROM TABLE(execspark( class => 'com.ibm.bigsql.DataSource', format => ‘csv’, skipHeader = true load => ‘hdfs://user/guest/sample.csv’ )) AS doc Using DataSource PTF to Describe a CSV File
  • 25. Example: SELECT people.firstname, people.lastname, people.phone FROM TABLE(execspark( class => 'com.ibm.bigsql.DataSource', format => ‘parquet’, load => ‘hdfs://user/guest/people.parquet’ )) AS people WHERE people.age > 55 Using DataSource PTF to Query a Parquet File
  • 27. Summary Polymorphic table functions ▪ Provide a fast and scalable way to transfer data from Spark to SQL engine with SQL engine in control ▪ No intermediate storage necessary like in naïve solution ▪ Many possible use cases supported ▪ Leverage the analytics capabilities of Spark in SQL engine
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Andreas Weininger andreas.Weininger@de.ibm.com @aweininger