SlideShare uma empresa Scribd logo
1 de 72
Baixar para ler offline
Extending Spark
for Qbeast's SQL
Data Source
with Paola Pardo and
Cesare Cugnasco
BarcelonaSpark Meetup
24th of October 2019
From the research to the industry
At first it was Extraction Transformation Loading
Hybrid Transactional Analytical Processing
Then the Lambda architecture tried to reduce latency
Hybrid Transactional Analytical Processing
A plot of the relative
bandwidth of system
components in the Titan
supercomputer at the Oak
Ridge Leadership Class Facility. Source: Bauer, Andrew C., et
al. "In situ methods,
infrastructures, and
applications on high
performance computing
platforms."
5
Consistent and transactional (at various
degree) level
Storage:
● Memory
● Local storage
Big Data HTAP: general design
Fast consistent layer
Weak consistency, high-latency,
immutable files
Storage:
• No-POSIX distributed file system
• Object Stores
Cheap/throughput layer
On-demand resources - decoupled
storage/ CPU
Temporary storage:
• Local disk
• Object Stores.
Query execution
Data ingestion
Periodical flushes
Data
Examples
Google’s Procella Snowflakes
Big Data HTAP: min-max pruning, zone maps, bloom filters..
Primary key partition A
Primary key partition B
Meta
Min/max
Bloom
range
Metadata server
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
Meta
Min/max
Bloom
range
June May
MarchJune
11
12
13
14
15Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
16
Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
960 KB 7 KB
17
18
High-priority
Medium-priority
Low-priority
RAM
Persistent
memory
Local disk
Object storage
Cold storage
QDB: file layout
Original data OutlookTree
Metadata and buffer
in fast storage
Data in columnar format
in slower storage
Hybrid columnar row
Row data
Disk
S3
Optane
Columnar to row mapping base on
the fact that the
random priority = DHT token
Interactive Big Data Visualization
● Overview
○ Catalyst Optimizer
○ APIs
○ Spark-Cassandra
● Extensions
○ SamplingPushdown
○ Multidimensional Filter Pushdown
● Future work
Outline
Overview
● CatalystOptimizer
● DataSources APIs
○ Key Concepts
○ Examples
● Spark-Cassandra-Connector
○ CassandraSourceRelation
24
Catalyst Optimizer
25
User Query
SELECT sum(v)
FROM
SELECT t1.id, t1.value+1+2 AS v
FROM t1 JOIN t2
WHERE
(t1.id == t2.id AND t2.id > 50)
● Expressions
○ New value computedon input values
● Attributes
○ Column of a data collection
○ Dataset,Data Operation
26
Unresolved Plan
PROJECT
FILTER
JOIN
UnresolvedRelation t1 UnresolvedRelation t2
SELECT sum(v)
FROM
SELECT t1.id, t1.value+1+2 AS v
FROM t1 JOIN t2
WHERE
(t1.id == t2.id AND t2.id > 50)
AGG
27
Analysis
JOIN
UnresolvedRelation t1 UnresolvedRelation t2
JOIN
MyCustomRelation t1 MyCustomRelation t2
Metadata
● Tree
○ Abstraction of users program
○ Node objects
● Rules
○ Transform the tree
○ Logical Optimization
○ Heuristics
Logical Plan
SELECT t1.value+1+2 AS v
ADD
ADDT1.value
Literal(1) Literal(2)
29
Optimized Logical Plan
ADD
ADDT1.value
Literal(1) Literal(2)
ADD
Literal(3)T1.value
30
Physical Planning
● Strategies
○ Set of transformations
○ Eg: selects the best Join execution
● Rule executor
○ Ensure requirements
○ Apply optimization
31
Physical Planning
● Strategies
○ Set of transformations
○ Eg: selects the best Join execution
● Rule executor
○ Ensure requirements
○ Apply physicaloptimization
Cost-based
● Key part to integrate datasources
○ How to read/writefrom/tostorage
○ Statistics
○ Physical Planning
● Hadoop, Hive
● Presto and Cassandra connectors
DataSource API
API
DataSource API
trait RelationProvider {
def createRelation
(sqlContext:SQLContext,
parameters: Map[String, String]):
BaseRelation
}
abstract class BaseRelation {
def sqlContext: SQLContext
def schema: StructType
def unhandledFilters: Array[Filter]
def sizeInBytes: Long
def needConversion: Boolean
}
trait TableScan {
def buildScan(): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
class DefaultSource extends RelationProvider with
SchemaRelationProvider {
override def createRelation(sqlContext: SQLContext,
parameters: Map[String, String])
: BaseRelation = {
createRelation(sqlContext, parameters, null)
}
//creates a relation with an Undefined Schema (null)
override def createRelation( “”, “” schema: StructType)
: BaseRelation = {
//implementation
return new MyCustomRelation(<>, schema)(sqlContext)
}
//gets the Schema of the table and produces a
MyCustomRelation
}
DataSource API
class MyCustomRelation(location: String,
userSchema: StructType)
(@transient val sqlContext: SQLContext)
extends BaseRelation
with Serializable {
override def schema: StructType = {
//implementation which returns
// StructType
// (or a sequence of StructFields)
}
}
}
● Limited extension
● Lack of info about partition
● Lack of Columnar and Streaming
support
DataSource API
trait LimitedScan {
def buildScan(limit: Int): RDD[Row]
}
trait PrunedLimitedScan {
def buildScan(requiredColumns: Array[String],
limit: Int): RDD[Row]
}
trait PrunedFilteredLimitedScan {
def buildScan(requiredColumns: Array[String],
filters: Array[Filter], limit: Int): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
● Writed in Java since 2.3
● ReadSupport or
WriteSupport
● Own partitioner
● Mix-in some Support
interfaces
DataSourcev2 API
DataSourceV2
with ReadSupport
with ReadSupportWithSchema
DataSourceReader
with SupportPushdownFilters
with SupportPushdownRequiredColumns
....
InputPartitions
InputPartitionReader
DataSourcev2 API
public interface ReadSupport extends DataSourceV2 {
DataSourceReader createReader
(DataSourceOptions options);
}
public interface DataSourceReader {
StructType readSchema();
List<InputPartitions<Row>>planInputPartitions()
}
public interface SupportsPushDownRequiredColumns
extends DataSourceReader {
void pruneColumns
(StructType requiredSchema);
}
public interface InputPartition<T> {
InputPartitionReader<T>
createPartitionReader();
}
public interface InputPartitionReader<T> extends
Closeable {
boolean next();
T get();
}
● DataStax open-source
● RDDs, DataFrames and CQL
39
Spark-Cassandra-Connector
40
CassandraSourceRelation
PrunedFilteredScan InsertableRelation
BaseRelation:
● schema
● sizeInBytes
● unhandledFilters
private[cassandra] class
CassandraSourceRelation(
tableRef: TableRef,
userSpecifiedSchema: Option[StructType],
filterPushdown: Boolean,
confirmTruncate: Boolean,
tableSizeInBytes: Option[Long],
connector: CassandraConnector,
readConf: ReadConf,
writeConf: WriteConf,
sparkConf: SparkConf,
override val sqlContext:
SQLContext)
extends BaseRelation
with InsertableRelation
with PrunedFilteredScan
with Logging
org.apache.spark.sql.cassandra.CassandraConn...
41
CassandraSourceRelation
Pruned
Filtered
Scan
● Column Pruning
○ Discard columns
● Filter Pushdown
○ Discard rows
● DataSource API
● Pushdown restrictions
○ Filteringonly one column
○ Not custom index suppory
Limitations
Extensions
● Scenario
● Sampling Pushdown
○ Sample Operator
○ Changes
● Multidimensional Filter Pushdown
○ Filter Pushdown
○ Changes
44
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
45
Scenario
CREATE TABLE keyspace.table (
id double PRIMARY KEY,
x double,
y double,
z double
);
CREATE CUSTOM INDEX IF NOT
EXISTS table_idx
ON table.keyspace (x, y, z)
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
A Qbeast indexed Table and Query examples:
SELECT * from keyspace.table
WHERE expr(table_idx,
‘precision=0.1’)
FILTERPUSHDOWN
SAMPLING PUSHDOWN
● Sample
○ lower/upper bound
○ with/without Replacement
○ seed
Sample Operator on Spark
SELECT * from keyspace.table
TABLESAMPLE(5 ROWS)
SELECT * from keyspace.table
TABLESAMPLE(10 PERCENT)
df.sample(...)
47
Sampling Pushdown
Catalyst Optimizer
DataSource API
CassandraSourceRelation
● Filter Pushdown
● Column Pruning
● Sampling with Qbeast?
● Filter Pushdown
● Column Pruning
● Sampling Pushdown?
● New interfaces for the Scan
● New method to detect sampling
operator and Datasource
48
Sampling Pushdown
48
Pruned
Sampled
Filtered
Scan
Sampled
Pruned
Scan
DataSourceAPI
Sampled
Scan
Sampled
Filtered
Scan
@InterfaceStability.Stable
trait SampledFilteredScan {
def buildScan(filters: Array[Filter], sample:
Sample): RDD[Row]
}
@InterfaceStability.Stable
trait PrunedSampledScan {
def buildScan(requiredColumns: Array[String],
sample: Sample): RDD[Row]
}
@InterfaceStability.Stable
trait SampledScan {
def buildScan(sample: Sample): RDD[Row]
}
Sampling Pushdown
@InterfaceStability.Stable
trait PrunedSampledFilteredScan {
def pushSampling(sample: Sample): Boolean
def buildScan(requiredColumns: Array[String],
filters: Array[Filter], sample: Sample): RDD[Row]
}
org.apache.spark.sql.sources.interfaces
case s @ Sample(_, _, _, _, physical_op @ PhysicalOperation(p, f, l:
LogicalRelation)) =>
l.relation match {
case scan: PrunedSampledFilteredScan if scan.pushSampling(s) =>
pruneFilterProject(
l,
p,
f,
(a, f) => toCatalystRDD(l, a,
scan.buildScan(a.map(_.name).toArray, f, s))) :: Nil
case _ => Nil
}
Sampling Pushdown
org.apache.spark.sql.execution.datasources.DataSourceStrategy
51
Sampling Pushdown
1. User level option to pushdown sampling
2. Detection of Sample
3. Analysis
4. Write CQL expression to query the index
5. Let Qbeast handle it again!
Processing the pushdown:
Sampling Pushdown
private[cassandra] class
CassandraSourceRelation(
//other stuff
sampling: Boolean
override val sqlContext: SQLContext)
extends BaseRelation
with InsertableRelation
with PrunedFilteredScan
with PrunedFilteredSampledScan
with Logging
override def pushSampling(sample: Sample): Boolean = {
//check if the table is indexed and the user wants to
pushdown the operator
}
override def buildScan
(requiredColumns: Array[String], filters: Array[Filter],
sample: Sample): RDD[Row] = {
//construct the index CQL code and push it through the
scanning
}
org.apache.spark.sql.cassandra.CassandraConn...
Sampling Pushdown
SELECT * from keyspace.table
TABLESAMPLE (5 PERCENT)
Simple LookupSample(0.0,0,05,false, 983653)
Full Table Scan
Filter Pushdown
55
Multidimensional Pruning
Catalyst Optimizer
DataSource API
CassandraSourceRelation
● Filter Pushdown
● Column Pruning
● Samplingwith Qbeast
● Multidimensional pushdown?
● Filter Pushdown
● Column Pruning
● SamplingPushdown
56
Multidimensional Pruning
1. Detect the index
2. Analyze the predicate
3. Pushdown the Filters to Cassandra
4. Let Qbeast handle it!
Processing the pushdown:
private val qbeast = table.qbeastColumns.map(_.columnName)
/** Returns the set of predicates that contains doubleranges
for the index qBeast*/
val qbeastPredicatesToPushdown: Set[Predicate] = {
val doubleRange = rangePredicatesByName.filter(p =>
p._2.exists(Predicates.isLessThanPredicate)
&&
p._2.exists(Predicates.isGreaterThanOrEqualPredicate))
if (qbeast.toSet subsetOf doubleRange.keySet) {
val eqQbeast = qbeast.flatMap(rangePredicatesByName)
eqQbeast.toSet
}
else
Set.empty
}}
Multidimensional Pruning
val predicatesToPushDown: Set[Predicate] =
partitionKeyPredicatesToPushDown ++
clusteringColumnPredicatesToPushDown ++
indexedColumnPredicatesToPushDow ++
qbeastPredicatesToPushdown
org.apache.spark.sql.cassandra.BasicCassandraPredicateToPushdown
Multidimensional Pushdown
SELECT * from keyspace.table
WHERE x >= 0.1826763 AND x < 0.5555
AND y >= 1.9 AND y < 2.863653
AND z >= 0.1 AND z < 10.78645
FILTER(isNotNull)
PrunedFilteredScan
FILTER(x,y, z, isNotNull)
Full Table Scan
Example
Example
Example
Example
Example
Future Work
● Dimensional Aware
● Join Strategy
● Storage
● Useful for Data Locality Strategies
● Physical Planning
Dimensional Aware
● Shuffle-Hash-Join
● Broadcast-Join
● Sort-Merge-Join
66
Join Strategy in Spark
● Dimensional Aware Data Partition
● Speculative optimization on Sampling
Join on Qbeast
● Save Qbeast data in Arrow
● Static column with file information
● Make Analytics Faster
● Spark support since 2.3
Integration with Arrow
Future Work
● Dimensional Aware
● Join Strategy
● Storage
● DataSource V2
● New Java Class
● New method to detect sampling
operator and Datasource
70
DataSourceV2
70
DataSourceAPIv2
Supports
Pushdown
Sampling
package org.apache.spark.sql.sources.v2.reader;
@InterfaceStability.Evolving
public interface SupportsPushDownSampling extends
DataSourceReader {
boolean pushSampling(Sample sample);
}
DataSourceV2
case s @ Sample(_, _, _, _, l @ PhysicalOperation(p, f, e: DataSourceV2Relation)) =>
//implementation of pruning and filter pushdown
ProjectExec(p, withFilter) :: Nil
case _ => Nil
}
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

Mais conteúdo relacionado

Mais procurados

Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰoggers
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Data centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheData centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheSpark Summit
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 

Mais procurados (20)

Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Data centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad UlrecheData centric Metaprogramming by Vlad Ulreche
Data centric Metaprogramming by Vlad Ulreche
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 

Semelhante a Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfCédrick Lunven
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCRMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCClément OUDOT
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreLukas Fittl
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterEDB
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 

Semelhante a Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco (20)

SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSCRMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data Center
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 

Último

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cugnasco

  • 1. Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cugnasco BarcelonaSpark Meetup 24th of October 2019
  • 2. From the research to the industry
  • 3. At first it was Extraction Transformation Loading Hybrid Transactional Analytical Processing
  • 4. Then the Lambda architecture tried to reduce latency Hybrid Transactional Analytical Processing
  • 5. A plot of the relative bandwidth of system components in the Titan supercomputer at the Oak Ridge Leadership Class Facility. Source: Bauer, Andrew C., et al. "In situ methods, infrastructures, and applications on high performance computing platforms." 5
  • 6. Consistent and transactional (at various degree) level Storage: ● Memory ● Local storage Big Data HTAP: general design Fast consistent layer Weak consistency, high-latency, immutable files Storage: • No-POSIX distributed file system • Object Stores Cheap/throughput layer On-demand resources - decoupled storage/ CPU Temporary storage: • Local disk • Object Stores. Query execution Data ingestion Periodical flushes Data
  • 8. Big Data HTAP: min-max pruning, zone maps, bloom filters.. Primary key partition A Primary key partition B Meta Min/max Bloom range Metadata server Meta Min/max Bloom range Meta Min/max Bloom range Meta Min/max Bloom range June May MarchJune
  • 9.
  • 10.
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas
  • 16. 16 Image credit: Nemo Jantzen Lucky Me, 2015, Photography, acrylic, and glass spheres on wooden canvas 960 KB 7 KB
  • 17. 17
  • 19. QDB: file layout Original data OutlookTree Metadata and buffer in fast storage Data in columnar format in slower storage
  • 20. Hybrid columnar row Row data Disk S3 Optane Columnar to row mapping base on the fact that the random priority = DHT token
  • 21. Interactive Big Data Visualization
  • 22. ● Overview ○ Catalyst Optimizer ○ APIs ○ Spark-Cassandra ● Extensions ○ SamplingPushdown ○ Multidimensional Filter Pushdown ● Future work Outline
  • 23. Overview ● CatalystOptimizer ● DataSources APIs ○ Key Concepts ○ Examples ● Spark-Cassandra-Connector ○ CassandraSourceRelation
  • 25. 25 User Query SELECT sum(v) FROM SELECT t1.id, t1.value+1+2 AS v FROM t1 JOIN t2 WHERE (t1.id == t2.id AND t2.id > 50) ● Expressions ○ New value computedon input values ● Attributes ○ Column of a data collection ○ Dataset,Data Operation
  • 26. 26 Unresolved Plan PROJECT FILTER JOIN UnresolvedRelation t1 UnresolvedRelation t2 SELECT sum(v) FROM SELECT t1.id, t1.value+1+2 AS v FROM t1 JOIN t2 WHERE (t1.id == t2.id AND t2.id > 50) AGG
  • 27. 27 Analysis JOIN UnresolvedRelation t1 UnresolvedRelation t2 JOIN MyCustomRelation t1 MyCustomRelation t2 Metadata
  • 28. ● Tree ○ Abstraction of users program ○ Node objects ● Rules ○ Transform the tree ○ Logical Optimization ○ Heuristics Logical Plan SELECT t1.value+1+2 AS v ADD ADDT1.value Literal(1) Literal(2)
  • 29. 29 Optimized Logical Plan ADD ADDT1.value Literal(1) Literal(2) ADD Literal(3)T1.value
  • 30. 30 Physical Planning ● Strategies ○ Set of transformations ○ Eg: selects the best Join execution ● Rule executor ○ Ensure requirements ○ Apply optimization
  • 31. 31 Physical Planning ● Strategies ○ Set of transformations ○ Eg: selects the best Join execution ● Rule executor ○ Ensure requirements ○ Apply physicaloptimization
  • 33. ● Key part to integrate datasources ○ How to read/writefrom/tostorage ○ Statistics ○ Physical Planning ● Hadoop, Hive ● Presto and Cassandra connectors DataSource API API
  • 34. DataSource API trait RelationProvider { def createRelation (sqlContext:SQLContext, parameters: Map[String, String]): BaseRelation } abstract class BaseRelation { def sqlContext: SQLContext def schema: StructType def unhandledFilters: Array[Filter] def sizeInBytes: Long def needConversion: Boolean } trait TableScan { def buildScan(): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 35. class DefaultSource extends RelationProvider with SchemaRelationProvider { override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) : BaseRelation = { createRelation(sqlContext, parameters, null) } //creates a relation with an Undefined Schema (null) override def createRelation( “”, “” schema: StructType) : BaseRelation = { //implementation return new MyCustomRelation(<>, schema)(sqlContext) } //gets the Schema of the table and produces a MyCustomRelation } DataSource API class MyCustomRelation(location: String, userSchema: StructType) (@transient val sqlContext: SQLContext) extends BaseRelation with Serializable { override def schema: StructType = { //implementation which returns // StructType // (or a sequence of StructFields) } } }
  • 36. ● Limited extension ● Lack of info about partition ● Lack of Columnar and Streaming support DataSource API trait LimitedScan { def buildScan(limit: Int): RDD[Row] } trait PrunedLimitedScan { def buildScan(requiredColumns: Array[String], limit: Int): RDD[Row] } trait PrunedFilteredLimitedScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter], limit: Int): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 37. ● Writed in Java since 2.3 ● ReadSupport or WriteSupport ● Own partitioner ● Mix-in some Support interfaces DataSourcev2 API DataSourceV2 with ReadSupport with ReadSupportWithSchema DataSourceReader with SupportPushdownFilters with SupportPushdownRequiredColumns .... InputPartitions InputPartitionReader
  • 38. DataSourcev2 API public interface ReadSupport extends DataSourceV2 { DataSourceReader createReader (DataSourceOptions options); } public interface DataSourceReader { StructType readSchema(); List<InputPartitions<Row>>planInputPartitions() } public interface SupportsPushDownRequiredColumns extends DataSourceReader { void pruneColumns (StructType requiredSchema); } public interface InputPartition<T> { InputPartitionReader<T> createPartitionReader(); } public interface InputPartitionReader<T> extends Closeable { boolean next(); T get(); }
  • 39. ● DataStax open-source ● RDDs, DataFrames and CQL 39 Spark-Cassandra-Connector
  • 40. 40 CassandraSourceRelation PrunedFilteredScan InsertableRelation BaseRelation: ● schema ● sizeInBytes ● unhandledFilters private[cassandra] class CassandraSourceRelation( tableRef: TableRef, userSpecifiedSchema: Option[StructType], filterPushdown: Boolean, confirmTruncate: Boolean, tableSizeInBytes: Option[Long], connector: CassandraConnector, readConf: ReadConf, writeConf: WriteConf, sparkConf: SparkConf, override val sqlContext: SQLContext) extends BaseRelation with InsertableRelation with PrunedFilteredScan with Logging org.apache.spark.sql.cassandra.CassandraConn...
  • 41. 41 CassandraSourceRelation Pruned Filtered Scan ● Column Pruning ○ Discard columns ● Filter Pushdown ○ Discard rows
  • 42. ● DataSource API ● Pushdown restrictions ○ Filteringonly one column ○ Not custom index suppory Limitations
  • 43. Extensions ● Scenario ● Sampling Pushdown ○ Sample Operator ○ Changes ● Multidimensional Filter Pushdown ○ Filter Pushdown ○ Changes
  • 44. 44 Scenario CREATE TABLE keyspace.table ( id double PRIMARY KEY, x double, y double, z double ); CREATE CUSTOM INDEX IF NOT EXISTS table_idx ON table.keyspace (x, y, z) SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 A Qbeast indexed Table and Query examples: SELECT * from keyspace.table WHERE expr(table_idx, ‘precision=0.1’)
  • 45. 45 Scenario CREATE TABLE keyspace.table ( id double PRIMARY KEY, x double, y double, z double ); CREATE CUSTOM INDEX IF NOT EXISTS table_idx ON table.keyspace (x, y, z) SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 A Qbeast indexed Table and Query examples: SELECT * from keyspace.table WHERE expr(table_idx, ‘precision=0.1’) FILTERPUSHDOWN SAMPLING PUSHDOWN
  • 46. ● Sample ○ lower/upper bound ○ with/without Replacement ○ seed Sample Operator on Spark SELECT * from keyspace.table TABLESAMPLE(5 ROWS) SELECT * from keyspace.table TABLESAMPLE(10 PERCENT) df.sample(...)
  • 47. 47 Sampling Pushdown Catalyst Optimizer DataSource API CassandraSourceRelation ● Filter Pushdown ● Column Pruning ● Sampling with Qbeast? ● Filter Pushdown ● Column Pruning ● Sampling Pushdown?
  • 48. ● New interfaces for the Scan ● New method to detect sampling operator and Datasource 48 Sampling Pushdown 48 Pruned Sampled Filtered Scan Sampled Pruned Scan DataSourceAPI Sampled Scan Sampled Filtered Scan
  • 49. @InterfaceStability.Stable trait SampledFilteredScan { def buildScan(filters: Array[Filter], sample: Sample): RDD[Row] } @InterfaceStability.Stable trait PrunedSampledScan { def buildScan(requiredColumns: Array[String], sample: Sample): RDD[Row] } @InterfaceStability.Stable trait SampledScan { def buildScan(sample: Sample): RDD[Row] } Sampling Pushdown @InterfaceStability.Stable trait PrunedSampledFilteredScan { def pushSampling(sample: Sample): Boolean def buildScan(requiredColumns: Array[String], filters: Array[Filter], sample: Sample): RDD[Row] } org.apache.spark.sql.sources.interfaces
  • 50. case s @ Sample(_, _, _, _, physical_op @ PhysicalOperation(p, f, l: LogicalRelation)) => l.relation match { case scan: PrunedSampledFilteredScan if scan.pushSampling(s) => pruneFilterProject( l, p, f, (a, f) => toCatalystRDD(l, a, scan.buildScan(a.map(_.name).toArray, f, s))) :: Nil case _ => Nil } Sampling Pushdown org.apache.spark.sql.execution.datasources.DataSourceStrategy
  • 51. 51 Sampling Pushdown 1. User level option to pushdown sampling 2. Detection of Sample 3. Analysis 4. Write CQL expression to query the index 5. Let Qbeast handle it again! Processing the pushdown:
  • 52. Sampling Pushdown private[cassandra] class CassandraSourceRelation( //other stuff sampling: Boolean override val sqlContext: SQLContext) extends BaseRelation with InsertableRelation with PrunedFilteredScan with PrunedFilteredSampledScan with Logging override def pushSampling(sample: Sample): Boolean = { //check if the table is indexed and the user wants to pushdown the operator } override def buildScan (requiredColumns: Array[String], filters: Array[Filter], sample: Sample): RDD[Row] = { //construct the index CQL code and push it through the scanning } org.apache.spark.sql.cassandra.CassandraConn...
  • 53. Sampling Pushdown SELECT * from keyspace.table TABLESAMPLE (5 PERCENT) Simple LookupSample(0.0,0,05,false, 983653) Full Table Scan
  • 55. 55 Multidimensional Pruning Catalyst Optimizer DataSource API CassandraSourceRelation ● Filter Pushdown ● Column Pruning ● Samplingwith Qbeast ● Multidimensional pushdown? ● Filter Pushdown ● Column Pruning ● SamplingPushdown
  • 56. 56 Multidimensional Pruning 1. Detect the index 2. Analyze the predicate 3. Pushdown the Filters to Cassandra 4. Let Qbeast handle it! Processing the pushdown:
  • 57. private val qbeast = table.qbeastColumns.map(_.columnName) /** Returns the set of predicates that contains doubleranges for the index qBeast*/ val qbeastPredicatesToPushdown: Set[Predicate] = { val doubleRange = rangePredicatesByName.filter(p => p._2.exists(Predicates.isLessThanPredicate) && p._2.exists(Predicates.isGreaterThanOrEqualPredicate)) if (qbeast.toSet subsetOf doubleRange.keySet) { val eqQbeast = qbeast.flatMap(rangePredicatesByName) eqQbeast.toSet } else Set.empty }} Multidimensional Pruning val predicatesToPushDown: Set[Predicate] = partitionKeyPredicatesToPushDown ++ clusteringColumnPredicatesToPushDown ++ indexedColumnPredicatesToPushDow ++ qbeastPredicatesToPushdown org.apache.spark.sql.cassandra.BasicCassandraPredicateToPushdown
  • 58. Multidimensional Pushdown SELECT * from keyspace.table WHERE x >= 0.1826763 AND x < 0.5555 AND y >= 1.9 AND y < 2.863653 AND z >= 0.1 AND z < 10.78645 FILTER(isNotNull) PrunedFilteredScan FILTER(x,y, z, isNotNull) Full Table Scan
  • 64. Future Work ● Dimensional Aware ● Join Strategy ● Storage
  • 65. ● Useful for Data Locality Strategies ● Physical Planning Dimensional Aware
  • 66. ● Shuffle-Hash-Join ● Broadcast-Join ● Sort-Merge-Join 66 Join Strategy in Spark
  • 67. ● Dimensional Aware Data Partition ● Speculative optimization on Sampling Join on Qbeast
  • 68. ● Save Qbeast data in Arrow ● Static column with file information ● Make Analytics Faster ● Spark support since 2.3 Integration with Arrow
  • 69. Future Work ● Dimensional Aware ● Join Strategy ● Storage ● DataSource V2
  • 70. ● New Java Class ● New method to detect sampling operator and Datasource 70 DataSourceV2 70 DataSourceAPIv2 Supports Pushdown Sampling
  • 71. package org.apache.spark.sql.sources.v2.reader; @InterfaceStability.Evolving public interface SupportsPushDownSampling extends DataSourceReader { boolean pushSampling(Sample sample); } DataSourceV2 case s @ Sample(_, _, _, _, l @ PhysicalOperation(p, f, e: DataSourceV2Relation)) => //implementation of pruning and filter pushdown ProjectExec(p, withFilter) :: Nil case _ => Nil }