SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
© 2017 IBM Corporation
Using Pluggable Apache Spark SQL Filters
to help GridPocket users keep up with the
Jones’ (and save the planet)
Paula Ta-Shma,
Guy Gerson
IBM Research
Contact: paula@il.ibm.com
Joint work with:
Tal Ariel, IBM
Filip Gluszak, GridPocket
Gal Lushi, IBM
Yosef Moatti, IBM
Papa Niamadio, GridPocket
Nathaël Noguès, GridPocket
© 2017 IBM Corporation2
You have 5 minutes to pack your bags…
© 2017 IBM Corporation3
?
© 2017 IBM Corporation4
?
© 2017 IBM Corporation
Using Pluggable Apache Spark SQL Filters
to help GridPocket users keep up with the
Jones’ (and save the planet)
Paula Ta-Shma,
Guy Gerson
IBM Research
Contact: paula@il.ibm.com
Joint work with:
Tal Ariel, IBM
Filip Gluszak, GridPocket
Gal Lushi, IBM
Yosef Moatti, IBM
Papa Niamadio, GridPocket
Nathaël Noguès, GridPocket
© 2017 IBM Corporation6
What would convince you to reduce your energy usage ?
#EUres2
© 2017 IBM Corporation7
What would convince you to reduce your energy usage ?
§ Save $ ?
#EUres2
© 2017 IBM Corporation8
What would convince you to reduce your energy usage ?
§ Save $ ?
§ Save the planet ?
#EUres2
© 2017 IBM Corporation9
Competing to Save Energy
§ Research shows that utility
customers are most influenced by
peer pressure to save energy
§ => Help utilities compare their
customers’ energy consumption
with that of their neighbours
– Queries are anonymized
#EUres2
© 2017 IBM Corporation10
GridPocket
§ A smart grid company developing
energy management applications
and cloud services for electricity,
water and gas utilities
§ HQ based in France
§ http://www.gridpocket.com/
§ Developed open source data
generator
§ Provided industry specific use
cases and algorithms
The GridPocket Dataset
§ Our target: 1 million meters
reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Allowing for 1 order of magnitude
growth in each dimension gives
1PB in 3 months
§ => Use object storage
§ Typically at least order of
magnitude lower storage cost
than NoSQL databases
§ NoSQL dataset – one large
denormalized table containing
meter reading information
– SQL query for nearest neighbours
#EUres2
© 2017 IBM Corporation11
What is Object Storage ?
§ Objects contain data and metadata
§ Written once and not modified
– Cannot append or update data
- Can overwrite completely
- Can update metadata
– No rename operation
§ Accessed through RESTful HTTP
– PUT/GET/POST/DELETE object/bucket
– Flat namespace
§ High capacity, low cost
§ Storage of choice for Big datasets
– Analytics works best on equally sized
objects
§ Examples
– Amazon S3, IBM COS, Google Cloud
Storage, OpenStack Swift
Swift
© 2017 IBM Corporation12
Object Storage and Spark are separately managed
microservices
© 2017 IBM Corporation13
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
© 2017 IBM Corporation14
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
© 2017 IBM Corporation15
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
Our goal:Minimize
1. Number of bytes shipped
2. Number of REST requests
THE key factors affecting cost
(and performance)
© 2017 IBM Corporation16
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
© 2017 IBM Corporation17
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
We want one solution for all data
formats
© 2017 IBM Corporation18
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
We want one solution for all data
formats
2. Use Hive Style Partitioning to layout
Object Storage data
– Partition pruning
© 2017 IBM Corporation19
Hive Style Partitioning and Partition Pruning
§ Relies on a dataset naming convention
§ Object storage has flat namespace, but can create a virtual folder hierarchy
– Use ‘/’ in object name
§ Data can be partitioned according to a column e.g. dt
– Information about object contents is encoded in object name e.g. dt=2015-09-14
§ Spark SQL can query fields in object name as well as in data e.g. “dt”
– Filters the objects which need to be read from Object Storage and sent to Spark
© 2017 IBM Corporation20
Limitations of Today’s Hive Style Partitioning in Spark
§ Only one hierarchy is possible
– Like database primary key, no secondary keys
§ Changing partitioning scheme requires rewriting entire dataset !
– Hierarchy cannot be changed without renaming all objects
§ No range partitioning
– Only supports partitioning with discrete types e.g. Gender:M/F, date, age etc.
– doesn't work well for timestamps, arbitrary floating point numbers etc.
§ A deep hierarchy may result in small and non uniform object sizes
– May reduce performance
© 2017 IBM Corporation21
Can we do more ?
NY
© 2017 IBM Corporation22
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
© 2017 IBM Corporation23
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
Min/max values
© 2017 IBM Corporation24
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
One or more bounding boxes
© 2017 IBM Corporation25
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
A set of values
© 2017 IBM Corporation26
Filter According to Metadata
§ Store one metadata record per object
– Unlike database fully inverted index
§ Various index types
– Min and max value for ordered columns
– Bounding boxes for geospatial data
– Bloom filters as space efficient value lists
§ Users can choose which columns to index
and index type per column
– Can index additional columns later on
§ Main requirement: no false negatives
§ Avoids touching irrelevant objects altogether
§ Handles updates (PUT/DELETE object)
– Filtering out irrelevant objects always works
#EUres2
NY
© 2017 IBM Corporation27
Spark SQL query execution flow
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional file
filter
Read data
Metadata
Filter
© 2017 IBM Corporation28
The Interface
Filters should extend this trait and implement application specific filtering logic.
trait ExecutionFileFilter {
/**
* @param dataFilters query predicates for data columns
* @param f represents an object which exists in the file catalog
* @return true if the object needs to be scanned during execution
*/
def isRequired(dataFilters: Seq[Filter], f: FileStatus) : Boolean
}
Turn filter on by specifying Filter class:
sqlContext.setConf("spark.sql.execution.fileFilter", “TestFileFilter”)
< 20 lines of code to integrate into Spark
#EUres2
© 2017 IBM Corporation29
Data Ingest: How to best Partition the Data
§ Organizing data in
objects in a smart way
generates more effective
metadata
§ Partition during data
ingestion
§ Example: Kafka allows
user to define partitioning
function
§ Needs to scale
horizontally – want
stateless partitioner
#EUres2
Client 1
Partition 0
Client 2
Client 3
Partition 1
Partition 2
Partition 3
© 2017 IBM Corporation30
Geospatial Partitioning: Grid Partitioner
§ Divide the world map into a grid
– Precision depends on use case
§ Each data point belongs to a cell
§ Each cell is hashed to a partition
§ A partition can contain multiple
cells
§ Partitions periodically generate
objects
§ Each object is described using a
list of bounding boxes
– 1 per participating cell
#EUres2
1 2 3 4
5 6 7 8
Partition 0
Partition 1
Partition 2
8
2
6
4
5
1
7
3
6 6
5
1
2 7
Objects
© 2017 IBM Corporation31
Experimental Results
The GridPocket Dataset
§ 1 million meters reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2)
§ Compared using 50 and 100 Kakfa partitions
#EUres2
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation32
Experimental Results
The GridPocket Dataset
§ 1 million meters reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2)
§ Compared using 50 and 100 Kakfa partitions
#EUres2
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation33
#GB Transferred for bounding box queries
#EUres2
0
200
400
600
800
1000
1200
20x20 km 10x10 km 5x5 km
no filter filter
#GB transferred
50 partitions 100 partitions
1 TB dataset, grid partitioner, precision 1, average of 10 randomly located queries
© 2017 IBM Corporation34
#GET Requests for bounding box queries
#EUres2
1 TB dataset, grid partitioner, precision 1, took average of 10 randomly located queries
0
2000
4000
6000
8000
10000
12000
20x20 km 10x10 km 5x5 km
no filter filter
#GET requests
50 partitions 100 partitions
© 2017 IBM Corporation35
Time (sec) for bounding box queries
#EUres2
1 TB dataset, precision 1, took average of 10 randomly located queries
0
500
1000
1500
2000
2500
20x20 km 10x10 km 5x5 km
no filter filter
Time (sec)
50 partitions 100 partitions
© 2017 IBM Corporation36
Demo
§ Runs on IOSTACK testbed
– 3 Spark and 3 Object
Storage nodes
§ Demo dataset
– 1 million meters
– Report every 15 mins
– 1 day’s worth of data
– 11 GB (csv)
§ Grid Partitioner Config
– Cells have precision 1
- ~10 km2
– 100 partitions
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation37
Conclusions: Don’t let your vacation turn into a relocation
§ Running SQL queries directly
on Big Datasets in Object
Storage is viable using
metadata
§ A small change to Spark
enables dramatic
performance improvements
§ We focused here on
geospatial data and the
GridPocket use case,
although it also applies to
other data types and use
cases
§ IBM is investing in Spark
SQL as a backbone for SQL
processing services in the
cloud
#EUres2
© 2017 IBM Corporation38
Thanks!
Contact:
paula@il.ibm.com
https://www.linkedin.com/in/paulatashma/
guyger@il.ibm.com
https://www.linkedin.com/in/guy-gerson-82619164/
© 2017 IBM Corporation39
Backup
#EUres2
© 2017 IBM Corporation40
GridPocket Use Case and Dataset
§ NoSQL dataset – one large denormalized table
– date, index, sumHC, sumHP, type, vid, size, temp, city, region, lat, lng
– Index = meter reading
– sumHC = total of energy consumed since midnight in off-hours
– sumHP = total of energy consumed since midnight in rush-hours
– Type = elec/gas/water
– Vid = meter id
– Size = apartment size in square meters
© 2017 IBM Corporation41
Example Filter Scenario
§ Want to analyze data from active sensors only
§ External DB contains sensor activity info
Data Layout
Archives/dt=01-01-2014
Archives/dt=01-01-2014/sensor1.json (500MB)
Archives/dt=01-01-2014/sensor2.json (500MB)
Archives/dt=01-01-2014/sensor3.json (500MB)
Archives/dt=02-01-2014
Archives/dt=02-01-2014/sensor1.json (500MB)
Archives/dt=02-01-2014/sensor2.json (500MB)
Archives/dt=02-01-2014/sensor3.json (500MB)
more...
#EUres2
sensor active
sensor1 FALSE
sensor2 TRUE
sensor3 FALSE
© 2017 IBM Corporation42
Example Filter
class LiveSensorFilter extends ExecutionFileFilter {
//get a list of live sensors from an external Service
val activeSensors = SensorService.getLiveSensors
//returns true if object represents a live sensor
@Override
def isRequired(dataFilters:Seq[org.apache.spark.sql.sources.Filter],
fileStatus: FileStatus) : Boolean = {
activeSensors.contains(Utils.getSensorIdFromPath(fileStatus))
}
Turn filter on:
sqlContext.setConf("spark.sql.execution.fileFilter", “LiveSensorFilter ”)
#EUres2

Mais conteúdo relacionado

Mais procurados

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 

Mais procurados (20)

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 

Destaque

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Destaque (10)

Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Semelhante a Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones' (and save the planet) Paula Ta-Shma and Guy Gerson

Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA
 

Semelhante a Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones' (and save the planet) Paula Ta-Shma and Guy Gerson (20)

A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store world
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Managing your Black Friday Logs
Managing your Black Friday LogsManaging your Black Friday Logs
Managing your Black Friday Logs
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804c
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
S106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aS106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902a
 
Database Architecture - Case Study - SMS Gyan.pdf
Database Architecture - Case Study - SMS Gyan.pdfDatabase Architecture - Case Study - SMS Gyan.pdf
Database Architecture - Case Study - SMS Gyan.pdf
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 

Mais de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

Mais de Spark Summit (20)

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 

Último

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones' (and save the planet) Paula Ta-Shma and Guy Gerson

  • 1. © 2017 IBM Corporation Using Pluggable Apache Spark SQL Filters to help GridPocket users keep up with the Jones’ (and save the planet) Paula Ta-Shma, Guy Gerson IBM Research Contact: paula@il.ibm.com Joint work with: Tal Ariel, IBM Filip Gluszak, GridPocket Gal Lushi, IBM Yosef Moatti, IBM Papa Niamadio, GridPocket Nathaël Noguès, GridPocket
  • 2. © 2017 IBM Corporation2 You have 5 minutes to pack your bags…
  • 3. © 2017 IBM Corporation3 ?
  • 4. © 2017 IBM Corporation4 ?
  • 5. © 2017 IBM Corporation Using Pluggable Apache Spark SQL Filters to help GridPocket users keep up with the Jones’ (and save the planet) Paula Ta-Shma, Guy Gerson IBM Research Contact: paula@il.ibm.com Joint work with: Tal Ariel, IBM Filip Gluszak, GridPocket Gal Lushi, IBM Yosef Moatti, IBM Papa Niamadio, GridPocket Nathaël Noguès, GridPocket
  • 6. © 2017 IBM Corporation6 What would convince you to reduce your energy usage ? #EUres2
  • 7. © 2017 IBM Corporation7 What would convince you to reduce your energy usage ? § Save $ ? #EUres2
  • 8. © 2017 IBM Corporation8 What would convince you to reduce your energy usage ? § Save $ ? § Save the planet ? #EUres2
  • 9. © 2017 IBM Corporation9 Competing to Save Energy § Research shows that utility customers are most influenced by peer pressure to save energy § => Help utilities compare their customers’ energy consumption with that of their neighbours – Queries are anonymized #EUres2
  • 10. © 2017 IBM Corporation10 GridPocket § A smart grid company developing energy management applications and cloud services for electricity, water and gas utilities § HQ based in France § http://www.gridpocket.com/ § Developed open source data generator § Provided industry specific use cases and algorithms The GridPocket Dataset § Our target: 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Allowing for 1 order of magnitude growth in each dimension gives 1PB in 3 months § => Use object storage § Typically at least order of magnitude lower storage cost than NoSQL databases § NoSQL dataset – one large denormalized table containing meter reading information – SQL query for nearest neighbours #EUres2
  • 11. © 2017 IBM Corporation11 What is Object Storage ? § Objects contain data and metadata § Written once and not modified – Cannot append or update data - Can overwrite completely - Can update metadata – No rename operation § Accessed through RESTful HTTP – PUT/GET/POST/DELETE object/bucket – Flat namespace § High capacity, low cost § Storage of choice for Big datasets – Analytics works best on equally sized objects § Examples – Amazon S3, IBM COS, Google Cloud Storage, OpenStack Swift Swift
  • 12. © 2017 IBM Corporation12 Object Storage and Spark are separately managed microservices
  • 13. © 2017 IBM Corporation13 Object Storage and Spark are separately managed microservices Want to query data on object storage directly
  • 14. © 2017 IBM Corporation14 Object Storage and Spark are separately managed microservices Want to query data on object storage directly
  • 15. © 2017 IBM Corporation15 Object Storage and Spark are separately managed microservices Want to query data on object storage directly Our goal:Minimize 1. Number of bytes shipped 2. Number of REST requests THE key factors affecting cost (and performance)
  • 16. © 2017 IBM Corporation16 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata
  • 17. © 2017 IBM Corporation17 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata We want one solution for all data formats
  • 18. © 2017 IBM Corporation18 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata We want one solution for all data formats 2. Use Hive Style Partitioning to layout Object Storage data – Partition pruning
  • 19. © 2017 IBM Corporation19 Hive Style Partitioning and Partition Pruning § Relies on a dataset naming convention § Object storage has flat namespace, but can create a virtual folder hierarchy – Use ‘/’ in object name § Data can be partitioned according to a column e.g. dt – Information about object contents is encoded in object name e.g. dt=2015-09-14 § Spark SQL can query fields in object name as well as in data e.g. “dt” – Filters the objects which need to be read from Object Storage and sent to Spark
  • 20. © 2017 IBM Corporation20 Limitations of Today’s Hive Style Partitioning in Spark § Only one hierarchy is possible – Like database primary key, no secondary keys § Changing partitioning scheme requires rewriting entire dataset ! – Hierarchy cannot be changed without renaming all objects § No range partitioning – Only supports partitioning with discrete types e.g. Gender:M/F, date, age etc. – doesn't work well for timestamps, arbitrary floating point numbers etc. § A deep hierarchy may result in small and non uniform object sizes – May reduce performance
  • 21. © 2017 IBM Corporation21 Can we do more ? NY
  • 22. © 2017 IBM Corporation22 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY
  • 23. © 2017 IBM Corporation23 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY Min/max values
  • 24. © 2017 IBM Corporation24 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY One or more bounding boxes
  • 25. © 2017 IBM Corporation25 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY A set of values
  • 26. © 2017 IBM Corporation26 Filter According to Metadata § Store one metadata record per object – Unlike database fully inverted index § Various index types – Min and max value for ordered columns – Bounding boxes for geospatial data – Bloom filters as space efficient value lists § Users can choose which columns to index and index type per column – Can index additional columns later on § Main requirement: no false negatives § Avoids touching irrelevant objects altogether § Handles updates (PUT/DELETE object) – Filtering out irrelevant objects always works #EUres2 NY
  • 27. © 2017 IBM Corporation27 Spark SQL query execution flow Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter
  • 28. © 2017 IBM Corporation28 The Interface Filters should extend this trait and implement application specific filtering logic. trait ExecutionFileFilter { /** * @param dataFilters query predicates for data columns * @param f represents an object which exists in the file catalog * @return true if the object needs to be scanned during execution */ def isRequired(dataFilters: Seq[Filter], f: FileStatus) : Boolean } Turn filter on by specifying Filter class: sqlContext.setConf("spark.sql.execution.fileFilter", “TestFileFilter”) < 20 lines of code to integrate into Spark #EUres2
  • 29. © 2017 IBM Corporation29 Data Ingest: How to best Partition the Data § Organizing data in objects in a smart way generates more effective metadata § Partition during data ingestion § Example: Kafka allows user to define partitioning function § Needs to scale horizontally – want stateless partitioner #EUres2 Client 1 Partition 0 Client 2 Client 3 Partition 1 Partition 2 Partition 3
  • 30. © 2017 IBM Corporation30 Geospatial Partitioning: Grid Partitioner § Divide the world map into a grid – Precision depends on use case § Each data point belongs to a cell § Each cell is hashed to a partition § A partition can contain multiple cells § Partitions periodically generate objects § Each object is described using a list of bounding boxes – 1 per participating cell #EUres2 1 2 3 4 5 6 7 8 Partition 0 Partition 1 Partition 2 8 2 6 4 5 1 7 3 6 6 5 1 2 7 Objects
  • 31. © 2017 IBM Corporation31 Experimental Results The GridPocket Dataset § 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2) § Compared using 50 and 100 Kakfa partitions #EUres2 Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 32. © 2017 IBM Corporation32 Experimental Results The GridPocket Dataset § 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2) § Compared using 50 and 100 Kakfa partitions #EUres2 Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 33. © 2017 IBM Corporation33 #GB Transferred for bounding box queries #EUres2 0 200 400 600 800 1000 1200 20x20 km 10x10 km 5x5 km no filter filter #GB transferred 50 partitions 100 partitions 1 TB dataset, grid partitioner, precision 1, average of 10 randomly located queries
  • 34. © 2017 IBM Corporation34 #GET Requests for bounding box queries #EUres2 1 TB dataset, grid partitioner, precision 1, took average of 10 randomly located queries 0 2000 4000 6000 8000 10000 12000 20x20 km 10x10 km 5x5 km no filter filter #GET requests 50 partitions 100 partitions
  • 35. © 2017 IBM Corporation35 Time (sec) for bounding box queries #EUres2 1 TB dataset, precision 1, took average of 10 randomly located queries 0 500 1000 1500 2000 2500 20x20 km 10x10 km 5x5 km no filter filter Time (sec) 50 partitions 100 partitions
  • 36. © 2017 IBM Corporation36 Demo § Runs on IOSTACK testbed – 3 Spark and 3 Object Storage nodes § Demo dataset – 1 million meters – Report every 15 mins – 1 day’s worth of data – 11 GB (csv) § Grid Partitioner Config – Cells have precision 1 - ~10 km2 – 100 partitions Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 37. © 2017 IBM Corporation37 Conclusions: Don’t let your vacation turn into a relocation § Running SQL queries directly on Big Datasets in Object Storage is viable using metadata § A small change to Spark enables dramatic performance improvements § We focused here on geospatial data and the GridPocket use case, although it also applies to other data types and use cases § IBM is investing in Spark SQL as a backbone for SQL processing services in the cloud #EUres2
  • 38. © 2017 IBM Corporation38 Thanks! Contact: paula@il.ibm.com https://www.linkedin.com/in/paulatashma/ guyger@il.ibm.com https://www.linkedin.com/in/guy-gerson-82619164/
  • 39. © 2017 IBM Corporation39 Backup #EUres2
  • 40. © 2017 IBM Corporation40 GridPocket Use Case and Dataset § NoSQL dataset – one large denormalized table – date, index, sumHC, sumHP, type, vid, size, temp, city, region, lat, lng – Index = meter reading – sumHC = total of energy consumed since midnight in off-hours – sumHP = total of energy consumed since midnight in rush-hours – Type = elec/gas/water – Vid = meter id – Size = apartment size in square meters
  • 41. © 2017 IBM Corporation41 Example Filter Scenario § Want to analyze data from active sensors only § External DB contains sensor activity info Data Layout Archives/dt=01-01-2014 Archives/dt=01-01-2014/sensor1.json (500MB) Archives/dt=01-01-2014/sensor2.json (500MB) Archives/dt=01-01-2014/sensor3.json (500MB) Archives/dt=02-01-2014 Archives/dt=02-01-2014/sensor1.json (500MB) Archives/dt=02-01-2014/sensor2.json (500MB) Archives/dt=02-01-2014/sensor3.json (500MB) more... #EUres2 sensor active sensor1 FALSE sensor2 TRUE sensor3 FALSE
  • 42. © 2017 IBM Corporation42 Example Filter class LiveSensorFilter extends ExecutionFileFilter { //get a list of live sensors from an external Service val activeSensors = SensorService.getLiveSensors //returns true if object represents a live sensor @Override def isRequired(dataFilters:Seq[org.apache.spark.sql.sources.Filter], fileStatus: FileStatus) : Boolean = { activeSensors.contains(Utils.getSensorIdFromPath(fileStatus)) } Turn filter on: sqlContext.setConf("spark.sql.execution.fileFilter", “LiveSensorFilter ”) #EUres2