Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
1. About me
Software Engineer in Data Platform
team at Intuit
Speaking at Strata New York 2019 - Time-
travel for Data Pipelines: Solving the
mystery of what changed?
Technical Lead for Real-Time Analytics
and Lineage Framework (SuperGlue)
Contributor to spark-cassandra-connector
Twitter : shradha151
Linkedln : https://www.linkedin.com/in/shradha-ambekar-a0504714
9. 9
Recognized as one of the world’s leading companies
2004 - 2019
Most Admired:
Computer Software
2002 - 2019
100 Best Companies
to Work For
2018
Most Innovative
Companies
2018
Companies Best
Positioned For
Breakout Growth
13. 13
Cassandra Data Model
Cassandra is a wide column store - hybrid between a key-value and a tabular data
management system. It’s data model is a partitioned row store with tunable
consistency.
PARTITION KEY -
dt,country,offering,category,name
CLUSTERING KEY -
local_ts,company_id
Cassandra Partition stores many rows
16. 16
Cassandra Partition
• Cassandra organizes data into partitions.
• Cassandra partition stores many rows
• Within a partition, there is ordering based on the Clustering Keys
• Partition Key Of a row is hashed to determine its token
• Data is located on a token range
18. 18
Basic Aggregate Queries on Cassandra
May result in Time Out
1. Coordinator fetches data from
multiple nodes
2. Stores data in heap
3. Aggregates data before
returning results
21. 21
Spark With Cassandra
What worked
• Writes Scaled Very Well
• Extremely high latencies for analytical queries
What did not work
• Basic analytical queries involving IN clause were
taking several minutes and sometimes crashing.
• Workload not distributed when scanning multiple
Cassandra partitions
22. 22
Presentation flow for next slides
• Problem Statement
• Performance Metrics for Problem Statement
• Concepts & Debugging Workflow
• Solution
• Performance Metrics After Fix
• Takeaways
24. 24
Performance Metrics
JOB RUNS FOR MINUTES AND SOMETIMES FAILS
ERROR [ReadStage:93968] 2018-01-31 15:05:24,224 SliceQueryFilter.java (line 200) Scanned
over 100000 tombstones in htap.event ; query aborted (see tombstone_failure_threshold)
28. 28
Spark SQL Catalyst
• An implementation agnostic framework for manipulating trees of relational operators and
expressions
• It defines all the expressions, logical plans and optimization API’s for SPARK SQL
Catalyst Stages While Parsing SQL Query
31. 31
Physical Planning (Rules Executor + Strategies)
Spark takes the logical plan and generates zero or more physical
plans by using strategies.
Strategies -
• DataSourceStrategy
• Aggregation
• Join Selection
• InMemoryScans
• FileSourceStrategy
Physical Plan is executed to generate RDD
33. 33
Physical Plan Strategies -> DataSource API
Enables SPARK SQL to read from Data Source .
Built In Data Sources :
• JSON
• PARQUET
• CSV
• JDBC
• ORC
34. 34
CassandraDataSource: Implementing Ext Custom DataSource V1
• Extend class RelationProvider and override
• createRelation -necessary relations of our
data captured. Returns BaseRelation class.
Extend BaseRelation with ScanTrait with
InsertableRelation and override
- sizeInBytes - estimate size of this relation
- unhandledFilters - returns filters that cannot
be pushed to datasource.
- buildScan - returns an RDD. This method
applies predicate rules.
- insert - write the dataframe to datasource
class CassandraSourceRelation extends
BaseRelation with PrunedFilteredScan with
InsertableRelation
- sizeInBytes – estimated size of Cassandra Table
- unhandledFilters - applies predicate push down
rules and returns filters not handled Cassandra
- buildScan - applies predicate push down rules
and returns CassandraTableScanRDD
- insert - to write the Dataframe to data source
class DefaultSource extends
RelationProvider
- createRelation method returns
CassandraSourceRelation(sqlContext,schem
a,table)
35. 35
External DataSource Example
CASSANDRA DATA SOURCE
DataSource Instance can be created using -
spark.read.options("org.apache.spark.sql.cassandra").format(Map( "table" -> ”event", "keyspace" ->
”htap")).load())
or
Create Table Using DDL
37. 37
Basic Cassandra Pushdown
Starting Cassandra 2.2 provide IN support on any partition key column. Prior to that only
the last partition key column predicate can be an IN
Starting Cassandra 2.2 provide IN support for clustering columns. Prior to that only the last
predicate is allowed to be an IN predicate only if it was preceded by an equality predicate.
BuildScan method in CassandraSourceRelation class is invoked by Data Source Strategy
during physical plan and applies Predicate Rules
Released in Spark-Cassandra-Connector 2.3.1
39. 39
Total Cores = 8
Total Size of Table= 100 GB
Rows Scanned = 10,477,680
Time = 1.9 min
IMPROVEMENT % >15 min to
1.9 min = 87.3 %
Performance Improvement
40. 40
Spark Job Scans Required Cassandra Partitions
but Creates only one Spark Partition
42. 42
Takeaways
If queries are running slow:
• Analyze the logical and physical Plan to check if rules are applied.
• With the help of DAG identify the slow running process - scan, shuffle, transformations
etc..
If scans are issue and using an external data source:
• Check Rows scanned and number of tasks launched
• Look at the buildScan method in DataSourceRelation class to debug the problem
May get a chance to contribute to Open Source !!
48. 48
RDD Contract
Parent RDDs (RDD dependencies)
getPartitions returns An array of partitions that a dataset is divided to.
A compute function to do a computation on partitions.
An optional Partitioner that defines how keys are hashed
Optional preferred locations (aka locality info), i.e. hosts for a partition where the
records live or are the closest to read from.
49. 49
Stack calls when an application is submitted to the Master
Ref: https://trongkhoanguyen.com/spark/understand-the-scheduler-component-in-spark-core/
50. 50
CassandraTableScanRDD –
RDD representing a Table Scan of A Cassandra Table.
getDependencies - Returns RDD dependency
getPartitions -If predicates are pushed, only 1 spark partition is created else a spark partition
consists of one or more contiguous token ranges
getPreferredLocations - This method tells Spark the preferred nodes to fetch a partition from
compute - Data fetched for token range corresponding to each partition.
partitioner- Defines how the data is partitioned.
Current Implementation
52. 52
Fix Required : Updates to getPartitions method in Cassandra
Table Scan RDD
PULL REQUEST - https://github.com/datastax/spark-cassandra-
connector/pull/1214
getPartitions -
If predicates are pushed , determines the number of cassandra partitions to scan based on query.
Creates Spark Partitions equal to Cassandra Partitions scanned (based on query)
CASSANDRA PARTITION KEY
SPARK CASSANDRA PARTITION(TOKEN
RANGE,MACHINE)
54. 54
Total Cores = 8
Total Size of Table= 100 GB
Rows Scanned = 10,477,680
Time = 32 sec
IMPROVEMENT %
~15 min to 1.9min to 32 sec
(after first + second patch) = 97%
57. 57
Takeaways
If workload is not distributed :
• Check the number of spark tasks launched .
• Check the configuration properties (number of cores etc ) passed to the spark job.
If config parameter or any other infrastructure settings looks good:
• Check the partitions logic for the rdd.
• Validate the number of partitions created based on logic
May get a chance to contribute to Open Source !!
58. Q&A
Your opportunity to ask and
learn
We are Hiring
Twitter : shradha151
Linkedln : https://www.linkedin.com/in/shradha-ambekar-a0504714