Oscon 2019 - Optimizing analytical queries on Cassandra by 100x

About me
Software Engineer in Data Platform
team at Intuit
Speaking at Strata New York 2019 - Time-
travel for Data Pipelines: Solving the
mystery of what changed?
Technical Lead for Real-Time Analytics
and Lineage Framework (SuperGlue)
Contributor to spark-cassandra-connector
Twitter : shradha151
Linkedln : https://www.linkedin.com/in/shradha-ambekar-a0504714

Shradha Ambekar
July 17,2019
Optimizing analytical queries
on Cassandra by 100x
# OSCON

3
0
5
10
15
20
25
30
Query 1 Query 2 Query 3 Query 4 Query 5 Query 6
Query Performance
Pending Running
High Latency Queries Impacting Metrics
Mins
hour of day

4
Performance Metrics
JOB RUNS FOR MINUTES AND SOMETIMES FAILS

5
At Intuit, we faced challenges running analytical queries at scale on
Cassandra. This presentation describes inefficiencies
and how we solved

Consumers
Small businesses
Self-employed
Who we serve:

7
Who we are:
Founded
8,900
Employees
50M
Customers
1993
IPO
$6B
FY18
Revenue
20
Locations
1983

8
Powering Prosperity
Around the World

9
Recognized as one of the world’s leading companies
2004 - 2019
Most Admired:
Computer Software
2002 - 2019
100 Best Companies
to Work For
2018
Most Innovative
Companies
2018
Companies Best
Positioned For
Breakout Growth

10
Cassandra is used within Intuit to drive Streaming & ML Use-Cases

11
Cassandra can handle massive datasets and scales very well for fast
writes.
But
Cassandra storage model is inefficient for analytical queries

13
Cassandra Data Model
Cassandra is a wide column store - hybrid between a key-value and a tabular data
management system. It’s data model is a partitioned row store with tunable
consistency.
PARTITION KEY -
dt,country,offering,category,name
CLUSTERING KEY -
local_ts,company_id
Cassandra Partition stores many rows

14
Partition Key Rows Ordered by Clustering Keys
AS
C
AS
C

15
Cassandra Data Distribution

16
Cassandra Partition
• Cassandra organizes data into partitions.
• Cassandra partition stores many rows
• Within a partition, there is ordering based on the Clustering Keys
• Partition Key Of a row is hashed to determine its token
• Data is located on a token range

17
Basic Queries on Cassandra
Result Set

18
Basic Aggregate Queries on Cassandra
May result in Time Out
1. Coordinator fetches data from
multiple nodes
2. Stores data in heap
3. Aggregates data before
returning results

20
Spark
Ref : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html
Fast in-memory data processing framework used as an execution engine for analytics workloads

21
Spark With Cassandra
What worked
• Writes Scaled Very Well
• Extremely high latencies for analytical queries
What did not work
• Basic analytical queries involving IN clause were
taking several minutes and sometimes crashing.
• Workload not distributed when scanning multiple
Cassandra partitions

22
Presentation flow for next slides
• Problem Statement
• Performance Metrics for Problem Statement
• Concepts & Debugging Workflow
• Solution
• Performance Metrics After Fix
• Takeaways

23
Problem 1- Spark SQL with IN clause queries very slow

24
Performance Metrics
JOB RUNS FOR MINUTES AND SOMETIMES FAILS
ERROR [ReadStage:93968] 2018-01-31 15:05:24,224 SliceQueryFilter.java (line 200) Scanned
over 100000 tombstones in htap.event ; query aborted (see tombstone_failure_threshold)

25
Jira - https://datastax-oss.atlassian.net/browse/SPARKC-490
(Pushdown filter for composite partition keys
doesn't work when IN clause is used)

28
Spark SQL Catalyst
• An implementation agnostic framework for manipulating trees of relational operators and
expressions
• It defines all the expressions, logical plans and optimization API’s for SPARK SQL
Catalyst Stages While Parsing SQL Query

29
Unresolved Logical Plan
Query Abstract Syntax Tree

30
Analyzer and Optimizer
•ResolveRelations,
•ResolveReferences,
•ResolveMissingReferences,
•ResolveSubquery,
•ResolveBroadcastHints
•RemoveAllHints
•PushDownPredicate,
•CombineFilters,
•CombineLimits,
•CombineUnions,
•ColumnPruning,
•PruneFilters

31
Physical Planning (Rules Executor + Strategies)
Spark takes the logical plan and generates zero or more physical
plans by using strategies.
Strategies -
• DataSourceStrategy
• Aggregation
• Join Selection
• InMemoryScans
• FileSourceStrategy
Physical Plan is executed to generate RDD

32
Spark Plan Output
Unresolved Plan
!'Project [*] !
'Filter
'UnresolvedRelation `event`
Analyzer + Optimizer
!Project [dt#0, billing_country#1, offering#2, category#3, name#4, local_ts#5, realm_id#6,
auth_id#7, ivid1#8, ivid2#9, tz#10, utc_ts#11]
Filter
SubqueryAlias event
Relation[dt#0,billing_country#1,offering#2,category#3,name#4,local_ts#5,realm_id#6,auth_id#7,
ivid1#8,ivid2#9,tz#10,utc_ts#11]
org.apache.spark.sql.cassandra.CassandraSourceRelation@1c171746
Physical Plan
C* Filters: []
Spark Filters [In(offering, [qbn subscription,qbo], In(name,
[wwsuiwidgetinit,billing_success,billing_failure,pageview.app_homepage,im], In(billing_country,
[us], In(category, [billing,buy,trial,product,ipd], In(dt, [2019-03-07,2019-03-08]]

33
Physical Plan Strategies -> DataSource API
Enables SPARK SQL to read from Data Source .
Built In Data Sources :
• JSON
• PARQUET
• CSV
• JDBC
• ORC

34
CassandraDataSource: Implementing Ext Custom DataSource V1
• Extend class RelationProvider and override
• createRelation -necessary relations of our
data captured. Returns BaseRelation class.
Extend BaseRelation with ScanTrait with
InsertableRelation and override
- sizeInBytes - estimate size of this relation
- unhandledFilters - returns filters that cannot
be pushed to datasource.
- buildScan - returns an RDD. This method
applies predicate rules.
- insert - write the dataframe to datasource
class CassandraSourceRelation extends
BaseRelation with PrunedFilteredScan with
InsertableRelation
- sizeInBytes – estimated size of Cassandra Table
- unhandledFilters - applies predicate push down
rules and returns filters not handled Cassandra
- buildScan - applies predicate push down rules
and returns CassandraTableScanRDD
- insert - to write the Dataframe to data source
class DefaultSource extends
RelationProvider
- createRelation method returns
CassandraSourceRelation(sqlContext,schem
a,table)

35
External DataSource Example
CASSANDRA DATA SOURCE
DataSource Instance can be created using -
spark.read.options("org.apache.spark.sql.cassandra").format(Map( "table" -> ”event", "keyspace" ->
”htap")).load())
or
Create Table Using DDL

37
Basic Cassandra Pushdown
Starting Cassandra 2.2 provide IN support on any partition key column. Prior to that only
the last partition key column predicate can be an IN
Starting Cassandra 2.2 provide IN support for clustering columns. Prior to that only the last
predicate is allowed to be an IN predicate only if it was preceded by an equality predicate.
BuildScan method in CassandraSourceRelation class is invoked by Data Source Strategy
during physical plan and applies Predicate Rules
Released in Spark-Cassandra-Connector 2.3.1

38
PULL REQUEST –
https://github.com/datastax/spark-cassandra-connector/pull/1167/files
BEFORE AFTER

39
Total Cores = 8
Total Size of Table= 100 GB
Rows Scanned = 10,477,680
Time = 1.9 min
IMPROVEMENT % >15 min to
1.9 min = 87.3 %
Performance Improvement

40
Spark Job Scans Required Cassandra Partitions
but Creates only one Spark Partition

42
Takeaways
If queries are running slow:
• Analyze the logical and physical Plan to check if rules are applied.
• With the help of DAG identify the slow running process - scan, shuffle, transformations
etc..
If scans are issue and using an external data source:
• Check Rows scanned and number of tasks launched
• Look at the buildScan method in DataSourceRelation class to debug the problem
May get a chance to contribute to Open Source !!

43
Problem 2 : Only one Spark Partition created

44
Spark Job scans required Cassandra Partitions
but creates only one Spark Partition

46
Spark Core Architecture
Ref : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html

47
Resilient Distributed Datasets
• Immutable
• Lazy evaluated
• Cacheable
• Parallel
• Typed
• Partitioned
• Location-Stickiness

48
RDD Contract
Parent RDDs (RDD dependencies)
getPartitions returns An array of partitions that a dataset is divided to.
A compute function to do a computation on partitions.
An optional Partitioner that defines how keys are hashed
Optional preferred locations (aka locality info), i.e. hosts for a partition where the
records live or are the closest to read from.

49
Stack calls when an application is submitted to the Master
Ref: https://trongkhoanguyen.com/spark/understand-the-scheduler-component-in-spark-core/

50
CassandraTableScanRDD –
RDD representing a Table Scan of A Cassandra Table.
getDependencies - Returns RDD dependency
getPartitions -If predicates are pushed, only 1 spark partition is created else a spark partition
consists of one or more contiguous token ranges
getPreferredLocations - This method tells Spark the preferred nodes to fetch a partition from
compute - Data fetched for token range corresponding to each partition.
partitioner- Defines how the data is partitioned.
Current Implementation

52
Fix Required : Updates to getPartitions method in Cassandra
Table Scan RDD
PULL REQUEST - https://github.com/datastax/spark-cassandra-
connector/pull/1214
getPartitions -
If predicates are pushed , determines the number of cassandra partitions to scan based on query.
Creates Spark Partitions equal to Cassandra Partitions scanned (based on query)
CASSANDRA PARTITION KEY
SPARK CASSANDRA PARTITION(TOKEN
RANGE,MACHINE)

53
Increased Parallelism. Each Spark Partition fetches data from
required Cassandra Partition based on Query

54
Total Cores = 8
Total Size of Table= 100 GB
Rows Scanned = 10,477,680
Time = 32 sec
IMPROVEMENT %
~15 min to 1.9min to 32 sec
(after first + second patch) = 97%

56
BEFORE AFTER
5X Fast
Data :17GB
Core :8

57
Takeaways
If workload is not distributed :
• Check the number of spark tasks launched .
• Check the configuration properties (number of cores etc ) passed to the spark job.
If config parameter or any other infrastructure settings looks good:
• Check the partitions logic for the rdd.
• Validate the number of partitions created based on logic
May get a chance to contribute to Open Source !!

Q&A
Your opportunity to ask and
learn
We are Hiring
Twitter : shradha151
Linkedln : https://www.linkedin.com/in/shradha-ambekar-a0504714

Oscon 2019 - Optimizing analytical queries on Cassandra by 100x

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Oscon 2019 - Optimizing analytical queries on Cassandra by 100x

Similar to Oscon 2019 - Optimizing analytical queries on Cassandra by 100x (20)

Recently uploaded

Recently uploaded (20)

Oscon 2019 - Optimizing analytical queries on Cassandra by 100x