Parallelization of Structured Streaming Jobs Using Delta Lake

© Tubi, proprietary and confidential
Parallelization of Structured Streaming Jobs
using Delta
- Oliver Lewis, Sr. Data Engineer

© Tubi, proprietary and
confidential
2

confidential
3
Datalake Throughput
Requests/s
40,000
Aggregate Records/day
800M
Volume/day
500GB

confidential
4
Analytics Powered By Stream of Immutable Events

confidential
5
Analytics Powered By Stream of Immutable Events

confidential
6
Engineering Challenges w/ Stream First Architecture
• Datalake file right-sizing
• Backfill / Data Deletion Process is a nightmare
• Multiple Streams writing to the same location

confidential
7
Delta @ Tubi
• Optimization of ingested parquet files
• Data Deletion Use cases (GDPR/CCPA)
• _spark_metadata failures in backfill operations

confidential
8
Example of a simple structured streaming job

confidential
9
Strategies to Backfill
1) Write a batch job to backfill.
1) Gracefully terminate the streaming job.
Gotcha: Do not replace readStream/read and writeStream/write because
implicitly flatMapGroupsWithState is converted to mapGroups and you’ll lose
state management entirely.

confidential
10
Issues in backfilling large datasets
We set the start_date to 2016-01-01 and the end_date to 2020-05-31 and run
the job.
There are several problems in structuring the job like this:
1) It would be a nuisance if the job ran for a long time before failing.
2) State management cannot hold such a large state.

confidential
11
Encapsulate the Task.
What we need is a small batch size that can be
● TRIGGERED
● EXECUTED
● COMPLETED
So that at any given time we do not store too much
state on the executors and also clearly track
completion.
At scale, any date can be sent as input and we would
be able to generate the same output. I.e. we should
have an idempotent task.

confidential
12
Performance
To make the backfill go faster our immediate intuition
is to increase the size of the cluster.
Example: 3886 tasks and we have 64 cores and it
took 8.2 mins.
If we have 3886 cores we can complete this job in ~8
secs.
So our intuition is CORRECT.
Increasing the cluster size is useful until the number
of cores is less than or equal to the most expensive
task.

confidential
13
Performance
● But if the number of cores is greater than tasks,
then you have a large cluster that is not being
fully utilized.
● This is an important limitation of our initial
intuition that by spinning up a larger cluster we
can increase performance, that isn't always
true

confidential
14
Performance

confidential
15
Performance

confidential
16
Backfilling in parallel
1) Separating the business logic from the execution logic.
1) We can run multiple streams in parallel. Each job is submitted to the spark scheduler
which will be responsible for the execution of the job depending on the number of free
cores available.
1) Using scala parallel collections (.par)

confidential
17
Par collections limitations
1) Rob Pike: Concurrency is not parallelism.
1) Par collections do launch Spark jobs in parallel, but the Spark scheduler may not
actually execute the jobs in parallel.

confidential
18
Futures and Fair Scheduler Pool
1) By default, each pool gets an equal share of the cluster, but inside each pool, jobs run
in FIFO order.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
1) You can further configure the pools schedulerMode, minShare, and weight.

confidential
19
DEMO
https://dbc-5f2acc18-
b29d.cloud.databricks.com/?o=6211501775943445#notebook/3133024054071987/

confidential
20
Performance Chart

confidential
21
Failure and Recovery handling
● We should be able to handle failure and
retries within the job.
● Build a simple StateStore to monitor
which state the job is currently in.
● If the job has successfully finished then
we can remove it from the state store.

© Tubi, proprietary and confidential
Thank You.
https://corporate.tubitv.com/company/careers/
Blog: https://code.tubitv.com/
Contact:
https://www.linkedin.com/in/oliveralewis/
olewis@tubi.tv

Parallelization of Structured Streaming Jobs Using Delta Lake

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Parallelization of Structured Streaming Jobs Using Delta Lake

Semelhante a Parallelization of Structured Streaming Jobs Using Delta Lake (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Parallelization of Structured Streaming Jobs Using Delta Lake