SlideShare uma empresa Scribd logo
1 de 76
Baixar para ler offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Rose Toomey, Coatue Management
Spark At Scale In the
Cloud
#UnifiedDataAnalytics #SparkAISummit
About me
NYC. Finance. Technology. Code.
• Each job I wrote code but found that the data
challenges just kept growing
– Lead API Developer at Gemini Trust
– Director at Novus Partners
• Now: coding and working with data full time
– Software Engineer at Coatue Management
How do you process this…
Numbers are approximate.
• Dataset is 35+ TiB raw
• Input files are 80k+ unsplittable compressed row-based
format with heavy skew, deeply nested directory structure
• Processing results in 275+ billion rows cached to disk
• Lots of data written back out to S3
– Including stages ending in sustained writes of tens of TiB
4
On a very big Spark cluster…
Sometimes you just need to bring the entire
dataset into memory.
The more nodes a Spark cluster has, the more
important configuration tuning becomes.
Even more so in the cloud, where you will
regularly experience I/O variance and
unreliable nodes.
In the cloud?
• Infrastructure management is hard
– Scaling resources and bandwidth in a datacenter
is not instant
– Spark/Hadoop clusters are not islands – you’re
managing an entire ecosystem of supporting
players
• Optimizing Spark jobs is hard
Let’s limit the number of hard things we’re going to tackle
at once.
Things going wrong at scale
Everything is relative. In smaller clusters, these
configurations worked fine.
• Everything is waiting on everything else because Netty
doesn't have enough firepower to shuffle faster
• Speculation meets skew and relaunches the very
slowest parts of a join, leaving most of the cluster idle
• An external service rate limits, which causes blacklisting
to sideline most of a perfectly good cluster
7
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Putting together a big
Spark cluster
• What kind of nodes should the
cluster have? Big? Small?
Medium?
• What's your resource limitation for
the number of executors?
– Just memory (standalone)
– Both memory and vCPUs (YARN)
• Individual executors should have
how much memory and how many
virtual CPUs?Galactic Wreckage in Stephan's Quintet
9
One Very Big Standalone Node
One mega instance configured with many
"just right" executors, each provisioned with
• < 32 GiB heap (sweet spot for GC)
• 5 cores (for good throughput)
• Minimizes shuffle overhead
• Like the pony, not offered by your cloud
provider. Also, poor fault tolerance.
10
Multiple Medium-sized Nodes
When looking at medium sized nodes, we
have a choice:
• Just one executor
• Multiple executors
But a single executor might not be the best
resource usage:
• More cores on a single executor is not
necessarily better
• When using a cluster manager like
YARN, more executors could be a more
efficient use of CPU and memory
11
Many Small Nodes
12
• 500+ small nodes
• Each node over-provisioned
relative to multiple executor per
node configurations
• Single executor per node
• Most fault tolerant but big
communications overhead
“Desperate affairs require
desperate measures.”
Vice Admiral Horatio Nelson
Why ever choose the worst solution?
Single executor per small (or medium) node is the worst
configuration for cost, provisioning, and resource usage. Why not
recommend against it?
• Resilient to node degradation and loss
• Quick transition to production: relative over-provisioning of
resources to each executor behaves more like a notebook
• Awkward instance sizes may provision more quickly than larger
instances
13
Onward!
Now you have your cluster composition in mind, you’ll need to scale
up your base infrastructure to support the number of nodes:
• Memory and garbage collection
• Tune RPC for cluster communications
• Where do you put very large datasets?
• How do you get them off the cluster?
• No task left behind: scheduling in difficult times
14
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Spark memory management
SPARK-1000: Consolidate
storage and execution memory
management
• NewRatio controls
Young/Old proportion
• spark.memory.fraction
sets storage and execution
space to ~60% tenured
space
16
Young Generation 1/3
Old Generation 2/3
300m reserved
spark.memory.fraction ~60%
50% execution
dynamic – will take more
50% storage
spark.memory.storageFraction
~40%
Spark
metadata,
user data
structures,
OOM safety
17
Field guide to Spark GC tuning
• Lots of minor GC - easy fix
– Increase Eden space (high allocation rate)
• Lots of major GC - need to diagnose the trigger
– Triggered by promotion - increase Eden space
– Triggered by Old Generation filling up - increase Old Generation
space or decrease spark.memory.fraction
• Full GC before stage completes
– Trigger minor GC earlier and more often
18
Full GC tailspin
Balance sizing up against tuning code
• Switch to bigger and/or more nodes
• Look for slow running stages caused by avoidable shuffle, tune
joins and aggregation operations
• Checkpoint both to preserve work at strategic points but also to
truncate DAG lineage
• Cache to disk only
• Trade CPU for memory by compressing data in memory using
spark.rdd.compress
19
Which garbage collector?
Throughput or latency?
• ParallelGC favors throughput
• G1GC is low latency
– Shiny new things like string deduplication
– vulnerable to wide rows
Whichever you choose, collect early and often.
20
Where to cache big datasets
• To disk. Which is slow.
• But frees up as much tenured space as possible for
execution, and storing things which must be in memory
– internal metadata
– user data structures
– broadcasting the skew side of joins
21
22
Perils of caching to disk
19/04/13 01:27:33 WARN BlockManagerMasterEndpoint: No more replicas
available for rdd_48_27005 !
When you lose an executor, you lose all the cached blocks stored by that
executor even if the node is still running.
• If lineage is gone, the entire job will fail
• If lineage is present, RDD#getOrCompute tries to compensate for the missing
blocks by re-ingesting the source data. While it keeps your job from failing,
this could introduce enormous slowdowns if the source data is skewed, your
ingestion process is complex, etc.
23
Self healing block management
// use this with replication >= 2 when caching to
disk in non-distributed filesystem
spark.storage.replication.proactive = true
Pro-active block replenishment in case of node/executor failures
https://issues.apache.org/jira/browse/SPARK-15355
https://github.com/apache/spark/pull/14412
24
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Networking
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Tune RPC for cluster
communications
Netty server processing RPC requests
is the backbone of both authentication
and shuffle services.
Insufficient RPC resources cause slow
speed mayhem: clients disassociate,
operations time out.
org.apache.spark.network.util.
TransportConf is the shared config for
both shuffle and authentication services.
Ruth Teitelbum and Marlyn Meltzer
reprogramming ENIAC, 1946
26
Scaling RPC
// used for auth
spark.rpc.io.serverThreads = coresPerDriver * rpcThreadMultiplier
// used for shuffle
spark.shuffle.io.serverThreads = coresPerDriver * rpcThreadMultiplier
Where "RPC thread multiplier" is a scaling factor to increase the service's thread pool.
• 8 is aggressive, might cause issues
• 4 is moderately aggressive
• 2 is recommended (start here, benchmark, then increase)
• 1 (number of vCPU cores) is default but is too small for a large cluster
27
Shuffle
The definitive presentation on shuffle tuning:
Tuning Apache Spark for Large-Scale Workloads (Gaoxiang Liu
and Sital Kedia)
So this section focuses on
• Some differences to configurations presented in Liu and
Kedia's presentation, as well as
• Configurations that weren't shown in this presentation
28
Strategy for lots of shuffle clients
1. Scale the server way up
// mentioned in Liu/Kedia presentation but now deprecated
// spark.shuffle.service.index.cache.entries = 2048
// default: 100 MiB
spark.shuffle.service.index.cache.size = 256m
// length of accept queue. default: 64
spark.shuffle.io.backLog = 8192
// default (not increased by spark.network.timeout)
spark.rpc.lookupTimeout = 120s
29
Strategy for lots of shuffle clients
2. make clients more patient, more fault tolerant, fewer
simultaneous requests in flight
spark.reducer.maxReqsInFlight = 5 // default:
Int.MaxValue
spark.shuffle.io.maxRetries = 10 // default: 3
spark.shuffle.io.retryWait = 60s // default 5s
30
Strategy for lots of shuffle clients
spark.shuffle.io.numConnectionsPerPeer = 1
Scaling this up conservatively for multiple executor per node
configurations can be helpful.
Not recommended to change the default for single executor per
node.
31
Shuffle partitions
spark.sql.shuffle.partitions = max(1, nodes - 1) *
coresPerExecutor * parallelismPerCore
where parallelism per core is some hyperthreading factor, let's say 2.
It's not the best for large shuffles although it can be adjusted.
Apache Spark Core—Deep Dive—Proper Optimization (Daniel Tomes)
recommends setting this value to max(cluster executor cores,
shuffle stage input / 200 MB). That translates to 5242 partitions
per TB. Highly aggressive shuffle optimization is required for a large
dataset on a cluster with a large number of executors.
32
Kill Spill
spark.shuffle.spill.numElementsForceSpillThreshold = 25000000
spark.sql.windowExec.buffer.spill.threshold = 25000000
spark.sql.sortMergeJoinExec.buffer.spill.threshold = 25000000
• Spill is the number one cause of poor performance on very large
Spark clusters. These settings control when Spark spills data from
memory to disk – the defaults are a bad choice!
• Set these to a big Integer value – start with 25000000 and
increase if you can. More is more.
• SPARK-21595: Separate thresholds for buffering and spilling in
ExternalAppendOnlyUnsafeRowArray
Scaling AWS S3 Writes
Hadoop AWS S3 support in 3.2.0 is
amazing
• Especially the new S3A committers
https://hadoop.apache.org/docs/r3.2.0/hado
op-aws/tools/hadoop-aws/index.html
EMR: write to HDFS and copy off using
s3DistCp (limit reducers if necessary)
Databricks: writing directly to S3 just works
FirstNASAISINGLASSrocketlaunch
34
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Task Scheduling
Spark's powerful task scheduling
settings can interact in unexpected
ways at scale.
• Dynamic resource allocation
• External shuffle
• Speculative Execution
• Blacklisting
• Task reaper
Apollo 13 Mailbox at Mission Control
36
Dynamic resource allocation
Dynamic resource allocation benefits a multi-tenant cluster where
multiple applications can share resources.
If you have an ETL pipeline running on a large transient Spark
cluster, dynamic allocation is not useful to your single application.
Note that even in the first case, when your application no longer
needs some executors, those cluster nodes don't get spun down:
• Dynamic allocation requires an external shuffle service
• The node stays live and shuffle blocks continue to be served from it
37
External shuffle service
spark.shuffle.service.enabled = true
spark.shuffle.registration.timeout = 60000 // default: 5ms
spark.shuffle.registration.maxAttempts = 5 // default: 3
Even without dynamic allocation, an external shuffle service may be a good idea.
• If you lose executors through dynamic allocation, the external shuffle process still
serves up those blocks.
• The external shuffle service could be more responsive than the executor itself
However, the registration values are insufficient for a large busy cluster:
SPARK-20640 Make rpc timeout and retry for shuffle registration configurable
38
Speculative execution
When speculative execution works as intended, tasks running slowly
due to transient node issues don't bog down that stage indefinitely.
• Spark calculates the median execution time of all tasks in the stage
• spark.speculation.quantile - don't start speculating until this
percentage of tasks are complete (default 0.75)
• spark.speculation.multiplier - expressed as a multiple of the
median execution time, this is how slow a task must be to be
considered for speculation
• Whichever task is still running when the first finishes gets killed
39
One size does not fit all
spark.speculation = true
spark.speculation.quantile = 0.8 //default: 0.75
spark.speculation.multiplier = 4 // default: 1.5
These were our standard speculative execution settings. They
worked "fine" in most of our pipelines. But they worked fine
because the median size of the tasks at 80% was OK.
What happens when reasonable settings meet unreasonable
data?
40
21.2 TB shuffle, 20% of tasks killed
41
Speculation: unintended consequences
The median task length is based on the fast 80% - but due to heavy skew, this estimate is bad!
Causing the scheduler to take the worst part of the job and … launches more copies of the worst
longest running tasks ... one of which then gets killed.
spark.speculation = true
// start later (might get a better estimate)
spark.speculation.quantile = 0.90
// default 1.5 - require a task to be really bad
spark.speculation.multiplier = 6
The solution was two-fold:
• Start speculative execution later (increase the quantile) and require a greater slowness
multiplier
• Do something about the skew
42
Benefits of speculative execution
• Speculation can be very helpful when the application is interacting
with an external service. Example: writing to S3
• When speculation kills a task that was going to fail anyway, it
doesn't count against the failed tasks for that
stage/executor/node/job
• Clusters are not tuned in a day! Speculation can help pave over
slowdowns caused by scaling issues
• Useful canary: when you see tasks being intentionally killed in any
quantity, it's worth investigating why
43
Blacklisting
spark.blacklist.enabled = true
spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 // task blacklisted from
executor
spark.blacklist.stage.maxFailedTasksPerExecutor = 2 // executor blacklisted from
stage
// how many different tasks must fail in successful tasks sets before executor
// blacklisted from application
spark.blacklist.application.maxFailedTasksPerExecutor = 2
spark.blacklist.timeout = 1h // executor removed from blacklist, takes new tasks
Blacklisting prevents Spark from scheduling tasks on executors/nodes which have failed too many
times in the current stage.
The default number of failures are too conservative when using flaky external services. Let's see
how quickly it can add up...
44
45
Blacklisting gone wrong
• While writing three very large datasets to S3, something went
wrong about 17 TiB in
• 8600+ errors trying write to S3 in the space of eight minutes,
distributed across 1000 nodes
– Some executors backoff and retry, succeed
– Speculative execution kicks in, padding the blow
– But all the nodes quickly accumulate at least two failed tasks,
many have more and get blacklisted
• Eventually translating to four failed tasks, killing the job
46
47
Don't blacklist too soon
• We enabled blacklisting but didn't adjust the defaults because - we never "needed" to
before
• Post mortem showed cluster blocks were too large for our s3a settings
spark.blacklist.enabled = true
spark.blacklist.stage.maxFailedTasksPerExecutor = 8 // default: 2
spark.blacklist.application.maxFailedTasksPerExecutor = 24 // default: 2
spark.blacklist.timeout = 15m // default: 1h
Solution was to
• Make blacklisting a lot more tolerant of failure
• Repartition data on write for better block size
• Adjust s3a settings to raise multipart upload size
48
Don't fear the reaper
spark.task.reaper.enabled = true
// default: -1 (prevents executor from self-destructing)
spark.task.reaper.killTimeout = 180s
The task reaper monitors to make sure tasks that get interrupted or killed actually shut
down.
On a large job, give a little extra time before killing the JVM
• If you've increased timeouts, the task may need more time to shut down cleanly
• If the task reaper kills the JVM abruptly, you could lose cached blocks
SPARK-18761 Uncancellable / unkillable tasks may starve jobs of resources
49
Spark at scale in the cloud
Building
• Composition
• Structure
Scaling
• Memory
• Services
• S3
Scheduling
• Speculation
• Blacklisting
Tuning
Patience
Tolerance
Acceptance
Increase tolerance
• If you find a timeout or number of retries, raise it
• If you find a buffer, backlog, queue, or threshold, increase it
• If you have a MR task with a number of reducers trying to use
a service concurrently in a large cluster
– Either limit the number of active tasks per reducer, or
– Limit the number of reducers active at the same time
51
Be more patient
// default - might be too low for a large cluster
under load
spark.network.timeout = 120s
Spark has a lot of different networking timeouts. This is the
biggest knob to turn: increasing this increases many settings at
once.
(This setting does not increase the spark.rpc.timeout used by
shuffle and authentication services.)
52
Executor heartbeat timeouts
spark.executor.heartbeatInterval = 10s // default
spark.executor.heartbeatInterval should be significantly
less than spark.network.timeout.
Executors missing heartbeats usually signify a memory issue, not
a network problem.
• Increase the number of partitions in the dataset
• Remediate skew causing some partition(s) to be much larger
than the others
53
Be resilient to failure
spark.stage.maxConsecutiveAttempts = 10 // default: 4
// default: 4 (would go higher for cloud storage misbehavior)
spark.task.maxFailures = 12
spark.max.fetch.failures.per.stage = 10 // default: 4 (helps shuffle)
Increasing the number of failures your application can accept at the task and stage level.
Use blacklisting and speculation to your advantage. It's better to concede some extra resources to a
stage which eventually succeeds than to fail the entire job:
• Note that tasks killed through speculation - which might otherwise have failed - don't count against
you here.
• Blacklisting - which in the best case removes from a stage or job a host which can't participate
anyway - also helps proactively keep this count down. Just be sure to raise the number of failures
there too!
54
Koan
A Spark job that is broken
is only a special case of a
Spark job that is working.
Koan Mu calligraphy by Brigitte D'Ortschy
is licensed under CC BY 3.0
55
Interested?
• What we do: data engineering @ Coatue
‒ Terabyte scale, billions of rows
‒ Lambda architecture
‒ Functional programming
• Stack
‒ Scala (cats, shapeless, fs2, http4s)
‒ Spark / Hadoop / EMR / Databricks
‒ Data warehouses
‒ Python / R / Tableau
‒ Chat with me or email: rtoomey@coatue.com
‒ Twitter: @prasinous
56
Digestifs
Resources, links, configurations
Useful things for later
Desirable heap size for executors
spark.executor.memory = ???
JVM flag -XX:+UseCompressedOops allows you to use 4-byte pointers instead
of 8 (on by default in JDK 7+).
< 32 GB good for prompt GC, supports compressed OOPs.
32-48 GB "dead zone."
without compressed OOPs over 32 GB, you need almost 48GB to hold the
same number of objects.
49 - 64+ GB very large joins or special case with wide rows and G1GC.
58
How many concurrent tasks per executor?
spark.executor.cores = ???
Defaults to number of physical cores, but represents the maximum number of
concurrent tasks that can run on a single executor.
< 2 Too few cores. Doesn't make good use of parallelism.
2 - 4 recommended size for "most" spark apps.
5 HDFS client performance tops out.
> 8 Too many cores. Overhead from context switching outweighs benefit.
59
Memory
• Spark docs: Garbage Collection Tuning
• Distribution of Executors, Cores and Memory for a Spark Application
running in Yarn (spoddutur.github.io/spark-notes)
• How-to: Tune Your Apache Spark Jobs (Part 2) - (Sandy Ryza)
• Why Your Spark Applications Are Slow or Failing, Part 1: Memory
Management (Rishitesh Mishra)
• Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities
(Fabian Lange)
• Everything by Aleksey Shipilëv at https://shipilev.net/, @shipilev, or
anywhere else
60
GC debug logging
Restart your cluster with these options in
spark.executor.extraJavaOptions and
spark.driver.extraJavaOptions
-verbose:gc -XX:+PrintGC -XX:+PrintGCDateStamps 
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails 
-XX:+PrintGCCause -XX:+PrintTenuringDistribution 
-XX:+PrintFlagsFinal
61
Parallel GC: throughput friendly
-XX:+UseParallelGC -XX:ParallelGCThreads=NUM_THREADS
• The heap size set using spark.driver.memory and
spark.executor.memory
• Defaults to one third Young Generations and two thirds Old
Generation
• Number of threads does not scale 1:1 with number of cores
– Start with 8
– After 8 cores, use 5/8 remaining cores
– After 32 cores, use 5/16 remaining cores
62
Parallel GC: sizing Young Generation
• Eden is 3/4 of young generation
• Each of the two survivor spaces is 1/8 of young generation
By default, -XX:NewRatio=2, meaning that Old Generation occupies 2/3
of the heap
• Increase NewRatio to give Old Generation more space (3 for
3/4 of the heap)
• Decrease NewRatio to give Young Generation more space (1
for 1/2 of the heap)
63
Parallel GC: sizing Old Generation
By default, spark.memory.fraction allows cached internal data
to occupy 0.6 * (heap size - 300M). Old Generation needs
to be bigger than spark.memory.fraction.
• Decrease spark.memory.storageFraction (default 0.5) to free
up more space for execution
• Increase Old Generation space to combat spilling to disk,
cache eviction
64
G1 GC: latency friendly
-XX:+UseG1GC -XX:ParallelGCThreads=X 
-XX:ConcGCThreads=(2*X)
Parallel GC threads are the "stop the world" worker threads. Defaults to the same
calculation as parallel GC; some articles recommend 8 + max(0, cores - 8) * 0.625.
Concurrent GC threads mark in parallel with the running application. The default of a
quarter as many threads as used for parallel GC may be conservative for a large Spark
application. Several articles recommended scaling this number of thread up in conjunction
with a lower initiating heap occupancy.
Garbage First Garbage Collector Tuning (Monica Beckwith)
65
G1 GC logging
Same as shown for parallel GC, but also
-XX:+UnlockDiagnosticVMOptions 
-XX:+PrintAdaptiveSizePolicy 
-XX:+G1SummarizeConcMark
G1 offers a range of GC logging information on top of the
standard parallel GC logging options.
Collecting and reading G1 garbage collector logs - part 2 (Matt
Robson)
66
G1 Initiating heap occupancy
-XX:InitiatingHeapOccupancyPercent=35
By default, G1 GC will initiate garbage collection when the heap is 45 percent full. This can lead to
a situation where full GC is necessary before the less costly concurrent phase has run or
completed.
By triggering concurrent GC sooner and scaling up the number of threads available to perform the
concurrent work, the more aggressive concurrent phase can forestall full collections.
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
(Karunanithi Shanmugam)
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing (Eric Kaczmarek and
Liqi Yi, Intel)
67
G1 Region size
-XX:G1HeapRegionSize=16
The heap defaults to region size between 1 and 32 MiB. For example, a heap with <= 32 GiB has a region size
of 8 MiB; one with <= 16 GiB has 4 MiB.
If you see Humongous Allocation in your GC logs, indicating an object which occupies > 50% of your current
region size, then consider increasing G1HeapRegionSize. Changing this setting is not recommended for most
cases because
• Increasing region size reduces the number of available regions, plus
• The additional cost of copying/cleaning up the larger regions may reduce throughput or increase latency
Most commonly caused by a dataset with very wide rows. If you can't improve G1 performance, switch back to
parallel GC.
Plumbr.io handbook: GC Tuning: In Practice: Other Examples: Humongous Allocations
68
G1 string deduplication
-XX:+UseStringDeduplication 
-XX:+PrintStringDeduplicationStatistics
May decrease your memory usage if you have a significant
number of duplicate String instances in memory.
JEP 192: String Deduplication in G1
69
Shuffle
• Scaling Apache Spark at Facebook (Ankit Agarwal and Sameer Agarwal)
• Spark Shuffle Deep Dive (Bo Yang)
These older presentations sometimes pertain to previous versions of Spark
but still have substantial value.
• Optimal Strategies for Large Scale Batch ETL Jobs (Emma Tang) - 2017
• Apache Spark @Scale: A 60 TB+ production use case from Facebook
(Sital Kedia, Shuojie Wang and Avery Ching) - 2016
• Apache Spark the fastest open source engine for sorting a petabyte
(Reynold Xin) - 2014
70
S3
• Best Practices Design Patterns: Optimizing Amazon S3
Performance (Mai-Lan Tomsen Bukovec, Andy Warfield, and
Tim Harris)
• Seven Tips for Using S3DistCp on Amazon EMR to Move
Data Efficiently Between HDFS and Amazon S3 (Illya
Yalovyy)
• Cost optimization through performance improvement of
S3DistCp (Sarang Anajwala)
71
S3: EMR
Write your data to HDFS and then create a separate step using s3DistCp to
copy the files to S3.
This utility is problematic for large clusters and large datasets:
• Primitive error handling
– Deals with being rate limited by S3 by.... trying harder, choking, failing
– No way to increase the number of failures allowed
– No way to distinguish between being rate limited and getting fatal backend
errors
• If any s3DistCp step fails, EMR job fails even if a later s3DistCp step
succeeds
72
Using s3DistCp on a large cluster
-D mapreduce.job.reduces=(numExecutors / 2)
The default number of reducers is one per executor - documentation says the "right"
number is probably 0.95 or 1.75. All three choices are bad for s3DistCp, where the
reduce phase of the job writes to S3. Experiment to figure out how much to scale down
the number of reducers so the data is copied off in a timely manner without too much
rate limiting.
On large jobs, recommend running s3DistCp step as many times as necessary to
ensure all your data makes it off HDFS to S3 before the cluster shuts down.
Hadoop Map Reduce Tutorial: Map-Reduce User Interfaces
73
Databricks
fs.s3a.multipart.threshold = 2147483647 // default (in bytes)
fs.s3a.multipart.size = 104857600
fs.s3a.connection.maximum = min(clusterNodes, 500)
fs.s3a.connection.timeout = 60000 // default: 20000ms
fs.s3a.block.size = 134217728 // default: 32M - used for reading
fs.s3a.fast.upload = true // disable if writes are failing
// spark.stage.maxConsecutiveAttempts = 10 // default 4 -
increase if writes are failing
Databricks Runtimes uses their own S3 committer code which provides
reliable performance writing directly to S3.
74
Hadoop 3.2.0
// https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/committers.html
fs.s3a.committer.name = directory
fs.s3a.committer.staging.conflict-mode = replace // replace == overwrite
fs.s3a.attempts.maximum = 20 // How many times we should retry commands on transient
errors
fs.s3a.retry.throttle.limit = 20 // number of times to retry throttled request
fs.s3a.retry.throttle.interval = 1000ms
// Controls the maximum number of simultaneous connections to S3
fs.s3a.connection.maximum = ???
// Number of (part)uploads allowed to the queue before blocking additional uploads.
fs.s3a.max.total.tasks = ???
If you're lucky enough to have access to Hadoop 3.2.0, here's some highlights
pertinent to large clusters.
75
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Mais conteúdo relacionado

Mais procurados

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

Mais procurados (20)

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Semelhante a Apache Spark At Scale in the Cloud

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersSangjin Han
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsConcentric Sky
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at ScaleElasticsearch
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 

Semelhante a Apache Spark At Scale in the Cloud (20)

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Network support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacentersNetwork support for resource disaggregation in next-generation datacenters
Network support for resource disaggregation in next-generation datacenters
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Apache Spark At Scale in the Cloud

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Rose Toomey, Coatue Management Spark At Scale In the Cloud #UnifiedDataAnalytics #SparkAISummit
  • 3. About me NYC. Finance. Technology. Code. • Each job I wrote code but found that the data challenges just kept growing – Lead API Developer at Gemini Trust – Director at Novus Partners • Now: coding and working with data full time – Software Engineer at Coatue Management
  • 4. How do you process this… Numbers are approximate. • Dataset is 35+ TiB raw • Input files are 80k+ unsplittable compressed row-based format with heavy skew, deeply nested directory structure • Processing results in 275+ billion rows cached to disk • Lots of data written back out to S3 – Including stages ending in sustained writes of tens of TiB 4
  • 5. On a very big Spark cluster… Sometimes you just need to bring the entire dataset into memory. The more nodes a Spark cluster has, the more important configuration tuning becomes. Even more so in the cloud, where you will regularly experience I/O variance and unreliable nodes.
  • 6. In the cloud? • Infrastructure management is hard – Scaling resources and bandwidth in a datacenter is not instant – Spark/Hadoop clusters are not islands – you’re managing an entire ecosystem of supporting players • Optimizing Spark jobs is hard Let’s limit the number of hard things we’re going to tackle at once.
  • 7. Things going wrong at scale Everything is relative. In smaller clusters, these configurations worked fine. • Everything is waiting on everything else because Netty doesn't have enough firepower to shuffle faster • Speculation meets skew and relaunches the very slowest parts of a join, leaving most of the cluster idle • An external service rate limits, which causes blacklisting to sideline most of a perfectly good cluster 7
  • 8. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 9. Putting together a big Spark cluster • What kind of nodes should the cluster have? Big? Small? Medium? • What's your resource limitation for the number of executors? – Just memory (standalone) – Both memory and vCPUs (YARN) • Individual executors should have how much memory and how many virtual CPUs?Galactic Wreckage in Stephan's Quintet 9
  • 10. One Very Big Standalone Node One mega instance configured with many "just right" executors, each provisioned with • < 32 GiB heap (sweet spot for GC) • 5 cores (for good throughput) • Minimizes shuffle overhead • Like the pony, not offered by your cloud provider. Also, poor fault tolerance. 10
  • 11. Multiple Medium-sized Nodes When looking at medium sized nodes, we have a choice: • Just one executor • Multiple executors But a single executor might not be the best resource usage: • More cores on a single executor is not necessarily better • When using a cluster manager like YARN, more executors could be a more efficient use of CPU and memory 11
  • 12. Many Small Nodes 12 • 500+ small nodes • Each node over-provisioned relative to multiple executor per node configurations • Single executor per node • Most fault tolerant but big communications overhead “Desperate affairs require desperate measures.” Vice Admiral Horatio Nelson
  • 13. Why ever choose the worst solution? Single executor per small (or medium) node is the worst configuration for cost, provisioning, and resource usage. Why not recommend against it? • Resilient to node degradation and loss • Quick transition to production: relative over-provisioning of resources to each executor behaves more like a notebook • Awkward instance sizes may provision more quickly than larger instances 13
  • 14. Onward! Now you have your cluster composition in mind, you’ll need to scale up your base infrastructure to support the number of nodes: • Memory and garbage collection • Tune RPC for cluster communications • Where do you put very large datasets? • How do you get them off the cluster? • No task left behind: scheduling in difficult times 14
  • 15. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 16. Spark memory management SPARK-1000: Consolidate storage and execution memory management • NewRatio controls Young/Old proportion • spark.memory.fraction sets storage and execution space to ~60% tenured space 16 Young Generation 1/3 Old Generation 2/3 300m reserved spark.memory.fraction ~60% 50% execution dynamic – will take more 50% storage spark.memory.storageFraction ~40% Spark metadata, user data structures, OOM safety
  • 17. 17
  • 18. Field guide to Spark GC tuning • Lots of minor GC - easy fix – Increase Eden space (high allocation rate) • Lots of major GC - need to diagnose the trigger – Triggered by promotion - increase Eden space – Triggered by Old Generation filling up - increase Old Generation space or decrease spark.memory.fraction • Full GC before stage completes – Trigger minor GC earlier and more often 18
  • 19. Full GC tailspin Balance sizing up against tuning code • Switch to bigger and/or more nodes • Look for slow running stages caused by avoidable shuffle, tune joins and aggregation operations • Checkpoint both to preserve work at strategic points but also to truncate DAG lineage • Cache to disk only • Trade CPU for memory by compressing data in memory using spark.rdd.compress 19
  • 20. Which garbage collector? Throughput or latency? • ParallelGC favors throughput • G1GC is low latency – Shiny new things like string deduplication – vulnerable to wide rows Whichever you choose, collect early and often. 20
  • 21. Where to cache big datasets • To disk. Which is slow. • But frees up as much tenured space as possible for execution, and storing things which must be in memory – internal metadata – user data structures – broadcasting the skew side of joins 21
  • 22. 22
  • 23. Perils of caching to disk 19/04/13 01:27:33 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_48_27005 ! When you lose an executor, you lose all the cached blocks stored by that executor even if the node is still running. • If lineage is gone, the entire job will fail • If lineage is present, RDD#getOrCompute tries to compensate for the missing blocks by re-ingesting the source data. While it keeps your job from failing, this could introduce enormous slowdowns if the source data is skewed, your ingestion process is complex, etc. 23
  • 24. Self healing block management // use this with replication >= 2 when caching to disk in non-distributed filesystem spark.storage.replication.proactive = true Pro-active block replenishment in case of node/executor failures https://issues.apache.org/jira/browse/SPARK-15355 https://github.com/apache/spark/pull/14412 24
  • 25. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Networking • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 26. Tune RPC for cluster communications Netty server processing RPC requests is the backbone of both authentication and shuffle services. Insufficient RPC resources cause slow speed mayhem: clients disassociate, operations time out. org.apache.spark.network.util. TransportConf is the shared config for both shuffle and authentication services. Ruth Teitelbum and Marlyn Meltzer reprogramming ENIAC, 1946 26
  • 27. Scaling RPC // used for auth spark.rpc.io.serverThreads = coresPerDriver * rpcThreadMultiplier // used for shuffle spark.shuffle.io.serverThreads = coresPerDriver * rpcThreadMultiplier Where "RPC thread multiplier" is a scaling factor to increase the service's thread pool. • 8 is aggressive, might cause issues • 4 is moderately aggressive • 2 is recommended (start here, benchmark, then increase) • 1 (number of vCPU cores) is default but is too small for a large cluster 27
  • 28. Shuffle The definitive presentation on shuffle tuning: Tuning Apache Spark for Large-Scale Workloads (Gaoxiang Liu and Sital Kedia) So this section focuses on • Some differences to configurations presented in Liu and Kedia's presentation, as well as • Configurations that weren't shown in this presentation 28
  • 29. Strategy for lots of shuffle clients 1. Scale the server way up // mentioned in Liu/Kedia presentation but now deprecated // spark.shuffle.service.index.cache.entries = 2048 // default: 100 MiB spark.shuffle.service.index.cache.size = 256m // length of accept queue. default: 64 spark.shuffle.io.backLog = 8192 // default (not increased by spark.network.timeout) spark.rpc.lookupTimeout = 120s 29
  • 30. Strategy for lots of shuffle clients 2. make clients more patient, more fault tolerant, fewer simultaneous requests in flight spark.reducer.maxReqsInFlight = 5 // default: Int.MaxValue spark.shuffle.io.maxRetries = 10 // default: 3 spark.shuffle.io.retryWait = 60s // default 5s 30
  • 31. Strategy for lots of shuffle clients spark.shuffle.io.numConnectionsPerPeer = 1 Scaling this up conservatively for multiple executor per node configurations can be helpful. Not recommended to change the default for single executor per node. 31
  • 32. Shuffle partitions spark.sql.shuffle.partitions = max(1, nodes - 1) * coresPerExecutor * parallelismPerCore where parallelism per core is some hyperthreading factor, let's say 2. It's not the best for large shuffles although it can be adjusted. Apache Spark Core—Deep Dive—Proper Optimization (Daniel Tomes) recommends setting this value to max(cluster executor cores, shuffle stage input / 200 MB). That translates to 5242 partitions per TB. Highly aggressive shuffle optimization is required for a large dataset on a cluster with a large number of executors. 32
  • 33. Kill Spill spark.shuffle.spill.numElementsForceSpillThreshold = 25000000 spark.sql.windowExec.buffer.spill.threshold = 25000000 spark.sql.sortMergeJoinExec.buffer.spill.threshold = 25000000 • Spill is the number one cause of poor performance on very large Spark clusters. These settings control when Spark spills data from memory to disk – the defaults are a bad choice! • Set these to a big Integer value – start with 25000000 and increase if you can. More is more. • SPARK-21595: Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray
  • 34. Scaling AWS S3 Writes Hadoop AWS S3 support in 3.2.0 is amazing • Especially the new S3A committers https://hadoop.apache.org/docs/r3.2.0/hado op-aws/tools/hadoop-aws/index.html EMR: write to HDFS and copy off using s3DistCp (limit reducers if necessary) Databricks: writing directly to S3 just works FirstNASAISINGLASSrocketlaunch 34
  • 35. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Services • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 36. Task Scheduling Spark's powerful task scheduling settings can interact in unexpected ways at scale. • Dynamic resource allocation • External shuffle • Speculative Execution • Blacklisting • Task reaper Apollo 13 Mailbox at Mission Control 36
  • 37. Dynamic resource allocation Dynamic resource allocation benefits a multi-tenant cluster where multiple applications can share resources. If you have an ETL pipeline running on a large transient Spark cluster, dynamic allocation is not useful to your single application. Note that even in the first case, when your application no longer needs some executors, those cluster nodes don't get spun down: • Dynamic allocation requires an external shuffle service • The node stays live and shuffle blocks continue to be served from it 37
  • 38. External shuffle service spark.shuffle.service.enabled = true spark.shuffle.registration.timeout = 60000 // default: 5ms spark.shuffle.registration.maxAttempts = 5 // default: 3 Even without dynamic allocation, an external shuffle service may be a good idea. • If you lose executors through dynamic allocation, the external shuffle process still serves up those blocks. • The external shuffle service could be more responsive than the executor itself However, the registration values are insufficient for a large busy cluster: SPARK-20640 Make rpc timeout and retry for shuffle registration configurable 38
  • 39. Speculative execution When speculative execution works as intended, tasks running slowly due to transient node issues don't bog down that stage indefinitely. • Spark calculates the median execution time of all tasks in the stage • spark.speculation.quantile - don't start speculating until this percentage of tasks are complete (default 0.75) • spark.speculation.multiplier - expressed as a multiple of the median execution time, this is how slow a task must be to be considered for speculation • Whichever task is still running when the first finishes gets killed 39
  • 40. One size does not fit all spark.speculation = true spark.speculation.quantile = 0.8 //default: 0.75 spark.speculation.multiplier = 4 // default: 1.5 These were our standard speculative execution settings. They worked "fine" in most of our pipelines. But they worked fine because the median size of the tasks at 80% was OK. What happens when reasonable settings meet unreasonable data? 40
  • 41. 21.2 TB shuffle, 20% of tasks killed 41
  • 42. Speculation: unintended consequences The median task length is based on the fast 80% - but due to heavy skew, this estimate is bad! Causing the scheduler to take the worst part of the job and … launches more copies of the worst longest running tasks ... one of which then gets killed. spark.speculation = true // start later (might get a better estimate) spark.speculation.quantile = 0.90 // default 1.5 - require a task to be really bad spark.speculation.multiplier = 6 The solution was two-fold: • Start speculative execution later (increase the quantile) and require a greater slowness multiplier • Do something about the skew 42
  • 43. Benefits of speculative execution • Speculation can be very helpful when the application is interacting with an external service. Example: writing to S3 • When speculation kills a task that was going to fail anyway, it doesn't count against the failed tasks for that stage/executor/node/job • Clusters are not tuned in a day! Speculation can help pave over slowdowns caused by scaling issues • Useful canary: when you see tasks being intentionally killed in any quantity, it's worth investigating why 43
  • 44. Blacklisting spark.blacklist.enabled = true spark.blacklist.task.maxTaskAttemptsPerExecutor = 1 // task blacklisted from executor spark.blacklist.stage.maxFailedTasksPerExecutor = 2 // executor blacklisted from stage // how many different tasks must fail in successful tasks sets before executor // blacklisted from application spark.blacklist.application.maxFailedTasksPerExecutor = 2 spark.blacklist.timeout = 1h // executor removed from blacklist, takes new tasks Blacklisting prevents Spark from scheduling tasks on executors/nodes which have failed too many times in the current stage. The default number of failures are too conservative when using flaky external services. Let's see how quickly it can add up... 44
  • 45. 45
  • 46. Blacklisting gone wrong • While writing three very large datasets to S3, something went wrong about 17 TiB in • 8600+ errors trying write to S3 in the space of eight minutes, distributed across 1000 nodes – Some executors backoff and retry, succeed – Speculative execution kicks in, padding the blow – But all the nodes quickly accumulate at least two failed tasks, many have more and get blacklisted • Eventually translating to four failed tasks, killing the job 46
  • 47. 47
  • 48. Don't blacklist too soon • We enabled blacklisting but didn't adjust the defaults because - we never "needed" to before • Post mortem showed cluster blocks were too large for our s3a settings spark.blacklist.enabled = true spark.blacklist.stage.maxFailedTasksPerExecutor = 8 // default: 2 spark.blacklist.application.maxFailedTasksPerExecutor = 24 // default: 2 spark.blacklist.timeout = 15m // default: 1h Solution was to • Make blacklisting a lot more tolerant of failure • Repartition data on write for better block size • Adjust s3a settings to raise multipart upload size 48
  • 49. Don't fear the reaper spark.task.reaper.enabled = true // default: -1 (prevents executor from self-destructing) spark.task.reaper.killTimeout = 180s The task reaper monitors to make sure tasks that get interrupted or killed actually shut down. On a large job, give a little extra time before killing the JVM • If you've increased timeouts, the task may need more time to shut down cleanly • If the task reaper kills the JVM abruptly, you could lose cached blocks SPARK-18761 Uncancellable / unkillable tasks may starve jobs of resources 49
  • 50. Spark at scale in the cloud Building • Composition • Structure Scaling • Memory • Services • S3 Scheduling • Speculation • Blacklisting Tuning Patience Tolerance Acceptance
  • 51. Increase tolerance • If you find a timeout or number of retries, raise it • If you find a buffer, backlog, queue, or threshold, increase it • If you have a MR task with a number of reducers trying to use a service concurrently in a large cluster – Either limit the number of active tasks per reducer, or – Limit the number of reducers active at the same time 51
  • 52. Be more patient // default - might be too low for a large cluster under load spark.network.timeout = 120s Spark has a lot of different networking timeouts. This is the biggest knob to turn: increasing this increases many settings at once. (This setting does not increase the spark.rpc.timeout used by shuffle and authentication services.) 52
  • 53. Executor heartbeat timeouts spark.executor.heartbeatInterval = 10s // default spark.executor.heartbeatInterval should be significantly less than spark.network.timeout. Executors missing heartbeats usually signify a memory issue, not a network problem. • Increase the number of partitions in the dataset • Remediate skew causing some partition(s) to be much larger than the others 53
  • 54. Be resilient to failure spark.stage.maxConsecutiveAttempts = 10 // default: 4 // default: 4 (would go higher for cloud storage misbehavior) spark.task.maxFailures = 12 spark.max.fetch.failures.per.stage = 10 // default: 4 (helps shuffle) Increasing the number of failures your application can accept at the task and stage level. Use blacklisting and speculation to your advantage. It's better to concede some extra resources to a stage which eventually succeeds than to fail the entire job: • Note that tasks killed through speculation - which might otherwise have failed - don't count against you here. • Blacklisting - which in the best case removes from a stage or job a host which can't participate anyway - also helps proactively keep this count down. Just be sure to raise the number of failures there too! 54
  • 55. Koan A Spark job that is broken is only a special case of a Spark job that is working. Koan Mu calligraphy by Brigitte D'Ortschy is licensed under CC BY 3.0 55
  • 56. Interested? • What we do: data engineering @ Coatue ‒ Terabyte scale, billions of rows ‒ Lambda architecture ‒ Functional programming • Stack ‒ Scala (cats, shapeless, fs2, http4s) ‒ Spark / Hadoop / EMR / Databricks ‒ Data warehouses ‒ Python / R / Tableau ‒ Chat with me or email: rtoomey@coatue.com ‒ Twitter: @prasinous 56
  • 58. Desirable heap size for executors spark.executor.memory = ??? JVM flag -XX:+UseCompressedOops allows you to use 4-byte pointers instead of 8 (on by default in JDK 7+). < 32 GB good for prompt GC, supports compressed OOPs. 32-48 GB "dead zone." without compressed OOPs over 32 GB, you need almost 48GB to hold the same number of objects. 49 - 64+ GB very large joins or special case with wide rows and G1GC. 58
  • 59. How many concurrent tasks per executor? spark.executor.cores = ??? Defaults to number of physical cores, but represents the maximum number of concurrent tasks that can run on a single executor. < 2 Too few cores. Doesn't make good use of parallelism. 2 - 4 recommended size for "most" spark apps. 5 HDFS client performance tops out. > 8 Too many cores. Overhead from context switching outweighs benefit. 59
  • 60. Memory • Spark docs: Garbage Collection Tuning • Distribution of Executors, Cores and Memory for a Spark Application running in Yarn (spoddutur.github.io/spark-notes) • How-to: Tune Your Apache Spark Jobs (Part 2) - (Sandy Ryza) • Why Your Spark Applications Are Slow or Failing, Part 1: Memory Management (Rishitesh Mishra) • Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities (Fabian Lange) • Everything by Aleksey Shipilëv at https://shipilev.net/, @shipilev, or anywhere else 60
  • 61. GC debug logging Restart your cluster with these options in spark.executor.extraJavaOptions and spark.driver.extraJavaOptions -verbose:gc -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintFlagsFinal 61
  • 62. Parallel GC: throughput friendly -XX:+UseParallelGC -XX:ParallelGCThreads=NUM_THREADS • The heap size set using spark.driver.memory and spark.executor.memory • Defaults to one third Young Generations and two thirds Old Generation • Number of threads does not scale 1:1 with number of cores – Start with 8 – After 8 cores, use 5/8 remaining cores – After 32 cores, use 5/16 remaining cores 62
  • 63. Parallel GC: sizing Young Generation • Eden is 3/4 of young generation • Each of the two survivor spaces is 1/8 of young generation By default, -XX:NewRatio=2, meaning that Old Generation occupies 2/3 of the heap • Increase NewRatio to give Old Generation more space (3 for 3/4 of the heap) • Decrease NewRatio to give Young Generation more space (1 for 1/2 of the heap) 63
  • 64. Parallel GC: sizing Old Generation By default, spark.memory.fraction allows cached internal data to occupy 0.6 * (heap size - 300M). Old Generation needs to be bigger than spark.memory.fraction. • Decrease spark.memory.storageFraction (default 0.5) to free up more space for execution • Increase Old Generation space to combat spilling to disk, cache eviction 64
  • 65. G1 GC: latency friendly -XX:+UseG1GC -XX:ParallelGCThreads=X -XX:ConcGCThreads=(2*X) Parallel GC threads are the "stop the world" worker threads. Defaults to the same calculation as parallel GC; some articles recommend 8 + max(0, cores - 8) * 0.625. Concurrent GC threads mark in parallel with the running application. The default of a quarter as many threads as used for parallel GC may be conservative for a large Spark application. Several articles recommended scaling this number of thread up in conjunction with a lower initiating heap occupancy. Garbage First Garbage Collector Tuning (Monica Beckwith) 65
  • 66. G1 GC logging Same as shown for parallel GC, but also -XX:+UnlockDiagnosticVMOptions -XX:+PrintAdaptiveSizePolicy -XX:+G1SummarizeConcMark G1 offers a range of GC logging information on top of the standard parallel GC logging options. Collecting and reading G1 garbage collector logs - part 2 (Matt Robson) 66
  • 67. G1 Initiating heap occupancy -XX:InitiatingHeapOccupancyPercent=35 By default, G1 GC will initiate garbage collection when the heap is 45 percent full. This can lead to a situation where full GC is necessary before the less costly concurrent phase has run or completed. By triggering concurrent GC sooner and scaling up the number of threads available to perform the concurrent work, the more aggressive concurrent phase can forestall full collections. Best practices for successfully managing memory for Apache Spark applications on Amazon EMR (Karunanithi Shanmugam) Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing (Eric Kaczmarek and Liqi Yi, Intel) 67
  • 68. G1 Region size -XX:G1HeapRegionSize=16 The heap defaults to region size between 1 and 32 MiB. For example, a heap with <= 32 GiB has a region size of 8 MiB; one with <= 16 GiB has 4 MiB. If you see Humongous Allocation in your GC logs, indicating an object which occupies > 50% of your current region size, then consider increasing G1HeapRegionSize. Changing this setting is not recommended for most cases because • Increasing region size reduces the number of available regions, plus • The additional cost of copying/cleaning up the larger regions may reduce throughput or increase latency Most commonly caused by a dataset with very wide rows. If you can't improve G1 performance, switch back to parallel GC. Plumbr.io handbook: GC Tuning: In Practice: Other Examples: Humongous Allocations 68
  • 69. G1 string deduplication -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics May decrease your memory usage if you have a significant number of duplicate String instances in memory. JEP 192: String Deduplication in G1 69
  • 70. Shuffle • Scaling Apache Spark at Facebook (Ankit Agarwal and Sameer Agarwal) • Spark Shuffle Deep Dive (Bo Yang) These older presentations sometimes pertain to previous versions of Spark but still have substantial value. • Optimal Strategies for Large Scale Batch ETL Jobs (Emma Tang) - 2017 • Apache Spark @Scale: A 60 TB+ production use case from Facebook (Sital Kedia, Shuojie Wang and Avery Ching) - 2016 • Apache Spark the fastest open source engine for sorting a petabyte (Reynold Xin) - 2014 70
  • 71. S3 • Best Practices Design Patterns: Optimizing Amazon S3 Performance (Mai-Lan Tomsen Bukovec, Andy Warfield, and Tim Harris) • Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 (Illya Yalovyy) • Cost optimization through performance improvement of S3DistCp (Sarang Anajwala) 71
  • 72. S3: EMR Write your data to HDFS and then create a separate step using s3DistCp to copy the files to S3. This utility is problematic for large clusters and large datasets: • Primitive error handling – Deals with being rate limited by S3 by.... trying harder, choking, failing – No way to increase the number of failures allowed – No way to distinguish between being rate limited and getting fatal backend errors • If any s3DistCp step fails, EMR job fails even if a later s3DistCp step succeeds 72
  • 73. Using s3DistCp on a large cluster -D mapreduce.job.reduces=(numExecutors / 2) The default number of reducers is one per executor - documentation says the "right" number is probably 0.95 or 1.75. All three choices are bad for s3DistCp, where the reduce phase of the job writes to S3. Experiment to figure out how much to scale down the number of reducers so the data is copied off in a timely manner without too much rate limiting. On large jobs, recommend running s3DistCp step as many times as necessary to ensure all your data makes it off HDFS to S3 before the cluster shuts down. Hadoop Map Reduce Tutorial: Map-Reduce User Interfaces 73
  • 74. Databricks fs.s3a.multipart.threshold = 2147483647 // default (in bytes) fs.s3a.multipart.size = 104857600 fs.s3a.connection.maximum = min(clusterNodes, 500) fs.s3a.connection.timeout = 60000 // default: 20000ms fs.s3a.block.size = 134217728 // default: 32M - used for reading fs.s3a.fast.upload = true // disable if writes are failing // spark.stage.maxConsecutiveAttempts = 10 // default 4 - increase if writes are failing Databricks Runtimes uses their own S3 committer code which provides reliable performance writing directly to S3. 74
  • 75. Hadoop 3.2.0 // https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/committers.html fs.s3a.committer.name = directory fs.s3a.committer.staging.conflict-mode = replace // replace == overwrite fs.s3a.attempts.maximum = 20 // How many times we should retry commands on transient errors fs.s3a.retry.throttle.limit = 20 // number of times to retry throttled request fs.s3a.retry.throttle.interval = 1000ms // Controls the maximum number of simultaneous connections to S3 fs.s3a.connection.maximum = ??? // Number of (part)uploads allowed to the queue before blocking additional uploads. fs.s3a.max.total.tasks = ??? If you're lucky enough to have access to Hadoop 3.2.0, here's some highlights pertinent to large clusters. 75
  • 76. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT