Mais conteúdo relacionado Semelhante a Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance] (20) Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance]1. Securely explore your data
PERFORMANCE MODELS
FOR APACHE ACCUMULO:
THE HEAVY TAIL OF A SHARED-
NOTHING ARCHITECTURE
Chris McCubbin
Director of Data Science
Sqrrl Data, Inc.
2. TODAY’S TALK
1. Quick intro to performance optimization
2. Techniques for targeted distributed application
modeling performance improvement
3. A deep dive in to improving bulk load application
performance
4. A shallow dive in to partial schemas
2©2014 Sqrrl Data, Inc
3. SO, YOUR DISTRIBUTED
APPLICATION IS SLOW
• Today’s distributed applications run on tens or
hundreds of library components
• Many versions so internet advice could be
ineffective, or worse, flat out wrong
• Hundreds of settings
• Some, shall we say, could be better documented
• Shared-nothing architectures are usually
“shared-little” architectures with tricky
interactions
• Profiling is hard and time-consuming
©2014 Sqrrl Data, Inc 3
4. ROUND UP THE ‘USUAL
SUSPECTS’?
• “Common knowledge” that some things can cause
performance issues
• Too much network usage
• Disk Bound
• Stragglers
• Framework settings
• Unbalanced distribution
• SerDe
• This might be a good start, but we really want to
focus on the biggest problem if we can
• Technology, installations and use cases have high
variability: what works for one job on one cluster may
be useless on another
©2014 Sqrrl Data, Inc 4
6. MAKING A MODEL
©2014 Sqrrl Data, Inc 6
• Determine points of low-impact metrics
• Add some if needed
• Create parallel state machine models with
components driven by these metrics
• Estimate running times and bottlenecks from
a-priori information and/or apply measured
statistics
• Focus testing on validation of the initial
model and the (estimated) pain points
• Apply Amdahl’s Law
• Rinse, repeat
7. The Apache Accumulo™ sorted, distributed key/value store is a secure, robust,
scalable, high performance data storage and retrieval system.
• Many applications in real-time storage and analysis of “big data”:
• Spatio-temporal indexing in non-relational distributed databases - Fox et al
2013 IEEE International Congress on Big Data
• Big Data Dimensional Analysis - Gadepally et al IEEE HPEC 2014
• Leading its peers in performance and scalability:
• Achieving 100,000,000 database inserts per second using Accumulo and
D4M - Kepner et al IEEE HPEC 2014
• An NSA Big Graph experiment (Technical Report NSA-RD-2013-056002v1)
• Benchmarking Apache Accumulo BigData Distributed Table Store Using Its
Continuous Test Suite - Sen et al 2013 IEEE International Congress on Big
Data
For more papers and presentations, see http://accumulo.apache.org/papers.html
7©2014 Sqrrl Data, Inc
8. • Collections of KV pairs form Tables
• Tables are partitioned into Tablets
• Metadata tablets hold info about
other tablets, forming a 3-level
hierarchy
• A Tablet is a unit of work for a
Tablet Server
Data
Tablet
-‐∞
:
thing
Data
Tablet
thing
:
∞
Data
Tablet
-‐∞
:
Ocelot
Data
Tablet
Ocelot
:
Yak
Data
Tablet
Yak
:
∞
Data
Tablet
-‐∞
to
∞
Table:
Adam’s
Table
Table:
Encyclopedia
Table:
Foo
SCALING UP: DIVIDE & CONQUER
Well-‐Known
Loca9on
(zookeeper)
Root
Tablet
-‐∞
to
∞
Metadata
Tablet
2
“Encyclopedia:Ocelot”
to
∞
Metadata
Tablet
1
-‐∞
to
“Encyclopedia:Ocelot”
8©2014 Sqrrl Data, Inc
9. BULK INGEST OVERVIEW
• Accumulo supports two mechanisms to bring
data in: streaming ingest and bulk ingest.
• Bulk Ingest
• Goal: maximize throughput without constraining
latency.
• Create a set of Accumulo Rfiles by some means,
then register those files with Accumulo.
• RFiles are groups of sorted key-value pairs with
some indexing information
• MapReduce has a built-in key sorting phase: a good
fit to produce RFiles
©2014 Sqrrl Data, Inc 9
11. BULK INGEST MODEL
11
Time
• 100% CPU
• 20% Disk
• 0% Network
• 46 seconds
• 40% CPU
• 100% Disk
• 20% Network
• 168 seconds
• 10% CPU
• 20% Disk
• 40% Network
• 17 seconds
Hypothetical Resource Usage
©2014 Sqrrl Data, Inc
Map Reduce Register
12. INSIGHT
12
Time
• 100% CPU
• 20% Disk
• 0% Network
• 46 seconds
• 40% CPU
• 100% Disk
• 20% Network
• 168 seconds
• 10% CPU
• 20% Disk
• 40% Network
• 17 seconds
• Spare disk here, spare CPU there – can we even out resource consumption?
• Why did reduce take 168 seconds? It should be more like 40 seconds.
• No clear bottleneck during registration – is there a synchronization or
serialization problem?
©2014 Sqrrl Data, Inc
Map Reduce Register
13. Reduce Thread
Map Thread
LOOKING DEEPER:
REFINED BULK INGEST MODEL
13
Map
Setup
Map Sort
Sort Reduce Output
Spill Merge
Shuffle
Serve
Time
©2014 Sqrrl Data, Inc
Parallel Latch
14. BULK INGEST MODEL PREDICTIONS
• We can constrain parts of the model by physical
throughput limitations
• Disk -> memory (100Mbps avg 7200rpm seq. read rate)
• Input reader
• Memory -> Disk (100Mbps)
• Spill, OutputWriter
• Disk -> Disk (50Mbps)
• Merge
• Network (Gigabit = 125Mbps)
• Shuffle
• And/or algorithmic limitations
• Sort, (Our) Map, (Our) Reduce, SerDe
©2014 Sqrrl Data, Inc 14
15. PERFORMANCE GOAL MODEL
©2014 Sqrrl Data, Inc 15
Performance goals obtained through:
• Simulation of individual components
• Prediction of available resources at runtime
16. INSTRUMENTATION
application version 1.3.3 SYSTEM DATA
application sha 8d17baf8 node num 1 input type arcsight
yarn.nodemanager.resource.memory-mb 43008 map num containers 20 input block size 32
yarn.scheduler.minimum-allocation-mb 2048 red num containers 20 input block count 20
yarn.scheduler.maximum-allocation-mb 43008 cores physical 12 input total 672054649
yarn.app.mapreduce.am.resource.mb 2048 cores logical 24 output map 9313303723
yarn.app.mapreduce.am.command-opts -Xmx1536m disk num 8 output map:combine input records 243419324
mapreduce.map.memory.mb 2048 disk bandwidth 100 output map:combine records out 209318830
mapreduce.map.java.opts -Xmx1638m replication 1 output map:spill 7325671992
mapreduce.reduce.memory.mb 2048 monitoring TRUE output final 573802787
mapreduce.reduce.java.opts -Xmx1638m output map:combine 7301374577
mapreduce.task.io.sort.mb 100 TIME
mapreduce.map.sort.spill.percent 0.8 map:setup avg 8 RATIOS
mapreduce.task.io.sort.factor 10 map:map avg 12 input explosion factor 13.877904
mapreduce.reduce.shuffle.parallelcopies 5 map:sort avg 12 compression intermediate 1.003327786
mapreduce.job.reduce.slowstart.completedmaps 1 map:spill avg 12 load combiner output 0.783972562
mapreduce.map.output.compress FALSE map:spill count 7 total ratio 0.786581455
mapred.map.output.compression.codec n/a map:merge avg 46
description baseline map total 290 CONSTANTS
red:shuffle avg 6 avg schema entry size (bytes) 59
red:merge avg 38
red:reduce avg 68 effective MB/sec 1.618488025
red:total avg 112
red:reducer count 20
job:total 396
16©2014 Sqrrl Data, Inc
18. PATH TO IMPROVEMENT
1. Profiling revealed much time spent serializing/deserializing
Accumulo’s Key class
1. Supported by recent investigations on e.g. spark jobs
1. “as much as half of the CPU time is spent deserializing and
decompressing data.” https://www.eecs.berkeley.edu/~keo/
publications/nsdi15-final147.pdf
2. With proper configuration, MapReduce supports
comparison of MR keys in serialized form
3. Rewriting Key’s serialization lead to an order-preserving
encoding, easy to compare in serialized form
4. Configure MapReduce to use native code to compare Keys
5. Tweak map input size and spill memory for as few spills as
possible
18©2014 Sqrrl Data, Inc
20. PERFORMANCE MEASUREMENT
Optimized sorting
Insights:
• Map is slower than expected
• Intermediate data inflation ratio (output from map) is very high, and the
mapper is now disk-bound
• Amdahl’s law strikes again
• Reducer Output is also already disk bound.
• Can we trade disk time in Map for ‘free’ CPU time in Reduce?
20©2014 Sqrrl Data, Inc
Reduce Thread
Map Thread
Map
Setup
Map Sort
Sort Reduce Output
Spill Merge
Shuffle
Serve
21. PATH TO IMPROVEMENT
• Evaluation of data passed from map to reduce
revealed inefficiencies:
• Constant timestamp cost 8 bytes per key
• Repeated column names could be encoded/
compressed
• Some Key/Value pairs didn’t need to be created until
reduce
• Blocks of data output from the mapper guaranteed to
transfer ‘en masse’ to the same reducer
• Hypothesis
• Create ‘dehydrated’ key-value pairs of consecutive
values when possible
• Spend CPU time in reduce to ‘rehydrate’ the key-values
prior to output
• Fewer keys in shuffle also means the sort phase is more
efficient
21©2014 Sqrrl Data, Inc
22. PERFORMANCE MEASUREMENT
Optimized map code
• Improvement:
• Big speedup in map function
• Twice as fast
• Reduced intermediate inflation sped up all
steps between map and reduce
22©2014 Sqrrl Data, Inc
23. DO TRY THIS AT HOME
With these steps, we achieved 6X speedup:
• Perform comparisons on serialized objects
• With Map/Reduce, calculate how many merge
steps are needed
• Avoid premature data inflation
• Leverage compression to shift bottlenecks
• Always consider how fast your code should run
Hints for Accumulo Application Optimization
23©2014 Sqrrl Data, Inc
24. POSTSCRIPT: CARRYING
IMPROVEMENTS IN TO THE
APPLICATION
©2014 Sqrrl Data, Inc 24
• Recall that we “dehydrated” consecutive KVs
into one KV out of map, and “rehydrated”
them in reduce
• Specifically, document storage
• We can do this if we know the schema of the
document in advance
• What if we just store dehydrated documents
on disk?
25. POSTSCRIPT: PARTIAL SCHEMAS
©2014 Sqrrl Data, Inc 25
• Advantages
• Bulk ingest just got even faster (no rehydrate step)
• Disk footprint smaller
• Potentially faster query response
• Potential issues
• Need to keep schemas around (but still want to
have flexible schemas)
• How do you handle (lazy) updates?
• Documents need to be rehydrated at some point…
when? And what’s the perf trade-off?
• Perhaps we should model this?
• To be continued…