O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Apache Arrow: In Theory, In Practice

8.019 visualizações

Publicada em

Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.

Publicada em: Software

Apache Arrow: In Theory, In Practice

  1. 1. © 2017 Dremio Corporation @DremioHQ Apache Arrow: In Theory, In Practice Apache Arrow Meetup @ Enigma November 1, 2017 Jacques Nadeau
  2. 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  3. 3. © 2017 Dremio Corporation @DremioHQ Arrow In Theory
  4. 4. © 2017 Dremio Corporation @DremioHQ The Apache Arrow Project • Started Feb 17, 2016 (Apache tlp) • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R Committers & Contributors from:
  5. 5. © 2017 Dremio Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc. • Interoperable
  6. 6. © 2017 Dremio Corporation @DremioHQ Arrow In Memory Columnar Format • Shredded Nested Data Structures • Randomly Accessible • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  7. 7. © 2017 Dremio Corporation @DremioHQ High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to- Arrow reader)
  8. 8. © 2017 Dremio Corporation @DremioHQ Common Processing Libraries (soon) • High Performance Canonical processing for Arrow Data Structures – Sort – Hash Table – Dictionary encoding – Predicate application & masking • Multiple Medium and Processing Paradigms – Memory, NVMe, 3d Xpoint – X86, GPU, Many Core (Phi), etc.
  9. 9. © 2017 Dremio Corporation @DremioHQ Arrow Data Types • Scalars – Boolean – [u]int[8,16,32,64], Decimal, Float, Double – Date, Time, Timestamp – UTF8 String, Binary • Complex – Struct, Map, List • Advanced – Union (sparse & dense)
  10. 10. © 2017 Dremio Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  11. 11. © 2017 Dremio Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  12. 12. © 2017 Dremio Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  13. 13. © 2017 Dremio Corporation @DremioHQ Arrow Components • Core Libraries • Within Project Integrations • Extended Integrations
  14. 14. © 2017 Dremio Corporation @DremioHQ Arrow: Core Components • Java Library • C++ Library • C Library • Ruby Library • Python Library • JavaScript Library
  15. 15. © 2017 Dremio Corporation @DremioHQ In-Project Arrow Building Blocks/Applications • Plasma: – Shared memory caching layer, originally created in Ray • Feather: – Fast ephemeral format for movement of data between R/Python • ArrowRest (soon): – RPC/IPC interchange library (active development) • ArrowRoutines (soon): – Common data manipulation components
  16. 16. © 2017 Dremio Corporation @DremioHQ Arrow Integrations • Pandas – Move seamlessly to from Arrow as a means for communication, serialization, fast processing • GOAI (GPU Open Analytics Initiative), libgdf and the GPU dataframe – Leverages Arrow as internal representation • Parquet – Read and write Parquet quickly to/from Parquet. C++ library builds directly on Arrow. • Spark – Supports conversion to Pandas via Arrow construction using Arrow Java Library • Dremio – OSS project, Sabot Engine executes entirely on Arrow memory
  17. 17. © 2017 Dremio Corporation @DremioHQ Arrow In Practice
  18. 18. © 2017 Dremio Corporation @DremioHQ Real World Arrow: Sabot • Dremio is an OSS data fabric product • The core engine is “Sabot” – Built entirely on top of Arrow libraries, runs in JVM
  19. 19. © 2017 Dremio Corporation @DremioHQ Sabot: Arrow in Practice • Memory Management • Vector sizing • RPC Communication • Filtering/Sorting • Rowwise-algorithms: Hash Tables • Vector-wise Algorithms – Aggregation – Unnesting
  20. 20. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management • Arrow includes chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Support both reservation and local limits – Include leak detection, debug ownership logs and location accounting • Size allocators (reservation and maximum) based on workload management, when to trigger spilling, etc. • All Arrow Vectors hold one or more off-heap buffers • Everything is manually reference managed – Some code more complex – Provides strong memory availability understanding Root res: 0 max: 20g Job 1 res: 10m max: 1g Job 2 res: 10m max: 1g Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m Task 1 res: 1m max: -1 Task 2 res: 5m max: 20m IntVector Validity Data
  21. 21. © 2017 Dremio Corporation @DremioHQ Practice: Memory Management Cont’d • Data moves through data pipelines • Ownership needs to be clear (to plan/control execution – Allocated memory can be referenced by many consumers – One allocator ‘owns’ the accounted memory – Consumers can use Vector’s transfer capability to leverage transfer semantics and handoff data ownership https://goo.gl/HN9nCH Scan Aggregate Aggregate res: 10m max: 1g Scan res: 10m max: 1g transfer ownership
  22. 22. © 2017 Dremio Corporation @DremioHQ Practice: Vector Sizing • Batches are the smallest work unit • Batches of records can be 1..64k records in size. • Optimization Problem – Larger improve processing performance – Larger causes pipeline problems – Smaller causes more heap overhead • Execution-Level Adaptive Resizing for wide records (100-1000s fields) Narrow Batch Wide Batch 4095 records 127 records
  23. 23. © 2017 Dremio Corporation @DremioHQ Practice: RPC Communication • Goals – Leverage Gathering Writes – Ensure connection resilience despite memory pressure • Custom Netty-based RPC protocol – All messages include structured (proto) and sidecar memory message – Out of memory at message consumption time, ensuring fail-ack as opposed to connection disconnect Send: Listener listener Proto structuredMessage ArrowBuf... dataBodies https://goo.gl/XWyrc1 Structured message Gathering write
  24. 24. © 2017 Dremio Corporation @DremioHQ Filtering & Sorting • For filtering and sorting, create a selection vector – Describes valid values and ordering without reorganizing underlying data. – Two bytes for filter purposes (single batch horizon) – Four bytes for sort purposes (multi-batch horizon) • 4-Byte selection vector pattern frequently by other operations • 6-Byte selection vector used in some cases (to manage wide batches) • Defer copy/compacting 2 14 35 99 1-2 2-14 1-35 2-99 sv4 sv2
  25. 25. © 2017 Dremio Corporation @DremioHQ Row-wise Algorithms: Hash Table + Aggregation For generating hash table, maintaining a columnar structure for keys slows hashing insertion and lookup • Break data into fixed and variable values • Use consistent fixed value insertion • Use dynamic variable output • Pivot data – Vector at time for fixed values – All variable at same time for variable vectors • Hash and equality as bucket of bytes • Avoids excessive indirection • Maintain Aggregation tables in columnar format Fixed Block Vector Variable Block Vector Aggregation Tables validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset validity|fixed1|fixed2|varlen|varoffset len|data|len|data|len|data|len |data|len|data|len|data|len|da ta|len|data|len|data|len|data|l en|data|len|data|len|data|len| data|len|data|len|data Partial-agg2 Partial-agg1 Partial-agg3 Partial-agg4 Partial-agg5 Partial-agg6 pivot fixed pivot variable unpivot unpivot direct projection
  26. 26. © 2017 Dremio Corporation @DremioHQ Example Pivot Code • Takes advantage of runs of nullable values, working a word at a time – ALL_SET, NONE_SET, SOME_SET • Ensure canonicalization of values based on validity – Typically validity data is zeroed on allocation, other vectors are not. – Vector data has to be cleared when pivoting nulled values • Conditions are avoided static void pivot8Bytes( VectorPivotDef def, FixedBlockVector fixedBlock, final int count ){ ... // decode word at a time. while (srcDataAddr < finalWordAddr) { final long bitValues = PlatformDependent.getLong(srcBitsAddr); if (bitValues == NONE_SET) { // noop (all nulls). bitTargetAddr += (WORD_BITS * blockLength); valueTargetAddr += (WORD_BITS * blockLength); srcDataAddr += (WORD_BITS * EIGHT_BYTE); } else if (bitValues == ALL_SET) { // all set, set the bit values using a constant AND. Independently set the data values without transformation. final int bitVal = 1 << bitOffset; for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength) { PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | bitVal); } for (int i = 0; i < WORD_BITS; i++, valueTargetAddr += blockLength, srcDataAddr += EIGHT_BYTE) { PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr)); } } else { // some nulls, some not, update each value to zero or the value, depending on the null bit. for (int i = 0; i < WORD_BITS; i++, bitTargetAddr += blockLength, valueTargetAddr += blockLength, srcDataAddr += E final int bitVal = ((int) (bitValues >>> i)) & 1; PlatformDependent.putInt(bitTargetAddr, PlatformDependent.getInt(bitTargetAddr) | (bitVal << bitOffset)); PlatformDependent.putLong(valueTargetAddr, PlatformDependent.getLong(srcDataAddr) * bitVal); } } srcBitsAddr += WORD_BYTES; } https://goo.gl/EgLy9r
  27. 27. © 2017 Dremio Corporation @DremioHQ Node 1 Mux’d Practice: Parallel Columnar Shuffle • Partition data based on a hashed key • Avoid excessive batch buffering cost • Steps 1. Consolidate node-local streams • Allow reduction in buffering memory in large clusters (k*n instead of n*n) 2. Hash the key(s) to determine bucket offset • Generate bucket vector 3. Pre-allocate output buffers at target output size • Sized depending on narrow/wide batches 4. Do columnar copies per vector • Written in C-like low overhead pattern with no abstraction Node 2 Thread 1 Thread 2 generate bucket vector Do bucket- level copies Gathering Write Thread 1 Thread 2
  28. 28. © 2017 Dremio Corporation @DremioHQ Example Copier Code • Two byte offset addresses (sv2) • Tight loop focused on • Far more efficient than runtime-generated row- wise code – Also has faster startup time public void copy(long offsetAddr, int count) { final List<ArrowBuf> sourceBuffers = source.getFieldBuffers(); targetAlt.allocateNew(count); final List<ArrowBuf> targetBuffers = target.getFieldBuffers(); final long max = offsetAddr + count * STEP_SIZE; final long srcAddr = sourceBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); long dstAddr = targetBuffers.get(VALUE_BUFFER_ORDINAL).memoryAddress(); for(long addr = offsetAddr; addr < max; addr += STEP_SIZE, dstAddr += SIZE){ PlatformDependent.putLong(dstAddr, PlatformDependent.getLong(srcAddr + ((char) PlatformDependent.getShort(addr)) * SIZE)); } } https://goo.gl/fZEsfy
  29. 29. © 2017 Dremio Corporation @DremioHQ Unnesting List Vectors • Common Pattern: List of objects that want to be unrolled to separate records. • Arrow’s representation allows a direct unroll (no inner data copies required) • Since leaf vectors can be larger (up to 2B), may need to split apart inner vectors – Make use of SplitAndTransfer necessary – SplitAndTransfer as cheap as possible • Noop for fixed data • Offset rewrite for variable width vectors, noop for variable data • Bit rewrite & shifting for Validity vectors List Vector OffsetVector Struct Vector Inner Vectors
  30. 30. © 2017 Dremio Corporation @DremioHQ What’s Coming • Arrow RPC/REST – Generic way to retrieve data in Arrow format – Generic way to serve data in Arrow format – Simplify integrations across the ecosystem • Arrow Routines – GPU and LLVM
  31. 31. © 2017 Dremio Corporation @DremioHQ Get Involved • Join the community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow, @DremioHQ, @intjesus

×