O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight

Machine learning pipelines are a hot topic at the moment. Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of running machine learning models in production.

Livros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo
  • Seja o primeiro a comentar

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight

  1. 1. The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
  2. 2. Apache Arrow: Primer
  3. 3. >12M monthly downloads & growing exponentially Arrow powers dozens of open source & commercial technologies Java, C, C++, Python, R, JavaScript, C#, Ruby, Rust, Go, … 10+ programming languages supported Arrow’s adoption provides numerous benefits: • 300+ developers contributing • Broad architecture (CPU/GPU/FPGA), OS and language support • Awareness & OSS thought leadership Arrow has become the industry standard for in-memory data
  4. 4. What is Arrow? What is it? ●A specification that outlines in-memory binary layout of data ●A set of libraries and tools ●A set of standards to make analytical data transportable ●Representation for efficient analytical processing on CPUs and GPUs What isn’t it? ●It’s not an installable system ●It’s not a memory grid or in-memory cache ●It’s not designed for streaming or other single record operations (e.g. transactions)
  5. 5. Arrow In Memory Columnar Format ●Shredded Nested Data Structures ●Randomly Accessible ●Maximize CPU throughput ○ Pipelining ○ SIMD ○ cache locality ●Scatter/gather I/O Traditional Memory Arrow Memory
  6. 6. Example Arrow Building Blocks Gandiva ● LLVM-based JIT compilation for execution of arbitrary expressions against Arrow data structures Feather ● Fast ephemeral format for movement of data between R/Python Arrow Flight ● RPC/IPC interchange library for efficient interchange of data between processes Parquet ● Read and write Arrow quickly to/from Parquet. C++ library builds directly on Arrow.
  7. 7. Apache Arrow Flight
  8. 8. Arrow Flight ●High performance wire protocol ●Focused on bulk transfer for analytics ●Full delivery of Arrow interoperability promise ●Cross-platform ●Built for parallel ●Designed for Security FLIGHT
  9. 9. Arrow Data Paradigm: Streams of Batches ● Primary Communication: ○ A stream of Arrow record batches ○ Bulk transfer targeting efficient movement ○ Effectively peer-to-peer ● Specific Methods: ○ Put Stream: Client sends a stream to server ○ Get Stream: Server sends a stream to client ○ Both initiated by Client Client Server Put HeaderDataDataDataend Thanks endDataDataDataHeader Get Descriptor
  10. 10. Endpoint: Retrieved with Ticket Flight Host 1 Host 2 Big Datasets Require Parallel Streams ● Parallel consumption and locality awareness ○ A flight is composed of streams ○ Each stream has a FlightEndpoint: A opaque stream ticket along with a consumption location ○ Systems can take advantage of location information to improve data locality ● Flights have two reference systems: ○ Dotted path namespace for simple services (e.g. marketing.yesterday.sales) ○ Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) ● Support for Stream Listing ○ ListFlights (Criteria) ○ GetFlightInfo (FlightDescriptor) Stream Stream Stream Stream
  11. 11. Flight Spark Source
  12. 12. Spark DataSource V2 ● Columnar support ● Transactions ● Partitions ● Better support for pushdowns
  13. 13. Flight Spark Source ● Uses Columnar Batch to leverage Spark’s Arrow support ● Supports pushdown of filters and projects ● Partitioned by Arrow flight ticket
  14. 14. Benchmarks ● 4x node EMR querying 4x node Dremio AWS Edition (m5d.8xlarge) ● Return n rows to spark executors then perform a non-trivial calculation ● Table shows t1 (t2) where t1 is total time and t2 is only transport time ● All units are seconds Data Size JDBC Serial Flight Parallel Flight Parallel Flight - 8 nodes 100,000 3.84 (1) 1 (1) 2.9 (2.21) 3.78 (3.02) 1,000,000 6.5 (2.8) 1.41 (1) 3.07 (2.76) 4.38 (2.98) 10,000,000 25.88 (22.9) 8.05 (4.3) 6.25 (3.43) 8.19 (4) 100,000,000 223 (220) 109 (105) 18.72 (11) 8.53 (10) 1,000,000,000 n/a n/a 36.6 (16) 18.9 (15)
  15. 15. Demo Time!
  16. 16. Thanks! Let me know your thoughts ○ rymurr@dremio.com ○ https://github.com/rymurr Join the Arrow Community ○ @apachearrow ○ subscribe-dev@apache.arrow.org ○ arrow.apache.org Try out Dremio ○ bit.ly/dremiodeploy ○ community.dremio.com Benchmarks ● Flight: https://bit.ly/32IWvCB ● Spark Connector: https://bit.ly/3bpR0Ni Code Examples ● Arrow Flight Example Code: https://bit.ly/2XgjmUE