How Apache Arrow and Parquet boost cross-language interoperability

2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL

3. Agenda The Problem Arrow Parquet Outlook

4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

5. Diﬀerent Systems - Varying Python Support • Various levels of Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )

6. Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system / cross-language communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )

7. Apache Arrow - The Impact • An example: Retrieve a dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas

8. Apache Arrow • Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined eﬀort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md

9. Arrow in Action: Feather • Language-agnostic file format for binary data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuﬀers)

10. Apache Parquet

11. Apache Parquet • Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query eﬃcient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world

12. The Basics • 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page

13. Using Parquet in Python • You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion

14. State of Arrow & Parquet Arrow in-memory spec for columnar data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R

15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way • alpha implementation using memory mapped files • JVM <-> native with shared reference counting

16. Get Involved! • dev@arrow.apache.org & dev@parquet.apache.org • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet

17. Questions ?!

How Apache Arrow and Parquet boost cross-language interoperability

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a How Apache Arrow and Parquet boost cross-language interoperability

Semelhante a How Apache Arrow and Parquet boost cross-language interoperability (20)

Mais de Uwe Korn

Mais de Uwe Korn (9)

Último

Último (20)

How Apache Arrow and Parquet boost cross-language interoperability