Free Movement of Data with Apache Arrow – data2day 2018

•

0 gostou•270 visualizações

The data science/analytics/engineering landscape is getting more heterogeneous each day. Apache Arrow provides a common ground for all the tools in this space to interact with each other. Presentation is in German.

Dados e análise

Free Movement of Data with
Apache Arrow
Uwe L. Korn
data2day
26.09.2018
1

• Data Scientist/Engineer bei 
Blue Yonder (@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Arbeite in Python, C++11 und
SQL
• Twitter: @xhochy
• Mail: uwe@apache.org
About me
2

3
Demand Planning
Suppliers DCs
Customers
Replenishment
Truckload
Optimization
Staff
Planning
Delivery
Schedules
Pick
Optimization
First Order
Planning
Promotion
Planning
Dynamic
Pricing
Personalized
Couponing
Initial Buy
(Online) Stores
Replenishment
Software für den Handel 
– mit AI

JVM Python / Native
4
Big Data triﬀt Data Science
triﬀt Serialisierung

Warum: Data Pipelines!
5
• Daten sind nicht Teil einer Anwendung
• Verschiedene Nutzung von Reporting über User
Interaktion zu Data Science
• enorme, inhomogene Landschaft an Tools
• Performance ist kritisch auf Grund der Größe

Generelles Problem
6
• Gute Interoperalität innerhalb eines Ökosystems
• Oft basierend auf einem gemeinsamen Backend (z.B. NumPy)
• Schlechte Integration zu anderen Systems
• CSV ist oft die einzige Lösung
• „Wir müssen reden!“
• Kopie im RAM is ca. 10GiB/s
• (De-)serialiserung kommt oben drauf

Quelle: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Columnar Data
7

Apache Parquet
9
• spaltenbasiertes Dateiformat
• gestartet in 2012, Apache in 2013
• Default für tabellarische Daten in Hadoop & co
• Inzwischen auch für C++, Python, Rust, .NET, …
• Schnell dank:
• Encoding
• Kompression
• Predicate Pushdown

Speichere in einem, lade im anderen Ökosystem…
… aber persistiere immer dazwischen.
10

Apache Arrow
12
• spaltenbasiertes Speichermodell
• kein Overhead zwischen Systemen
• Ausgelegt für moderne SIMD Prozessoren und GPUs
• Verfügbar in: C, C++, Ruby, Go, Rust, Java, Python,
JavaScript, Julia, R, Matlab, Lua.
• Oﬀener Standard

Apache Arrow: Detail
13
• Beispiel String Array
• 2 Varianten:
• Plain: valid bitmap / oﬀsets / values
• Dictionary Encoding:
• Alle vorkommenden Werte als Plain
• Index Array für Mapping auf Werte

Apache Arrow: Beispiel 1
Datenabzug aus DB
14
• Datenbanken sind auf kleine Ergebnisse ausgerichtet 
(selbst bei großen Eingabedaten)
• Machine Learning erfordert granulare Daten
• CSV Export ist immer vorhanden und schnell
• Stattdessen:
• Turbodbc für schnelle Anbindung
• Arrow als Datenformat auf dem Weg 
DB -> C++ -> Python / Pandas

Apache Arrow: Beispiel 2
PySpark
15
• 1 Millionen Integer von Spark nach PySpark
• 8 MiB Daten (sehr wenig!)
• Bis jetzt: 2.57s
• Mit Arrow (@pandas_udf): 0.05s

Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
18

Recomendados

Going beyond Apache Parquet's default settingsUwe Korn

pandas.(to/from)_sql is simple but not fastUwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Scalable Scientific Computing with DaskUwe Korn

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Mais conteúdo relacionado

Destaque

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Destaque (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Free Movement of Data with Apache Arrow – data2day 2018

1. Free Movement of Data with Apache Arrow Uwe L. Korn data2day 26.09.2018 1

2. • Data Scientist/Engineer bei  Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Arbeite in Python, C++11 und SQL • Twitter: @xhochy • Mail: uwe@apache.org About me 2

3. 3 Demand Planning Suppliers DCs Customers Replenishment Truckload Optimization Staff Planning Delivery Schedules Pick Optimization First Order Planning Promotion Planning Dynamic Pricing Personalized Couponing Initial Buy (Online) Stores Replenishment Software für den Handel  – mit AI

4. JVM Python / Native 4 Big Data triﬀt Data Science triﬀt Serialisierung

5. Warum: Data Pipelines! 5 • Daten sind nicht Teil einer Anwendung • Verschiedene Nutzung von Reporting über User Interaktion zu Data Science • enorme, inhomogene Landschaft an Tools • Performance ist kritisch auf Grund der Größe

6. Generelles Problem 6 • Gute Interoperalität innerhalb eines Ökosystems • Oft basierend auf einem gemeinsamen Backend (z.B. NumPy) • Schlechte Integration zu anderen Systems • CSV ist oft die einzige Lösung • „Wir müssen reden!“ • Kopie im RAM is ca. 10GiB/s • (De-)serialiserung kommt oben drauf

7. Quelle: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ ) Columnar Data 7

8. Apache Parquet 8

9. Apache Parquet 9 • spaltenbasiertes Dateiformat • gestartet in 2012, Apache in 2013 • Default für tabellarische Daten in Hadoop & co • Inzwischen auch für C++, Python, Rust, .NET, … • Schnell dank: • Encoding • Kompression • Predicate Pushdown

10. Speichere in einem, lade im anderen Ökosystem… … aber persistiere immer dazwischen. 10

11. Zero-Copy Interaktion 11

12. Apache Arrow 12 • spaltenbasiertes Speichermodell • kein Overhead zwischen Systemen • Ausgelegt für moderne SIMD Prozessoren und GPUs • Verfügbar in: C, C++, Ruby, Go, Rust, Java, Python, JavaScript, Julia, R, Matlab, Lua. • Oﬀener Standard

13. Apache Arrow: Detail 13 • Beispiel String Array • 2 Varianten: • Plain: valid bitmap / oﬀsets / values • Dictionary Encoding: • Alle vorkommenden Werte als Plain • Index Array für Mapping auf Werte

14. Apache Arrow: Beispiel 1 Datenabzug aus DB 14 • Datenbanken sind auf kleine Ergebnisse ausgerichtet  (selbst bei großen Eingabedaten) • Machine Learning erfordert granulare Daten • CSV Export ist immer vorhanden und schnell • Stattdessen: • Turbodbc für schnelle Anbindung • Arrow als Datenformat auf dem Weg  DB -> C++ -> Python / Pandas

15. Apache Arrow: Beispiel 2 PySpark 15 • 1 Millionen Integer von Spark nach PySpark • 8 MiB Daten (sehr wenig!) • Bis jetzt: 2.57s • Mit Arrow (@pandas_udf): 0.05s

16. Ray

17. Apache Arrow: Das Ziel 17

18. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 18