SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Free Movement of Data with
Apache Arrow
Uwe L. Korn
data2day
26.09.2018
1
• Data Scientist/Engineer bei

Blue Yonder (@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Arbeite in Python, C++11 und
SQL
• Twitter: @xhochy
• Mail: uwe@apache.org
About me
2
3
Demand Planning
Suppliers DCs
Customers
Replenishment
Truckload
Optimization
Staff
Planning
Delivery
Schedules
Pick
Optimization
First Order
Planning
Promotion
Planning
Dynamic
Pricing
Personalized
Couponing
Initial Buy
(Online) Stores
Replenishment
Software für den Handel

– mit AI
JVM Python / Native
4
Big Data trifft Data Science
trifft Serialisierung
Warum: Data Pipelines!
5
• Daten sind nicht Teil einer Anwendung
• Verschiedene Nutzung von Reporting über User
Interaktion zu Data Science
• enorme, inhomogene Landschaft an Tools
• Performance ist kritisch auf Grund der Größe
Generelles Problem
6
• Gute Interoperalität innerhalb eines Ökosystems
• Oft basierend auf einem gemeinsamen Backend (z.B. NumPy)
• Schlechte Integration zu anderen Systems
• CSV ist oft die einzige Lösung
• „Wir müssen reden!“
• Kopie im RAM is ca. 10GiB/s
• (De-)serialiserung kommt oben drauf
Quelle: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Columnar Data
7
Apache Parquet
8
Apache Parquet
9
• spaltenbasiertes Dateiformat
• gestartet in 2012, Apache in 2013
• Default für tabellarische Daten in Hadoop & co
• Inzwischen auch für C++, Python, Rust, .NET, …
• Schnell dank:
• Encoding
• Kompression
• Predicate Pushdown
Speichere in einem, lade im anderen Ökosystem…
… aber persistiere immer dazwischen.
10
Zero-Copy Interaktion
11
Apache Arrow
12
• spaltenbasiertes Speichermodell
• kein Overhead zwischen Systemen
• Ausgelegt für moderne SIMD Prozessoren und GPUs
• Verfügbar in: C, C++, Ruby, Go, Rust, Java, Python,
JavaScript, Julia, R, Matlab, Lua.
• Offener Standard
Apache Arrow: Detail
13
• Beispiel String Array
• 2 Varianten:
• Plain: valid bitmap / offsets / values
• Dictionary Encoding:
• Alle vorkommenden Werte als Plain
• Index Array für Mapping auf Werte
Apache Arrow: Beispiel 1
Datenabzug aus DB
14
• Datenbanken sind auf kleine Ergebnisse ausgerichtet

(selbst bei großen Eingabedaten)
• Machine Learning erfordert granulare Daten
• CSV Export ist immer vorhanden und schnell
• Stattdessen:
• Turbodbc für schnelle Anbindung
• Arrow als Datenformat auf dem Weg

DB -> C++ -> Python / Pandas
Apache Arrow: Beispiel 2
PySpark
15
• 1 Millionen Integer von Spark nach PySpark
• 8 MiB Daten (sehr wenig!)
• Bis jetzt: 2.57s
• Mit Arrow (@pandas_udf): 0.05s
Ray
Apache Arrow: Das Ziel
17
Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
18

Mais conteúdo relacionado

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Free Movement of Data with Apache Arrow – data2day 2018

  • 1. Free Movement of Data with Apache Arrow Uwe L. Korn data2day 26.09.2018 1
  • 2. • Data Scientist/Engineer bei
 Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Arbeite in Python, C++11 und SQL • Twitter: @xhochy • Mail: uwe@apache.org About me 2
  • 3. 3 Demand Planning Suppliers DCs Customers Replenishment Truckload Optimization Staff Planning Delivery Schedules Pick Optimization First Order Planning Promotion Planning Dynamic Pricing Personalized Couponing Initial Buy (Online) Stores Replenishment Software für den Handel
 – mit AI
  • 4. JVM Python / Native 4 Big Data trifft Data Science trifft Serialisierung
  • 5. Warum: Data Pipelines! 5 • Daten sind nicht Teil einer Anwendung • Verschiedene Nutzung von Reporting über User Interaktion zu Data Science • enorme, inhomogene Landschaft an Tools • Performance ist kritisch auf Grund der Größe
  • 6. Generelles Problem 6 • Gute Interoperalität innerhalb eines Ökosystems • Oft basierend auf einem gemeinsamen Backend (z.B. NumPy) • Schlechte Integration zu anderen Systems • CSV ist oft die einzige Lösung • „Wir müssen reden!“ • Kopie im RAM is ca. 10GiB/s • (De-)serialiserung kommt oben drauf
  • 7. Quelle: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ ) Columnar Data 7
  • 9. Apache Parquet 9 • spaltenbasiertes Dateiformat • gestartet in 2012, Apache in 2013 • Default für tabellarische Daten in Hadoop & co • Inzwischen auch für C++, Python, Rust, .NET, … • Schnell dank: • Encoding • Kompression • Predicate Pushdown
  • 10. Speichere in einem, lade im anderen Ökosystem… … aber persistiere immer dazwischen. 10
  • 12. Apache Arrow 12 • spaltenbasiertes Speichermodell • kein Overhead zwischen Systemen • Ausgelegt für moderne SIMD Prozessoren und GPUs • Verfügbar in: C, C++, Ruby, Go, Rust, Java, Python, JavaScript, Julia, R, Matlab, Lua. • Offener Standard
  • 13. Apache Arrow: Detail 13 • Beispiel String Array • 2 Varianten: • Plain: valid bitmap / offsets / values • Dictionary Encoding: • Alle vorkommenden Werte als Plain • Index Array für Mapping auf Werte
  • 14. Apache Arrow: Beispiel 1 Datenabzug aus DB 14 • Datenbanken sind auf kleine Ergebnisse ausgerichtet
 (selbst bei großen Eingabedaten) • Machine Learning erfordert granulare Daten • CSV Export ist immer vorhanden und schnell • Stattdessen: • Turbodbc für schnelle Anbindung • Arrow als Datenformat auf dem Weg
 DB -> C++ -> Python / Pandas
  • 15. Apache Arrow: Beispiel 2 PySpark 15 • 1 Millionen Integer von Spark nach PySpark • 8 MiB Daten (sehr wenig!) • Bis jetzt: 2.57s • Mit Arrow (@pandas_udf): 0.05s
  • 16. Ray
  • 17. Apache Arrow: Das Ziel 17
  • 18. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 18