SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Wes McKinney
Apache Arrow
Cross-Language Development
Platform for In-Memory Analytics
DataEngConf Barcelona 2018
Wes McKinney
• Created Python pandas project (~2008), lead
developer/maintainer until 2013
• PMC Apache Arrow, Apache Parquet, ASF Member
• Wrote Python for Data Analysis (1e 2012, 2e
2017)
• Formerly Co-founder / CEO of DataPad (acquired
by Cloudera in 2014)
• Other OSS work: Ibis, Feather, Apache Kudu,
statsmodels
● Raise money to support full-time
open source developers
● Grow Apache Arrow ecosystem
● Build cross-language, portable
computational libraries for data
science
● Build relationships across industry
https://ursalabs.org
People
Initial Sponsors and Partners
Prospective sponsors / partners,
please reach out: info@ursalabs.org
Open standards: why do they matter?
• Simplify system architectures
• Reduce ecosystem fragmentation
• Improve interoperability
• Reuse more libraries and algorithms
Example open standards
• Human-readable semi-structured data: XML, JSON
• Structured data query language: SQL
• Binary storage formats (with metadata)
• NetCDF
• HDF5
• Apache Parquet, ORC
• Serialization / RPC protocols
• Apache Avro
• Protocol buffers
• Not an open standard: Excel, CSV (grrrr)
Standardizing in-memory data
• Best example: strided ndarray / tensor memory (NumPy /
Fortran-compatible)
• Why?
• Zero-overhead memory sharing between libraries in-memory and processes
via shared memory
• Reuse algorithms
• Reuse IO / storage code
Tables and data frames
• Notoriously not based on open standards
• Vary widely in supported data types (e.g. nested data)
• Where are they found?
• Internals of SQL databases
• Big data systems (Apache Spark, Apache Hive)
• In-memory data frame libraries: Python (pandas), R (base, data.table), Julia
(DataFrames.jl)
• We say “data frame” but the byte-level RAM layout varies greatly
from system-to-system
Many data frame projects
● These are the workhorse libraries of data engineering,
data preparation, and feature engineering for AI/ML
● Little code reuse across projects
● Uneven performance and features
What’s driving the fragmentation?
● Tight coupling between front end and back end
● Feudal nature of open source software communities
● Incompatible memory representations
○ Cannot share data without serialization
○ Cannot share algorithms because implementation
depends on memory representation
Apache Arrow
● OSS Community initiative conceived in 2015
● Intersection of big data, database systems, and data
science tools
● Key idea: Language agnostic open standards to
accelerate in-memory computing
● https://github.com/apache/arrow
Defragmenting Data Access
“Portable” Data Frames
pandas
R
JVM
Non-Portable Data Frames
Arrow
Portable Data Frames
…
Share data and algorithms at ~zero cost
Analytic database architecture
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Vertically integrated /
“Black Box”
● Internal components do
not have a public API
● Users interact with front
end
Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Components have public
APIs
● Use what you need
● Different front ends can
be developed
Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
Arrow is front end agnostic
Arrow Use Cases
● Data access
○ Read and write widely used storage formats
○ Interact with database protocols, other data sources
● Data movement
○ Zero-copy interprocess communication
○ Efficient RPC / client-server communications
● Computation libraries
○ Efficient in-memory / out-of-core data frame-type analytics
○ LLVM-compilation for vectorized expression evaluation
Arrow’s Columnar Memory Format
• Runtime memory format for analytical query processing
• Companion to serialization tech like Apache {Parquet, ORC}
• “Fully shredded” columnar, supports flat and nested schemas
• Organized for cache-efficient access on CPUs/GPUs
• Optimized for data locality, SIMD, parallel processing
• Accommodates both random access and scan workloads
Arrow-accelerated Python + Apache Spark
● Joint work with Li Jin from Two
Sigma, Bryan Cutler from IBM
● Vectorized user-defined functions,
fast data export to pandas
import pandas as pd
from scipy import stats
@pandas_udf('double')
def cdf(v):
return pd.Series(stats.norm.cdf(v))
df.withColumn('cumulative_probability',
cdf(df.v))
Arrow-accelerated Python + Apache Spark
Spark SQL
Arrow Columnar
Stream Input
PySpark Worker
Zero copy via socket pandas
Arrow Columnar
Stream Output
to arrow
from arrow
from arrow
to arrow
New work: Arrow Flight RPC Framework
• A gRPC-based framework for defining custom data services that
send and receive Arrow columnar data natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations
• Write Arrow memory directly onto outgoing gRPC buffer
• Avoid any copying or deserialization
Arrow Flight - Parallel Get
Client Planner
GetFlightInfo
FlightInfo
DoGet Data Nodes
FlightData
DoGet
FlightData
...
Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”
Flight: Static datasets and custom commands
• Support “named” datasets, and “command” datasets
• Commands are binary, and will be server-dependent
• Implement custom commands using general structured data
serialization tools
message SQLQuery {
binary database_uri = 1;
binary query = 2;
}
Commands.proto GetFlightInfo RPC
type: CMD
cmd: <serialized command>
Flight: Custom actions
• Any request that is not a table pull or push, can be implemented as
an “action”
• Actions return a stream of opaque binary data
Arrow: History and Status
• Community initiative started in 2016, initially backed by leading
developers of ~13 major OSS data processing projects
• Project development status
• Codebase 2.5 years old
• > 190 distinct contributors
• 10 major releases
• Some level of support in 10 programming languages (C, C++, Go, Java,
JavaScript, MATLAB, Python, R, Ruby, Rust)
• Over 500K monthly installs in Python alone
Example: Gandiva, Arrow-LLVM compiler
https://github.com/dremio/gandiva
SELECT year(timestamp), month(timestamp), …
FROM table
...
Input Table
Fragment
Arrow Java
JNI (Zero-copy)
Evaluate
Gandiva
LLVM
Function
Arrow C++
Result Table
Fragment
Example use: Ray ML framework from Berkeley RISELab
March 20, 2017All Rights Reserved 30
Source: https://arxiv.org/abs/1703.03924
• Uses Plasma, shared
memory-based object store
originally developed for Ray
• Zero-copy reads of tensor
collections
Arrow on the GPU
• NVIDIA-led GPU Open Analytics Initiative
(http://gpuopenanalytics.com)
• “GPU DataFrame”: Arrow on the GPU
• Example: Execute Numba-compiled code on SQL results from MapD
shared via CUDA IPC
• Plasma also supports GPU shared memory
Some Industry Contributors to Apache Arrow
ClearCode
Upcoming Roadmap
• Software development lifecycle improvements
• Data ingest / access / export
• Computational libraries (CPU + GPU)
• Expanded language support
• Richer RPC / messaging
• More system integrations
Computational libraries
• “Kernel functions” performing vectorized analytics on Arrow
memory format
• Select CPU or GPU variant based on data location
• Operator graphs (compose multiple operators)
• Subgraph compiler (using LLVM -- see Gandiva)
• Runtime engine: execute operator graphs
Data Access / Ingest
• Apache Avro
• Apache Parquet nested data support
• Apache ORC
• CSV
• JSON
• ODBC / JDBC
• … and likely other data access points
Arrow-powered Data Science Systems
• Portable runtime libraries, usable from multiple programming
languages
• Decoupled front ends
• Companion to distributed systems like Dask, Ray
Getting involved
• Join dev@arrow.apache.org
• PRs to https://github.com/apache/arrow
• Learn more about the Ursa Labs vision for Arrow-powered data
science: https://ursalabs.org/tech/

Mais conteĂşdo relacionado

Mais procurados

Mais procurados (20)

Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 

Semelhante a Apache Arrow at DataEngConf Barcelona 2018

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Semelhante a Apache Arrow at DataEngConf Barcelona 2018 (20)

.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
An Open Source Workbench for Prototyping Multimodal Interactions Based on Off...
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

Mais de Wes McKinney

Mais de Wes McKinney (13)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Apache Arrow at DataEngConf Barcelona 2018

  • 1. Wes McKinney Apache Arrow Cross-Language Development Platform for In-Memory Analytics DataEngConf Barcelona 2018
  • 2. Wes McKinney • Created Python pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly Co-founder / CEO of DataPad (acquired by Cloudera in 2014) • Other OSS work: Ibis, Feather, Apache Kudu, statsmodels
  • 3. ● Raise money to support full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Build relationships across industry https://ursalabs.org
  • 5. Initial Sponsors and Partners Prospective sponsors / partners, please reach out: info@ursalabs.org
  • 6. Open standards: why do they matter? • Simplify system architectures • Reduce ecosystem fragmentation • Improve interoperability • Reuse more libraries and algorithms
  • 7. Example open standards • Human-readable semi-structured data: XML, JSON • Structured data query language: SQL • Binary storage formats (with metadata) • NetCDF • HDF5 • Apache Parquet, ORC • Serialization / RPC protocols • Apache Avro • Protocol buffers • Not an open standard: Excel, CSV (grrrr)
  • 8. Standardizing in-memory data • Best example: strided ndarray / tensor memory (NumPy / Fortran-compatible) • Why? • Zero-overhead memory sharing between libraries in-memory and processes via shared memory • Reuse algorithms • Reuse IO / storage code
  • 9. Tables and data frames • Notoriously not based on open standards • Vary widely in supported data types (e.g. nested data) • Where are they found? • Internals of SQL databases • Big data systems (Apache Spark, Apache Hive) • In-memory data frame libraries: Python (pandas), R (base, data.table), Julia (DataFrames.jl) • We say “data frame” but the byte-level RAM layout varies greatly from system-to-system
  • 10. Many data frame projects ● These are the workhorse libraries of data engineering, data preparation, and feature engineering for AI/ML ● Little code reuse across projects ● Uneven performance and features
  • 11. What’s driving the fragmentation? ● Tight coupling between front end and back end ● Feudal nature of open source software communities ● Incompatible memory representations ○ Cannot share data without serialization ○ Cannot share algorithms because implementation depends on memory representation
  • 12. Apache Arrow ● OSS Community initiative conceived in 2015 ● Intersection of big data, database systems, and data science tools ● Key idea: Language agnostic open standards to accelerate in-memory computing ● https://github.com/apache/arrow
  • 14. “Portable” Data Frames pandas R JVM Non-Portable Data Frames Arrow Portable Data Frames … Share data and algorithms at ~zero cost
  • 15. Analytic database architecture Front end API Computation Engine In-memory storage IO and Deserialization ● Vertically integrated / “Black Box” ● Internal components do not have a public API ● Users interact with front end
  • 16. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization ● Components have public APIs ● Use what you need ● Different front ends can be developed
  • 17. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization Arrow is front end agnostic
  • 18.
  • 19. Arrow Use Cases ● Data access ○ Read and write widely used storage formats ○ Interact with database protocols, other data sources ● Data movement ○ Zero-copy interprocess communication ○ Efficient RPC / client-server communications ● Computation libraries ○ Efficient in-memory / out-of-core data frame-type analytics ○ LLVM-compilation for vectorized expression evaluation
  • 20. Arrow’s Columnar Memory Format • Runtime memory format for analytical query processing • Companion to serialization tech like Apache {Parquet, ORC} • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
  • 21. Arrow-accelerated Python + Apache Spark ● Joint work with Li Jin from Two Sigma, Bryan Cutler from IBM ● Vectorized user-defined functions, fast data export to pandas import pandas as pd from scipy import stats @pandas_udf('double') def cdf(v): return pd.Series(stats.norm.cdf(v)) df.withColumn('cumulative_probability', cdf(df.v))
  • 22. Arrow-accelerated Python + Apache Spark Spark SQL Arrow Columnar Stream Input PySpark Worker Zero copy via socket pandas Arrow Columnar Stream Output to arrow from arrow from arrow to arrow
  • 23. New work: Arrow Flight RPC Framework • A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively • Uses Protocol Buffers v3 for client protocol • Pluggable command execution layer, authentication • Low-level gRPC optimizations • Write Arrow memory directly onto outgoing gRPC buffer • Avoid any copying or deserialization
  • 24. Arrow Flight - Parallel Get Client Planner GetFlightInfo FlightInfo DoGet Data Nodes FlightData DoGet FlightData ...
  • 25. Arrow Flight - Efficient gRPC transport Client DoGet Data Node FlightData Row Batch Row Batch Row Batch Row Batch Row Batch ... Data transported in a Protocol Buffer, but reads can be made zero-copy by writing a custom gRPC “deserializer”
  • 26. Flight: Static datasets and custom commands • Support “named” datasets, and “command” datasets • Commands are binary, and will be server-dependent • Implement custom commands using general structured data serialization tools message SQLQuery { binary database_uri = 1; binary query = 2; } Commands.proto GetFlightInfo RPC type: CMD cmd: <serialized command>
  • 27. Flight: Custom actions • Any request that is not a table pull or push, can be implemented as an “action” • Actions return a stream of opaque binary data
  • 28. Arrow: History and Status • Community initiative started in 2016, initially backed by leading developers of ~13 major OSS data processing projects • Project development status • Codebase 2.5 years old • > 190 distinct contributors • 10 major releases • Some level of support in 10 programming languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust) • Over 500K monthly installs in Python alone
  • 29. Example: Gandiva, Arrow-LLVM compiler https://github.com/dremio/gandiva SELECT year(timestamp), month(timestamp), … FROM table ... Input Table Fragment Arrow Java JNI (Zero-copy) Evaluate Gandiva LLVM Function Arrow C++ Result Table Fragment
  • 30. Example use: Ray ML framework from Berkeley RISELab March 20, 2017All Rights Reserved 30 Source: https://arxiv.org/abs/1703.03924 • Uses Plasma, shared memory-based object store originally developed for Ray • Zero-copy reads of tensor collections
  • 31. Arrow on the GPU • NVIDIA-led GPU Open Analytics Initiative (http://gpuopenanalytics.com) • “GPU DataFrame”: Arrow on the GPU • Example: Execute Numba-compiled code on SQL results from MapD shared via CUDA IPC • Plasma also supports GPU shared memory
  • 32. Some Industry Contributors to Apache Arrow ClearCode
  • 33. Upcoming Roadmap • Software development lifecycle improvements • Data ingest / access / export • Computational libraries (CPU + GPU) • Expanded language support • Richer RPC / messaging • More system integrations
  • 34. Computational libraries • “Kernel functions” performing vectorized analytics on Arrow memory format • Select CPU or GPU variant based on data location • Operator graphs (compose multiple operators) • Subgraph compiler (using LLVM -- see Gandiva) • Runtime engine: execute operator graphs
  • 35. Data Access / Ingest • Apache Avro • Apache Parquet nested data support • Apache ORC • CSV • JSON • ODBC / JDBC • … and likely other data access points
  • 36. Arrow-powered Data Science Systems • Portable runtime libraries, usable from multiple programming languages • Decoupled front ends • Companion to distributed systems like Dask, Ray
  • 37. Getting involved • Join dev@arrow.apache.org • PRs to https://github.com/apache/arrow • Learn more about the Ursa Labs vision for Arrow-powered data science: https://ursalabs.org/tech/