Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Â
Scaling PyData Up and Out
1. Scaling PyData Up and Out
Travis E. Oliphant, PhD @teoliphant
Python Enthusiast since 1997
Making NumPy- and Pandas-style code faster
and run in parallel.
PyData London 2016
2. @teoliphant
Scale Up vs Scale Out
2
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
4. @teoliphant
4
I spent the first 15 years of my âPython lifeâ
working to build the foundation of the Python
Scientific Ecosystem as a practitioner/developer
PEP 357
PEP 3118
SciPy
5. @teoliphant
5
I plan to spend the next 15 years ensuring this
ecosystem can scale both up and out as an
entrepreneur, evangelist, and director of
innovation. CONDA
NUMBA
DASK
DYND
BLAZE
DATA-FABRIC
6. @teoliphant
6
In 2012 we âtook offâ in
this effort to form the
foundation of scalable
Python â Numba (scale-up)
and Blaze (scale-out) were
the major efforts.
In early 2013, I formally
âretiredâ from NumPy
maintenance to pursue
ânext-generationâ NumPy.
Not so much a single new library, but a new emphasis on scale
8. @teoliphant
8
Organize a company
Empower people to solve the worldâs
greatest challenges.
We help people discover, analyze, and collaborate by
connecting their curiosity and experience with any data.
Purpose
Mission
9. @teoliphant
9
Start working on key technology
Blaze â blaze, odo, datashape, dynd, dask
Bokeh â interactive visualization in web including Jupyter
Numba â Array-oriented Python subset compiler
Conda â Cross-platform package manager
with environments
Bootstrap and seed funding through 2014
VC funded in 2015
10. @teoliphant
10
isâŚ.
the Open Data Science Platform
powered by PythonâŚ
the fastest growing open data science language
⢠Accelerate Time-to-Value
⢠Connect Data, Analytics & Compute
⢠Empower Data Science Teams
Create a Platform!
Free and Open Source!
11. @teoliphant
11
Governance
Provenance
Security
OPERATIONS
Python
R
Spark | Scala
JavaScript
Java
Fortran
C/C++
DATA SCIENCE
LANGUAGES
DATA
Flat Files (CSV, XLSâŚ)) SQL DB NoSQL Hadoop Streaming
HARDWARE
Workstation Server Cluster
APPLICATIONS
Interactive Presentations,
Notebooks, Reports & Apps
Solution
Templates
Visual Data Exploration
Advanced Spreadsheets
APIs
ANALYTICS
Data Exploration
Querying Data Prep
Data Connectors
Visual Programming
Notebooks
Analytics Development
Stats
Data
Mining
Deep Learning
Machine
Learning
Simulation &
Optimization
Geospatial
Text & NLP
Graph & Network
Image Analysis
Advanced Analytics
IDEsCICD
Package, Dependency,
Environment Management
Web & Desktop App
Dev
SOFTWARE DEVELOPMENT
HIGH PERFORMANCE
Distributed Computing
Parallelism &â¨
Multi-threading
Compiled Assets
GPUs & Multi-core
Compiler
Business
Analyst
Data
Scientist
Developer
Data
Engineer
DevOps
DATA SCIENCE
TEAM
Cloud On-premises
12. @teoliphant
Free Anaconda now with MKL as default
12
â˘Intel MKL (Math Kernel Libraries) provide enhanced
algorithms for basic math functions.
â˘Using MKL provides optimal performance for basic BLAS,
LAPACK, FFT, and math functions.
â˘Anaconda since version 2.5 has MKL provided as the default
in the free download of Anaconda (you can also distribute
binaries linked against these MKL-enhanced tools).
13. @teoliphant
Community maintained conda packages
13
Automatic
building
and testing for
your recipes
Awesome!
conda install
-c conda-forge
conda config
--add channels
conda-forge
14. @teoliphant
Promises Made
14
â˘In 2012 we talked about the need for parallel Pandas and
Parallel NumPy (both scale-up and scale-out).
â˘At the first PyData NYC in 2012 we described Blaze as an
effort to deliver scale out and Numba as well as DyND as an
effort to deliver scale-up.
â˘We introduced them early to rally a developer community.
â˘We can now rally a user-community around working code!
15. @teoliphant
15
Milestone success â 2016
Numba is delivering on scaling up
⢠NumPyâs ufuncs and gufuncs on multiple
CPU and GPU threads
⢠General GPU code
⢠General CPU code
Dask is delivering on scaling out
⢠dask.array (NumPy scaled out)
⢠dask.dataframe (Pandas scaled out)
⢠dask.bag (lists scaled out)
⢠dask.delayed (general scale-out
construction)
16. @teoliphantNumba + Dask
Look at all of the data with Bokehâs datashader.
Decouple the data-processing from the visualization.
Visualize arbitrarily large data.
16
⢠E.g. Open Street Map data:
⢠About 3 billion GPS coordinates
⢠https://blog.openstreetmap.org/
2012/04/01/bulk-gps-point-data/.
⢠This image was rendered in one
minute on a standard MacBook
with 16 GB RAM
⢠Renders in a milliseconds on
several 128GB Amazon EC2
instances
17. @teoliphantCategorical data: 2010 US Census
17
⢠One point per
person
⢠300 million total
⢠Categorized by
race
⢠Interactive
rendering with
Numba+Dask
⢠No pre-tiling
18. @teoliphant
YARN
JVM
Bottom Lineâ¨
10-100X faster performance
⢠Interact with data in HDFS and
Amazon S3 natively from Python
⢠Distributed computations without the
JVM & Python/Java serialization
⢠Framework for easy, flexible
parallelism using directed acyclic
graphs (DAGs)
⢠Interactive, distributed computing
with in-memory persistence/caching
Bottom Line
⢠Leverage Python &
R with Spark
Batch
Processing
Interactive
Processing
HDFS
Ibis
Impala
PySpark & SparkR
Python & R
ecosystem
MPI
High Performance,
Interactive,
Batch
Processing
Native
read & write
NumPy, Pandas, âŚ
720+ packages
18
19. What happened to Blaze?
Still going strong â just at another place in the stack
(and sometimes a description of an ecosystem).
23. @teoliphantBlaze
23
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
âŚ
Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x-
ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).
25. @teoliphant
25
datashapeblaze
Blaze (like DyND) uses datashape as its type system
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
26. @teoliphantDatashape
26
A structured data description language for all kinds of data
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
28. @teoliphant
iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server â Lights up your Dark Data
28
Builds off of Blaze uniform interface to
host data remotely through a JSON web
API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
30. @teoliphant
Compute recipes work with existing libraries and have multiple
backends â write once and run anywhere. Layer over existing data!
⢠Python list
⢠Numpy arrays
⢠Dynd
⢠Pandas DataFrame
⢠Spark, Impala, Ibis
⢠Mongo
⢠Dask
30
⢠Hint: Use odo to copy from
one format and/or engine to
another!
31. @teoliphant
Data Fabric (pre-alpha, demo-ware)
⢠New initiative just starting (if you can fund it, please tell me).
⢠Generalization of Python buffer-protocol to all languages
⢠A library to build a catalog of shared-memory chunks across the
cluster holding data that can be described by data-shape
⢠A helper type library in DyND that any language can use to
select appropriate analytics for
⢠Foundation for a cross-language âsuper-setâ of PyData stack
31
https://github.com/blaze/datafabric
32. Brief Overview of Numba
Go to Graham Markallâs Talk at 2:15 in LG6 to hear more.
33. @teoliphantClassic Example
33
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot
36. @teoliphantNumba Features
⢠Numba supports:
â Windows, OS X, and Linux
â 32 and 64-bit x86 CPUs and NVIDIA GPUs
â Python 2 and 3
â NumPy versions 1.6 through 1.9
⢠Does not require a C/C++ compiler on the userâs system.
⢠< 70 MB to install.
⢠Does not replace the standard Python interpreterâ¨
(all of your existing Python libraries are still available)
36
37. @teoliphantHow to Use Numba
1. Create a realistic benchmark test case.â¨
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.â¨
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a little
refactoring.â¨
(see online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical functions. â¨
(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
37
39. @teoliphant
Example: Filter an array
39
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
Numba decoratorâ¨
(nopython=True not required)
2.7x Speedup
over NumPy!
40. @teoliphantNumPy UFuncs and GUFuncs
40
NumPy ufuncs (and gufuncs) are functions that operate âelement-
wiseâ (or âsub-dimension-wiseâ) across an array without an explicit loop.
This implicit loop (which is in machine code) is at the core of why NumPy
is fast. Dispatch is done internally to a particular code-segment based
on the type of the array. It is a very powerful abstraction in the PyData
stack.
Making new fast ufuncs used to be only possible in C â painful!
With nb.vectorize and numba.guvectorize it is now easy!
The inner secrets of NumPy are now at your finger-tips for you to make
your own magic!
41. @teoliphantExample: Making a windowed compute filter
41
Perform a computation
on a finite window of the
input.
For a linear system, this is a
FIR filter and what np.convolve
or sp.signal.lfilter can do.
But what if you want some
arbitrary computation.
42. @teoliphantExample: Making a windowed compute filter
42
Hand-coded implementation
Build a ufunc for the kernel
which is faster for large arrays!
This can now run easily on GPU
with âtarget=gpuâ and many-
cores âtarget=parallelâ
Array-oriented!
44. @teoliphantOther interesting things
⢠CUDA Simulator to debug your code in Python interpreter
⢠Generalized ufuncs (@guvectorize) including GPU support and multi-
core (threaded) support
⢠Call ctypes and cffi functions directly and pass them as arguments
⢠Support for types that understand the buffer protocol
⢠Pickle Numba functions to run on remote execution engines
⢠ânumba annotateâ to dump HTML annotated version of compiled
code
⢠See: http://numba.pydata.org/numba-doc/0.23.0/
44
45. @teoliphantRecently Added Numba Features
⢠Support for named tuples in nopython mode
⢠Support for sets in nopython mode
⢠Limited support for lists in nopython mode
⢠On-disk caching of compiled functions (opt-in)
⢠JIT classes (zero-cost abstraction)
⢠Support of np.dot (and â@â operator on Python 3.5)
⢠Support for some of np.linalg
⢠generated_jit (jit the functions that are the return values of the decorated
function â meta-programming)
⢠SmartArrays which can exist on host and GPU (transparent data access).
⢠Ahead of Time Compilation
⢠Disk-caching of pre-compiled code
45
46. Overview of Dask as a Parallel
Processing Framework with Distributed
47. @teoliphantPrecursors to Parallelism
47
⢠Consider the following approaches first:
1. Use better algorithms
2. Try Numba or C/Cython
3. Store data in efficient formats
4. Subsample your data
⢠If you have to parallelize:
1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk)
2. Then a large workstation (24 cores, 1 TB RAM)
3. Finally, scale out to a cluster
48. @teoliphantOverview of Dask
48
Dask is a Python parallel computing library that is:
⢠Familiar: Implements parallel NumPy and Pandas objects
⢠Fast: Optimized for demanding for numerical applications
⢠Flexible: for sophisticated and messy algorithms
⢠Scales out: Runs resiliently on clusters of 100s of machines
⢠Scales down: Pragmatic in a single process on a laptop
⢠Interactive: Responsive and fast for interactive data science
Dask complements the rest of Anaconda. It was developed withâ¨
NumPy, Pandas, and scikit-learn developers.
54. @teoliphant
⢠simple: easy to use API
⢠flexible: perform a lots of action with a minimal amount of code
⢠fast: dispatching to run-time engines & cython
⢠database like: familiar ops
⢠interop: integration with the PyData Stack
54
(((A + 1) * 2) ** 3)
58. @teoliphant
⢠Single machine with multiple threads or processes
⢠On a cluster with SSH (dcluster)
⢠Resource management: YARN (knit), SGE, Slurm
⢠On the cloud with Amazon EC2 (dec2)
⢠On a cluster with Anaconda for cluster management
⢠Manage multiple conda environments and packages â¨
on bare-metal or cloud-based clusters
Using Anaconda and Dask on your Cluster
58
60. @teoliphantExamples
60
Analyzing
NYC Taxi
CSV data using
distributed Dask
DataFrames
⢠Demonstrate
Pandas at scale
⢠Observe responsive
user interface
Distributed
language
processing with
text data using
Dask Bags
⢠Explore data using
a distributed
memory cluster
⢠Interactively query
data using libraries
from Anaconda
Analyzing global
temperature
data using
Dask Arrays
⢠Visualize complex
algorithms
⢠Learn about dask
collections and
tasks
Handle custom
code and
workflows using
Dask Imperative
⢠Deal with messy
situations
⢠Learn about
scheduling
1 2 3 4
61. @teoliphant
Example 1: Using Dask DataFrames on a cluster with CSV data
61
⢠Built from Pandas DataFrames
⢠Match Pandas interface
⢠Access data from HDFS, S3, local, etc.
⢠Fast, low latency
⢠Responsive user interface
January, 2016
Febrary, 2016
March, 2016
April, 2016
May, 2016
Pandas
DataFrame}
Dask
DataFrame
}
62. @teoliphant
Example 2: Using Dask Bags on a cluster with text data
62
⢠Distributed natural language processing
with text data stored in HDFS
⢠Handles standard computations
⢠Looks like other parallel frameworks
(Spark, Hive, etc.)
⢠Access data from HDFS, S3, local, etc.
⢠Handles the common case
...
(...)
data
...
(...)
data
function
...
...
(...)
data
function
...
result
merge
... ...
data
function
(...)
...
function
63. @teoliphant
NumPy
Array
}
}Dask
Array
Example 3: Using Dask Arrays with global temperature data
63
⢠Built from NumPyâ¨
n-dimensional arrays
⢠Matches NumPy interface
(subset)
⢠Solve medium-large
problems
⢠Complex algorithms
64. @teoliphant
Example 4: Using Dask Delayed to handle custom workflows
64
⢠Manually handle functions to support messy situations
⢠Life saver when collections aren't flexible enough
⢠Combine futures with collections for best of both worlds
⢠Scheduler provides resilient and elastic execution
65. @teoliphant
Scale Up vs Scale Out
65
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
Numba
Dask
Blaze
66. @teoliphantContribute to the next stage of the journey
66
Numba, Blaze and the
Blaze ecosystem (dask, odo, dynd, datashape, data-fabric, and beyondâŚ)