SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Scaling PyData Up and Out
Travis E. Oliphant, PhD @teoliphant
Python Enthusiast since 1997
Making NumPy- and Pandas-style code faster
and run in parallel.
PyData London 2016
@teoliphant
Scale Up vs Scale Out
2
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
@teoliphant
3
@jakevdp
"Python's Scientic Ecosystem"
@teoliphant
4
I spent the first 15 years of my “Python life”
working to build the foundation of the Python
Scientific Ecosystem as a practitioner/developer
PEP 357
PEP 3118
SciPy
@teoliphant
5
I plan to spend the next 15 years ensuring this
ecosystem can scale both up and out as an
entrepreneur, evangelist, and director of
innovation. CONDA
NUMBA
DASK
DYND
BLAZE
DATA-FABRIC
@teoliphant
6
In 2012 we “took off” in
this effort to form the
foundation of scalable
Python — Numba (scale-up)
and Blaze (scale-out) were
the major efforts.
In early 2013, I formally
“retired” from NumPy
maintenance to pursue
“next-generation” NumPy.
Not so much a single new library, but a new emphasis on scale
@teoliphant
7
Rally a community
“Apache” for Open
Data Science
Community-led and directed Conference Series
with sponsorship proceeds going directly to
NumFocus
@teoliphant
8
Organize a company
Empower people to solve the world’s
greatest challenges.
We help people discover, analyze, and collaborate by
connecting their curiosity and experience with any data.
Purpose
Mission
@teoliphant
9
Start working on key technology
Blaze — blaze, odo, datashape, dynd, dask
Bokeh — interactive visualization in web including Jupyter
Numba — Array-oriented Python subset compiler
Conda — Cross-platform package manager
with environments
Bootstrap and seed funding through 2014
VC funded in 2015
@teoliphant
10
is….
the Open Data Science Platform
powered by Python…
the fastest growing open data science language
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
Create a Platform!
Free and Open Source!
@teoliphant
11
Governance
Provenance
Security
OPERATIONS
Python
R
Spark | Scala
JavaScript
Java
Fortran
C/C++
DATA SCIENCE
LANGUAGES
DATA
Flat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming
HARDWARE
Workstation Server Cluster
APPLICATIONS
Interactive Presentations,
Notebooks, Reports & Apps
Solution
Templates
Visual Data Exploration
Advanced Spreadsheets
APIs
ANALYTICS
Data Exploration
Querying Data Prep
Data Connectors
Visual Programming
Notebooks
Analytics Development
Stats
Data
Mining
Deep Learning
Machine
Learning
Simulation &
Optimization
Geospatial
Text & NLP
Graph & Network
Image Analysis
Advanced Analytics
IDEsCICD
Package, Dependency,
Environment Management
Web & Desktop App
Dev
SOFTWARE DEVELOPMENT
HIGH PERFORMANCE
Distributed Computing
Parallelism &

Multi-threading
Compiled Assets
GPUs & Multi-core
Compiler
Business
Analyst
Data
Scientist
Developer
Data
Engineer
DevOps
DATA SCIENCE
TEAM
Cloud On-premises
@teoliphant
Free Anaconda now with MKL as default
12
•Intel MKL (Math Kernel Libraries) provide enhanced
algorithms for basic math functions.
•Using MKL provides optimal performance for basic BLAS,
LAPACK, FFT, and math functions.
•Anaconda since version 2.5 has MKL provided as the default
in the free download of Anaconda (you can also distribute
binaries linked against these MKL-enhanced tools).
@teoliphant
Community maintained conda packages
13
Automatic
building
and testing for
your recipes
Awesome!
conda install
-c conda-forge
conda config
--add channels
conda-forge
@teoliphant
Promises Made
14
•In 2012 we talked about the need for parallel Pandas and
Parallel NumPy (both scale-up and scale-out).
•At the first PyData NYC in 2012 we described Blaze as an
effort to deliver scale out and Numba as well as DyND as an
effort to deliver scale-up.
•We introduced them early to rally a developer community.
•We can now rally a user-community around working code!
@teoliphant
15
Milestone success — 2016
Numba is delivering on scaling up
• NumPy’s ufuncs and gufuncs on multiple
CPU and GPU threads
• General GPU code
• General CPU code
Dask is delivering on scaling out
• dask.array (NumPy scaled out)
• dask.dataframe (Pandas scaled out)
• dask.bag (lists scaled out)
• dask.delayed (general scale-out
construction)
@teoliphantNumba + Dask
Look at all of the data with Bokeh’s datashader.
Decouple the data-processing from the visualization.
Visualize arbitrarily large data.
16
• E.g. Open Street Map data:
• About 3 billion GPS coordinates
• https://blog.openstreetmap.org/
2012/04/01/bulk-gps-point-data/.
• This image was rendered in one
minute on a standard MacBook
with 16 GB RAM
• Renders in a milliseconds on
several 128GB Amazon EC2
instances
@teoliphantCategorical data: 2010 US Census
17
• One point per
person
• 300 million total
• Categorized by
race
• Interactive
rendering with
Numba+Dask
• No pre-tiling
@teoliphant
YARN
JVM
Bottom Line

10-100X faster performance
• Interact with data in HDFS and
Amazon S3 natively from Python
• Distributed computations without the
JVM & Python/Java serialization
• Framework for easy, flexible
parallelism using directed acyclic
graphs (DAGs)
• Interactive, distributed computing
with in-memory persistence/caching
Bottom Line
• Leverage Python &
R with Spark
Batch
Processing
Interactive
Processing
HDFS
Ibis
Impala
PySpark & SparkR
Python & R
ecosystem
MPI
High Performance,
Interactive,
Batch
Processing
Native
read & write
NumPy, Pandas, …
720+ packages
18
What happened to Blaze?
Still going strong — just at another place in the stack
(and sometimes a description of an ecosystem).
@teoliphant
20
Expressions
Metadata
Runtime
@teoliphant
21
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...
@teoliphantAPIs, syntax, language
22
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
numba
@teoliphantBlaze
23
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x-
ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).
@teoliphantBlaze
24
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris, sepal_ratio = iris.sepal_length /
iris.sepal_width, petal_ratio = iris.petal_length /
iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
@teoliphant
25
datashapeblaze
Blaze (like DyND) uses datashape as its type system
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
@teoliphantDatashape
26
A structured data description language for all kinds of data
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
@teoliphantDatashape
27
{
flowersdb: {
iris: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
},
irismongo: 150 * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}
}
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]
@teoliphant
iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
28
Builds off of Blaze uniform interface to
host data remotely through a JSON web
API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
@teoliphantBlaze Client
29
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa
@teoliphant
Compute recipes work with existing libraries and have multiple
backends — write once and run anywhere. Layer over existing data!
• Python list
• Numpy arrays
• Dynd
• Pandas DataFrame
• Spark, Impala, Ibis
• Mongo
• Dask
30
• Hint: Use odo to copy from
one format and/or engine to
another!
@teoliphant
Data Fabric (pre-alpha, demo-ware)
• New initiative just starting (if you can fund it, please tell me).
• Generalization of Python buffer-protocol to all languages
• A library to build a catalog of shared-memory chunks across the
cluster holding data that can be described by data-shape
• A helper type library in DyND that any language can use to
select appropriate analytics for
• Foundation for a cross-language “super-set” of PyData stack
31
https://github.com/blaze/datafabric
Brief Overview of Numba
Go to Graham Markall’s Talk at 2:15 in LG6 to hear more.
@teoliphantClassic Example
33
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot
@teoliphantThe Basics
34
CPython 1x
Numpy array-wide operations 13x
Numba (CPU) 120x
Numba (NVidia Tesla K20c) 2100x
Mandelbrot
@teoliphantHow Numba works
35
Bytecode
Analysis
Python
Function
Function
Arguments
Type
Inference
Numba IR
LLVM IR
Machine
Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Rewrite IR
Lowering
LLVM JIT
@teoliphantNumba Features
• Numba supports:
– Windows, OS X, and Linux
– 32 and 64-bit x86 CPUs and NVIDIA GPUs
– Python 2 and 3
– NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter

(all of your existing Python libraries are still available)
36
@teoliphantHow to Use Numba
1. Create a realistic benchmark test case.

(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.

(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a little
refactoring.

(see online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 

(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
37
@teoliphant
Example: Filter an array
38
@teoliphant
Example: Filter an array
39
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
Numba decorator

(nopython=True not required)
2.7x Speedup
over NumPy!
@teoliphantNumPy UFuncs and GUFuncs
40
NumPy ufuncs (and gufuncs) are functions that operate “element-
wise” (or “sub-dimension-wise”) across an array without an explicit loop.
This implicit loop (which is in machine code) is at the core of why NumPy
is fast. Dispatch is done internally to a particular code-segment based
on the type of the array. It is a very powerful abstraction in the PyData
stack.
Making new fast ufuncs used to be only possible in C — painful!
With nb.vectorize and numba.guvectorize it is now easy!
The inner secrets of NumPy are now at your finger-tips for you to make
your own magic!
@teoliphantExample: Making a windowed compute filter
41
Perform a computation
on a finite window of the
input.
For a linear system, this is a
FIR filter and what np.convolve
or sp.signal.lfilter can do.
But what if you want some
arbitrary computation.
@teoliphantExample: Making a windowed compute filter
42
Hand-coded implementation
Build a ufunc for the kernel
which is faster for large arrays!
This can now run easily on GPU
with ‘target=gpu’ and many-
cores ‘target=parallel’
Array-oriented!
@teoliphantExample: Making a windowed compute filter
43
@teoliphantOther interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize) including GPU support and multi-
core (threaded) support
• Call ctypes and cffi functions directly and pass them as arguments
• Support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc/0.23.0/
44
@teoliphantRecently Added Numba Features
• Support for named tuples in nopython mode
• Support for sets in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• JIT classes (zero-cost abstraction)
• Support of np.dot (and ‘@‘ operator on Python 3.5)
• Support for some of np.linalg
• generated_jit (jit the functions that are the return values of the decorated
function — meta-programming)
• SmartArrays which can exist on host and GPU (transparent data access).
• Ahead of Time Compilation
• Disk-caching of pre-compiled code
45
Overview of Dask as a Parallel
Processing Framework with Distributed
@teoliphantPrecursors to Parallelism
47
• Consider the following approaches first:
1. Use better algorithms
2. Try Numba or C/Cython
3. Store data in efficient formats
4. Subsample your data
• If you have to parallelize:
1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk)
2. Then a large workstation (24 cores, 1 TB RAM)
3. Finally, scale out to a cluster
@teoliphantOverview of Dask
48
Dask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales out: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
Dask complements the rest of Anaconda. It was developed with

NumPy, Pandas, and scikit-learn developers.
@teoliphantSpectrum of Parallelization
49
Threads
Processes
MPI
ZeroMQ
Dask
Hadoop
Spark
SQL:
Hive
Pig
Impala
Implicit control: Restrictive but easyExplicit control: Fast but hard
@teoliphant
Dask: From User Interaction to Execution
50
@teoliphant
Dask Collections: Familiar Expressions and API
51
x.T - x.mean(axis=0)
df.groupby(df.index).value.mean()
def load(filename):
def clean(data):
def analyze(result):
Dask array (mimics NumPy)
Dask dataframe (mimics Pandas) Dask imperative (wraps custom code)
b.map(json.loads).foldby(...)
Dask bag (collection of data)
@teoliphant
Dask Graphs: Example Machine Learning Pipeline
52
@teoliphant
Dask Graphs: Example Machine Learning Pipeline + Grid Search
53
@teoliphant
• simple: easy to use API
• flexible: perform a lots of action with a minimal amount of code
• fast: dispatching to run-time engines & cython
• database like: familiar ops
• interop: integration with the PyData Stack
54
(((A + 1) * 2) ** 3)
@teoliphant
55
(B - B.mean(axis=0)) 

+ (B.T / B.std())
@teoliphant
Scheduler
Worker
Worker
Worker
Worker
Client
Same network
User Machine (laptop)Client
Worker
Dask Schedulers: Example - Distributed Scheduler
56
@teoliphantCluster Architecture Diagram
57
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
@teoliphant
• Single machine with multiple threads or processes
• On a cluster with SSH (dcluster)
• Resource management: YARN (knit), SGE, Slurm
• On the cloud with Amazon EC2 (dec2)
• On a cluster with Anaconda for cluster management
• Manage multiple conda environments and packages 

on bare-metal or cloud-based clusters
Using Anaconda and Dask on your Cluster
58
@teoliphantScheduler Visualization with Bokeh
59
@teoliphantExamples
60
Analyzing
NYC Taxi
CSV data using
distributed Dask
DataFrames
• Demonstrate
Pandas at scale
• Observe responsive
user interface
Distributed
language
processing with
text data using
Dask Bags
• Explore data using
a distributed
memory cluster
• Interactively query
data using libraries
from Anaconda
Analyzing global
temperature
data using
Dask Arrays
• Visualize complex
algorithms
• Learn about dask
collections and
tasks
Handle custom
code and
workflows using
Dask Imperative
• Deal with messy
situations
• Learn about
scheduling
1 2 3 4
@teoliphant
Example 1: Using Dask DataFrames on a cluster with CSV data
61
• Built from Pandas DataFrames
• Match Pandas interface
• Access data from HDFS, S3, local, etc.
• Fast, low latency
• Responsive user interface
January, 2016
Febrary, 2016
March, 2016
April, 2016
May, 2016
Pandas
DataFrame}
Dask
DataFrame
}
@teoliphant
Example 2: Using Dask Bags on a cluster with text data
62
• Distributed natural language processing
with text data stored in HDFS
• Handles standard computations
• Looks like other parallel frameworks
(Spark, Hive, etc.)
• Access data from HDFS, S3, local, etc.
• Handles the common case
...
(...)
data
...
(...)
data
function
...
...
(...)
data
function
...
result
merge
... ...
data
function
(...)
...
function
@teoliphant
NumPy
Array
}
}Dask
Array
Example 3: Using Dask Arrays with global temperature data
63
• Built from NumPy

n-dimensional arrays
• Matches NumPy interface
(subset)
• Solve medium-large
problems
• Complex algorithms
@teoliphant
Example 4: Using Dask Delayed to handle custom workflows
64
• Manually handle functions to support messy situations
• Life saver when collections aren't flexible enough
• Combine futures with collections for best of both worlds
• Scheduler provides resilient and elastic execution
@teoliphant
Scale Up vs Scale Out
65
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
Numba
Dask
Blaze
@teoliphantContribute to the next stage of the journey
66
Numba, Blaze and the
Blaze ecosystem (dask, odo, dynd, datashape, data-fabric, and beyond…)

Mais conteĂşdo relacionado

Mais procurados

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with RDatabricks
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 

Mais procurados (16)

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 

Destaque

Scala.io
Scala.ioScala.io
Scala.ioSteve Gury
 
Numba Overview
Numba OverviewNumba Overview
Numba Overviewstan_seibert
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...IntelÂŽ Software
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentationkammeyer
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...PyData
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 

Destaque (7)

Scala.io
Scala.ioScala.io
Scala.io
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
PyData Paris 2015 - Track 3.3 Antoine Pitrou
PyData Paris 2015 - Track 3.3 Antoine PitrouPyData Paris 2015 - Track 3.3 Antoine Pitrou
PyData Paris 2015 - Track 3.3 Antoine Pitrou
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentation
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 

Semelhante a Scaling PyData Up and Out

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
eScience Cluster Arch. Overview
eScience Cluster Arch. OvervieweScience Cluster Arch. Overview
eScience Cluster Arch. OverviewFrancesco Bongiovanni
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 

Semelhante a Scaling PyData Up and Out (20)

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
eScience Cluster Arch. Overview
eScience Cluster Arch. OvervieweScience Cluster Arch. Overview
eScience Cluster Arch. Overview
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 

Mais de Travis Oliphant

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019Travis Oliphant
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019Travis Oliphant
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with condaTravis Oliphant
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonTravis Oliphant
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData IntroductionTravis Oliphant
 

Mais de Travis Oliphant (14)

Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Numba
NumbaNumba
Numba
 

Último

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Scaling PyData Up and Out

  • 1. Scaling PyData Up and Out Travis E. Oliphant, PhD @teoliphant Python Enthusiast since 1997 Making NumPy- and Pandas-style code faster and run in parallel. PyData London 2016
  • 2. @teoliphant Scale Up vs Scale Out 2 Big Memory & Many Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster ScaleUp (BiggerNodes) Scale Out (More Nodes)
  • 4. @teoliphant 4 I spent the first 15 years of my “Python life” working to build the foundation of the Python Scientific Ecosystem as a practitioner/developer PEP 357 PEP 3118 SciPy
  • 5. @teoliphant 5 I plan to spend the next 15 years ensuring this ecosystem can scale both up and out as an entrepreneur, evangelist, and director of innovation. CONDA NUMBA DASK DYND BLAZE DATA-FABRIC
  • 6. @teoliphant 6 In 2012 we “took off” in this effort to form the foundation of scalable Python — Numba (scale-up) and Blaze (scale-out) were the major efforts. In early 2013, I formally “retired” from NumPy maintenance to pursue “next-generation” NumPy. Not so much a single new library, but a new emphasis on scale
  • 7. @teoliphant 7 Rally a community “Apache” for Open Data Science Community-led and directed Conference Series with sponsorship proceeds going directly to NumFocus
  • 8. @teoliphant 8 Organize a company Empower people to solve the world’s greatest challenges. We help people discover, analyze, and collaborate by connecting their curiosity and experience with any data. Purpose Mission
  • 9. @teoliphant 9 Start working on key technology Blaze — blaze, odo, datashape, dynd, dask Bokeh — interactive visualization in web including Jupyter Numba — Array-oriented Python subset compiler Conda — Cross-platform package manager with environments Bootstrap and seed funding through 2014 VC funded in 2015
  • 10. @teoliphant 10 is…. the Open Data Science Platform powered by Python… the fastest growing open data science language • Accelerate Time-to-Value • Connect Data, Analytics & Compute • Empower Data Science Teams Create a Platform! Free and Open Source!
  • 11. @teoliphant 11 Governance Provenance Security OPERATIONS Python R Spark | Scala JavaScript Java Fortran C/C++ DATA SCIENCE LANGUAGES DATA Flat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming HARDWARE Workstation Server Cluster APPLICATIONS Interactive Presentations, Notebooks, Reports & Apps Solution Templates Visual Data Exploration Advanced Spreadsheets APIs ANALYTICS Data Exploration Querying Data Prep Data Connectors Visual Programming Notebooks Analytics Development Stats Data Mining Deep Learning Machine Learning Simulation & Optimization Geospatial Text & NLP Graph & Network Image Analysis Advanced Analytics IDEsCICD Package, Dependency, Environment Management Web & Desktop App Dev SOFTWARE DEVELOPMENT HIGH PERFORMANCE Distributed Computing Parallelism &
 Multi-threading Compiled Assets GPUs & Multi-core Compiler Business Analyst Data Scientist Developer Data Engineer DevOps DATA SCIENCE TEAM Cloud On-premises
  • 12. @teoliphant Free Anaconda now with MKL as default 12 •Intel MKL (Math Kernel Libraries) provide enhanced algorithms for basic math functions. •Using MKL provides optimal performance for basic BLAS, LAPACK, FFT, and math functions. •Anaconda since version 2.5 has MKL provided as the default in the free download of Anaconda (you can also distribute binaries linked against these MKL-enhanced tools).
  • 13. @teoliphant Community maintained conda packages 13 Automatic building and testing for your recipes Awesome! conda install -c conda-forge conda config --add channels conda-forge
  • 14. @teoliphant Promises Made 14 •In 2012 we talked about the need for parallel Pandas and Parallel NumPy (both scale-up and scale-out). •At the first PyData NYC in 2012 we described Blaze as an effort to deliver scale out and Numba as well as DyND as an effort to deliver scale-up. •We introduced them early to rally a developer community. •We can now rally a user-community around working code!
  • 15. @teoliphant 15 Milestone success — 2016 Numba is delivering on scaling up • NumPy’s ufuncs and gufuncs on multiple CPU and GPU threads • General GPU code • General CPU code Dask is delivering on scaling out • dask.array (NumPy scaled out) • dask.dataframe (Pandas scaled out) • dask.bag (lists scaled out) • dask.delayed (general scale-out construction)
  • 16. @teoliphantNumba + Dask Look at all of the data with Bokeh’s datashader. Decouple the data-processing from the visualization. Visualize arbitrarily large data. 16 • E.g. Open Street Map data: • About 3 billion GPS coordinates • https://blog.openstreetmap.org/ 2012/04/01/bulk-gps-point-data/. • This image was rendered in one minute on a standard MacBook with 16 GB RAM • Renders in a milliseconds on several 128GB Amazon EC2 instances
  • 17. @teoliphantCategorical data: 2010 US Census 17 • One point per person • 300 million total • Categorized by race • Interactive rendering with Numba+Dask • No pre-tiling
  • 18. @teoliphant YARN JVM Bottom Line
 10-100X faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 18
  • 19. What happened to Blaze? Still going strong — just at another place in the stack (and sometimes a description of an ecosystem).
  • 21. @teoliphant 21 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  • 22. @teoliphantAPIs, syntax, language 22 Data Runtime Expressions metadata storage/containers compute datashape blaze dask odo parallelize optimize, JIT numba
  • 23. @teoliphantBlaze 23 Interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x- ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).
  • 24. @teoliphantBlaze 24 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
  • 25. @teoliphant 25 datashapeblaze Blaze (like DyND) uses datashape as its type system >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  • 26. @teoliphantDatashape 26 A structured data description language for all kinds of data http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels 4
  • 27. @teoliphantDatashape 27 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32 # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]
  • 28. @teoliphant iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris Blaze Server — Lights up your Dark Data 28 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json server.yaml
  • 29. @teoliphantBlaze Client 29 >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa
  • 30. @teoliphant Compute recipes work with existing libraries and have multiple backends — write once and run anywhere. Layer over existing data! • Python list • Numpy arrays • Dynd • Pandas DataFrame • Spark, Impala, Ibis • Mongo • Dask 30 • Hint: Use odo to copy from one format and/or engine to another!
  • 31. @teoliphant Data Fabric (pre-alpha, demo-ware) • New initiative just starting (if you can fund it, please tell me). • Generalization of Python buffer-protocol to all languages • A library to build a catalog of shared-memory chunks across the cluster holding data that can be described by data-shape • A helper type library in DyND that any language can use to select appropriate analytics for • Foundation for a cross-language “super-set” of PyData stack 31 https://github.com/blaze/datafabric
  • 32. Brief Overview of Numba Go to Graham Markall’s Talk at 2:15 in LG6 to hear more.
  • 33. @teoliphantClassic Example 33 from numba import jit @jit def mandel(x, y, max_iters): c = complex(x,y) z = 0j for i in range(max_iters): z = z*z + c if z.real * z.real + z.imag * z.imag >= 4: return 255 * i // max_iters return 255 Mandelbrot
  • 34. @teoliphantThe Basics 34 CPython 1x Numpy array-wide operations 13x Numba (CPU) 120x Numba (NVidia Tesla K20c) 2100x Mandelbrot
  • 35. @teoliphantHow Numba works 35 Bytecode Analysis Python Function Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  • 36. @teoliphantNumba Features • Numba supports: – Windows, OS X, and Linux – 32 and 64-bit x86 CPUs and NVIDIA GPUs – Python 2 and 3 – NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available) 36
  • 37. @teoliphantHow to Use Numba 1. Create a realistic benchmark test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement. 37
  • 39. @teoliphant Example: Filter an array 39 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array Numba decorator
 (nopython=True not required) 2.7x Speedup over NumPy!
  • 40. @teoliphantNumPy UFuncs and GUFuncs 40 NumPy ufuncs (and gufuncs) are functions that operate “element- wise” (or “sub-dimension-wise”) across an array without an explicit loop. This implicit loop (which is in machine code) is at the core of why NumPy is fast. Dispatch is done internally to a particular code-segment based on the type of the array. It is a very powerful abstraction in the PyData stack. Making new fast ufuncs used to be only possible in C — painful! With nb.vectorize and numba.guvectorize it is now easy! The inner secrets of NumPy are now at your finger-tips for you to make your own magic!
  • 41. @teoliphantExample: Making a windowed compute filter 41 Perform a computation on a finite window of the input. For a linear system, this is a FIR filter and what np.convolve or sp.signal.lfilter can do. But what if you want some arbitrary computation.
  • 42. @teoliphantExample: Making a windowed compute filter 42 Hand-coded implementation Build a ufunc for the kernel which is faster for large arrays! This can now run easily on GPU with ‘target=gpu’ and many- cores ‘target=parallel’ Array-oriented!
  • 43. @teoliphantExample: Making a windowed compute filter 43
  • 44. @teoliphantOther interesting things • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) including GPU support and multi- core (threaded) support • Call ctypes and cffi functions directly and pass them as arguments • Support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.23.0/ 44
  • 45. @teoliphantRecently Added Numba Features • Support for named tuples in nopython mode • Support for sets in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • JIT classes (zero-cost abstraction) • Support of np.dot (and ‘@‘ operator on Python 3.5) • Support for some of np.linalg • generated_jit (jit the functions that are the return values of the decorated function — meta-programming) • SmartArrays which can exist on host and GPU (transparent data access). • Ahead of Time Compilation • Disk-caching of pre-compiled code 45
  • 46. Overview of Dask as a Parallel Processing Framework with Distributed
  • 47. @teoliphantPrecursors to Parallelism 47 • Consider the following approaches first: 1. Use better algorithms 2. Try Numba or C/Cython 3. Store data in efficient formats 4. Subsample your data • If you have to parallelize: 1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk) 2. Then a large workstation (24 cores, 1 TB RAM) 3. Finally, scale out to a cluster
  • 48. @teoliphantOverview of Dask 48 Dask is a Python parallel computing library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales out: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science Dask complements the rest of Anaconda. It was developed with
 NumPy, Pandas, and scikit-learn developers.
  • 50. @teoliphant Dask: From User Interaction to Execution 50
  • 51. @teoliphant Dask Collections: Familiar Expressions and API 51 x.T - x.mean(axis=0) df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask imperative (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data)
  • 52. @teoliphant Dask Graphs: Example Machine Learning Pipeline 52
  • 53. @teoliphant Dask Graphs: Example Machine Learning Pipeline + Grid Search 53
  • 54. @teoliphant • simple: easy to use API • flexible: perform a lots of action with a minimal amount of code • fast: dispatching to run-time engines & cython • database like: familiar ops • interop: integration with the PyData Stack 54 (((A + 1) * 2) ** 3)
  • 55. @teoliphant 55 (B - B.mean(axis=0)) 
 + (B.T / B.std())
  • 56. @teoliphant Scheduler Worker Worker Worker Worker Client Same network User Machine (laptop)Client Worker Dask Schedulers: Example - Distributed Scheduler 56
  • 57. @teoliphantCluster Architecture Diagram 57 Client Machine Compute Node Compute Node Compute Node Head Node
  • 58. @teoliphant • Single machine with multiple threads or processes • On a cluster with SSH (dcluster) • Resource management: YARN (knit), SGE, Slurm • On the cloud with Amazon EC2 (dec2) • On a cluster with Anaconda for cluster management • Manage multiple conda environments and packages 
 on bare-metal or cloud-based clusters Using Anaconda and Dask on your Cluster 58
  • 60. @teoliphantExamples 60 Analyzing NYC Taxi CSV data using distributed Dask DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Imperative • Deal with messy situations • Learn about scheduling 1 2 3 4
  • 61. @teoliphant Example 1: Using Dask DataFrames on a cluster with CSV data 61 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame} Dask DataFrame }
  • 62. @teoliphant Example 2: Using Dask Bags on a cluster with text data 62 • Distributed natural language processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function
  • 63. @teoliphant NumPy Array } }Dask Array Example 3: Using Dask Arrays with global temperature data 63 • Built from NumPy
 n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  • 64. @teoliphant Example 4: Using Dask Delayed to handle custom workflows 64 • Manually handle functions to support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution
  • 65. @teoliphant Scale Up vs Scale Out 65 Big Memory & Many Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster ScaleUp (BiggerNodes) Scale Out (More Nodes) Numba Dask Blaze
  • 66. @teoliphantContribute to the next stage of the journey 66 Numba, Blaze and the Blaze ecosystem (dask, odo, dynd, datashape, data-fabric, and beyond…)