Processing biggish data on commodity hardware: simple Python patterns

Processing biggish data
on commodity hardware
Simple Python patterns
Ga¨el Varoquaux INRIA/Parietal – Neurospin
Disclaimer: I’m French, I have opinions
We’re in Texas, I hope y’all have left your guns outside
Yeah, I know, Texas is bigger than France

“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Oﬀ-the-self computers
∼ 16 CPUs, 32 Gb RAM
G Varoquaux 2

My tools
Python, what else? + Numpy
+ Scipy
The ndarray is underused
by the data community
G Varoquaux 3

My tools
Python, what else? Patterns in this presentation:
scikit-learn
Machine learning in Python
joblib
Using Python functions as
pipeline jobs
G Varoquaux 3

Design philosophy
1. Fail gracefully
Easy to debug. Robust to errors.
2. Don’t solve hard problems
The original problem can be bent.
3. Dependencies suck
Distribution is an age-old problem.
4. Performance matters
Waiting kills productivity.
G Varoquaux 4

Processing big data
Speed ups in Hadoop, CPUs...
Execution pipelines
dataﬂow programming
parallel computing
Data access
storing
caching
G Varoquaux 5

Processing big data
Speed ups in Hadoop, CPUs...
Execution pipelines
dataﬂow programming
parallel computing
Data access
storing
caching
Pipelines can get messy
Databases are tedious
G Varoquaux 5

5 simple Python patterns for eﬃcient data crunching
1 On the ﬂy data reduction
2 On-line algorithms
3 Parallel processing patterns
4 Caching
5 Fast I/O
G Varoquaux 6

Big how?
2 scenarios:
Many observations –samples
e.g. twitter
Many descriptors per observation –features
e.g. brain scans
G Varoquaux 7

G Varoquaux 8

Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 8

1 Dropping data
Number one technique used to handle large dataset
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Performance tip: run the loop in parallel
Exploits redundancy across observations
Great when the number of samples is large
G Varoquaux 9

1 Dimension reduction
Often individual features are low SNR
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast –sub-optimal– clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing, when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 10

1 An example: randomized SVD
sklearn.utils.extmath.randomized svd
One random projection + power iterations
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 11

Process the data one sample at a time
G Varoquaux 12

Compute the mean of a gazillion
numbers
Hard?
G Varoquaux 12

Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux 12

2 Convergence: statistics and speed
If the data are i.i.d., converges to expectations
Mini-batch = bunch observations
Trade-oﬀ between memory usage and vectorization
Example: K-Means clustering
X = np.random.normal(size=(10000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).ﬁt(X)
0.62 s
G Varoquaux 13

G Varoquaux 14

Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
G Varoquaux 14

Workers compete for data access
Memory bus is a bottleneck
On grids: distributed storage
G Varoquaux 14

Workers compete for data access
Memory bus is a bottleneck
On grids: distributed storage
The right grain of parallelism
Too ﬁne ⇒ overhead
Too coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14

3 Queues – the magic behind joblib.Parallel
Queues: high-performance, concurrent-friendly
Diﬃculty: callback on result arrival
⇒ multiple threads in caller + risk of deadlocks
Dispatch queue should ﬁll up “slowly”
⇒ pre dispatch in joblib
⇒ Back and forth communication
Door open to race conditions
G Varoquaux 15

3 What happens where: grand-central dispatch?
joblib design: Caller, dispatch queue, and collect
queue in same process
Beneﬁt: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Beneﬁt: resource managment in nested for loops
G Varoquaux 16

4 Caching
For reproducible science:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 17

4 The joblib approach
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code for caching
G Varoquaux 18

4 Eﬃcient input argument hashing – joblib.hash
Compute md5 of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
md5 is in the Python standard library
G Varoquaux 19

4 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
⇒ Multiple ﬁles
Store concurrency issues
Strategy: atomic operations + try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 20

5 Fast I/O
Fast read-outs, for out-of-core computing
G Varoquaux 21

5 Making I/O fast
Fast compression
CPU may be faster than disk access
Chunk data for access patterns pytables
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress needs C-contiguous buffers
Store raw buffer + meta-information (strides, class...)
- use reduce
- rebuild: np.core.multiarray. reconstruct
not in pytables
G Varoquaux 22

5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 23

@GaelVaroquaux
Summing up
5 simple Python patterns for eﬃcient data crunching
4 Caching
5 Fast I/O

@GaelVaroquaux
Cost of complexity underestimated
Know your problem
& solve it with simple primitives
Python modules
scikit-learn: machine learning
joblib: pipeline-ish patterns
Come work with me!
Positions available

Processing biggish data on commodity hardware: simple Python patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Processing biggish data on commodity hardware: simple Python patterns

Similar to Processing biggish data on commodity hardware: simple Python patterns (20)

More from Gael Varoquaux

More from Gael Varoquaux (17)

Recently uploaded

Recently uploaded (20)

Processing biggish data on commodity hardware: simple Python patterns