Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python.
The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebase
Processing biggish data on commodity hardware: simple Python patterns
1. Processing biggish data
on commodity hardware
Simple Python patterns
Ga¨el Varoquaux INRIA/Parietal – Neurospin
Disclaimer: I’m French, I have opinions
We’re in Texas, I hope y’all have left your guns outside
Yeah, I know, Texas is bigger than France
3. My tools
Python, what else? + Numpy
+ Scipy
The ndarray is underused
by the data community
G Varoquaux 3
4. My tools
Python, what else? Patterns in this presentation:
scikit-learn
Machine learning in Python
joblib
Using Python functions as
pipeline jobs
G Varoquaux 3
5. Design philosophy
1. Fail gracefully
Easy to debug. Robust to errors.
2. Don’t solve hard problems
The original problem can be bent.
3. Dependencies suck
Distribution is an age-old problem.
4. Performance matters
Waiting kills productivity.
G Varoquaux 4
6. Processing big data
Speed ups in Hadoop, CPUs...
Execution pipelines
dataflow programming
parallel computing
Data access
storing
caching
G Varoquaux 5
7. Processing big data
Speed ups in Hadoop, CPUs...
Execution pipelines
dataflow programming
parallel computing
Data access
storing
caching
Pipelines can get messy
Databases are tedious
G Varoquaux 5
8. 5 simple Python patterns for efficient data crunching
1 On the fly data reduction
2 On-line algorithms
3 Parallel processing patterns
4 Caching
5 Fast I/O
G Varoquaux 6
9. Big how?
2 scenarios:
Many observations –samples
e.g. twitter
Many descriptors per observation –features
e.g. brain scans
G Varoquaux 7
11. 1 On the fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux 8
12. 1 Dropping data
Number one technique used to handle large dataset
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Performance tip: run the loop in parallel
Exploits redundancy across observations
Great when the number of samples is large
G Varoquaux 9
13. 1 Dimension reduction
Often individual features are low SNR
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast –sub-optimal– clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing, when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 10
14. 1 An example: randomized SVD
sklearn.utils.extmath.randomized svd
One random projection + power iterations
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 11
18. 2 Convergence: statistics and speed
If the data are i.i.d., converges to expectations
Mini-batch = bunch observations
Trade-off between memory usage and vectorization
Example: K-Means clustering
X = np.random.normal(size=(10000, 200))
scipy.cluster.vq.
kmeans(X, 10,
iter=2)
11.33 s
sklearn.cluster.
MiniBatchKMeans(n clusters=10,
n init=2).fit(X)
0.62 s
G Varoquaux 13
20. 3 Parallel processing patterns
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
G Varoquaux 14
21. 3 Parallel processing patterns
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
On grids: distributed storage
G Varoquaux 14
22. 3 Parallel processing patterns
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
On grids: distributed storage
The right grain of parallelism
Too fine ⇒ overhead
Too coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
23. 3 Queues – the magic behind joblib.Parallel
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
⇒ multiple threads in caller + risk of deadlocks
Dispatch queue should fill up “slowly”
⇒ pre dispatch in joblib
⇒ Back and forth communication
Door open to race conditions
G Varoquaux 15
24. 3 What happens where: grand-central dispatch?
joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Benefit: resource managment in nested for loops
G Varoquaux 16
25. 4 Caching
For reproducible science:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux 17
26. 4 The joblib approach
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
Challenges in the context of big data
a & b are big
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code for caching
G Varoquaux 18
27. 4 Efficient input argument hashing – joblib.hash
Compute md5 of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
md5 is in the Python standard library
G Varoquaux 19
28. 4 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
⇒ Multiple files
Store concurrency issues
Strategy: atomic operations + try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 20
29. 5 Fast I/O
Fast read-outs, for out-of-core computing
G Varoquaux 21
30. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
Chunk data for access patterns pytables
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress needs C-contiguous buffers
Store raw buffer + meta-information (strides, class...)
- use reduce
- rebuild: np.core.multiarray. reconstruct
not in pytables
G Varoquaux 22
31. 5 Benchmarking to np.save and pytables
yaxisscale:1isnp.save
NeuroImaging data (MNI atlas)G Varoquaux 23
32. @GaelVaroquaux
Summing up
5 simple Python patterns for efficient data crunching
1 On the fly data reduction
2 On-line algorithms
3 Parallel processing patterns
4 Caching
5 Fast I/O
33. @GaelVaroquaux
Cost of complexity underestimated
Know your problem
& solve it with simple primitives
Python modules
scikit-learn: machine learning
joblib: pipeline-ish patterns
Come work with me!
Positions available