Online statistical analysis using transducers and sketch algorithms. Don’t know what either is? You are going to learn something very cool (and perspective-changing) then. Know them, but want an experience report? Got you covered, fam.
3. Transducers at a glance
• Transducers decomplect recursion mechanism,
transformation, building the output, and access mechanism
• 3 user-facing “protocols”: xf, transdcucer, and CollReduce
7. Many batch algorithms can
be turned into online ones
Parallelize independent computations Find a recursive relation
8. github.com/MastodonC/kixi.stats
• Count
• (Arithmetic) mean
• Geometric mean
• Harmonic mean
• Median
• Variance
• Interquartile range
• Standard deviation
• Standard error
• Skewness
• Kurtosis
• Covariance
• Covariance matrix
• Correlation
• Correlation matrix
• Simple linear regression
• Standard error of the mean
• Standard error of the estimate
• Standard error of the prediction
• …
11. Annoyances
• Can only transduce one coll at a time
• Always have to pass in an xf
• Having functions that return a transducer or not is error
prone
15. Histogram construction
1. Pick a number of buckets K
2. For each incoming value:
1. If a bucket for it exists, increment it
2. else, add a new bucket with count = 1
3. If there are > K buckets, find the two most adjacent
buckets and merge them
28. Takeouts
• Transducers are not only performant but also a good
modularization protocol
• You don’t realise how often you want a distribution until
you have it readily available
• Often approximations are good enough
• You can get surprisingly far on a single machine