2. Preliminaries
• Code is available from github:
– git@github.com:tdunning/Chapter-16.git
• EC2 instances available
• Thumb drives also available
• Email to ted.dunning@gmail.com
• Twitter @ted_dunning
3. A Quick Review
• What is classification?
– goes-ins: predictors
– goes-outs: target variable
• What is classifiable data?
– continuous, categorical, word-like, text-like
– uniform schema
• How do we convert from classifiable data to
feature vector?
5. Classifiable Data
• Continuous
– A number that represents a quantity, not an id
– Blood pressure, stock price, latitude, mass
• Categorical
– One of a known, small set (color, shape)
• Word-like
– One of a possibly unknown, possibly large set
• Text-like
– Many word-like things, usually unordered
6. But that isn’t quite there
• Learning algorithms need feature vectors
– Have to convert from data to vector
• Can assign one location per feature
– or category
– or word
• Can assign one or more locations with hashing
– scary
– but safe on average
15. Generating new features
• Sometimes the existing features are difficult to
use
• Restating the geometry using new reference
points may help
• Automatic reference points using k-means can
be better than manual references
19. Integration Issues
• Feature extraction is ideal for map-reduce
– Side data adds some complexity
• Clustering works great with map-reduce
– Cluster centroids to HDFS
• Model training works better sequentially
– Need centroids in normal files
• Model deployment shouldn’t depend on HDFS
22. Old tricks, new dogs
Read from local disk
• Mapper from distributed cache
– Assign point to cluster
Read from
– Emit cluster id, (1, point) HDFS to local disk
• Combiner and reducer by distributed cache
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by
• Output to HDFS map-reduce
23. Old tricks, new dogs
• Mapper
– Assign point to cluster Read
from
– Emit cluster id, 1, point NFS
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, n, sum/n Written by
map-reduce
• Output to HDFS
MapR FS
24. Modeling architecture
Side-data
Now via NFS
I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling
Map-reduce