Machine learning derives mathematical models from raw data. In the model building process, raw data is first processed into "features," then the features are given to algorithms to train a model. The process of turning raw data into features is sometimes called feature engineering, and it is a crucial step in model building. Good features lead to successful models with a lot of predictive power; bad features lead to a lot of headache and nowhere.
This talk aims to help the audience understand what is a feature space and why it is so important. We will go through some common feature space representations of English text and discuss what tasks they are suited for and why. Expect lots of pictures, whiteboard drawings and handwaving. We will exercise our power of imagination to visualize high dimensional feature spaces in our mind's eye. Presented by Alice Zheng Director of Data Science at Dato.
2. #datapopupseattle
UNSTRUCTURED
Data Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s enterprise data science platform is used
by leading analytical organizations to increase
productivity, enable collaboration, and publish
models into production faster.
6. ‹#›
The machine learning pipeline
I+fell+in+love+the+instant+I+laid+
my+eyes+on+that+puppy.+His+
big+eyes+and+playful+tail,+his+
soft+furry+paws,+…
Raw+data
Features
Models
Predictions
Deploy+in+
production
7. ‹#›
Three things to know about ML
• Feature = numeric representation of raw data
• Model = mathematical “summary” of features
• Making something that works = choose the right model
and features, given data and task
13. ‹#›
Feature space in machine learning
• Raw data ! high dimensional vectors
• Collection of data points ! point cloud in feature space
• Feature engineering = creating features of the appropriate
granularity for the task
15. Crudely speaking, mathematicians fall into two categories:
the algebraists, who find it easiest to reduce all problems
to sets of numbers and variables, and the geometers, who
understand the world through shapes.
-- Masha Gessen, “Perfect Rigor”
25. ‹#›
When does bag-of-words fail?
puppy
cat
2
1
1
have
I+have+a+puppy
I+have+a+cat
I+have+a+kitten
Task:+find+a+surface+that+separates++
documents+about+dogs+vs.+cats
Problem:+the+word+“have”+adds+fluff++
instead+of+information
I+have+a+dog+
and+I+have+a+pen
1
26. ‹#›
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words are
discounted
• Term frequency (tf) = Number of times a terms appears in a
document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
27. ‹#›
From BOW to tf-idf
puppy
cat
2
1
1
have
I+have+a+puppy
I+have+a+cat
I+have+a+kitten
idf(puppy)+=+log+4+
idf(cat)+=+log+4+
idf(have)+=+log+1+=+0
I+have+a+dog+
and+I+have+a+pen
1
28. ‹#›
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy)+=+log+4+
tfidf(cat)+=+log+4+
tfidf(have)+=+0
I+have+a+dog+
and+I+have+a+pen,+
I+have+a+kitten
1
log+4
log+4
I+have+a+cat
I+have+a+puppy
Decision+surface
Tf[idf+flattens+
uninformative+
dimensions+in+the+
BOW+point+cloud
29. ‹#›
That’s not all, folks!
• Geometry is the key to understanding feature space and
machine learning
• Many other fun topics:
- Feature normalization
- Feature transformations
- Model regularization
• Dato is hiring! jobs@dato.com
@RainyData,+@DatoInc