Producing highly accurate Predictive Models in Social Data Mining can be a challenge. Feature Engineering using traditional methodologies can only take you so far. Trying to find that needle in a haystack when the subject matter is too domain specific or prone to ambiguity can require large investments to achieve accurate results. Through this presentation we will discuss methodologies used by Toyota’s Research and Development Data Science Team and share secrets of building highly accurate Predictive Models for Social data using innovative techniques for Feature Engineering applied on the Apache Spark and MLlib platform.
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
1. TOYOTA Machine Learning on
Apache SparkTM
FEATURE
ENGINEERING
Brian Kursar, Sr Data Scientist
Toyota Motor Sales IT Research and Development
Big Data Day LA 2015
4. TOYOTA Big Data History
2015
2014
2013
2012
2011
2010
C360 - Next Gen Insights Platform
Over 6B Records
C360 - Customer Experience Analytics
Over 700M Records
C360 - Toyota Social Media Intelligence Center
Over 500M Records
Product Quality Analytics v2
Over 120M Records
Marketing and Incentives Analytics
70M Records
Product Quality Analytics
Over 60M Records
6. Existing Tools
Jan – Feb Feb - Mar
+1% +1%
+1% +2%
Toyota Social Opinion 2013
WHY?
It doesn’t give you the “Why”
7. 40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
Toyota Online Conversations by the Numbers
2014
Study
8. Toyota Online Conversations by the Numbers
40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
2014
Study
50%
Noise
9. Millions of Social Media posts a day and not
enough Resources to read them all
Problem Statement
10. Categorize and Prioritize incoming Social Media
interactions in Real-Time using Machine Learning to
provide ACTIONABLE INSIGHTS
Campaign
Opinions
Customer
Feedback
Product
Feedback Noise
Technology Opportunities
11. Is this the image that executives at your company have when they hear
the words “Machine Learning?” If so, help make it relative.
12. First Spark MLlib Experiment
• Seat Cover Wrinkles/Cracking
• Brake Noise
• Shift Quality
• Oil Leaks
• HVAC Odor
• Dead Battery
• Rodent Wire Harness Damage
• Paint Chips
Time-box project to 12 Weeks
Classify at min 80% accuracy
13. How does one
find training data
for a noise they
have never
personally
experienced?
14. NoiseBrakes
If it was as easy as using key words “Brake” and “Noise”
then why bother use Machine Learning?
15. Where can we find categorized specific Product Quality
Concerns reported by Customers to use as Training Data?
Challenge
16. Mechanical Turk
The Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace
that enables individuals and businesses (known as Requesters) to coordinate
the use of human intelligence to perform tasks that computers are currently
unable to do.
17. Mechanical Turk
Define Social data
with Keywords
“Brake” AND Toyota
Product line (i.e.
Prius, Camry, Avalon,
etc.)
Define
Leverage results from
Mechanical Turk as
positive and negative
training sets.
Train
Create Human
Intelligence Tasks
“Is this comment
indicative of a
problem concerning
Brake Noise?”
Y
N
Create
18. When I’m backing up in my 2012 Prius, it sounds like
something hanging up or scraping as it rotates and only
happens in the morning..
I hear a squeak coming from the back wheel of
my Prius as I pull out from my driveway in the
morning.
True positives missed by workers in labeling due to lack of
experience with the problem they are looking to identify.
21. Social ML Pipeline
Hand Labeled
Social
Labeled Internal
Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Natural Language Processing
Training Data
23. Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Natural Language Processing
Labeled Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features
25. Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Natural Language Processing
Labeled Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features
Model
30. False Positives
I just had my friend at the toyota dealer rotate my tires and he
said … that the brake pads are getting thin really fast.
So what should I do when they get too thin in the future and start
to squeak?
31. False Positives
i cut the iac hose as shown in figure 20 in the manual but when i
start the car, it started gasping for air... choking...
sounds like it's about to die out.
i bought the power brake check valve (80190 part for
kragen)... but either i'm not installing it right or it's the wrong
size... i have no idea.
36. Noise
squeak grind
groan
squeal
Measure Distance Similarity Between Concepts
pads
caliper
rotor
wheel
drum
Brakes
Distance is calculated between
concepts based on the
Minkowski distance formula.
41. Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Natural Language Processing
Labeled Survey
Responses
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
N-Grams
Model
Synonyms POS
Distance Similarity
Euclidean
Distance (L2)
-
√(|x|^ 2 + |y|^ 2)
Manhattan
Distance (L1)
-
|x|+ |y|
Concept Manager
Concept Interpreter
Negation Evaluator
44. TO STEM OR NOT TO STEM
noun verb
When does stemming cause your entities to change?
45. TO STEM OR NOT TO STEM
noun noun
Will stemming produce false positives?
46. Negation – to contradict or deny something.
NEGATION
Great work done in medical document retrieval can be
leveraged. How different is a sick person from a sick car?
47. • absence
• did not
• didn't
• doesn't
• isn’t
• neither
• never
• no
• none
• without
NEGATION
did not
absence
no isn’t
none
Negation
never
48. I a am an and any my no the took
N-GRAMS AND STOP WORDS
Generic definition - the most common words in a language.
• For some models, Unigrams (single word) can be problematic due to lack of adjacent term which may assist in
disambiguation or could indicate a voice.
• Helps to keep possessive nouns especially in the feature set versus out via a stop list. Especially for Twitter
data. It helps identify a “Voice” versus marketing noise.
I a am an and any my no the took
49. • Each Model should have its own carefully selected stop words list
• Utilize unimportant entities in Stop Words Lists
N-GRAMS AND STOP LISTS
50. "interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}
CONTEXT IS KING
},
• Not all Social Media Interactions are
equal
• Some offer better Metadata than
others
• Leverage relevant Metadata as
features, feature weights, or filters
"interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}
51. • Leverage Metadata to set Context to your documents
• Review Sites – Ask a Pointed Question
CONTEXT IS KING
I love it!
Venue Name Movie Name
Metadata can tell you what “it” is.
Product Name Brand Name
52. CONTEXT IS KING
Interaction Type = Reply
Child Post
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Interaction Type = Tweet
Parent Post
Your new Prius is awesome.
How do you like it?
Context
53. CONTEXT IS KING
Interaction Type = Tweet
Interaction Author = @MrsKursar
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Same text different context
54. Brian Kursar – Sr Data Scientist
Toyota Motor Sales IT Research and Development
@briankursar
toyota.com/careers